100% found this document useful (2 votes)
118 views796 pages

Probability For Statistics and Machine Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
118 views796 pages

Probability For Statistics and Machine Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 796

Springer Texts in Statistics

Series Editors:
G. Casella
S. Fienberg
I. Olkin

For further volumes:


http://www.springer.com/series/417
Anirban DasGupta

Probability for Statistics


and Machine Learning

Fundamentals and Advanced Topics

ABC
Anirban DasGupta
Department of Statistics
Purdue University
150 N. University Street
West Lafayette, IN 47907, USA
[email protected]

Mathematica
R
is a registered trademark of Wolfram Research, Inc.

ISBN 978-1-4419-9633-6 e-ISBN 978-1-4419-9634-3


DOI 10.1007/978-1-4419-9634-3
Springer New York Dordrecht Heidelberg London

Library of Congress Control Number: 2011924777

c Springer Science+Business Media, LLC 2011


All rights reserved. This work may not be translated or copied in whole or in part without the written
permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York,
NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in
connection with any form of information storage and retrieval, electronic adaptation, computer software,
or by similar or dissimilar methodology now known or hereafter developed is forbidden.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are
not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject
to proprietary rights.

Printed on acid-free paper

Springer is part of Springer Science+Business Media (www.springer.com)


To Persi Diaconis, Peter Hall, Ashok Maitra,
and my mother, with affection
Preface

This is the companion second volume to my undergraduate text Fundamentals of


Probability: A First Course. The purpose of my writing this book is to give gradu-
ate students, instructors, and researchers in statistics, mathematics, and computer
science a lucidly written unique text at the confluence of probability, advanced
stochastic processes, statistics, and key tools for machine learning. Numerous top-
ics in probability and stochastic processes of current importance in statistics and
machine learning that are widely scattered in the literature in many different spe-
cialized books are all brought together under one fold in this book. This is done
with an extensive bibliography for each topic, and numerous worked-out examples
and exercises. Probability, with all its models, techniques, and its poignant beauty,
is an incredibly powerful tool for anyone who deals with data or randomness. The
content and the style of this book reflect that philosophy; I emphasize lucidity, a
wide background, and the far-reaching applicability of probability in science.
The book starts with a self-contained and fairly complete review of basic prob-
ability, and then traverses its way through the classics, to advanced modern topics
and tools, including a substantial amount of statistics itself. Because of its nearly
encyclopaedic coverage, it can serve as a graduate text for a year-long probabil-
ity sequence, or for focused short courses on selected topics, for self-study, and as
a nearly unique reference for research in statistics, probability, and computer sci-
ence. It provides an extensive treatment of most of the standard topics in a graduate
probability sequence, and integrates them with the basic theory and many examples
of several core statistical topics, as well as with some tools of major importance
in machine learning. This is done with unusually detailed bibliographies for the
reader who wants to dig deeper into a particular topic, and with a huge repertoire of
worked-out examples and exercises. The total number of worked-out examples in
this book is 423, and the total number of exercises is 808. An instructor can rotate
the exercises between semesters, and use them for setting exams, and a student can
use them for additional exam preparation and self-study. I believe that the book is
unique in its range, unification, bibliographic detail, and its collection of problems
and examples.
Topics in core probability, such as distribution theory, asymptotics, Markov
chains, martingales, Poisson processes, random walks, and Brownian motion are
covered in the first 14 chapters. In these chapters, a reader will also find basic

vii
viii Preface

coverage of such core statistical topics as confidence intervals, likelihood functions,


maximum likelihood estimates, posterior densities, sufficiency, hypothesis testing,
variance stabilizing transformations, and extreme value theory, all illustrated with
many examples. In Chapters 15, 16, and 17, I treat three major topics of great appli-
cation potential, empirical processes and VC theory, probability metrics, and large
deviations. Chapters 18, 19, and 20 are specifically directed to the statistics and
machine-learning community, and cover simulation, Markov chain Monte Carlo,
the exponential family, bootstrap, the EM algorithm, and kernels.
The book does not make formal use of measure theory. I do not intend to mini-
mize the role of measure theory in a rigorous study of probability. However, I believe
that a large amount of probability can be taught, understood, enjoyed, and applied
without needing formal use of measure theory. We do it around the world every
day. At the same time, some theorems cannot be proved without at least a men-
tion of some measure theory terminology. Even some definitions require a mention
of some measure theory notions. I include some unavoidable mention of measure-
theoretic terms and results, such as the strong law of large numbers and its proof, the
dominated convergence theorem, monotone convergence, Lebesgue measure, and a
few others, but only in the advanced chapters in the book.
Following the table of contents, I have suggested some possible courses with
different themes using this book. I have also marked the nonroutine and harder ex-
ercises in each chapter with an asterisk. Likewise, some specialized sections with
reference value have also been marked with an asterisk. Generally, the exercises
and the examples come with a caption, so that the reader will immediately know the
content of an exercise or an example. The end of the proof of a theorem has been
marked by a  sign.
My deepest gratitude and appreciation are due to Peter Hall. I am lucky that the
style and substance of this book are significantly molded by Peter’s influence. Out of
habit, I sent him the drafts of nearly every chapter as I was finishing them. It didn’t
matter where exactly he was, I always received his input and gentle suggestions
for improvement. I have found Peter to be a concerned and warm friend, teacher,
mentor, and guardian, and for this, I am extremely grateful.
Mouli Banerjee, Rabi Bhattacharya, Burgess Davis, Stewart Ethier, Arthur
Frazho, Evarist Giné, T. Krishnan, S. N. Lahiri, Wei-Liem Loh, Hyun-Sook Oh,
B. V. Rao, Yosi Rinott, Wen-Chi Tsai, Frederi Viens, and Larry Wasserman
graciously went over various parts of this book. I am deeply indebted to each
of them. Larry Wasserman, in particular, suggested the chapters on empirical pro-
cesses, VC theory, concentration inequalities, the exponential family, and Markov
chain Monte Carlo. The Springer series editors, Peter Bickel, George Casella, Steve
Fienberg, and Ingram Olkin have consistently supported my efforts, and I am so very
thankful to them. Springer’s incoming executive editor Marc Strauss saw through
the final production of this book extremely efficiently, and I have much enjoyed
working with him. I appreciated Marc’s gentility and his thoroughly professional
handling of the transition of the production of this book to his oversight. Valerie
Greco did an astonishing job of copyediting the book. The presentation, display,
and the grammar of the book are substantially better because of the incredible care
Preface ix

and thoughtfulness that she put into correcting my numerous errors. The staff at
SPi Technologies, Chennai, India did an astounding and marvelous job of produc-
ing this book. Six anonymous reviewers gave extremely gracious and constructive
comments, and their input has helped me in various dimensions to make this a
better book. Doug Crabill is the greatest computer systems administrator, and with
an infectious pleasantness has bailed me out of my stupidity far too many times.
I also want to mention my fond memories and deep-rooted feelings for the Indian
Statistical Institute, where I had all of my college education. It was just a wonderful
place for research, education, and friendships. Nearly everything that I know is due
to my years at the Indian Statistical Institute, and for this I am thankful.
This is the third time that I have written a book in contract with John Kimmel.
John is much more than a nearly unique person in the publishing world. To me,
John epitomizes sensitivity and professionalism, a singular combination. I have now
known John for almost six years, and it is very very difficult not to appreciate and
admire him a whole lot for his warmth, style, and passion for the subjects of statis-
tics and probability. Ironically, the day that this book entered production, the news
came that John was leaving Springer. I will remember John’s contribution to my
professional growth with enormous respect and appreciation.

West Lafayette, Indiana Anirban DasGupta


Contents

Suggested Courses with Different Themes . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . xix

1 Review of Univariate Probability . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 1


1.1 Experiments and Sample Spaces . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 1
1.2 Conditional Probability and Independence.. . . . . . . . . .. . . . . . . . . . . . . . . . . 5
1.3 Integer-Valued and Discrete Random Variables . . . . .. . . . . . . . . . . . . . . . . 8
1.3.1 CDF and Independence.. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 9
1.3.2 Expectation and Moments.. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 13
1.4 Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 19
1.5 Generating and Moment-Generating Functions . . . . .. . . . . . . . . . . . . . . . . 22
1.6  Applications of Generating Functions to a Pattern Problem .. . . . . . 26
1.7 Standard Discrete Distributions . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 28
1.8 Poisson Approximation to Binomial . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 34
1.9 Continuous Random Variables.. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 36
1.10 Functions of a Continuous Random Variable . . . . . . . .. . . . . . . . . . . . . . . . . 42
1.10.1 Expectation and Moments.. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 45
1.10.2 Moments and the Tail of a CDF . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 49
1.11 Moment-Generating Function and Fundamental Inequalities .. . . . . . . 51
1.11.1  Inversion of an MGF and Post’s Formula . . . . . . . . . . . . . . . . . 53
1.12 Some Special Continuous Distributions .. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 54
1.13 Normal Distribution and Confidence Interval for a Mean .. . . . . . . . . . . 61
1.14 Stein’s Lemma .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 66
1.15  Chernoff’s Variance Inequality .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 68
1.16  Various Characterizations of Normal Distributions . . . . . . . . . . . . . . . . 69
1.17 Normal Approximations and Central Limit Theorem . . . . . . . . . . . . . . . . 71
1.17.1 Binomial Confidence Interval .. . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 74
1.17.2 Error of the CLT . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 76
1.18 Normal Approximation to Poisson and Gamma . . . . .. . . . . . . . . . . . . . . . . 79
1.18.1 Confidence Intervals .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 80
1.19  Convergence of Densities and Edgeworth Expansions .. . . . . . . . . . . . 82
References .. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 92

xi
xii Contents

2 Multivariate Discrete Distributions. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 95


2.1 Bivariate Joint Distributions and Expectations of Functions .. . . . . . . . 95
2.2 Conditional Distributions and Conditional Expectations .. . . . . . . . . . . .100
2.2.1 Examples on Conditional Distributions
and Expectations .. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .101
2.3 Using Conditioning to Evaluate Mean and Variance . . . . . . . . . . . . . . . . .104
2.4 Covariance and Correlation .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .107
2.5 Multivariate Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .111
2.5.1 Joint MGF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .112
2.5.2 Multinomial Distribution .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .114
2.6  The Poissonization Technique .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .116

3 Multidimensional Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .123


3.1 Joint Density Function and Its Role . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .123
3.2 Expectation of Functions.. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .132
3.3 Bivariate Normal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .136
3.4 Conditional Densities and Expectations.. . . . . . . . . . . . .. . . . . . . . . . . . . . . . .140
3.4.1 Examples on Conditional Densities and Expectations .. . . . .142
3.5 Posterior Densities, Likelihood Functions, and Bayes Estimates . . . .147
3.6 Maximum Likelihood Estimates. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .152
3.7 Bivariate Normal Conditional Distributions . . . . . . . . .. . . . . . . . . . . . . . . . .154
3.8  Useful Formulas and Characterizations for Bivariate Normal . . . . .155
3.8.1 Computing Bivariate Normal Probabilities .. . . . . . . . . . . . . . . . .157
3.9  Conditional Expectation Given a Set and Borel’s Paradox . . . . . . . .158
References .. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .165

4 Advanced Distribution Theory .. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .167


4.1 Convolutions and Examples . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .167
4.2 Products and Quotients and the t- and F -Distribution . . . . . . . . . . . . . . .172
4.3 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .177
4.4 Applications of Jacobian Formula .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .178
4.5 Polar Coordinates in Two Dimensions . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .180
4.6  n-Dimensional Polar and Helmert’s Transformation .. . . . . . . . . . . . . .182
4.6.1 Efficient Spherical Calculations with Polar
Coordinates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .182
4.6.2 Independence of Mean and Variance
in Normal Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .185
4.6.3 The t Confidence Interval . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .187
4.7 The Dirichlet Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .188
4.7.1  Picking a Point from the Surface of a Sphere .. . . . . . . . . . . .191
4.7.2  Poincaré’s Lemma .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .191
4.8  Ten Important High-Dimensional Formulas
for Easy Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .191
References .. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .197
Contents xiii

5 Multivariate Normal and Related Distributions . . . . . . . . .. . . . . . . . . . . . . . . . .199


5.1 Definition and Some Basic Properties .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .199
5.2 Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .202
5.3 Exchangeable Normal Variables .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .205
5.4 Sampling Distributions Useful in Statistics . . . . . . . . . .. . . . . . . . . . . . . . . . .207
5.4.1  Wishart Expectation Identities .. . . . . . . . . . .. . . . . . . . . . . . . . . . .208
5.4.2 * Hotelling’s T 2 and Distribution of Quadratic Forms . . . . .209
5.4.3  Distribution of Correlation Coefficient .. .. . . . . . . . . . . . . . . . .212
5.5 Noncentral Distributions .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .213
5.6 Some Important Inequalities for Easy Reference .. . .. . . . . . . . . . . . . . . . .214
References .. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .218

6 Finite Sample Theory of Order Statistics and Extremes .. . . . . . . . . . . . . . . .221


6.1 Basic Distribution Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .221
6.2 More Advanced Distribution Theory .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .225
6.3 Quantile Transformation and Existence of Moments .. . . . . . . . . . . . . . . .229
6.4 Spacings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .233
6.4.1 Exponential Spacings and Réyni’s Representation . . . . . . . . .233
6.4.2 Uniform Spacings . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .234
6.5 Conditional Distributions and Markov Property .. . . .. . . . . . . . . . . . . . . . .235
6.6 Some Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .238
6.6.1  Records . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .238
6.6.2 The Empirical CDF . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .241
6.7  Distribution of the Multinomial Maximum .. . . . . . .. . . . . . . . . . . . . . . . .243
References .. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .247

7 Essential Asymptotics and Applications . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .249


7.1 Some Basic Notation and Convergence Concepts . . .. . . . . . . . . . . . . . . . .250
7.2 Laws of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .254
7.3 Convergence Preservation . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .259
7.4 Convergence in Distribution . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .262
7.5 Preservation of Convergence and Statistical Applications . . . . . . . . . . .267
7.5.1 Slutsky’s Theorem .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .268
7.5.2 Delta Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .269
7.5.3 Variance Stabilizing Transformations . . . . . .. . . . . . . . . . . . . . . . .272
7.6 Convergence of Moments .. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .274
7.6.1 Uniform Integrability .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .275
7.6.2 The Moment Problem and Convergence in Distribution.. . .277
7.6.3 Approximation of Moments.. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .278
7.7 Convergence of Densities and Scheffé’s Theorem .. .. . . . . . . . . . . . . . . . .282
References .. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .292
xiv Contents

8 Characteristic Functions and Applications .. . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .293


8.1 Characteristic Functions of Standard Distributions . .. . . . . . . . . . . . . . . . .294
8.2 Inversion and Uniqueness .. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .298
8.3 Taylor Expansions, Differentiability, and Moments .. . . . . . . . . . . . . . . . .302
8.4 Continuity Theorems .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .303
8.5 Proof of the CLT and the WLLN . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .305
8.6  Producing Characteristic Functions . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .306
8.7 Error of the Central Limit Theorem . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .308
8.8 Lindeberg–Feller Theorem for General Independent Case . . . . . . . . . . .311
8.9  Infinite Divisibility and Stable Laws . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .315
8.10  Some Useful Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .317
References .. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .322

9 Asymptotics of Extremes and Order Statistics . . . . . . . . . . .. . . . . . . . . . . . . . . . .323


9.1 Central-Order Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .323
9.1.1 Single-Order Statistic. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .323
9.1.2 Two Statistical Applications . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .325
9.1.3 Several Order Statistics. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .326
9.2 Extremes .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .328
9.2.1 Easily Applicable Limit Theorems . . . . . . . . .. . . . . . . . . . . . . . . . .328
9.2.2 The Convergence of Types Theorem . . . . . . .. . . . . . . . . . . . . . . . .332
9.3  Fisher–Tippett Family and Putting it Together . . . .. . . . . . . . . . . . . . . . .333
References .. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .338

10 Markov Chains and Applications .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .339


10.1 Notation and Basic Definitions . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .340
10.2 Examples and Various Applications as a Model .. . . .. . . . . . . . . . . . . . . . .340
10.3 Chapman–Kolmogorov Equation .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .345
10.4 Communicating Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .349
10.5 Gambler’s Ruin .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .352
10.6 First Passage, Recurrence, and Transience .. . . . . . . . . .. . . . . . . . . . . . . . . . .354
10.7 Long Run Evolution and Stationary Distributions .. .. . . . . . . . . . . . . . . . .359
References .. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .374

11 Random Walks .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .375


11.1 Random Walk on the Cubic Lattice . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .375
11.1.1 Some Distribution Theory .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .378
11.1.2 Recurrence and Transience.. . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .379
11.1.3  Pólya’s Formula for the Return Probability .. . . . . . . . . . . . . .382
11.2 First Passage Time and Arc Sine Law .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .383
11.3 The Local Time.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .387
11.4 Practically Useful Generalizations . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .389
11.5 Wald’s Identity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .390
11.6 Fate of a Random Walk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .392
Contents xv

11.7 Chung–Fuchs Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .394


11.8 Six Important Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .396
References .. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .400

12 Brownian Motion and Gaussian Processes . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .401


12.1 Preview of Connections to the Random Walk . . . . . . .. . . . . . . . . . . . . . . . .402
12.2 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .403
12.2.1 Condition for a Gaussian Process to be Markov . . . . . . . . . . . .406
12.2.2  Explicit Construction of Brownian Motion . . . . . . . . . . . . . . .407
12.3 Basic Distributional Properties . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .408
12.3.1 Reflection Principle and Extremes .. . . . . . . . .. . . . . . . . . . . . . . . . .410
12.3.2 Path Properties and Behavior Near Zero and Infinity .. . . . . .412
12.3.3  Fractal Nature of Level Sets . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .415
12.4 The Dirichlet Problem and Boundary Crossing Probabilities .. . . . . . .416
12.4.1 Recurrence and Transience.. . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .418
12.5 The Local Time of Brownian Motion . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .419
12.6 Invariance Principle and Statistical Applications . . . .. . . . . . . . . . . . . . . . .421
12.7 Strong Invariance Principle and the KMT Theorem .. . . . . . . . . . . . . . . . .425
12.8 Brownian Motion with Drift and Ornstein–Uhlenbeck Process.. . . . .427
12.8.1 Negative Drift and Density of Maximum.. .. . . . . . . . . . . . . . . . .427
12.8.2  Transition Density and the Heat Equation . . . . . . . . . . . . . . . .428
12.8.3  The Ornstein–Uhlenbeck Process . . . . . . . .. . . . . . . . . . . . . . . . .429
References .. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .435

13 Poisson Processes and Applications.. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .437


13.1 Notation .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .438
13.2 Defining a Homogeneous Poisson Process. . . . . . . . . . .. . . . . . . . . . . . . . . . .439
13.3 Important Properties and Uses as a Statistical Model . . . . . . . . . . . . . . . .440
13.4  Linear Poisson Process and Brownian Motion: A Connection . . . .448
13.5 Higher-Dimensional Poisson Point Processes . . . . . . .. . . . . . . . . . . . . . . . .450
13.5.1 The Mapping Theorem . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .452
13.6 One-Dimensional Nonhomogeneous Processes . . . . .. . . . . . . . . . . . . . . . .453
13.7  Campbell’s Theorem and Shot Noise . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .456
13.7.1 Poisson Process and Stable Laws . . . . . . . . . . .. . . . . . . . . . . . . . . . .458
References .. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .462

14 Discrete Time Martingales and Concentration Inequalities . . . . . . . . . . . . .463


14.1 Illustrative Examples and Applications in Statistics . . . . . . . . . . . . . . . . . .463
14.2 Stopping Times and Optional Stopping . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .468
14.2.1 Stopping Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .469
14.2.2 Optional Stopping . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .470
14.2.3 Sufficient Conditions for Optional Stopping Theorem . . . . .472
14.2.4 Applications of Optional Stopping . . . . . . . . .. . . . . . . . . . . . . . . . .474
xvi Contents

14.3 Martingale and Concentration Inequalities.. . . . . . . . . .. . . . . . . . . . . . . . . . .477


14.3.1 Maximal Inequality .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .477
14.3.2  Inequalities of Burkholder, Davis, and Gundy .. . . . . . . . . . .480
14.3.3 Inequalities of Hoeffding and Azuma . . . . . .. . . . . . . . . . . . . . . . .483
14.3.4  Inequalities of McDiarmid and Devroye .. . . . . . . . . . . . . . . . .485
14.3.5 The Upcrossing Inequality . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .488
14.4 Convergence of Martingales . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .490
14.4.1 The Basic Convergence Theorem .. . . . . . . . . .. . . . . . . . . . . . . . . . .490
14.4.2 Convergence in L1 and L2 . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .493
14.5  Reverse Martingales and Proof of SLLN . . . . . . . . . .. . . . . . . . . . . . . . . . .494
14.6 Martingale Central Limit Theorem .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .497
References .. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .503

15 Probability Metrics .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .505


15.1 Standard Probability Metrics Useful in Statistics . . . .. . . . . . . . . . . . . . . . .505
15.2 Basic Properties of the Metrics . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .508
15.3 Metric Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .515
15.4 Differential Metrics for Parametric Families. . . . . . . . .. . . . . . . . . . . . . . . . .519
15.4.1  Fisher Information and Differential Metrics . . . . . . . . . . . . . .520
15.4.2  Rao’s Geodesic Distances on Distributions . . . . . . . . . . . . . . .522
References .. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .525

16 Empirical Processes and VC Theory . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .527


16.1 Basic Notation and Definitions . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .527
16.2 Classic Asymptotic Properties of the Empirical Process . . . . . . . . . . . . .529
16.2.1 Invariance Principle and Statistical Applications . . . . . . . . . . .531
16.2.2  Weighted Empirical Process . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .534
16.2.3 The Quantile Process . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .536
16.2.4 Strong Approximations of the Empirical Process .. . . . . . . . . .537
16.3 Vapnik–Chervonenkis Theory . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .538
16.3.1 Basic Theory .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .538
16.3.2 Concrete Examples . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .540
16.4 CLTs for Empirical Measures and Applications .. . . .. . . . . . . . . . . . . . . . .543
16.4.1 Notation and Formulation .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .543
16.4.2 Entropy Bounds and Specific CLTs. . . . . . . . .. . . . . . . . . . . . . . . . .544
16.4.3 Concrete Examples . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .547
16.5 Maximal Inequalities and Symmetrization .. . . . . . . . . .. . . . . . . . . . . . . . . . .547
16.6  Connection to the Poisson Process . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .551
References .. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .557

17 Large Deviations .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .559


17.1 Large Deviations for Sample Means . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .560
17.1.1 The Cramér–Chernoff Theorem in R. . . . . . .. . . . . . . . . . . . . . . . .560
17.1.2 Properties of the Rate Function . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .564
17.1.3 Cramér’s Theorem for General Sets . . . . . . . .. . . . . . . . . . . . . . . . .566
Contents xvii

17.2 The GRartner–Ellis Theorem and Markov Chain Large


Deviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .567
17.3 The t-Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .570
17.4 Lipschitz Functions and Talagrand’s Inequality . . . . .. . . . . . . . . . . . . . . . .572
17.5 Large Deviations in Continuous Time.. . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .574
17.5.1  Continuity of a Gaussian Process. . . . . . . . .. . . . . . . . . . . . . . . . .576
17.5.2  Metric Entropy of T and Tail of the Supremum . . . . . . . . . .577
References .. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .582

18 The Exponential Family and Statistical Applications . . .. . . . . . . . . . . . . . . . .583


18.1 One-Parameter Exponential Family . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .583
18.1.1 Definition and First Examples . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .584
18.2 The Canonical Form and Basic Properties . . . . . . . . . . .. . . . . . . . . . . . . . . . .589
18.2.1 Convexity Properties . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .590
18.2.2 Moments and Moment Generating Function .. . . . . . . . . . . . . . .591
18.2.3 Closure Properties . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .594
18.3 Multiparameter Exponential Family. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .596
18.4 Sufficiency and Completeness .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .600
18.4.1  Neyman–Fisher Factorization and Basu’s Theorem . . . . . .602
18.4.2  Applications of Basu’s Theorem to Probability .. . . . . . . . . .604
18.5 Curved Exponential Family .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .607
References .. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .612

19 Simulation and Markov Chain Monte Carlo . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .613


19.1 The Ordinary Monte Carlo. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .615
19.1.1 Basic Theory and Examples.. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .615
19.1.2 Monte Carlo P -Values . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .622
19.1.3 Rao–Blackwellization . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .623
19.2 Textbook Simulation Techniques .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .624
19.2.1 Quantile Transformation and Accept–Reject.. . . . . . . . . . . . . . .624
19.2.2 Importance Sampling and Its Asymptotic Properties . . . . . . .629
19.2.3 Optimal Importance Sampling Distribution .. . . . . . . . . . . . . . . .633
19.2.4 Algorithms for Simulating from Common
Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .634
19.3 Markov Chain Monte Carlo. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .637
19.3.1 Reversible Markov Chains . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .639
19.3.2 Metropolis Algorithms . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .642
19.4 The Gibbs Sampler .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .645
19.5 Convergence of MCMC and Bounds on Errors .. . . . .. . . . . . . . . . . . . . . . .651
19.5.1 Spectral Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .653
19.5.2  Dobrushin’s Inequality and Diaconis–Fill–
Stroock Bound .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .657
19.5.3  Drift and Minorization Methods .. . . . . . . . .. . . . . . . . . . . . . . . . .659
xviii Contents

19.6 MCMC on General Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .662


19.6.1 General Theory and Metropolis Schemes . .. . . . . . . . . . . . . . . . .662
19.6.2 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .666
19.6.3 Convergence of the Gibbs Sampler .. . . . . . . .. . . . . . . . . . . . . . . . .670
19.7 Practical Convergence Diagnostics .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .673
References .. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .686

20 Useful Tools for Statistics and Machine Learning . . . . . . .. . . . . . . . . . . . . . . . .689


20.1 The Bootstrap.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .689
20.1.1 Consistency of the Bootstrap .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .692
20.1.2 Further Examples .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .696
20.1.3  Higher-Order Accuracy of the Bootstrap.. . . . . . . . . . . . . . . . .699
20.1.4 Bootstrap for Dependent Data . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .701
20.2 The EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .704
20.2.1 The Algorithm and Examples .. . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .706
20.2.2 Monotone Ascent and Convergence of EM . . . . . . . . . . . . . . . . .711
20.2.3  Modifications of EM . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .714
20.3 Kernels and Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .715
20.3.1 Smoothing by Kernels .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .715
20.3.2 Some Common Kernels in Use . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .717
20.3.3 Kernel Density Estimation . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .719
20.3.4 Kernels for Statistical Classification . . . . . . . .. . . . . . . . . . . . . . . . .724
20.3.5 Mercer’s Theorem and Feature Maps.. . . . . .. . . . . . . . . . . . . . . . .732
References .. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .744

A Symbols, Useful Formulas, and Normal Table . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .747


A.1 Glossary of Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .747
A.2 Moments and MGFs of Common Distributions . . . . .. . . . . . . . . . . . . . . . .750
A.3 Normal Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .755

Author Index. . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .757

Subject Index . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .763


Suggested Courses with Different Themes

Duration Theme Chapters


15 weeks Beginning graduate 2–7, 9
15 weeks Advanced graduate 7, 8, 10, 11, 12, 13, 14
15 weeks Special topics for statistics students 9, 10, 15, 16, 17, 18, 20
15 weeks Special topics for computer science students 4, 11, 14, 16, 17, 18, 19
8 weeks Summer course for statistics students 11, 12, 14, 20
8 weeks Summer course for computer science students 14, 16, 18, 20
8 weeks Summer course on modeling and simulation 4, 10, 13, 19

xix
Chapter 1
Review of Univariate Probability

Probability is a universally accepted tool for expressing degrees of confidence or


doubt about some proposition in the presence of incomplete information or uncer-
tainty. By convention, probabilities are calibrated on a scale of 0 to 1; assigning
something a zero probability amounts to expressing the belief that we consider it
impossible, whereas assigning a probability of one amounts to considering it a cer-
tainty. Most propositions fall somewhere in between. Probability statements that we
make can be based on our past experience, or on our personal judgments. Whether
our probability statements are based on past experience or subjective personal judg-
ments, they obey a common set of rules, which we can use to treat probabilities in
a mathematical framework, and also for making decisions on predictions, for un-
derstanding complex systems, or as intellectual experiments and for entertainment.
Probability theory is one of the most applicable branches of mathematics. It is used
as the primary tool for analyzing statistical methodologies; it is used routinely in
nearly every branch of science, such as biology, astronomy and physics, medicine,
economics, chemistry, sociology, ecology, finance, and many others. A background
in the theory, models, and applications of probability is almost a part of basic edu-
cation. That is how important it is.
For a classic and lively introduction to the subject of probability, we recommend
Feller (1968,1971). Among numerous other expositions of the theory of probabil-
ity, a variety of examples on various topics can be seen in Ross (1984), Stirzaker
(1994), Pitman (1992), Bhattacharya and Waymire (2009), and DasGupta (2010).
Ash (1972), Chung (1974), Breiman (1992), Billingsley (1995), and Dudley (2002)
are masterly accounts of measure-theoretic probability.

1.1 Experiments and Sample Spaces

Treatment of probability theory starts with the consideration of a sample space.


The sample space is the set of all possible outcomes in some physical experiment.
For example, if a coin is tossed twice and after each toss the face that shows is
recorded, then the possible outcomes of this particular coin-tossing experiment, say

A. DasGupta, Probability for Statistics and Machine Learning: Fundamentals 1


and Advanced Topics, Springer Texts in Statistics, DOI 10.1007/978-1-4419-9634-3 1,
c Springer Science+Business Media, LLC 2011
2 1 Review of Univariate Probability

 are HH; HT; TH; TT, with H denoting the occurrence of heads and T denoting
the occurrence of tails. We call

 D fHH; HT; TH; TTg

the sample space of the experiment .


In general, a sample space is a general set , finite or infinite. An easy example
where the sample space  is infinite is to toss a coin until the first time heads show
up and record the number of the trial at which the first head appeared. In this case,
the sample space  is the countably infinite set

 D f1; 2; 3; : : :g:

Sample spaces can also be uncountably infinite; for example, consider the experi-
ment of choosing a number at random from the interval Œ0; 1. The sample space of
this experiment is  D Œ0; 1. In this case,  is an uncountably infinite set. In all
cases, individual elements of a sample space are denoted as !. The first task is to
define events and to explain the meaning of the probability of an event.

Definition 1.1. Let  be the sample space of an experiment . Then any subset A
of , including the empty set  and the entire sample space  is called an event.
Events may contain even one single sample point !, in which case the event
is a singleton set f!g. We want to assign probabilities to events. But we want to
assign probabilities in a way that they are logically consistent. In fact, this cannot
be done in general if we insist on assigning probabilities to arbitrary collections of
sample points, that is, arbitrary subsets of the sample space . We can only define
probabilities for such subsets of  that are tied together like a family, the exact
concept being that of a -field. In most applications, including those cases where
the sample space  is infinite, events that we would want to normally think about
will be members of such an appropriate -field. So we do not mention the need
for consideration of -fields any further, and get along with thinking of events as
subsets of the sample space , including in particular the empty set  and the entire
sample space  itself.
Here is a definition of what counts as a legitimate probability on events.

Definition 1.2. Given a sample space , a probability or a probability measure on


 is a function P on subsets of  such that
(a) P .A/  0 for anyA  I
(b) P ./ D 1I P1
(c) Given disjoint subsets A1 ; A2 ; : : : of ; P .[1
i D1 Ai / D i D1 P .Ai /:

Property (c) is known as countable additivity. Note that it is not something that
can be proved, but it is like an assumption or an axiom. In our experience, we have
seen that operating as if the assumption is correct leads to useful and credible an-
swers in many problems, and so we accept it as a reasonable assumption. Not all
1.1 Experiments and Sample Spaces 3

probabilists agree that countable additivity is natural; but we do not get into that
debate in this book. One important point is that finite additivity is subsumed in
countable additivity; that is if there are some Pmfinite number m of disjoint subsets
A1 ; A2 ; : : : ; Am of , then P .[m A
i D1 i / D i D1 P .Ai /: Also, it is useful to note
that the last two conditions in the definition of a probability measure imply that
P ./, the probability of the empty set or the null event, is zero.
One notational convention is that strictly speaking, for an event that is just a
singleton set f!g, we should write P .f!g/ to denote its probability. But to reduce
clutter, we simply use the more convenient notation P .!/.
One pleasant consequence of the axiom of countable additivity is the following
basic result. We do not prove it here as it is a simple result; see DasGupta (2010) for
a proof.

Theorem 1.1. Let A1  A2  A3     be an infinite family of subsets of a sample


space  such that An # A. Then, P .An / ! P .A/ as n ! 1.
Next, the concept of equally likely sample points is a very fundamental one.

Definition 1.3. Let  be a finite sample space consisting of N sample points. We


say that the sample points are equally likely if P .!/ D N1 for each sample point !.
An immediate consequence, due to the addivity axiom, is the following useful
formula.

Proposition. Let  be a finite sample space consisting of N equally likely sample


points. Let A be any event and suppose A contains n distinct sample points. Then

n Number of sample points favorable to A


P .A/ D D :
N Total number of sample points

Let us see some examples.

Example 1.1 (The Shoe Problem). Suppose there are five pairs of shoes in a closet
and four shoes are taken out at random. What is the probability that among the four
that are taken out, there is at least one complete
  pair?
The total number of sample points is 10 4 D 210. Because selection was done
completely at random, we assume that all sample points are equally likely. At least
one complete pair would mean two complete pairs, or exactly one complete   pair
and two other nonconforming shoes. Two complete pairs can be chosen in 52 D 10
  
ways. Exactly one complete pair can be chosen in 51 42  2  2 D 120 ways. The
5  
1
term is for choosing the pair that is complete; the 42 term is for choosing two
incomplete pairs, and then from each incomplete pair, one chooses the left or the
right shoe. Thus, the probability that there will be at least one complete pair among
the four shoes chosen is .10 C 120/=210 D 13=21 D :62.

Example 1.2 (Five-Card Poker). In five-card poker, a player is given 5 cards from a
full deck of 52 cards at random. Various named hands of varying degrees of rarity
exist. In particular, we want to calculate the probabilities of A D two pairs and
4 1 Review of Univariate Probability

B D a flush. Two pairs is a hand with 2 cards each of 2 different denominations and
the fifth card of some other denomination; a flush is a hand with 5 cards of the same
suit, but the cards cannot be of denominations in a sequence.
 42 44 52
Then, P .A/ D 13 2 2 1 = 5 D :04754:
To find P .B/, note that there are 10 ways to select 5 cards from a suit such
that the cards are in a sequence, namely, fA; 2; 3; 4; 5g; f2; 3; 4; 5; 6g; : : : ; f10; J; Q;
K; Ag, and so,
! ! !, !
4 13 52
P .B/ D  10 D :00197:
1 5 5

These are basic examples of counting arguments that are useful whenever there
is a finite sample space and we assume that all sample points are equally likely.
A major result in combinatorial probability is the inclusionexclusion formula,
which says the following.

Theorem 1.2. Let A1 ; A2 ; : : : ; An be n general events. Let

X
n X X
S1 D P .Ai /I S2 D P .Ai \Aj /I S3 D P .Ai \Aj \Ak /I   
i D1 1i <j n 1i <j <kn

Then,

X
n X X
P .[niD1 Ai / D P .Ai /  P .Ai \ Aj / C P .Ai \ Aj \ Ak /
i D1 1i <j n 1i <j <kn

    C .1/nC1 P .A1 \ A2 \    \ An /

D S1  S2 C S3     C .1/nC1 Sn :

Example 1.3 (Missing Suits in a Bridge Hand). Consider a specific player, say
North, in a Bridge game. We want to calculate the probability that North’s hand
is void in at least one suit. Towards this, denote the suits as 1, 2, 3, 4 and let
Ai D North’s hand is void in suit i.
Then, by the inclusion exclusion formula,

P .North’s hand is void in at least one suit/


D P .A1 [ A2 [ A3 [ A4 /
39ı52 26ı52 13ı52
D4 13 13
6 13 13
C4 13 13
D :051; which is small, but not very small.
1.2 Conditional Probability and Independence 5

The inclusion–exclusion formula can be hard to apply exactly, because the quan-
tities Sj for large indices j can be difficult to calculate. However, fortunately, the
inclusion–exclusion formula leads to bounds in both directions for the probability
of the union of n general events. We have the following series of bounds.

Theorem 1.3 (Bonferroni Bounds). Given n events A1 ; A2 ; : : : ; An , let pn D


P .[niD1 Ai /: Then,

pn S 1 I pn  S 1  S 2 I pn S1  S2 C S3 I : : : :

In addition,
X
n
P .\niD1 Ai /  1  P .Aci /:
i D1

1.2 Conditional Probability and Independence

Both conditional probability and independence are fundamental concepts for proba-
bilists and statisticians alike. Conditional probabilities correspond to updating one’s
beliefs when new information becomes available. Independence corresponds to ir-
relevance of a piece of new information, even when it is made available. In addition,
the assumption of independence can and does significantly simplify development,
mathematical analysis, and justification of tools and procedures.

Definition 1.4. Let A; B be general events with respect to some sample space ,
and suppose P .A/ > 0. The conditional probability of B given A is defined as

P .A \ B/
P .BjA/ D :
P .A/

Some immediate consequences of the definition of a conditional probability are the


following.

Theorem 1.4. (a) (Multiplicative Formula). For any two events A, B such that
P .A/ > 0, one has P .A \ B/ D P .A/P .BjA/I
(b) For any two events A, B such that 0 < P .A/ < 1, one has P .B/ D P .BjA/
P .A/ C P .BjAc /P .Ac /I
(c) (Total Probability Formula). If A1 ; A2 ; : : : ; Ak form a partition of the sample
space , (i.e., Ai \ Aj D  for all i ¤ j , and [kiD1 Ai D ), and if 0 <
P .Ai / < 1 for all i , then

X
k
P .B/ D P .BjAi /P .Ai /:
i D1
6 1 Review of Univariate Probability

(d) (Hierarchical Multiplicative Formula). Let A1 ; A2 ; : : : ; Ak be k general


events in a sample space . Then

P .A1 \ A2 \    ::: \ Ak / D P .A1 /P .A2 jA1 /P .A3 jA1 \ A2 /   


 P .Ak jA1 \ A2 \    ::: \ Ak1 /:

Example 1.4. One of two urns has a red and b black balls, and the other has c red
and d black balls. One ball is chosen at random from each urn, and then one of these
two balls is chosen at random. What is the probability that this ball is red?
If each ball selected from the two urns is red, then the final ball is definitely red.
If one of those two balls is red, then the final ball is red with probability 1/2. If none
of those two balls is red, then the final ball cannot be red.
Thus,

P .The final ball is red/ D a=.a C b/  c=.c C d / C 1=2


 Œa=.a C b/  d=.c C d / C b=.a C b/  c=.c C d /
2ac C ad C bc
D :
2.a C b/.c C d /

As an example, suppose a D 99; b D 1; c D 1; d D 1.

2ac C ad C bc
Then D :745:
2.a C b/.c C d /
Although the total percentage of red balls in the two urns is more than 98%, the
chance that the final ball selected would be red is just about 75%.

Example 1.5 (A Clever Conditioning Argument). Coin A gives heads with probabil-
ity s and coin B gives heads with probability t. They are tossed alternately, starting
off with coin A. We want to find the probability that the first head is obtained on
coin A.
We find this probability by conditioning on the outcomes of the first two tosses;
more precisely, define
A1 D fH g D First toss gives HI A2 D fTH gI A3 D fT T g:
Let also,
A D The first head is obtained on coin A:
One of the three events A1 ; A2 ; A3 must happen, and they are also mutually
exclusive. Therefore, by the total probability formula,

X
3
P .A/ D P .Ai /P .AjAi / D s  1 C .1  s/t  0 C .1  s/.1  t/P .A/
i D1
) P .A/ D s=Œ1  .1  s/.1  t/ D s=.s C t  st/:
1.2 Conditional Probability and Independence 7

As an example, let s D :4; t D :5. Note that coin A is biased against heads. Even
then, s=.s C t  st/ D :57 > :5. We see that there is an advantage in starting first.

Definition 1.5. A collection of events A1 ; A2 ; : : : ; An is said to be mutually inde-


pendent (or just independent) if for each k; 1 k n, and any k of the events,
Ai1 ; : : : ; Aik ; P .Ai1 \    Aik / D P .Ai1 /    P .Aik /: They are called pairwise
independent if this property holds for k D 2.

Example 1.6 (Lotteries). Although many people buy lottery tickets out of an expec-
tation of good luck, probabilistically speaking, buying lottery tickets is usually a
waste of money. Here is an example. Suppose in a weekly state lottery, five of the
numbers 00; 01; : : : ; 49 are selected without replacement at random, and someone
holding exactly those numbers wins the lottery. Then, the probability that someone
holding one ticket will be the winner in a given week is

1
  D 4:72  107 :
50
5

Suppose this person buys a ticket every week for 40 years. Then, the probability
that he will win the lottery on at least one week is 1  .1  4:72  107 /5240 D
:00098 < :001; still a very small probability. We assumed in this calculation that
the weekly lotteries are all mutually independent, a reasonable assumption. The
calculation would fall apart if we did not make this independence assumption.
It is not uncommon to see the conditional probabilities P .AjB/ and P .BjA/
confused with each other. Suppose in some group of lung cancer patients, we see a
large percentage of smokers. If we define B to be the event that a person is a smoker,
and A to be the event that a person has lung cancer, then all we can conclude is that
in our group of people P .BjA/ is large. But we cannot conclude from just this
information that smoking increases the chance of lung cancer, that is, that P .AjB/
is large. In order to calculate a conditional probability P .AjB/ when we know the
other conditional probability P .BjA/, a simple formula known as Bayes’ theorem
is useful. Here is a statement of a general version of Bayes’ theorem.

Theorem 1.5. Let fA1 ; A2 ; : : : ; Am g be a partition of a sample space . Let B be


some fixed event. Then

P .BjAj /P .Aj /
P .Aj jB/ D Pm :
i D1 P .BjAi /P .Ai /

Example 1.7 (Multiple Choice Exams). Suppose that the questions in a multiple
choice exam have five alternatives each, of which a student has to pick one as the
correct alternative. A student either knows the truly correct alternative with proba-
bility :7, or she randomly picks one of the five alternatives as her choice. Suppose a
particular problem was answered correctly. We want to know what the probability
is that the student really knew the correct answer.
8 1 Review of Univariate Probability

Define

A D The student knew the correct answer;


B D The student answered the question correctly:

We want to compute P .AjB/. By Bayes’ theorem,

P .BjA/P .A/ 1  :7
P .AjB/ D D D :921:
P .BjA/P .A/ C P .BjAc /P .Ac / 1  :7 C :2  :3

Before the student answered the question, our probability that she would know the
correct answer to the question was :7; but once she answered it correctly, the poste-
rior probability that she knew the correct answer increases to :921. This is exactly
what Bayes’ theorem does; it updates our prior belief to the posterior belief, when
new evidence becomes available.

1.3 Integer-Valued and Discrete Random Variables

In some sense, the entire subject of probability and statistics is about distributions of
random variables. Random variables, as the very name suggests, are quantities that
vary, over time, or from individual to individual, and the reason for the variability is
some underlying random process. Depending on exactly how an underlying exper-
iment  ends, the random variable takes different values. In other words, the value
of the random variable is determined by the sample point ! that prevails, when the
underlying experiment  is actually conducted. We cannot know a priori the value
of the random variable, because we do not know a priori which sample point ! will
prevail when the experiment  is conducted. We try to understand the behavior of
a random variable by analyzing the probability structure of that underlying random
experiment.
Random variables, like probabilities, originated in gambling. Therefore, the ran-
dom variables that come to us more naturally, are integer-valued random variables;
for examples, the sum of the two rolls when a die is rolled twice. Integer-valued
random variables are special cases of what are known as discrete random variables.
Discrete or not, a common mathematical definition of all random variables is the
following.
Definition 1.6. Let  be a sample space corresponding to some experiment  and
let X W  ! R be a function from the sample space to the real line. Then X is
called a random variable.
Discrete random variables are those that take a finite or a countably infinite
number of possible values. In particular, all integer-valued random variables are
discrete. From the point of view of understanding the behavior of a random variable,
the important thing is to know the probabilities with which X takes its different
possible values.
1.3 Integer-Valued and Discrete Random Variables 9

Definition 1.7. Let X W  ! R be a discrete random variable taking a finite or


countably infinite number of values x1 ; x2 ; x3 ; : : : : The probability distribution or
the probability mass function (pmf) of X is the function p.x/ D P .X D x/; x D
x1 ; x2 ; x3 ; : : : ; and p.x/ D 0, otherwise.
It is common to not explicitly mention the phrase “p.x/ D 0 otherwise,” and we
generally follow this convention. Some authors use the phrase mass function instead
of probability mass function. P
For any pmf, one must have p.x/  0 for any x, and i p.xi / D 1. Any
function satisfying these two properties for some set of numbers x1 ; x2 ; x3 ; : : : is a
valid pmf.

1.3.1 CDF and Independence

A second important definition is that of a cumulative distribution function (CDF).


The CDF gives the probability that a random variable X is less than or equal to any
given number x. It is important to understand that the notion of a CDF is universal
to all random variables; it is not limited to only the discrete ones.
Definition 1.8. The cumulative distribution function of a random variable X is the
function F .x/ D P .X x/; x 2 R.
Definition 1.9. Let X have the CDF F .x/. Any number m such that P .X m/  :5,
and also P .X  m/  :5 is called a median of F , or equivalently, a median of X .
Remark. The median of a random variable need not be unique. A simple way to
characterize all the medians of a distribution is available.
Proposition. Let X be a random variable with the CDF F .x/. Let m0 be the first
x such that F .x/  :5, and let m1 be the last x such that P .X  x/  :5. Then, a
number m is a median of X if and only if m 2 Œm0 ; m1 .
The CDF of any random variable satisfies a set of properties. Conversely, any
function satisfying these properties is a valid CDF; that is, it will be the CDF of
some appropriately chosen random variable. These properties are given in the next
result.
Theorem 1.6. A function F .x/ is the CDF of some real-valued random variable X
if and only if it satisfies all of the following properties.
(a) 0 F .x/ 1 8x 2 R.
(b) F .x/ ! 0 as x ! 1; and F .x/ ! 1 as x ! 1.
(c) Given any real number a; F .x/ # F .a/ as x # a.
(d) Given any two real numbers x; y; x < y; F .x/ F .y/:
Property (c) is called continuity from the right, or simply right continuity. It is
clear that a CDF need not be continuous from the left; indeed, for discrete random
variables, the CDF has a jump at the values of the random variable, and at the jump
points, the CDF is not left continuous. More precisely, one has the following result.
10 1 Review of Univariate Probability

Proposition. Let F .x/ be the CDF of some random variable X . Then, for any x,
(a) P .X D x/ D F .x/  limy"x F .y/ D F .x/  F .x/, including those points
x for which P .X D x/ D 0.
(b) P .X  x/ D P .X > x/ C P .X D x/ D .1  F .x// C .F .x/  F .x// D
1  F .x/.

Example 1.8 (Bridge). Consider the random variable

X D Number of aces in North’s hand in a Bridge game:

Clearly, X can take any of the values x D 0; 1; 2; 3; 4. If X D x, then the other


13  x cards in North’s hand must be non-ace cards. Thus, the pmf of X is
  
4 48
x 13  x
P .X D x/ D   ; x D 0; 1; 2; 3; 4:
52
13

In decimals, the pmf of X is:


x 0 1 2 3 4
p(x) .304 .439 .213 .041 .003
The CDF of X is a jump function, taking jumps at the values 0; 1; 2; 3; 4, namely
the possible values of X . The CDF is

F .x/ D 0 if x < 0I
D :304 if 0 x < 1I
D :743 if 1 x < 2I
D :956 if 2 x < 3I
D :997 if 3 x < 4I
D1 if x  4:

Example 1.9 (Indicator Variables). Consider the experiment of rolling a fair die
twice and now define a random variable Y as follows.

Y D 1 if the sum of the two rolls X is an even numberI


Y D 0 if the sum of the two rolls X is an odd number:

If we let A be the event that X is an even number, then Y D 1 if A happens, and


Y D 0 if A does not happen. Such random variables are called indicator random
variables and are immensely useful in mathematical calculations in many complex
situations.
1.3 Integer-Valued and Discrete Random Variables 11

Definition 1.10. Let A be any event in a sample space . The indicator random
variable for A is defined as
IA D 1 if A happens:
IA D 0 if A does not happen:
Thus, the distribution of an indicator variable is simply P .IA D 1/ D P .A/I
P .IA D 0/ D 1  P .A/.
An indicator variable is also called a Bernoulli variable with parameter p, where
p is just P .A/. We later show examples of uses of indicator variables in calculation
of expectations.
In applications, we are sometimes interested in the distribution of a function,
say g.X /, of a basic random variable X . In the discrete case, the distribution of a
function is found in the obvious way.

Proposition (Function of a Random Variable). Let X be a discrete random


variable
P and Y D g.X / a real-valued function of X . Then, P .Y D y/ D
xW g.x/Dy p.x/.

Example 1.10. Suppose X has the pmf


c
p.x/ D ; x D 0; ˙1; ˙2; ˙3:
1 C x2
Suppose we want to find the distribution of two functions of X :

Y D g.X / D X 3 I Z D h.X / D sin X :
2
First, the constant c must be explicitly evaluated. By directly summing the values,
X 13c 5
p.x/ D )cD :
x
5 13

Note that g.X / is a one-to-one function of X , but h.X / is not one-to-one. The
values of Y are 0; ˙1; ˙8; ˙27. For example, P .Y D 0/ D P .X D 0/ D c D
5=13I P .Y D 1/ D P .X D 1/ D c=2 D 5=26, and so on. In general, for y D 0;
˙1; ˙8; ˙27; P .Y D y/ D P .X D y 1=3 / D 1Cyc2=3 , with c D 5=13.
However, Z D h.X / is not a one-to-one function of X . The possible values of
Z are as follows.

x h.x/
3 1
2 0
1 1
0 0
1 1
2 0
3 1
12 1 Review of Univariate Probability

So, for example, P .Z D 0/ D P .X D 2/ C P .X D 0/ C P .X D 2/ D 75 c D


7=13: The pmf of Z D h.X / is:
z 1 0 1
P .Z D z/ 3/13 7/13 3/13
A key concept in probability is that of independence of a collection of random
variables. The collection could be finite or infinite. In the infinite case, we want
each finite subcollection of the random variables to be independent. The definition
of independence of a finite collection is as follows.

Definition 1.11. Let X1 ; X2 ; : : : ; Xk be k  2 discrete random variables defined


on the same sample space . We say that X1 ; X2 ; : : : ; Xk are independent if
P .X1 D x1 ; X2 D x2 ; : : : ; Xk D xk / D P .X1 D x1 /P .X2 D x2 /    P .Xk D
xk /; 8 x1 ; x2 ; : : : ; xk .
It follows from the definition of independence of random variables that if X1 ; X2
are independent, then any function of X1 and any function of X2 are also indepen-
dent. In fact, we have a more general result.

Theorem 1.7. Let X1 ; X2 ; : : : ; Xk be k  2 discrete random variables, and sup-


pose they are independent. Let U D f .X1 ; X2 ; : : : ; Xi / be some function of
X1 ; X2 ; : : : ; Xi , and V D g.Xi C1 ; : : : ; Xk / be some function of Xi C1 ; : : : ; Xk .
Then, U and V are independent.
This result is true of any types of random variables X1 ; X2 ;    ; Xk , not just
discrete ones.
A common notation of wide use in probability and statistics is now introduced.

If X1 ; X2 ; : : : ; Xk are independent, and moreover have the same CDF, say F ,


iid
then we say that X1 ; X2 ; : : : ; Xk are iid (or IID) and write X1 ; X2 ; : : : ; Xk F.
The abbreviation iid (IID) means independent and identically distributed.

Example 1.11 (Two Simple Illustrations). Consider the experiment of tossing a fair
coin (or any coin) four times. Suppose X1 is the number of heads in the first two
tosses, and X2 is the number of heads in the last two tosses. Then, it is intuitively
clear that X1 ; X2 are independent, because the last two tosses carry no informa-
tion regarding the first two tosses. The independence can be easily mathematically
verified by using the definition of independence.
Next, consider the experiment of drawing 13 cards at random from a deck of 52
cards. Suppose X1 is the number of aces and X2 is the number of clubs among the 13
cards. Then, X1 ; X2 are not independent. For example, P .X1 D 4; X2 D 0/ D 0,
but P .X1 D 4/, and P .X2 D 0/ are both > 0, and so P .X1 D 4/P .X2 D 0/ > 0.
So, X1 ; X2 cannot be independent.
1.3 Integer-Valued and Discrete Random Variables 13

1.3.2 Expectation and Moments

By definition, a random variable takes different values on different occasions. It is


natural to want to know what value it takes on average. Averaging is a very primitive
concept. A simple average of just the possible values of the random variable will be
misleading, because some values may have so little probability that they are rela-
tively inconsequential. The average or the mean value, also called the expected value
of a random variable is a weighted average of the different values of X , weighted
according to how important the value is. Here is the definition.
Definition 1.12. Let XPbe a discrete random variable. We say that the expected
value of X exists if i jxi jp.xi / < 1, in which case the expected value is
defined as
X
 D E.X / D xi p.xi /:
i
P P
For notational convenience, we simply write x xp.x/ instead of i xi p.xi /: The
expected value is also known as the expectation or the mean of X . P
If the set of possible values of X is infinite, then the infinite sum x xp.x/
can
P take different values on rearranging the terms of the infinite series unless
x jxjp.x/
P < 1. So, as a matter of definition, we have to include the qualification
that x jxjp.x/ < 1.
If the sample space  of the underlying experiment is finite or countably infinite,
then we can also calculate the expectation by averaging directly over the sample
space.
Proposition (Change of Variable Formula). Suppose the sample space  is
finite or countably infinite and X is a discrete random variable with expectation .
Then, X X
D xp.x/ D X.!/P .!/;
x !
where P .!/ is the probability of the sample point !.
Important Point. Although it is not the focus of this chapter, in applications we are
often interested in more than one variable at the same time. To be specific, consider
two discrete random variables X; Y defined on a common sample space . Then
we could construct new random variables out of X and Y , for example, X Y; X C
Y; X 2 C Y 2 , and so on. We can then talk of their expectations as well. Here is a
general definition of expectation of a function of more than one random variable.
Definition 1.13. Let X1 ; X2 ; : : : ; Xn be n discrete random variables, all defined on
a common sample space , with a finite or a countably infinite number of sam-
P points. We say that the expectation of a function g.X1 ; X2 ; : : : ; Xn / exists if
ple
! jg.X1 .!/; X2 .!/; : : : ; Xn .!//jP .!/ < 1, in which case, the expected value
of g.X1 ; X2 ; : : : ; Xn / is defined as
X
EŒg.X1 ; X2 ; : : : ; Xn / D g.X1 .!/; X2 .!/; : : : ; Xn .!//P .!/:
!

The next few results summarize the most fundamental properties of expectations.
14 1 Review of Univariate Probability

Proposition. (a) If there exists a finite constant c such that P .X D c/ D 1, then


E.X / D c.
(b) If X; Y are random variables defined on the same sample space  with finite
expectations, and if P .X Y / D 1, then E.X / E.Y /.
(c) If X has a finite expectation, and if P .X  c/ D 1; then E.X /  c. If P .X
c/ D 1, then E.X / c.

Proposition (Linearity of Expectations). Let X1 ; X2 ; : : : ; Xn be random vari-


ables defined on the same sample space , and c1 ; c2 ; : : : ; cn any real-valued
constants. Then, provided E.Xi / exists for every Xi ,
!
X
n X
n
E ci Xi D ci E.Xi /:
i D1 i D1

in particular, E.cX / D cE.X / and E.X1 C X2 / D E.X1 / C E.X2 /, whenever


the expectations exist.
The following fact also follows easily from the definition of the pmf of a function
of a random variable. The result says that the expectation of a function of a random
variable X can be calculated directly using the pmf of X itself, without having to
calculate the pmf of the function.

Proposition (Expectation of a Function). Let X be a discrete random variable


on a sample space  with a finite or countable number of sample points, and
.Y D g.X // a function of X . Then,
X X
E.Y / D Y .!/P .w/ D g.x/p.x/;
! x

provided E.Y / exists.

Caution. If g.X / is a linear function of X , then, of course, E.g.X // D g.E.X //.


But, in general, the two things are not equal. For example, E.X 2 / is not the same
as .E.X //2 ; indeed, E.X 2 / > .E.X //2 for any random variable X that is not a
constant.
A very important property of independent random variables is the following fac-
torization result on expectations.

Theorem 1.8. Suppose X1 ; X2 ; : : : ; Xn are independent random variables. Then,


provided each expectation exists,

E.X1 X2    Xn / D E.X1 /E.X2 /    E.Xn /:

Let us now show some more illustrative examples.


1.3 Integer-Valued and Discrete Random Variables 15

Example 1.12. Let X be the number of heads obtained in two tosses of a fair coin.
The pmf of X is p.0/ D p.2/ D 1=4; p.1/ D 1=2. Therefore, E.X / D 0  1=4 C
1  1=2 C 2  1=4 D 1. Because the coin is fair, we expect it to show heads 50% of
the number of times it is tossed, which is 50% of 2, that is, 1.

Example 1.13 (Dice Sum). Let X be the sum of the two rolls when a fair die
is rolled twice. The pmf of X is p.2/ D p.12/ D 1=36I p.3/ D p.11/ D
2=36I p.4/ D p.10/ D 3=36I p.5/ D p.9/ D 4=36I p.6/ D p.8/ D 5=36I p.7/ D
6=36. Therefore, E.X / D 21=36C32=36C43=36C  C121=36 D 7. This
can also be seen by letting X1 D the face obtained on the first roll; X2 D the face
obtained on the second roll, and by using E.X / D E.X1 C X2 / D E.X1 / C
E.X2 / D 3:5 C 3:5 D 7.
Let us now make this problem harder. Suppose that a fair die is rolled 10 times
and X is the sum of all 10 rolls. The pmf of X is no longer so simple; it will be
cumbersome to write it down. But, if we let Xi D the face obtained on the ith roll, it
is still true by the linearity of expectations that E.X / D E.X1 CX2 C    CX10 / D
E.X1 / C E.X2 / C    C E.X10 / D 3:5  10 D 35. We can easily compute the
expectation, although the pmf would be difficult to write down.

Example 1.14 (A Random Variable Without a Finite Expectation). Let X take the
positive integers 1; 2; 3; : : : as its values with the pmf

1
p.x/ D P .X D x/ D ; x D 1; 2; 3; : : : :
x.x C 1/
1
This is a valid pmf, because obviously x.xC1/ > 0 for any x D 1; 2; 3; : : : ; and also
P1 1
the infinite series xD1 x.xC1/ sums to 1, a fact from calculus. Now,

1
X 1
X X1 X1
1 1 1
E.X / D xp.x/ D x D D D 1;
xD1 xD1
x.x C 1/ xD1
xC1 xD2
x

also a fact from calculus.


This example shows that not all random variables have a finite expectation. Here,
the reason for the infiniteness of E.X / is that X takes large integer values x with
probabilities p.x/ that are not adequately small. The large values are realized suffi-
ciently often that on average X becomes larger than any given finite number.
The zero–one nature of indicator random variables is extremely useful for calcu-
lating expectations of certain integer-valued random variables whose distributions
are sometimes so complicated that it would be difficult to find their expectations di-
rectly from definition. We describe the technique and some illustrations of it below.

Proposition. Let XPbe an integer-valued random variable such that it can be rep-
m
resented as X D i D1 ci IAi for somePm, constants c1 ; c2 ; : : : ; cm , and suitable
events A1 ; A2 ; : : : ; Am . Then, E.X / D m i D1 ci P .Ai /.
16 1 Review of Univariate Probability

Example 1.15 (Coin Tosses). Suppose a coin that has probability p of showing
heads in any single toss is tossed n times, andP
let X denote the number of times in the
n tosses that a head is obtained. Then, X D niD1 IP Ai , where Ai is P
the event that a
head is obtained in the ith toss. Therefore, E.X / D niD1 P .Ai / D niD1 p D np.
P of the expectation would involve finding the pmf of X and
A direct calculation
obtaining the sum nxD0 xP .X D x/; it can also be done that way, but that is a
much longer calculation.
The random variable X of this example is a binomial random variable  
with parameters n and p. Its pmf is given by the formula P .X D x/ D xn p x
.1  p/nx ; x D 0; 1; 2; : : : ; n.

Example 1.16 (Consecutive Heads in Coin Tosses). Suppose a coin with probability
p for heads in a single toss is tossed n times. How many times can we expect to see
a head followed by at least one more head? For example, if n D 5, and we see the
outcomes HTHHH, then we see a head followed by at least one more head twice.
Define Ai D The i th and the .i C 1/th toss both result in heads. Then
X
n1
X D number of times a head is followed by at least one more head D IAi ;
i D1
P Pn1 2
and so E.X / D n1 i D1 P .Ai / D i D1 p D .n  1/p . For example, if a fair coin
2

is tossed 20 times, we can expect to see a head followed by another head about five
times (19  :52 D 4:75).
Another useful technique for calculating expectations of nonnegative integer-
valued random variables is based on the CDF of the random variable, rather than
directly on the pmf. This method is useful when calculating probabilities of the form
P .X > x/ is logically more straightforward than directly calculating P .X D x/.
Here is the expectation formula based on the tail CDF.

Theorem 1.9 (Tailsum Formula). Let X take values 0; 1; 2; : : : : Then


1
X
E.X / D P .X > n/:
nD0

Example 1.17 (Family Planning). Suppose a couple will have children until they
have at least one child of each sex. How many children can they expect to have?
Let X denote the childbirth at which they have a child of each sex for the first time.
Suppose the probability that any particular childbirth will be a boy is p, and that all
births are independent. Then,

P .X > n/ D P .the first n children are all boys or all girls/ D p n C .1  p/n :
P
Therefore, E.X / D 2 C 1 nD2 Œp C .1  p/  D 2 C p =.1  p/ C .1  p/ =p D
n n 2 2
1
p.1p/
 1. If boys and girls are equally likely on any childbirth, then this says that
a couple waiting to have a child of each sex can expect to have three children.
1.3 Integer-Valued and Discrete Random Variables 17

The expected value is calculated with the intention of understanding what a


typical value is of a random variable. But two very different distributions can have
exactly the same expected value. A common example is that of a return on an in-
vestment in a stock. Two stocks may have the same average return, but one may be
much riskier than the other, in the sense that the variability in the return is much
higher for that stock. In that case, most risk-averse individuals would prefer to in-
vest in the stock with less variability. Measures of risk or variability are of course
not unique. Some natural measures that come to mind are E.jX  j/, known as the
mean absolute deviation, or P .jX  j > k/ for some suitable k. However, neither
of these two is the most common measure of variability. The most common measure
is the standard deviation of a random variable.
Definition 1.14. Let a random variable X have a finite mean . The variance of X
is defined as
 2 D EŒ.X  /2 ;
p
and the standard deviation of X is defined as  D  2 :
It is easy to prove that  2 < 1 if and only if E.X 2 /, the second moment of X ,
is finite. It is not uncommon to mistake the standard deviation for the mean absolute
deviation, but they are not the same. In fact, an inequality always holds.
Proposition.   E.jX  j/, and  is strictly greater unless X is a constant
random variable, namely, P .X D / D 1.
We list some basic properties of the variance of a random variable.
Proposition.
(a) Var.cX / D c 2 Var.X / for any real c.
(b) Var.X C k/ D Var.X / for any real k.
(c) Var.X /  0 for any random variable X , and equals zero only if P .X D c/ D 1
for some real constant c.
(d) Var.X / D E.X 2 /  2 :
The quantity E.X 2 / is called the second moment of X . The definition of a gen-
eral moment is as follows.
Definition 1.15. Let X be a random variable, and k  1 a positive integer. Then
E.X k / is called the kth moment of X , and E.X k / is called the kth inverse moment
of X , provided they exist.
We therefore have the following relationships involving moments and the
variance.
Variance D Second Moment  .First Moment/2 :
Second Moment D Variance C .First Moment/2 :
Statisticians often use the third moment around the mean as a measure of lack of
symmetry in the distribution of a random variable. The point is that if a random
variable X has a symmetric distribution, and has a finite mean , then all odd mo-
ments around the mean, namely, EŒ.X  /2kC1  will be zero, if the moment exists.
18 1 Review of Univariate Probability

In particular, EŒ.X  /3  will be zero. Likewise, statisticians also use the fourth
moment around the mean as a measure of how spiky the distribution is around the
mean. To make these indices independent of the choice of unit of measurement (e.g.,
inches or centimeters), they use certain scaled measures of asymmetry and peaked-
ness. Here are the definitions.
Definition 1.16. (a) Let X be a random variable with EŒjX j3  < 1. The skewness
of X is defined as
EŒ.X  /3 
ˇD :
3
(b) Suppose X is a random variable with EŒX 4  < 1. The kurtosis of X is
defined as
EŒ.X  /4 
D  3:
4
The skewness ˇ is zero for symmetric distributions, but the converse need not be
true. The kurtosis is necessarily 2, but can be arbitrarily large, with spikier
distributions generally having a larger kurtosis. But a very good interpretation of
is not really available. We later show that D 0 for all normal distributions; hence
the motivation for subtracting 3 in the definition of .
Example 1.18 (Variance of Number of Heads). Consider the experiment of two
tosses of a fair coin and let X be the number of heads obtained. Then, we have seen
that p.0/ D p.2/ D 1=4; and p.1/ D 1=2. Thus, E.X 2 / D 0  1=4 C 1  1=2 C
4  1=4 D 3=2, and E.X / D 1. Therefore, Var.X / D E.X 2 /  2 D 3=2  1 D 12 ,
p
and the standard deviation is  D :5 D :707.
Example 1.19 (A Random Variable with an Infinite Variance). If a random variable
has a finite variance, then it can be shown that it must have a finite mean. This
example shows that the converse need not be true.
Let X be a discrete random variable with the pmf
c
P .X D x/ D ; x D 1; 2; 3; : : : ;
x.x C 1/.x C 2/

where the normalizing constant c D 4. The expected value of X is


1
X X1
4 1
E.X / D x D4 D 4  1=2 D 2:
xD1
x.x C 1/.x C 2/ xD1
.x C 1/.x C 2/

Therefore, by direct verification, X has a finite expectation. Let us now examine the
second moment of X .
1
X X1
4 1
E.X 2 / D x2  D4 x D 1;
xD1
x.x C 1/.x C 2/ xD1
.x C 1/.x C 2/
1.4 Inequalities 19

because the series


1
X 1
x
xD1
.x C 1/.x C 2/

is not finitely summable, a fact from calculus. Because E.X 2 / is infinite, but E.X /
is finite,  2 D E.X 2 /  ŒE.X /2 must also be infinite.
If a collection of random variables is independent, then just like the expectation,
the variance also adds up. Precisely, one has the following very useful fact.
Theorem 1.10. Let X1 ; X2 ; : : : ; Xn be n independent random variables. Then,

Var.X1 C X2 C    C Xn / D Var.X1 / C Var.X2 / C    C Var.Xn /:

An important corollary of this result is the following variance formula for the
mean, XN , of n independent and identically distributed random variables.
Corollary 1.1. Let X1 ; X2 ; : : : ; Xn be independent random variables with a com-
mon variance  2 < 1. Let XN D X1 CCX n
n N D 2 .
. Then Var.X/ n

1.4 Inequalities

The mean and the variance, together, have earned the status of being the two most
common summaries of a distribution. A relevant question is whether ;  are useful
summaries of the distribution of a random variable. The answer is a qualified yes.
The inequalities below suggest that knowing just the values of ; , it is in fact
possible to say something useful about the full distribution.
Theorem 1.11. (a) (Chebyshev’s Inequality). Suppose E.X /D and Var.X / D
 2 , assumed to be finite. Let k be any positive number. Then

1
P .jX  j  k/ :
k2
(b) (Markov’s Inequality). Suppose X takes only nonnegative values, and sup-
pose E.X / D , assumed to be finite. Let c be any postive number. Then,

P .X  c/ :
c
The virtue of these two inequalities is that they make no restrictive assumptions on
the random variable X . Whenever ;  are finite, Chebyshev’s inequality is appli-
cable, and whenever ; is finite, Markov’s inequality applies, provided the random
variable is nonnegative. However, the universal nature of these inequalities also
makes them typically quite conservative.
Although Chebyshev’s inequality usually gives conservative estimates for tail
probabilities, it does imply a major result in probability theory in a special case.
20 1 Review of Univariate Probability

Theorem 1.12 (Weak Law of Large Numbers). Let X1 ; X2 ; : : : be iid random


variables, with E.Xi / D ; Var.Xi / D  2 < 1. Then, for any > 0; P .jXN j >
/ ! 0, as n ! 1.
There is a stronger version of the weak law of large numbers, which says that
in fact, with certainty, XN will converge to  as n ! 1. The precise mathematical
statement is that 
P lim XN D  D 1:
n!1

The only condition needed is that E.jXi j/ should be finite. This is called the strong
law of large numbers. It is impossible to prove it without using much more so-
phisticated concepts and techniques than we are using here. The strong law of
large numbers is treated later in the book. Inequalities better than Chebyshev’s or
Markov’s inequality are available under additional restrictions on the distribution
of the underlying random variable X . We state three other inequalities that can
sometimes give bounds better than what Chebyshev’s or Markov’s inequality can
give.

Theorem 1.13. (a) (Cantelli’s Inequality). Suppose E.X / D ; Var.X / D  2 ,


assumed to be finite. Then,

1
P .X    k/ :
k2 C 1
1
P .X   k/ :
k C1
2

(b) (Paley–Zygmund Inequality). Suppose X takes only nonnegative values, with


E.X / D ; Var.X / D  2 , assumed to be finite. Then, for 0 < c < 1,

2
P .X > c/  .1  c/2 :
2 C 2

(c) (Alon–Spencer Inequality). Suppose X takes only nonnegative integer values,


with E.X / D ; Var.X / D  2 , assumed to be finite. Then,

2
P .X D 0/ :
2 C 2

These inequalities may be seen in Rao (1973), Paley and Zygmund (1932), and Alon
and Spencer (2000, p. 58), respectively.
The area of probability inequalities is an extremely rich and diverse area. The
reason for it is that inequalities are tremendously useful in giving approximate an-
swers when the exact answer to a problem, or a calculation, is very hard or perhaps
even impossible to obtain. We periodically present and illustrate inequalities over
the rest of the book. Some really basic inequalities based on moments are presented
in the next theorem.
1.4 Inequalities 21

Theorem 1.14. (a) (Cauchy–Schwarz Inequality). Let X; Y be two random


variables such that E.X 2 / and E.Y 2 / are finite. Then,
p p
E.jX Y j/ E.X 2 / E.Y 2 /:

R
(b) (Holder’s Inequality). Let X; Y be two random variables, and 1 < p < 1 a
p
real number such that E.jX jp / < 1. Let q D p1 , and suppose E.jY jq / <
1. Then,
1 1
E.jX Y j/ ŒE.jX jp / p ŒE.jY jq / q :
(c) (Minkowski’s Inequality). Let X; Y be two random variables, and p  1 a
real number such that E.jX jp /; E.jY jp / < 1. Then,
1 1 1
ŒE.jX C Y jp / p ŒE.jX jp / p C ŒE.jY jp / p ;

and, in particular, if E.jX j/; E.jY j/ are both finite, then,

E.jX C Y j/ E.jX j/ C E.jY j/;

known as the triangular inequality.


(d) (Lyapounov Inequality). Let X be a random variable, and 0 < ˛ < ˇ such
that E.jX jˇ / < 1: Then,
1 1
ŒE.jX j˛ / ˛ ŒE.jX jˇ / ˇ :

Example 1.20 (Application of Cauchy–Schwarz Inequality). The most useful


applications of Holder’s inequality and the Cauchy–Schwarz inequality are to
continuous random variables, which we have not discussed yet. We give a simple
application of the Cauchy–Schwarz inequality to a dice problem.
Suppose X; Y are the maximum and the minimum of two rolls of a fair die. Also
let X1 be the first roll and X2 be the second roll. Note that X Y D X1 X2 . Therefore,
p p p p
E X Y DE XY D E X1 X2
p p p p
DE X1 X2 D E X1 E X2
h p i2 1 p p 2
D E X1 D 1CC 6
36
1
D  .10:83/2 D 3:26:
36
Therefore, by the Cauchy–Schwarz inequality,
p p
E.X / E.Y /  3:26
p p
) E.X / 7  E.X /  3:26
22 1 Review of Univariate Probability

(because, E.X / C E.Y / D E.X1 / C E.X2 / D 7)


p
) m.7  m/  3:26

(writing m for E.X /)

) m.7  m/  10:63
)m 4:77;

because the quadratic m.7  m/  10:63 D 0 has the two roots m D 2:23; 4:77:
It is interesting that this bound is reasonably accurate, as the exact value of m D
36 D 4:47.
E.X / is 161

1.5 Generating and Moment-Generating Functions

Studying distributions of random variables and their basic quantitative properties,


such as expressions for moments, occupies a central role in both statistics and prob-
ability. It turns out that a function called the probability-generating function is often
a very useful mathematical tool in studying distributions of random variables. The
moment-generating function, which is related to the probability-generating func-
tion, is also extremely useful as a mathematical tool in numerous problems.
Definition 1.17. The probability generating function (pgf), also called simply the
generating function, of a nonnegative P1integer-valued random variable X is defined
as G.s/ D GX .s/ D E.s X / D xD0 s P .X D x/; provided the expectation
x

is finite.
In this definition, 00 is to be understood as being equal to 1. Note that G.s/ is
always finite for jsj 1, but it could be finite over a larger interval, depending on
the specific random variable X .
Two basic properties of the generating function are the following.

Theorem 1.15. (a) Suppose G.s/ is finite in some open interval containing the ori-
gin. Then, G.s/ is infinitely differentiable in that open interval, and P .X D k/ D
G .k/ .0/
kŠ ; k  0, where G .0/ .0/ means G.0/.
(b) If lims"1 G .k/ .s/ is finite, then EŒX.X  1/    .X  k C 1/ exists and is finite,
and G .k/ .1/ D lims"1 G .k/ .s/ D EŒX.X  1/    .X  k C 1/.

Definition 1.18. EŒX.X  1/    .X  k C 1/ is called the kth factorial moment


of X .

Remark. The kth factorial moment of X exists if and only if the kth moment E.X k /
exists.
One of the most important properties of generating functions is the following.
1.5 Generating and Moment-Generating Functions 23

Theorem 1.16. Let X1 ; X2 ; : : : ; Xn be independent random variables, with gener-


ating functions G1 .s/; G2 .s/; : : : ; Gn .s/. Then the generating function of X1 C X2
C    C Xn equals
Yn
GX1 CX2 CCXn .s/ D Gi .s/:
i D1

One reason that the generating function is useful as a tool is its distribution
determining property, in the following sense.
Theorem 1.17. Let G.s/ and H.s/ be the generating functions of two random vari-
ables X; Y . If G.s/ D H.s/ in any nonempty open interval, then X; Y have the same
distribution.
Summarizing, then, one can find from the generating function of a nonnegative
integer-valued random variable X , the pmf of X , and every moment of X , including
the moments that are infinite.
Example 1.21 (Discrete Uniform Distribution). Suppose X has the discrete uni-
form distribution on f1; 2; : : : ; ng. Then, its generating function is

X
n
1X x
n
G.s/ D EŒs X  D s x P .X D x/ D s
xD1
n xD1

s.s  1/ n
D ;
n.s  1/
P
by summing the geometric series nxD1 s x : As a check, if we differentiate G.s/
once, we get
1 C s n Œn.s  1/  1
G 0 .s/ D :
n.s  1/2

On applying L’Hospital’s rule, we get that G 0 .1/ D nC1


2
which, therefore, is the
mean of X .
Example 1.22 (The Poisson Distribution). Consider a nonnegative integer-valued
random variable X with the pmf p.x/ D e 1 xŠ 1
; x D 0; 1; 2; : : : : This is indeed a
valid pmf. First, it is clear that p.x/  0 for any x. Also,
1
X 1
X X1
1 1
p.x/ D e 1 D e 1 D e 1 e D 1:
xD0 xD0
xŠ xD0

We find the generating function of this distribution. The generating function is


1
X 1
G.s/ D EŒs X  D s x e 1
xD0

1
X sx
D e 1 D e 1 e s D e s1 :
xD0

24 1 Review of Univariate Probability

The first derivative of G.s/ is G 0 .s/ D e s1 , and therefore G 0 .1/ D e 0 D 1. From
our theorem above, we conclude that E.X / D 1. Indeed, the pmf that we have in
this example is the pmf of the so-called Poisson distribution with mean one. The pmf
 x
of the Poisson distribution with a general mean is p.x/ D e xŠ ; x D 0; 1; 2; : : : :
The Poisson distribution is an extremely important distribution in probability theory
and is studied in more detail below.
We have defined the probability-generating function only for nonnegative
integer-valued random variables. The moment-generating function is usually dis-
cussed in the context of general random variables, not necessarily integer-valued,
or discrete. The two functions are connected. Here is the formal definition.

Definition 1.19. Let X be a real-valued random variable. The moment-generating


function (mgf) of X is defined as

X .t/ D .t/ D EŒe tX ;

whenever the expectation is finite.


Note that the mgf .t/ of a random variable X always exists and is finite if
t D 0, and .0/ D 1. It may or may not exist when t ¤ 0. If it does exist for t in a
nonempty open interval containing zero, then many properties of X can be derived
by using the mgf .t/; it is an extremely useful tool. If X is a nonnegative integer-
valued random variable, then writing s X as e X log s , it follows that the (probability)
generating function G.s/ is equal to .log s/, whenever G.s/ < 1. Thus, the two
generating functions, namely the probability-generating function, and the moment-
generating function are connected.
The following theorem explains the name of a moment-generating function.

Theorem 1.18. (a) Suppose the mgf .t/ of a random variable X is finite in some
open interval containing zero. Then, .t/ is infinitely differentiable in that open
interval, and for any k  1;

E.X k / D .k/
.0/:

(b) (Distribution-Determining Property). If 1 .t/; 2 .t/ are the mgfs of two


random variables X; Y , and if 1 .t/ D 2 .t/ in some nonempty open inter-
val containing zero, then X; Y have the same distribution.
(c) If X1 ; X2 ; : : : ; Xn are independent random variables, and if each Xi has a mgf
i .t/, existing in some open interval around zero, then X1 C X2 C    C Xn
also has a mgf in that open interval, and

Y
n

X1 CX2 CCXn .t/ D i .t/:


i D1
1.5 Generating and Moment-Generating Functions 25

Example 1.23 (Discrete Uniform Distribution). Let X have the pmf P .X D x/ D


1
n
; x D 1; 2; : : : ; n. Then, its mgf is

1 X tk
n
e t .e nt  1/
.t/ D EŒe tX  D e D :
n n.e t  1/
kD1

By direct differentiation,

0 e t .1 C ne .nC1/t  .n C 1/e nt /
.t/ D :
n.e t  1/2

On applying L’Hospital’s rule twice, we get the previously derived fact that
E.X / D nC1
2
:

Example 1.24. Suppose X takes only two values 0 and 1, with P .X D 1/ D p;


P .X D 0/ D 1  p; 0 < p < 1: Thus, X is a Bernoulli variable with parameter p.
Then, the mgf of X is

.t/ D EŒe tX  D pe t C .1  p/:

If we differentiate this, we get 0 .t/ D pe t ; 00 .t/ D pe t . Therefore, 0 .0/ D


pe 0 D p, and also 00 .0/ D p. From the general properties of mgfs, it then follows
that E.X / D 0 .0/ D p, and E.X 2 / D 00 .0/ D p. Now go back to the pmf of
X that we started with in this example, and note that indeed, by direct calculation,
E.X / D E.X 2 / D p.
Closely related to the moments of a random variable are certain quantities known
as cumulants. Cumulants arise in accurate approximation of the distribution of sums
of independent random variables. They are also used for statistical modeling pur-
poses in some applied sciences. The name cumulant was coined by Sir Ronald
Fisher (Fisher (1929)), although they were discussed in the literature by others prior
to Fisher’s coining the cumulant term. We define and describe some basic facts
about cumulants below; this material is primarily for reference purposes, and may
be omitted at first reading.
We first need to define central moments of a random variable because cumulants
are related to them.

Definition 1.20. Let a random variable X have a finite j th moment for some spec-
ified j  1. The j th central moment of X is defined as j D EŒ.X  /j , where
 D E.X /.

Remark. Note that 1 D E.X  / D 0 and 2 D E.X  /2 D  2 , the variance


of X . If X has a distribution symmetric about zero, then every odd-order central
moment EŒ.X  /2kC1  is easily proved to be zero, provided it exists.

Definition 1.21. Let X have a finite mgf .t/ is some neighborhood of zero,
and let K.t/ D log .t/, when it exists. The rth cumulant of X is defined as
26 1 Review of Univariate Probability
r
r D dtd
r K.t/jt D0 . Equivalently, the cumulants of X are the coefficients in the
P
power series expansion K.t/ D 1 tn
nD1 n nŠ , within its radius of convergence.
Note that K.t/ D log .t/ implies that e K.t / D .t/. By equating
coefficients in the power series expansion of e K.t / with those in the power
series expansion of .t/, it is easy to express the first few moments (and
therefore, the first few central moments) in terms of the cumulants. Indeed, de-
noting ci D E.X i /;  D E.X / D c1 ; i D E.X  /i ;  2 D 2 , one obtains the
expressions
2 3
c1 D 1I c2 D 2 C 1I c3 D 3 C3 1 2 C 1:

c4 D 4 C4 1 3 C3 2
2 C6 2
1 2 C 4
1:

The corresponding expressions for the central moments are much simpler:

2 D 2I 3 D 3I 4 D 4 C3 2
2:

In general, the cumulants satisfy the recursion relations


!
X
n1
n1
n D cn  cnk k:
k1
kD1

These result in the specific expressions

2 D 2 I 3 D 3 I 4 D 4  322 :

High-order cumulants have quite complex expressions in terms of the central mo-
ments j ; the corresponding expressions in terms of the cj are even more complex.
The derivations of these expressions stated above involve straight differentiation.
We do not present the algebra. It is useful to know these expressions for some prob-
lems in statistics.

1.6  Applications of Generating Functions to a Pattern


Problem

We now describe some problems in discrete probability that are generally known as
problems of patterns. Generating functions turn out to be crucially useful in analyz-
ing many of these problems. Suppose a coin with probability p for heads is tossed
repeatedly. How long does it take before we see three heads in succession for the
first time? Questions such as this which pertain to waiting times for seeing one or
more specified patterns are particularly amenable to use of the generating function.
1.6 Applications of Generating Functions to a Pattern Problem 27

Theorem 1.19. Suppose a coin with probability p for heads is tossed repeatedly,
and N D Nr is the first toss at which a head run of length r is obtained. Then
r  k
X 1
E.N / D I
p
kD1

1  p 1C2r  qp r .1 C 2r/
Var.N / D :
q 2 p 2r

Proof. Let pk  P .N D k/. The trick is to write a recursion relation for the se-
quence pk and then convert it to a generating function problem. This technique has
been found to be successful in solving numerous hard combinatorial problems.
Clearly, p1 D p2 D    D pr1 D 0. Also, pr D p r . The first head run of
length r occurs at the .r C 1/th trial if and only if the first trial is a tail and the
last r trials are all heads; therefore, prC1 D qp r , where q D 1  p. Similarly,
prC2 D q 2 p r C pq  p r D qprC1 C pqpr . For a general k  r C 1, we have the
recursion relation
pk D qpk1 C pqpk2 C    p r1 qpkr :
Multiplying by s k and summing over k, we get

G.s/ D p r s r C qsG.s/ C pqs 2 G.s/ C    C p r1 qs r G.s/


pr sr p r s r .1  ps/
) G.s/ D D ;
1  qs.1 C ps C    C .ps/ /
r1 1  s C qp r s rC1

on summing the geometric series 1 C ps C    C .ps/r1 , and on using the fact that
p C q D 1. Thus, by using a very clever recursion, the generating function of N has
been obtained; it is:
p r s r .1  ps/
G.s/ D :
1  s C qp r s rC1

Inasmuch as we have a closed-form formula for G.s/, we can determine pk for any
specified k by simply repeated differentiation; we can also obtain the expected value
of N , as E.N / D G 0 .1/.
By using the fact that G 00 .1/ D EŒN.N  1/, we can obtain the second moment,
and from there the variance of N . t
u

Example 1.25 (Run of Heads). If the coin is a fair coin, we get the result that the
expected number of tosses necessary to get the first head run of length r is 2 C 22
C    C 2r ; for example, it takes on average 14 tosses of a fair coin to obtain a run
of three heads for the first time.
On computing using the variance formula in the theorem, one can see that the
variance of N is very large for r > 3; sometimes one has to wait a very long time
to see a run of four or more consecutive heads.
28 1 Review of Univariate Probability

1.7 Standard Discrete Distributions

A few special discrete distributions arise very frquently in applications. Either the
underlying probability mechanism of a problem is such that one of these distribu-
tions is truly the correct distribution for that problem, or the problem may be such
that one of these distributions is a very good choice to model that problem. The spe-
cial distributions we present are the Binomial, the geometric, the negative binomial,
the hypergeometric, and the Poisson.
The Binomial Distribution. The binomial distribution represents a sequence of
independent coin tossing experiments. Suppose a coin with probability p; 0 < p < 1
for heads in a single trial is tossed independently a prespecified number of times,
say n times, n  1. Let X be the number of times in the n tosses that a head is
obtained. Then the pmf of X is:
!
n x
P .X D x/ D p .1  p/nx ; x D 0; 1; : : : ; n;
x
 
the xn term giving the choice of the x tosses out of the n tosses in which the heads
occur.
Coin tossing, of course, is just an artifact. Suppose a trial can result in only one
of two outcomes, called a success(S) or a failure(F), the probability of obtaining
a success being p in any trial. Such a trial is called a Bernoulli trial. Suppose a
Bernoulli trial is repeated independently a prespecified number of times, say n times,
Let X be the number of times in the n trials that a success is obtained. Then X has
the pmf given above, and we say that X has a binomial distribution with parameters
n and p, and write X Bin.n; p/.
The Geometric Distribution. Suppose a coin with probability p; 0 < p < 1, for
heads in a single trial is repeatedly tossed until a head is obtained for the first time.
Assume that the tosses are independent. Let X be the number of the toss at which
the very first head is obtained. Then the pmf of X is:

P .X D x/ D p.1  p/x1 ; x D 1; 2; 3; : : : :

We say that X has a geometric distribution with parameter p, and we write X


Geo.p/. A geometric distribution measures a waiting time for the first success in a
sequence of independent Bernoulli trials, each with the same success probability p;
that is, the coin cannot change from one toss to another.
The Negative Binomial Distribution. The negative binomial distribution is a
generalization of a geometric distribution, when we repeatedly toss a coin with
probability p for heads, independently, until a total number of r heads has been
obtained, where r is some fixed integer  1. The case r D 1 corresponds to the ge-
ometric distribution. Let X be the number of the first toss at which the rth success
is obtained. Then the pmf of X is:
1.7 Standard Discrete Distributions 29
!
x1 r
P .X D x/ D p .1  p/xr ; x D r; r C 1; : : : ;
r 1
 
the term x1
r1 simply giving the choice of the r  1 tosses among the first x  1
tosses where the first r  1 heads were obtained. We say that X has a negative
binomial distribution with parameters r and p, and we write X NB.r; p/.
The Hypergeometric Distribution. The hypergeometric distribution also repre-
sents the number of successes in a prespecified number of Bernoulli trials, but the
trials happen to be dependent. A typical example is that of a finite population in
which there are in all N objects, of which some D are of type I, and the other
N  D are of type II. A without replacement sample of size n; 1 n < N is chosen
at random from the population. Thus, the selected sampling units are necessarily
different. Let X be the number of units or individuals of type I among the n units
chosen. Then the pmf of X is:
  
D N D
x nx
P .X D x/ D   ;
N
n

nN CD x D. We say that such an X has a hypergeometric distribution


with parameters n; D; N , and we write X Hypergeo.n; D; N /.
The Poisson Distribution. The Poisson distribution is perhaps the most used and
useful distribution for modeling nonnegative integer-valued random variables.
The pmf of a Poisson distribution with parameter is:

e  x
P .X D x/; ; x D 0; 1; 2; : : : I

P
by using the power series expansion of e  D 1 x
xD0 xŠ , it follows that this is indeed
a valid pmf.
Three specific situations where a Poisson distribution is almost routinely adopted
as a model are the following.
(A) The number of times a specific event happens in a specified period of time,
for example, the number of phone calls received by someone over a 24-hour
period.
(B) The number of times a specific event or phenomenon is observed in a specified
amount of area or volume, for example, the number of bacteria of a certain kind
in one liter of a sample of water, or the number of misprints per page of a book,
and so on.
(C) The number of times a success is obtained when a Bernoulli trial with success
probability p is repeated independently n times, with p being small and n being
large, such that the product np has a moderate value, say between :5 and 10.
Thus, although the true distribution is a binomial, a Poisson distribution is used
as an effective and convenient approximation.
30 1 Review of Univariate Probability

We now present the most important properties of these special discrete


distributions.
Theorem 1.20. Let X Bin.n; p/. Then,
(a)  D E.X / D npI  2 D Var.X / D np.1  p/I
(b) The mgf of X equals .t/ D .pe t C 1  p/n at any tI
(c) EŒ.X  /3  D np.1  3p C 2p 2 /:
(d) EŒ.X  /4  D np.1  p/Œ1 C 3.n  2/p.1  p/:
Theorem 1.21 (Mean Absolute Deviation and Mode). Let X Bin.n; p/.
Let denote the smallest integer > np and let m D b np C p c. Then,
(a) EjX  npj D 2 .1  p/P .X D /:
(b) The mode of X equals m. In particular, if np is an integer, then the mode is
exactly np; if np is not an integer, then the mode is one of the two integers just
below and just above np.
Theorem 1.22. (a) Let X Geo.p/. Let q D 1  p. Then,

1 q
E.X / D I Var.X / D :
p p2

(b) Let X NB.r; p/; r  1: Then,


r rq
E.X / D I Var.X / D :
p p2

Furthermore, the mgf and the (probability) generating function of X equal


 r  
pe t 1
.t/ D ; t < log :
1  qe t q
 r
ps 1
G.s/ D ; s< :
1  qs q

Theorem 1.23. Let X Hypergeo.n; D; N /; and let p D D


N
. Then,
 
N n
E.X / D npI Var.X / D np.1  p/ :
N 1

Problems that should truly be modeled as hypergeometric distribution problems


are often analyzed as if they were binomial distribution problems. That is, the fact
that samples have been taken without replacement is ignored, and one pretends the
successive draws are independent. When does it not matter that the dependence be-
tween the trials is ignored? Intuitively, we would think that if the population size N
were large, and neither D nor N  D were small, the trials would act as if they
were independent trials. The following theorem justifies this intuition.
1.7 Standard Discrete Distributions 31

Theorem 1.24 (Convergence of Hypergeometric to Binomial). Let X D XN


Hypergeo.n; D; N /, where D D DN and N are such that N ! 1; ND
! p; 0 <
p < 1: Then, for any fixed n, and for any fixed x,
  
D N D !
x nx n x
P .X D x/ D   ! p .1  p/nx ;
N x
n

as N ! 1.

Theorem 1.25. Let X Poi. /. Then,


(a) E.X / D Var(X) D :
(b) E.X  /3 D I E.X  /4 D 3 2
C I
(c) The mgf of X equals
t 1/
.t/ D e .e :

(d) The integer part of is always a mode of X . If is itself an integer, then and
 1 are both modes of X .

Let us now see some illustrative examples.

Example 1.26 (Guessing on a Multiple Choice Exam). A multiple choice test with
20 questions has five possible answers for each question. A completely unprepared
student picks the answer for each question at random and independently. Suppose
X is the number of questions that the student answers correctly.
We identify each question with a Bernoulli trial and a correct answer as a success.
Because there are 20 questions and the student picks the answer at random from
five choices, X Bin.n; p/, with n D 20; p D 15 D :2. We can now answer any
question we want about X .
For example,

P .the student gets every answer wrong/ D P .X D 0/ D :820 D :0115;

whereas,

P .the student gets every answer right/ D P .X D 20/ D :220 D 1:05  1014 ;

a near impossibility. Suppose the instructor has decided that it will take at least 13
correct answers to pass this test. Then,
!
X
20
20 x 20x
P .the student will pass/ D :2 :8 D :000015;
xD13
x

still a very very small probability.


32 1 Review of Univariate Probability

Example 1.27 (Meeting Someone with the Same Birthday). Suppose you were
born on October 15. How many different people do you have to meet before you
find someone who was also born on October 15? Under the usual conditions of
equally likely birthdays, and independence of the birthdays of all people that you
will meet, the number of people X you have to meet to find the first person with the
same birthday as yours is geometric: X Geo.p/ with p D 365 1
. The pmf of X is
P .X D x/ D p.1  p/ : Thus, for any given k,
x1

1
X 1
X
P .X > k/ D p.1  p/x1 D p .1  p/x D .1  p/k :
xDkC1 xDk

For example, the chance that you will have to meet more than 1000 people to find
someone with the same birthday as yours is .364=365/1000 D :064.

Example 1.28 (Lack of Memory of Geometric Distribution). Let X Geo.p/, and


suppose m; n are given positive integers. Then, X has the interesting property

P .X > m C njX > n/ D P .X > m/:

That is, suppose you are waiting for some event to happen for the first time. You
have tried, say, 20 times, and you still have not succeeded. You may feel that it is
due anytime now. But the chance that it will take another ten tries is the same as if
you just started, and forget that you have been patient for long time and have already
tried very hard for success.
The proof is simple. Indeed,
P
P .X > m C n/ x>mCn p.1  p/
x1
P .X > m C njX > n/ D D P
x>n p.1  p/
P .X > n/ x1

.1  p/mCn
D D .1  p/m D P .X > m/:
.1  p/n

Example 1.29 (A Classic Example: Capture–Recapture). An ingenious use of the


hypergeometric distribution in estimating the size of a finite population is the
capture–recapture method. It was originally used for estimating the total number
of fish in a body of water, such as a pond. Let N be the number of fish in the
pond. In this method, a certain number of fish, say D of them are initially captured
and tagged with a safe mark or identification device, and are returned to the water.
Then, a second sample of n fish is recaptured from the water. Assuming that the fish
population has not changed in any way in the intervening time, and that the initially
captured fish remixed with the fish population homogeneously, the number of fish in
the second sample, say X , that bear the mark is a hypergeometric random variable,
namely, X Hypergeo.n; D; N /. We know that the expected value of a hyperge-
ometric random variable is n ND
. If we set, as a formalism, X D n D
N
. and solve for
N , we get N D X . This is an estimate of the total number of fish in the pond.
nD

Although the idea is extremely original, this estimate can run into various kinds of
1.7 Standard Discrete Distributions 33

difficulties if, for example, the first catch of fish cluster around after being returned,
or hide, or if the fish population has changed between the two catches due to death
or birth, and of course if X turns out to be zero. Modifications of this estimate
(known as the Petersen estimate) are widely used in wildlife estimation, census, and
by the government for estimating tax frauds and number of people afflicted with
some infection.

Example 1.30 (Events over Time). April receives three phone calls at her home on
the average per day. On what percentage of days does she receive no phone calls;
more than five phone calls?
Because the number of calls received in a 24-hour period counts the occurrences
of an event in a fixed time period, we model X D number of calls received by April
on one day as a Poisson random variable with mean 3. Then,
X
5
P .X D 0/ D e 3 D :0498I P .X > 5/ D 1  P .X 5/ D 1  e 3 3x =xŠ
xD0

D 1  :9161 D :0839:
Thus, she receives no calls on 4:98% of the days and she receives more than five
calls on 8:39% of the days. It is important to understand that X has only been mod-
eled as a Poisson random variable, and other models could also be reasonable.

Example 1.31 (A Hierarchical Model with a Poisson Base). Suppose a chick lays
a Poi. / number of eggs in some specified period of time, say a month. Each egg
has a probability p of actually developing. We want to find the distribution of the
number of eggs that actually develop during that period of time.
Let X Poi. / denote the number of eggs the chick lays, and Y the number of
eggs that develop. For example,
1
X 1
X e  x
P .Y D 0/ D P .Y D 0jX D x/P .X D x/ D .1  p/x
xD0 xD0

1
X . .1  p// x
D e  D e  e .1p/ D e p :
xD0

In general,
1
!
X x y e  x
P .Y D y/ D p .1  p/xy
xDy
y xŠ
1
.p=.1  p//y  X 1
D e .1  p/x x
yŠ xDy
.x  y/Š

X1
.p=.1  p//y  . .1  p//n
D e . .1  p//y ;
yŠ nD0

34 1 Review of Univariate Probability

on writing n D x  y in the summation,

. p/y  .1p/ e p . p/y


D e e D ;
yŠ yŠ

and so, we recognize that Y Poi. p/. What is interesting here is that the dis-
tribution still remains Poisson, under assumptions that seem to be very realistic
physically.

1.8 Poisson Approximation to Binomial

A binomial random variable is the sum of n indicator variables. When the expecta-
tion of these indicator variables, namely p is small, and the number of summands n
is large, the Poisson distribution provides a good approximation to the binomial. The
Poisson distribution can also sometimes serve as a good approximation when the in-
dicators are independent, but have different expectations pi , or when the indicator
variables have some weak dependence. We start with the Poisson approximation to
the binomial when n is large, and p is small.

Theorem 1.26. Let Xn Bin.n; pn /; n  1. Suppose npn ! ; 0 < < 1, as


n ! 1, Let Y Poi. /. Then, for any given k; 0 k < 1,

P .Xn D k/ ! P .Y D k/;

as n ! 1.
In fact, the convergence is not just pointwise for each fixed k, but it is uniform
in k. This follows from the next theorem.

Theorem 1.27 (Le Cam, Barbour and Hall, Steele). Let Xn D B1 CB2 C  CBn ,
Bernoulli variables with parameters pi D pi;n . Let Yn
where Bi are independentP
Poi. /, where D n D niD1 pi . Then,

1
X 1  e  X
n
jP .Xn D k/  P .Yn D k/j 2 pi2 :
kD0 i D1

Here is an application of this Poisson approximation result.

Example 1.32 (Lotteries). Consider a weekly lottery in which 3 numbers from 25


numbers are selected at random, and a person holding exactly those 3 numbers is
the winner of the lottery. Suppose the person plays for n weeks, for large n. What is
the probability that he will win the lottery at least once (at least twice)?
Let X be the number of weeks that the player wins. Then,  assuming
 the weekly
lotteries to be independent, X Bin.n; p/, where p D 1= 25 3 D 2300 D :00043.
1
1.8 Poisson Approximation to Binomial 35
approx:
Because p is small, and n is supposed to be large, X Poi. /; D np D
:00043n. Therefore,
P .X  1/ D 1  P .X D 0/ 1  e :00043n ;

and,

P .X  2/ D 1  P .X D 0/  P .X D 1/ 1  e :00043n  :00043ne :00043n


D 1  .1 C :00043n/e :00043n :

We can compute these for various n. If he plays for five years,

1  e :00043n D 1  e :00043552 D :106;

and,
1  .1 C :00043n/e :00043n D :006:
If he plays for ten years,

1  e :00043n D 1  e :000431052 D :200;

and,
1  .1 C :00043n/e :00043n D :022:
We can see that the chances of any luck are at best moderate even after pro-
longed tries.
Sums of random variables arise very naturally in practical applications. For ex-
ample, the revenue over a year is the sum of the monthly revenues; the time taken to
finish a test with ten problems is the sum of the times taken to finish the individual
problems, and so on. Sometimes we can reasonably assume that the various random
variables being added are independent. Thus, the following general question is an
important one.
Suppose X1 ; X2 ; : : : ; Xk are k independent random variables, and suppose we
know the distributions of the individual Xi . What is the distribution of the sum
X1 C X2 C    C Xk ?
In general, this is a very difficult question. Interestingly, if the individual Xi have
one of the distinguished distributions we have discussed in this chapter, then their
sum is also often a distribution of that same type.
Theorem 1.28. (a) Suppose X1 ; X2 ; : : : ; Xk are k independent binomial random
variables, with Xi Bin.ni ; p/. Then X1 C X2 C    C Xk Bin.n1 C n2 C
   C nk ; p/I
(b) Suppose X1 ; X2 ; : : : ; Xk are k independent negative binomial random vari-
ables, with Xi NB.ri ; p/. Then X1 C X2 C    C Xk NB.r1 C r2 C
   C rk ; p/;
(c) Suppose X1 ; X2 ; : : : ; Xk are k independent Poisson random variables, with
Xi Poi. i /. Then X1 C X2 C    C Xk Poi. 1 C 2 C    k /:
36 1 Review of Univariate Probability

1.9 Continuous Random Variables

Discrete random variables serve as good examples to develop probabilistic intuition,


but they do not account for all the random variables that one studies in theory and in
applications. We now introduce the so called continuous random variables, which
typically take all values in some nonempty interval, such as the unit interval, the
entire real line, or the like, The right probabilistic paradigm for continuous variables
cannot be pmfs. Instead of pmfs, we operate with a density function for the variable.
The density function fully describes the distribution.
Definition 1.22. Let X be a real-valued random variable taking values in R, the
real line. A function f .x/ is called the density function or the probability density
function (pdf) of X if
Z b
For all a; b; 1 < a b < 1; P .a X b/ D f .x/dxI
a

in particular, for a function f .x/ to be a density function of some random variable,


it must satisfy:
Z 1
f .x/  0 8x 2 RI f .x/dx D 1:
1

Rb
The statement that P .a X b/ D a f .x/dx is the same as saying that if
we plot the density function f .x/, then the area under the graph between a and
b
R 1will give the probability that X is between a and b, while the statement that
1 f .x/dx D 1 is the same as saying that the area under the entire graph must be
one. This is a visually helpful way to think of probabilities for continuous random
variables; larger areas under the graph of the density function correspond to larger
probabilities.
The density function f .x/ can in principle be used to calculate the probability
that the randomR variable X belongs to a general set A, not just an interval. Indeed,
P .X 2 A/ D A f .x/dx.
Caution. Integrals over completely general sets A in the real line are not defined.
To make this completely rigorous, one has to use measure theory and concepts of
a Lebesgue integral. However, generally we only want to calculate P .X 2 A/ for
R A that are countable union of intervals. For such sets, defining the integral
sets
A f .x/dx would not be a problem and we can proceed as if we were just calculat-
ing ordinary integrals.
The definition of the cumulative distribution function remains the same as before.
Definition 1.23. Let X be a continuous random variable with a pdf f .x/. Then the
CDF of X is defined as
Z x
F .x/ D P .X x/ D P .X < x/ D f .t/dt:
1
1.9 Continuous Random Variables 37

Remark. At any point x0 at which f .x/ is continuous, the CDF F .x/ is differen-
tiable, and F 0 .x0 / D f .x0 /: In particular, if f .x/ is continuous everywhere, then
F 0 .x/ D f .x/ at all x.
Again, to be strictly rigorous, one really needs to say in the above sentence that
F 0 .x/ D f .x/ at almost all x, a concept in measure theory.

Example 1.33 (Using the Density to Calculate a Probability). Suppose X has the
uniform density on Œ0; 1 defined by f .x/ D 1; 0 x 1, and f .x/ D 0 otherwise.
We write X U Œ0; 1. Consider the events

A D fX is between .4 and .6gI


B D fX.1  X / :21gI
 1
C D sin X  p I
2 2
D D fX is a rational numberg:

We calculate each of P .A/; P .B/; P .C /, and


R P .D/. Recall that the probability of
any event, say E, is calculated as P .E/ D E f .x/dx, where f .x/ is the density
function: here f .x/ D 1 on Œ0; 1. Then,
Z :6
P .A/ D dx D :2:
:4

Next, note that x.1  x/ D :21 has two roots in [0,1], namely x D :3; :7, and
x.1  x/ :21 if x :3 or  :7. Therefore,
Z :3 Z 1
P .B/ D P .X :3/ C P .X  :7/ D dx C dx D :3 C :3 D :6:
0 :7

For the event C; sin. 2 X /  p1 if (and only if)


2

  1
X  )X  :
2 4 2
Thus,   Z 1
1 1
P .C / D P X  D dx D :
2 1
2
2
Finally, the set of rationals in [0,1] is a countable set. Therefore,
X X
P .D/ D P .X D x/ D 0 D 0:
xIx is rational xIx is rational
38 1 Review of Univariate Probability

Example 1.34 (From CDF to PDF and Median). Consider the function F .x/ D 0;
if x < 0I D 1  e x if 0 x < 1: This is a nonnegative nondecreasing function,
that goes to one as x ! 1, is continuous at any real number x, and is also differen-
tiable at any x except x D 0. Thus, it is the CDF of a continuous random variable,
and the PDF can be obtained by the relation f .x/ D F 0 .x/ D e x ; 0 < x < 1,
and f .x/ D F 0 .x/ D 0; x < 0. At x D 0, F .x/ is not differentiable. But we can
define the PDF in any manner we like at one specific point; so to be specific, we
write our PDF as

f .x/ D e x if 0 x < 1I
D 0; if x < 0:

This density is called the standard exponential density and is enormously important
in practical applications.
From the formula for the CDF, we see that F .m/ D :5 ) 1  e m D :5 )
m
e D :5 ) m D log 2 D :693: Thus, we have established that the standard
exponential density has median log 2 D :693.
In general, given a number p, there can be infinitely many values x such that
F .x/ D p. Any such value splits the distribution into two parts, 100p% of the
probability below it, and 100.1  p/% above. Such a value is called the pth quantile
or percentile of F . However, in order to give a prescription for choosing a unique
value when there is more than one x at which F .x/ D p, the following definition is
adopted.

Definition 1.24. Let X have the CDF F .x/. Let 0 < p < 1. The pth quantile or
the pth percentile of X is defined to be the first x such that F .x/  p:

F 1 .p/ D inffx W F .x/  pg:

The function F 1 .p/ is also sometimes denoted as Q.p/ and is called the quantile
function of F or X .

Remark. Statisticians call Q.:25/ and Q.:75/ the first and the third quartile of F
or X .
The distribution of a continuous random variable is completely described if we
describe either its density function, or its CDF. For flexible modeling, it is useful
to know how to create new densities or new CDFs out of densities or CDFs that
we have already thought of. This is similar to generating new functions out of old
functions in calculus. The following theorem describes some standard methods to
make new densities or CDFs out of already available ones.

Theorem 1.29. (a) Let f .x/ be any density function. Then, for any real number 
and any  > 0,
1 x  
g.x/ D g; .x/ D f
 
is also a valid density function.
1.9 Continuous Random Variables 39

(b) Let f1 ; f2 ; : : : ; fk be k densities for some k; 2 k < 1, and let p1 ; p2 ; : : : ; pk


P
be k constants such that each pi  0, and kiD1 pi D 1. Then,

X
k
f .x/ D pi fi .x/
i D1

is also a valid density function.


Densities of the form in part (a) are called location scale parameter densities,
with  as a location and  as a scale parameter. Densities of the form in part (b)
are known as finite mixtures. Both types are enormously useful in statistics.
Two other very familiar concepts in probability and statistics are those of sym-
metry and unimodality. Symmetry of a density function means that around some
point, the density has two halves that are exact mirror images of each other. Uni-
modality means that the density has just one peak point at some value. We give the
formal definitions.
Definition 1.25. A density function f .x/ is called symmetric around a number M
if f .M C u/ D f .M  u/ 8 u > 0: In particular, f .x/ is symmetric around zero if
f .u/ D f .u/ 8 u > 0:
Definition 1.26. A density function f .x/ is called strictly unimodal at (or around)
a number M if f .x/ is increasing for x < M , and decreasing for x > M .
Example 1.35 (The Triangular Density). Consider the density function
1
f .x/ D cx; 0 x I
2
1
D c.1  x/; x 1;
2
where c is a normalizing constant. It is easily verified that c D 4: This density
consists of two different linear segments on Œ0; 12  and Œ 12 ; 1. A plot of this density
see Fig. 1.1 looks like a triangle, and it is called the triangular density on Œ0; 1. Note
that it is symmetric and strictly unimodal.

Example 1.36 (The Double Exponential Density). We have previously seen the stan-
dard exponential density on Œ0; 1/ defined as e x ; x  0. We can extend this to the
negative real numbers by writing x for x in the above formula; that is, simply
define the density to be e x for x 0. Then, we have an overall function that equals

e x for x  0:
e x for x 0:
This function integrates to
Z 1 Z 0
x
e dx C e x dx D 1 C 1 D 2:
0 1
40 1 Review of Univariate Probability

1.5

0.5

x
0.2 0.4 0.6 0.8 1

Fig. 1.1 Triangular density on [0,1]

So, if we use a normalizing constant of 12 , then we get a valid density on the entire
real line:

1 x
f .x/ D e for x  0:
2
1 x
f .x/ D e for x 0:
2

The two lines can be combined into one formula as


1 jxj
f .x/ D e ; 1 < x < 1:
2
This is the standard double exponential density, and is symmetric, unimodal, and
has a cusp at x D 0; see Fig. 1.2.

Example 1.37 (The Normal Density). The double exponential density tapers off
to zero at the linear exponential rate at both tails (i.e., as x ! ˙1). If we force
the density to taper off at a quadratic exponential rate, then we will get a function
2
like e ax , for some chosen a > 0. Although this is obviously nonnegative, and
also has a finite integral over the whole real line, it does not integrate to one. So we
need a normalizing constant to make it a valid density function. Densities of this
form are called normal densities, and occupy the central place among all distribu-
tions in the theory and practice of probability and statistics. Gauss, while using the
method of least squares for analyzing astronomical data, used the normal distribu-
tion to justify least squares methods; the normal distribution is also often called the
Gaussian distribution, although de Moivre and Laplace both worked with it before
1.9 Continuous Random Variables 41

0.5

0.4

0.3

0.2

0.1

x
-4 -2 2 4

Fig. 1.2 Standard double exponential density

Gauss. Physical data on many types of variables approximately fit a normal distribu-
tion. The theory of statistical methods is often best understood when the underlying
distribution is normal. The normal distributions have many unique properties not
shared by any other distribution. Because of all these reasons, the normal density,
also called the bell curve, is the most used, important, and well-studied distribution.
Let
2
 .x/
f .x/ D f .xj; / D ce 22
; 1 < x < 1;
where c is a normalizing constant. The normalizing constant can be proved to be
equal to p1 . Thus, a normal density with parameters  and  is given by
 2

2
1  .x/
f .xj; / D p e 2 2 ; 1 < x < 1:
 2

We write X N.;  2 /; we show later that the two parameters  and  2 are
the mean and the variance of this distribution. Note that the N.;  2 / density is a
location-scale parameter density.
If  D 0 and  D 1, this simplifies to the formula
1 x2
p e 2 ; 1 < x < 1;
2
and is universally denoted by the notation .x/. It is called the standard normal
density. The standard normal density, then, is:
1 x2
.x/ D p e  2 ; 1 < x < 1:
2
42 1 Review of Univariate Probability

0.4 1

0.8
0.3
0.6
0.2
0.4
0.1
0.2

-3 -2 -1 1 2 3 -3 -2 -1 1 2 3

Fig. 1.3 The standard normal density and the CDF

R x Consequently, the CDF of the standard normal density is the function


1 .t/dt. It is not possible to express the CDF in terms of the elementary
functions. It is standard practice to denote it by using the notation ˆ.x/, and com-
pute it using widely available tables or software for a given x needed in a specific
application.
A plot of the standard normal density and its CDF are given in Fig. 1.3. Note the
bell shape of the density function .x/.

1.10 Functions of a Continuous Random Variable

As for discrete random variables, we are often interested in the distribution of some
function g.X / of a continuous random variable X . For example, X could measure
the input into some production process, and g.X / could be a function that describes
the output. For one-to-one functions g.X /, one has the following important formula.
Theorem 1.30 (The Jacobian Formula). Let X have a continuous pdf f .x/
and a CDF F .x/, and suppose Y D g.X / is a strictly monotone function of X with
a nonzero derivative. Then Y has the pdf

f .g 1 .y//
fY .y/ D ;
jg 0 .g 1 .y//j

where y belongs to the range of g.


Example 1.38 (Simple Linear Transformations). Suppose X is any continuous ran-
dom variable with a pdf f .x/ and let Y D g.X / be the linear function (a location
and scale change on X ) g.X / D a C bX; b ¤ 0. This is obviously a strictly
monotone function, as b ¤ 0. Take b > 0. Then the inverse function of g is
g 1 .y/ D ya
b
, and of course g 0 .x/ b. Putting it all together, from the theo-
rem above,
f .g 1 .y// 1 y  a
fY .y/ D 0 1 D f I
jg .g .y//j b b
1.10 Functions of a Continuous Random Variable 43

in general, whether b is positive or negative, the formula is:

1 y  a
fY .y/ D f :
jbj b

Example 1.39 (From Exponential to Uniform). Suppose X has the standard expo-
nential density f .x/ D e x ; x  0. Let Y D g.X / D e X . Again, g.X / is a
strictly monotone function, and the inverse function is found as follows.

g.x/ D e x D y ) x D  log y D g 1 .y/:

Also, g 0 .x/ D e x ;

f .g 1 .y// e . log y/


) fY .y/ D 0 1
D . log y/ :
g .g .y// je j
y
D D 1; 0 y 1:
y

We have thus proved that if X has a standard exponential density, then Y D e X is


uniformly distributed on Œ0; 1.
There is actually nothing special about choosing X to be the standard exponen-
tial; the following important result says that what we saw in the above example is
completely general for all continuous random variables.

Theorem 1.31. Let X have a continuous CDF F .x/. Consider the new random
variables Y D 1  F .X / and Z D F .X /. Then both Y; Z are distributed as
U Œ0; 1.
It is useful to remember this result in informal notation:

F .X / D U; and F 1 .U / D X:

The implication is a truly useful one. Suppose for purposes of computer experiments,
we want to have computer-simulated values of some random variable X that has
some CDF F and the quantile function Q D F 1 . Then, all we need to do is
to have the computer generate U Œ0; 1 values, say u1 ; u2 ; : : : ; un , and use x1 D
F 1 .u1 /; x2 D F 1 .u2 /; : : : ; xn D F 1 .un / as the set of simulated values for
our random variable of interest, namely X . Thus, the problem can be reduced to
simulation of uniform values, a simple task. The technique has so many uses that
there is a name for this particular function Z D F 1 .U / of a uniform random
variable U .
44 1 Review of Univariate Probability

Definition 1.27 (Quantile Transformation). Let U be a U Œ0; 1 random variable


and let F .x/ be a continuous CDF. Then the function of U defined as X D F 1 .U /
is called the quantile transformation of U , and it has exactly the CDF F .
What we have shown here is that we can simply start with a U Œ0; 1 random
variable and convert it to any other continuous random variable X we want by
simply using a transformation of U , and that transformation is the quantile trans-
formation.

Example 1.40 (The Cauchy Distribution). The Cauchy density, like the normal
and the double exponential, is also symmetric and unimodal, but the properties are
very different. It is such an atypical density that we often think of the Cauchy density
first when we look for a counterexample to a conjecture. There is a very interest-
ing way to obtain a Cauchy density from a uniform density by using the quantile
transformation. We describe that derivation in this example.
Suppose a person holds a flashlight in her hand, and standing one foot away
from an infinitely long wall, points the beam of light in a random direction. Here,
by random direction, we mean that the point of landing of the light ray makes an
angle X with the individual (considered to be a straight line one foot long), and
this angle X U Œ=2; =2: Let Y be the horizontal distance of the point at
which the light lands from the person, with Y being considered negative if the light
lands on the person’s left, and it being considered positive if it lands on the person’s
right.
Then, by elementary trigonometry,

Y
tan.X / D ) Y D tan.X /:
1

Now g.X / D tan X is a strictly monotone function of X , and the inverse function
is g 1 .y/ D arctan.y/; 1 < y < 1: Also, g 0 .x/ D 1 C tan2 x: Putting it all
together,

1
1
fY .y/ D 
D ; 1 < y < 1:
1C Œtan.arctan y/2 .1 C y 2 /

This is the standard Cauchy density.


The Cauchy density is particularly notorious for its heavy tail.

Example 1.41 (An Interesting Function that Is Not Strictly Monotone). Suppose X
2
has the standard normal density f .x/ D p1 e x =2 on .1; 1/. We want to find
2
the density of Y D g.X / D X 2 . However, we immediately realize that X 2 is not a
strictly monotone function on the whole real line (its graph is a parabola). Thus, the
general formula given above for densities of strictly monotone functions cannot be
applied in this problem. We attack the problem directly. Thus,
1.10 Functions of a Continuous Random Variable 45

P .Y y/ D P .X 2 y/ D P .X 2 y; X > 0/ C P .X 2 y; X < 0/
p p
D P .0 < X y/ C P . y X < 0/
p p p p
D F . y/  F .0/ C ŒF .0/  F . y/ D F . y/  F . y/;

where F is the CDF of X , that is, the standard normal CDF.


Inasmuch as we have obtained the CDF of Y , we now differentiate to get the pdf
of Y :
p p
d  p p  f . y/ f . y/
fY .y/ D F . y/  F . y/ D p  p
dy 2 y 2 y

(by use of the chain rule)


p p
f . y/ f . y/
D p C p
2 y 2 y

(because f is symmetric around zero, i.e., f .u/ D f .u/ for any u)


p p
2f . y/ f . y/ e y=2
D p D p Dp ;
2 y y 2y

y > 0. This is a very special density in probability and statistics, and is called the
chi-square density with one degree of freedom. We have thus proved that the square
of a standard normal random variable has a chi-square distribution with one degree
of freedom.
There is an analogous Jacobian formula for transformations g.X / that are not
one-to-one. Basically, we need to break the problem up into disjoint intervals, on
each of which the function g is one-to-one, apply the usual Jacobian technique on
each such subinterval, and then piece them together. Here is the formula.
Theorem 1.32 (Density of a Nonmonotone Transformation). Let X have a
continuous pdf f .x/ and let Y D g.X / be a transformation of X such that for a
given y, the equation g.x/ D y has at most countably many roots, say x1 ; x2 ; : : :,
where the xi depend on the given y. Assume also that g has a nonzero derivative at
each xi . Then, Y has the pdf

X f .xi /
fY .y/ D :
jg 0 .xi /j
i

1.10.1 Expectation and Moments


P
For discrete random variables, expectation was seen to be equal toP x xP .X D x/.
Of course, for continuous random variables, the analogous sum x xf .x/ is not
defined. The correct definition of expectation for continuous random variables re-
places sums by integrals.
46 1 Review of Univariate Probability

Definition 1.28. Let X be a continuous


R 1 random variable with a pdf f .x/. We say
that the expectation of X exists if 1 jxjf .x/dx < 1, in which case the expecta-
tion, or the expected value, or the mean of X is defined as
Z 1
E.X / D  D xf .x/dx:
1

Suppose X is a continuous random variable with a pdf f .x/ and Y D g.X / is a


function
R of X . If Y hasR a density, say fY .y/, then we can compute the expectation
as yfY .y/dy, or as g.x/f .x/dx. Because Y need not always be a continuous
random variable just because X is, it may not in general have a density fY .y/; but
the second expression is always applicable and correct.

Theorem 1.33. Let X be a continuous random variable with pdf R 1f .x/. Let g.X / be
a function of X . The expectation of g.X / exists if and only if 1 jg.x/jf .x/dx <
1, in which case the expectation of g.X / is
Z 1
EŒg.X / D g.x/f .x/dx:
1

The definitions of moments and the variance remain the same as in the discrete case.

Definition 1.29. Let X be a continuous random variable with pdf f .x/. Then the
kth moment of X is defined to be E.X k /; k  1. We say that the kth moment does
not exist if E.jX jk / D 1.

Corollary. Suppose X is a continuous random variable with pdf f .x/. Then its
variance, provided it exists, is equal to
Z 1 Z 1
2 D .x  /2 f .x/dx D x 2 f .x/dx  2 :
1 1

One simple observation that saves calculations, but is sometimes overlooked, is the
following fact; the proof of it merely uses the integration result that the integral of
the product of an odd function and an even function on a symmetric interval is zero,
if the integral exists.

Proposition. Suppose X has a distribution symmetric around some number a; that


is, X  a and a  X have the same distribution. Then, EŒ.X  a/2kC1  D 0, for
any k  0 for which the expectation EŒ.X  a/2kC1  exists.
For example, if X has a distribution symmetric about zero, then any odd mo-
ment (e.g., E.X /; E.X 3 /; etc.), provided it exists, must be zero. There is no need to
calculate it; it is automatically zero.

Example 1.42 (Area of a Random Triangle). Suppose an equilateral triangle is


constructed by choosing the common side length X to be uniformly distributed on
Œ0; 1. We want to find the mean and the variance of the area of the triangle.
1.10 Functions of a Continuous Random Variable 47

For a general triangle with sides a; b; c, the area equals


p
Area D s.s  a/.s  b/.s  c/;

where s D aCbCc
2 . When all the side lengths are equal, say, to a, this reduces to
p p
3 2 3 2
4
a . Therefore, in this example, we want the mean and variance of Y D 4
X .
The mean is p p
3 31 1
E.Y / D E.X / D
2
D p :
4 4 3 4 3
The variance equals

3 1 3 1 1
var.Y / D E.Y 2 /  ŒE.Y /2 D E.X 4 /  D 
16 48 16 5 48
3 1 1
D  D :
80 48 60
For the next example, we need the definition of the Gamma function. It repeatedly
necessary for us to work with the Gamma function in this text.

Definition 1.30. The Gamma function is defined as


Z 1
.˛/ D e x x ˛1 dx; ˛ > 0:
0

In particular,

.n/ D .n  1/Š; for any positive integer n:


.˛ C 1/ D ˛.˛/ 8˛ > 0:
 
1 p
 D :
2

Example 1.43 (Moments of Exponential). Let X have the standard exponential


density. Then, all its moments exist, and indeed,
Z 1
E.X n / D x n e x dx D .n C 1/ D nŠ:
0

In particular,
E.X / D 1I E.X 2 / D 2;

and therefore, Var.X / D E.X 2 /  ŒE.X /2 D 2  1 D 1: Thus, the standard


exponential density has the same mean and variance.

Example 1.44 (Absolute Value of a Standard Normal). This is often required in cal-
culations in statistical theory. Let X have the standard normal distribution; we want
to find E.jX j/: By definition,
48 1 Review of Univariate Probability
Z 1 Z 1 Z 1
1 2 =2 2 2 =2
E.jX j/ D jxjf .x/dx D p jxje x dx D p xe x dx
1 2 1 2 0

2 =2
(because jxje x is an even function of x on .1; 1/)
Z ˇ1
2 1
d  x 2 =2 2  x 2 =2 ˇ
ˇ
D p e dx D p e ˇ
2 0 dx 2 0
r
2 2
D p D :
2 

Example 1.45 (A Random Variable Whose Expectation Does Not Exist). Con-
sider the standard Cauchy random variable with the density f .x/ D .1Cx
1
2/ ;
R1
1 < x < 1. Recall that for E.X / to exist, we must have 1 jxjf .x/dx < 1.
But,
Z 1 Z 1 Z 1
1 jxj 1 x
jxjf .x/dx D dx  dx
1  1 1Cx 2  0 1 C x2
Z M
1 x
 dx
 0 1 C x2

(for any M < 1)


1
D log.1 C M 2 /;
2
and on letting M ! 1, we see that
Z 1
jxjf .x/dx D 1:
1

Therefore the expectation of a standard Cauchy random variable, or synonymously,


the expectation of a standard Cauchy distribution does not exist.
Example 1.46 (Moments of the Standard Normal). In contrast to the standard
Cauchy, every moment of a standard normal variable exists. The basic reason is that
the tail of the standard normal density is too thin. A formal proof is as follows.
Fix k  1. Then,
2 =2 2 =4 2 =4 2 =4
jxjk e x D jxjk e x e x C e x ;
2 =4
where C is a finite constant such that jxjk e x C for any real number x (such
a constant C does exist). Therefore,
Z 1 Z 1
2 =2 2 =4
jxjk e x dx C e x dx < 1:
1 1

Hence, by definition, for any k  1; E.X k / exists.


1.10 Functions of a Continuous Random Variable 49

Now, take k to be an odd integer, say k D 2n C 1; n  0. Then,


Z 1
1 2 =2
k
E.X / D p x 2nC1 e x dx D 0;
2 1

2
because x 2nC1 is an odd function and e x =2 is an even function. Thus, every odd
moment of the standard normal distribution is zero.
Next, take k to be an even integer, say k D 2n; n  1: Then,
Z 1 Z 1
1 2 =2 2 2 =2
E.X k / D p x 2n e x dx D p x 2n e x dx
2 1 2 0
Z 1 Z 1
2 1 1
D p zn e z=2 p d z D p zn1=2 e z=2 d z;
2 0 2 z 2 0

on making the substitution z D x 2 .


Now make a further substitution u D 2z . Then we get,

Z 1 Z 1
1 2n
E.X 2n / D p .2u/n1=2 e u 2d u D p un1=2 e u du:
2 0  0

R1
Now, we recognize 0 un1=2 e u d u to be .n C 12 /, and so, we get the formula

2n .n C 12 /
E.X 2n / D p ; n  1:


By using the Gamma duplication formula


 
1 p .2n  1/Š
 nC D 212n ;
2 .n  1/Š

this reduces to

.2n/Š
E.X 2n / D ; n  1:
2n nŠ

1.10.2 Moments and the Tail of a CDF

We now describe methods to calculate moments of a random variable from its sur-
vival function, namely FN .x/ D 1F .x/. There are important relationships between
the existence of moments and the rapidity with which the survival function goes to
zero as jxj ! 1.
50 1 Review of Univariate Probability

(a) Let X be a nonnegative random variable and suppose E.X / exists. Then

xF .x/ D xŒ.1  F .x// ! 0; as x ! 1:

(b) Let X beR a nonnegative random variable and suppose E.X / exists. Then
1
E.X / D 0 F .x/dx.
(c) Let X be a nonnegative random variable and suppose E.X k / exists, where k  1
is a given positive integer. Then

x k F .x/ D x k Œ1  F .x/ ! 0; as x ! 1:

(d) Let X be a nonnegative random variable and suppose E.X k / exists. Then
Z 1
k
E.X / D .kx k1 /Œ1  F .x/dx:
0

(e) Let X be a general real-valued random variable and suppose E.X / exists. Then

xŒ1  F .x/ C F .x/ ! 0; as x ! 1:

(f) Let X be a general real-valued random variable and suppose E.X / exists. Then

Z 1 Z 0
E.X / D Œ1  F .x/dx  F .x/dx:
0 1

Example 1.47 (Expected Value of the Minimum of Several Uniform Variables).


Suppose X1 ; X2 ; : : : ; Xn are independent U Œ0; 1 random variables, and let
mn D minfX1 ; X2 ; : : : ; Xn g be their minimum. By virtue of the independence
of X1 ; X2 ; : : : ; Xn ,

P .mn > x/ D P .X1 > x; X2 > x; : : : ; Xn > x/


Y
n
D P .Xi > x/ D .1  x/n ; 0 < x < 1;
i D1

and P .mn > x/ D 0 if x  1. Therefore, by the above theorem,


Z 1 Z 1 Z 1
E.mn / D P .mn > x/dx D P .mn > x/dx D .1  x/n dx
0 0 0
Z 1
1
D x n dx D :
0 nC1
1.11 Moment-Generating Function and Fundamental Inequalities 51

1.11 Moment-Generating Function and Fundamental


Inequalities

The definition previously given of the moment-generating function of a random


variable is completely general. We work out a few examples for some continuous
random variables.

Example 1.48 (Moment-Generating Function of Standard Exponential). Let X have


the standard exponential density. Then,
Z 1 Z 1
1
E.e tX / D e tx e x dx D e .1t /x dx D ;
0 0 1t

if t < 1, and it equals C1 if t  1. Thus, the mgf of the standard exponential distri-
bution is finite if and only if t < 1. So, the moments can be found by differentiating
the mgf, namely, E.X n / D .n/ .0/: Now, at any t < 1, by direct differentiation,
.n/
.t/ D .1tnŠ/nC1 ) E.X n / D .n/ .0/ D nŠ, a result we have derived before
directly.

Example 1.49 (Moment-Generating Function of Standard Normal). Let X have the


standard normal density. Then,
Z 1 Z 1
1 2 =2 1 2 =2 2 =2
E.e tX / D p e tx e x dx D p e .xt / dx  e t
2 1 2 1
Z 1
1 z2 =2 t 2 =2 2 =2 2 =2
D p e dz  e D 1  et D et ;
2 1

R1 2
because p1 1 e z =2 d z is the integral of the standard normal density, and so
2
must be equal to one.
We have therefore proved that the mgf of the standard normal distribution exists
2
at any real t and equals .t/ D e t =2 :
The mgf is useful in deriving inequalities on probabilities of tail values of a ran-
dom variable that have proved to be extremely useful in many problems in statistics
and probability. In particular, these inequalities typically give much sharper bounds
on the probability that a random variable would be far from its mean value than
Chebyshev’s inequality can give. Such probabilities are called large deviation prob-
abilities. We treat large deviations in detail in Chapter 17. We present a particular
large deviation inequality below and then present some neat applications.

Theorem 1.34 (Chernoff–Bernstein Inequality). Let X have the mgf .t/, and
assume that .t/ < 1 for t < t0 for some t0 ; 0 < t0 1. Let .t/ D log .t/,
and for a real number x, define

I.x/ D sup Œtx  .t/:


0<t <t0
52 1 Review of Univariate Probability

Then,
P .X  x/ e I.x/ :

See Bernstein (1927) and Chernoff (1952) for this inequality and other refinements
of it.
To apply the Chernoff–Bernstein inequality, it is necessary to be able to find the
mgf .t/ and then be able to find the function I.x/, which is called the rate function
of X .

We now show an example.

Example 1.50 (Testing the Bound in the Standard Normal Case). Suppose X is a
standard normal variable. Then the exact value of the probability P .X > x/ D
1  P .X x/ D 1  ˆ.x/ is easily computable, although no formula can be
written for it. The Chebyshev inequality will give for x > 0,

1 1
P .X > x/ D P .jX j > x/ :
2 2x 2
2 =2
To apply the Chernoff–Bernstein bound, use the formula .t/ D e t ) .t/ D
t 2 =2 ) I.x/ D supt >0 Œtx  t 2 =2 D x 2 =2: Therefore,
2 =2
P .X > x/ e I.x/ D e x :

Obviously, the Chernoff–Bernstein bound is much smaller than the Chebyshev


bound for large x. There are numerous other moment inequalities on positive and
general real-valued random variables. They have a variety of uses in theoretical cal-
culations. We present a few fundamental moment inequalities that are special.

Theorem 1.35 (Jensen’s Inequality). Let X be a random variable with a finite


mean, and g.x/ W R ! R a convex function. Then g.E.X // E.g.X //:

Example 1.51. Let X be any random variable with a finite mean . Consider the
function g.x/ D e ax ; where a is a real number. Then, by the second derivative test,
g is a convex function on the entire real line, and therefore, by Jensen’s inequality

E.e aX /  e a :

Here are two other important moment inequalities.

Theorem 1.36. (a) (Lyapounov Inequality). Given a nonnegative random vari-


able X , and 0 < ˛ < ˇ,
1 1
.EX ˛ / ˛ .EX ˇ / ˇ :
1.11 Moment-Generating Function and Fundamental Inequalities 53

(b) (Log Convexity Inequality of Lyapounov). Given a nonnegative random vari-


able X , and 0 ˛1 < ˛2 ˇ2 ,

EX ˛1 EX ˇ ˛1  EX ˛2 EX ˇ ˛2 :

We finish with an example of a paradox of expectations.

Example 1.52 (An Expectation Paradox). Suppose X; Y are two positive noncon-
stant independent random variables, with the same distribution; for example, X; Y
could be independent variables with a uniform distribution on Œ5; 10. We need
the assumption that the common distribution of X and Y is such that E. X1 / D
E. Y1 / < 1.
Let R D XY . Then, by Jensen’s inequality,
   
X 1 1
E.R/ D E D E.X /E > E.X / D 1:
Y Y E.Y /

So, we have proved that E. X


Y
/ > 1: But we can repeat exactly the same argument
Y
to conclude that E. X / > 1: So, we seem to have the paradoxical conclusion that
we expect X to be somewhat larger than Y , and we also expect Y to be somewhat
larger than X .
There are many other such examples of paradoxes of expectations.

1.11.1  Inversion of an MGF and Post’s Formula

The moment-generating function uniquely determines the distribution. Therefore,


in principle, if we knew the mgf of a distribution, we should be able to find the
distribution to which the mgf corresponds. In practice, this inversion is difficult,
and often we find the distribution corresponding to an mgf by inspection. There
are theoretical formulas for inverting an mgf. One of these formulas uses complex
variable methods, and the other uses only real variable methods. The latter, called
Post’s inversion formula, requires that we can calculate all derivatives (at least of
large orders) of the given mgf. This may be impractical. However, with the use of
symbolic software, and efficient numerical means, the Post inversion formula may
be usable in some cases. It is given below; see Widder (1989, p. 463) for a proof.

Proposition. Let X be a nonnegative continuous random variable with density


f .x/. Let .t/; t  0 be the one-sided mgf .t/ D EŒe tX . Suppose that f is
everywhere continuous. Then,
   
.1/k k kC1 .k/ k
f .x/ D lim ; x > 0:
k!1 kŠ x x
54 1 Review of Univariate Probability

1.12 Some Special Continuous Distributions

A number of densities, by virtue of their popularity in modeling, or because of


their special theoretical properties, are considered to be special. We discuss, when
suitable, their moments, the form of the CDF, the mgf, shape, and modal properties,
and interesting inequalities. Classic references to standard continuous distributions
are Johnson et al. (1994), and Kendall and Stuart (1976); Everitt (1998) contains
many unusual facts.
Definition 1.31. Let X have the pdf

1
f .x/ D ; a x b;
ba
D 0 otherwise;

where 1 < a < b < 1 are given real numbers.


Then we say that X has the uniform distribution on Œa; b and write X U Œa; b.
The basic properties of a uniform density are given next.
Theorem 1.37. (a) If X U Œ0; 1; then a C .b  a/X U Œa; b, and if X
U Œa; b; then Xa
ba
U Œ0; 1:
(b) The CDF of the U Œa; b distribution equals:

F .x/ D 0; x < a:
xa
D ; a x b:
ba
D 1; x > b:

e t b e t a
(c) The mgf of the U Œa; b distribution equals .t/ D .ba/t
:
(d) The nth moment of the U Œa; b distribution equals

b nC1  anC1
E.X n / D :
.b  a/.n C 1/

(e) The mean and the variance of the U Œa; b distribution equal

aCb .b  a/2
D I 2 D :
2 12
Example 1.53. A point is selected at random on the unit interval, dividing it into
two pieces with total length 1. Find the probability that the ratio of the length of the
shorter piece to the length of the longer piece is less than 1=4.
minfX;1Xg
Let X U Œ0; 1; we want P . maxfX;1Xg < 1=4/: This happens only if X < 1=5
or >4=5. Therefore, the required probability is P .X < 1=5/ C P .X > 4=5/ D
1=5 C 1=5 D 2=5.
1.12 Some Special Continuous Distributions 55

We defined the standard exponential density in the previous section. We now


introduce the general exponential density. Exponential densities are used to model
waiting times (e.g., waiting times for an elevator or at a supermarket checkout) or
failure times (e.g., the time till the first failure of some equipment) or renewal times
(e.g., time elapsed between successive earthquakes at a location), and so on. The
exponential density also has some very interesting theoretical properties.
Definition 1.32. A nonnegative random variable X has the exponential distribution
with parameter > 0 if it has the pdf f .x/ D 1 e x= ; x > 0. We write X
Exp. /.
Here are the basic properties of an exponential density.
Theorem 1.38. Let X Exp. /. Then,
X
(a)  Exp.1/:
(b) The CDF F .x/ D 1  e x= ; x > 0; (and zero for x 0.)
(c) E.X n / D n nŠ; n  1:
(d) The mgf .t/ D 1t
1
; t < 1= :
Example 1.54 (Mean Is Larger Than Median for Exponential). Suppose X
Exp.4/. What is the probability that X > 4?
Since X=4 Exp.1/;
Z 1
P .X > 4/ D P .X=4 > 1/ D e x dx D e 1 D :3679;
1

quite a bit smaller than 50%. This implies that the median of the distribution has to
be smaller than 4, where 4 is the mean. Indeed, the median is a number m such that
F .m/ D 12 (the median is unique in this example) ) 1  e m=4 D 12 ) m D
4 log 2 D 2:77:
This phenomenon that the mean is larger than the median is quite typical of
distributions that have a long right tail, as does the exponential.
In general, if X Exp. /; the median of X is log 2.
Example 1.55 (Lack of Memory of the Exponential Distribution). The exponential
densities have a lack of memory property similar to the one we established for the
geometric distribution. Let X Exp. /, and let s; t be positive numbers. The lack
of memory property is that P .X > s C tjX > s/ D P .X > t/. So, suppose that X
is the waiting time for an elevator, and suppose that you have already waited s D 3
minutes. Then the probability that you have to wait another two minutes is the same
as the probability that you would have to wait two minutes if you just arrived. This
is not true if the waiting time distribution is something other than an exponential.
The proof of the property is simple:

P .X > s C t/ e .sCt /=


P .X > s C tjX > s/ D D
P .X > s/ e s=
D e t = D P .X > t/:
56 1 Review of Univariate Probability

Example 1.56 (The Weibull Distribution). Suppose X Exp.1/, and let Y D X ˛ ,


where ˛ > 0 is a constant. This is a strictly monotone function with the inverse
function y 1=˛ , thus the density of Y is

f .y 1=˛ / 1=˛ 1
fY .y/ D 0
D e y 
jg .y /j
1=˛ ˛y .˛1/=˛

1 .1˛/=˛ y 1=˛
D y e ; y > 0:
˛

This final answer can be made to look a little simpler by writing ˇ D 1


˛. If we do
so, the density becomes
ˇ
ˇy ˇ 1 e y ; y > 0:
We can introduce an extra scale parameter akin to what we do for the exponential
case itself. In that case, we have the general two-parameter Weibull density

ˇ x ˇ 1 x ˇ
f .yjˇ; / D e .  / ; y > 0:

This is the Weibull density with parameters ˇ; .


The exponential density is decreasing on Œ0; 1/. A generalization of the expo-
nential density with a mode usually at some strictly positive number m is the Gamma
distribution. It includes the exponential as a special case, and can be very skewed,
or even almost a bell-shaped density. We later show that it also arises naturally, as
the density of the sum of a number of independent exponential random variables.

Definition 1.33. A positive random variable X is said to have a Gamma distribution


with shape parameter ˛ and scale parameter if it has the pdf

e x= x ˛1
f .xj˛; / D ˛ .˛/
; x > 0; ˛; > 0I

we write X G.˛; /. The Gamma density reduces to the exponential density with
mean when ˛ D 1; for ˛ < 1, the Gamma density is decreasing and unbounded,
whereas for large ˛, it becomes nearly a bell-shaped curve. A plot of some Gamma
densities in Fig. 1.4 reveals these features.
The basic facts about a Gamma distribution are given in the following theorem.

Theorem 1.39. (a) The CDF of the G.˛; / density is the normalized incomplete
Gamma function
.˛; x= /
F .x/ D ;
.˛/
Rx
where .˛; x/ D 0 e t t ˛1 dt.
1.12 Some Special Continuous Distributions 57

0.8 0.35
2 0.3
1.5 0.6 0.25
0.2
1 0.4 0.15
0.5 0.2 0.1
0.05
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6

0.175 0.1
0.15 0.08
0.125
0.1 0.06
0.075 0.04
0.05
0.025 0.02
2 4 6 8 10 12 14 5 10 15 20 25 30

Fig. 1.4 Plot of Gamma density with lambda D 1, alpha D .5, 1, 2, 6, 15

(b) The nth moment equals

n .˛ C n/
E.X n / D ; n  1:
.˛/
(c) The mgf equals
1
.t/ D .1  t/˛ ; t< :
(d) The mean and the variance equal

D˛ I 2 D ˛ 2
:

An important consequence of the mgf formula is the following result.

Corollary. Suppose X1 ; X2 ; : : : ; Xn are independent Exp. / variables. Then X1 C


X2 C    C Xn G.n; /.
Proof. X1 ; X2 ; : : : ; Xn are independent, thus for t < 1= ,

E.e t .X1 CX2 CCXn / / D E.e tX1 e tX2    e tXn /


D E.e tX1 /E.e tX2 /    E.e tXn /
D .1  t/1 .1  t/1    .1  t/1
D .1  t/n ;

which agrees with the mgf of a G.n; / distribution, and therefore, by the dis-
tribution determining property of mgfs, it follows that X1 C X2 C    C Xn
G.n; /. t
u
Example 1.57 (The General Chi-Square Distribution). We saw in the previous
section that the distribution of the square of a standard normal variable is the chi-
square distribution with one degree of freedom. A natural question is what is the
distribution of the sum of squares of several independent standard normal variables.
58 1 Review of Univariate Probability

Although we do not yet have the technical tools necessary to derive this distribu-
tion, it turns out that this distribution is in fact a Gamma distribution. Precisely,
P if
X1 ; X2 ; : : : ; Xm are m independent standard normal variables, then T D m X
i D1 i
2
m
has a G. 2 ; 2/ distribution, and therefore has the density

e t =2 t m=21
fm .t/ D ; t > 0:
2m=2 . m 2/

This is called the chi-square density with m degrees of freedom, and arises in nu-
merous contexts in statistics and probability. We write T 2m . From the general
formulas for the mean and variance of a Gamma distribution, we get that

Mean of a 2m distribution D mI


Variance of a 2m distribution D 2m:

The chi-square density is rather skewed for small m, but becomes approximately
bell-shaped when m gets large; we have seen this for general Gamma densities.
One especially important context in which the chi-square distribution arises is in
consideration of the distribution of the sample variance for iid normal observations.
The sample variance of a set of n random variables X1 ; X2 ; : : : ; Xn is defined as
1 Pn N 2 N X1 CCXn
s 2 D n1 i D1 .Xi  X / , where X D n is the mean of X1 ; : : : ; Xn . The
name sample variance derives from the following property.

Theorem 1.40. Suppose X1 ; : : : ; Xn are independent with a common distribution


F having a finite variance  2 . Then, for any n, E.s 2 / D  2 .
Proof. First note the algebraic identity

X
n X
n X
n X
n
.Xi  XN /2 D .Xi2 2Xi XN C XN 2 / D Xi2 2nXN 2 CnXN 2 D Xi2 nXN 2 :
i D1 i D1 i D1 i D1

Therefore,
" n #  2 
1 X 1 
E.s / D2
E Xi  nXN D
2 2
n. 2 C 2 /n C 2 D  2 :
n1 n1 n
i D1

N
If, in particular, X1 ; : : : ; Xn are iid N.;  2 /, then XiX are also normally dis-
tributed, each with mean zero. However, they are no longer independent. If we sum
their squares, then the sum of the squares will still be distributed as a chi square, but
there will be a loss of one degree of freedom, due to the fact that Xi  XN are not
independent, even though the Xi are independent.
We state this important fact formally. t
u
Pn
N 2
i D1 .Xi X/
Theorem 1.41. Suppose X1 ; : : : ; Xn are iid N.;  2 /. Then 2
2n1 .
Example 1.58 (Inverse Gamma Distribution). Suppose X G.˛; /. The distribu-
tion of X1 is called the inverse Gamma distribution. We derive its density.
1.12 Some Special Continuous Distributions 59

Because Y D g.X / D X1 is a strictly monotone function with the inverse func-


tion g 1 .y/ D y1 , and because the derivative of g is g 0 .x/ D  x12 , the density of
Y is

f y1 e 1=.y/ y 1˛ 1
fY .y/ D ˇ  ˇ D
ˇ 0 1 ˇ ˛ .˛/ y2
ˇg ˇy

e 1=.y/ y 1˛
D ˛ .˛/
; y > 0:

The inverse Gamma density is extremely skewed for small values of ˛; furthermore,
the right tail is so heavy for small ˛, that the mean does not exist if ˛ 1. Inverse
Gamma distributions are quite popular in studies of economic inequality, reliability
problems, and as prior distributions in Bayesian statistics.
For continuous random variables that take values between 0 and 1, the most stan-
dard family of densities is the family of Beta densities. Their popularity is due to
their analytic tractability, and due to the large variety of shapes that Beta densities
can take when the parameter values change. It is a generalization of the U Œ0; 1
density.
Definition 1.34. X is said to have a Beta density with parameters ˛ and ˇ if it has
the density
x ˛1 .1  x/ˇ 1
f .x/ D ; 0 x 1; ˛; ˇ > 0;
B.˛; ˇ/

where B.˛; ˇ/ D .˛/.ˇ /


.˛Cˇ / . We write X Be.˛; ˇ/: An important point is
1
that by its very notation, B.˛;ˇ /
must be the normalizing constant of the function
ˇ 1
x ˛1
.1  x/ ; thus, another way to think of B.˛; ˇ/ is that for any ˛; ˇ > 0,
Z 1
B.˛; ˇ/ D x ˛1 .1  x/ˇ 1 dx:
0

This fact is repeatedly useful in the following.


Theorem 1.42. Let X Be.˛; ˇ/.
(a) The CDF equals
Bx .˛; ˇ/
F .x/ D ;
B.˛; ˇ/
Rx
where Bx .˛; ˇ/ is the incomplete Beta function 0 t ˛1 .1  t/ˇ 1 dt.
(b) The nth moment equals

.˛ C n/.˛ C ˇ/
E.X n / D :
.˛ C ˇ C n/.˛/
60 1 Review of Univariate Probability

(c) The mean and the variance equal

˛ ˛ˇ
D I 2 D :
˛Cˇ .˛ C ˇ/2 .˛ C ˇ C 1/

(d) The mgf equals


.t/ D 1 F1 .˛; ˛ C ˇ; t/;
where 1 F1 .a; b; z/ denotes the confluent hypergeometric function.
Example 1.59 (Square of a Beta). Suppose X has a Beta density. Then X 2 also
takes values in Œ0; 1, but it does not have a Beta density. To have a specific example,
suppose X Be.7; 7/. Then the density of Y D X 2 is
p p
f . y/ y 3 .1  y/6
fY .y/ D p D p
2 y B.7; 7/2 y
p 6
D 6006y 5=2.1  y/ ; 0 y 1:
Clearly, this is not a Beta density.
In practical applications, certain types of random variables consistently exhibit
a long right tail, in the sense that a lot of small values are mixed with a few large
or excessively large values in the distributions of these random variables. Economic
variables such as wealth typically manifest such heavy tail phenomena. Other ex-
amples include sizes of oil fields, insurance claims, stock market returns, and river
height in a flood among others. The tails are sometimes so heavy that the random
variable may not even have a finite mean. Extreme value distributions are common
and increasingly useful models for such applications. A brief introduction to two
specific extreme value distributions is provided next. These two distributions are the
Pareto distribution and the Gumbel distribution. One peculiarity of semantics is that
the Gumbel distribution is often called the Gumbel law.
A random variable X is said to have the Pareto density with parameters  and ˛
if it has the density
˛ ˛
f .x/ D ; x   > 0; ˛ > 0:
x ˛C1
We write X P a.˛; /. The density is monotone decreasing. It may or may not
have a finite expectation, depending on the value of ˛. It never has a finite mgf in
any nonempty interval containing zero. The basic facts about a Pareto density are
given in the next result.
Theorem 1.43. Let X P a.˛; /.
(a) The CDF of X equals
 ˛

F .x/ D 1  ; x  ;
x
and zero for x < .
1.13 Normal Distribution and Confidence Interval for a Mean 61

(b) The nth moment exists if and only if n < ˛, in which case

˛ n
E.X n / D :
˛n
(c) For ˛ > 1, the mean exists; for ˛ > 2, the variance exists. Furthermore, they
equal
˛ ˛ 2
E.X / D I Var.X / D :
˛1 .˛  1/2 .˛  2/
We next define the Gumbel law. A random variable X is said to have the Gumbel
density with parameters ;  if it has the density

1 e x x


f .x/ D e  ; 1 < x < 1; 1 <  < 1;  > 0:

e

If  D 0 and  D 1, the density is called the standard Gumbel density. Thus, the
x
standard Gumbel density has the formula f .x/ D e e e x ; 1 < x < 1. The
density converges extremely fast (at a superexponential rate) at the left tail, but at
only a regular exponential rate at the right tail. Its relation to the density of the
maximum of a large number of independent normal variables makes it a special
density in statistics and probability; see Chapter 7. The basic facts about a Gumbel
density are collected together in the result below. All Gumbel distributions have a
finite mgf .t/ at any t. But no simple formula for it is possible.
Theorem 1.44. Let X have the Gumbel density with parameters ; . Then,
(a) The CDF equals
x

F .x/ D e e 1 < x < 1:

;

(b) E.X / D   , where :577216 is the Euler constant.


2 2
(c) Var.X / D 6  .
(d) The mgf of X exists everywhere.

1.13 Normal Distribution and Confidence Interval for a Mean

Empirical data on many types of variables across disciplines tend to exhibit uni-
modality and only a small amount of skewness. It is quite common to use a normal
distribution as a model for such data. The normal distribution occupies the central
place among all distributions in probability and statistics. There is also the cen-
tral limit theorem, which says that the sum of many small independent quantities
approximately follows a normal distribution. By a combination of reputation, con-
venience, mathematical justification, empirical experience, and habit, the normal
62 1 Review of Univariate Probability

distribution has become the most ubiquitious of all distributions. Detailed algebraic
properties can be seen in Rao (1973), Kendall and Stuart (1976), and Feller (1971).
Petrov (1975) is a masterly account of the role of the normal distribution in the limit
theorems of probability.
We have actually already defined a normal density. But let us recall the definition
here.

Definition 1.35. A random variable X is said to have a normal distribution with


parameters  and  2 if it has the density
2
1  .x/
f .x/ D p e 2 2 ; 1 < x < 1;
 2

where  can be any real number, and  > 0. We write X N.;  2 /. If X


N.0; 1/, we call it a standard normal variable.
The density of a standard normal variable is denoted as .x/, and equals the
function
1 x2
.x/ D p e  2 ; 1 < x < 1;
2
and the CDF is denoted as ˆ.x/. Note that the standard normal density is symmetric
and unimodal about zero. The general N.;  2 / density is symmetric and unimodal
about .
By definition of a CDF,
Z x
ˆ.x/ D .z/d z:
1

The CDF ˆ.x/ cannot be written in terms of the elementary functions, but can be
computed at a given value x, and tables of the values of ˆ.x/ are widely available.
For example, here are some selected values.

Example 1.60 (Standard Normal CDF at Selected Values).

x ˆ.x/
4 .00003
3 .00135
2 .02275
1 .15866
0 .5
1 .84134
2 .97725
3 .99865
4 .99997

Here are the most basic properties of a normal distribution.


1.13 Normal Distribution and Confidence Interval for a Mean 63

Theorem 1.45. (a) If X N.;  2 /, then Z D X


 N.0; 1/, and if Z
N.0; 1/, then X D  C Z N.;  2 /.
In words, if X is any normal random variable, then its standardized version
is always a standard normal variable.
(b) If X N.;  2 /, then
x  
P .X x/ D ˆ 8 x:


In particular, P .X / D P .Z 0/ D :5; that is, the median of X is .


(c) Every moment of any normal distribution exists, and the odd central moments
EŒ.X  /2kC1  are all zero.
(d) If Z N.0; 1/, then

.2k/Š
E.Z 2k / D ; k  1:
2k kŠ

(e) The mgf of the N.;  2 / distribution exists at all real t, and equals

t 2 2
.t/ D e tC 2 :

(f) If X N.;  2 /,

E.X / D I var.X / D  2 I E.X 3 / D 3 C 3 2 :


E.X 4 / D 4 C 62  2 C 3 4 :

(g) If X N.;  2 /, then 1 D ; 2 D  2 , and r D 0 8 r > 2, where j is the


jth cumulant of X .
An important consequence of part (b) of this theorem is the following result.
Corollary. Let X N.;  2 /, and let 0 < ˛ < 1. Let Z N.0; 1/. Suppose x˛ is
the .1  ˛/th quantile (also called percentile) of X , and z˛ is the .1  ˛/th quantile
of Z. Then
x˛ D  C z˛ :
Example 1.61 (Setting a Thermostat). Suppose that when the thermostat is set at
d degrees Celsius, the actual temperature of a certain room is a normal random
variable with parameters  D d and  D :5.
If the thermostat is set at 75ı C, what is the probability that the actual temperature
of the room will be below 74ı C?
By standardizing to an N.0; 1/ random variable,

P .X < 74/ D P .Z < .74  75/=:5/ D P .Z < 2/ D :02275:

Next, what is the lowest setting of the thermostat that will maintain a temperature
of at least 72ı C with a probability of .99?
64 1 Review of Univariate Probability

We want to find the value of d that makes P .X  72/ D :99 ) P .X < 72/ D
:01: Now, from a standard normal table, P .Z < 2:326/ D :01. Therefore, we
want to find d that makes d C   .2:326/ D 72 ) d  :5  2:326 D 72 ) d D
72 C :5  2:326 D 73:16ıC:

Example 1.62 (Rounding a Normal Variable). Suppose X N.0;  2 / and suppose


the absolute value of X is rounded p to the nearest integer. We have seen that the
expected value of jX j itself is  2=. How does rounding affect the expected
value?
Denote the rounded value of jX j by Y . Then, Y D 0 , jX j < :5I Y D 1 ,
:5 < jX j < 1:5I    , and so on. Therefore,
1
X 1
X
E.Y / D iP .i  1=2 < jX j < i C 1=2/ D iP .i  1=2 < X < i C 1=2/
i D1 i D1
1
X
C iP .i  1=2 < X < i C 1=2/
i D1
1
X 1
X
D2 i Œˆ..i C 1=2/=/ˆ..i  1=2/=/ D 2 Œ1  ˆ..i C 1=2/=/;
i D1 i D1

on some manipulation. P
For example, if  D 1, then this equals 2 1 D1 Œ1  ˆ.i C 1=2/ D :76358,
ip
whereas the unrounded jX j has the expectation 2= D :79789. The effect of
rounding is not serious when  D 1.
A plot of the expected value of Y and the expected value of jX j is shown in
Fig. 1.5 to study the effect of rounding.

sigma
1 2 3 4 5

Fig. 1.5 Expected value of rounded and unrounded jXj when X is N.0; sigma^ 2)
1.13 Normal Distribution and Confidence Interval for a Mean 65

We can see that the effect of rounding is uniformly small. There is classic
literature on corrections needed in computing means, variances, and higher mo-
ments when data are rounded. These are known as Sheppard’s corrections. Kendall
and Stuart (1976) give a thorough treatment of these needed corrections.

Example 1.63 (Lognormal Distribution). Lognormal distributions are common


models in studies of economic variables, such as income and wealth, because they
can adequately describe the skewness that one sees in data on such variables. If
X N.;  2 /, then the distribution of Y D e X is called a lognormal distribution
with parameters ;  2 . Note that the lognormal name can be confusing; a lognormal
variable is not the logarithm of a normal variable. A better way to remember its
meaning is log is normal.
Y D e X is a strictly monotone function of X , therefore by the usual formula for
the density of a monotone function, Y has the pdf

1 .log y/2

fY .y/ D p e 2 2 ; y > 0I
y 2

this is called the lognormal density with parameters ;  2 . A lognormal variable is


defined as e X for a normal variable X , thus its mean and variance are easily found
from the mgf of a normal variable. A simple calculation shows that

2 2 2
E.Y / D e C 2 I Var.Y / D .e   1/e 2C :

One of the main reasons for the popularity of the lognormal distribution is its
skewness; the lognormal density is extremely skewed for large values of . The
coefficient of skewness has the formula
2
p
ˇ D .2 C e  / e  2  1 ! 1; as  ! 1:

Note that the lognormal densities do not have a finite mgf at any t > 0, although
all its moments are finite. It is also the only standard continuous distribution that is
not determined by its moments. That is, there exist other distributions besides the
lognormal all of whose moments exactly coincide with the moments of a given log-
normal distribution. This is not true of any other distribution with a name that we
have come across in this chapter. For example, the normal and the Poisson distribu-
tions are all determined by their moments.
We had remarked in the above that sums of many independent variables tend to
be approximately normally distributed. A precise version of this is the central limit
theorem, which we study in the next section. What is interesting is that sums of any
number of independent normal variables are exactly normally distributed. Here is
the result.

: : : ; Xn ; n  2 be independent random variables, with


Theorem 1.46. Let X1 ; X2 ;P
Xi N.i ; i2 /: Let Sn D niD1 Xi . Then,
66 1 Review of Univariate Probability
!
X
n X
n
Sn N i ; i2 :
i D1 i D1

An important consequence is the following result.


Corollary. Suppose Xi ; 1 i n are independent, and each distributed as
Sn 2
N.;  /. Then X D n
2
N.; n /:
The theorem above implies that any linear function of independent normal vari-
ables is also normal,
!
Xn Xn X
n
ai Xi N ai i ; ai2 i2 :
i D1 i D1 i D1

Example 1.64 (Confidence Interval and Margin of Error). Suppose some random
variable X N.;  2 /, and we have n independent observations X1 ; X2 ; : : : ; Xn
on this variable X ; another way to put it is that X1 ; X2 ; : : : ; Xn are iid N.;  2 /.
Therefore, X N.;  2 =n/, and we have
p p
P .X  1:96= n  X C 1:96= n/
p p
D P .1:96= n X   1:96= n/
!
X 
D P 1:96 p 1:96 D ˆ.1:96/  ˆ.1:96/ D :95;
= n

from a standard normal table. p


Thus, with a 95% probability, for any n; pis between X ˙ 1:96= n: Statisti-
p X ˙ 1:96= n a 95% confidence interval for ,
cians call the interval of values
with a margin of error 1:96= n.
A tight confidence interval will correspond to a small margin
p of error. pFor exam-
ple, if we want a margin of error :1, then we need 1:96= n :1 , n  19:6
, n  384:16 2 . Statisticians call such a calculation a sample size calculation.

1.14 Stein’s Lemma

In 1981, Charles Stein gave a simple lemma for a normal distribution, and extended
it to the case of a finite number of independent normal variables, which seems in-
nocuous on its face, but has proved to be a really powerful tool in numerous areas
of statistics. It has also had its technical influence on the area of Poisson approxi-
mations, which we briefly discussed in this chapter. We present the basic lemma, its
extension to the case of several independent variables, and show some applications.
It would not be possible to give more than just a small glimpse of the applications of
Stein’s lemma here; the applications are too varied. Regrettably, no comprehensive
book or review of the various applications of Stein’s lemma is available at this time.
1.14 Stein’s Lemma 67

The original article is Stein (1981); Wasserman (2006) and Diaconis and Zabell
(1991) are two of the best sources to learn more about Stein’s lemma.
Theorem 1.47. (a) Let X N.;  2 /, and suppose g W R ! R is such that g is
differentiable at all but at most a finite number of points, and
2
 x2
(i) For some < 1; g.x/e 2 ! 0; as x ! ˙1.
(ii) EŒjg 0 .X /j < 1.
Then,
EŒ.X  /g.X / D  2 EŒg 0 .X /:
(b) Let X1 ; X2 ; : : : ; Xk be independent N.i ;  2 / variables, and suppose g W
Rk ! R is such that g has a partial derivative with respect to each xi at
all but at most a finite number of points. Then,

@
EŒ.Xi  i /g.X1 ; X2 ; : : : ; Xk / D  2 E g.X1 ; X2 ; : : : ; Xk / :
@Xi

Proof. We prove part (a). By definition of expectation, and by using integration by


parts,
Z 1
1 2 2
EŒ.X /g.X / D p .x  /g.x/e .x/ =.2 / dx
 2 1
Z 1
2 1 d .x/2 =.2 2 /
D  p g.x/ e dx
 2 1 dx
ˇ1
2 1 ˇ
.x/2 =.2 2 / ˇ
D  p g.x/e ˇ
 2 1
Z 1
1 2 2
C 2 p g 0 .x/e .x/ =.2 / dx
 2 1
2 1
ˇ
  2 ˇˇ
2 2
1  x  .1/x C x
D  p g.x/e 2 2 e 2 2 2 2
ˇ
2 1
Z 1
1 2 2
C 2 p g 0 .x/e .x/ =.2 / dx
 2 1
D 0 C  2 EŒg 0 .X / D  2 EŒg 0 .X /;
because by assumption
2
 x2
g.x/e 2 ! 0; as x ! ˙1;

and,
2
 .1/x C x
e 2 2 2
 is uniformly bounded in x:
The principal applications of Stein’s lemma are in statistical theory. Here, we show
a simple application. t
u
68 1 Review of Univariate Probability

Example 1.65. Suppose X N.;  2 /, and g.x/ is a differentiable and uniformly


bounded function with a bounded derivative; that is, jg.x/j M < 1 8x:, and
jg 0 .x/j C < 1 8x:, By Stein’s lemma,

E.X C g.X /  /2 D EŒ.X  /2 C g 2 .X / C 2.X  /g.X /


D EŒ.X  /2  C EŒg 2 .X / C 2EŒ.X  /g.X /
D  2 C EŒg 2 .X / C 2 2 EŒg 0 .X /  2 C M 2 C 2C  2

because g 2 .x/ M 2 8x; and g 0 .x/ C 8x.


Therefore, by applying Markov’s inequality,

EŒ.X C g.X /  /2   2 .1 C 2C / C M 2


P .jX C g.X /  j  k/ :
k2 k2

An example of such a function g.x/ would be g.x/ D 1Cx cx


2 , where c is any real
number. Its maximum value is M D c=2, and its derivative is bounded by C D jcj.
Plugging into the inequality above,
ˇ   
ˇ c .1 C 2jcj/ 2 C c 2 =4
P ˇˇ 1 C X  j  k :
1 C X2 k2

Functions of this general form .1 C c


1CX 2
/X are of interest in statistical theory.

1.15  Chernoff’s Variance Inequality

In 1981, Herman Chernoff gave a proof of an inequality for the normal distribu-
tion that essentially says that a smoothing operation, such as integration, is going to
reduce the variance of a function. The inequality has since been extensively general-
ized; see Chernoff (1981) for this inequality. We present this inequality, but present
a different proof, which works more generally.

Theorem 1.48. Let X N.0; 1/, and let g W R ! R be a function such that g is
once continuously differentiable, and EŒ.g 0 .X //2  < 1. Then,

Var.g.X // EŒ.g 0 .X //2 ;


with equality holding if and only if g.X / is a linear function, g.X / D a C bX for
some a; b.

Proof. We need to use the fact that for any random variable Y; ŒE.Y /2 E.Y 2 /;
we choose the variable Y suitably in the proof below. By the fundamental theorem
of calculus,
1.16 Various Characterizations of Normal Distributions 69

x Z  2
Z 2
0 1 x 0
.g.x/  g.0// D
2
g .t/dt D x g .t/dt
0 x 0
 Z x 2  Z x 
1 1
D x2 g 0 .t/dt x2 Œg 0 .t/2 dt :
x 0 x 0
Z x
Dx Œg 0 .t/2 dt;
0

by identifying Y here with a uniform random variable on Œ0; x,

Varg.X / D EŒg.X /  E.g.X //2 EŒg.X /  g.0/2


Z 1 Z x
.g 0 .t//2 dt x.x/dx
1 0
Z 1 Z x Z 1
D .g 0 .t//2 dt . 0 .x//dx D Œg 0 .x/2 .x/dx
1 0 1
(by integration by parts)
D EŒ.g 0 .X //2 : t
u

Example 1.66. Let X N.0; 1/. As a simple example, let g.X / D .X a/2 , where
a is a general constant. With some algebra, the exact variance of g.X / can be found;
we use Chernoff’s inequality to find an upper bound on the variance.
Clearly, g0 .x/ D 2.x  a/, and so, by Chernoff’s inequality, VarŒ.X  a/2 
EŒ4.X  a/2  D 4.1 C a2 /:
Example 1.67. Consider a general cubic polynomial g.X / D aCbX CcX 2 CdX 3 ,
and suppose that X N.0; 1/. The derivative of g is g0 .x/ D 3dx 2 C 2cx C b )
.g 0 .x//2 D 9d 2 x 4 C12cdx 3 C.4c 2 C 6bd /x 2 C 4bcxC b 2 : Because X is standard
normal, E.X / D E.X 3 / D 0; E.X 2 / D 1; E.X 4 / D 3: Thus, by Chernoff’s
inequality, for a general cubic polynomial,

Var.a C bX C cX 2 C dX 3 / 27d 2 C 4c 2 C b 2 C 6bd:

1.16  Various Characterizations of Normal Distributions

Normal distributions possess a huge number of characterization properties, that is,


properties that are not satisfied by any other distribution in large classes of distribu-
tions, and sometimes in the class of all distributions. It is not possible to document
many of these characterizing properties here; we present a subjective selection of
some elegant characterizations of normal distributions. The proofs of most of these
characterizing properties are entirely nontrivial, and we do not give the proofs here.
This section primarily has reference value. The most comprehensive reference on
characterizations of the normal distribution is Kagan, Linnik, and Rao (1973).
70 1 Review of Univariate Probability

Theorem 1.49 (Cramér–Lévy Theorem). Suppose for some given n  2, and


independent random variables X1 ; X2 ;    ; Xn ; Sn D X1 C X2 C    C Xn has a
normal distribution with some mean and some variance. Then each Xi is necessarily
normally distributed.
Theorem 1.50 (Chi-Square Distribution of Sample Variance). Suppose X1 ;
X2 ; : : : ; Xn are independent N.;  2 / variables. Then
Pn
i D1 .Xi  X /
N 2
2n1 :
2
Conversely, for given n  2, if X1 ; X2 ; : : : ; Xn are independent variables with some
common distribution, symmetric about the mean , and having a finite variance  2 ,
and if Pn
i D1 .Xi  X /
N 2
2n1 ;
2
then the common distribution of the Xi must be N.;  2 / for some .
Theorem 1.51 (Independence of Sample Mean and Sample Variance). Suppose
N
P1n; X2 ; : : : ; Xn2are independent N.;  / variables. Then, for any n  2; X and
2
X
.X i  XN / are independent. Conversely, for given n  2, if X ; X
1 P2 ; : : : ; Xn
i D1
are independent variables with some common distribution, and if XN and niD1 .Xi 
XN /2 are independent, then the common distribution of the Xi must be N.;  2 / for
some ;  2 .
Theorem 1.52 (Independence and Spherical Symmetry). Suppose for given
n  2, X1 ; X2 ; : : : ; Xn are independent variables with a common density f .x/. If
the product f .x1 /f .x2 /    f .xn / is a spherically symmetric function, that is, if
f .x1 /f .x2 /    f .xn / D g.x12 C x22 C    C xn2 / for some function g, then f .x/
must be the density of N.0;  2 / for some  2 .
Theorem 1.53 (Independence of Sum and Difference). Suppose X; Y are
independent random variables, each with a finite variance. Then X C Y and X  Y
are independent if and only if X; Y are normally distributed with an equal variance.
Theorem 1.54 (Independence of Two Linear P Functions). Suppose
P X1 ; X2 ; : : : ;
Xn are independent variables, and L1 D niD1 ai Xi ; L2 D niD1 bi Xi are two
different linear functions. If L1 ; L2 are independent, then for each i for which the
product ai bi ¤ 0; Xi must be normally distributed.
Theorem 1.55 (Characterization Through Stein’s Lemma). Let X have a fi-
nite mean  and a finite variance  2 . If EŒ.X  /g.X / D  2 EŒg 0 .X / for
every differentiable function g with EŒjg0 .X /j < 1, then X must be distributed as
N.;  2 /.
Theorem 1.56 (Characterization Through Chernoff’s Inequality). Suppose X
is a continuous random variable with a finite variance  2 . Let
G D fg W g is continuously differentiable; EŒ.g 0 .X //2  < 1g:
1.17 Normal Approximations and Central Limit Theorem 71

Let
Var.g.X //
B.g/ D :
 2 EŒ.g 0 .X //2 

If supg2G B.g/ D 1, then X must be distributed as N.;  2 / for some .

1.17 Normal Approximations and Central Limit Theorem

Many of the special discrete and special continuous distributions that we have
discussed can be well approximated by a normal distribution, for suitable config-
urations of their underlying parameters. Typically, the normal approximation works
well when the parameter values are such that the skewness of the distribution is
small. For example, binomial distributions are well approximated by a normal when
n is large and p is not too small or too large. Gamma distributions are well ap-
proximated by a normal when the shape parameter ˛ is large. There is a unifying
mathematical result here. The unifying mathematical result is one of the most im-
portant results in all of mathematics, and is called the central limit theorem. The
subject of central limit theorems is incredibly diverse. In this section, we present the
basic or the canonical central limit theorem, and present its applications to certain
problems with which we are already familiar. Among numerous excellent references
on central limit theorems, we recommend Feller (1968, 1971) for lucid exposition
and examples. The subject of central limit theorems also has a really interesting his-
tory; we recommend Le Cam (1986) and Stigler (1986) for reading some history
of the central limit theorem. Careful and comprehensive mathematical treatment is
available in Hall (1992) and Bhattacharya and Rao (1986). For a diverse selection
of examples, see DasGupta (2008).

Theorem 1.57 (Central Limit Theorem). For n  1, let X1 ; X2 ; : : : ; Xn be n


independent random variables, each having the same distribution, and suppose this
common distribution, say F , has a finite mean , and a finite variance  2 . Let
Sn D X1 C X2 C    C Xn ; XN D XN n D X1 CX2nCCXn : Then, as n ! 1,

(a) P Spn n x ! ˆ.x/ 8 x 2 RI
 p nN2
n.X/
(b) P  x ! ˆ.x/ 8 x 2 R:

In words, for large n,


 
Sn N n; n 2 :
 
2
XN N ; :
n

A very important case in which the general central limit theorem applies is the bino-
mial distribution. The CLT allows us to approximate clumsy binomial probabilities
72 1 Review of Univariate Probability

involving large factorials by simple and accurate normal approximations. We first


give the exact result on normal approximation of the binomial.

Theorem 1.58 (de Moivre–Laplace Central Limit Theorem). Let X D Xn Bin


.n; p/. Then, for any fixed p and x 2 R,
!
X  np
P p x ! ˆ.x/;
np.1  p/

as n ! 1.
The de Moivre–Laplace CLT tells us that if X Bin.n; p/, then we can approx-
imate the type probability P .X k/ as
!
X  np k  np
P .X k/ D P p p
np.1  p/ np.1  p/
!
k  np
ˆ p :
np.1  p/

Note that, in applying the normal approximation in the binomial case, we are using
a continuous distribution to approximate a discrete distribution taking only integer
values. The quality of the approximation improves, sometimes dramatically, if we
fill up the gaps between the successive integers. That is, pretend that an event of the
form X D x really corresponds to x  12 X x C 12 . In that case, in order to
approximate P .X k/, we in fact expand the domain of the event to k C 12 , and
approximate P .X k/ as
!
k C 12  np
P .X k/ ˆ p :
np.1  p/

This adjusted normal approximation is called normal approximation with a conti-


nuity correction. Continuity correction should always be done while computing a
normal approximation to a binomial probability. Here are the continuity-corrected
normal approxomation formulas for easy reference:
!
k C 12  np
P .X k/ ˆ p :
np.1  p/
! !
k C 12  np m  12  np
P .m X k/ ˆ p ˆ p :
np.1  p/ np.1  p/

Example 1.68 (Coin Tossing). This is the simplest example of a normal approxi-
mation of binomial probabilities. We solve a number of problems by applying the
normal approximation method.
1.17 Normal Approximations and Central Limit Theorem 73

First, suppose a fair coin is tossed 100 times. What is the probability that we
obtain between 45 and 55 heads? Denoting X as the number of heads obtained
in 100 tosses, X Bin.n; p/, with n D 100; p D :5. Therefore, by using the
continuity corrected normal approximation,
   
55:5  50 44:5  50
P .45 X 55/ ˆ p ˆ p
12:5 12:5
D ˆ.1:56/  ˆ.1:56/ D :9406  :0594 D :8812:

So, the probability that the percentage of heads is between 45% and 55% is high, but
not really high if we toss the coin 100 times. Here is the next question. How many
times do we need to toss a fair coin to be 99% sure that the percentage of heads
will be between 45 and 55%? The percentage of heads is between 45 and 55% if
and only if the number of heads is between :45n and :55n. Using the continuity
corrected normal approximation, again, we want
   
:55n C :5  :5n :45n  :5  :5n
:99 D ˆ p ˆ p
:25n :25n
 
:55n C :5  :5n
) :99 D 2ˆ p 1
:25n

(because, for any real number x; ˆ.x/  ˆ.x/ D 2ˆ.x/  1)


 
:55n C :5  :5n
)ˆ p D 995
:25n
 
:05n C :5
)ˆ p D :995:
:25n

Now, from a standard normal table, we find that ˆ.2:575/ D :995. Therefore, we
equate

:05n C :5
p D 2:575
:25n
p p
) :05n C :5 D 2:575  :5 n D 1:2875 n:
p
Writing n D x, we have here a quadratic equation :05x 2  1:2875x C :5 D 0 to
solve. The root we want is x D 25:71, and squaring it gives n  .25:71/2 D 661:04.
Thus, an approximate value of n such that in n tosses of a fair coin, the percentage of
heads will be between 45 and 55% with a 99% probability is n D 662. Most people
find that the value of n needed is higher than what they would have guessed.
Example 1.69 (Random Walk). The theory of random walks is one of the most beau-
tiful areas of probability. Here, we give an introductory example that makes use of
the normal approximation to a binomial.
74 1 Review of Univariate Probability

Suppose a drunkard is standing at time zero (say 11:00 PM) at some point, and
every second he either moves one step to the right, or one step to the left, with
equal probability, of where he is at that time. What is the probability that after two
minutes, he will be ten or more steps away from where he started? Note that the
drunkard will take 120 steps in 2 minutes.
Let the drunkard’s movement at the i th step be denoted as Xi ; then, P .Xi D
˙1/ D :5. So, we can think of Xi as Xi D 2Yi  1, where Yi Ber.:5/; 1 i
n D 120. If we assume that the drunkard’s successive movements X1 ; X2 ; : : : are
independent, then Y1 ; Y2 ; : : : are also independent, and so, Sn D Y1 C Y2 C    Yn
Bin.n; :5/. Furthermore,

jX1 C X2 C    C Xn j  10 , j2.Y1 C Y2 C    C Yn /  nj  10:

So, we want to find

P .j2.Y1 C Y2 C    C Yn /  nj  10/
 n  n
D P Sn   5 C P Sn  5
2 2
   
Sn  n 5 Sn  n 5
DP p 2  p CP p 2 p :
:25n :25n :25n :25n

Using the normal approximation, this is approximately equal to 2Œ1  ˆ. p:25n5


/D
2Œ1  ˆ.:91/ D 2.1  :8186/ D :3628:
We present four simulated walks of this drunkard in Fig. 1.6 over a two-minute
interval consisting of 120 steps. The different simulations show that the drunkard’s
random walk could evolve in different ways.

1.17.1 Binomial Confidence Interval

The normal approximation to the binomial distribution forms the basis for most of
the confidence intervals for the parameter p in common use. We describe two of
these in this section, the Wald confidence interval and the score confidence interval
for p. The Wald interval used to be the textbook interval, but the score interval is
gaining in popularity due to recent research establishing unacceptably poor proper-
ties of the Wald interval. The derivation of each interval is sketched below.
Let X Bin.n; p/. By the normal approximation to the Bin.n; p/ distribution,
for large n; X N.np; np.1p//, and therefore, the standardized binomial variable
pXnp N.0; 1/. This implies
np.1p/
1.17 Normal Approximations and Central Limit Theorem 75

Simulated random walk Second simulated random walk

6 Steps
20 40 60 80 100 120
4 -2.5
2 -5
Steps -7.5
20 40 60 80 100 120 -10
-2
-12.5
-4

Third simulated random walk Fourth simulated random walk

10
Steps
7.5 20 40 60 80 100 120
-2
5
2.5 -4
Steps -6
20 40 60 80 100 120
-2.5 -8
-5 -10

Fig. 1.6 Four simulated random walks


!
X  np
P z ˛2 p z ˛2 1˛
np.1  p/
 p p
) P z ˛2 np.1  p/ X  np z ˛2 np.1  p/ 1˛
r r !
p.1  p/ X p.1  p/
)P z ˛2 p z ˛2 1˛
n n n
r r !
X p.1  p/ X p.1  p/
)P  z ˛2 p C z ˛2 1  ˛:
n n n n

This last probability statement almost


q looks like a confidence statement on the pa-
rameter p, but not quite, because p.1p/ is not computable. So, we cannot use
q n
p.1p/
n ˙ z2
X ˛ as a confidence interval for p. We remedy this by substituting
q n
p.1p/
pO D Xn in n , to finally result in the confidence interval
r
O  p/
p.1 O
pO ˙ z ˛2 :
n
This is the Wald confidence interval for p.
An alternative and much better confidence interval for p can be constructed by
manipulating the normal approximation of the binomial in a different way. The steps
proceed as follows. Writing once again pO for Xn ,
76 1 Review of Univariate Probability
0 1
B pO  p C
P @z ˛2 q z ˛2 A 1˛
p.1p/
n
 
p.1  p/
2 2
) P .pO  p/ z˛ 1˛
2 n
2 ! 2 ! !
z˛ z˛
)P p2 1C 2
 p 2pO C 2
C pO 2
0 1  ˛:
n n

Now the quadratic equation


! !
z2˛ z2˛
p 2
1C 2
 p 2pO C 2
C pO 2 D 0
n n

has the two real roots


s
z2˛ p
pO C 2 z˛ n z2˛
p D p˙ D 2n
˙ 2
O  p/
p.1 O C 2
:
z2˛ n C z2˛ 4n
1C 2
n
2

This is the score confidence interval for p. It is established theoretically and


empirically in Brown, Cai, and DasGupta (2001, 2002) that the score confidence
interval performs much better than the Wald interval, even for very large n.

1.17.2 Error of the CLT

A famous theorem in probability places an upper bound on the error of the normal
approximation in the central limit theorem. If we make this upper bound itself small,
then we can be confident that the normal approximation will be accurate. This up-
per bound on the error of the normal approximation is known as the Berry–Esseen
bound. Specialized to the binomial case, it says the following; a proof can be seen in
Bhattacharya and Rao (1986) or in Feller (1968). The general Berry–Esseen bound
is treated in this text in Chapter 8.

Theorem 1.59 (Berry–Esseen Bound for Normal Approximation). Let X


Bin.n; p/, and let Y N.np; np.1  p//. Then for any real number x,

4 1  2p.1  p/
jP .X x/  P .Y x/j p :
5 np.1  p/
1.17 Normal Approximations and Central Limit Theorem 77

It should be noted that the Berry–Esseen bound is rather conservative. Thus,


accurate normal approximations are produced even when the upper bound,
a conservative one, is .1 or so. We do not recommend use of the Berry–Esseen
bound to decide when a normal approximation to the binomial can be accurately
done. The bound is simply too conservative. However, it is good to know this bound
due to its classic nature.
We finish with two more examples on the application of the general central limit
theorem.
Example 1.70 (Sum of Uniforms). We can approximate the distribution of the sum
of n independent uniforms on a general interval Œa; b by a suitable normal distri-
bution. However, it is interesting to ask what is the exact density of the sum of n
independent uniforms on a general interval Œa; b. Because a uniform random vari-
able on a general interval Œa; b can be transformed to a uniform on the unit interval
Œ1; 1 by a linear transformation and vice versa, we ask what is the exact density
of the sum of n independent uniforms on Œ1; 1. We want to compare this exact
density to a normal approximation for various values of n.
When n D 2, the density of the sum is a triangular density on Œ2; 2, which is
a piecewise linear polynomial. In general, the density of the sum of n independent
uniforms on Œ1; 1 is a piecewise polynomial of degree n1, there being n different
arcs in the graph of the density. The exact formula is:

b nCx !
X2 c
1 k n
fn .x/ D n .1/ .n C x  2k/n1 ; if jxj nI
2 .n  1/Š k
kD0

see Feller (1971, p. 27).


On the other hand, the CLT approximates the density of the sum by the N.0; n3 /
density. It would be interesting to compare plots of the exact and the approximating
normal density for various n. We see from Figs. 1.7, 1.8, and 1.9 that the normal
approximation is already nearly exact when n D 8.
Example 1.71 (Unwise Use of the CLT). Suppose the checkout time at a supermar-
ket has a mean of 4 minutes and a standard deviation of 1 minute. You have just
joined the queue in a lane where there are eight people ahead of you. From just this
information, can you say anything useful about the chances that you can be finished
checking out within half an hour?
With the information provided being only on the mean and the variance of an
individual checkout time, but otherwise nothing about the distribution, a possibility
is to use the CLT, although here n is only 9, which is not large. Let Xi ; 1 i 8,
be the checkout times taken by the eight customers ahead of you, and X9 the time
taken by you yourself. If we use the CLT, then we have

X
9
Sn D Xi N.36; 9/:
i D1
78 1 Review of Univariate Probability

0.5

0.4

0.3

0.2

0.1

x
-2 -1 1 2

Fig. 1.7 Exact and approximating normal density for sum of uniforms; n D 2

0.35

0.3

0.25

0.2

0.15

0.1

0.05

x
-4 -2 2 4

Fig. 1.8 Exact and approximating normal density for sum of uniforms; n D 4

Therefore,
 
30  36
P .Sn 30/ ˆ D ˆ.2/ D :0228:
3

In situations such as this where the information available is extremely limited, we


do sometimes use the CLT, but it is not very wise because the value of n is so small.
It may be better to model the distribution of checkout times, and answer the question
under that chosen model.
1.18 Normal Approximation to Poisson and Gamma 79

0.25

0.2

0.15

0.1

0.05

x
-7.5 -5 -2.5 2.5 5 7.5

Fig. 1.9 Exact and approximating normal density for sum of uniforms; n D 8

1.18 Normal Approximation to Poisson and Gamma

A Poisson variable with an integer parameter D n can be thought of as the sum


of n independent Poisson variables, each with mean 1. Likewise, a Gamma variable
with parameters ˛ D n and can be thought of as the sum of n independent expo-
nential variables, each with mean . So, in these two cases the CLT already implies
that a normal approximation to the Poisson and the Gamma holds, when n is large.
However, even if the Poisson parameter is not an integer, and even if the Gamma
parameter ˛ is not an integer, if is large, or if ˛ is large, a normal approximation
still holds. These results can be proved directly, by using the mgf technique. Here
are the normal approximation results for general Poisson and Gamma distributions.

Theorem 1.60. Let X Poi. /. Then


 
X
P p x ! ˆ.x/; as ! 1;

for any real number x.


Notationally, for large ,
X N. ; /:

Theorem 1.61. Let X G.˛; /. Then, for every fixed ,


 
X ˛
P p x ! ˆ.x/; as ˛ ! 1;
˛
for any real number x.
80 1 Review of Univariate Probability

Notationally, for large ˛,


2
X N.˛ ; ˛ /:
Example 1.72 (Nuclear Accidents). Suppose the probability of having any nuclear
accidents in any single nuclear plant during a given year is .0005, and that a country
has 100 such nuclear plants. What is the probability that there will be at least six
nuclear accidents in the country during the next 250 years?
Let Xij be the number of accidents in the i th year in the j th plant. We assume
that each Xij has a common Poisson distribution. The parameter, say  of this com-
mon Poisson distribution, is determined from the equation e  D 1  :0005 D
:9995 )  D  log.:9995/ D :0005: Assuming that these Xij are all independent,
the number of accidents T in the country during 250 years has a Poi. / distribution,
where D   100  250 D :0005  100  250 D 12:5. If we now do a normal
approximation with continuity correction,
 
5:5  12:5
P .T  6/ 1  ˆ p
12:5
D 1  ˆ.1:98/ D :9761:

So we see that although the chances of having any accidents in a particular plant in
any particular year are small, collectively, and in the long run the chances are high
that there will be quite a few such accidents.

1.18.1 Confidence Intervals

In Example 1.64, we described a confidence interval for a normal mean. A major


use of the various normal approximations described above is the construction of
confidence intervals for an unknown parameter of interest. The parameter can
be essentially anything. Thus, suppose that based on a sample of observations
X1 ; X2 ; : : : ; Xn , a parameter  is estimated by O D .X
O 1 ; : : : ; Xn /. What we need
is a theorem that allows us to approximate the distribution of O by a suitable normal
distribution. To fix notation, suppose that for large n; O is approximately distributed
as N.;  2 .// for some explicit and computable function ./. Then, we can con-
struct a confidence interval for , namely an interval L  U , and make
probability statements of the form P .L  U/ p for some suitable p.
Here is an important illustration.
Example 1.73 (Confidence Interval for a Poisson Mean). The normal approxima-
tion to the Poisson distribution can be used to find a confidence interval for the
mean of a Poisson distribution. We have already seen an example of a confidence
interval for a normal mean in this chapter. We now work out the Poisson case, using
the normal approximation to Poisson.
1.18 Normal Approximation to Poisson and Gamma 81

Suppose X Poi. /. By the normal approximation theorem, if is large, then


X
p N.0; 1/. Now, a standard normal random variable Z has the property

P .1:96 Z 1:96/ D :95. Because X
p N.0; 1/, we have


 
X
P 1:96 p 1:96 :95
 
.X  /2
,P 1:96 2
:95

, P ..X  /2  1:962 0/ :95


, P. 2
 .2X C 1:96 / C X
2 2
0/ :95: ./

Now the quadratic equation


2
 .2X C 1:962 / C X 2 D 0

has the roots


p
.2X C 1:962 / ˙.2X C 1:962 /2  4X 2
D ˙ D
2
p
.2X C 1:96 / ˙ 14:76 C 15:37X
2
D
2
p
D .X C 1:92/ ˙ 3:69 C 3:84X:

The quadratic 2  .2X C 1:962 / C X 2 is 0 when is between these two values


˙ . So we can rewrite ./ as
 p
P .X C 1:92/  3:69 C 3:84X .X C 1:92/
p
C 3:69 C 3:84X :95 ./:

In statistics, one often treats the parameter as unknown, and uses the data value
X to estimate the unknown . The statement ./ is interpreted as saying that with
approximately 95% probability, will fall inside the interval of values
p p
.X C 1:92/  3:69 C 3:84X .X C 1:92/ C 3:69 C 3:84X;

and so the interval


h p p i
.X C 1:92/  3:69 C 3:84X; .X C 1:92/ C 3:69 C 3:84X

is called an approximate 95% confidence interval for . We see that it is derived


from the normal approximation to a Poisson distribution.
82 1 Review of Univariate Probability

1.19  Convergence of Densities and Edgeworth Expansions

If in the central limit theorem, each individual


P Xi is a continuous random variable
with a density f .x/, then the sum Sn D niD1 Xi also has a density for each n, and
n
hence, the standardized sum Sn p n
also has a density for each n. It is natural to ask
n
if the densitity of Sn p n
converges to the standard normal density when n ! 1.
This is true, under suitable conditions on the basic density f .x/. We present a result
in this direction. It is useful to have a general result which ensures that under suitable
n
conditions, in the central limit theorem the density of Zn D Sn p n
converges to the
N.0; 1/ density. The result below is not the best available result in this direction,
but it often applies and is easy to state; a proof can be seen Bhattacharya and Rao
(1986).
Theorem 1.62 (Gnedenko’s Local Limit Theorem). Suppose X1 ; X2 ; : : : are in-
dependent random variables with a density f .x/, mean , and variance  2 . If f .x/
n
is uniformly bounded, then the density function of Zn D Sn p n
converges uni-
x2
formly on the real line R to the standard normal density .x/ D p12 e  2 :
One criticism of the normal approximation in the various cases we have de-
scribed is that any normal distribution is symmetric about its mean, and so, by
employing a normal approximation we necessarily ignore any skewness that may
be present in the true distribution that we are approximating. For instance, if the
individual Xi have an exponential density, then the true density of the sum Sn is a
Gamma density, which always has a skewness. But a normal approximation ignores
that, and as a result, the quality of the approximation can be poor, unless n is quite
large. Refined approximations that address this criticism are available.
We present refined density approximations that adjust the normal approxima-
tion for skewness, and one which also adjusts for kurtosis. They are collectively
known as Edgeworth expansions; similar higher-order approximations are known
for the CDF.
Suppose X1 ; X2 ; : : : ; Xn are continuous random variables with a density f .x/.
Suppose each individual Xi has four finite moments. Let ;  2 ; ˇ; denote the
mean, variance, coefficient of skewness, and coefficient of kurtosis of the common
distribution of the Xi . Let
p
Sn  n n.XN  /
Zn D p D :
 n 
Define the following three successively more refined density approximations for the
density of Zn :

fOn;0 .x/ D .x/:


 
ˇ.x 3  3x/
fOn;1 .x/ D 1 C p .x/:
6 n
Exercises 83

O ˇ.x 3  3x/ x 4  6x 2 C 3
fn;2 .x/ D 1 C p C
6 n 24

2 x  15x C 45x  15 1
6 4 2
Cˇ .x/:
72 n

The functions fOn;0 .x/; fOn;1 .x/; and fOn;2 .x/ are called the CLT approximation, the
first-order Edgeworth expansion, and the second-order Edgeworth expansion for the
density of the mean.
The approximations are of the form

p1 .x/ p2 .x/
.x/ C p .x/ C .x/ C    :
n n

The relevant polynomials p1 .x/; p2 .x/ are related to some very special polyno-
mials, known as Hermite polynomials. Hermite polynomials are obtained from
successive differentiation of the standard normal density .x/. Precisely, the jth
Hermite polynomial Hj .x/ is defined by the relation

d
.x/ D .1/j Hj .x/.x/:
dx j
In particular,

H1 .x/ D xI H2 .x/ D x 2  1I H3 .x/ D x 3  3xI H4 .x/ D x 4  6x 2 C 3I


H5 .x/ D x 5  10x 3 C 15xI H6 .x/ D x 6  15x 4 C 45x 2  15:

By comparing the formulas for the refined density approximations to the formulas
for the Hermite polynomials, the connection becomes obvious. They arise in the
density approximation formulas as a matter of fact; there is no intuition for it.

Exercises

Exercise 1.1. The population of Danville is 20,000. Can it be said with certainty
that there must be two or more people in Danville with exactly the same three
initials?
Exercise 1.2. The letters in the word FULL are rearranged at random. What is the
probability that it still spells FULL?
Exercise 1.3 (Skills Exercise). Let E; F , and G be three events. Find expressions
for the following events:
(a) Only E occurs
(b) Both E and G occur, but not F
84 1 Review of Univariate Probability

(c) All three occur


(d) At least one of the events occurs
(e) At most two of them occur

Exercise 1.4. An urn contains 5 red, 5 black, and 5 white balls. If 3 balls are chosen
without replacement at random, what is the probability that they are of exactly 2
different colors?

Exercise 1.5 (Matching Problem). Four men throw their watches into the sea, and
the sea brings each man one watch back at random. What is the probability that at
least one man gets his own watch back?

Exercise 1.6. Which is more likely:

(a) Obtaining at least one six in six rolls of a fair die,


or,
(b) Obtaining at least one double six in six rolls of a pair of fair dice.

Exercise 1.7 * (The General Shoes Problem). There are n pairs of shoes of n dis-
tinct colors in a closet and 2m are pulled out at random from the 2n shoes. What is
the probability that there is at least one complete pair among the shoes pulled?

Exercise 1.8. * There are n people are lined up at random for a photograph. What
is the probability that a specified set of r people happen to be next to each other?

Exercise 1.9. Calculate the probability that in Bridge, the hand of at least one player
is void in a particular suit.

Exercise 1.10 * (The Rumor Problem). In a town with n residents, someone starts
a rumor by saying it to one of the other n  1 residents. Thereafter, each recipient
passes the rumor on to one of the other residents, chosen at random. What is the
probability that by the kth time that the rumor has been told, it has not come back to
someone who has already heard it?

Exercise 1.11. Jen will call Cathy on Saturday with a 60% probability. She will
call Cathy on Sunday with a 80% probability. The probability that she will call on
neither of the two days is 10%. What is the probability that she will call on Sunday
if she calls on Saturday?

Exercise 1.12. Two distinct cards are drawn, one at a time. from a deck of 52 cards.
The first chosen card is the ace of spades. What is the probability that the second
card is neither an ace nor a spade?

Exercise 1.13. Suppose P .A/ D P .B/ D :9. Give a useful lower bound on
P .BjA/.

Exercise 1.14. * The probability that a coin will show all heads or all tails when
tossed four times is .25. What is the probability that it will show two heads and two
tails?
Exercises 85

Exercise 1.15 * (Conditional Independence). Events A; B are called condition-


ally independent given C , if P .A \ BjC / D P .AjC /P .BjC /.
(a) Give an example of events A; B; C such that A; B are not independent, but they
are conditionally independent given C .
(b) Give an example of events A; B; C such that A; B are independent, but are not
conditionally independent given C .

Exercise 1.16 (Polygraphs). Polygraphs are routinely administered to job


applicants for sensitive government positions. Suppose someone actually lying
fails the polygraph 90% of the time. But someone telling the truth also fails the
polygraph 15% of the time. If a polygraph indicates that an applicant is lying,
what is the probability that he is in fact telling the truth? Assume a general prior
probability p that the person is telling the truth.

Exercise 1.17 * (Random Matrix). The diagonal elements a; c of a 2  2 symmet-


ric matrix are chosen independently at random from 1; 2; : : : ; 5, and the off-diagonal
element is chosen at random from 1; : : : ; min.a; c/. Find the probability that the ma-
trix is nonsingular.

Exercise 1.18 (The Parking Problem). At a parking lot, there are 12 places ar-
ranged in a row. A man observed that there were 8 cars parked, and that the four
empty places were adjacent to each other. Given that there are 4 empty places, is
this arrangement surprising?

Exercise 1.19. Suppose a fair die is rolled twice and suppose X is the absolute
value of the difference of the two rolls. Find the pmf and the CDF of X and plot the
CDF. Find a median of X ; is the median unique?

Exercise 1.20 * (A Two-Stage Experiment). Suppose a fair die is rolled once and
the number observed is N . Then a fair coin is tossed N times. Let X be the number
of heads obtained. Find the pmf, the CDF, and the expected value of X . Does the
expected value make sense intuitively?

Exercise 1.21. * Find a discrete random variable X such that E.X / D E.X 3 / D 0;
E.X 2 / D E.X 4 / D 1:

Exercise 1.22 * (Waiting Time). An urn contains four red and four green balls that
are taken out without replacement, one at a time, at random. Let X be the first draw
at which a green ball is taken out. Find the pmf and the expected value of X .

Exercise 1.23 * (Runs). Suppose a fair die is rolled n times. By using the indicator
variable method, find the expected number of times that a six is followed by at least
two other sixes. Now compute the value when n D 100.

Exercise 1.24. * Suppose a couple will have children until they have at least two
children of each sex. By using the tail sum formula, find the expected value of the
number of children the couple will have.
86 1 Review of Univariate Probability

Exercise 1.25. Suppose X has pmf P .X D n1 / D 1


2n ; n  1. Find the mean of X .

Exercise 1.26 (A Calculus Calculation). The best quadratic predictor of some


random variable Y is a C bX C cX 2 , where a; b; and c are chosen to minimize
EŒ.Y  .a C bX C cX 2 //2 . Determine a; b, and c.

Exercise 1.27 * (Tail Sum Formula for the Second Moment). Let X be aPnon-
negative integer-valued random variable. Show that E.X 2 /  E.X / D 2 1
nD1
nP .X > n/:

Exercise 1.28 * (Obtaining Equality in Chebyshev’s Inequality). Consider a dis-


crete random variable X with the pmf P .X D ˙k/ D p; P .X D 0/ D 1  2p,
where k is a fixed positive number, and 0 < p < 12 .
(a) Find the mean and variance of X .
(b) Find P .jX  j  k/.
(c) Can you now choose p in such a way that P .jX  j  k/ becomes equal to
1
k2
?

Exercise 1.29 * (Variance of a Product). Suppose X1 ; X2 are independent ran-


dom variables. Give a sufficient condition for it to be true that Var.X1 X2 / D
Var.X1 /Var.X2 /:

Exercise 1.30 (Existence of Some Moments, but Not All). Give an example of
a random variable X taking the values 1; 2; 3; : : : such that E.X k / < 1 for any
k < p (p is specified), but E.X p / D 1.

Exercise 1.31. Find the generating function and the mgf of the random variable X
with the pmf P .X D n/ D 21n ; n D 1; 2; 3; : : : :

Exercise 1.32 (MGF of a Linear Function). Suppose X has the mgf .t/. Find
an expression for the mgf of aX C b, where a; b are real constants.

Exercise 1.33 (Convexity of the MGF). Suppose X has the mgf .t/, finite in
some open interval. Show that .t/ is convex in that open interval.

Exercise 1.34. Suppose G.s/; H.s/ are both generating functions. Show that
pG.s/ C .1  p/H.s/ is also a valid generating function for any p in .0; 1/. What
is an interesting interpretation of the distribution that has pG.s/ C .1  p/H.s/ as
its generating function?

Exercise 1.35. * Give an example of a random variable X such that X has a finite
mgf at any t, but X 2 does not have a finite mgf at any t > 0.

Exercise 1.36. Suppose a fair coin is tossed n times. Find the probability that ex-
actly half of the tosses result in heads, when n D 10; 30; 50; where does the
probability seem to converge as n becomes large?
Exercises 87

Exercise 1.37. Suppose one coin with probability .4 for heads, one with proba-
bility .6 for heads, and one that is a fair coin are each tossed once. Find the pmf of
the total number of heads obtained; is it a binomial distribution?

Exercise 1.38. In repeated rolling of a fair die, find the minimum number of rolls
necessary in order for the probability of at least one six to be:

.a/  :5I .b/  :9:

Exercise 1.39 * (Distribution of Maximum). Suppose n numbers are drawn at


random from f1; 2; : : : ; N g. What is the probability that the largest number drawn is
a specified number k if sampling is (a) with replacement; (b) without replacement?

Exercise 1.40 * (Poisson Approximation). One hundred people will each toss a
fair coin 200 times. Approximate the probability that at least 10 of the 100 people
would each have obtained exactly 100 heads and 100 tails.

Exercise 1.41. Suppose a fair coin is tossed repeatedly. Find the probability that 3
heads will be obtained before 4 tails.
Generalize to r heads and s tails.

Exercise 1.42 * (A Pretty Question). Suppose X is a Poisson distributed random


variable. Can three different values of X have an equal probability?

Exercise 1.43 * (Poisson Approximation). There are 20 couples seated at a rect-


angular table, husbands on one side and the wives on the other, in a random
order. Using a Poisson approximation, find the probability that exactly two hus-
bands are seated directly across from their wives; at least three are; at most
three are.

Exercise 1.44 (Poisson Approximation). There are 5 coins on a desk, with prob-
abilities .05, .1, .05, .01, and .04 for heads. By using a Poisson approximation, find
the probability of obtaining at least one head when the five coins are each tossed
once.
Is the number of heads obtained binomially distributed in this problem?
n
Exercise 1.45. Let X Bin.n; p/. Prove that P .X is even/ D 12 C .12p/ 2
.
Hence, show that P .X is even/ is larger than 12 for any n if p < 12 , but that it is
larger than 12 for only even values of n, if p > 12 .

Exercise 1.46. Let f .x/ D cjxj.1 C x/.1  x/; 1 x 1:


(a) Find the normalizing constant c that makes f .x/ a density function.
(b) Find the CDF corresponding to this density function. Plot it.
(c) Use the CDF to find:

P .X < :5/I P .X > :5/I P .:5 < X < :5/:


88 1 Review of Univariate Probability

Exercise 1.47. Show that for every p; 0 p 1; the function f .x/ D p sin x C
.1  p/ cos x; 0 x =2 (and f .x/ D 0 otherwise), is a density function. Find
its CDF and use it to find all the medians.

Exercise 1.48. * Give an example of a density function on Œ0; 1 by giving a formula


such that the density is finite at zero, unbounded at one, has a unique minimum in
the open interval .0; 1/ and such that the median is .5.

Exercise 1.49 * (A Mixed Distribution). Suppose the damage claims on a partic-


ular type of insurance policy are uniformly distributed on Œ0; 5 (in thousands of
dollars), but the maximum payout by the insurance company is 2500 dollars. Find
the CDF and the expected value of the payout, and plot the CDF. What is unusual
about this CDF?

Exercise 1.50 * (Random Division). Jen’s dog broke her six-inch long pencil off
at a random point on the pencil. Find the density function and the expected value of
the ratio of the lengths of the shorter piece and the longer piece of the pencil.

Exercise 1.51 (Square of a PDF Need Not Be a PDF). Give an example of a


density function f .x/ on Œ0; 1 such that cf 2 .x/ cannot be a density function for
any c.

Exercise 1.52 (Percentiles of the Standard Cauchy). Find the pth percentile of
the standard Cauchy density for a general p, and compute it for p D :75.

Exercise 1.53 * (Functional Similarity). Suppose X has the standard Cauchy den-
sity. Show that X  X1 also has a Cauchy density.
Can you find another function with this property on your own?
Hint: Think of simple rational functions.

Exercise 1.54 * (An Intriguing Identity). Suppose X has the standard Cauchy
density. Give a rigorous proof that P .X > 1/ D P .X > 2/ C P .X > 3/.

Exercise 1.55 * (Integer Part). Suppose X has a uniform distribution on Œ0; 10:5.
Find the expected value of the integer part of X .

Exercise 1.56 (The Density Function of the Density Function). Suppose X has a
density function f .x/. Find the density function of f .X / when f .x/ is the standard
normal density.

Exercise 1.57 (Minimum of Exponentials). Let X1 ; X2 ; : : : ; Xn be n indepen-


dent standard exponential random variables. Find an expression for EŒminfX1 ;
: : : ; Xn g.

Exercise 1.58. Suppose X is a positive random variable with mean one. Show that
E.log X / 0:

Exercise 1.59. Suppose X is a positive random variable with four finite moments.
Show that E.X /E.X 3 /  ŒE.X 2 /2 :
Exercises 89

Exercise 1.60 (Rate Function for Exponential). Derive the rate function I.x/ of
the Chernoff–Bernstein inequality for the standard exponential density, and hence
derive a bound for P .X > x/.

Exercise 1.61 * (Rate Function for the Double Exponential). Derive the rate
function I.x/ of the Chernoff–Bernstein inequality for the double exponential den-
sity and hence derive a bound for P .X > x/:

Exercise 1.62. X is uniformly distributed on some interval Œa; b. If its mean is 2,
and variance is 3, what are the values of a; b?

Exercise 1.63. Let X U Œ0; 1. Find the density of each of the following:
(a) X  3X I
3
 2
(b) X  12 I
  4
(c) sin 2 X :

Exercise 1.64 * (Mode of a Beta Density). Show that if a Beta density has a mode
in the open interval .0; 1/, then we must have ˛ > 1; ˛ C ˇ > 2, in which case, the
˛1
mode is unique and equals ˛Cˇ 2 .

Exercise 1.65. An exponential random variable with mean 4 is known to be larger


than 6. What is the probability that it is larger than 8?

Exercise 1.66 * (Sum of Gammas). Suppose X; Y are independent random vari-


ables, and X G.˛; /; Y G.ˇ; /. Find the distribution of X C Y by using
moment-generating functions.

Exercise 1.67 (Inverse Gamma Moments). Suppose X G.˛; /. Find a formula


for EŒ. X1 /n , when this expectation exists.

Exercise 1.68 (Product of Chi Squares). Suppose X1 ; X2 ; : : : ; Xn areQindepen-


dent chi square variables, with Xi 2mi . Find the mean and variance of niD1 Xi .

Exercise 1.69 * (Chi-Square Skewness). Let X 2m : Find the coefficient of


skewness of X and prove that it converges to zero as m ! 1.

Exercise 1.70 * (A Relation Between Poisson and Gamma). Suppose X Poi. /.


Prove by repeated integration by parts that

P .X n/ D P .G.n C 1; 1/ > /;

where G.n C 1; 1/ means a Gamma random variable with parameters n C 1 and 1.

Exercise 1.71 * (A Relation Between Binomial and Beta). Suppose X Bin.n; p/.
Prove that
P .X k  1/ D P .B.k; n  k C 1/ > p/;
where B.k; n  k C 1/ means a Beta random variable with parameters k; n  k C 1.
90 1 Review of Univariate Probability

Exercise 1.72. Suppose X has the standard Gumbel density. Find the density of
e X .

Exercise 1.73. Suppose X is uniformly distributed on Œ0; 1. Find the density of
log log X1 .

Exercise 1.74. Let Z N.0; 1/. Find


 ˇ ˇ   
ˇ 1 ˇˇ eZ 3
P :5 < ˇˇZ  < 1:5 I P .1 C Z C Z 2 > 0/I P > I
2ˇ 1 C eZ 4
P .ˆ.Z/ < :5/:
1
Exercise 1.75. Let Z N.0; 1/. Find the density of Z
. Is the density bounded?

Exercise 1.76. The 25th and the 75th percentile of a normally distributed random
variable are 1 and C1. What is the probability that the random variable is between
2 and C2?

Exercise 1.77 (Standard Normal CDF in Terms of the Error Function). In some
places, instead of the standard normal CDF, one sees use of the error function
p Rx 2
erf.x/ D .2= / 0 e t dt.
Express ˆ.x/ in terms of erf.x/.

Exercise 1.78 * (An Interesting Calculation). Suppose X N.;  2 /. Prove that


 p
EŒˆ.X / D ˆ = 1 C  2 :

Exercise 1.79 * (Useful Normal Distribution Formulas). Prove the following


primitive (indefinite integral) formulas.
R
(a) x 2 .x/dx D ˆ.x/  x.x/: p
R p
(b) Œ.x/2 dx D 1=.2 /ˆ.x 2/: p
R
(c) .x/.a C bx/dx D .1=t/.a=t/ˆ.tx
p C a=t/; where t D 1 C b 2 :
R
(d) x.x/ˆ.bx/dx D b=. 2t/ˆ.tx/  .x/ˆ.bx/:

Exercise 1.80 * (Useful Normal Distribution Formulas). Prove the following


definite integral formulas, with t as in the previous exercise:
R1 p
(a) 0 x.x/ˆ.bx/dx D 1=.2 2/Œ1 C b=t:
R1 p
(b) 1 x.x/ˆ.bx/dx D b=. 2t/:
R1
(c) 1 .x/ˆ.a C bx/dx D ˆ.a=t/:
R1 p
(d) 0 .x/Œˆ.bx/2 dx D 1=.2/Œarctan b C arctan 1 C 2b 2 :
R1 p
(e) 1 .x/Œˆ.bx/2 dx D 1= arctan 1 C 2b 2 :
Exercises 91

Exercise 1.81 (Median and Mode of Lognormal). Show that a general lognormal
density is unimodal, and find its mode and median.
Hint: For the median, remember that a lognormal variable is e X , where X is a
normal variable.

Exercise 1.82 (Margin of Error of a Confidence Interval). Suppose X1 ; X2 ; : : : ;


Xn are independent N.; 10/ variables. What is the smallest n such that the margin
of error of a 95% confidence interval for  is at most .05?

Exercise 1.83. Suppose X N.0; 1/; Y N.0; 9/, and X; Y are independent.
Find the value of P ..X  Y /2 > 5/.

Exercise 1.84. A fair die is rolled 25 times. Let X be the number of times a six is
obtained. Find the exact value of P .X D 6/, and compare it to a normal approxi-
mation of P .X D 6/.

Exercise 1.85. A basketball player has a history of converting 80% of his free
throws. Find a normal approximation with a continuity correction of the probability
that he will make between 18 and 22 throws out of 25 free throws.

Exercise 1.86 (Airline Overbooking). An airline knows from past experience that
10% of fliers with a confirmed reservation do not show up for the flight. Suppose a
flight has 250 seats. How many reservations over 250 can the airline permit, if they
want to be 95% sure that no more than two passengers with a confirmed reservation
would have to be bumped?

Exercise 1.87. * Suppose X1 ; X2 ; : : : ; XnPare independent N.0; 1/Pvariables. Find


an approximation to the probability that niD1 Xi is larger than niD1 Xi2 , when
n D 10; 20; 30.

Exercise 1.88 (A Product Problem). Suppose X1 ; X2 ; : : : ; X30 are 30 indepen-


dent variables, each distributed as U Œ0; 1. Find an approximation to the probability
that their geometric mean exceeds .4; exceeds .5.

Exercise 1.89 (Comparing a Poisson Approximation and a Normal Approxi-


mation). Suppose 1:5% of residents of a town never read a newspaper. Compute the
exact value, a Poisson approximation, and a normal approximation of the probability
that at least one resident in a sample of 50 residents never reads a newspaper.

Exercise 1.90 (Anything That Can Happen Will Eventually Happen). If you
predict in advance the outcomes of 10 tosses of a fair coin, the probability that you
get them all correct is .:5/10 , which is very small. Show that if 2000 people each try
to predict the 10 outcomes correctly, the chance that at least one of them succeeds
is better than 85%.

Exercise 1.91 * (Random Walk). Consider the drunkard’s random walk example.
Find the probability that the drunkard will be at least 10 steps over on the right from
his starting point after 200 steps. Compute a normal approximation.
92 1 Review of Univariate Probability

Exercise 1.92 (Test Your Intuition). Suppose a fair coin is tossed 100 times.
Which is more likely: you will get exactly 50 heads, or you will get more than
60 heads?

Exercise 1.93 * (Density of Uniform Sums). Give a direct proof that the density
of pSn=3
n
at zero converges to .0/, where Sn is the sum of n independent U Œ1; 1
variables.

Exercise 1.94 (Confidence Interval for Poisson mean). Derive a formula for an
approximate 99% confidence interval for a Poisson mean, by using the normal ap-
proximation to a Poisson distribution. Compare your formula to the formula for an
approximate 95% confidence interval that was worked out in text. Compute the 95%
and the 99% confidence interval if X D 5; 8; 12.

References

Alon, N. and Spencer, J. (2000). The Probabilistic Method, Wiley, New York.
Ash, R. (1972). Real Analysis and Probability, Academic Press, New York.
Barbour, A. and Hall, P. (1984). On the rate of Poisson convergence, Math. Proc. Camb. Phil. Soc.,
95, 473–480.
Bernstein, S. (1927). Theory of Probability, Nauka, Moscow.
Bhattacharya, R.N. and Rao, R.R. (1986). Normal Approximation and Asymptotic Expansions,
Robert E. Krieger, Melbourne, FL.
Bhattacharya, R.N. and Waymire, E. (2009). A Basic Course in Probability Theory, Springer,
New York.
Billingsley, P.(1995). Probability and Measure, Third Edition, John Wiley, New York.
Breiman, L. (1992). Probability, Addison-Wesley, New York.
Brown, L., Cai, T., and DasGupta, A. (2001). Interval estimation for a binomial proportion, Statist.
Sci., 16, 101–133.
Brown, L., Cai, T., and DasGupta, A. (2002). Confidence intervals for a binomial proportion and
asymptotic expansions, Ann. Statist., 30, 160–201.
Chernoff, H. (1952). A measure of asymptotic efficiency for tests of a hypothesis based on the sum
of observations, Ann. Math. Statist., 23, 493–507.
Chernoff, H. (1981). A note on an inequality involving the normal distribution, Ann. Prob., 9,
533–535.
Chung, K. L. (1974). A Course in Probability, Academic Press, New York.
DasGupta, A. (2008). Asymptotic Theory of Statistics and Probability, Springer, New York.
DasGupta, A. (2010). Fundamentals of Probability: A First Course, Springer, New York.
Diaconis, P. and Zabell, S. (1991). Closed form summation formulae for classical distributions,
Statist. Sci., 6, 284–302.
Dudley, R. (2002). Real Analysis and Probability, Cambridge University Press, Cambridge, UK.
Everitt, B. (1998). Cambridge Dictionary of Statistics, Cambridge University Press, New York.
Feller, W. (1968). Introduction to Probability Theory and its Applications, Vol. I, Wiley, New York.
Feller, W. (1971). Introduction to Probability Theory and Its Applications, Vol. II, Wiley,
New York.
Fisher, R.A. (1929). Moments and product moments of sampling distributions, Proc. London Math.
Soc., 2, 199–238.
Hall, P. (1992). The Bootstrap and Edgeworth Expansion, Springer-Verlag, New York.
References 93

Johnson, N., Kotz, S., and Balakrishnan, N. (1994). Continuous Univariate Distributions, Vol. I,
Wiley, New York.
Kagan, A., Linnik, Y., and Rao, C.R. (1973). Characterization Problems in Mathematical Statis-
tics, Wiley, New York.
Kendall, M.G. and Stuart, A. (1976). Advanced Theory of Statistics, Vol. I, Wiley, New York.
Le Cam, L. (1960). An approximation theorem for the Poisson binomial distribution, Pacific J.
Math., 10, 1181–1197.
Le Cam, L. (1986). The central limit theorem around 1935, Statist. Sci., 1, 78–96.
Paley, R.E. and Zygmund, A. (1932). A note on analytic functions in the unit circle, Proc. Camb.
Philos. Soc., 28, 266–272.
Petrov, V. (1975). Limit Theorems of Probability Theory, Oxford University Press, Oxford, UK.
Pitman, J. (1992). Probability, Springer-Verlag, New York.
Rao, C.R. (1973), Linear Statistical Inference and Applications, Wiley, New York.
Ross, S. (1984). A First Course in Probability, Macmillan, New York.
Steele, J.M. (1994). Le Cam’s inequality and Poisson approximations, Amer. Math Month., 101,
48–54.
Stein, C. (1981). Estimation of the mean of a multivariate normal distribution, Ann. Stat., 9,
1135–1151.
Stigler, S. (1986). The History of Statistics, Belknap Press, Cambridge, MA.
Stirzaker, D. (1994). Elementary Probability, Cambridge University Press, London.
Wasserman, L. (2006). All of Nonparametric Statistics, Springer, New York.
Widder, D. (1989). Advanced Calculus, Dover, New York.
Chapter 2
Multivariate Discrete Distributions

We have provided a detailed overview of distributions of one discrete or one


continuous random variable in the previous chapter. But often in applications, we are
just naturally interested in two or more random variables simultaneously. We may
be interested in them simultaneously because they provide information about each
other, or because they arise simultaneously as part of the data in some scientific
experiment. For instance, on a doctor’s visit, the physician may check someone’s
blood pressure, pulse rate, blood cholesterol level, and blood sugar level, because
together they give information about the general health of the patient. In such cases,
it becomes essential to know how to operate with many random variables simul-
taneously. This is done by using joint distributions. Joint distributions naturally
lead to considerations of marginal and conditional distributions. We study joint,
marginal, and conditional distributions for discrete random variables in this chapter.
The concepts of these various distributions for continuous random variables are not
different; but the techniques are mathematically more sophisticated. The continuous
case is treated in the next chapter.

2.1 Bivariate Joint Distributions and Expectations of Functions

We present the fundamentals of joint distributions of two variables in this section.


The concepts in the multivariate case are the same, although the technicalities are
somewhat more involved. We treat the multivariate case in a later section. The idea
is that there is still an underlying experiment , with an associated sample space .
But now we have two or more random variables on the sample space . Random
variables being functions on the sample space , we now have multiple functions,
say X.!/; Y .!/; : : : ; and so on . We want to study their joint behavior.

Example 2.1 (Coin Tossing). Consider the experiment  of tossing a fair coin three
times. Let X be the number of heads among the first two tosses, and Y the num-
ber of heads among the last two tosses. If we consider X and Y individually, we
realize immediately that they are each Bin.2; :5/ random variables. But the individ-
ual distributions hide part of the full story. For example, if we knew that X was 2,

A. DasGupta, Probability for Statistics and Machine Learning: Fundamentals 95


and Advanced Topics, Springer Texts in Statistics, DOI 10.1007/978-1-4419-9634-3 2,
c Springer Science+Business Media, LLC 2011
96 2 Multivariate Discrete Distributions

then that would imply that Y must be at least 1. Thus, their joint behavior cannot
be fully understood from their individual distributions; we must study their joint
distribution.
Here is what we mean by their joint distribution. The sample space  of this
experiment is

 D fHHH; HH T; H TH; H T T; THH; TH T; T TH; T T T g:

Each sample point has an equal probability 18 . Denoting the sample points as
!1 ; !2 ; : : : ; !8 , we see that if !1 prevails, then X.!1 / D Y .!1 / D 2. but if !2
prevails, then X.!2 / D 2; Y .!2 / D 1. The combinations of all possible values of
.X; Y / are

.0; 0/; .0; 1/; .0; 2/; .1; 0/; .1; 1/; .1; 2/; .2; 0/; .2; 1/; .2; 2/:

The joint distribution of .X; Y / provides the probability p.x; y/ D P .X Dx; Y Dy/
for each such combination of possible values .x; y/. Indeed, by direct counting us-
ing the eight equally likely sample points, we see that

1 1 1 1
p.0; 0/ D ; p.0; 1/ D ; p.0; 2/ D 0; p.1; 0/ D ; p.1; 1/ D I
8 8 8 4
1 1 1
p.1; 2/ D ; p.2; 0/ D 0; p.2; 1/ D ; p.2; 2/ D :
8 8 8

For example, why is p.0; 1/ 81 ? This is because the combination .X D 0; Y D 1/ is


favored by only one sample point, namely T TH . It is convenient to present these
nine different probabilities in the form of a table as follows.

Y
X 0 1 2
1 1
0 8 8
0
1 1 1
1 8 4 8
1 1
2 0 8 8

Such a layout is a convenient way to present the joint distribution of two discrete
random variables with a small number of values. The distribution itself is called the
joint pmf ; here is a formal definition.

Definition 2.1. Let X; Y be two discrete random variables with respective sets of
values x1 ; x2 ; : : : ; and y1 ; y2 ; : : : ; defined on a common sample space . The joint
pmf of X; Y is defined to be the function p.xi ; yj /DP .X Dxi ; Y Dyj /; i; j  1,
and p.x; y/ D 0 at any other point .x; y/ in R2 .
2.1 Bivariate Joint Distributions and Expectations of Functions 97

The requirements of a joint pmf are that


(i) p.x; y/  0 8.x; y/I
P P
(ii) i j p.xi ; yj / D 1:
Thus, if we write the joint pmf in the form of a table, then all entries should be
nonnegative, and the sum of all the entries in the table should be one.
As in the case of a single variable, we can define a CDF for more than one
variable also. For the case of two variables, here is the definition of a CDF.
Definition 2.2. Let X; Y be two discrete random variables, defined on a common
sample space . The joint CDF, or simply the CDF, of .X; Y / is a function F W
R2 ! Œ0; 1 defined as F .x; y/ D P .X x; Y y/; x; y 2 R:
Like the joint pmf, the CDF also characterizes the joint distribution of two dis-
crete random variables. But it is not very convenient or even interesting to work with
the CDF in the case of discrete random variables. It is much preferable to work with
the pmf when dealing with discrete random variables.
Example 2.2 (Maximum and Minimum in Dice Rolls). Suppose a fair die is rolled
twice, and let X; Y be the larger and the smaller of the two rolls (note that X can
be equal to Y ). Each of X; Y takes the individual values 1; 2; : : : ; 6, but we have
necessarily X  Y . The sample space of this experiment is
f11; 12; 13; : : : ; 64; 65; 66g:
By direct counting, for example, p.2; 1/ D 36 2
. Indeed, p.x; y/ D 36 2
for each
x; y D 1; 2; : : : ; 6; x > y, and p.x; y/ D 36 for x D y D 1; 2; : : : ; 6. Here is what
1

the joint pmf looks like in the form of a table:

Y
X 1 2 3 4 5 6
1
1 36 0 0 0 0 0
1 1
2 18 36
0 0 0 0
1 1 1
3 18 18 36
0 0 0
1 1 1 1
4 18 18 18 36
0 0
1 1 1 1 1
5 18 18 18 18 36
0
1 1 1 1 1 1
6 18 18 18 18 18 36

The individual pmfs of X; Y are easily recovered from the joint distribution. For
example,
X
6
1
P .X D 1/ D P .X D 1; Y D y/ D ; and
yD1
36

X
6
1 1 1
P .X D 2/ D P .X D 2; Y D y/ D C D ;
yD1
18 36 12
98 2 Multivariate Discrete Distributions

and see on. The individual pmfs are obtained by summing the joint probabilities
over all values of the other variable. They are:
x 1 2 3 4 5 6
1 3 5 7 9 11
pX .x/ 36 36 36 36 36 36
y 1 2 3 4 5 6
11 9 7 5 3 1
pY .y/ 36 36 36 36 36 36

From the individual pmf of X , we can find the expectation of X . Indeed,


1 3 11 161
E.X / D 1  C2 CC6 D :
36 36 36 36
Similarly, E.Y / D 9136 . The individual pmfs are called marginal pmfs, and here is
the formal definition.

Definition 2.3. Let p.x; y/ be the joint pmf


P of .X; Y /. The marginal pmf of a func-
tion Z D g.X; Y / is defined as pZ .z/ D .x;y/Wg.x;y/Dz p.x; y/: In particular,
X X
pX .x/ D p.x; y/I pY .y/ D p.x; y/;
y x

and for any event A, X


P .A/ D p.x; y/:
.x;y/2A

Example 2.3. Consider a joint pmf given by the formula


p.x; y/ D c.x C y/; 1 x; y n;
where c is a normalizing constant.
First of all, we need to evaluate c by equating

X
n X
n
p.x; y/ D 1
xD1 yD1

X
n X
n
,c .x C y/ D 1
xD1 yD1

X
n
n.n C 1/
,c nx C D1
xD1
2
n .n C 1/
2
n2 .n C 1/
,c C D1
2 2
, cn2 .n C 1/ D 1
1
,c D :
n2 .n C 1/
2.1 Bivariate Joint Distributions and Expectations of Functions 99

The joint pmf is symmetric between x and y (because x C y D y C x), and so,
X; Y have the same marginal pmf. For example, X has the pmf
Xn
1 X n
pX .x/ D p.x; y/ D 2 .x C y/
yD1
n .n C 1/ yD1
1 n.n C 1/
D nx C
n2 .n C 1/ 2
x 1
D C ; 1 x n:
n.n C 1/ 2n
Suppose now we want to compute P .X > Y /. This can be found by summing
p.x; y/ over all combinations for which x > y. But this longer calculation
can be avoided by using a symmetry argument that is often very useful. Note
that because the joint pmf is symmetric between x and y, we must have
P .X > Y / D P .Y > X / D p (say). But, also,
P .X > Y / C P .Y > X / C P .X D Y / D 1 ) 2p C P .X D Y / D 1
1  P .X D Y /
)pD :
2
Now,
X
n X
n
P .X D Y / D p.x; x/ D c  2x
xD1 xD1
1 1
D n.n C 1/ D :
n2 .n C 1/ n
Therefore, P .X > Y / D p D n1
2n
1
2
; for large n.
Example 2.4 (Dice Rolls Revisited). Consider again the example of two rolls of a
fair die, and suppose X; Y are the larger and the smaller of the two rolls. We have
worked out the joint distribution of .X; Y / in Example 2.2. Suppose we want to
find the distribution of the difference, X  Y . The possible values of X  Y are
0; 1; : : : ; 5, and we find P .X  Y D k/ by using the joint distribution of .X; Y /:
1
P .X  Y D 0/ D p.1; 1/ C p.2; 2/ C    C p.6; 6/ D I
6
5
P .X  Y D 1/ D p.2; 1/ C p.3; 2/ C    C p.6; 5/ D I
18
2
P .X  Y D 2/ D p.3; 1/ C p.4; 2/ C p.5; 3/ C p.6; 4/ D I
9
1
P .X  Y D 3/ D p.4; 1/ C p.5; 2/ C p.6; 3/ D I
6
1
P .X  Y D 4/ D p.5; 1/ C p.6; 2/ D I
9
1
P .X  Y D 5/ D p.6; 1/ D :
18
There is no way to find the distribution of X Y except by using the joint distribution
of .X; Y /.
100 2 Multivariate Discrete Distributions

Suppose now we also want to know the expected value of X  Y . Now that we
have the distribution of X  Y worked out, we can find the expectation by directly
using the definition of expectation:

X
5
E.X  Y / D kP .X  Y D k/
kD0
5 4 1 4 5 35
D C C C C D :
18 9 2 9 18 18

But, we can also use linearity of expectations and find E.X  Y / as

161 91 35
E.X  Y / D E.X /  E.Y / D  D
36 36 18

(see Example 2.2 for E.X /; E.Y /).


A third possible way to compute E.X  Y / is to treat X P P Y as a function of
.X; Y / and use the joint pmf of .X; Y / to find E.X  Y / as x y .x  y/p.x; y/.
In this particular example, this is an unncessarily laborious calculation, because
luckily we can find E.X  Y / by other quicker means in this example, as we just
saw. But in general, one has to resort to the joint pmf to calculate the expectation of
a function of .X; Y /. Here is the formal formula.
Theorem 2.1 (Expectation of a Function). Let .X; Y / have the joint pmf
p.x; y/, and let g.X;P YP
/ be a function of .X; Y /. We say that the expectation
of g.X; Y / exists if x y jg.x; y/jp.x; y/ < 1, in which case,
XX
EŒg.X; Y / D g.x; y/p.x; y/:
x y

2.2 Conditional Distributions and Conditional Expectations

Sometimes we want to know what the expected value is of one of the variables,
say X , if we knew the value of the other variable Y . For example, in the die tossing
experiment above, what should we expect the larger of the two rolls to be if the
smaller roll is known to be 2?
To answer this question, we have to find the probabilities of the various values
of X , conditional on knowing that Y equals some given y, and then average by
using these conditional probabilities. Here are the formal definitions.
Definition 2.4 (Conditional Distribution). Let .X; Y / have the joint pmf p.x; y/.
The conditional distribution of X given Y D y is defined to be

p.x; y/
p.xjy/ D P .X D xjY D y/ D ;
pY .y/
2.2 Conditional Distributions and Conditional Expectations 101

and the conditional expectation of X given Y D y is defined to be


P P
X xp.x; y/ xp.x; y/
E.X jY D y/ D xp.xjy/ D x
D Px :
x
pY .y/ x p.x; y/

The conditional distribution of Y given X D x and the conditional expectation of


Y given X D x are defined analogously, by switching the roles of X and Y in the
above definitions.
We often casually write E.X jy/ to mean E.X jY D y/.
Two easy facts that are nevertheless often useful are the following.

Proposition. Let X; Y be random variables defined on a common sample space .


Then,
(a) E.g.Y /jY D y/ D g.y/; 8y; and for any function g;
(b) E.Xg.Y /jY D y/ D g.y/E.X jY D y/ 8y; and for any function g.
Recall that in Chapter 1, we defined two random variables to be independent if
P .X x; Y y/ D P .X x/P .Y y/ 8 x; y 2 R. This is of course a correct
definition; but in the case of discrete random variables, it is more convenient to
think of independence in terms of the pmf. The definition below puts together some
equivalent definitions of independence of two discrete random variables.

Definition 2.5 (Independence). Let .X; Y / have the joint pmf p.x; y/. Then X; Y
are said to be independent if

p.xjy/ D pX .x/; 8 x; y such that pY .y/ > 0I


, p.yjx/ D pY .y/: 8 x; y such that pX .x/ > 0I
, p.x; y/ D pX .x/pY .y/; 8 x; yI
, P .X x; Y y/ D P .X x/P .Y y/ 8 x; y:

The third equivalent condition in the above list is usually the most convenient one
to verify and use.
One more frequently useful fact about conditional expectations is the following.

Proposition. Suppose X; Y are independent random variables. Then, for any func-
tion g.X / such that the expectations below exist, and for any y,

EŒg.X /jY D y D EŒg.X /:

2.2.1 Examples on Conditional Distributions and Expectations

Example 2.5 (Maximum and Minimum in Dice Rolls). In the experiment of two rolls
of a fair die, we have worked out the joint distribution of X; Y , where X is the larger
102 2 Multivariate Discrete Distributions

and Y the smaller of the two rolls. Using this joint distribution, we can now find the
conditional distributions. For instance,

P .Y D 1jX D 1/ D 1I P .Y D yjX D 1/ D 0; if y > 1I


1=18 2
P .Y D 1jX D 2/ D D I
1=18 C 1=36 3
1=36 1
P .Y D 2jX D 2/ D D I
1=18 C 1=36 3
P .Y D yjX D 2/ D 0; if y > 2I
1=18 2
P .Y D yjX D 6/ D D ; if 1 y 5I
5=18 C 1=36 11
1=36 1
P .Y D 6jX D 6/ D D :
5=18 C 1=36 11

Example 2.6 (Conditional Expectation in a 2  2 Table). Suppose X; Y are binary


variables, each taking only the values 0; 1 with the following joint distribution.

Y
X 0 1
0 s t
1 u v

We want to evaluate the conditional expectation of X given Y D 0; 1, respectively.


By using the definition of conditional expectation,

0  p.0; 0/ C 1  p.1; 0/ u
E.X jY D 0/ D D I
p.0; 0/ C p.1; 0/ sCu
0  p.0; 1/ C 1  p.1; 1/ v
E.X jY D 1/ D D :
p.0; 1/ C p.1; 1/ t Cv

Therefore,

v u vs  ut
E.X jY D 1/  E.X jY D 0/ D  D :
t Cv sCu .t C v/.s C u/

It follows that we can now have the single formula

u vs  ut
E.X jY D y/ D C y;
sCu .t C v/.s C u/

y D 0; 1. We now realize that the conditional expectation of X given Y D y is a


linear function of y in this example. This is the case whenever both X; Y are binary
variables, as they were in this example.
2.2 Conditional Distributions and Conditional Expectations 103

Example 2.7 (Conditional Expectation in Dice Experiment). Consider again the


example of the joint distribution of the maximum and the minimum of two rolls of
a fair die. Let X denote the maximum, and Y the minimum. We find E.X jY D y/
for various values of y.
By using the definition of E.X jY D y/, we have, for example,

1 1
C18 Œ2 C
1
   C 6 41
E.X jY D 1/ D 36
D D 3:73I
36 C 18
1 5 11

as another example,

3 1
C 18
1
 15 33
E.X jY D 3/ D 36
D D 4:71I
1
36
C 18
3 7

and,
5 1
C6 1
17
E.X jY D 5/ D 36 18
D D 5:77:
1
36
C 1
18
3

We notice that E.X jY D 5/ > E.X jY D 3/ > E.X jY D 1/I in fact, it is true that
E.X jY D y/ is increasing in y in this example. This does make intuitive sense.
Just as in the case of a distribution of a single variable, we often also want a mea-
sure of variability in addition to a measure of average for conditional distributions.
This motivates defining a conditional variance.

Definition 2.6 (Conditional Variance). Let .X; Y / have the joint pmf p.x; y/.
Let X .y/ D E.X jY D y/: The conditional variance of X given Y D y is
defined to be
X
Var.X jY D y/ D EŒ.X  X .y//2 jY D y D .x  X .y//2 p.xjy/:
x

We often write casually Var.X jy/ to mean Var.X jY D y/.

Example 2.8 (Conditional Variance in Dice Experiment). We work out the condi-
tional variance of the maximum of two rolls of a die given the minimum. That is,
suppose a fair die is rolled twice, and X; Y are the larger and the smaller of the two
rolls; we want to compute Var.X jy/.
For example, if y D 3, then X .y/ D E.X jY D y/ D E.X jY D 3/ D 4:71
(see the previous example). Therefore,
X
Var.X jy/ D .x  4:71/2 p.xj3/
x
.3  4:71/2  36
1
C.4  4:71/2  1
C.5  4:71/2  18
1
C.6  4:71/2  1
D 18 18
1
36
C 18 C 18 C 18
1 1 1

D 1:06:
104 2 Multivariate Discrete Distributions

To summarize, given that the minimum of two rolls of a fair die is 3, the expected
value of the maximum is 4.71 and the variance of the maximum is 1.06.
These two values, E.X jy/ and Var.X jy/, change as we change the given
value y. Thus E.X jy/ and Var.X jy/ are functions of y, and for each separate y, a
new calculation is needed. If X; Y happen to be independent, then of course what-
ever y is, E.X jy/ D E.X /, and Var.X jy/ D Var.X /.
The next result is an important one in many applications.

Theorem 2.2 (Poisson Conditional Distribution). Let X; Y be independent


Poisson random variables, with means ; . Then the conditional distribution
of X given X C Y D t is Bin.t; p/, where p D C

.

Proof. Clearly, P .X D xjX C Y D t/ D 0 8x > t: For x t,

P .X D x; X C Y D t/
P .X D xjX C Y D t/ D
P .X C Y D t/
P .X D x; Y D t  x/
D
P .X C Y D t/
e  x
e  t x tŠ
D
xŠ .t  x/Š e .C/ . C /t

(on using the fact that X C Y Poi. C /; see Chapter 1)


x t x
tŠ 
D
xŠ.t  x/Š . C /t
! x  t x
t 
D ;
x C C


which is the pmf of the Bin.t; C / distribution. t
u

2.3 Using Conditioning to Evaluate Mean and Variance

Conditioning is often an extremely effective tool to calculate probabilities, means,


and variances of random variables with a complex or clumsy joint distribution. Thus,
in order to calculate the mean of a random variable X , it is sometimes greatly con-
venient to follow an iterative process, whereby we first evaluate the mean of X after
conditioning on the value y of some suitable random variable Y , and then average
over y. The random variable Y has to be chosen judiciously, but is often clear from
the context of the specific problem. Here are the precise results on how this tech-
nique works; it is important to note that the next two results hold for any kind of
random variables, not just discrete ones.
2.3 Using Conditioning to Evaluate Mean and Variance 105

Theorem 2.3 (Iterated Expectation Formula). Let X; Y be random variables


defined on the same probability space . Suppose E.X / and E.X jY D y/ exist
for each y. Then,
E.X / D EY ŒE.X jY D y/I
thus, in the discrete case,
X
E.X / D X .y/pY .y/;
y

where X .y/ D E.X jY D y/.

Proof. We prove this for the discrete case. By definition of conditional expectation,
P
xp.x; y/
X .y/ D x
pY .y/
X XX XX
) X .y/pY .y/ D xp.x; y/ D xp.x; y/
y y x x y
X X X
D x p.x; y/ D xpX .x/ D E.X /:
x y x
The corresponding variance calculation formula is the following. The proof of
this uses the iterated mean formula above, and applies it to .X  X /2 . u
t

Theorem 2.4 (Iterated Variance Formula). Let X; Y be random variables de-


fined on the same probability space . Suppose Var.X /; Var.X jY D y/ exist for
each y. Then,

Var.X / D EY ŒVar.X jY D y/ C VarY ŒE.X jY D y/:

Remark. These two formulas for iterated expectation and iterated variance are valid
for all types of variables, not just the discrete ones. Thus, these same formulas still
hold when we discuss joint distributions for continuous random variables in the next
chapter.
Some operational formulas that one should be familiar with are summarized
below.

Conditional Expectation and Variance Rules.

E.g.X /jX D x/ D g.x/I E.g.X /h.Y /jY D y/ D h.y/E.g.X /jY D y/I


E.g.X /jY D y/ D E.g.X // if X; Y are independentI
Var.g.X /jX D x/ D 0I Var.g.X /h.Y /jY D y/ D h2 .y/Var.g.X /jY D y/I
Var.g.X /jY D y/ D Var.g.X // if X; Y are independent:

Let us see some applications of the two iterated expectation and iterated variance
formulas.
106 2 Multivariate Discrete Distributions

Example 2.9 (A Two-Stage Experiment). Suppose n fair dice are rolled. Those that
show a six are rolled again. What are the mean and the variance of the number of
sixes obtained in the second round of this experiment?
Define Y to be the number of dice in the first round that show a six, and X the
number of dice in the second round that show a six. Given Y D y; X Bin.y; 16 /,
and Y itself is distributed as Bin.n; 16 /. Therefore,
hy i n
E.X / D EŒE.X jY D y/ D EY D :
6 36
Also,

Var.X / D EY ŒVar.X jY D y/ C VarY ŒE.X jY D y/


15 hy i
D EY y C VarY
66 6
5 n 1 15
D C n
36 6 36 6 6
5n 5n 35n
D C D :
216 1296 1296
Example 2.10. Suppose a chicken lays a Poisson number of eggs per week with
mean . Each egg, independently of the others, has a probability p of fertilizing.
We want to find the mean and the variance of the number of eggs fertilized in a
week.
Let N denote the number of eggs hatched and X the number of eggs fertilized.
Then, N Poi. /, and given N D n; X Bin.n; p/. Therefore,

E.X / D EN ŒE.X jN D n/ D EN Œnp D p ;

and,

Var.X / D EN ŒVar.X jN D n/ C VarN .E.X jN D n/


D EN Œnp.1  p/ C VarN .np/ D p.1  p/ C p 2 D p :

Interestingly, the number of eggs actually fertilized has the same mean and variance
p , (Can you see why?)
Remark. In all of these examples, it was important to choose the variable Y wisely
on which one should condition. The efficiency of the technique depends on this very
crucially.
Sometimes a formal generalization of the iterated expectation formula when a
third variable Z is present is useful. It is particularly useful in hierarchical statis-
tical modeling of distributions, where an ultimate marginal distribution for some
X is constructed by first conditioning on a number of auxiliary variables, and then
gradually unconditioning them. We state the more general iterated expectation for-
mula; its proof is exactly similar to that of the usual iterated expectation formula.
2.4 Covariance and Correlation 107

Theorem 2.5 (Higher-Order Iterated Expectation). Let X; Y; Z be random


variables defined on the same sample space . Assume that each conditional
expectation below and the marginal expectation E.X / exist. Then,

E.X / D EY ŒEZjY fE.X jY D y; Z D z/g:

2.4 Covariance and Correlation

We know that variance is additive for independent random variables; that is, if
X1 ; X2 ; : : : ; Xn are independent random variables, then Var.X1 CX2 C  CXn / D
Var.X1 / C    C Var.Xn /: In particular, for two independent random variables
X; Y; Var.X CY / D Var.X /CVar.Y /: However, in general, variance is not additive.
Let us do the general calculation for Var.X C Y /.

Var.X C Y / D E.X C Y /2  ŒE.X C Y /2


D E.X 2 C Y 2 C 2X Y /  ŒE.X / C E.Y /2
D E.X 2 /CE.Y 2 /C2E.X Y /ŒE.X /2 ŒE.Y /2 2E.X /E.Y /
D E.X 2 /ŒE.X /2 CE.Y 2 /ŒE.Y /2 C2ŒE.X Y /E.X /E.Y /
D Var.X / C Var.Y / C 2ŒE.X Y /  E.X /E.Y /:

We thus have the extra term 2ŒE.X Y /  E.X /E.Y / in the expression for
Var.X C Y /; of course, when X; Y are independent, E.X Y / D E.X /E.Y /, and
so the extra term drops out. But, in general, one has to keep the extra term. The
quantity E.X Y /  E.X /E.Y / is called the covariance of X and Y .
Definition 2.7 (Covariance). Let X; Y be two random variables defined on a
common sample space , such that E.X Y /; E.X /; E.Y / all exist. The covariance
of X and Y is defined as

Cov.X; Y / D E.X Y /  E.X /E.Y / D EŒ.X  E.X //.Y  E.Y //:

Remark. Covariance is a measure of whether two random variables X; Y tend to in-


crease or decrease together. If a larger value of X generally causes an increment in
the value of Y , then often (but not always) they have a positive covariance. For ex-
ample, taller people tend to weigh more than shorter people, and height and weight
usually have a positive covariance.
Unfortunately, however, covariance can take arbitrary positive and arbitrary neg-
ative values. Therefore, by looking at its value in a particular problem, we cannot
judge whether it is a large value. We cannot compare a covariance with a standard
to judge if it is large or small. A renormalization of the covariance cures this prob-
lem, and calibrates it to a scale of 1 to C1. We can judge such a quantity as large,
small, or moderate; for example, .95 would be large positive, .5 moderate, and .1
small. The renormalized quantity is the correlation coefficient or simply the corre-
lation between X and Y .
108 2 Multivariate Discrete Distributions

Definition 2.8 (Correlation). Let X; Y be two random variables defined on a


common sample space , such that Var.X /; Var.Y / are both finite. The correla-
tion between X; Y is defined to be

Cov.X; Y /
X;Y D p p :
Var.X / Var.Y /

Some important properties of covariance and correlation are put together in the next
theorem.

Theorem 2.6 (Properties of Covariance and Correlation). Provided that the re-
quired variances and the covariances exist,
(a) Cov.X; c/ D 0 for any X and any constant cI
(b) Cov.X; X / D var.X / for!any X I
Pn P
m Pn Pm
(c) Cov ai Xi ; bj Y j D ai bj Cov.Xi ; Yj /;
i D1 j D1 i D1 j D1

and in particular,
Var.aX CbY / D Cov.aX CbY ;aX CbY /
D a2 Var.X /Cb 2 Var.Y /C2abCov.X; Y /;
and, !
X
n X
n X X
n
Var Xi D Var.Xi / C 2 Cov.Xi ; Xj /I
i D1 i D1 i <j D1

(d) For any two independent random variables X; Y; Cov.X; Y / D X;Y D 0I


(e) aCbX;cCd Y D sgn.bd /X;Y ; where sgn.bd / D 1 if bd > 0; and D 1
if bd < 0:
(f) Whenever X;Y is defined, 1 X;Y 1.
(g) X;Y D 1 if and only if for some a, some b > 0; P .Y D a C bX / D 1;
X;Y D 1 if and only if for some a, some b < 0; P .Y D a C bX / D 1.

Proof. For part (a), Cov.X; c/ D E.cX /  E.c/E.X / D cE.X /  cE.X / D 0.


For part (b), Cov.X; X / D E.X 2 /  ŒE.X /2 D var.X /. For part (c),
0 1
X
n X
m
Cov @ ai Xi ; bj Y j A
i D1 j D1
2 3 ! 0 1
X
n X
m X
n X
m
DE4 ai Xi  bj Y j 5  E ai Xi E @ bj Y j A
i D1 j D1 i D1 j D1
0 1 " # 2 3
X
n X
m X
n X
m
DE@ ai bj Xi Yj A  ai E.Xi /  4 bj E.Yj /5
i D1 j D1 i D1 j D1
2.4 Covariance and Correlation 109

X
n X
m X
n X
m
D ai bj E.Xi ; Yj /  ai bj E.Xi /E.Yj /
i D1 j D1 i D1 j D1

X
n X
m
D ai bj ŒE.Xi ; Yj /  E.Xi /E.Yj /
i D1 j D1

X
n X
m
D ai bj Cov.Xi ; Yj /:
i D1 j D1

Part (d) follows on noting that E.X Y / D E.X /E.Y / if X; Y are independent. For
part (e), first note that Cov.a C bX; c C d Y / D bd Cov.X; Y / by using part (a)
and part (c). Also, Var.a C bX / D b 2 Var.X /; Var.c C d Y / D d 2 Var.Y /

bd Cov.X; Y /
) aCbX;cCd Y D p p
b Var.X / d 2 Var.Y /
2

bd Cov.X; Y /
D p p
jbj Var.X /jd j Var.Y /
bd
D X;Y D sgn.bd /X;Y :
jbd j

The proof of part (f) uses the Cauchy–Schwarz inequality (see Chapter 1) that for
any two random variables U; V; ŒE.U V /2 E.U 2 /E.V 2 /. Let

X  E.X / Y  E.Y /
U D p ; V D p :
Var.X / Var.Y /

Then, E.U 2 / D E.V 2 / D 1, and

X;Y D E.U V / E.U 2 /E.V 2 / D 1:

The lower bound X;Y  1 follows similarly.


Part (g) uses the condition for equality in the Cauchy–Schwarz inequality: in
order that X;Y D ˙1, one must have ŒE.U V /2 D E.U 2 /E.V 2 / in the argument
above, which implies the statement in part (g). t
u

Example 2.11 (Correlation Between Minimum and Maximum in Dice Rolls). Con-
sider again the experiment of rolling a fair die twice, and let X; Y be the maximum
and the minimum of the two rolls. We want to find the correlation between X; Y .
The joint distribution of .X; Y / was worked out in Example 2.2. From the joint
distribution,

1 2 4 3 6 9 30 36 49
E.X Y / D C C C C C CC C D :
36 18 36 18 18 36 18 36 4
110 2 Multivariate Discrete Distributions

The marginal pmfs of X; Y were also worked out in Example 2.2. From the marginal
pmfs, by direct calculation, E.X / D 161=36; E.Y / D 91=36; Var.X / D Var.Y / D
2555=1296: Therefore,

E.X Y /  E.X /E.Y /


X;Y D p p
Var.X / Var.Y /
49=4  161=36  91=36 35
D D D :48:
2555=1296 73

The correlation between the maximum and the minimum is in fact positive for any
number of rolls of a die, although the correlation will converge to zero when the
number of rolls converges to 1.

Example 2.12 (Correlation in the Chicken–Eggs Example). Consider again the ex-
ample of a chicken laying a Poisson number of eggs N with mean , and each egg
fertilizing, independently of others, with probability p. If X is the number of eggs
actually fertilized, we want to find the correlation between the number of eggs laid
and the number fertilized, that is, the correlation between X and N .
First,

E.XN / D EN ŒE.XN jN D n/ D EN ŒnE.X jN D n/


D EN Œn2 p D p. C 2
/:

Next, from our previous calculations, E.X / D p ; E.N / D ; Var.X / D p ;


Var.N / D : Therefore,

E.XN /  E.X /E.N /


X;N D p p
Var.X / Var.N /
p. C 2
/p 2
p
D p p D p:
p

Thus, the correlation goes up with the fertility rate of the eggs.

Example 2.13 (Best Linear Predictor). Suppose X and Y are two jointly distributed
random variables, and either by necessity, or by omission, the variable Y was not
observed. But X was observed, and there may be some information in the X value
about Y . The problem is to predict Y by using X . Linear predictors, because of their
functional simplicity, are appealing. The mathematical problem is to choose the best
linear predictor a C bX of Y , where best is defined as the predictor that minimizes
the mean squared error EŒY  .a C bX /2 . We show that the answer has something
to do with the covariance between X and Y .
By breaking the square, R.a; b/

DEŒY .aCbX /2 D a2 Cb 2 E.X 2 /C2abE.X /2aE.Y /2bE.X Y /CE.Y 2 /:


2.5 Multivariate Case 111

To minimize this with respect to a; b, we partially differentiate R.a; b/ with respect


to a; b, and set the derivatives equal to zero:

@
R.a; b/ D 2a C 2bE.X /  2E.Y / D 0
@a
, a C bE.X / D E.Y /I
@
R.a; b/ D 2bE.X 2 / C 2aE.X /  2E.X Y / D 0
@b
, aE.X / C bE.X 2 / D E.X Y /:

Simultaneously solving these two equations, we get

E.X Y /  E.X /E.Y / E.X Y /  E.X /E.Y /


bD ; a D E.Y /  E.X /:
Var.X / Var.X /

These values do minimize R.a; b/ by an easy application of the second derivative


test. So, the best linear predictor of Y based on X is

Cov.X; Y / Cov.X; Y /
best linear predictor of Y D E.Y /  E.X / C X
Var.X / Var.X /
Cov.X; Y /
D E.Y / C ŒX  E.X /:
Var.X /

The best linear predictor is also known as the regression line of Y on X . It is of


widespread use in statistics.

Example 2.14 (Zero Correlation Does Not Mean Independence). If X; Y are inde-
pendent, then necessarily Cov.X; Y / D 0, and hence the correlation is also zero.
The converse is not true. Take a three-valued random variable X with the pmf
P .X D ˙1/ D p; P .X D 0/ D 1  2p; 0 < p < 12 . Let the other variable
Y be Y D X 2 : Then, E.X Y / D E.X 3 / D 0, and E.X /E.Y / D 0, because
E.X / D 0. Therefore, Cov.X; Y / D 0. But X; Y are certainly not independent; for
example, P .Y D 0jX D 0/ D 1, but P .Y D 0/ D 1  2p ¤ 0:
Indeed, if X has a distribution symmetric around zero, and if X has three finite
moments, then X and X 2 always have a zero correlation, although they are not
independent.

2.5 Multivariate Case

The extension of the concepts for the bivariate discrete case to the multivariate dis-
crete case is straightforward. We give the appropriate definitions and an important
example, namely that of the multinomial distribution, an extension of the binomial
distribution.
112 2 Multivariate Discrete Distributions

Definition 2.9. Let X1 ; X2 ; : : : ; Xn be discrete random variables defined on a com-


mon sample space , with Xi taking values in some countable set Xi . The joint
pmf of .X1 ; X2 ; : : : ; Xn / is defined as p.x1 ; x2 ; : : : ; xn / D P .X1 D x1 ; : : : ; Xn D
xn /; xi 2 Xi ; and zero otherwise:.

Definition 2.10. Let X1 ; X2 ; : : : ; Xn be random variables defined on a common


sample space . The joint CDF of X1 ; X2 ; : : : ; Xn is defined as F .x1 ; x2 ; : : : ; xn / D
P .X1 x1 ; X2 x2 ; : : : ; Xn xn /; x1 ; x2 ; : : : ; xn 2 R.
The requirements of a joint pmf are the usual:
x2 ; : : : ; xn /  0 8 x1 ; x2 ; : : : ; xn 2 RI
(i) p.x1 ; P
(ii) p.x1 ; x2 ; : : : ; xn / D 1:
x1 2X1 ;:::;xn 2Xn

The requirements of a joint CDF are somewhat more complicated.


The requirements of a CDF are that
(i) 0 F 1 8.x1 ; : : : ; xn /:
(ii) F is nondecreasing in each coordiante:
(iii) F equals zero if one or more of the xi D 1:
(iv) F equals one if all the xi D C1:
(v) F assigns a nonnegative probability to every n dimensional rectangle

Œa1 ; b1   Œa2 ; b2       Œan ; bn :

This last condition, (v), is a notationally clumsy condition to write down. If n D 2,


it reduces to the simple inequality that

F .b1 ; b2 /  F .a1 ; b2 /  F .b1 ; a2 / C F .a1 ; a2 /  0 8a1 b1 ; a2 b2 :

Once again, we mention that it is not convenient or interesting to work with the
CDF for discrete random variables; for discrete variables, it is preferable to work
with the pmf.

2.5.1 Joint MGF

Analogous to the case of one random variable, we can define the joint mgf for sev-
eral random variables. The definition is the same for all types of random variables,
discrete or continuous, or other mixed types. As in the one-dimensional case, the
joint mgf of several random variables is also a very useful tool. First, we repeat
the definition of expectation of a function of several random random variables; see
Chapter 1, where it was first introduced and defined. The definition below is equiv-
alent to what was given in Chapter 1.

Definition 2.11. Let X1 ; X2 ; : : : ; Xn be discrete random variables defined on


a common sample space , with Xi taking values in some countable set Xi .
2.5 Multivariate Case 113

Let the joint pmf of X1 ; X2 ; : : : ; Xn be p.x1 ; : : : ; xn /: Let g.x1 ; : : : ; xn / be


a real-valued
P function of n variables. We say that EŒg.X1 ; X2 ; : : : ; Xn / ex-
ists if x1 2X1 ;:::;xn 2Xn jg.x1 ; : : : ; xn /jp.x1 ; : : : ; xn / < 1, in which case, the
expectation is defined as
X
EŒg.X1 ; X2 ; : : : ; Xn / D g.x1 ; : : : ; xn /p.x1 ; : : : ; xn /:
x1 2X1 ;:::;xn 2Xn

A corresponding definition when X1 ; X2 ; : : : ; Xn are all continuous random vari-


ables is given in the next chapter.
Definition 2.12. Let X1 ; X2 ; : : : ; Xn be n random variables defined on a common
sample space , The joint moment-generating function of X1 ; X2 ; : : : ; Xn is defined
to be
0
.t1 ; t2 ; : : : ; tn / D EŒe t1 X1 Ct2 X2 C:::Ctn Xn  D EŒe t X ;
provided the expectation exists, and where t0 X denotes the inner product of the
vectors t D .t1 ; : : : ; tn /; X D .X1 ; : : : ; Xn /.
Note that the joint moment-generating function (mgf) always exists at the origin,
namely, t D .0; : : : ; 0/, and equals one at that point. It may or may not exist at other
points t. If it does exist in a nonempty rectangle containing the origin, then many
important characteristics of the joint distribution of X1 ; X2 ; : : : ; Xn can be derived
by using the joint mgf. As in the one-dimensional case, it is a very useful tool. Here
is the moment-generation property of a joint mgf.
Theorem 2.7. Suppose .t1 ; t2 ; : : : ; tn / exists in a nonempty open rectangle con-
taining the origin t D 0: Then a partial derivative of .t1 ; t2 ; : : : ; tn / of every order
with respect to each ti exists in that open rectangle, and furthermore,
 @k1 Ck2 CCkn
E X1k1 X2k2    Xnkn D .t1 ; t2 ; : : : ; tn /jt1 D 0; t2 D 0; : : : ; tn D 0:
@t1k1    @tnkn

A corollary of this result is sometimes useful in determining the covariance between


two random variables.
Corollary. Let X; Y have a joint mgf in some open rectangle around the origin
.0; 0/. Then,
  
@2 @ @
Cov.X; Y / D .t1 ; t2 /j0;0  .t1 ; t2 /j0;0 .t1 ; t2 /j0;0 :
@t1 @t2 @t1 @t2

We also have the distribution-determining property, as in the one-dimensional case.


Theorem 2.8. Suppose .X1 ; X2 ; : : : ; Xn / and .Y1 ; Y2 ; : : : ; Yn / are two sets of
jointly distributed random variables, such that their mgfs X .t1 ; t2 ; : : : ; tn / and
Y .t1 ; t2 ; : : : ; tn / exist and coincide in some nonempty open rectangle contain-
ing the origin. Then .X1 ; X2 ; : : : ; Xn / and .Y1 ; Y2 ; : : : ; Yn / have the same joint
distribution.
114 2 Multivariate Discrete Distributions

Remark. It is important to note that the last two theorems are not limited to discrete
random variables; they are valid for general random variables. The proofs of these
two theorems follow the same arguments as in the one-dimensional case, namely
that when an mgf exists in a nonempty open rectangle, it can be differentiated in-
finitely often with respect to each variable ti inside the expectation; that is, the order
of the derivative and the expectation can be interchanged.

2.5.2 Multinomial Distribution

One of the most important multivariate discrete distributions is the multinomial dis-
tribution. The multinomial distribution corresponds to n balls being distributed to k
cells, independently, with each ball having the probability pi of being dropped into
the i th cell. The random variables under consideration are X1 ; X2 ; : : : ; Xk , where
Xi is the number of balls that get dropped into the i th cell. Then their joint pmf is
the multinomial pmf defined below.
Definition 2.13. A multivariate random vector .X1 ; X2 ; : : : ; Xk / is said to have a
multinomial distribution with parameters n; p1 ; p2 ; : : : ; pk if it has the pmf
nŠ x
P .X1 D x1 ; X2 D x2 ; : : : ; Xk D xk / D p x1 p x2 : : : pk k ;
x1 Šx2 Š    xk Š 1 2
X
k
xi  0; xi Dn;
i D1

P
pi  0; kiD1 pi D 1:
We write .X1 ; X2 ; : : : ; Xk / Mult.n; p1 ; : : : ; pk / to denote a random vector
with a multinomial distribution.
Example 2.15 (Dice Rolls). Suppose a fair die is rolled 30 times. We want to find
the probabilities that
(i) Each face is obtained exactly five times.
(ii) The number of sixes is at least five.
If we denote the number of times face number i is obtained as Xi , then
.X1 ; X2 ; : : : ; X6 / Mult.n; p1 ; : : : ; p6 /, where n D 30 and each pi D 16 .
Therefore,
P .X1 D 5; X2 D 5; : : : ; X6 D 5/
   5
30Š 1 5 1
D :::
.5Š/6 6 6
 
30Š 1 30
D
.5Š/6 6
D :0004:
2.5 Multivariate Case 115

Next, each of the 30 rolls will either be a 6 or not, independently of the other rolls,
with probability 16 , and so, X6 Bin.30; 16 /: Therefore,
!   
X4
30 1 x 5 30x
P .X6  5/ D 1  P .X6 4/ D 1 
xD0
x 6 6
D :5757:

Example 2.16 (Bridge). Consider a Bridge game with four players, North, South,
East, and West. We want to find the probability that North and South together
have two or more aces. Let Xi denote the number of aces in the hands of player
i; i D 1; 2; 3; 4; we let i D 1; 2 mean North and South. Then, we want to find
P .X1 C X2  2/:
The joint distribution of .X1 ; X2 ; X3 ; X4 / is Mult.4; 14 ; 14 ; 14 ; 14 / (think of each
ace as a ball, and the four players as cells). Then, .X1 C X2 ; X3 C X4 /
Mult.4; 12 ; 12 /: Therefore,
 4    
4Š 1 4Š 1 4 4Š 1 4
P .X1 C X2  2/ D C C
2Š2Š 2 3Š1Š 2 4Š0Š 2
11
D :
16
Important formulas and facts about the multinomial distribution are given in the
next theorem.

Theorem 2.9. Let .X1 ; X2 ; : : : ; Xk / Mult.n; p1 ; p2 ; : : : ; pk /. Then,


(a) E.Xi / D npi I Var.Xi / D npi .1  pi /I
(b) 8 i; Xi Bin.n; pi /I
(c) Cov.Xi ; Xj /qD npi pj ; 8i ¤ j I
pi pj
(d) Xi ;Xj D  .1pi /.1pj / ; 8i ¤ j I
(e) 8m; 1 m < k; .X1 ; X2 ; : : : ; Xm /j.XmC1 C XmC2 C : : : C Xk / D s
Mult.n  s; 1 ; 2 ; : : : ; m /;
pi
where i D p1 Cp2 C:::Cpm
:

Proof. Define Wi r as the indicator of the event that the rth ball lands in the i th cell.
Note that for a given i , the variables Wi r are independent. Then,

X
n
Xi D Wi r ;
rD1

P P
and therefore, E.Xi / D nrD1 EŒWi r  D npi , and Var.Xi / D nrD1 Var.Wi r / D
npi .1  pi /: Part (b) follows from the definition of a multinomial experiment
116 2 Multivariate Discrete Distributions

(the trials are identical and independent, and each ball either lands or not in the
i th cell). For part (c),
!
Xn X
n
Cov.Xi ; Xj / D Cov Wi r ; Wjs
rD1 sD1
X
n X
n
D Cov.Wi r ; Wjs /
rD1 sD1
Xn
D Cov.Wi r ; Wjr /
rD1

(because Cov.Wi r ; Wjs / would be zero when s ¤ r)


X
n
D ŒE.Wi r Wjr /  E.Wi r /E.Wjr /
rD1
Xn
D Œ0  pi pj  D npi pj :
rD1

Part (d) follows immediately from part (c) and part (a). Part (e) is a calculation, and
is omitted. t
u
Example 2.17 (MGF of the Multinomial Distribution). Let .X1 ; X2 ; : : : ; Xk /
Mult.n; p1 :p2 ; : : : ; pk /. Then the mgf .t1 ; t2 ; : : : ; tk / exists at all t, and a formula
follows easily. Indeed,
X nŠ
EŒe t1 X1 CCtk Xk  D
x
e t1 x1 e t2 x2    e tk xk p1x1 p2x2 cdotspk k
Pk x1 Š    xk Š
xi 0; i D1 xi Dn
X nŠ
D .p1 e t1 /x1 .p2 e t2 /x2    .pk e tk /xk
Pk x1 Š    xk Š
xi 0; i D1 xi Dn

D .p1 e C p2 e t2 C    C pk e tk /n ;
t1

by the multinomial expansion identity


X nŠ x
.a1 C a2 C    C ak /n D ax1 ax2    ak k :
Pk x1 Š    xk Š 1 2
xi 0; i D1 xi Dn

2.6  The Poissonization Technique

Calculation of complex multinomial probabilities often gets technically simplified


by taking the number of balls to be a random variable, specifically, a Poisson random
variable. We give the Poissonization theorem and some examples in this section.
2.6 The Poissonization Technique 117

Theorem 2.10. Let N Poi. /, and suppose given N D n; .X1 ; X2 ; : : : ; Xk /


Mult.n; p1 ; p2 ; : : : ; pk /: Then, marginally, X1 ; X2 ; : : : ; Xk are independent
Poisson, with Xi Poi. pi /:
Proof. By the total probability formula,

P .X1 D x1 ; X2 D x2 ; : : : ; Xk D xk /
1
X e  n
D P .X1 D x1 ; X2 D x2 ; : : : ; Xk D xk jN D n/
nD0

X1 
.x1 C x2 C    C xk /Š x1 x2 x e
n
D p1 p2    pk k InDx1 Cx2 CCxk
nD0
x1 Šx2 Š    xk Š nŠ
1
D e 
x
x1 x2
 xk
p1x1 p2x2    pk k
x1 Šx2 Š    xk Š
1
D e  . p1 /x1 . p2 /x2    . pk /xk
x1 Šx2 Š    xk Š
Y
k
e pi . pi /xi
D ;
xi Š
i D1

which establishes that the joint marginal pmf of .X1 ; X2 ; : : : ; Xk / is the product
of k Poisson pmfs, and so X1 ; X2 ; : : : ; Xk must be marginally independent, with
Xi Poi. pi /. t
u

Corollary. Let A be a set in the k-dimensional Euclidean space Rk . Let


.Y1 ; Y2 ; : : : ; Yk / Mult.n; p1 ; p2 ; : : : ; pk /. Then, P ..Y1 ; Y2 ; : : : ; Yk / 2 A/
equals nŠc.n/, where c.n/ is the coefficient of n in the power series expansion
of e  P ..X1 ; X2 ; : : : ; Xk / 2 A/: Here X1 ; X2 ; : : : ; Xk are as above: they are
independent Poisson variables, with Xi Poi. pi /.
The corollary is simply a restatement of the identity
1
X n

P ..X1 ; X2 ; : : : ; Xk / 2 A/ D e P ..Y1 ; Y2 ; : : : ; Yk / 2 A/:
nD0

Example 2.18 (No Empty Cells). Suppose n balls are distributed independently and
at random into k cells. We want to find a formula for the probability that no cell
remains empty.
We use the Poissonization technique to solve this problem. We want a formula
for P .Y1 ¤ 0; Y2 ¤ 0; : : : ; Yk ¤ 0/.
Marginally, each Xi Poi. k /, and therefore,

P .X1 > 0; X2 > 0; : : : ; Xk > 0/ D .1  e =k /k


) e  P .X1 > 0; X2 > 0; : : : ; Xk > 0/ D e  .1  e =k /k
118 2 Multivariate Discrete Distributions
!
X
k
k
D .1/x e .1x=k/
xD0
x
! 1
Xk
k X . .1  x=k//n
D .1/x
xD0
x nD0 nŠ
1
!
X n X k
k
D Œ .1/x .1  x=k/n :
nD0
nŠ xD0
x

Therefore, by the above corollary,


!
X
k
k
P .Y1 ¤ 0; Y2 ¤ 0; : : : ; Yk ¤ 0/ D .1/ x
.1  x=k/n :
xD0
x

Exercises

Exercise 2.1. Consider the experiment of picking one word at random from the
sentence
ALL IS WELL IN THE NEWELL FAMILY
Let X be the length of the word selected and Y the number of Ls in it. Find in a
tabular form the joint pmf of X and Y , their marginal pmfs, means, and variances,
and the correlation between X and Y .
Exercise 2.2. A fair coin is tossed four times. Let X be the number of heads, Z the
number of tails, and Y D jX  Zj. Find the joint pmf of .X; Y /, and E.Y /.
Exercise 2.3. Consider the joint pmf p.x; y/ D cxy; 1 x 3; 1 y 3.
(a) Find the normalizing constant c.
(b) Are X; Y independent? Prove your claim.
(c) Find the expectations of X; Y; X Y:
Exercise 2.4. Consider the joint pmf p.x; y/ D cxy; 1 x y 3.
(a) Find the normalizing constant c.
(b) Are X; Y independent? Prove your claim.
(c) Find the expectations of X; Y; X Y:
Exercise 2.5. A fair die is rolled twice. Let X be the maximum and Y the minimum
of the two rolls. By using the joint pmf of .X; Y / worked out in text, find the pmf
of X X
Y , and hence the mean of Y .

Exercise 2.6. A hat contains four slips of paper, numbered 1, 2, 3, and 4. Two slips
are drawn at random, without replacement. X is the number on the first slip and Y
the sum of the two numbers drawn. Write in a tabular form the joint pmf of .X; Y /.
Hence find the marginal pmfs. Are X; Y independent?
Exercises 119

Exercise 2.7 * (Conditional Expectation in Bridge). Let X be the number of


clubs in the hand of North and Y the number of clubs in the hand of South in a
Bridge game. Write a general formula for E.X jY D y/, and compute E.X jY D 3/.
How about E.Y jX D 3/?

Exercise 2.8. A fair die is rolled four times. Find the probabilities that:
(a) At least 1 six is obtained;
(b) Exactly 1 six and exactly one two is obtained,
(c) Exactly 1 six, 1 two, and 2 fours are obtained.

Exercise 2.9 (Iterated Expectation). A household has a Poisson number of cars


with mean 1. Each car that a household possesses has, independently of the other
cars, a 20% chance of being an SUV. Find the mean number of SUVs a household
possesses.

Exercise 2.10 (Iterated Variance). Suppose N Poi. /, and given N D n; X is


distributed as a uniform on f0; 1; : : : ; ng. Find the variance of the marginal distribu-
tion of X .

Exercise 2.11. Suppose X and Y are independent Geo.p/ random variables. Find
P .X  Y /I P .X > Y /:

Exercise 2.12. * Suppose X and Y are independent Poi. / random variables. Find
P .X  Y /I P .X > Y /:

Hint: This involves a Bessel function of a suitable kind.

Exercise 2.13. Suppose X and Y are independent and take the values 1, 2, 3, 4 with
probabilities .2, .3, .3, .2. Find the pmf of X C Y .

Exercise 2.14. Two random variables have the joint pmf p.x; x C 1/ D
nC1 ; x D 0; 1; : : : ; n. Answer the following questions with as little calculation
1

as possible.
(a) Are X; Y independent?
(b) What is the variance of Y  X ?
(c) What is Var.Y jX D 1/?

Exercise 2.15 (Binomial Conditional Distribution). Suppose X; Y are indepen-


dent random variables, and that X Bin.m; p/; Y Bin.n; p/. Show that the
conditional distribution of X given X C Y D t is a hypergeometric distribution;
identify the parameters of this hypergeometric distribution.

Exercise 2.16 * (Poly-Hypergeometric Distribution). A box has D1 red, D2


green, and D3 blue balls. Suppose n balls are picked at random without replace-
ment from the box. Let X; Y; Z be the number of red, green, and blue balls selected.
Find the joint pmf of .X; Y; Z/.
120 2 Multivariate Discrete Distributions

Exercise 2.17 (Bivariate Poisson). Suppose U; V; W are independent Poisson ran-


dom variables, with means ; ; . Let X D U C W I Y D V C W:
(a) Find the marginal pmfs of X; Y .
(b) Find the joint pmf of .X; Y /.

Exercise 2.18. Suppose a fair die is rolled twice. Let X; Y be the two rolls. Find
the following with as little calculation as possible:
(a) E.X C Y jY D y/.
(b) E.X Y jY D y/.
(c) Var.X 2 Y jY D y/.
(d) XCY;XY :

Exercise 2.19 (A Waiting Time Problem). In repeated throws of a fair die, let X
be the throw in which the first six is obtained, and Y the throw in which the second
six is obtained.
(a) Find the joint pmf of .X; Y /.
(b) Find the expectation of Y  X .
(c) Find E.Y  X jX D 8/.
(d) Find Var.Y  X jX D 8/.

Exercise 2.20 * (Family Planning). A couple want to have a child of each sex, but
they will have at most four children. Let X be the total number of children they
will have and Y the number of girls at the second childbirth. Find the joint pmf of
.X; Y /, and the conditional expectation of X given Y D y; y D 0; 2.

Exercise 2.21 (A Standard Deviation Inequality). Let X; Y be two random vari-


ables. Show that XCY X C Y :

Exercise 2.22 * (A Covariance Fact). Let X; Y be two random variables. Suppose


E.X jY D y/ is nondecreasing in y. Show that X;Y  0, assuming the correlation
exists.

Exercise 2.23 (Another Covariance Fact). Let X; Y be two random variables.


Suppose E.X jY D y/ is a finite constant c. Show that Cov.X; Y / D 0:

Exercise 2.24 (Two-Valued Random Variables). Suppose X; Y are both two-


valued random variables. Prove that X and Y are independent if and only if they
have a zero correlation.

Exercise 2.25 * (A Correlation Inequality). Suppose X; Y each have


p mean 0 and
variance 1, and a correlation . Show that E.maxfX 2 ; Y 2 g/ 1 C 1  2 .

Exercise 2.26 (A Covariance Inequality). Let X be any random variable, and


g.X /; h.X / two functions such that they are both nondecreasing or both nonin-
creasing. Show that Cov.g.X /; h.X //  0:
Exercises 121

Exercise 2.27 (Joint MGF). Suppose a fair die is rolled four times. Let X be the
number of ones and Y the number of sixes. Find the joint mgf of X and Y , and
hence, the covariance between X; Y .

Exercise 2.28 (MGF of Bivariate Poisson). Suppose U; V; W are independent


Poisson random variables, with means ; ; . Let X D U C W I Y D V C W:
Find the joint mgf of X; Y , and hence E.X Y /.

Exercise 2.29 (Joint MGF). In repeated throws of a fair die, let X be the throw in
which the first six is obtained, and Y the throw in which the second six is obtained.
Find the joint mgf of X; Y , and hence the covariance between X and Y .

Exercise 2.30 * (Poissonization). A fair die is rolled 30 times. By using the Pois-
sonization theorem, find the probability that the maximum number of times any face
appears is 9 or more.

Exercise 2.31 * (Poissonization). Individuals can be of one of three genotypes in


a population. Each genotype has the same percentage of individuals. A sample of n
individuals from the population will be taken. What is the smallest n for which with
probability  :9, there are at least five individuals of each genotype in the sample?
Chapter 3
Multidimensional Densities

Similar to several discrete random variables, we are frequently interested in


applications in studying several continuous random variables simultaneously. And
similar to the case of one continuous random variable, again we do not speak of
pmfs of several continuous variables, but of a pdf, jointly for all the continuous
random variables. The joint density function completely characterizes the joint
distribution of the full set of continuous random variables. We refer to the en-
tire set of random variables as a random vector. Both the calculation aspects, as
well as the application aspects of multidimensional density functions are gener-
ally sophisticated. As such, the ability to use and operate with multidimensional
densities is among the most important skills one needs to have in probability and
also in statistics. The general concepts and calculations are discussed in this chap-
ter. Some special multidimensional densities are introduced separately in later
chapters.

3.1 Joint Density Function and Its Role

Exactly as in the one-dimensional case, it is important to note the following points.


(a) The joint density function of all the variables does not equal the probability of
a specific point in the multidimensional space; the probability of any specific
point is still zero.
(b) The joint density function reflects the relative importance of a particular point.
Thus, the probability that the variables together belong to a small set around a
specific point, say x D .x1 ; x2 ; : : : ; xn / is roughly equal to the volume of that
set multiplied by the density function at the specific point x. This volume inter-
pretation for probabilities is useful for intuitive understanding of distributions
of multidimensional continuous random variables.
(c) For a general set A in the multidimensional space, the probability that the ran-
dom vector X belongs to A is obtained by integrating the joint density function
over the set A.

A. DasGupta, Probability for Statistics and Machine Learning: Fundamentals 123


and Advanced Topics, Springer Texts in Statistics, DOI 10.1007/978-1-4419-9634-3 3,
c Springer Science+Business Media, LLC 2011
124 3 Multidimensional Densities

These are all just the most natural extensions of the corresponding one-
dimensional facts to the present multidimensional case. We now formally define a
joint density function.

Definition 3.1. Let X D .X1 ; X2 ; : : : ; Xn / be an n-dimensional random vector,


taking values in Rn , for some n; 1 < n < 1. We say that f .x1 ; x2 ; : : : ; xn / is the
joint density or simply the density of X if for all a1 ; a2 ; : : : ; an ; b1 ; b2 ; : : : ; bn ; 
1 < ai bi < 1,

P .a1 X1 b1 ; a2 X2 b2 ; : : : ; an Xn bn /
Z bn Z b2 Z b1
D  f .x1 ; x2 ; : : : ; xn / dx1 dx2    dxn :
an a2 a1

In order that a function f W Rn ! R be a density function of some n-dimensional


random vector, it is necessary and sufficient that
(i) f .x1 ; x2 ; : : : ; xn /  0 8 .x1 ; x2 ; : : : ; xn / 2 Rn I
Z
(ii) f .x1 ; x2 ; : : : ; xn / dx1 dx2    dxn D 1:
Rn

The definition of the joint CDF is the same as that given in the discrete case. But
now the joint CDF is an integral of the density rather than a sum. Here is the precise
definition.

Definition 3.2. Let X be an n-dimensional random vector with the density function
f .x1 ; x2 ; : : : ; xn /. The joint CDF or simply the CDF of X is defined as
Z xn Z x1
F .x1 ; x2 ; : : : ; xn / D  f .t1 ; : : : ; tn / dt1    dtn :
1 1

As in the one-dimensional case, both the CDF and the density completely specify
the distribution of a continuous random vector and one can be obtained from the
other. We know how to obtain the CDF from the density; the reverse relation is that
(for almost all .x1 ; x2 ; : : : ; xn /),

@n
f .x1 ; x2 ; : : : ; xn / D F .x1 ; x2 ; : : : ; xn / :
@x1    @xn

Again, the qualification almost all is necessary for a rigorous description of the
interrelation between the CDF and the density, but we operate as though the identity
above holds for all .x1 ; x2 ; : : : ; xn /,
Analogous to the case of several discrete variables, the marginal densities are
obtained by integrating out (instead of summing) all the other variables. In fact,
all lower-dimensional marginals are obtained that way. The precise statement is the
following.
3.1 Joint Density Function and Its Role 125

Proposition. Let X D .X1 ; X2 ; : : : ; Xn / be a continuous random vector with a


joint density f .x1 ; x2 ; : : : ; xn /. Let 1 p < n. Then the marginal joint density of
X1 ; X2 ; : : : ; Xp is given by
Z 1 Z 1
 
f1;2;:::;p x1 ; x2 ; : : : ; xp D  f .x1 ; x2 ; : : : ; xn / dxpC1    dxn :
1 1

At this stage, it is useful to give a characterization of independence of a set of n


continuous random variables by using the density function.

Proposition. Let X D .X1 ; X2 ; : : : ; Xn / be a continuous random vector with a


joint density f .x1 ; x2 ; : : : ; xn /. Then, X1 ; X2 ; : : : ; Xn are independent if and only
if the joint density factorizes as

Y
n
f .x1 ; x2 ; : : : ; xn / D fi .xi / ;
i D1

where fi .xi / is the marginal density function of Xi .

Proof. If the joint density factorizes as above, then on integrating


Q both sides of
this factorization identity, one gets F .x1 ; x2 ; : : : ; xn / D niD1 Fi .xi / 8 .x1 ; x2 ;
: : : ; xn /, which is the definition of independence.
Conversely, if they are independent, then take the identity

Y
n
F .x1 ; x2 ; : : : ; xn / D Fi .xi / ;
i D1

and partially differentiate both sides successively with respect to x1Q ; x2 ; : : : ; xn , and
it follows that the joint density factorizes as f .x1 ; x2 ; : : : ; xn / D niD1 fi .xi / : u t

Let us see some initial examples.

Example 3.1 (Bivariate Uniform). Consider the function

f .x; y/ D 1 if 0 x 1; 0 y 1I
D 0 otherwise:

Clearly, f is always nonnegative, and


Z 1 Z 1 Z 1 Z 1
f .x; y/dxdy D f .x; y/dxdy
1 1 0 0
Z 1 Z 1
D dxdy D 1:
0 0
126 3 Multidimensional Densities

Therefore, f is a valid bivariate density function. The marginal density of X is


Z 1
f1 .x/ D f .x; y/dy
1
Z 1 Z 1
D f .x; y/dy D dy D 1;
0 0

if 0 x 1, and zero otherwise. Thus, marginally, X U Œ0; 1, and similarly,


marginally, Y U Œ0; 1. Furthermore, clearly, for all x; y the joint density f .x; y/
factorizes as f .x; y/ D f1 .x/f2 .y/, and so X; Y are independent too. The joint
density f .x; y/ of this example is called the bivariate uniform density. It gives the
constant density of one to all points .x; y/ in the unit square Œ0; 1  Œ0; 1 and zero
density outside the unit square. The bivariate uniform, therefore, is the same as
putting two independent U Œ0; 1 variables together as a bivariate vector.

Example 3.2 (Uniform in a Triangle). Consider the function

f .x; y/ D c; if x; y  0; x C y 1;
D 0 otherwise:

The set of points x; y  0; x C y 1 forms a triangle in the plane with vertices at


.0; 0/; .1; 0/, and .0; 1/; thus, it is just half the unit square see Fig. 3.1. The normal-
izing constant c is easily evaluated:

y
1

0.8

0.6

0.4

0.2

x
0.2 0.4 0.6 0.8 1

Fig. 3.1 Uniform density on a triangle equals c D 2 on this set


3.1 Joint Density Function and Its Role 127
Z
1D cdxdy
x;yWx;y0;xCy1
Z 1 Z 1y
D cdxdy
0 0
Z 1
Dc .1  y/dy
0
c
D
2
) c D 2:

The marginal density of X is


Z 1x
f1 .x/ D 2dy D 2.1  x/; 0 x 1:
0

Similarly, the marginal density of Y is

f2 .y/ D 2.1  y/; 0 y 1:

Contrary to the previous example, X; Y are not independent now. There are many
ways to see this. For example,
 
1 1
P X > jY > D 0:
2 2
  R1
But, P X > 12 D 1 2.1  x/dx D 14 ¤ 0: So, X; Y cannot be independent. We
2
can also see that the joint density f .x; y/ does not factorize as the product of the
marginal densities, and so X; Y cannot be independent.

Example 3.3. Consider the function f .x; y/ D xe x.1Cy/ ; x; y  0: First, let us


verify that it is a valid density function.
It is obviously nonnegative. Furthermore,
Z 1 Z 1 Z 1 Z 1
f .x; y/dxdy D xe x.1Cy/ dxdy
1 1 0 0
Z 1
1
D dy
0 .1 C y/2
Z 1
1
D dy D 1:
1 y2

Hence, f .x; y/ is a valid joint density. It is plotted in Fig. 3.2. Next, let us find the
marginal densities:
128 3 Multidimensional Densities

Fig. 3.2 The density f(x, y) D x Exp (x(1Cy))

Z 1 Z 1
f1 .x/ D xe x.1Cy/ dy D x e x.1Cy/ dy
0 0
Z 1
e x
Dx e xy dy D x D e x ; x  0:
1 x

Therefore, marginally, X is a standard exponential. Next,


Z 1
1
f2 .y/ D xe x.1Cy/ dx D ; y  0:
0 .1 C y/2

Clearly, we do not have the factorization identity f .x; y/ D f1 .x/f2 .y/ 8 x; yI


thus, X; Y are not independent.

Example 3.4 (Nonuniform Joint Density with Uniform Marginals). Let .X; Y / have
the joint density function f .x; y/ D c  2.c  1/.x C y  2xy/; x; y 2 Œ0; 1; 0 <
c < 2: This is nonnegative in the unit square, as can be seen by considering the
cases c < 1; c D 1; c > 1 separately. Also,
Z 1 Z 1
f .x; y/dxdy
0 0
Z 1Z 1
D c  2.c  1/ .x C y  2xy/dxdy
0 0
Z 1 
1
D c  2.c  1/ C y  y dy D c  .c  1/ D 1:
0 2
3.1 Joint Density Function and Its Role 129

Now, the marginal density of X is


Z 1
f1 .x/ D f .x; y/dy
0
1
D c  2.c  1/ x C  x D 1:
2

Similarly, the marginal density of Y is also the constant function 1. So each marginal
is uniform, although the joint density is not uniform if c ¤ 1.

Example 3.5 (Using the Density to Calculate a Probability). Suppose .X; Y / has
the joint density f .x; y/ D 6xy 2 ; x; y  0; x C y 1: Thus, this is yet another
density on the triangle with vertices at .0; 0/; .1; 0/; and .0; 1/. We want to find
P .X C Y < 12 /: By definition,
  Z
1
P X CY < D 6xy2 dxdy
2 .x;y/Ix;y0;xCy< 12
Z 1 Z 1 y
2 2
D6 xy 2 dxdy
0 0
Z 1
2 . 12  y/2
D6 y2 dy
0 2
Z 1  2
2 1
D3 y2  y dy
0 2
1 1
D 3 D :
960 320
This example gives an elementary illustration of the need to work out the limits of
the iterated integrals carefully while using a joint density to calculate the probability
of some event. In fact, properly finding the limits of the iterated integrals is the part
that requires the greatest care when working with joint densities.

Example 3.6 (Uniform Distribution in a Circle). Suppose C denotes the unit circle
in the plane: ˚ 
C D .x; y/ W x 2 C y 2 1 :
We pick a point .X; Y / at random from C ; what that means is that .X; Y / has the
density
f .x; y/ D c; if .x; y/ 2 C;
and is zero otherwise. Because
Z Z
f .x; y/dxdy D c dxdy D c  Area of C D c D 1;
C C
130 3 Multidimensional Densities

we have that the normalizing constant c D 1


: Let us find the marginal densities.
First,
p
Z Z 1x 2
1 1
f1 .x/ D dy D p dy
yWx 2 Cy 2 1    1x 2
p
2 1  x2
D ; 1 x 1:


The joint density f .x; y/ is symmetric between x; y (i.e., f .x; y/ D f .y; x/) thus
Y has the same marginal density as X ; that is,
p
2 1  y2
f2 .y/ D ; 1 y 1:


Because f .x; y/ ¤ f1 .x/f2 .y/, X; Y are not independent. Note that if X; Y has a
joint uniform density in the unit square, we find them to be independent; but now,
when they have a uniform density in the unit circle, we find them to be not indepen-
dent. In fact, the following general rule holds.
Suppose a joint density f .x; y/ can be written in a form g.x/h.y/; .x; y/ 2 S ,
and f .x; y/ zero otherwise. Then, X; Y are independent if and only if S is a rect-
angle (including squares).
Example 3.7 (An Interesting Property of Exponential Variables). Suppose X; Y are
independent Exp. /, Exp./ variables. We want to find P .X Y /. A possible ap-
plication is the following. Suppose you have two televisions at your home, a plasma
unit with a mean lifetime of five years, and an ordinary unit with a mean lifetime
of ten years. What is the probability that the plasma tv will fail before the ordinary
one?
R From our general definition of probabilities of events, we need to calculate
x;y>0;xy f .x; y/dxdy: In general, there need not be an interesting answer for
this integral. But, here in the independent exponential case, there is.
Since X; Y are independent, the joint density is f .x; y/ D  1
e x=y= ;
x; y > 0: Therefore,

Z
1 x=y=
P .X Y/ D e dxdy
x;y>0;xy 
Z 1Z y
1
D e x=y= dxdy
 0 0
Z Z
1 1 y= y= x
D e e dxdy
 0 0
Z 1 
1
D e y= 1  e y= dy
 0
Z
1 1 y.1=C1=/
D 1 e dy
 0
3.1 Joint Density Function and Its Role 131

1

D 1 D1
1

C 1

C
 1
D D :
C 1C 


Thus, the probability that X is less than Y depends in a very simple way on just the
quantity E.X/
E.Y / :

Example 3.8 (Curse of Dimensionality). A phenomenon that complicates the work


of a probabilist in high dimensions (i.e., when dealing with a large number of ran-
dom variables simultaneously) is that the major portion of the probability in the
joint distribution lies away from the central region of the variable space. As a con-
sequence, sample observations taken from the high-dimensional distribution tend to
leave the central region sparsely populated. Therefore, it becomes difficult to learn
about what the distribution is doing in the central region. This phenomenon has been
called the curse of dimensionality.
As an example, consider n independent U Œ1; 1 random variables, X1 ; X2 ; : : : ;
Xn , and suppose we ask what the probability is that X D .X1 ; X2 ; : : : ; Xn / lies in
the inscribed sphere
˚ 
Bn D .x1 ; x2 ; : : : ; xn / W x12 C x22 C    C xn2 1 :

By definition, the joint density of X1 ; X2 ; : : : ; Xn is

f .x1 ; x2 ; : : : ; xn / D c; 1 xi 1; 1 i n;

where c D 1
2n
. Also, by definition of probability,
Z
P .X 2 Bn / D cdx1 dx2    dxn
Bn

Vol .Bn /
D ;
2n

where Vol .Bn / is the volume of the n-dimensional unit sphere Bn , and equals
n
2
Vol .Bn / D n :
 2 C1

Thus, finally,
n
2
P .X 2 Bn / D n :
2n  2
C1
132 3 Multidimensional Densities

This is a very pretty formula. Let us evaluate this probability for various values of n,
and examine the effect of increasing the number of dimensions on this probability.
Here is a table.

n P .X 2 Bn /
2 .785
3 .524
4 .308
5 .164
6 .081
10 .002
12 .0003
15 .00001
18 3.13 107

We see that in ten dimensions, there is a 1 in 500 chance that a uniform random vec-
tor will fall in the central inscribed sphere, and in 18 dimensions, the chance is much
less than one in a million. Therefore, when you are dealing with a large number of
random variables at the same time, you will need a huge amount of sample data to
learn about the behavior of their joint distribution in the central region; most of the
data will come from the corners! You must have a huge amount of data to have at
least some data points in your central region. As stated above this phenomenon has
been termed the curse of dimensionality.

3.2 Expectation of Functions

Expectations for multidimensional densities are defined analogously to the one-


dimensional case. Here is the definition.

Definition 3.3. Let .X1 ; X2 ; : : : ; Xn / have a joint density function f .x1 ; x2 ;


: : : ; xn /, and let g.x1 ; x2 ; : : : ; xn / be a real-valued function of x1 ; x2 ; : : : ; xn .
We say that the expectation of g.X1 ; X2 ; : : : ; Xn / exists if
Z
jg .x1 ; x2 ; : : : ; xn / jf .x1 ; x2 ; : : : ; xn / dx1 dx2    dxn < 1;
Rn

in which case the expected value of g.X1 ; X2 ; : : : ; Xn / is defined as


Z
E Œg .X1 ; X2 ; : : : ; Xn / D g.x1 ; x2 ; : : : ; xn /f .x1 ; x2 ; : : : ; xn / dx1 dx2    dxn :
Rn

Remark. It is clear from the definition that the expectation of each individual
Xi can be evaluated by either interpreting Xi as a function of the full vector
.X1 ; X2 ; : : : ; Xn /, or by simply using the marginal density fi .x/ of Xi ; that is,
3.2 Expectation of Functions 133
Z
E .Xi / D xi f .x1 ; x2 ; : : : ; xn / dx1 dx2    dxn
ZR1
n

D xfi .x/dx:
1

A similar comment applies to any function h.Xi / of just Xi alone. All the proper-
ties of expectations that we have previously established, for example, linearity of
expectations, continue to hold in the multidimensional case. Thus,

E Œag .X1 ; X2 ; : : : ; Xn / C bh .X1 ; X2 ; : : : ; Xn /


D aE Œg .X1 ; X2 ; : : : ; Xn / C bE Œh .X1 ; X2 ; : : : ; Xn / :

We work out some examples now.

Example 3.9 (Bivariate Uniform). Two numbers X; Y are picked independently at


random from Œ0; 1. What is the expected distance between them?
Thus, if X; Y are independent U Œ0; 1, we want to compute E.jX  Y j/, which is

Z 1 Z 1
E.jX  Y j/ D jx  yjdxdy
0 0
Z 1 Z y Z 1
D .y  x/dx C .x  y/dx dy
0 0 y
Z    
1
y2 1  y2
D y2  C  y.1  y/ dy
0 2 2
Z 1
1
D  y C y 2 dy
0 2
1 1 1 1
D  C D :
2 2 3 3

Example 3.10 (Independent Exponentials). Suppose X; Y are independently dis-


tributed as Exp. /, Exp./, respectively. We want to find the expectation of the
minimum of X and Y . The calculation below requires patience, but is not otherwise
difficult.
Denote W D minfX; Y g. Then,
Z 1 Z 1
1 x= y=
E.W / D e
minfx; yg
e dxdy
0 0 
Z 1Z y Z 1Z 1
1 x= y= 1 x= y=
D x e e dxdy C y e e dxdy
0 0  0 y 
134 3 Multidimensional Densities
Z 1 Z y
1 y= 1
D e x e x= dx dy
0  0
Z 1 Z 1
1 y= 1
C e y e x= dx dy
0  y
Z 1 h i
1 y=
D e  e y=  ye y= dy
0 
Z 1
1 y= y=
C e ye dy
0 

(on integrating the x integral in the first term by parts)

2  2
D C
. C /2 . C /2

(once again, by integration by parts)

 1
D D ;
C 1
 C 1


a very pretty result.

Example 3.11 (Use of Polar Coordinates). Suppose a point .x; y/ is picked at ran-
dom from inside the unit circle. We want to find its expected distance from the center
of the circle.
Thus, let .X; Y / have the joint density

1
f .x; y/ D ; x2 C y 2 1;


and zero otherwise.


p
We find EŒ X 2 C Y 2 . By definition,

p
EŒ X 2 C Y 2 
Z p
1
D x 2 C y 2 dxdy:
 .x;y/Wx 2 Cy 2 1

It is now very useful to make a transformation by using the polar coordinates

x D r cos ; y D r sin ;
3.2 Expectation of Functions 135

with dxdy D rdrd. Therefore,


hp i Z p
1
E X CY D
2 2 x 2 C y 2 dxdy
 .x;y/Wx 2 Cy 2 1
Z Z
1 1  2
D r ddr
 0 
Z 1
2
D2 r 2 dr D :
0 3
We later show various calculations finding distributions of functions of many con-
tinuous variables where transformation to polar and spherical coordinates often
simplifies the integrations involved.
Example 3.12 (A Spherically Symmetric Density). Suppose .X; Y / has a joint den-
sity function
c
f .x; y/ D 3
; x; y  0;
.1 C x C y 2 / 2
2

where c is a positive normalizing constant. We prove below that this is a valid


joint density and evaluate the normalizing constant c. Note that f .x; y/ depends
on x; y only through x 2 C y 2 ; such a density function is called spherically symmet-
ric, because the density f .x; y/ takes the same value at all points on the perimeter
of a circle given by x 2 C y 2 D k.
To prove that f is a valid density, first note that it is obviously nonnegative. Next,
by making a transformation to polar coordinates, x D r cos ; y D r sin ,
Z Z 1Z 
2 r
f .x; y/dxdy D c 3
ddr
x>0;y>0 0 0 .1 C r 2 / 2

(here, 0  2,as x; y are both positive)
Z
 1 r  
Dc 3
dr D c  1 D c
2 0 .1 C r / 22 2 2
2
)c D :

We show that E.X / does not exist. Note that it then follows that E.Y / also does
not exist, because f .x; y/ D f .y; x/ in this example. The expected value of X is,
again, by transforming to polar coordinates,
Z Z 
2 1 2 r2
E.X / D 3
cos ddr
 0 0 .1 C r 2 / 2
Z
2 1 r2
D dr
 0 .1 C r 2 / 32
D 1;
r 2 1
because the final integrand 3 behaves like the function r
for large r, and
R1 1 .1Cr 2/ 2

k r dr diverges for any positive k.


136 3 Multidimensional Densities

3.3 Bivariate Normal

The bivariate normal density is one of the most important densities for two jointly
distributed continuous random variables, just as the univariate normal density is for
one continuous variable. Many correlated random variables across applied and so-
cial sciences are approximately distributed as a bivariate normal. A typical example
is the joint distribution of two size variables, such as height and weight.
x 2 Cy 2
Definition 3.4. The function f .x; y/D 2 1
e  2 ; 1 < x; y < 1 is called the
bivariate standard normal density.
Clearly, we see that f .x; y/ D .x/.y/ 8 x; y: Therefore, the bivariate stan-
dard normal distribution corresponds to a pair of independent standard normal
variables X; Y . If we make a linear transformation

U D 1 C 1 X
h p i
V D 2 C 2 X C 1  2 Y ;

then we get the general five-parameter bivariate normal density, with means 1 ; 2 ,
standard deviations 1 ; 2 , and correlation U;V D ; here, 1 <  < 1:
Definition 3.5. The density of the five-parameter bivariate normal distribution is

1 .x1 /2 .y2 /2 2.x1 /.y2 /


1  2 Œ 2 C 2  1 2 
f .x; y/ D p e 2.1 / 1 2
;
21 2 1  2

1 < x; y < 1:
If 1 D 2 D 0; 1 D 2 D 1, then the bivariate normal density has just the
parameter , and it is denoted as SBV N./.
If we sample observations from a general bivariate normal distribution, and plot
the data points as points in the plane, then they would roughly plot out to an elliptical
shape. The reason for this approximate elliptical shape is that the exponent in the
formula for the density function is a quadratic form in the variables. Figure 3.3
is a simulation of 1000 values from a bivariate normal distribution. The roughly
elliptical shape is clear. It is also seen in the plot that the center of the point cloud
is quite close to the true means of the variables, which were chosen to be 1 D
4:5; 2 D 4.
From the representation we have given above of the general bivariate normal
vector .U; V / in terms of independent standard normals X; Y , it follows that

E.U V / D 1 2 C 1 2
) Cov.U; V / D 1 2 :

The symmetric matrix with the variances as diagonal entries and the covariance as
the off-diagonal entry is called the variance–covariance matrix, or the dispersion
3.3 Bivariate Normal 137

Y
7

X
2 3 4 5 6 7

Fig. 3.3 Simulation of a bivariate normal with means 4.5, 4; variances 1; correlation .75

Fig. 3.4 Bivariate normal densities with zero means, unit variances, and rho D 0, .5

matrix, or sometimes simply the covariance matrix of .U; V /. Thus, the covariance
matrix of .U; V / is
 
12 1 2
†D :
1 2 22

A plot of the SBV N./ density is provided in Fig. 3.4 for  D 0; :5; the zero cor-
relation case corresponds to independence. We see from the plots that the bivariate
density has a unique peak at the mean point .0; 0/ and falls off from that point like a
mound. The higher the correlation, the more the density concentrates near a plane.
In the limiting case, when  D ˙1, the density becomes fully concentrated on a
plane, and we call it a singular bivariate normal.
When  D 0, the bivariate normal density does factorize into the product of the
two marginal densities. Therefore, if  D 0, then U; V are actually independent, and
so, in that case, P .U > 1 ; V > 2 / D P .Each variable is larger than its mean
value/ D 12 21 D 14 . When the parameters are general, one has the following classic
formula.
138 3 Multidimensional Densities

Theorem 3.1 (A Classic Bivariate Normal Formula). Let .U; V / have the
five-parameter bivariate normal density with parameters 1 ; 2 ; 1 ; 2 ; .
Then,

P .U > 1 ; V > 2 / D P .U < 1 ; V < 2 /


1 arcsin 
D C
4 2
A derivation of this formula can be seen in Tong (1990).

Example 3.13. Suppose a bivariate normal vector .U; V / has correlation . Then,
by applying the formula above, whatever 1 ; 2 are,

1 1
P .U > 1 ; V > 2 / D 1=4 C 1=.2/arcSin D ;
2 3
when  D 12 : When  D :75, the probability increases to .385. In the limit, when
 ! 1, the probability tends to .5. That is, when  ! 1, all the probability becomes
confined to the first and the third quadrants fU > 1 ; V > 2 g, and fU < 1 ; V <
2 g, with the probability of each of these two quadrants approaching .5.
Another important property of a bivariate normal distribution is the following
result.

Theorem 3.2. Let .U; V / have a general five-parameter bivariate normal distribu-
tion. Then, any linear function aU C bV of .U; V / is normally distributed:
 
aU C bV N a1 C b2 ; a2 12 C b 2 22 C 2ab1 2 :

In particular, each of U; V is marginally normally distributed:


   
U N 1 ; 12 ; V N 2 ; 22 :
   
If  D 0, then U; V are independent with N 1 ; 12 ; N 2 ; 22 marginal distri-
butions.

Proof. First note that E.aU C bV / D a1 C b2 by linearity of expectations,


and Var.aU C bV / D a2 Var.U / C b 2 Var.V / C 2ab Cov.U; V / by the general
formula for the variance of a linear combination of two jointly distributed random
variables (see Chapter 2). But Var.U / D 12 ; Var.V / D 22 , and Cov.U; V / D
1 2 : Therefore, Var.aU C bV / D a2 12 C b 2 22 C 2ab1 2 :
Therefore, we only have to prove that aU C bV is normally distributed. For this,
we use our representation of U; V in terms of a pair of independent standard normal
variables X; Y :

U D 1 C 1 X
h p i
V D 2 C 2 X C 1  2 Y :
3.3 Bivariate Normal 139

Multiplying the equations by a; b and adding, we get the representation


h p i
aU C bV D a1 C b2 C a1 X C b2 X C b2 1  2 Y
h p i
D a1 C b2 C .a1 C b2 /X C b2 1  2 Y :

That is, aU C bV can be represented as a linear function cX C d Y C k of two in-


dependent standard normal variables X; Y , and so aU C bV is necessarily normally
distributed (see Chapter 1). t
u

In fact, a result stronger than the previous theorem holds. What is true is that any
two linear functions of U; V will again be distributed as a bivariate normal. Here is
the stronger result.
Theorem 3.3. Let .U; V / have a general five-parameter bivariate normal distri-
bution. Let Z D aU C bV; W D cU C dV be two linear functions, such
that ad  bc ¤ 0. Then, .Z; W / also has a bivariate normal distribution, with
parameters

E.Z/ D a1 C b2 ; E.W / D c1 C d2 I


Var.Z/ D a2 12 C b 2 22 C 2ab1 2 I
Var.W / D c 2 12 C d 2 22 C 2cd1 2 I
ac12 C bd22 C .ad C bc/1 2
Z;W D p :
Var.Z/Var.W /

The proof of this theorem is similar to the proof of the previous theorem, and the
details are omitted.
Example 3.14 (Independence of Mean and Variance). Suppose X1 ; X2 are two iid
N.;  2 / variables. Then, of course, they are also jointly bivariate normal. Define
now two linear functions

Z D X1 C X2 ; W D X1  X2 :

Because .X1 ; X2 / has a bivariate normal distribution, so does .Z; W /. However,


plainly,

Cov.Z; W / D Cov.X1 C X2 ; X1  X2 / D Var.X1 /  Var.X2 / D 0:

Therefore, Z; W must actually be independent. As a consequence, Z and W 2 are


also independent.
Now note that the sample variance of X1 ; X2 is
   
X1 C X2 2 X1 C X2 2 .X1  X2 /2 W2
s 2 D X1  C X2  D D :
2 2 2 2
140 3 Multidimensional Densities

And, of course, XN D X1 CX 2
2
D Z2 . Therefore, it follows that XN and s 2 are
independent.
This is true not just for two observations, but for any number of iid obser-
vations from a normal distribution. This is proved after we introduce multivari-
ate normal distributions, and it is also proved in Chapter 18 by using Basu’s
theorem.

Example 3.15 (Normal Marginals Do Not Guarantee Joint Normal). Although joint
bivariate normality of two random variables implies that each variable must be
marginally a univariate normal, the converse is in general not true.
Let Z N.0; 1/, and let U be a two-valued random variable with the pmf
P .U D ˙1/ D 12 . Take U and Z to be independent. Define now X D U jZj, and
Y D Z.
Then, each of X; Y has a standard normal distribution. That X has a standard
normal distribution is easily seen in many ways, for example, by just evaluating its
CDF. Take x > 0; then,

1 1
P .X x/ D P .X xjU D 1/  C P .X xjU D 1/ 
2 2
1 1
D 1 C P .jZj x/ 
2 2
1 1
D C  Œ2ˆ.x/  1 D ˆ.x/:
2 2

Similarly, also for x 0; P .X x/ D ˆ.x/.


But, jointly, X; Y cannot be bivariate normal, because X 2 D U 2 Z 2 D Z 2 D Y 2
with probability one. That is, the joint distribution of .X; Y / lives on just the two
lines y D ˙x, and so is certainly not bivariate normal.

3.4 Conditional Densities and Expectations

The conditional distribution for continuous random variables is defined analogously


to the discrete case, with pmfs replaced by densities. The formal definitions are as
follows.
Definition 3.6 (Conditional Density). Let .X; Y / have a joint density f .x; y/.
The conditional density of X given Y D y is defined as

f .x; y/
f .xjy/ D f .xjY D y/ D ; 8y such that fY .y/ > 0:
fY .y/
3.4 Conditional Densities and Expectations 141

The conditional expectation of X given Y D y is defined as


Z 1
E.X jy/ D E.X jY D y/ D xf .xjy/dx
1
R1
xf .x; y/dx
D R1
1 ;
1 f .x; y/dx

8y such that fY .y/ > 0:


For fixed x, the conditional expectation E.X jy/ D X .y/ is a number. As we
vary y, we can think of E.X jy/ as a function of y. The corresponding function of
Y is written as E.X jY / and is a random variable. It is very important to keep this
notational distinction in mind.
The conditional density of Y given X D x and the conditional expectation of Y
given X D x are defined analogously. That is, for instance,

f .x; y/
f .yjx/ D ; 8x such that fX .x/ > 0:
fX .x/

An important relationship connecting the two conditional densities is the follow-


ing result.

Theorem 3.4 (Bayes Theorem for Conditional Densities). Let .X; Y / have a
joint density f .x; y/. Then, 8x; y; such that fX .x/ > 0; fY .y/ > 0,

f .xjy/fY .y/
f .yjx/ D :
fX .x/

Proof.

f .x;y/
f .xjy/fY .y/ fY .y/ Y
f .y/
D
fX .x/ fX .x/
f .x; y/
D D f .yjx/:
fX .x/

Thus, we can convert one conditional density to the other one by using Bayes’ the-
orem; note the similarity to Bayes’ theorem discussed in Chapter 1. t
u

Definition 3.7 (Conditional Variance). Let .X; Y / have a joint density f .x; y/.
The conditional variance of X given Y D y is defined as
R1
1 .x  X .y//2 f .x; y/dx
Var.X jy/ D Var.X jY D y/ D R1 ;
1 f .x; y/dx

8y such that fY .y/ > 0; where X .y/ denotes E.X jy/.


142 3 Multidimensional Densities

Remark. All the facts and properties about conditional pmfs and conditional ex-
pectations that were presented in the previous chapter for discrete random variables
continue to hold verbatim in the continuous case, with densities replacing the pmfs
in their statements. In particular, the iterated expectation and variance formula, and
all the rules about conditional expectations and variance in Section 2.3 hold in the
continuous case.
An important optimizing property of the conditional expectation is that the best
predictor of Y based on X among all possible predictors is the conditional expecta-
tion of Y given X . Here is the exact result.

Proposition (Best Predictor). Let .X; Y / be jointly distributed random vari-


ables (of any kind). Suppose E.Y 2 / < 1. Then EX;Y Œ.Y  E.Y jX //2 
EX;Y Œ.Y  g.X //2 , for any function g.X /. Here, the notation EX<Y stands
for expectation with respect to the joint distribution of X; Y .

Proof. Denote Y .x/ D E.Y jX D x/. Then, by the property of the mean of any
random variable U that E.U  E.U //2 E.U  a/2 for any a, we get that here,

EŒ.Y  Y .x//2 jX D x EŒ.Y  g.x//2 jX D x;

for any x.
Inasmuch as this inequality holds for any x, it also holds on taking an expectation:
     
EX E .Y  Y .x//2 jX D x EX E .Y  g.x//2 jX D x
   
) EX;Y .Y  Y .X //2 EX;Y .Y  g.X //2 ;

where the final line is a consequence of the iterated expectation formula (see
Chapter 2). t
u

We now show a number of examples.

3.4.1 Examples on Conditional Densities and Expectations

Example 3.16 (Uniform in a Triangle). Consider the joint density

f .x; y/ D 2; if x; y  0; x C y 1:

By using the results derived in Example 3.2,

f .x; y/ 1
f .xjy/ D D ;
fY .y/ 1y
3.4 Conditional Densities and Expectations 143

if 0 x 1  y, and is zero otherwise. Thus, we have the interesting conclusion


that given Y D y; X is distributed uniformly in Œ0; 1  y. Consequently,

1y
E.X jy/ D ; 8y; 0 < y < 1:
2
Also, the conditional variance of X given Y D y is, by the general variance formula
for uniform distributions,

.1  y/2
Var.X jy/ D :
12

Example 3.17 (Uniform Distribution in a Circle). Let .X; Y / have a uniform den-
sity in the unit circle, f .x; y/ D 1 , x 2 C y 2 1; We find the conditional
expectation of X given Y D y. First, the conditional density is

1
f .x; y/ 1
f .xjy/ D D p D p ;
fY .y/ 2 1y 2
2 1  y2


p p
 1  y2 x 1  y2:
Thus, we have the interesting
p result that the conditional density of X given
p
Y D y is uniform on Œ 1  y 2 ; 1  y 2 . It being an interval symmetric about
zero, we have in addition the result that for any y; E.X jY D y/ D 0:
Let us now find the conditional
p variance.
p The conditional distribution of X given
Y D y is uniform on Œ 1  y 2 ; 1  y 2 , therefore by the general variance
formula for uniform distributions,
 p 2
2 1  y2 1  y2
Var.X jy/ D D :
12 3

Thus, the conditional variance decreases as y moves away from zero, which makes
sense intuitively, because as y moves away from zero, the line segment in which x
varies becomes smaller.

Example 3.18 (A Two-Stage Experiment). Suppose X is a positive random variable


with density f .x/, and given X D x, a number Y is chosen at random between 0
and x. Suppose, however, that you are only told the value of Y , and the x value is
kept hidden from you. What is your guess for x?
The formulation of the problem is:

X f .x/I Y jX D x U Œ0; xI we want to find E.X jY D y/:


144 3 Multidimensional Densities

To find E.X jY D y/, our first task would be to find f .xjy/, the conditional density
of X given Y D y. This is, by its definition,

f .x; y/ f .yjx/f .x/


f .xjy/ D D
fY .y/ fY .y/
1
I fxyg f .x/
D Rx1 1 :
y x f .x/dx

Therefore,
Z 1
E.X jY D y/ D xf .xjy/dx
y
R1 1
y x x f .x/dx 1  F .y/
D R1 1 D R1 1 ;
y x f .x/dx y x f .x/dx

where F denotes the CDF of X .


Suppose now, in particular, that f .x/ is the U Œ0; 1 density. Then, by plugging
into this general formula,

1  F .y/ 1y
E.X jY D y/ D R 1 1 D ; 0 < y < 1:
y x f .x/dx
 log y

The important thing to note is that although X has marginally a uniform density
and expectation 12 , given Y D y; X is not uniformly distributed, and E.X jY D y/
is not 12 . Indeed, as Fig. 3.5 shows, E.X jY D y/ is an increasing function of y,
increasing from zero at y D 0 to one at y D 1.

0.8

0.6

0.4

0.2

y
0.2 0.4 0.6 0.8 1

Fig. 3.5 Plot of E(XjY D y) when X is U[0,1], YjX D x is UŒ0; x


3.4 Conditional Densities and Expectations 145

Example 3.19 (E.X jY D y/ Exists for Any y, but E.X / Does Not). Consider the
setup of the preceding example once again (i.e., X f .x/) and given X D x;
Y U Œ0; x. Suppose f .x/ D x12 ; x  1. Then the marginal expectation E.X /
R1 R1
does not exist, because 1 x x12 dx D 1 x1 dx diverges.
However, from the general formula in the preceding example,
1
1  F .y/ y
E.X jY D y/ D R 1 f .x/ D 1
D 2y;
y x dx 2y 2

and thus, E.X jY D y/ exists for every y.

Example 3.20 (Using Conditioning to Evaluate Probabilities). We described the it-


erated expectation technique in the last chapter to calculate expectations. It turns out
that it is in fact also a really useful way to calculate probabilities. Because the prob-
ability of any event A is also the expectation of X D IA , by the iterated expectation
technique, we can calculate P .A/ as

P .A/ D E.IA / D E.X / D EY ŒE.X jY D y/ D EY ŒP .AjY D y/;

by using a conditioning variable Y judiciously. The choice of the conditioning vari-


able Y is usually clear from the particular context. Here is an example.
Let X; Y be independent U Œ0; 1 random variables. Then Z D X Y also takes
values in Œ0; 1, and suppose we want to find an expression for P .Z z/: We can
do this by using the iterated expectation technique:

P .X Y z/ D EŒIXY z  D EY ŒE.IXY z jY D y/


h  i
D EY ŒE.IXyz jY D y/ D EY E IX yz jY D y
D EY ŒE.IX yz /

(because X and Y are independent)


 
z
D EY P X :
y

Now, note that P .X z


y/ is z
y if z
y 1 , y  z, and P .X z
y/ D 1 if y < z.
Therefore,
  Z z Z 1
z z
EY P X D 1dy C dy
y 0 z y
D z  z log z;

0<z 1: So, the final answer to our problem is P .X Y z/ D z  z log z; 0 <


z 1:
146 3 Multidimensional Densities

Example 3.21 (Power of the Iterated Expectation Formula). Let X; Y; Z be three


independent U Œ0; 1 random variables. We find the probability that X 2  Y Z by
once again using the iterated expectation formula.
Towards this,
P .X 2  Y Z/ D 1  P .X 2 < Y Z/
D 1  EŒIX 2 <Y Z  D 1  EY;Z ŒE.IX 2 <Y Z jY D y; Z D z/
D 1  EY;Z ŒE.IX 2 <yz jY D y; Z D z/
D 1  EY;Z ŒE.IX 2 <yz /
(because X; Y; Z are independent)
  p 
D 1  EY;Z P .X 2 < yz/ D 1  EY;Z yz
hp i hp i
D 1  EY Y EZ Z
 2
2 5
D 1 D :
3 9
Once again, we see the power of identifying probabilities as expectations of in-
dicator variables and the power of using the iterated expectation formula.

Example 3.22 (Conditional Density Given the Sum). Suppose X; Y are two in-
dependent Exp.1/ variables. What is the conditional density of X given that
X C Y D t? Denote X CY D T . Then, we know from Chapter 1 that T G.2; 1/.
Also, by definition of probabilities for jointly continuous random variables, by de-
noting the joint density of .X; Y / as f .x; y/,
Z
P .X x; T t/ D f .u; v/d ud v
ux;uCvt
Z
D e uv d ud v
0<ux;0<uCvt
Z x Z t u
u v
D e e dv du
0 0
Z x
D e u .1  e ut /d u
0
Z x Z x
D e u d u  e t d u
0 0
D 1  e x  xe t ;
for x > 0; t > x.
Therefore, the joint density of X and T is

@2
fX;T .x; t/ D Œ1  e x  xe t 
@x@t
D e t ; 0 < x < t < 1:
3.5 Posterior Densities, Likelihood Functions, and Bayes Estimates 147

Now, therefore, from the definition of conditional densities,


fX;T .x; t/
f .xjt/ D
fT .t/
e t 1
D t D ;
te t
0 < x < t:
That is, given that the sum X C Y D t, X is distributed uniformly on Œ0; t. In
particular,
t t2
E.X jX C Y D t/ D ; Var.X jX C Y D t/ D :
2 12
To complete the example, we mention a quick trick to compute the conditional ex-
pectation. Note that by symmetry,

E.X jX C Y D t/ D E.Y jX C Y D t/
) t D E.X C Y jX C Y D t/ D 2E.X jX C Y D t/
t
) E.X jX C Y D t/ D :
2
So, if we wanted just the conditional expectation, then the conditional density calcu-
lation would not be necessary in this case. This sort of symmetry argument is often
very useful in reducing algebraic calculations. But one needs to be absolutely sure
that the symmetry argument will be valid in a given problem.

3.5 Posterior Densities, Likelihood Functions,


and Bayes Estimates

In Bayesian statistics, parameters of distributions, being unknown, are formally


assigned a probability distribution, called a prior distribution. For example, if
X Bin.n; p/, then the binomial distribution is interpreted to be the conditional
distribution of X given that another random variable taking values in Œ0; 1 is equal
to p. This other variable Y is assigned a density g.p/, which reflects the statis-
tician’s a priori belief about the value of that success probability. One then uses
Bayes’ theorem to find a conditional density for the parameter given the observed
value, x, of X . This density is called the posterior density of the parameter. The
posterior density combines the a priori information with the information coming
from the data value x to form a final density for the parameter. One then uses this
posterior density to make statements about the parameter, for example, the poste-
rior probability that the parameter is > :6 is < :25. One can use the mean of the
posterior density as an estimate for the true value of the unknown parameter, and so
on. Some examples of this Bayesian approach are worked out below. But, first we
formally define a posterior density.
148 3 Multidimensional Densities

Definition 3.8. Suppose for some fixed n  1; .X1 ; : : : ; Xn / have the joint density,
or the joint pmf, f .x1 ; : : : ; xn j/, where  is a real-valued parameter, taking values
in an interval .a; b/, where a; b may be ˙1. Formally, consider  itself to be a
random variable, and suppose  has a density g./ on .a; b/. Then, the conditional
density of  given X1 D x1 ; : : : ; Xn D xn is called the posterior density of , and
is given by
f .x1 ; : : : ; xn j/g./
f .jx1 ; : : : ; xn / D R b :
a f .x 1 ; : : : ; xn j/g./d

The function l./ D f .x1 ; : : : ; xn j/ is called the likelihood function, the function
g./ is called the prior density, and the conditional expectation of  given X1 D
x1 ; : : : ; Xn D xn ; E.jX1 D x1 ; : : : ; Xn D xn / , if it exists, is called the posterior
mean of .

Remark. Note that in the expression for the posterior density, only the numera-
tor depends on . The denominator depends only on x1 ; : : : ; xn , because in the
denominator,  is being completely integrated out. So, we should think of the de-
nominator in the expression for the posterior density to be merely a normalizing
constant.
Note also that if .a; b/ is a bounded interval, and we take g to be the uniform
density on .a; b/, then, apart from the normalizing constant in the denominator, the
posterior density of  is exactly the same as the likelihood function.

Example 3.23 (Posterior Density for Exponential Mean). Suppose we have a single
observation X Exp. /, and that has the marginal density g. / D 2 ; 0 <
< 1: Then, by Bayes’ theorem,

f .xj /g. /
f . jx/ D
fX .x/
f .xj /g. / f .xj /g. /
D R1 D R1
0 f .x; /d 0 f .xj /g. /d
1 x x
e  2 2e  
D R1 
D R1
1 x x
0 e
 2 d 0 2e   d
x
e 
D ;
k.x/
R1 x
where k.x/ denotes the integral 0 e   d , which exists, but does not have a sim-
ple final formula. Thus, finally, the posterior density of given that the data value
X D x, is
x
e 
f . jx/ D ; 0 < < 1:
k.x/
3.5 Posterior Densities, Likelihood Functions, and Bayes Estimates 149

lambda
0.2 0.4 0.6 0.8 1

Fig. 3.6 Prior and posterior density in exponential example

We give a plot of the prior density for , along with the posterior density for in
Fig. 3.6. A comparison of the two density plots explains the effect of the data value
X D x on updating the prior density to the posterior density. We see from the plots
that the data value (x D 3) makes larger values more likely under the posterior
than they were under the prior.

Example 3.24 (Posterior Mean for Binomial p). Suppose X Bin.n; p/, where
the probability of success p is treated as an unknown parameter. For example, you
may take a sample of n people independently from a population and count how many
are vegetarians. Then p will correspond to the fraction of vegetarians in the entire
population, and it seems likely that you cannot really know what that proportion is
in the entire population.
In the Bayesian approach, you have to assign the parameter p a distribution.
For simplicity of calculations, suppose we give p the U Œ0; 1 prior. So, the Bayes
model is:
p U Œ0; 1; X jp Bin.n; p/:
The posterior density, by definition, is the conditional density of p given X D x,
x being the actual observed value of X . Then, from Bayes’ theorem for conditional
densities,
n  x
f .xjp/g.p/ p .1  p/nx
f .pjx/ D R 1 D R 1 xn
0 x p .1  p/
x nx dp
0 f .xjp/g.p/dp

p x .1  p/nx p x .1  p/nx
D R1 D .xC1/.nxC1/
:
0 p x .1  p/nx dp .nC2/

Here, in the last line, the denominator is obtained by actually doing the integration.
But, if we did not bother to do the integration in the denominator, and just looked
150 3 Multidimensional Densities

at the numerator, we would have realized that apart from an as yet unevaluated
constant term that the denominator will contribute, the posterior density is a Beta
density with parameters x C 1 and n  x C 1, respectively. From the formula for
the mean of a Beta density(see Chapter 1), we get the additional formula that the
conditional expectation.

xC1 xC1
E.pjX D x/ D D :
xC1CnxC1 nC2

This is called the posterior mean. So, if you believe in your U Œ0; 1 prior for p,
then as a Bayesian you may wish to estimate p by xC1
nC2 , the posterior mean. This is
different from xn , the more common estimate of p. The slight alteration is caused by
treating p as a random variable, and by adopting the Bayesian approach.
More generally, if p has the general Beta prior density, p Be.˛; ˇ/, then the
same calculation as above shows that the posterior density is another Beta, and it is
the Be.x C ˛; n  x C ˇ/ density.
Example 3.25 (Posterior Density for a Poisson Mean). Suppose we have n iid sam-
ple values X1 ; X2 ; : : : ; Xn from a Poisson distribution with mean . The parameter
is considered unknown. Therefore, in the Bayesian approach, we have to choose
a prior distribution for it. Suppose we choose a standard exponential prior for : the
Bayes model is:
P
Y
n
e  xi
n
e n i D1 xi
f .x1 ; x2 ; : : : ; xn j / D D Qn I
xi Š i D1 xi Š
i D1

and g. / D e  ; > 0.
Therefore, by definition of a posterior density,

f .x1 ; x2 ; : : : ; xn j /g. /
f . jx1 ; x2 ; : : : ; xn / D R 1
0 f .x1 ; x2 ; : : : ; xn j /g. /d
Pn Pn
e .nC1/ i D1 xi
e .nC1/ i D1 xi
D R1 Pn D Pn
e .nC1/ i D1 xi d . i D1Pxi C1/
0 1C n x
.nC1/ i D1 i
Pn Pn
.n C 1/1C i D1 xi e .nC1/ i D1 xi
D P :
. niD1 xi C 1/

Once again, the integration in the denominator did not really need to be done. By
simply looking at the numerator,
P we would recognize this to be a Gamma density
with shape parameter niD1 xi C 1 and scale parameter nC1 1
. From the general
formula for the mean of a GammaP density(see Chapter 1), we get the posterior mean
n
xi C1
formula E. jx1 ; x2 ; : : : ; xn / D i D1
nC1 . Once again, it is a slight alteration of
Pn
i D1 xi
the estimate one may have thought of intuitively, namely the estimate n
.
3.5 Posterior Densities, Likelihood Functions, and Bayes Estimates 151

Example 3.26 (Posterior Density of a Normal Mean). Suppose X1 ; X2 ; : : : ; Xn


N.; 1/, and suppose  is assigned a standard normal prior density. Then, by
definition of the posterior density,

1
Pn 2 2
e 2 i D1 .xi / e 2
f .jx1 ; : : : ; xn / D R Pn 2
1 1
e 2 i D1 .xi / e
2
1
2 d
nC1 2 Pn
e 2  C . i D1 xi /
D R1 nC1 2 Pn
e 2  C . i D1 xi / d
1
Pn 2
x
 nC1  i D1 i
2 nC1
e
D Pn 2
:
R1  nC1
2  nC1
x
i D1 i

1 e d

Once again, it is not necessary to work out the integral in the denominator, although
it is certainly not difficult to do so. All we need to recognize is that the numerator,
after doing all the algebra that we did, has Preduced to yet another normal density
n
i D1 xi 1
on , namely, a normal density with mean nC1 and variance nC1 . If we did go
through the chores of actually performing the integration in the denominator, we
would surely find it to be just the normalizing constant of this normal density. The
conclusion is that if X1 ; X2 ; : : : ; Xn N.;
 Pn1/, and  has a standard normal prior,
i D1 xi 1
then the posterior density of  is the N nC1 ; nC1 density. In particular, the
Pn
x
posterior mean is E.jx1 ; : : : ; xn / D nC1i D1 i
, and the variance of the posterior
1
distribution is nC1 . Note that the posterior variance does not depend on x1 ; : : : ; xn !
This rather remarkable fact is entirely specific to the choice of a normal prior den-
sity; any normal prior will result in this constant posterior variance property.

Remark. In each of the last three examples, we obtained something interesting. In


the binomial example, the prior density used was U Œ0; 1, which is a special case
of a Beta, and we found the posterior density to be another Beta. In the Poisson ex-
ample, the prior density used was the standard exponential, which is a special case
of a Gamma, and we found the posterior density to be another Gamma. And, in the
normal example, the prior density used was the standard normal, and we found
the posterior density to be another normal. In other words, the functional form of
the prior and the posterior density are the same in each of these three examples.
In updating the prior to the posterior, we only updated the parameters of the prior,
but not the functional form. This happens more generally, but only for special types
of priors, and the special type depends on the specific problem. This is considered
to be of such great convenience, that priors which satisfy this neat updating prop-
erty have been given a name; they are called conjugate priors. Here is a formal
definition.
152 3 Multidimensional Densities

Definition 3.9. Let X f .xj/, and suppose  g./, where g belongs to some
family of densities G. The family G is called a family of conjugate priors for the
model f if the posterior density f .jx/ is also a member of G for any x.
The Beta family is a conjugate family for the binomial case. The Gamma family
is conjugate for the Poisson case. Normal distributions on  form a conjugate family
for the mean  in the normal case. Conjugate families are not unique, and in each
new problem, one has to find a convenient one by inspection.

3.6 Maximum Likelihood Estimates

The posterior density combines the likelihood function l./ and the prior density
g./ with suitable normalization. Fisher’s idea was to use just the likelihood func-
tion itself as the yardstick for assessing the credibility of each  for being the true
value of the parameter . If the likelihood function l./ is large at some , that 
value is consistent with the data that where obtained; on the other hand, if the likeli-
hood function l./ is small at some , that  value is inconsistent with the data that
were obtained. Fisher suggested maximizing the likelihood function over all possi-
ble values of , and using the maxima as an estimate of . This is the celebrated
maximum likelihood estimate.
Many think that maximum likelihood is the greatest conceptual invention in the
history of statistics. Although in some high or infinite-dimensional problems com-
putation and performance of maximum likelihood estimates are less than desirable,
or even poor, in a vast majority of models in practical use, MLEs are about the best
that one can do. They have many asymptotic optimality properties that translate
into fine performance in finite samples. We give a few illustrative examples, after
defining an MLE.
Definition 3.10. Suppose given a parameter ; X .n/ D .X1 ; : : : ; Xn / have a joint
pdf or joint pmf f .x1 ; : : : ; xn j /;  2 ‚. Any value O D .X
O 1 ; : : : ; Xn / at which
the likelihood function l./ D f .x1 ; : : : ; xn j / is maximized is called a maximum
likelihood estimate (MLE) of , provided O 2 ‚, and l./ O < 1.
It is important to understand that an MLE need not exist, or be unique. But, in
many examples, it exists and is unique for any dataset X .n/ . In maximizing the
likelihood function over , any pure constant terms not involving  may be ignored.
Also, in many standard models, it is more convenient to maximize L./ D log l./;
it simplifies the algebra without affecting the correctness of the final answer.
iid
Example 3.27P(MLE of Binomial p). Let X1 ; : : : ; Xn Ber.p/; 0 < p < 1. Then,
writing X D niD1 Xi (the total number of successes in these n trials),

l.p/ D p X .1  p/.nX/ ) L.p/ D log l.p/ D X log p C .n  X /log.1  p/:

For 0 < X < n, L.p/ has a unique stationary point, namely a point at which the first
derivative L0 .p/ D 0. This point is p D Xn . Furthermore, it is easily verified that
3.6 Maximum Likelihood Estimates 153

L00 .p/ < 0 for all p 2 .0; 1/; that is, L.p/ is strictly concave. So, for 0 < X < n,
there is a unique MLE of p, and it is just the common sense estimate Xn , the sample
proportion of successes. If X D 0 or n, the likelihood function is maximized at a
boundary value p D 0 or 1. In those two cases, an MLE of p does not exist.
iid
Example 3.28 (Mean of an Exponential). Let X1 ; : : : ; Xn Exp. /; > 0. Then,
Pn
1X
1 n
e  i D1 Xi
l. / D n
) L. / D  Xi  n log :
i D1
Pn
X
L. / has a unique stationary point, it being D i D1 n
i
D X . Furthermore,
L00 . / < 0 at D X . Also note that l. / ! 0 as ! 0 or 1. These three facts
together imply that for all possible datasets X1 ; : : : ; Xn , there is a unique MLE of
, and it is the sample mean X .
iid
Example 3.29 (MLE of Normal Mean and Variance). Let X1 ; : : : ; Xn N
.;  2 /; 1 <  < 1;  2 > 0. This is a two-parameter example. The likelihood
function is P
 12 n i D1 .Xi /
2
2 e 2
l.;  / D n :
. 2 / 2
Maximizing a function of two variables by calculus methods has to be done care-
fully, because the second derivative tests are subtle and must be carefully applied.
We instead obtain the MLEs of  and  2 directly, as follows. Pn
The argument
Xi
uses a sequence of simple inequalities. As usual, let X D i D1 , and also let
1 Pn
n
s0 D n i D1 .Xi  X / ; note that this is different from the sample variance
2 2
1 Pn
s 2 D n1 i D1 .Xi  X / . The argument below shows that the unique MLEs of
2

;  2 are X and s02 . This follows from the straightforward inequalities


n
e 2
l.;  2 / l.X ;  2 / l.X; s02 / D < 1;
s0n
and therefore, .X ; s02 / is the unique global maxima of l.;  2 /.
Example 3.30 (Endpoint of a Uniform). This is an example where the MLE cannot
be obtained by finding stationary points of the likelihood function, and is found by
iid
examining the shape of the likelihood function. Let X1 ; : : : ; Xn U Œ0; ;  > 0.
Then, the individual densities are f .xi j / D  I0xi  . Therefore,
1

Y
n
1 1
l./ D I0xi  D n I max.x1 ;:::;xn / Imin.x1 ;:::;xn /0
 
i D1
1
D I max.x1 ;:::;xn / ;
n
because under the model, with probability one under any  > 0; min.X1 ; : : : ; Xn /
is greater than zero.
154 3 Multidimensional Densities

The likelihood function is therefore zero on .0; max.x1 ; : : : ; xn //, and on


Œmax.x1 ; : : : ; xn /; 1/ it is strictly decreasing, with a finite value at the jump point
 D max.x1 ; : : : ; xn /. Therefore, X.n/ D max.X1 ; : : : ; Xn / is the unique MLE of
 for all data values X1 ; : : : ; Xn .

3.7 Bivariate Normal Conditional Distributions

Suppose .X; Y / have a joint bivariate normal distribution. A very important property
of the bivariate normal is that each conditional distribution, the distribution of Y
given X D x, and that of X given Y D y is a univariate normal, for any x and
any y. This really helps in easily computing conditional probabilities involving one
variable, when the other variable is held fixed at some specific value.
Theorem 3.5. Let .X; Y / have a bivariate normal distribution with parameters
1 ; 2 ; 1 ; 2 ; : Then,

(a) X jY D y N 1 C  21 .y  2 /; 12 .1  2 / I

(b) Y jX D x N 2 C  21 .x  1 /; 22 .1  2 / :

In particular, the conditional expectations of X given Y D y and of Y given X D x


are linear functions of y and x, respectively:
1
E.X jY D y/ D 1 C  .y  2 /I
2
2
E.Y jX D x/ D 2 C  .x  1 /;
1

and the variance of each conditional distribution is a constant, and does not depend
on the conditioning values x or y.
The proof of this theorem involves some tedious integration manipulations, and
we omit it; the details of the proof are available in Tong (1990).

Remark. We see here that the conditional expectation is linear in the bivariate nor-
mal case. Specifically, take E.Y jX D x/ D 2 C  21 .x  1 /. Previously, we
have seen in Chapter 2 that the conditional expectation E.Y jX / is, in general, the
best predictor of Y based on X . Now we see that the conditional expectation is a
linear predictor in the bivariate normal case, and it is the best predictor and there-
fore, also the best linear predictor. In Chapter 2, we called the best linear predictor
the regression line of Y on X . Putting it all together, we have the very special result
that in the bivariate normal case, the regression line of Y on X and the best overall
predictor are the same:
For bivariate normal distributions, the conditional expectation of one variable
given the other coincides with the regression line of that variable on the other
variable.
3.8 Useful Formulas and Characterizations for Bivariate Normal 155

Example 3.31. Suppose incomes of husbands and wives in a population are bivari-
ate normal with means 75 and 60 (in thousands of dollars), standard deviations 20
each, and a correlation of .75. We want to know in what percentage of those families
where the wife earns 80,000 dollars, the family income exceeds 175,000 dollars.
Denote the income of the husband and the wife by X and Y . Then, we want
to find P .X C Y > 175jY D 80/. By the above theorem. X jY D 80 N.75 C
:75.80  60/; 400.1  :752 // D N.90; 175/: Therefore,
P .X C Y > 175jY D 80/ D P .X > 95jY D 80/
 
95  90
DP Z> p D P .Z > :38/
175
D :3520;
where Z denotes a standard normal variable.
Example 3.32 (Galton’s Observation: Regression to the Mean). This example is
similar to the previous example, but makes an interesting different point. It is often
found that students who get a very good grade on the first midterm, do not do as
well on the second midterm. We can try to explain it by doing a bivariate normal
calculation.
Denote the grade on the first midterm by X , that on the second midterm by Y ,
and suppose X; Y are jointly bivariate normal with means 70, standard deviations
10, and a correlation .7. Suppose a student scored 90 on the first midterm. What are
the chances that she will get a lower grade on the second midterm?
This is
P .Y < X jX D 90/ D P .Y < 90jX D 90/
 
90  84
DP Z< p D P .Z < :84/
51
D :7995;
where Z is a standard normal variable, and we have used the fact that Y jX D 90
N.70 C :7.90  70/; 100.1  :72 // D N.84; 51/.
Thus, with a fairly high probability, the student will not be able to match her first
midterm grade on the second midterm. The phenomenon of regression to mediocrity
was popularized by Galton, who noticed that the offspring of very tall parents tended
to be much closer to being of just about average height, and the extreme tallness in
the parents was not commonly passed on to the children.

3.8  Useful Formulas and Characterizations


for Bivariate Normal

A number of extremely elegant characterizations and also some very neat formulas
for useful quantities are available for a general bivariate normal distribution. An-
other practical issue is the numerical computation of bivariate normal probabilities.
156 3 Multidimensional Densities

Although tables are widely available for the univariate standard normal, for the
bivariate normal corresponding tables are found only in specialized sources, and
are sketchy. Thus, a simple and reasonably accurate approximation formula is prac-
tically useful. We deal with these issues in this section.
First, we need some notation. For jointly distributed random variables X; Y with
means 1 ; 2 , standard deviations 1 ; 2 , and positive integers r; s, we denote

EŒ.X  1 /r .Y  2 /s  EŒjX  1 jr jY  2 js 
r;s D I r;s D :
1r 2s 1r 2s

We then have the following useful formulas for a general bivariate normal
distribution.

Theorem 3.6 (Bivariate Normal Formulas).


(a) The joint mgf of the bivariate normal distribution is given by
1 2 2 2 2
. t1 ; t2 / D e t1 1 Ct2 2 C 2 Œt1 1 Ct2 2 C2 t1 t2 1 2 
I

(b) r;s D s;r I r;s D s;r I


(c) r;s D 0 if r C s is odd;
(d) 1;1 D I 1;3 D 3I

2;2 D 1 C 22 I 2;4 D 3.1 C 42 /I

3;3 D 3.3 C 22 /I 4;4 D 3.3 C 242 C 84 /I


hp i h p
(e) 1;1 D 2

1  2 C  arcsin  I 3;3 D 2

.4 C 112 / 1  2 C 3.3 C
i
22 / arcsin  :
(f) E.maxfX; Y g/ D p1 C .1  p/2 C ı;
where
   
 
pDˆ ; ıD ;
 

and
 D 1  2 I  2 D 12 C 22  21 2 :

These formulas are proved in Kamat (1953).


Among the numerous characterizations of the bivariate normal distribution, a
few stand out in their clarity and in being useful. We state these characterizations
below.
3.8 Useful Formulas and Characterizations for Bivariate Normal 157

Theorem 3.7 (Characterizations). Let X; Y be jointly distributed random vari-


ables.
(a) X; Y are jointly bivariate normal if and only if every linear combination aX C
bY has a univariate normal distribution.
(b) X; Y are jointly bivariate normal if and only if X is univariate normal, and
8x; Y jX D x N.a C bx; c 2 /, for some a; b; c.
(c) X; Y are jointly bivariate normal if and only if 8x; 8y; Y jX D x; and
X jY D y are univariate normals, and in addition, either one of the marginals is
a univariate normal, or one of the conditional variance functions is a constant
function.
See Kagan et al. (1973), and Patel and Read (1996) for these and other charac-
terizations of the bivariate normal distribution.

3.8.1 Computing Bivariate Normal Probabilities

There are many approximations to the CDF of a general bivariate normal distribu-
tion. The most accurate ones are too complex for quick use. The relatively simple
approximations are not computationally accurate for all configurations of the argu-
ments and the parameters. Keeping a balance between simplicity and accuracy, we
present here two approximations.
Mee–Owen Approximation. Let .X; Y / have the general five-parameter bivariate
normal distribution. Then,
 
kc
P .X 1 C h1 ; Y 2 C k2 / ˆ.h/ˆ ;

.h/
where c D  ˆ.h/ ;  2 D 1 C hc  c 2 :
Cox–Wermuth Approximation.
!
.h/  k
P .X  1 C h1 ; Y  2 C k2 / ˆ.h/ˆ p ;
1  2

.h/
where .h/ D 1ˆ.h/ :
See Mee and Owen (1983) and Cox and Wermuth (1991) for the motivation
behind these approximations. See Plackett (1954) for reducing the dimension of
the integral for computing multivariate normal probabilities. Genz (1993) provides
some comparison of the different algorithms and approximations for computing
multivariate normal probabilities.

Example 3.33 (Testing the Approximations). We compute the Mee–Owen approxi-


mation, the Cox–Wermuth approximation, and the corresponding exact probabilities
158 3 Multidimensional Densities

0.9564 0.975
0.9562
0.956 0.97
0.9558
0.9556 0.965
0.9554
0.9552 rho
0.05 0.1 0.15 0.2 0.25 0.3
rho
0.05 0.1 0.15 0.2 0.25 0.3 0.955

Fig. 3.7 P(X < 2, Y < 2) and the Mee–Owen approximation in the standardized case

0.3

0.25

0.2

0.15

0.1

0.05

k
0.5 1 1.5 2 2.5 3

Fig. 3.8 P(X > k, Y > k) and Cox–Wermuth approximation in the standardized case, rho D .5

in two trial cases. We can see from Fig. 3.8 that the Cox–Wermuth approximation
is nearly exact in the trial case. The Mee–Owen approximation in Fig. 3.7 is rea-
sonable, but not very accurate. It should also be noted that the quantity  2 in the
Mee–Owen approximation can be negative, in which case the approximation is not
usable. Generally, the Mee–Owen approximation is inaccurate or unusable if h; k; 
are large. The Cox–Wermuth approximation should not be used when  is large
(>:75 or so).

3.9  Conditional Expectation Given a Set and Borel’s Paradox

In applications, one is very often interested in finding the expectation of one random
variable given that another random variable belongs to some set, rather than given
that it is exactly equal to some value. For instance, we may want to know what the
average income of husbands is among those families where the wife earns more than
$100,000.
3.9 Conditional Expectation Given a Set and Borel’s Paradox 159

The mathematical formulation of the problem is to find E.X jY 2 A/, for some
given set A. It is not possible to talk rigorously about this without using measure
theory. In fact, even defining E.X jY 2 A/ can be a problem. We limit ourselves to
special types of sets A.

Definition 3.11. Let .X; Y / have a joint density f .x; y/, with marginal densities
fX .x/; fY .y/. Let A be a subset of the real line such that P .Y 2 A/ > 0. Then
R1 R
xf .x; y/dydx
E.X jY 2 A/ D R1
1 R
A
:
1 A f .x; y/dydx

Remark. When the conditioning event A has probability zero, we can get into para-
doxical situations when we try to compute E.X jY 2 A/. What happens is that it
may be possible to rewrite the conditioning event Y 2 A as an equivalent event
V 2 B for some carefully chosen function V D V .X; Y /. Yet, when we com-
pute E.X jY 2 A/ and E.X jV 2 B/, we arrive at different answers! The paradox
arises because of subtleties of measure zero sets. It is not possible to describe how
one avoids such a paradox without the knowledge of abstract measure theory. We
do however, give an example illustrating this paradox, popularly known as Borel’s
paradox.

Example 3.34 (Borel’s Paradox). Let .X; Y / have the joint density

f .x; y/ D 1; if 0 x 1; x y 1  x;

and zero otherwise. Then, the marginal density of Y is


Z 1
fY .y/ D dx D 1 C y; if  1 y 0; and
y
Z 1y
fY .y/ D D 1  y; if 0 y 1:
0

Therefore, the conditional density of X given Y D 0 is

f .xjY D 0/ D 1; 0 x 1:

This is just the uniform density on Œ0; 1, and so we get E.X jY D 0/ D :5.
Now transform .X; Y / by the one-to-one transformation .X; Y / ! .U; V /,
where U D X; V D XCY X . The Jacobian of the transformation is J D u, and
hence the joint density of .U; V / is

1
fU;V .u; v/ D u; 0 < u < 1; 0 < v < :
u
160 3 Multidimensional Densities

In the transformed variables, X D U and Y D U.V  1/. So, Y D 0 , V D 1.


Yet, when, we compute the conditional density of U given V D 1, we get after a
little algebra
f .ujV D 1/ D 2u; 0 u 1:

So, E.U jV D 1/ D 23 , which, paradoxically, is different from :5!

Example 3.35 (Mean Residual Life). In survival analysis and medicine, a quantity
of great interest is the mean residual life. Suppose that a person afflicted with some
disease has survived five years. How much longer can the patient be expected to
survive? Thus, suppose X is a continuous random variable with density f .x/. We
want to find E.X  c j X  c/: Assuming that P .X  c/ > 0,

E.X  c j X  c/ D E.X j X  c/  c
R1
xf .x/dx
D c  c:
1  F .c/

As one specific example, let X Exp. /. Then, from this formula,

E.X  c j X  c/
R 1 1 x=
x e dx
D c c
e c=
.c C /e c=
D c
e c=
D ;
which is independent of c. We recognize that this is just the lack of memory property
of an exponential distribution manifesting itself in the mean residual life calculation.
In contrast, suppose X N.0; 1/ (of course, in reality a survival time X cannot
have mean zero!). Then,
E.X  c j X  c/ D E.X j X  c/  c
R1
x.x/dx
D c c
1  ˆ.c/
.c/ 1
D c D  c;
1  ˆ.c/ R.c/

where R.c/ D 1ˆ.c/.c/ is the Mills ratio. The calculation shows that the Mills ratio
arises very naturally in a calculation of interest in survival analysis.
Note that now the mean residual life is no longer independent of c. Take c to be
positive. From Laplace’s expansion for the Mills ratio (see Chapter 1),

1 c3 c 1
c c D 2 :
R.c/ c2  1 c 1 c
Exercises 161

That is, the mean residual life is approximately equal to 1c , which is a decreasing
function of c. So, unlike in the exponential case, if survival time is normal, then a
patient who has survived a long time is increasingly more unlikely to survive too
much longer.

Exercises

Exercise 3.1. Suppose .X; Y / have the joint density f .x; y/ D cxy; x; y 2 Œ0; 1.
(a) Find the normalizing constant c.
(b) Are X; Y independent?
(c) Find the marginal densities and expectations of X; Y .
(d) Find the expectation of X Y .

Exercise 3.2. Suppose .X; Y / have the joint density f .x; y/ D cxy; x; y  0I x C
y 1.
(a) Find the normalizing constant c.
(b) Are X; Y independent?
(c) Find the marginal densities and expectations of X; Y .
(d) Find the expectation of X Y .

Exercise 3.3. Suppose .X; Y / have the joint density f .x; y/ D ce y ; 0 x


y < 1.
(a) Find the normalizing constant c.
(b) Are X; Y independent?
(c) Find the marginal densities and expectations of X; Y .
(d) Find the conditional expectation of X given Y D y.
(e) Find the conditional expectation of Y given X D x.
(f) Find the correlation between X and Y .

Exercise 3.4 (Uniform in a Triangle). Suppose X; Y are uniformly distributed in


the triangle bounded by 1 x 1; y  0, and the two lines y D 1 C x and
y D 1  x.
(a) Find P .X  :5/.
(b) Find P .Y  :5/.
(c) Find the marginal densities and expectations of X; Y .

Exercise 3.5 (Uniform Distribution in a Sphere). Suppose .X; Y; Z/ has the den-
sity f .x; y; z/ D c; if x 2 C y 2 C z2 1:
(a) Find the constant c.
(b) Are any of X; Y , or Y; Z or X; Z pairwise independent?
(c) Find the marginal densities and expectations of X; Y; Z.
162 3 Multidimensional Densities

(d) Find the conditional expectation of X given Y D y, and the conditional


expectation of X given Z D z.
(e) Find the conditional expectation of X given (both) Y D y; Z D z.
(f) Find the correlation between any pair, say X and Y .

Exercise 3.6. Suppose X; Y are independent U Œ0; 1 variables. Find the conditional
expectation E.jX  Y j jY D y/.

Exercise 3.7 (Uniform in a Triangle). Suppose X; Y are uniformly distributed in


the triangle x; y  0; x C y 1: Find the conditional expectation E.jX  Y j
jY D y/.

Exercise 3.8. Suppose X; Y; Z are independent U Œ0; 1 variables. Find P .jX 


Y j > jY  Zj/.

Exercise 3.9 * (Iterated Expectation).


p Suppose X; Y are independent standard ex-
ponential variables. Find E.X X C Y /.

Exercise 3.10 (Expectation of a Quotient). Suppose X; Y are independent, and


2
X Be.2; 2/; Y Be.3; 3/. Find E. X
Y2
/:

Exercise 3.11. Suppose X; Y; Z are three independent standard exponential vari-


ables. Find P .X < 2Y < 3Z/.

Exercise 3.12 * (Conceptual). Suppose X U Œ0; 1, and Y D 2X . What is the


joint distribution of .X; Y /? Does the joint distribution have a density?

Exercise 3.13 * (Breaking a Stick). Suppose X U Œ0; 1, and given that X D
x; Y U Œ0; x. Let U D 1  X; V D Y; W D X  Y . Find the expectation of the
maximum of U; V; W .
This amounts to breaking a stick, and then breaking the left piece again.

Exercise 3.14 (Iterated Expectation). Suppose X1 U Œ0; 1, and for n  2; Xn


given that Xn1 D x is distributed as U Œ0; x. What is E.Xn /, and its limit as
n ! 1?
iid
Exercise 3.15 (An MLE). Let X1 ; : : : ; Xn Poi. /; > 0. Show that a unique
MLE of exists unless each Xi is equal to zero, in which case an MLE does not
exist.
iid
Exercise 3.16 * (Nonunique MLE). Let X1 ; : : : ; Xn U Œ  1;  C 1; 1 <
 < 1. Show that the MLE of  is not unique. Find all the MLEs.
iid
Exercise 3.17 * (A Difficult to Find MLE). Let X1 ; : : : ; Xn C.; 1/; n  2.
Show that the likelihood function is, in general, multimodal. Consider the following
data values: –10, 0, 2, 5, 14. Plot the likelihood function and find the MLE of .
Exercises 163

iid
Exercise 3.18. Let X1 ; : : : ; Xn N.; /;  > 0. Show that there is a unique MLE
of , and find it.
Exercise 3.19 (MLE in a Genetics Problem). According to Mendel’s law, the
genotypes aa; Aa, and AA in a population with genetic equilibrium with respect
to a single gene having two alleles have proportions f 2 ; 2f .1  f /, and .1  f /2
in the population. Suppose n individuals are sampled from the population and the
number of observed individuals of each genotype are n1 ; n2 ; n3 , respectively. Find
the MLE of f .
Exercise 3.20 * (MLE in a Discrete Parameter Problem). Two independent
proofreaders A and B are asked to read a manuscript containing N errors; N  0
is unknown. n1 errors are found by A alone, n2 by B alone, and n12 by both. What
is the MLE of N ? State your assumptions.
iid
Exercise 3.21 * (MLE for Double Exponential Case). Let X1 ; : : : ; Xn
DoubleExp.; 1/. Show that the sample median is one MLE of ; is it the only
MLE?
Exercise 3.22 * (MLE of Common Mean). Suppose X1 ; X2 ; : : : ; Xm are iid
N.; 12 / and Y1 ; Y2 ; : : : ; Yn are iid N.; 22 /, and all m C n observations are
independent. Find the MLE of .
iid
Exercise 3.23 * (MLE Under a Constraint). Let X1 ; : : : ; Xn N.; 1/, where
we know that   0. Show that there is a unique MLE of , and find it.
Hint: Think intuitively.
iid
Exercise 3.24 * (MLE in the Gamma Case). Let X1 ; : : : ; Xn G.˛; /; ˛>0;
> 0. Show that there is a unique MLE of .˛; /, which is the only stationary point
of the logarithm of the likelihood function. Compute it for the following simple
dataset (n D 8/ W :5; 1; 1:4; 2; 1; 2:5; 1:5; 2.
Exercise 3.25 (Bivariate Normal Probability). Suppose X; Y are jointly bivariate
normal with zero means, unit standard deviations, and correlation . Find all values
of  for which 14 P .X > 0; Y > 0/ 12 5
.
Exercise 3.26. Suppose X; Y are jointly bivariate normal with zero means, unit
standard deviations, and correlation  D :75. Find P .Y > 2jX D 1/.
Exercise 3.27. Suppose X; Y are jointly bivariate normal with general parameters.
Characterize all constants a; b such that X C Y and aX C bY are independent.
Exercise 3.28 * (Probability of a Diamond). Suppose X; Y; Z are independent
U Œ1; 1 variables. Find the probability that jX j C jY j C jZj 1:
Exercise 3.29 (Missing the Bus). A bus arrives at a random time between 9:00 AM
and 9:15 AM at a stop. Tim will arrive at that stop at a random time between 9:00
AM and 9:15 AM, independently of the bus, and will wait for (at most) five minutes
at the stop. Find the probability that Tim will meet the bus.
164 3 Multidimensional Densities

Exercise 3.30. Cathy and Jen plan to meet at a cafe and each will arrive at the cafe
at a random time between 11:00 AM and 11:30 AM, independently of each other.
Find the probability that the first to arrive has to wait between 5 and 10 minutes for
the other to arrive.

Exercise 3.31 (Bivariate Normal Probability). Suppose the amounts of oil (in
barrels) lifted on a given day from two wells are jointly bivariate normal, with means
150 and 200, and variances 100 and 25, and correlation .5. What is the probability
that the total amount lifted is larger than 400 barrels on one given day? The proba-
bility that the amounts lifted from the two wells on one day differ by more than 50
barrels?

Exercise 3.32 * (Conceptual). Suppose .X; Y / have a bivariate normal distribution


with zero means, unit standard deviations, and correlation ; 1 <  < 1:
What is the joint distribution of .X C Y; X  Y; Y /? Does this joint distribution
have a density?

Exercise 3.33. Suppose X N.;  2 /. Find the correlation between X and Y ,


where Y D X . Find all values of .; / for which the correlation is zero.
2

Exercise 3.34 * (Maximum Correlation). Suppose .X; Y / has a general bivariate


normal distribution with a positive correlation . Show that among all functions
g.X /; h.Y / with finite variances, the correlation between g.X / and h.Y / is maxi-
mized when g.X / D X; h.Y / D Y .

Exercise 3.35 (Bivariate Normal Calculation). Suppose X N.0; 1/, and given
X D x, Y N.x C 1; 1/.
(a) What is the marginal distribution of Y ?
(b) What is the correlation between X and Y ?
(c) What is the conditional distribution of X given Y D y?

Exercise 3.36 * (Uniform Distribution in a Sphere). Suppose X; Y; Z are uni-


formly distributed in the unit sphere. Find the mean and the variance of the distance
of the point .X; Y; Z/ from the origin.

Exercise 3.37 * (Uniform Distribution in a Sphere). Suppose X; Y; Z are uni-


formly distributed in the unit sphere.
(a) Find the marginal density of .X; Y /.
(b) Find the marginal density of X .

Exercise 3.38. Suppose X; Y; Z are independent exponentials with means


; 2 ; 3 . Find P .X < Y < Z/:

Exercise 3.39 * (Mean Residual Life). Suppose X N.;  2 /. Derive a formula


for the mean residual life and investigate its monotonicity behavior with respect to
each of ; c; , each time holding the other two fixed.
References 165

Exercise 3.40 * (Bivariate Normal Conditional Calculation). Suppose the sys-


tolic blood pressure X and fasting blood sugar, Y are jointly distributed as a
bivariate normal in some population with means 120, 105, standard deviations 10,
20, and correlation 0.7. Find the average fasting blood sugar of those with a systolic
blood pressure greater than 140.

Exercise 3.41 * (Buffon’s Needle). Suppose the plane is gridded by a series of par-
allel lines, drawn h units apart. A needle of length l is dropped at random on the
plane. Let p.l; h/ be the probability that the needle intersects one of the parallel
lines. Show that
(a) p.l; h/ D 2l
h , if l ph;
(b) p.l; h/ D h  h Œ l
2l 2 2  h2 C h arcsin. hl / C 1, if l > h.

References

Cox, D. and Wermuth, N. (1991). A simple approximation for bivariate and trivariate normal inte-
grals, Internat. Statist. Rev., 59, 263–269.
Genz, A. (1993). Comparison of methods for the computation of multivariate norwal probabilities,
Computing Sciences and Statistics, 25, 400–405.
Kagan, A., Linnik, Y., and Rao, C. R. (1973). Characterization Problems in Mathematical Statis-
tics, Wiley, New York.
Kamat, A. (1953). Incomplete and absolute moments of the multivariate normal distribution, with
applications, Biometrika, 40, 20–34.
Mee, R. and Owen. D. (1983). A simple approximation for bivariate normal probability, J. Qual.
Tech., 15, 72–75.
Patel, J. and Read, C. (1996). Handbook of the Normal Distribution, Marcel Dekker, New York.
Plackett, R. (1954). A reduction formula for multivariate normal probabilities, Biometrika, 41,
351–360.
Tong, Y. (1990). Multivariate Normal Distribution, Springer-Verlag, New York.
Chapter 4
Advanced Distribution Theory

Studying distributions of functions of several random variables is of primary interest


in probability and statistics. For example, the original variables X1 ; X2 ; : : : ; Xn
could be the inputs into some process or system, and we may be interested in the
output, which is some suitable function of these input variables. Sums, products,
and quotients are special functions that arise quite naturally in applications. These
are discussed with a special emphasis in this chapter, although the general theory is
also presented. Specifically, we present the classic theory of polar transformations
and the Helmert transformation in arbitrary dimensions, and the development of the
Dirichlet, t- and the F -distribution. The t- and the F -distribution arise in numerous
problems in statistics, and the Dirichlet distribution has acquired an extremely spe-
cial role in modeling and also in Bayesian statistics. In addition, these techniques
and results are among the most sophisticated parts of distribution theory.

4.1 Convolutions and Examples

Definition 4.1. Let X; Y be independent random variables. The distribution of their


sum X C Y is called the convolution of X and Y .
Remark. Usually, we study convolutions of two continuous or two discrete random
variables. But, in principle, one could be continuous and the other discrete.
Example 4.1. Suppose X; Y have a joint density function f .x; y/, and suppose
we want to find the density of their sum, namely X C Y . Denote the condi-
tional density of X given Y D y by fXjy .xjy/ and the conditional CDF, namely,
P .X ujY D y/ by FXjY .u/. Then, by the iterated expectation formula,

P .X C Y z/ D EŒIXCY z 
D EY ŒE.IXCY z jY D y/ D EY ŒE.IXCyz jY D y/
D EY ŒP .X z  yjY D y/
Z 1
D EY ŒFXjY .z  y/ D FXjY .z  y/fY .y/dy:
1

A. DasGupta, Probability for Statistics and Machine Learning: Fundamentals 167


and Advanced Topics, Springer Texts in Statistics, DOI 10.1007/978-1-4419-9634-3 4,
c Springer Science+Business Media, LLC 2011
168 4 Advanced Distribution Theory

In particular, if X and Y are independent, then the conditional CDF FXjY .u/ will
be the same as the marginal CDF FX .u/ of X . In this case, the expression above
simplifies to Z 1
P .X C Y z/ D FX .z  y/fY .y/dy:
1
The density of X C Y can be obtained by differentiating the CDF of X C Y :

d
fXCY .z/ D P .X C Y z/
dz
Z 1
d
D FX .z  y/fY .y/dy
d z 1
Z 1
d
D FX .z  y/fY .y/ dy
1 d z

(assuming that the derivative can be carried inside the integral)


Z 1
D fX .z  y/fY .y/dy:
1

Indeed, this is the general formula for the density of the sum of two real-valued
independent continuous random variables.

Theorem 4.1. Let X; Y be independent real-valued random variables with densi-


ties fX .x/; fY .y/; respectively. Let Z D X C Y be the sum of X and Y . Then, the
density of the convolution is
Z 1
fZ .z/ D fX .z  y/fY .y/dy:
1

More generally, if X; Y are not necessarily independent, and have joint density
f .x; y/, then Z D X C Y has the density
Z 1
fZ .z/ D fXjY .z  y/fY .y/dy:
1

Definition 4.2. If X; Y are independent continuous random variables with a com-


mon density f .x/, then the density of the convolution is denoted as f  f . In
general, if X1 ; X2 ; : : : ; Xn are n independent continuous random variables with a
common density f .x/, then the density of their sum X1 C X2 C    C Xn is called
.n/
the n-fold convolution of f and is denoted as f  .

Example 4.2 (Sum of Exponentials). Suppose X; Y are independent Exp. / vari-


ables, and we want to find the density of Z D X C Y . By the convolution formula,
for z > 0,
4.1 Convolutions and Examples 169
Z 1
1 zy 1 y
fZ .z/ D e  Iy<z e   Iy>0 dy
1
Z z
1 z
D 2
e   dy
0

 z
ze
D 2
;

which is the density of a Gamma distribution with parameters 2 and . Recall that
we had proved this earlier in Chapter 1 by using mgfs.
Example 4.3 (Difference of Exponentials). Let U; V be independent standard expo-
nentials. We want to find the density of Z D U V . Writing X D U , and Y D V ,
we notice that Z D X C Y , and X; Y are still independent. However, now Y is a
negative exponential, and so has density fY .y/ D e y Iy < 0: It is also important to
note that Z can now take any real value, positive or negative. Substituting into the
formula for the convolution density,
Z 1
fZ .z/ D e .zy/ .Iy < z/e y .Iy < 0/dy:
1

Now, first consider z > 0. Then this last expression becomes


Z 0 Z 0
fZ .z/ D e .zy/ e y dy D e z e 2y dy
1 1
1 z
D e :
2
On the other hand, for z < 0, the convolution formula becomes
Z z Z z
fZ .z/ D e .zy/ e y dy D e z e 2y dy
1 1
1 1
D e z e 2z D e z :
2 2
Combining the two cases, we can write the single formula

1 jzj
fZ .z/ D e ; 1 < z < 1I
2
that is, if X; Y are independent standard exponentials, then the difference X  Y
has a standard double exponential density. This representation of the double expo-
nential is often useful. Also note that although the standard exponential distribution
is obviously not symmetric, the distribution of the difference of two independent
exponentials is symmetric. This is a useful technique for symmetrizing a random
variable.
170 4 Advanced Distribution Theory

Definition 4.3 (Symmetrization of a Random Variable). Let X1 ; X2 be indepen-


dent random variables with a common distribution F . Then Xs D X1  X2 is called
the symmetrization of F or symmetrization of X1 .
If X1 is a continuous random variable with density f .x/, then its symmetrization
has the density Z 1
fs .z/ D f .z C y/f .y/dy:
1

Example 4.4 (A Neat General Formula). Suppose X; Y are positive random vari-
ables with a joint density of the form f .x; y/ D g.x C y/: What is the density of
the convolution?
Note that now X; Y are in general not independent, because a joint density of
the form g.x C y/ does not in general factorize into the product form necessary for
independence. First, the conditional density

g.x C y/
fXjY .x/ D R 1
0 g.x C y/dx

g.x C y/ g.x C y/
D R1 D ;
g.x/dx N
G.y/
y
R1
N
writing G.y/ for g.x/dx: Also, the marginal density of Y is
y
Z 1 Z 1
fY .y/ D g.x C y/dx D N
g.x/dx D G.y/:
0 y

Substituting into the general case formula for the density of a sum,
Z z
g.z/ N
fZ .z/ D G.y/dy D zg.z/;
N
0 G.y/

a very neat formula.


As an application, consider the example of .X; Y / being uniformly distributed in
a triangle with the joint density f .x; y/ D 2; x; y  0; x C y 1: Identifying the
function g as g.z/ D 2I0z1 , we have, from our general formula above, that in this
case Z D X C Y has the density fZ .z/ D 2z; 0 z 1:
Example 4.5 (Sums of Cauchy Variables). Let X; Y be independent standard
Cauchy random variables with the common density function f .x/ D .1Cx
1
2/ ;

1 < x < 1. Then, the density of the convolution is


Z 1 Z 1
fZ .z/ D fX .z  y/fY .y/dy D f .z  y/f .y/dy
1 1
Z 1
1 1
D 2 dy
 1 .1 C .z  y/2 /.1 C y 2 /
1 2
D
 2 4 C z2
4.1 Convolutions and Examples 171

on a partial fraction expansion of

1 2
D :
.1 C .z  y/2 /.1 C y 2 / .4 C z2 /

Therefore, the density of W D Z2 D XCY 2


1
would be .1Cw 2 / , which is, remarkably,

the same standard Cauchy density with which we had started.


By using characteristic functions, which we discuss in Chapter 8, it can be shown
that if X1 ; X2 ; : : : ; Xn are independent standard Cauchy variables, then for any
n  2, their average XN D X1 CX2nCCXn also has the standard Cauchy distribution.

Example 4.6 (Normal–Poisson Convolution). Here is an example of the convolution


of one continuous and one discrete random variable. Let X N.0; 1/ and Y
Poi. /. Then their sum Z D X C Y is still continuous, and has the density
1
X e  y
fZ .z/ D .z  y/ :
yD0

More generally, if X N.0;  2 /, and Y Poi. /. Then the density of the sum is

1 X z  y
1
e  y
fZ .z/ D  :
 yD0  yŠ

This is not expressible in terms of the elementary functions. However, it is in-


teresting to plot the density. Figure 4.1 shows an unconventional density for the
convolution with multiple local maxima and shoulders.

0.25

0.2

0.15

0.1

0.05

x
-2 2 4 6 8 10 12

Fig. 4.1 Convolution of N(0, .09) and Poi (4)


172 4 Advanced Distribution Theory

For purposes of summary and easy reference, we list some convolutions of com-
mon types below.

Distribution of Summands Distribution of Sum


P
Xi Bin.ni ; p/ Bin. ni ; p/
P
Xi Poi. i / Poi. P i /
Xi NB.ri ; p/ NB. ri ; p/
Xi Exp. / Gamma.n; /
P P
Xi N.i ; i2 / N.P i ; Pi2 /
Xi C.i ; i2 / C. i ; . i /2 /
Xi U Œa; b See Chapter 1

4.2 Products and Quotients and the t- and F -Distribution

Suppose X; Y are two random variables. Then two other functions that arise nat-
urally in many applications are the product X Y , and the quotient X Y . Following
exactly the same technique as for convolutions, one can find the density of each of
X Y and X Y
. More precisely, one first finds the CDF by using the iterated expecta-
tion technique, exactly as we did for convolutions, and then differentiates the CDF
to obtain the density. Here are the density formulas; they are extremely important
and useful. They are proved in exactly the same way that the formula for the den-
sity of the convolution was obtained above; you would condition, and then take an
iterated expectation. Therefore, the formal detail is omitted.
Theorem 4.2. Let X; Y be continuous random variables with a joint density
f .x; y/. Let U D X Y; V D X
Y . Then the densities of U; V are given by
Z 1
1  u
fU .u/ D f x; dxI
1 jxj x
Z 1
fV .v/ D jyjf .vy; y/dy:
1

Example 4.7 (Product and Quotient of Uniforms). Suppose X; Y are independent


U Œ0; 1 random variables. Then, by the above theorem, the density of the product
U D X Y is
Z 1
1  u
fU .u/ D f x; dx
1 jxj x
Z 1
1
D  1dx D  log u; 0 < u < 1:
u x
4.2 Products and Quotients and the t - and F -Distribution 173

Next, again by the above theorem, the density of the quotient V D X


Y is
Z 1
fV .v/ D jyjf .vy; y/dy
1
Z minf 1v ;1g
D ydy
0

.minf 1v ; 1g/2
D ; 0 < v < 1I
2
thus, the density of the quotient V is
1
fV .v/ D ; if 0 < v 1I
2
1
D 2 ; if v > 1:
2v
The density of the quotient is plotted in Fig. 4.2; we see that it is continuous, but not
differentiable at v D 1.

Example 4.8 (Ratio of Standard Normals). The distribution of the ratio of two in-
dependent standard normal variables is an interesting one; we show now that it is
in fact a standard Cauchy distribution. Indeed, by applying the general formula, the
density of the quotient V D XY is
Z 1
fV .v/ D jyjf .vy; y/dy
1
Z 1
1 y2 2
D jyje  2 .1Cv / dy
2 1

0.5

0.4

0.3

0.2

0.1

v
1 2 3 4 5

Fig. 4.2 Density of quotient of two uniforms


174 4 Advanced Distribution Theory
Z 1
1 y2
.1Cv2 /
D ye  2 dy
 0
1
D ; 1 < v < 1;
.1 C v2 /
p
by making the substitution t D 1 C v2 y in the integral on the last line. This proves
that the quotient has a standard Cauchy distribution.
It is important to note that zero means for the normal variables are essential
for this result. If either X or Y has a nonzero mean, the quotient has a compli-
cated distribution, and is definitely not Cauchy. The distribution is worked out in
Hinkley (1969). It is also highly interesting that there are many other distributions
F such that if X; Y are independent with the common distribution F , then the quo-
tient X
Y
is distributed as a standard Cauchy. One example of such a distribution F is
a continuous distribution with the density
p
2 1
f .x/ D ; 1 < x < 1:
 1 C x4

Example 4.9 (The F -Distribution). Let X G.˛; 1/; Y G.ˇ; 1/, and suppose
X=˛
X; Y are independent. The distribution of the ratio R D Y =ˇ arises in statistics in
many contexts and is called an F -distribution. We derive the explicit form of the
density here.
X=˛
First, we find the density of XY , from which the density of R D Y =ˇ follows
easily. Again, by applying the general formula for the density of a quotient, the
density of the quotient V D XY
is
Z 1
fV .v/ D jyjf .vy; y/dy
1
Z 1
1
D ye y.1Cv/ .vy/˛1 y ˇ 1 dy
.˛/.ˇ/ 0
Z 1
1
D v˛1 e y.1Cv/ y ˛Cˇ 1 dy
.˛/.ˇ/ 0

1 .˛ C ˇ/
D v˛1
.˛/.ˇ/ .1 C v/˛Cˇ
.˛ C ˇ/ v˛1
D ; 0 < v < 1:
.˛/.ˇ/ .1 C v/˛Cˇ

To complete the example, notice now that R D cV , where c D ˇ˛ . Therefore, the


density of R is immediately obtained from the density of V . Indeed,

1 r
fR .r/ D fV ;
c c
4.2 Products and Quotients and the t - and F -Distribution 175

where fV is the function we just derived above. If we simplify fR .r/, we get the
final expression

 ˇ
ˇ
˛ r ˛1
fR .r/ D  ˛Cˇ
; r > 0:
ˇ
B.˛; ˇ/ r C ˛

This is the F -density with parameters ˛; ˇ; it is common in statistics to refer to 2˛


and 2ˇ as the degrees of freedom of the distribution.

Example 4.10 (The Student t-Distribution). Once again, the t-distribution is one
that arises frequently in statistics.qSuppose, X N.0; 1/; Z 2m , and suppose
X; Z are independent. Let Y D Z
m
. Then the distribution of the quotient V D
X
Y
is called t-distribution with m degrees of freedom. We derive its density in this
example.
Recall that Z has the density

e z=2 zm=21
; z > 0:
2m=2 . m2
/

q
Therefore, Y D Z
m
has the density

2
mm=2 e my =2 y m1
fY .y/ D ; y > 0:
2m=2 1 . m2
/

Because, by hypothesis, X and Z are independent, it follows that X and Y are also
independent, and so their joint density f .x; y/ is just the product of the marginal
densities of X and Y .
Once again, by applying our general formula for the density of a quotient,
Z 1
fV .v/ D jyjf .vy; y/dy
1
Z 1
mm=2 2 y 2 =2 2 =2
D p ye v e my y m1 dy
22m=2 1 . m
2
/ 0
Z 1
mm=2 2 Cm/y 2 =2
D p e .v y m dy
22m=2 1 . m
2/ 0

mm=2 . mC1
2 /2
.m1/=2
D p 
22m=2 1 . m
2
/ .m C v2 /.mC1/=2
176 4 Advanced Distribution Theory

mm=2 . mC1 / 1
D p 2
. 2 / .m C v /.mC1/=2
m 2

. mC1
2 /
D p 2
; 1 < v < 1:
m. 2 /.1 C vm /.mC1/=2
m

This is the density of the Student t-distribution with m degrees of freedom.


Note that when the degree of freedom m D 1, this becomes just the stan-
dard Cauchy density. The t-distribution was first derived in 1908 by William
Gossett under the pseudonym Student. The distribution was later named the Student
t-distribution by Ronald Fisher.
The t density, just like the standard normal, is symmetric and unimodal around
zero, although with tails much heavier than those of the standard normal for small
values of m. However, as m ! 1, the density converges pointwise to the standard
normal density, and then the t and the standard normal density look almost the same.
We give a plot of the t density for a few degrees of freedom in Fig. 4.3, and of the
standard normal density for a visual comparison.

Example 4.11 (An Interesting Gaussian Factorization). We exhibit independent


random variables X; Y in this example such that X Y has a standard normal dis-
tribution. Note that if we allow Y to be a constant random variable, then we can
always write such a factorization. After all, we can take X to be standard normal,
and Y to be 1! So, we will exhibit nonconstant X; Y such that they are independent,
and X Y has a standard normal distribution.
2
For this, let X have the density xe x =2 ; x > 0, and let Y have the density
p 1
; 1 < y < 1: Then, by our general formula for the density of a product,
 2
1y
the product U D X Y has the density
Z 1
1  u
fU .u/ D f x; dx
1 jxj x
Z 1
1 1 x 2 =2 1
D xe q dx
 juj x 1 u2
x2

0.4 0.4 0.4


0.3 0.3 0.3
0.2 0.2 0.2
0.1 0.1 0.1

-6 -4 -2 2 4 6 -4 -2 2 4 -4 -2 2 4

Fig. 4.3 t Density for m D 3, 20, 30 degrees of freedom with N(0,1) density superimposed
4.3 Transformations 177
Z 1 2
1 xe x =2
D p dx
 juj x 2  u2
r
1  u2 =2 1 2
D e D p e u =2 ;
 2 2

where the final integral is obtained by the substitution x 2 D u2 C z2 .

4.3 Transformations

The simple technique that we used in the previous section to derive the density of
a sum or a product does not extend to functions of a more complex nature. Con-
sider the simple case of just two continuous variables X; Y with some joint density
f .x; y/, and suppose we want to find the density of some function U D g.X; Y /.
Then, the general technique is to pair up U with another function V D h.X; Y /, and
first obtain the joint CDF of .U; V / from the joint CDF of .X; Y /. The pairing up has
to be done carefully: only some judicious choices of V will work in a given example.
Having found the joint CDF of .U; V /, by differentiation one finds the joint density
of .U; V /, and then finally integrates v out to obtain the density of just U . Fortu-
nately, this agenda does work out, because the transformation from .X; Y / to .U; V /
can be treated as just a change of variable in manipulation with double integrals, and
calculus tells us how to find double integrals by making suitable changes of vari-
ables (i.e., substitutions). Indeed, the method works out for any number of jointly
distributed variables, X1 ; X2 ; : : : ; Xn , and a function U D g.X1 ; X2 ; : : : ; Xn /, and
the reason it works out is that the whole method is just a change of variables in
manipulating a multivariate integral.
Here is the theorem on density of a multivariate transformation, a major theo-
rem in multivariate distribution theory. It is really nothing but the change of variable
theorem of multivariate calculus. After all, probabilities in the continuous case are
integrals, and an integral can be evaluated by changing variables to a new set of coor-
dinates. If we do that, then we have to put in the Jacobian term coming from making
the change of variable. Translated into densities, the theorem is the following.
Theorem 4.3 (Multivariate Jacobian Formula). Let X D .X1 ; X2 ; : : : ; Xn /
have the joint density function f .x1 ; x2 ; : : : ; xn /, such that there is an open set
S  Rn with P .X 2 S / D 1. Suppose ui D gi .x1 ; x2 ; : : : ; xn /; 1 i n are n
real-valued functions of x1 ; x2 ; : : : ; xn such that
(a) .x1 ; x2 ; : : : ; xn / ! .g1 .x1 ; x2 ; : : : ; xn /; : : : ; gn .x1 ; x2 ; : : : ; xn // is a one-to-
one function of .x1 ; x2 ; : : : ; xn / on S with range space T .
(b) The inverse functions xi D hi .u1 ; u2 ; : : : ; un /; 1 i n; are continuously
differentiable on T with respect to each uj .
178 4 Advanced Distribution Theory

(c) The Jacobian determinant


ˇ ˇ
ˇ @x1 @x1
 @x1 ˇ
ˇ @u1 @u2 @un ˇ
ˇ ˇ
ˇ @x2 @x2 @x2 ˇ
ˇ @u1  @un ˇˇ
ˇ @u2
J Dˇ : :: ˇˇ
ˇ : :: ::
ˇ : : : : ˇ
ˇ ˇ
ˇ @xn @xn ˇˇ
ˇ @u @xn
@u2  @u
1 n

is nonzero.
Then the joint density of .U1 ; U2 ; : : : ; Un / is given by

fU1 ;U2 ;:::;Un .u1 ; u2 ; : : : ; un /Df.h1 .u1 ; u2 ; : : : ; un /; : : : ; hn .u1 ; u2 ; : : : ; un //jJj;

where jJ j denotes the absolute value of the Jacobian determinant J , and


the notation f on the right-hand side means the original joint density of
.X1 ; X2 ; : : : ; Xn /.

4.4 Applications of Jacobian Formula

We now show a number of examples of its applications to finding the density of


interesting transformations. We emphasize that quite often only one of the func-
tions ui D gi .x1 ; x2 ; : : : ; xn / is provided, whose density function we want to find.
But unless that function is a really simple one, its density cannot be found directly
without invoking the Jacobian theorem given here. It is necessary to make up the re-
maining .n  1/ functions, and then obtain their joint density by using this Jacobian
theorem. Finally, one would integrate out all these other coordinates to get the den-
sity function of just ui . The other .n  1/ functions need to be found judiciously.

Example 4.12 (A Relation Between Exponential and Uniform). Let X; Y be inde-


pendent standard exponentials, and define U D XCY X
. We want to find the density
of U . We have to pair it up with another function V in order to use the Jacobian the-
orem. We choose V D X CY . We have here a one-to-one function for x > 0; y > 0.
Indeed, the inverse functions are

x D x.u; v/ D uvI y D y.u; v/ D v  uv D v.1  u/:

The partial derivatives of the inverse functions are

@x @x @y @y
D vI D uI D vI D 1  u:
@u @v @u @v
4.4 Applications of Jacobian Formula 179

Thus, the Jacobian determinant equals J D v.1  u/ C uv D v: By invoking the


Jacobian theorem, the joint density of U; V is

fU;V .u; v/ D e uv e v.1u/ jvj D ve v ;

0 < u < 1; v > 0:


Thus, the joint density of U; V has factorized into a product form on a rectangle;
the marginals are

fU .u/ D 1; 0 < u < 1I fV .v/ D ve v ; v > 0;

and the rectangle being .0; 1/  .0; 1/. Therefore, we have proved that if X; Y are
X
independent standard exponentials, then XCY and X C Y are independent, and they
are, respectively, uniform and a Gamma. Of course, we already knew that X C Y
G.2; 1/ from our mgf proof in Chapter 1. In Chapter 18 we show that this result can
also be proved by using Basu’s theorem.

Example 4.13 (A Relation Between Gamma and Beta). The previous example gen-
eralizes in a nice way. Let X; Y be independent variables, distributed respectively
as G.˛; 1/; G.ˇ; 1/. Let again, U D XCYX
; V D X C Y . Then, from our previous
example, the Jacobian determinant is still J D v. Therefore, the joint density of
U; V is

1
fU;V .u; v/ D e v .uv/˛1 .v.1  u//ˇ 1 v
.˛/.ˇ/
1
D u˛1 .1  u/ˇ 1 e v v˛Cˇ 1 ; 0 < u < 1; v > 0:
.˛/.ˇ/

Once again, we have factorized the joint density of U and V as the product of the
marginal densities, with .U; V / varying in the rectangle .0; 1/.0; 1/, the marginal
densities being

e v v˛Cˇ 1
fV .v/ D ; v > 0;
.˛ C ˇ/
.˛ C ˇ/ ˛1
fU .u/ D u .1  u/ˇ 1 ; 0 < u < 1:
.˛/.ˇ/

That is, if X; Y are independent Gamma variables, then XCYX


and X C Y are inde-
pendent, and they are respectively distributed as a Beta and a Gamma. Of course,
we already knew from an mgf argument that X C Y is a Gamma.
This relationship between the Gamma and the Beta distribution is useful in sim-
ulating values from a Beta distribution.
180 4 Advanced Distribution Theory

4.5 Polar Coordinates in Two Dimensions

Example 4.14 (Transformation to Polar Coordinates). We have already worked out


a few examples where we transformed two variables to their polar coordinates,
in order to calculate expectations of suitable functions, when the variables have
a spherically symmetric density. We now use a transformation to polar coordinates
to do distributional calculations. In any spherically symmetric situation, transfor-
mation to polar coordinates is a technically useful device, and gets the answers out
quickly for many problems.
Let .X; Y / have a spherically symmetric joint density
p p given by f .x; y/ YD
g. x 2 C y 2 /. Consider the polar transformation rD X 2 C Y 2 ; D arctan X :
This is a one-to-one transformation, with the inverse functions given by

x D r cos ; y D r sin :

The partial derivatives of the inverse functions are

@x @x @y @y
D cos ; D r sin ; D sin ; D r cos :
@r @ @r @

Therefore, the Jacobian determinant is J D r cos2  Cr sin2  D r: By the Jacobian


theorem, the density of .r; /is

fr; .r; / D rg.r/;

with r;  belonging to a suitable rectangle, which depends on the exact set of values
.x; y/ on which the original joint density f .x; y/ is strictly positive. But, in any
case, we have established that the joint density of .r; / factorizes into the prod-
uct form on a rectangle, and so in any spherically symmetric situation, the polar
coordinates r and  are independent, a very convenient fact. Always, in a spheri-
cally symmetric case, r will have the density crg.r/ on some interval and for some
suitable normalizing constant c, and  will have a uniform density on some interval.
Now consider three specific choices of the original density function. First con-
sider the uniform case:
1
f .x; y/ D ; 0 < x 2 C y 2 < 1:


Then g.r/ D 1 ; 0 < r < 1: So in this case, r has the density 2r; 0 < r < 1, and 
has the uniform density 2 1
;  <  < .
Next consider the case of two independent standard normals. Indeed, in this case,
the joint density is spherically symmetric, namely,

1 .x 2 Cy 2 /=2
f .x; y/ D e ; 1 < x; y < 1:
2
4.5 Polar Coordinates in Two Dimensions 181
2
1 r =2
Thus, g.r/ D 2 e ; r > 0. Therefore, in this case r has the Weibull density
2
r =2
re ; r > 0, and , again is uniform on .; /.
Finally, consider the caseq of two independent folded standard normals, that is,
2
each of X; Y has the density 2 e x =2 ; x > 0. In this case, r varies on .0; 1/, but
 varies on .0; 2 /. Thus, r and  are still independent, but this time,  is uniform
2
on .0; 2 /, whereas r still has the same Weibull density re r =2 ; r > 0.
Example 4.15 (Usefulness of the Polar Transformation). Suppose .X; Y / are jointly
uniform in the unit circle. We use the joint density of .r; / to find the answers to a
number of questions.
First, by using the polar transformation,

E.X C Y / D EŒr.cos  C sin /


D E.r/E.cos  C sin /:

Now, E.r/ < 1, and


Z 
1
E.cos  C sin / D .cos  C sin /
2 
1
D .0 C 0/ D 0:
2
Therefore, E.X C Y / D 0: It should be noted that actually each of X; Y has mean
zero in this case, and so we could have proved that E.X C Y / D 0 directly too.
Next, suppose we wantp to find the probability that .X; Y / lies in the intersection
p
of the spherical shell 14 X 2 C Y 2 34 and the cone X; Y > 0; p1 X
Y
3:
3
This looks like a hard problem! But polar coordinates will save us. Transforming to
the polar coordinates, this probability is
 
1 3  
P r ; 
4 4 6 3
  
1 3  
DP r P 
4 4 6 3
Z 3
4 1 1
D 2rdr  D :
1
4
12 24

It would have been a much more tedious calculation to do this using the original
rectangular coordinates.
Example 4.16 (Product of n Uniforms). Let X1 ; X2 ; : : : ; Xn be independent U Œ0; 1
variables, and suppose we want to find the density of the product U D Un D
Q n
i D1 Xi . According to our general discussion, we have to choose n  1 other func-
tions, and then apply the Jacobian theorem. Define

u1 D x1 ; u2 D x1 x2 ; u3 D x1 x2 x3 ; : : : ; un D x1 x2 ; : : : ; xn :
182 4 Advanced Distribution Theory

This is a one-to-one transformation, and the inverse functions are xi D uiu1


i
;2
i nI x1 D u1 : Thus, the Jacobian matrix of the partial derivatives is lower tri-
anglular, and therefore the Jacobian determinant equals the product of the diagonal
elements
Yn
@xi 1
J D D Qn1 :
@ui i D1 ui i D1
Now applying the Jacobian density theorem, the joint density of U1 ; U2 ; : : : ; Un is

1
fU1 ;U2 ;:::;Un .u1 ; u2 ; : : : ; un / D Qn1 ;
i D1 ui

0 < un < un1 <    < u1 < 1:


On integrating out u1 ; u2 ; : : : ; un1 , we get the density of Un :
Z 1 Z 1 Z 1
1
fUn .u/ D  d u1 d u2    d un2 d un1
u un1 u2 u1 u2    un1
j.log u/ j n1
D ;
.n  1/Š

0 < u < 1: This example illustrates that applying the Jacobian theorem needs care-
ful manipulation with multiple integrals, and skills in using the Jacobian technique
are very important in deriving distributions of functions of many variables.

4.6  n-Dimensional Polar and Helmert’s Transformation

We saw evidence of practical advantages of transforming to polar coordinates in two


dimensions in the previous section. As a matter of fact, as was remarked in Example
4.14, in any spherically symmetric situation in any number of dimensions, transfor-
mation to the n-dimensional polar coordinates is a standard technical device. The
transformation from rectangular to the polar coordinates often greatly simplifies the
algebraic complexity of the calculations. We first present the n-dimensional polar
transformation in this section.

4.6.1 Efficient Spherical Calculations with Polar Coordinates

Definition 4.4. For n  3, the n-dimensional


Q polar transformation is a one-to-one
mapping from Rn ! Œ0; 1/  n2 i D1 ‚ i  ‚ n1 , where ‚n1 D Œ0; 2, and for
i n  2; ‚i D Œ0; , with the mapping defined by

x1 D  cos 1 ;
x2 D  sin 1 cos 2 ;
4.6 n-Dimensional Polar and Helmert’s Transformation 183

x3 D  sin 1 sin 2 cos 3 ;


::
:
xn1 D  sin 1 sin 2    sin n2 cos n1 ;
xn D  sin 1 sin 2    sin n2 sin n1 :

The transformation has the useful property that x12 C x22 C    C xn2 D
2 8 .x1 ; x2 ; : : : ; xn / 2 Rn , that is,  is the length of the vector x D .x1 ; x2 ; : : : ; xn /.
The Jacobian determinant of the transformation equals

J D n1 sinn2 1 sinn3 2    sin n2 :

Consequently, by the Jacobian density theorem, we have the following result.


Theorem 4.4 (Joint Density of Polar Coordinates). Let X1 ; X2 ; : : : ; Xn be n
continuous random variables with a joint density f .x1 ; x2 ; : : : ; xn /. Then the joint
density of .; 1 ; 2 ; : : : ; n1 / is given by

p.; 1 ; 2 ; : : : ; n1 / D n1 f .x1 ; x2 ; : : : ; xn /j sinn2 1 sinn3 2    sin n2 j;

where in the right side, one writes for x1 ; x2 ; : : : ; xn , their defining expressions in
terms of ; 1 ; 2 ; : : : ; n1 , as provided above.
In particular, if X1 ; X2 ; : : : ; Xn have a spherically symmetric joint density
q 
f .x1 ; x2 ; : : : ; xn / D g x12 C Cx22 CC xn2

for some function g, then the joint density of .; 1 ; 2 ; : : : ; n1 / equals

p.; 1 ; 2 ; : : : ; n1 / D n1 g./j sinn2 1 sinn3 2    sin n2 j;

and  is distributed independently of the angles .1 ; 2 ; : : : ; n1 /:


Example 4.17 (Rederiving the Chi-Square Distribution). Suppose that X1 ; X2 ; : : : ;
Xn are independent standard normals, so that their joint density is

1 Pn
1 2
i D1 xi ;
f .x1 ; x2 ; : : : ; xn / D e 2 1 < xi < 1; i D 1; 2; : : : ; n:
.2/n=2

Thus, the joint density is spherically symmetric with


q 
f .x1 ; x2 ; : : : ; xn / D g x12 C Cx22 C    C xn2 ;

where
1 2
g./ D n=2
e 2 :
.2/
184 4 Advanced Distribution Theory
qP
n
Therefore, from our general theorem above,  D 2
i D1 Xi has the density
2
 2
cn1 e for some normalizing constant c. Making the transformation W D 2 ,
we get from the general formula for the density of a monotone transformation in
one dimension (see Chapter 1) that W has the density

fW .w/ D ke w=2 wn=2 1 ; w > 0:


P
It follows that W D 2 D niD1 Xi2 has a 2n density, because the constant k must
necessarily be 2n=21 n , which makes fW .w/ exactly equal to the 2n density. Recall
2
that this was previously proved by using the mgf technique. Now we have a polar
transformation proof of it.
Example 4.18 (Curse of Dimensionality). Suppose X1 ; X2 ; : : : ; Xn are jointly uni-
form in the n-dimensional unit sphere Bn , with the density f .x1 ; x2 ; : : : ; xn / D
1
Vol.Bn /
, where
n
2
Vol.Bn / D
. 2 C 1/
n

is the volume of Bn . Therefore,


q f .x1 ; x2 ; : : : ; x
n / is spherically symmetric with

f .x1 ; x2 ; : : : ; xn / D g x12 C Cx22 C    C xn2 , where

. n2 C 1/
g./ D n ; 0 <  < 1:
2
Hence, by our general theorem above,  has the density cn1 for some normalizing
constant c. The normalizing constant is easily evaluated:
Z 1
c
1D cn1 d D
0 n
) c D n:

Thus, the length of an n-dimensional vector picked at random from the unit sphere
has the density nn1 ; 0 <  < 1: As a consequence, the expected length of an
n-dimensional vector picked at random from the unit sphere equals
Z 1
n
E./ D n n d D ;
0 nC1
which is very close to one for large n. So, one can expect that a point chosen at
random from a high-dimensional sphere would be very close to the boundary, rather
than the center. Once again, we see the curse of dimensionality in action.
Transformation to polar coordinates also results in some striking formulas and
properties for general spherically symmetric distributions. They are collected to-
gether in the following theorem. We do not prove this theorem in the text, as each
part only requires a transformation to the n-dimensional polar coordinates, and then
straightforward calculations.
4.6 n-Dimensional Polar and Helmert’s Transformation 185

Theorem 4.5 (General Spherically Symmetric Facts). Let X D


q.X1 ; X2 ; : : : ; Xn /
have a spherically symmetric joint density f .x1 ; x2 ; : : : ; xn / D g x12 C    C xn2 .
Then,
(a) For any m < n, the distribution of .X1 ; X2 ; : : : ; Xm / is also spherically sym-
metric.
(b)  D jjX jj has the density cn1 g./, where c is the normalizing constant
R1 1
n1 g. /d
.
0
(c) Let U D jjXjj
X
, and  D jjX jj. Then U and  are independent, and U is dis-
tributed uniformly on the boundary of the n-dimensional unit sphere.
Conversely, if an n-dimensional random vector Z can be represented as Z D
U, where U is distributed uniformly on the boundary of the n-dimensional unit
sphere,  is some nonnegative random variable, and ; U are independent, then
Z has a spherically symmetric distribution.
(d) For any unit vector c; c1 X1 C    C cn Xn has the sameP
distribution as X1 .
(e) If E.jXi j/ < 1, then for any vector c; E.Xi j niD1 ci Xi D t/ D
c
Pn i 2 t:
c
i D1 i
(f) If X is uniformly distributed on the boundary of the n-dimensional unit sphere,
then a lower-dimensional projection .X1 ; X2 ; : : : ; Xk /.k < n/ has the density
!.nk/=21
X
k X
k
fk .x1 ; x2 ; : : : ; xk / D c 1  xi2 ; xi2 < 1;
i D1 i D1

where the normalizing constant c equals


n

cD  
2
I
k nk
 2
 2

in particular, if n D 3, then each jXi j U Œ0; 1, and each Xi U Œ1; 1.

4.6.2 Independence of Mean and Variance in Normal Case

Another transformation of technical use in spherically symmetric problems is


Helmert’s orthogonal transformation. It transforms an n-dimensional vector to
another n-dimensional vector by making an orthogonal transformation. In a spher-
ically symmetric situation, the orthogonal transformation will not affect the joint
distribution. At the same time, the transformation may unveil structures in the
problem that were not initially apparent. Specifically, when X1 ; X2 ; : : : ; Xn are
independent standard normals, the joint density of X1 ; X2 ; : : : ; Xn is spherically
symmetric. And Helmert’s transformation leads to a number of very important
results in this case. Although quicker proofs of some of these results for the normal
186 4 Advanced Distribution Theory

case are now available, nevertheless the utility of Helmert’s transformation in spher-
ically symmetric situations makes it an important tool. We first need to recall a few
definitions and facts from linear algebra.

Definition 4.5. An n  n real matrix P is called orthogonal if PP 0 D P 0 P D In ,


where In is the n  n identity matrix.
Any orthogonal matrix P has the property that it leaves lengths unchanged,
that is, if x is an n-dimensional vector, then x and P x have the same length:
jjxjj D jjP xjj. Furthermore, the determinant of any orthogonal matrix must be
˙1.jP j D ˙1/.

Definition 4.6 (Helmert’s Transformation). Let X D .X1 ; X2 ; : : : ; Xn / be an


n-dimensional random vector. The Helmert transformation of X is the orthogonal
transformation Y D P X, where P is the n  n orthogonal matrix
0 1
p1
n
p1
n
p1
n
 p1
n
B C
B  p12 p1 0  0 C
B 2 C
B p1 p1  p26  0 C
P DB C
B 6 6
C
B :: C
@ : A
p 1
n.n1/
p 1
n.n1/
p 1
n.n1/
    pn.n1/
n1

Verbally, in the first row, every element is p1n , and in the subsequent rows, say
the i th row, every element after the diagonal element in that row is zero.
Two important properties of the Helmert transformation are the following.

Proposition. For all X and Y D P X,

X
n X
n X
n X
n
Yi2 D Xi2 I Yi2 D .Xi  XN /2 ;
i D1 i D1 i D2 i D1

Pn
where XN D i D1 Xi
n
P P P
Pn P2 is an2 orthogonal
Proof. Pn matrix, thus niD1 Yi2 D niD1 Xi2 : Also, niD2 Yi2 D
N2
i D1 Yi  Y1 D i D1 Xi  nX , by definition of Y1 , because the first row of P
2
1
has all entries equal to pn :
These two properties lead to the following two important results. t
u

Theorem 4.6 (Independence of Mean and Variance in NormalP Case). Suppose


X1 ; X2 ; : : : ; Xn are independent N.;  2 / variables. Then XN and niD1 .Xi  XN /2
are independently distributed.

Proof. First consider the case  D 0;  D 1. The Jacobian determinant of the


transformation x ! y D P x is jP j D ˙1: Therefore, by the Jacobian density
theorem, the joint density of Y1 ; : : : ; Yn is
4.6 n-Dimensional Polar and Helmert’s Transformation 187

fY1 ;:::;Yn .y1 ; : : : ; yn / D fX1 ;:::;Xn .x1 ; : : : ; xn /jJ j


Pn Pn
1 x2 1 y2
i D1 i i D1 i
D e 2 D e 2 ;
.2/n=2 .2/n=2

which proves that Y1 ; : : : ; Yn too are independent standard normal variables. P


As a consequence, .Y2 ; : : : ; Yn / isPindependent of Y1 , and hence niD2 Yi2 is
independent of Y1 , which means n N 2
D1 .Xi  X / is independent of p X,N due to
Pn that2 iP
our earlier observation that i D2 Yi D i D1 .Xi  XN /2 , and that Y1 D nXN .
n

Consider now the case of general ; . Because .X1 ; X2 ; : : : ; Xn / has the


representation .X1 ; X2 ; : : : ; Xn / D .; ; : : : ; / C .Z1 ; Z2 ; : : : ; Zn /, where
Z1 ; Z2 ; : : : ; Zn are independent
P Pn standard normals, one has XN D  C  Z, N and
n
.X  XN / 2
D  2
.Z  N
Z/ 2
. Therefore, the independence of XN and
PinD1 i i D1 i
i D1 .X i  XN / 2
follows from their independence in the special standard normal
case. t
u
Pn
Remark. The independence of XN and i D1 .Xi  XN / is a signature property of
2

the normal distribution; we observed it earlier in Chapter 1. A very important con-


sequence of their independence is the following result, of enormous importance in
statistics.
Theorem 4.7. Let X1 ; X2 ; : : : ; Xn be independent N.;  2 / variables. Then,
Pn
N 2
i D1 .Xi X /
(a) 2
2n1 I
p
N
(b) The t statistic t D n.X/ t.n  1/; a t distribution with n  1 degrees of
s P
freedom, where s is defined as .n  1/s 2 D niD1 .Xi  XN /2 :
Proof. It is enough to prove this theorem when  D 0;  D 1, by the same argu-
ment
Pn made in the preceding
Pn theorem. So we assume that  D 0;  D 1. Because
.X  XN / 2
D Y 2
, and Y 2 ; : : : ; Yn are independent N(0,1) variables,
i D1 i P i D2 i
it follows that niD1 .Xi  XN /2 has aq 2n1 distribution. For part (b), note that
p
nXN N.0; 1/, and write s as s D .n1/s 2
n1
, and so s is the square root of a
2n1 random variable divided by n  1, its degrees of freedom. It therefore follows
p
nXN
that s
has a t-distribution, from the definition of a t-distribution. t
u

4.6.3 The t Confidence Interval

The result in Theorem 4.7 leads to one of the mainstays of statistical methodology,
namely the t confidence interval for the mean of a normal distribution, when the
variance  2 is unknown. In Section 1.13, we described how to construct a confidence
interval for  when we know the value of  2 . The interval derived there is X ˙
z ˛2 psn , where z ˛2 D ˆ1 .1 ˛2 /. Obviously, this interval cannot be used if we do not
know the value of . However, we can easily remedy this slight problem by simply
using part (b) of Theorem 4.7, which says that if X1 ; : : : ; Xn are iid N.;  2 /, then
188 4 Advanced Distribution Theory
p
N
n.X/
s
t.n  1/. This result implies, with t ˛2 ;n1 denoting the 1  ˛
2
quantile
of the t.n  1/ distribution,
p !
n.XN  /
P t ˛2 ;n1 t ˛2 ;n1 D 1  ˛
s
 
s N s
, P t ˛2 ;n1 p X   t ˛2 ;n1 p D 1˛
n n
 
s s
, P XN  t ˛2 ;n1 p  XN C t ˛2 ;n1 p D 1  ˛:
n n

The interval XN ˙ t ˛2 ;n1 . psn is called the 100.1  ˛/% t confidence interval. It is
based on the assumption of X1 ; : : : ; Xn being iid N.;  2 / for some  and some
 2 . In practice, it is often used for very nonnormal or even correlated data. This is
unjustified and in fact wrong.

4.7 The Dirichlet Distribution

The Jacobian density formula, when suitably applied to a set of independent Gamma
random variables, results in a hugely useful and important density for random vari-
ables in a simplex. In the plane, the standard simplex is the triangle with vertices at
.0; 0/; .0; 1/, and .1; 0/. In the general n dimensions, the standard simplex is the
Pnof all n-dimensional vectors x D .x1 ; : : : ; xn / such that eachP
set xi  0, and
n
i D1 ix 1: If we define an additional xnC1 as x nC1 D 1  i D1 xi , then
.x1 ; : : : ; xnC1 / forms a vector of proportions adding to one. Thus, the Dirichlet dis-
tribution can be used in any situation where an entity has to necessarily fall into
one of n C 1 mutually exclusive subclasses, and we want to study the proportion
of individuals belonging to the different subclasses. Indeed, when statisticians want
to model an ensemble of fractional variables adding to one, they often first look
at the Dirichlet distribution as their model. See Aitchison (1986). Dirichlet distri-
butions are also immensely important in Bayesian statistics. Fundamental work on
the use of Dirichlet distributions in Bayesian modeling and on calculations using
the Dirichlet distribution has been done in Ferguson (1973), Blackwell (1973), and
Basu and Tiwari (1982).
Let X1 ; X2 ; : : : ; XnC1 be independent Gamma random variables, with Xi
G.˛i ; 1/. Define
Xi
pi D PnC1 ; 1 i n;
j D1 Xj
P
and denote pnC1 D 1  niD1 pi . Then, we have the following theorem.
Theorem 4.8. p D .p1 ; p2 ; : : : ; pn / has the joint density
P
nC1
 i D1 ˛i Y ˛ 1
nC1
f .p1 ; p2 ; : : : ; pn / D QnC1 pi i :
i D1 .˛ i / i D1
4.7 The Dirichlet Distribution 189

Proof. This is proved by using the Jacobian density theorem. The transformation
0 1
X
nC1
.x1 ; x2 ; : : : ; xnC1 / ! @p1 ; p2 ; : : : ; pn ; Xj A
j D1

P
is a one-to-one transformation with the Jacobian determinant J D . nC1 n
j D1 Xj / :
Inasmuch as X1 ; X2 ; : : : ; XnC1 are independent Gamma random variables, Papplying
the Jacobian density theorem, we get the joint density of .p1 ; p2 ; : : : ; pn ; nC1
j D1 Xj /
as

1 PnC1 Y
nC1
˛ 1
s mC ˛i 1
fp1 ;p2 ;:::;pn ;s .p1 ; p2 ; : : : ; pn ; s/ D QnC1 e s i D1  pi i :
i D1 .˛i / i D1

If we now integrate s out (on .0; 1/), we get the joint density of p1 ; p2 ; : : : ; pn ,
as stated in this theorem. t
u
Definition 4.7 (Dirichlet Density). An n-dimensional vector p D .p1 ; p2 ;
: : : ; pn / is said to have the Dirichlet distribution with parameter vector ˛ D
.˛1 ; : : : ; ˛nC1 /; ˛i > 0, if it has the joint density
P
. nC1 ˛i / Y ˛i 1
nC1
f .p1 ; p2 ; : : : ; pn / D QnC1i D1 pi ;
i D1 .˛i / i D1

P
pi  0; niD1 pi 1:
We write p Dn .˛/.
Remark. When n D 1, the Dirichlet density reduces to a Beta density with param-
eters ˛1 ; ˛2 . Simple integrations give the following moment formulas.
Proposition. Let p Dn .˛/. Then,

˛i ˛i .t  ˛i / ˛i ˛j
E.pi / D ; Var.pi / D ; Cov.pi ; pj / D  ; i ¤ j;
t t 2 .t C 1/ t .t C 1/
2

P
where t D nC1i D1 ˛i :
Thus, notice that the covariances (and hence the correlations) are always
negative.
A convenient fact about the Dirichlet density is that lower-dimensional marginals
are also Dirichlet distributions. So are the conditional distributions of suitably
renormalized subvectors given the rest.
Theorem 4.9 (Marginal and Conditional Distributions).
(a) Let p Dn .˛/.PFix m < n, and let pm D .p1 ; : : : ; pm /, and ˛m D
.˛1 ; : : : ; ˛m ; t  m
i D1 ˛i /. Then pm Dm .˛m /. In particular, each pi
Be.˛i ; t  ˛i /.
190 4 Advanced Distribution Theory

p
(b) Let p D.˛/. Fix m < n, and let qi D P i
1 m
; i D m C 1; : : : ; n. Let
i D1 pi
ˇ m D .˛mC1 ; : : : ; ˛nC1 /. Then,
.qmC1 ; : : : ; qn /j.p1 ; : : : ; pm / Dnm .ˇm /:
These two results follow in a straightforward manner from the definition of con-
ditional densities, and the functional form of a Dirichlet density.
It also follows from the representation of a Dirichlet random vector in terms
of independent Gamma variables that sums of a subset of the coordinates must
have Beta distributions.
Theorem 4.10 (Sums of Subvectors).
Pm
(a) Let p Pmm < n, and let Sm D Sm;n D
Pm Dn .˛/. Fix i D1 pi . Then Sm
Be. i D1 ˛i ; t  i D1 ˛i /:
(b) More generally, suppose the entire Dirichlet vector p is partitioned into k sub-
vectors,
.p1 ; : : : ; pm1 /I .pm1 C1 ; : : : ; pm2 /I : : : I .pmk1 C1 ; : : : ; pn /:

Let S1 ; S2 ; : : : ; Sk be the sums of the coordinates of these k subvectors. Then


0 1
X
m1 X
m2 X
n
.S1 ; S2 ; : : : ; Sk / Dk @ ˛i ; ˛i ; : : : ; ˛i A :
i D1 i Dm1 C1 i Dmk1 C1

The Dirichlet density is obtained by definition as the density of functions of


independent Gamma variables, thus it also has connections to the normal dis-
tribution, by virtue of the fact that sums of squares of independent standard
normal variables are chi square, which are, after all, Gamma random variables.
Here is the connection to the normal distribution.
Theorem 4.11 (Dirichlet and Normal). Let X1 ; X2 ; q: : : ; Xn be independent stan-
Pn
dard normal variables. Fix k; 1 k < n. Let jjX jj D 2
i D1 Xi , the length of X.
Xi
Let Yi D jjXjj ; i D 1; 2; : : : ; k. Then,
(a) has the density Dk .˛/, where ˛ D .1=2; 1=2; : : : ; 1=2;
.Y12 ; Y22 ; : : : ; Yk2 /
.n  k/=2/I
. n / Pk P
(b) .Y1 ; Y2 ; : : : ; Yk / has the density k=2 2nk .1  2 .nk/=21
i D1 yi / ; kiD1
 . 2 /
yi2 < 1:
Proof. Part (a) is a consequence of the definition of a Dirichlet distribution in terms
of independent Gamma variables, and the fact that marginals of a Dirichlet distri-
bution are also Dirichlet. For part (b), first make the transformation from Yi2 to
jYi j; 1 i k, and use the Jacobian density theorem. Then, observe that the joint
density of .Y1 ; Y2 ; : : : ; Yk / is symmetric, and so the joint density of .Y1 ; Y2 ; : : : ; Yk /
and that of .jY1 j; jY2 j; : : : ; jYk j/ are given by essentially the same function, apart
from a normalizing constant. t
u
4.8 Ten Important High-Dimensional Formulas for Easy Reference 191

4.7.1  Picking a Point from the Surface of a Sphere

Actually, part (b) of this last theorem brings out a very interesting connection be-
tween the normal distribution and the problem of picking a point at random from
the boundary of a high-dimensional sphere. If Un D .Un1 ; Un2 ; : : : ; Unn / is uni-
formly distributed on the boundary of the n-dimensional unit sphere, and if we take
k < n, hold k fixed and let n ! 1, then part (b) leads to the very useful fact that
the joint distribution of .Un1 ; Un2 ; : : : ; Unk / is approximately the same as the joint
distribution of k independent N.0; n1 /-variables. That is, if a point was picked at
random from the surface of a high-dimensional sphere, and if we then looked at a
low-dimensional projection of that point, the projection would act as a set of nearly
independent normal variables with zero means and variances n1 . This is known as
Poincaré’s lemma. Compare this with Theorem 15.5, where the exact density of
a lower-dimensional projection was worked out; Poincaré’s lemma can be derived
from there.

4.7.2  Poincaré’s Lemma

Theorem 4.12 (Poincaré’s Lemma). Let Un D .Un1 ; Un2 ; : : : ; Unn / be uniformly


distributed on the boundary of the n-dimensional unit sphere. Let k  1 be a fixed
positive integer. Then,

p p p  Y
k
P nUn1 x1 ; nUn2 x2 ; : : : ; nUnk xk ! ˆ.xi /;
i D1

8 .x1 ; x2 ; : : : ; xk / 2 Rk :

4.8  Ten Important High-Dimensional Formulas


for Easy Reference

qP an n-dimensional random vector has a spherically symmetric joint density


Suppose
n 2
g. i D1 xi /. Then, by making the n-dimensional polar transformation, we can
P
reduce the expectation of a function h. niD1 Xi2 / to just a one-dimensional inte-
gral, although in principle, an expectation requires integration on the n-dimensional
space. The structure of spherical symmetry enables us to make this drastic reduc-
tion to just a one-dimensional integral. Similar other reductions follow in many other
problems, and essentially all of them are consequences of making a transformation
just right for that problem, and then working out the Jacobian. We state a number of
such frequently useful formulas in this section.
192 4 Advanced Distribution Theory

Theorem 4.13.
n
2
(a) Volume of n-dimensional unit sphere = . n
:
2 C1/
n
(b) Surface area of n-dimensional unit sphere = .n n
2
:
Z 2 C1/
1
(c) Volume of n-dimensional simplex = dx1    dxn D :
x 0;x CCxn 1 nŠ
Z X n   !i n 1
xi ˛i Y pi 1
(d) P f xi dx1    dxn
xi 0; n
xi ˛
i D1 . ci / 1
i ci
i D1 i D1
Qn p Qn pi Z 1 Pn pi
c i i D1 . ˛i / 1
D QinD1 ˛i Pn pi f .t/t i D1 ˛i dt; .ci ; ˛i ; pi > 0/:
i D1 i . i D1 ˛i / 0
Z Qn pi 1
xi
(e) Pn ˛
PinD1 ˛  dx1    dxn
xi 0; i D1 xi i 1 i D1 xi i

Qn
pi P
i D1  ˛i n pi
D Qn P
n pi Pn pi ; i D1 ˛i > :
i D1 ˛i  i D1 ˛i  . i D1 ˛i /
Z !2mC1
X
n
(f) Pn pi xi dx1    dxn D 0 8p1 ; : : : ; pn ; 8m  1:
2
i D1 xi 1 i D1
Z !2m
X n
(g) Pn pi xi dx1    dxn
2
i D1 xi 1 i D1

.2m1/Š
n Pn 
2 m
D 22m1 .m1/Š  . n
2
8p1 ; : : : ; pn ; 8m  1:
 i D1 pi ;
2 CmC1/ q 
P2n 2
Z P2n
n
.2/ In i D1 ci
(h) P e i D1 ci xi dx1    dxn D P n ;
2n 2
i D1 xi 1 2n 2 2
i D1 ci

where In .z/ denotes the Bessel function defined by that notation.


Z 1 Pn 
n X
Pn 2 k
i D1 c
(i) P e i D1 i i dx1    dxn D  2
c x
 i
:
n 2
i D1 xi 1
4k kŠ n2 C k C 1
0v 1 kD0
Z u n n Z r
uX 2 2
(j) P f @t xi2 A dx1    dxn D  n  t n1 f .t/dt:
n 2
i D1 xi r
2
i D1
 2 0

Exercises

Exercise 4.1. Suppose X U Œ0; 1, and Y has the density 2y; 0 < y < 1; and that
X; Y are independent. Find the density of X Y and of X
Y .
Exercises 193

Exercise 4.2. Suppose X U Œ0; 1, and Y has the density 2y; 0 < y < 1; and that
X; Y are independent. Find the density of X C Y; X  Y; jX  Y j.
Exercise 4.3. Suppose .X; Y / have the joint pdf f .x; y/ D c.xCy/e xy ; x; y>0:
(a) Are X; Y independent?
(b) Find the normalizing constant c.
(c) Find the density of X C Y .
Exercise 4.4. Suppose X; Y have the joint density cxy; 0 < x < y < 1:
(a) Are X; Y independent?
(b) Find the normalizing constant c.
(c) Find the density of X Y .
Exercise 4.5 * (A Conditioning Argument). Suppose a fair coin is tossed twice
and the number of heads obtained is N . Let X; Y be independent U Œ0; 1 variables,
and independent of N . Find the density of NX Y .
Exercise 4.6. Suppose X U Œ0; a; Y U Œ0; b; Z U Œ0; c; 0 < a < b < c,
and that X; Y; Z are independent. Let m D minfX; Y; Zg. Find expressions for
P .m D X /; P .m D Y /; P .m D Z/:
Exercise 4.7. Suppose X; Y are independent standard exponential random vari-
XY
ables. Find the density of X Y , and of .XCY /2
.

Hint: Use Y
XCY
D 1 X
XCY
, and see the examples in text.
Exercise 4.8 * (Uniform in a Circle). Suppose .X; Y / are jointly uniform in the
unit circle. By transforming to polar coordinates, find the expectations of X 2XY
CY 2
,
p XY
and of .
X CY
2 2

Exercise 4.9 * (Length of Bivariate Uniform). Suppose X; Y are independent


U Œ0; 1 variables.

p of X C Y , and P .X C Y
2 2 2 2
(a) Find the density 1/.
(b) Show that E. X C Y / :765:
2 2

Hint: It is best to do this directly, and not try polar coordinates.


Exercise 4.10. Suppose .X; Y / have the joint CDF F .x; y/ D x 3 y 2 ; 0 x; y 1.
Find the density of X Y and of X
Y .

Exercise 4.11 * (Distance Between Two Random Points). Suppose P D .X; Y /


and Q D .Z; W / are two independently picked points from the unit circle, each ac-
cording to a uniform distribution in the circle. What is the average distance between
P and Q?
Exercise 4.12 * (Distance from the Boundary). A point is picked uniformly from
the unit square. What is the expected value of the distance of the point from the
boundary of the unit square?
194 4 Advanced Distribution Theory

Exercise 4.13. Suppose X; Y are independent standard normal variables. Find the
values of P . X
Y
< 1/, and of P .X < Y /. Why are they not the same?

Exercise 4.14 (A Normal Calculation). A marksman is going to take two shots at


a bull’s eye. The distance of the first and the second shot from the bull’s eye are
distributed as that of .jX j; jY j/, where X N.0;  2 /; Y N.0;  2 /, and X; Y are
independent. Find a formula for the probability that the second shot is closer to the
target.

Exercise 4.15 * (Quotient in Bivariate Normal). Suppose .X; Y / have a bivariate


normal distribution with zero means, unit standard deviations, and a correlation .
Show that X
Y
still has a Cauchy distribution.

Exercise 4.16 * (Product of Beta). Suppose X; Y are independent Be.˛; ˇ/;


Be. ; ı/ random variables. Find the density of X Y . Do you recognize the form?

Exercise 4.17 * (Product of Normals). Suppose X; Y are independent standard


normal variables. Find the density of X Y .
Hint: The answer involves a Bessel function K0 .

Exercise 4.18 * (Product of Cauchy). Suppose X; Y are independent standard


Cauchy variables. Derive a formula for the density of X Y .

Exercise 4.19. Prove that the square of a t random variable has an F -distribution.

Exercise 4.20 * (Box–MuellerpTransformation). Suppose p X; Y are independent


U Œ0; 1 variables. Let U D 2 log X cos.2Y /; V D 2 log X sin.2Y /:
Show that U; V are independent and that each is standard normal.

Remark. This is a convenient method to generate standard normal values, by using


only uniform random numbers.

Exercise 4.21 * (Deriving a General Formula). Suppose .X; Y; Z/ have a joint


density of the form f .x; y; z/ D g.x C y C z/; x; y; z > 0: Find a formula for the
density of X C Y C Z.

Exercise 4.22. Suppose .X; Y; Z/ have a joint density f .x; y; z/ D 6


.1CxCyCz/4
; x; y; z >
0: Find the density of X C Y C Z.

Exercise 4.23 (Deriving a General Formula). Suppose X U Œ0; 1 and Y is an


arbitrary continuous random variable. Derive a general formula for the density of
X CY.

Exercise 4.24 (Convolution of Uniform and Exponential). Let X U Œ0; 1;


Y Exp. /, and X; Y are independent. Find the density of X C Y .

Exercise 4.25 (Convolution of Uniform and Normal). Let X U Œ0; 1; Y


N.;  2 /, and X; Y are independent. Find the density of X C Y .
Exercises 195

Exercise 4.26 (Convolution of Uniform and Cauchy). Let X U Œ0; 1; Y


C.0; 1/, and X; Y are independent. Find the density of X C Y .

Exercise 4.27 (Convolution of Uniform and Poisson). Let X U Œ0; 1; Y


Poi. /, and let X; Y be independent. Find the density of X C Y .

Exercise 4.28 * (Bivariate Cauchy). Suppose .X; Y / has the joint pdf f .x; y/ D
c
.1Cx 2 Cy 2 /3=2
; 1 < x; y < 1.
(a) Find the normalizing constant c.
(b) Are X; Y independent?
(c) Find the densities of the polar coordinates r; .
(d) Find P .X 2 C Y 2 1/.

Exercise 4.29. * Suppose X; Y; Z are independent standard exponentials. Find the


joint density of
X X CY
; ; X C Y C Z:
X CY CZ X CY CZ
Exercise 4.30 (Correlation). Suppose X;
p Y are independent U Œ0; 1 variables.
Find the correlation between X C Y and X Y .

Exercise 4.31 (Correlation). Suppose X; Y are jointly uniform in the unit circle.
Find the correlation between X Y and X 2 C Y 2 .

Exercise 4.32 * (Sum and Difference of General Exponentials). Suppose X


Exp. /; Y Exp./, and that X; Y are independent. Find the density of X C Y
and of X  Y .

Exercise 4.33 * (Double Exponential Convolution). Suppose X; Y are indepen-


dent standard double exponentials, each with the density 12 e jxj ; 1 < x < 1.
Find the density of X C Y .
XCY Z
p
Exercise 4.34. * Let X; Y; Z be independent standard normals. Show that
1CZ 2
has a normal distribution.

Exercise 4.35 * (Decimal Expansion of a Uniform). Let X U Œ0; 1 and sup-


pose X D :n1 n2 n3    is the decimal expansion of X . Find the marginal and joint
distribution of n1 ; n2 ; : : : ; nk , for k  1:

Exercise 4.36 * (Integer Part and Fractional Part). Let X be a standard expo-
nential variable. Find the joint distribution of the integer part and the fractional part
of X . Note that they do not have a joint density.

Exercise 4.37 * (Factorization of Chi Square). Suppose X has a chi square distri-
bution with one degree of freedom. Find nonconstant independent random variables
Y; Z such that Y Z has the same distribution as X .
Hint: Look at text.
196 4 Advanced Distribution Theory

Exercise 4.38 * (Multivariate Cauchy). Suppose X1 ; X2 ; : : : ; Xn have the joint


density
c
f .x1 ; : : : ; xn / D 3
;
.1 C x1 C    C xn2 / 2
2

where c is a normalizing constant.


Find the density of X12 C X22 C    C Xn2 .

Exercise 4.39 (Ratio of Independent Chi Squares). Suppose X1 ; X2 ; : : : ; Xm are


independent N.;  2 / variables, and Y1 ; Y2 ; : : : ; Yn are independent N.;  2 / vari-
ables. Assume also that all m C n variables are independent. Show that
P
.n  1/ 2 m N 2
i D1 .Xi  X /
P
.m  1/ 2 i D1 .Yi  YN /2
n

has an F distribution.

Exercise 4.40 * (An Example due to Larry Shepp). Let X; Y be independent


standard normals. Show that p XY has a normal distribution.
2 2 X CY

Exercise 4.41 (Dirichlet Calculation). Suppose .p1 ; p2 ; p3 / D3 .1; 1; 1; 1/.


(a) What is the marginal density of each of p1 ; p2 ; p3 ?
(b) What is the marginal density of p1 C p2 ?
(c) Find the conditional probability P .p3 > 14 j p1 C p2 D 12 /.

Exercise 4.42 * (Dirichlet Calculation). Suppose .p1 ; p2 ; p3 ; p4 / has a Dirichlet


distribution with a general parameter vector ˛. Find each of the following.
(a) Var.p1 C p2  p3  p4 /.
(b) P .p1 C p2 > p3 C p4 /.
(c) E.p3 C p4 j p1 C p2 D c/.
(d) Var.p3 C p4 j p1 C p2 D c/:

Exercise 4.43 * (Dirichlet Cross Moments). Suppose p Dn .˛/. Let r; s  1 be


integers. Show that

.˛i C r  1/.˛j C s  1/
E.pir pjs / D ;
t 2 .t C 1/

PnC1
where t D i D1 ˛i .
References 197

References

Aitchison, J. (1986). Statistical Analysis of Compositional Data, Chapman and Hall, New York.
Basu, D. and Tiwari, R. (1982). A Note on Dirichlet Processes, in Statistics and Probability, Es-
says in Honor of C. R. Rao, 89-103, J. K. Ghosh and G. Kallianpur, Eds., North-Holland,
Amsterdam.
Blackwell, D. (1973). Discreteness of Ferguson selections, Ann. Stat., 1, 2, 356–358.
Ferguson, T. (1973). A Bayesian analysis of some nonparametric problems, Ann. Stat., 1, 209–230.
Hinkley, D. (1969). On the ratio of two correlated normal random variables, Biometrika, 56, 3,
635–639.
Chapter 5
Multivariate Normal and Related Distributions

Multivariate normal distribution is the natural extension of the bivariate normal


to the case of several jointly distributed random variables. Dating back to the
works of Galton, Karl Pearson, Edgeworth, and later Ronald Fisher, the multivariate
normal distribution has occupied the central place in modeling jointly distributed
continuous random variables. There are several reasons for its special status. Its
mathematical properties show a remarkable amount of intrinsic structure; the prop-
erties are extremely well studied; statistical methodologies in common use often
have their best or optimal performance when the variables are distributed as mul-
tivariate normal; and, there is the multidimensional central limit theorem and its
various consequences which imply that many kinds of functions of independent
random variables are approximately normally distributed, in some suitable sense.
We present some of the multivariate normal theory and facts with examples in this
chapter.

5.1 Definition and Some Basic Properties

As in the bivariate case, a general multivariate normal distribution is defined as the


distribution of a linear function of a standard normal vector. Here is the definition.

Definition 5.1. Let n  1. A random vector Z D .Z1 ; Z2 ; : : : ; Zn / is said to have


an n-dimensional standard normal distribution if the Zi are independent univariate
standard normal variables, in which case their joint density is
Pn
1 z2
 i D1 i
f .z1 ; z2 ; : : : ; zn / D n e
2  1 < zi < 1; i D 1; 2; : : : ; n:
.2/ 2
Definition 5.2. Let n  1, and let B be an n  n real matrix of rank k n. Suppose
Z has an n-dimensional standard normal distribution. Let  be any n-dimensional
vector of real constants. Then X D  C BZ is said to have a multivariate normal
distribution with parameters  and †, where † D BB 0 is an n  n real symmetric
nonnegative definite (nnd) matrix. If k < n, the distribution is called a singular
multivariate normal. If k D n, then † is positive definite and the distribution of

A. DasGupta, Probability for Statistics and Machine Learning: Fundamentals 199


and Advanced Topics, Springer Texts in Statistics, DOI 10.1007/978-1-4419-9634-3 5,
c Springer Science+Business Media, LLC 2011
200 5 Multivariate Normal and Related Distributions

X is called a nonsingular multivariate normal or often, just a multivariate normal.


We use the notation X Nn .; †/, or sometimes X M V N.; †/.
We treat only nonsingular multivariate normals in this chapter.

Theorem 5.1 (Density of Multivariate Normal). Suppose B is full rank. Then,


the joint density of X1 ; X2 ; : : : ; Xn is

1 1 0 1 .x/
f .x1 ; x2 ; : : : ; xn / D n 1
e  2 .x/ † ;
.2/ j†j
2 2

where x D .x1 ; x2 ; : : : ; xn / 2 Rn :

Proof. By definition of the multivariate


P normal distribution, and the definition of
matrix product, Xi D i C njD1 bij Zj ; 1 i n. This is a one-to-one trans-
@xi
formation because B is full rank, and the partial derivatives are @zj D bij . Hence
@xi
the determinant of the matrix of the partial derivatives @zj
is jBj. The Jacobian
1
determinant is the reciprocal of this determinant, and hence, P is jBj1 D j†j 2 ,
because j†j D jBB j D jBjjB j D jBj . Furthermore, niD1 z2i D z0 z D
0 0 2

.B 1 .x  //0 .B 1 .x  // D .x  /0 .B 0 /1 B 1 .x  / D .x  /0 †1 .x  /:


Now the theorem follows from the Jacobian density theorem. t
u

It follows from the linearity of expectations, and the linearity of covariance, that
E.Xi / D i , and Cov.Xi ; Xj / D ij , the .i; j /th element of †. The vector of ex-
pectations is usually called the mean vector, and the matrix of pairwise covariances
is called the covariance matrix. Thus, we have the following facts about the physical
meanings of ; †:

Proposition. The mean vector of X equals E.X/ D ; The covariance matrix


of X equals †.

Example 5.1. Figure 5.1 of a simulation of 1000 values from a bivariate normal
distribution shows the elliptical shape of the point cloud, as would be expected from
the fact that the formula for the density function is a quadratic form in the variables.
It is also seen in the plot that the center of the point cloud is quite close to the true
means of the variables, namely 1 D 4:5; 2 D 4, which were used for the purpose
of the simulation.
An important property of a multivariate normal distribution is the property of
closure under linear transformations; that is, any number of linear functions of X
will also have a multivariate normal distribution. The precise closure property is as
follows.

Theorem 5.2 (Density of Linear Transformations). Let X Nn .; †/, and


let Akn be a matrix of rank k; k n. Then Y D AX Nk .A; A†A0 /. In
particular, marginally, Xi N.i ; i2 /, where i2 D i i ; 1 i n, and all
lower-dimensional marginals are also multivariate normal in the corresponding
dimension.
5.1 Definition and Some Basic Properties 201

Y
7

X
2 3 4 5 6 7

Fig. 5.1 Simulation of a bivariate normal with means 4.5, 4; variances 1; correlation .75

Proof. Let B be such that † D BB 0 . Then, writing X D  C BZ, we get


Y D A. C BZ/ D A C ABZ, and therefore, using the same representa-
tion of a multivariate normal vector once again, Y Nk .A; .AB/.AB/0 / D
0 0 0
Nk .A; ABB A / D Nk .A; A†A /. t
u

Corollary (MGF of Multivariate Normal). Let X Nn .; †/. Then the mgf of
X exists at all points in Rn and is given by
0 C t 0 †t
;† .t/ D et 2 :
This follows on simply observing that the theorem above implies that t0 X
N.t 0 ; t 0 †t/, and by then using the formula for the mgf of a univariate normal
distribution. t
u

As we just observed, any linear combination t0 X of a multivariate normal vector


is univariate normal. A remarkable fact is that the converse is also true.

Theorem 5.3 (Characterization of Multivariate Normal). Let X be an n-


dimensional random vector, and suppose each linear combination t0 X has a uni-
variate normal distribution. Then X has a multivariate normal distribution.
See Tong (1990, p. 29) for a proof of this result.

Example 5.2. Suppose .X1 ; X2 ; X3 / N3 .; †/, where


0 1
0
 D @1A
2
0 1
1 0 1
† D @0 2 2A:
1 2 3
202 5 Multivariate Normal and Related Distributions

We want to find the joint distribution of .X1  X2 ; X1 C X2 C X3 /. We recognize


this to be a linear function AX, where
 
1 1 0
AD :
1 1 1

Therefore, by the theorem above,

.X1  X2 ; X1 C X2 C X3 / N2 .A; A†A0 /;

and by direct matrix multiplication,


 
1
A D
3

and  
0 3 2
A†A D :
2 12
In particular, X1  X2 N.1; 3/; X1 C X2 C X3 N.3; 12/, and Cov.X1 
X2 ; X1 C X2 C X3 / D 2: Therefore, the correlation between X1  X2 and X1 C
X2 C X3 equals p 2 p D  13 :
3 12

5.2 Conditional Distributions

As in the bivariate normal case, zero correlation between a particular pair of vari-
ables implies that the particular pair must be independent. And, as in the bivariate
normal case, all lower-dimensional conditional distributions are also multivariate
normal.

Theorem 5.4. Suppose X Nn .; †/. Then, Xi ; Xj are independent if and only
if ij D 0. More generally, if X is partitioned as X D .X1 ; X2 /, where X1 is
k-dimensional, and X2 is n  k dimensional, and if † is accordingly partitioned as
 
†11 †12
†D ;
†21 †22

where †11 is k  k, †22 is .n  k/  .n  k/, then X1 and X2 are independent if


and only if †12 is the null matrix.

Proof. The first statement follows immediately because we have proved that all
lower-dimensional marginals are also multivariate normal. Therefore, any pair
.Xi ; Xj /; i ¤ j , is bivariate normal. Now use the bivariate normal property that
a zero covariance implies independence.
5.2 Conditional Distributions 203

The second part uses the argument that if †12 is the null matrix, then the joint
density of .X1 ; X2 / factorizes in a product form, on some calculation, and therefore,
X1 and X2 must be independent. Alternatively, the second part also follows from the
next theorem given immediately below. t
u

Theorem 5.5 (Conditional Distributions). Let n  2; and 1 k < n. Let X D


.X1 ; X2 /, where X1 is k-dimensional, and X2 is n  k dimensional. Suppose X
Nn .; †/. Partition ; † as
 
1
D
2
 
†11 †12
†D ;
†21 †22

where 1 is k-dimensional, 2 is n  k dimensional, and †11 ; †12 ; †22 are, re-


spectively, k  k; k  n  k; n  k  n  k. Finally, let †1:2 D †11  †12 †1
22 †21 .
Then,
X1 j X2 D x2 Nk .1 C †12 †1 22 .x2  2 /; †1:2 /:

This involves some tedious manipulations with matrices, and we omit the proof.
See Tong (1990, pp. 33–35) for the details. An important result that follows from
the conditional covariance matrix formula in the above theorem is the following; it
says that once you take out the effect of X2 on X1 , X2 and the residual will actually
become independent.

Corollary. Let X D .X1 ; X2 / Nn .; †/. Then, X1  E.X1 jX2 / and X2 are
independent.

Example 5.3. Let .X1 ; X2 ; X3 / N.; †/, where 1 D 2 D 3 D 0;


i i D 1; 12 D 12 ; 13 D 23 D 34 . It is easily verified that † is positive defi-
nite. We want to find the conditional distribution of .X1 ; X2 / given X3 D x3 .
By the above theorem, the conditional mean is
   
1 3 3 3x3 3x3
.0; 0/ C .13 ; 23 / .x3  0/ D ; x3 D ; :
33 4 4 4 4

And the conditional covariance matrix is found from the formula in the above
theorem as  7 
16
 16
1
:
 16
1 7
16

In particular, given X3 D x3 ; X1 N. 3x43 ; 167


/; the distribution of X2 given X3 D
x3 is the same normal distribution. Finally, given X3 D x3 , the correlation between
X1 and X2 is
1
1
 q 16q D  < 0;
7 7 7
16 16
204 5 Multivariate Normal and Related Distributions

although the unconditional correlation between X1 and X2 is positive, because


12 D 12 > 0. The correlation between X1 and X2 given X3 D x3 is called the
partial correlation between X1 and X2 given X3 D x3 .

Example 5.4. Suppose .X1 ; X2 ; X3 / N3 .; †/, where


0 1
4 1 0
† D @1 2 1A:
0 1 3
Suppose we want to find all a; b such that X3  aX1  bX2 is independent of
.X1 ; X2 /. The answer to this question depends only on †; so, we leave the mean
vector  unspecified.
To answer this question, we first have to note that the three variables X3 
aX1  bX2 ; X1 ; X2 together have a multivariate normal distribution, by the gen-
eral theorem on multivariate normality of linear transformations. Because of this
fact, X3  aX1  bX2 is independent of .X1 ; X2 / if and only if each of Cov.X3 
aX1  bX2 ; X1 /; Cov.X3  aX1  bX2 ; X2 / is zero. These are equivalent to
31  a11  b21 D 0I and 32  a12  b22 D 0
, 4a C b D 0I a C 2b D 1
) a D  17 ; b D 47 :
We conclude this section with a formula for a quadrant probability in the trivariate
normal case; no such clean formula is possible in general in dimensions four and
more. The formula given below is proved in Tong (1990).
Proposition (Quadrant Probability). Let .X1 ; X2 ; X3 / have a multivariate
normal distribution with means i ; i D 1; 2; 3, and pairwise correlations ij . Then,
1 1
P .X1 > 1 ; X2 > 2 ; X3 > 3 / D C Œarcsin 12 C arcsin 23 C arcsin 13 :
8 4
Example 5.5 (A Sample Size Problem). Suppose X1 ; X2 ; X3 are jointly trivari-
ate normal, each with mean zero, and suppose that Xi ;Xj D  for each pair
Xi ; Xj ; i ¤ j . Then, by the above proposition, the probability of the first quad-
rant is 18 C 4
3
arcsin./ D p (say). Now suppose that n points .Xi1 ; Xi 2 ; Xi 3 / have
been simulated from this trivariate normal distribution. Let T be the number of such
points that fall into the first quadrant. We then have that T Bin.n; p/.
Suppose now we need to have at least 100 points to fall in the first quadrant, with
a large probability, say probability .9. How many points do we need to simulate to
ensure this?
By the central limit theorem,
!
99:5  np
P .T  100/ 1  ˆ p D :9
np.1  p/
!
99:5  np
)ˆ p D :1
np.1  p/
5.3 Exchangeable Normal Variables 205

99:5  np
) p D 1:28
np.1  p/
p p
p 1:28 p.1p/C 1:64p.1p/C398p
) nD
2p
p
(on solving the quadratic equation in n from the line before). For p instance, if
 D :5, then p D :25, and by plugging into the last formula, we get n D 21:1 )
n 445:

5.3 Exchangeable Normal Variables

In some applications, a set of random variables is thought to have identical marginal


distributions, but is not independent. If we also impose the assumption that any pair
has the same bivariate joint distribution, then in the special multivariate normal case
there is a simple description of the joint distribution of the entire set of variables.
This is related to the concept of an exchangeable sequence, which we first define.

Definition 5.3. A sequence of random variables fX1 ; X2 ; : : : ; Xn g is called a


finitely exchangeable sequence if the joint distribution of .X1 ; X2 ; : : : ; Xn / is the
same as the joint distribution of .X.1/ ; X.2/ ; : : : ; X.n/ / for any permutation
..1/; .2/; : : : ; .n// of .1; 2; : : : ; n/.

Remark. So, for example, if n D 2, and the variables are continuous, then finite
exchangeability simply means that the joint density f .x1 ; x2 / satisfies f .x1 ; x2 / D
f .x2 ; x1 / 8 x1 ; x2 .
An exchangeable multivariate normal sequence has a simple description.

Theorem 5.6. Let .X1 ; X2 ; : : : ; Xn / have a multivariate normal distribution. Then,


fX1 ; X2 ; : : : ; Xn g is an exchangeable sequence if and only if for some common
;  2 ; ; E.Xi / D ; var.Xi / D  2 ; 8i; and 8i ¤ j; Xi ;Xj D .

Proof. The theorem follows on observing that if all the variances are equal, and if
every pair has the same correlation, then .X1 ; X2 ; : : : ; Xn / and .X.1/ ; X.2/ ; : : :,
X.n/ / have the same covariance matrix for any permutation ..1/; .2/; : : : ; .n//
of .1; 2; : : : ; n/. Because they also have the same mean vector (obviously), and be-
cause any multivariate normal distribution is fully determined by only its mean
vector and its covariance matrix, it follows that .X1 ; X2 ; : : : ; Xn / and .X.1/ ,
X.2/ ; : : : ; X.n/ / have the same multivariate normal distribution, and hence
fX1 ; X2 ; : : : ; Xn g is exchangeable. t
u

Example 5.6 (Constructing Exchangeable Normal Variables). Let Z0 ; Z1 ; : : : Zn


p
p independent standard normal variables. Now define Xi D  C . Z0 C
be
1  Zi /; 1 i n, where 0  1. Because X admits the repre-
sentation X D  C BZ, where  is a vector with each coordinate equal to
206 5 Multivariate Normal and Related Distributions

, X D .X1 ; X2 ; : : : ; Xn / and Z D .Z1 ; Z2 ; : : : ; Zn /, it follows that X has a


multivariate normal distribution. It is clear that E.Xi / D  8i . Also, Var.Xi / D
 2 . C 1  / D  2 8i . Next, Cov.Xi ; Xj / D  2  Cov.Z0 ; Z0 / D  2 ; 8i ¤ j .
So we have proved that fX1 ; X2 ; : : : ; Xn g is exchangeable. We leave it as an easy
exercise to explicitly calculate the matrix B cited in the argument.
In general, the CDF of a multivariate normal distribution is an n-dimensional
integral, and for n > 1, the dimension of the integral can be reduced by one to
n  1 by using a conditioning argument. However, in the exchangeable case with
the common correlation   0, there is a one-dimensional integral representation,
given below.
Theorem 5.7 (CDF of Exchangeable Normals). Let X D .X1 ; X2 ; : : : ; Xn /
have an exchangeable multivariate normal distribution with common mean , com-
mon variance  2 , and common correlation   0. Let .x1 ; : : : ; xn / 2 Rn , and let
ai D xi . Then,

Z 1 Y
n  p 
ai C z 
P .X1 x1 ; : : : ; Xn xn / D .z/ ˆ p d z:
1 i D1
1

p p
Proof. We use the representation Xi D  C . Z0 C 1  Zi /, where
Z0 ; : : : ; Zn are all independent standard normals. Now, use the iterated expecta-
tion technique as

P .X1 x1 ; : : : ; Xn xn / D EŒIX1 x1 ;:::;Xn xn 


D EZ0 EŒIX1 x1 ;:::;Xn xn jZ0 D z
D EZ0 EŒIp1 Z1 a1 p z;:::;p1 Zn an p z ; jZ0 D z
D EZ0 EŒIp1 Z1 a1 
p p
z;:::; 1 Zn an 
p 
z

(inasmuch as Z0 is independent of the rest of the Zi )


Z 1 Y
n  p 
ai  z 
D .z/ ˆ p dz
1 i D1
1

Z 1 Y
n p  
ai C z 
D .z/ ˆ p d z;
1 i D1
1

because .z/ D .z/. This proves the theorem. t


u
Example 5.7. Suppose X1 ; X2 ; : : : ; Xn are exchangeable normal variables with
zero means, unit standard deviations, and correlation . We are interested in the
probability P .\niD1 fXi 1g/, as a function ofp. By the theorem above, this equals
R1 1Cz
the one-dimensional integral 1 .z/ˆn . p1 /d z. Although we cannot simplify
this analytically, it is interesting to look at the effect of  and n on this probability.
Here is a table.
5.4 Sampling Distributions Useful in Statistics 207

 nD2 nD4 nD6 n D 10


0 .7079 .5011 .3547 .1777
.25 .7244 .5630 .4569 .3262
.5 .7452 .6267 .5527 .4606
.75 .7731 .6989 .6549 .6004
.95 .8108 .7823 .7665 .7476

The probabilities decrease with n for fixed ; this, again, is the curse of dimension-
ality. On the other hand, for fixed n, the probabilities increase with ; this is because
the event under consideration says that the variables, in some sense, act similarly.
The probability of their doing so increases when they have a larger correlation.

5.4 Sampling Distributions Useful in Statistics

Much as in the case of one dimension, structured results are available for functions
of a set of n-dimensional independent random vectors, each distributed as a mul-
tivariate normal. The applications of most of these results are in statistics. A few
major distributional results are collected together in this section.
First, we need some notation. Given N independent random vectors, Xi ; 1
i N , each Xi Nn .; †/, we define the sample mean vector and the sample
covariance matrix as
X
N
N D 1
X Xi ;
N
i D1

1 X
N
S D N
.Xi  X/.X N 0
i  X/ ;
N 1
i D1
P
where in the definitions above is defined as vector addition, and for a vector
u; uu0 means a matrix product. Note that in one dimension (i.e., when n D 1) XN is
also distributed as a normal, and X; N S are independently distributed. Moreover, in
one dimension, .N  1/S has a 2N 1 distribution. Analogues of all of these results
exist in this general multivariate case. This part of the multivariate normal theory is
very classic.
We need another definition.

Definition 5.4 (Wishart Distribution). Let Wpp be a symmetric positive definite


random matrix with elements wij ; 1 i; j p. W is said to have a Wishart
distribution with k degrees of freedom (k  p) and scale matrix †, if the joint
density of its elements,wij ; 1 i j p is given by

1 1 W /
f .W / D cjW j.kp1/=2 e  2 t r.† ;
208 5 Multivariate Normal and Related Distributions

where the normalizing constant c equals

1
cD Qp  :
ki C1
2kp=2 j†jk=2  p.p1/=4 i D1  2

We write W Wp .k; †/.

Theorem 5.8. Let Xi be independent Nn .; †/ random vectors, 1 i N . Then,


(a) N
X †
Nn .; N /.
(b) For N > n; S is positive definite with probability one.
(c) For N > n; .N  1/S Wn .N  1; †/:
(d) N and S are independently distributed:
X
Part (a) of this theorem follows easily from the representation of a multivariate nor-
mal vector in terms of a multivariate standard normal vector. For part (b), see Eaton
and Eaton and Perlman (1973) and Dasgupta (1971). Part (c) is classic. Numerous
proofs of part (c) are available. Specifically, see Mahalanobis et al. (1937), Olkin
and Roy (1954), and Ghosh and Sinha (2002). Part (d) has also been proved by var-
ious methods. Classic proofs are available in Tong (1990). The most efficient proof
follows from an application of Basu’s theorem, a theorem in statistical inference
(Basu (1955); see Chapter 18).

5.4.1  Wishart Expectation Identities

A series of elegant and highly useful expectation identities for the Wishart distri-
bution were derived in Haff (1977, 1979a,b, 1981). The identities were derived by
clever use of the divergence theorem of multidimensional calculus, and resulted in
drastic reduction in the amount of algebraic effort involved in classic derivations of
various Wishart expectations and moments. Although they can be viewed as results
in multivariate probability, their main concrete applications are in multivariate statis-
tics. A selection of these moment and expectation formulas are collected together in
the result below.

Theorem 5.9 (Wishart Identities). We need some notation to describe the


identities. We start with a general Wishart distributed random matrix:

Spp Wp .k; †/; † nonsingular:


S D The set of all p  p nnd matrices:
f .S /; g.S / W Scalar functions on S:
@f @f
D The matrix with elements :
@S @Sij
5.4 Sampling Distributions Useful in Statistics 209

diag.M / D Diagonal matrix with the same diagonal elements as those of M:


M.t / D tM C .1  t/diagM:
0 1 12
X
jjM jj D @ m2ij A ;
i;j

where mij are the elements of M . We mention that the identities are also valid for
the case p D 1, which corresponds to the chi-square case.
Theorem 5.10 (Wishart Identities). Let S Wp .k; †/. Suppose f .S /; g.S / are
twice differentiable with respect to each sij . Let Q be a nonrandom real matrix.
Then,
(a) If kp > 4; EŒf .S /tr†1  D .k  p  1/tr.EŒf .S /S 1 / C 2tr.EŒ @S
@f
/I
1 1 @g
(b) If kp > 4; EŒg.S /tr.† Q/D.k  p  1/tr.EŒg.S /S Q/C2tr.EŒ @S :
Q. 1 / /I
2
(c) If kp > 2; EŒg.S /tr.†1 S / D kpEŒg.S / C 2tr.EŒ @S
@g
:S. 1 / /I
2
(d) If kp > 4; EŒf .S /tr.S 1 Q†1 / D .k  p  2/EŒf .S /tr.S 2 Q/ 
EŒf .S /.trS 1 /.tr.S 1 Q// C 2tr.EŒ @S
@f
:.S 1 Q/. 1 / /:
2

Example 5.8. In identity (b) above, choose g.S / D 1, and Q D Ip . Then, the
identity gives tr.†1 / D .k  p  1/EŒtrS1  ) EŒtrS1  D kp1 1
tr†1 :
Next, in identity (a), choose f .S / D tr.S 1 /. Then, the identity gives

EŒtrS 1 tr†1 D .k  p  1/EŒtrS 1 2  2EŒtrS 2 


.tr†1 /2
) .k  p  1/EŒtrS 1 2  2EŒtrS 2  D
kp1
1 2
.tr† / 2
) .k  p  1/EŒtrS 1 2  EŒtrS 1 2
kp1 p
2
D .k  p  1  /EŒtrS 1 2
p
1 2
.tr† /
) EŒtrS 1 2  :
.k  p  1/.k  p  1  p2 /

Note that we are able to obtain these expectations without doing the hard distribu-
tional calculations that accompany classic derivations of such Wishart expectations.

5.4.2 * Hotelling’s T 2 and Distribution of Quadratic Forms

The multidimensional analogue of the t statistic is another important statistic in


multivariate statistics. Thus, suppose Xi ; 1 i N are independent Nn .; †/
210 5 Multivariate Normal and Related Distributions

N and S be the sample mean vector and covariance matrix. The


variables, and let X
Hotelling’s T 2 statistic (proposed in Hotelling (1931)) is the quadratic form

N  /0 S 1 .X
T 2 D N.X N  /;

assuming that N > n, so that S 1 exists with probability one. This is an ex-
tremely important statistic in multivariate statistical analysis. Its distribution was
also worked out in Hotelling (1931).
Theorem 5.11. Let Xi Nn .; †/; 1 i N , and suppose they are indepen-
dent. Assume that N > n. Then
N n 2
T F .n; N  n/;
n.N  1/

the F -distribution with n and N  n degrees of freedom.


The T 2 statistic is an example of a self-normalized statistic, in that the defining
quadratic form has been normalized by S , the sample covariance matrix, which is
a random matrix. Also of great importance in statistical problems is the distribution
of quadratic forms that are normalized by some nonrandom matrix, that is, distri-
N  /0 A1 .X
butions of statistics of the form .X N  /, where A is a suitable n  n
nonsingular matrix. It turns out that such quadratic forms are sometimes distributed
as chi square, and some classic and very complete results in this direction are avail-
able. A particularly important result is the Fisher Cochran theorem. The references
for the four theorems below are Rao (1973) and Tong (1990).
Theorem 5.12 (Distribution of Quadratic Forms). Let X Nn .0; In /, and
Bnn be a symmetric matrix of rank r n. Then Q D X0 BX has a chi-square
distribution if and only if B is idempotent, in which case the degrees of freedom of
Q are r D Rank.B/ D t r.B/.
Sketch of Proof: By the spectral decomposition theorem, there is an orthogonal ma-
trix P such that P 0 BP D , the diagonal matrix of the eigenvalues of B. Denote the
nonzero eigenvalues as 1 ; : : : ; r , where r D Rank.B/ D t r.B/. Then, making
the
Pr orthogonal transformation Y D P 0 X; Q D X0 BX D Y0 P 0 BP Y D Y0 Y D
2
i D1 i Yi .
X Nn .; In /, and Y D P 0 X is an orthogonal transformation, therefore Y
is also Nn .; In /, and so, the Yi2 are independent 21 variables. This allows one
Prwrite the2 mgf, and hence the cgf (the cumulant generating function) of Q D
to
i D1 i Yi . If Q were to be distributed as a chi square, its expectation and hence its
degrees of freedom Pwould have to be r. If one compares the cgf of a 2r distribution
r
with that of Q D i D1 i Yi2 , then by matching the coefficients in the Taylor series
expansion of both, one gets that tr.B k / D tr.B/ D r 8k  1, which would cause
B to be idempotent.
Conversely, if B is idempotent,
P then necessarily r eigenvalues of B are one, and
the others zero, and so Q D riD1 i Yi2 is a sum of r independent 21 variables,
and hence, is 2r . t
u
5.4 Sampling Distributions Useful in Statistics 211

A generalization to arbitrary covariance matrices is the following.

Theorem 5.13. Let X Nn .; †/, and Bnn a symmetric matrix of rank r. Then
Q D .X  /0 B.X  / 2r if and only if B†B D B.

The following theorem is of great use in statistics, and especially so in the area
of linear statistical models.

Theorem 5.14 (Fisher Cochran Theorem). Let X Nn .0; In /. Let Bi ; 1


i k be symmetric nonnegative definite matrices of rank ri ; 1 i k,
P
where kiD1 ri D n. Suppose Qi D X0 Bi X, and suppose Q D X0 X decomposes as
P
Q D kiD1 Qi . Then, each Qi 2ri , and the Qi are all independent.

Once again, see Tong (1990) for proofs of the last two theorems. Here is a pretty
application of the Fisher Cochran theorem.

Example 5.9 (Independence of Mean and Variance). Suppose X1 ; X2 ; : : : ; Xn are


independent N.;  2 / variables. Consider the algebraic identity

X
n X
n
.Xi  /2 D .Xi  XN /2 C n.XN  /2
i D1 i D1
X
n X
n
) Yi2 D .Yi  YN /2 C n.YN /2 ;
i D1 i D1

where we let Yi D Xi .


Pn Pn
Letting Q D N 2 N 2
i D1 Yi ; Q1 D i D1 .Yi  Y / ; Q2 D n.Y / , we have the
2

decomposition Q D Q1 C Q2 . Because the Yi are independent standard normals,


Q 2n , and the matrices B1 ; B2 corresponding to the quadratic forms Q1 ; Q2
have ranks n  1 and 1. Thus, all the assumptions of the Fisher Cochran theorem
hold, and so, it follows that
Pn
X
n
 XN /2
i D1 .Xi
Q1 D .Yi  YN /2 D 2n1 ;
2
i D1

P
and that moreover niD1 .Yi  YN /P 2
and .YN /2 must be independent. On using the
n N 2 N
symmetry of the Yi , it follows
Pthat i D1 .Yi  Y / and Y also must be independent,
n N N
which is the same as saying i D1 .Xi  X / and X must be independent.
2

In fact, simple general results on independence of linear forms or of quadratic


forms are known in the multivariate standard normal case.

Theorem 5.15 (Independence of Forms). Let X Nn .0; In /. Then,


(a) Two linear functions c1 X; c2 X are independent if and only if c1 0 c2 D 0,
0 0

(b) The linear functions c0 X and the quadratic form X0 BX are independent if and
only if Bc D 0.
212 5 Multivariate Normal and Related Distributions

(c) The quadratic forms X0 AX and X0 BX are independent if and only if AB D ,


the null matrix.
Part (a) of this theorem is obvious. Standard proofs of parts (b) and (c) use char-
acteristic functions, which we have not discussed yet.

5.4.3  Distribution of Correlation Coefficient

In a bivariate normal distribution, independence of the two coordinate variables is


equivalent to their being uncorrelated. As a result, there is an intrinsic interest in
knowing about the value of the correlation coefficient in a bivariate normal distribu-
tion. In statistical problems, the correlation  is taken to be an unknown parameter,
and one tries to estimate it from sample data, .Xi ; Yi /; 1 i n, from the underly-
ing bivariate normal distribution. The usual estimate of  is the sample correlation
coefficient r. It is simply a plug-in estimate; that is, one takes the definition of :
E.X Y /  E.X /E.Y /
D ;
X Y
and for each quantity in this defining expression, substitutes the corresponding sam-
ple statistic, which gives
1 Pn NN
i D1 Xi Yi  X Y
rD q P n
q P (5.1)
n N 2 1 n .Yi  YN /2
i D1 .Xi  X /
1
n n i D1

Pn
 nXN YN
i D1 Xi Yi
D qP P : (5.2)
n N 2 n .Yi  YN /2
i D1 .Xi  X / i D1

This is the sample correlation coefficient.


Ronald Fisher worked out the density of r for a general bivariate normal distribu-
tion. Obviously, the density depends only on , and not on 1 ; 2 ; 1 ; 2 . Although
the density of r is simple if the true  D 0, Fisher found it to be extremely complex
if  ¤ 0 (Fisher (1915)). This is regarded as a classic calculation.
Theorem 5.16. Let .X; Y / have a bivariate normal distribution with general pa-
rameters. Then, the sample correlation coefficient r has the density

n2
fR .r/ D p .1  2 /.n1/=2 .1  r 2 /.n4/=2 .1  r/3=2n
2.n  1/B.1=2; n  1=2/
 
1 C r
2 F1 1=2; 1=2I n  1=2I ;
2

where 2 F1 denotes the ordinary hypergeometric function, ususally denoted by that


notation.
5.5 Noncentral Distributions 213

In particular, if X; Y are independent, then


p
n  2r
p t.n  2/;
1  r2

a t distribution with n  2 degrees of freedom.

5.5  Noncentral Distributions

The T 2 statistic of Hotelling is centered at the mean  of the distribution. In statisti-


cal problems, it is important also to know what the distribution of the T 2 statistic is
when it is instead centered at some other vector, say a, rather than the mean vector
 itself. Note that the same question can also be asked of just the one-dimensional t
statistic, and of the one-dimensional sample variance. These questions are discussed
in this section.
The Noncentral t-Distribution. Suppose X1 ; X2 ; : : : ; Xn are independent N.;
 2 / variables. The noncentral t statistic is defined as
p
n.XN  a/
t.a/ D ;
s
1 Pn N 2
where s 2 D n1 i D1 .Xi  X / is the sample variance. The distribution of t.a/ is
given in the following result.

Theorem 5.17. The statistic t.a/ has the noncentral t-distribution


p
with n  1 de-
grees of freedom and noncentrality parameter ı D n.a/ 
, with the density
function
2 2
ft .a/ .x/ D ce .n1/ı =.2.x Cn1// .x 2 C n  1/n=2
 2
Z 1
 t  p ıx =2
 t n1 e x 2 Cn1 dt; 1 < x < 1;
0

where the normalizing constant c equals

.n  1/.n1/=2
cDp :
 . n1
2
/2n=21

Furthermore, for n > 2,


p  . n  1/
E.t.a// D ı n  1 2n1 :
. 2 /
214 5 Multivariate Normal and Related Distributions

The Noncentral Chi-Square Distribution. Suppose X1P; X2 ; : : : ; Xn are indepen-


n
.X a/2
dent N.;  2 / variables. Then the distribution of Sa2 D i D1  2i is given in the
next theorem.

Theorem 5.18. The statistic Sa2 has the noncentral chi-square distribution with n
2
degrees of freedom and noncentrality parameter D n .a/ 2
, with the density
function given by a Poisson mixture of ordinary chi squares
1
X e  k
fSa2 .x/ D gnC2k .x/;

kD0

where gj .x/ stands for the density of an ordinary chi-square random variable
with j degrees of freedom. Furthermore,

E.Sa2 / D n C I Var.Sa2 / D 2.n C 2 /:

5.6  Some Important Inequalities for Easy Reference

Multivariate normal distributions satisfy a large number of elegant inequalities,


many of which are simultaneously intuitive, entirely nontrivial, and also useful. A
selection of some of the most prominent such inequalities is presented here for easy
reference.
Slepian’s Inequality I. Let X Nn .; †/, and a1 ; : : : ; an any n real constants.
Let r;s be the correlation between Xr ; Xs . Let .i; j /; 1 i < j n be a fixed pair
of indices. Then, each of
P .\niD1 Xi ai /; P .\niD1 Xi  ai /
is strictly increasing in i;j , and r;s is held fixed 8 .r; s/ ¤ .i; j /.
Slepian’s Inequality II. Let X Nn .; †/, and C  Rn a convex set, symmetric
around . Let  2 Rn , and 0 s < t < 1. then,

P .X C s 2 C /  P .X C t 2 C /:

Anderson’s Inequality. Let X Nn .; †1 /; Y Nn .; †2 /, where †2  †1 is


nnd, and C any convex set symmetric around . Then,

P .X 2 C /  P .Y 2 C /:

Monotonicity Inequality. Let X Nn .; †/, and let  D minfi;j ; i ¤ j g  0.


Then,
5.6 Some Important Inequalities for Easy Reference 215

Y
n  
ai  i
P .\niD1 Xi ai /  ˆ I
i
i D1

Y
n  
i  ai
P .\niD1 Xi  ai /  ˆ :
i
i D1

Positive Dependence Inequality. Let X Nn .; †/, and suppose  ij 0 for all
i; j; i ¤ j . Then,
Y
n
P .Xi  ci ; i D 1; : : : ; n/  P .Xi  ci /:
i D1

Sidak Inequality. Let X Nn .; †/. Then,

Y
n
P .jXi j ci ; i D 1; : : : ; n/  P .jXi j ci /;
i D1

for any constants c1 ; : : : ; cn .


Chen Inequality. Let X Nn .; †/, and let g.x/ be a real-valued function having
all partial derivatives gi ; such that E.jgi .X/j2 / < 1. Then,

E.rg.X //0 †E.rg.X // Var.g.X // EŒ.rg.X //0 †.rg.X //:

Cirel’son et al. Concentration Inequality. Let X Nn .0; In /, and f W Rn ! R


a Lipschitz function with Lipschitz norm D supx;y jf .x/f .y/j
jjxyjj
1. Then, for any
t > 0,
t2
P .f .X /  Ef .X /  t/ e 2 :
Borell Concentration Inequality. Let X Nn .0; In /, and f W Rn ! R a Lips-
chitz function with Lipschitz norm D supx;y jf .x/f .y/j
jjxyjj 1. Then, for any t > 0,

t2
P .f .X /  Mf  t/ e 2 ;

where Mf denotes the median of f .X /.


Covariance Inequality. Let X Nn .; †/, and let g1 .x/; g2 .x/ be real-valued
functions, monotone nondecreasing in each coordinate xi . Suppose ij  0
8i ¤ j . Then,
Cov.g1 .X/; g2 .X//  0:

References to each of these inequalities can be seen in DasGupta (2008) and


Tong (1990).
216 5 Multivariate Normal and Related Distributions

P Nn .; †/, and suppose we want to find a bound for the


Example 5.10. Let X
variance of X0 X D niD1 Xi2 . Using g.X/ D X0 X in Chen’s inequality, rg.X/ D
2X, and so, .rg.X//0 †.rg.X// D 4X0 †X: This gives,

Var.g.X// 4EŒX0 †X D 4EŒtr.†XX0 /


D 4tr.†.† C 0 //
D 4.tr†2 C 0 †/:

Note that it would not be very easy to try to find the variance of X0 X directly.

Exercises

Exercise 5.1. Prove that the density of any multivariate normal distribution is uni-
formly bounded.

Exercise 5.2. Suppose X1 ; X1 CX2 ; X3 .X1 CX2 / are jointly multivariate normal.
Prove or disprove that X1 ; X2 ; X3 have a multivariate normal distribution.

Exercise 5.3 (Density Contour). Suppose X1 ; X2 have a bivariate normal distri-


bution with means 1, 2, variances 9, 4, and correlation 0.8. Characterize the set of
points .x1 ; x2 / at which the joint density f .x1 ; x2 / D k, a given constant, and plot
it for k D :01; :04.

Exercise 5.4. Suppose X1 ; X2 ; X3 ; X4 have an exchangeable multivariate normal


distribution with general parameters. Find all constants a; b; c; d such that aX1 C
bX2 and cX3 C dX4 are independent.

Exercise 5.5 (Quadrant Probability). Suppose X1 ; X2 ; X3 ; X4 have an exchange-


able multivariate normal distribution with zero means, and common variance  2 ,
and common correlation . Derive a formula for P .aX 1 CbX 2 > 0; cX 3 CdX 4 > 0/.

Exercise 5.6 * (Quadrant Probability). Suppose X Nn .; †/. If ij D 12 8 i; j;


i ¤ j , show that P .X1 > 1 ; : : : ; Xn > n / D nC1
1
.

Exercise 5.7. Suppose .X1 ; X2 ; X3 ; X4 / has a multivariate normal distribution with


covariance matrix 0 1
2 1 1 0
B 1 2 1 1 C
†DB @ 1 1 3
C:
0 A
0 1 0 2
(a) Find the two largest possible subsets of X1 ; X2 ; X3 ; X4 such that one subset is
independent of the other subset.
(b) Find the conditional variance of X4 given X1 .
Exercises 217

(c) Find the conditional variance of X4 given X1 ; X2 .


(d) Find the conditional covariance matrix of .X2 ; X4 / given .X1 ; X3 /.
(e) Find the variance of X1 C X2 C X3 C X4 .
(f)Find all linear combinations aX1 C bX2 C cX3 C dX4 which have a zero
correlation with X1 C X2 C X3 C X4 .
(g) Find the value of maxcWjjcjjD1 Var.c0 X/:

Exercise 5.8 * (A Problem of Statistical Importance). Suppose given Y D y,


X1 ; : : : ; Xn are independent N.y;  2 / variables, and Y itself is distributed as
N.m;  2 /. Find the conditional distribution of Y given Xi D xi ; 1 i n.

Exercise 5.9. Let X; Y be independent standard normal variables. Define


ZD1 if Y > 0I Z D 1 if Y < 0. Find the marginal distribution of XZ, and
show that Z; XZ are independent.

Exercise 5.10 * (A Covariance Matrix Calculation). Let X1 ; X2 ; : : : ; Xn be n in-


dependent standard normal variables. For each k; 1 k n, let XN k D X1 C:::CX
k
k
.
Let Yk D XNk  XNn .
(a) Find the joint density of .Y1 ; Y2 ; : : : ; Yn1 /.
(b) Prove that .Y1 ; Y2 ; : : : ; Yn1 / is independent of XNn .

Exercise 5.11 * (A Joint Distribution Calculation). Let X0 ; X1 ; : : : ; Xn be inde-


Xi
pendent standard normal variables. Let Yi D X0
; 1 i n.
(a) Find the joint P density of .Y1 ; Y2 ; : : : ; Yn /.
(b) Prove that niD1 ci Yi has a Cauchy distribution for any n dimensional vector
.c1 ; c2 ; : : : ; cn /.

Exercise 5.12 * (An Interesting Connection). Let .X1 ; : : : ; Xn / be distributed


uniformly on the boundary of the n-dimensional unit sphere. Fix k < n and con-
P
sider Rk D kiD1 Xi2 . By using a connection between the uniform distribution on
the boundary of a sphere and the standard normal distribution, prove that Rk has a
Beta distribution; identify these Beta parameters.

Exercise 5.13 * (A Second Interesting Connection). Let .X1 ; : : : ; Xn / be dis-


tributed uniformly on the boundary of the n-dimensional unit sphere. Prove that,
for any constants p1 ; : : : ; pn , such that they are nonnegative, and add to one, the
P p2
quantity niD1 Xi2 has exactly the same distribution. Hence, what must this distri-
i
bution be?

Exercise 5.14 (Covariance Between a Linear and a Quadratic Form). Suppose


X N2 .; †/. Find the covariance between c0 X and X0 BX, for a general vector c
and a general symmetric matrix B. When is it zero?

Exercise 5.15. Let X Nn .0; In /. Find the covariance between c0 X and X0 BX, for
a general vector c and a general symmetric matrix B. When is it zero?
218 5 Multivariate Normal and Related Distributions

Exercise 5.16. Let X N2 .0; †/. Characterize all symmetric matrices B such that
X0 BX has a chi square distribution.
Exercise 5.17 * (Noncentral Chi Square MGF). Show that a noncentral chi-
t
square distribution has the mgf e 12t .1  2t/k=2 , where is the noncentrality
parameter and k the degrees of freedom.
For what values of t does this formula apply?
Exercise 5.18 * (Normal Approximation to Noncentral Chi Square). Show that
if the degrees of freedom k is large, then a noncentral chi-square random variable
is approximately normal, with mean k C and variance 2.k C 2 /, where is the
noncentrality parameter.
Exercise 5.19 (Noncentral F Distribution). Suppose X; Y are independent ran-
dom variables, and X is noncentral chi square with m degrees of freedom and
noncentrality parameter , and Y is an ordinary (or central) chi square with n de-
nX
grees of freedom. Find the density of mY .
nX
Remark. The density of mY is called the noncentral F -distribution with m and n
degrees of freedom, and noncentrality parameter .
Exercise 5.20 (Noncentral F Mean). Suppose X has a noncentral F -distribution
with m and n degrees of freedom, and noncentrality parameter . Show that
E.X / D n.mC/
m.n2/
:
When is this formula valid?
Exercise 5.21 (Noncentral F -Distribution). If X has a noncentral t-distribution,
show that X 2 has a noncentral F distribution.
Exercise 5.22 * (Application of Anderson’s Inequality). Let X have a bivariate
normal distribution with means zero, variances 1; 4, and a correlation of :5. Let Y
qa bivariate normal distribution with means zero, variances 6; 9 and a correlation
have
of 23 . Show that P .X12 C X22 < c/  P .Y12 C Y22 < c/ for all c > 0.
Exercise 5.23. * Let .X; Y; Z/ N3 .0; †/. Show that P ..X  1/2 C .Y  1/2 C
.Z  1/2 1/ > P ..X  2/2 C .Y  2/2 C .Z  2/2 1/.
Hint: Use one of the Slepian inequalities.
Exercise 5.24 (Normal Marginals with Nonnormal Joint). Give an example of
a random vector .X; Y; Z/ such that each of X; Y; Z has a normal distribution, but
jointly .X; Y; Z/ is not multivariate normal.

References

Basu, D. (1955). On statistics independent of a complete sufficient statistic, Sankhyá, 15, 377–380.
DasGupta, A. (2008). Asymptotic Theory of Statistics and Probability, Springer-Verlag, New York.
Dasgupta, S. (1971). Nonsingularity of the sample covariance matrix, Sankhyá, Ser A, 33, 475–478.
Eaton, M. and Perlman, M. (1973). The nonsingularity of generalized sample covariance matrices,
Ann. Stat., 1, 710–717.
References 219

Fisher, R. A. (1915). Frequency distribution of the values of the correlation coefficient in samples
from an indefinitely large population, Biometrika, 10, 507–515.
Ghosh, M. and Sinha, B. (2002). A simple derivation of the Wishart distribution, Amer. Statist., 56,
100–101.
Haff, L. (1977). Minimax estimators for a multinormal precision matrix, J. Mult. Anal., 7, 374–385.
Haff, L. (1979a). Estimation of the inverse covariance matrix, Ann. Stat., 6, 1264–1276.
Haff, L. (1979b). An identity for the Wishart distribution with applications, J. Mult. Anal., 9,
531–544.
Haff, L. (1981). Further identities for the Wishart distribution with applications in regression,
Canad. J. Stat., 9, 215–224.
Hotelling, H. (1931). The generalization of Student’s ratio, Ann. Math. Statist., 2, 360–378.
Mahalanobis, P., Bose, R. and Roy, S. (1937). Normalization of statistical variates and the use of
rectangular coordinates in the theory of sampling distributions, Sankhyá, 3, 1–40.
Olkin, I. and Roy, S. (1954). On multivariate distribution theory, Ann. Math. Statist., 25, 329–339.
Rao, C. R. (1973). Linear Statistical Inference and Its Applications, Wiley, New York.
Tong, Y. (1990). The Multivariate Normal Distribution, Springer-Verlag, New York.
Chapter 6
Finite Sample Theory of Order Statistics
and Extremes

The ordered values of a sample of observations are called the order statistics of the
sample, and the smallest and the largest are called the extremes. Order statistics
and extremes are among the most important functions of a set of random variables
that we study in probability and statistics. There is natural interest in studying the
highs and lows of a sequence, and the other order statistics help in understanding
the concentration of probability in a distribution, or equivalently, the diversity in
the population represented by the distribution. Order statistics are also useful in sta-
tistical inference, where estimates of parameters are often based on some suitable
functions of the order statistics. In particular, the median is of very special impor-
tance. There is a well-developed theory of the order statistics of a fixed number n
of observations from a fixed distribution, as also an asymptotic theory where n goes
to infinity. We discuss the case of fixed n in this chapter. A distribution theory for
order statistics when the observations are from a discrete distribution is complex,
both notationally and algebraically, because of the fact that there could be several
observations which are actually equal. These ties among the sample values make the
distribution theory cumbersome. We therefore concentrate on the continuous case.
Principal references for this chapter are the books by David (1980), Reiss (1989),
Galambos (1987), Resnick (2007), and Leadbetter et al. (1983). Specific other ref-
erences are given in the sections.

6.1 Basic Distribution Theory

Definition 6.1. Let X1 ; X2 ; : : : ; Xn be any n real-valued random variables. Let


X.1/ X.2/  X.n/ denote the ordered values of X1 ; X2 ;    ; Xn . Then,
X.1/ ; X.2/ ; : : : ; X.n/ are called the order statistics of X1 ; X2 ; : : : ; Xn .

Remark. Thus, the minimum among X1 ; X2 ; : : : ; Xn is the first-order statistic, and


the maximum the nth-order statistic. The middle value among X1 ; X2 ; : : : ; Xn is
called the median. But it needs to be defined precisely, because there is really no
middle value when n is an even integer. Here is our definition.

A. DasGupta, Probability for Statistics and Machine Learning: Fundamentals 221


and Advanced Topics, Springer Texts in Statistics, DOI 10.1007/978-1-4419-9634-3 6,
c Springer Science+Business Media, LLC 2011
222 6 Finite Sample Theory of Order Statistics and Extremes

Definition 6.2. Let X1 ; X2 ; : : : ; Xn be any n real-valued random variables. Then,


the median of X1 ; X2 ; : : : ; Xn is defined to be Mn D X.mC1/ if n D 2m C 1 (an
odd integer), and Mn D X.m/ if n D 2m (an even integer). That is, in either case,
the median is the order statistic X.k/ where k is the smallest integer  n2 .
Example 6.1. Suppose :3; :53; :68; :06; :73; :48; :87; :42; :89; :44 are ten indepen-
dent observations from the U Œ0; 1 distribution. Then, the order statistics are .06,
.3, .42, .44, .48, .53, .68, .73, .87, .89. Thus, X.1/ D :06; X.n/ D :89, and because
2 D 5; Mn D X.5/ D :48.
n

An important connection to understand is the connection order statistics have


with the empirical CDF, a function of immense theoretical and methodological im-
portance in both probability and statistics.
Definition 6.3. Let X1 ; X2 ; : : : ; Xn be any n real-valued random variables. The em-
pirical CDF of X1 ; X2 ; : : : ; Xn , also called the empirical CDF of the sample, is the
function
# fXi W Xi xg
Fn .x/ D I
n
that is, Fn .x/ measures the proportion of sample values that are x for a given x.
Remark. Therefore, by its definition, Fn .x/ D 0 whenever x < X.1/ , and Fn .x/ D
1 whenever x  X.n/ . It is also a constant, namely, kn , for all x-values in the interval
ŒX.k/ ; X.kC1/ /. So Fn satisfies all the properties of being a valid CDF. Indeed, it
is the CDF of a discrete distribution, which puts an equal probability of n1 at the
sample values X1 ; X2 ; : : : ; Xn . This calls for another definition.
1
Definition 6.4. Let Pn denote the discrete distribution that assigns probability n to
each Xi . Then, Pn is called the empirical measure of the sample.
Definition 6.5. Let Qn .p/ D Fn1 .p/ be the quantile function corresponding to
Fn . Then, Qn D Fn1 is called the quantile function of X1 ; X2 ; : : : ; Xn , or the
empirical quantile function.
We can now relate the median and the order statistics to the quantile function
Fn1 .
Proposition. Let X1 ; X2 ; : : : ; Xn be n random variables. Then,
(a) X.i / D Fn1 . ni /I
(b) Mn D Fn1 . 12 /:
We now specialize to the case where X1 ; X2 ; : : : ; Xn are independent random vari-
ables with a common density function f .x/ and CDF F .x/, and work out the
fundamental distribution theory of the order statistics X.1/ ; X.2/ ; : : : ; X.n/ .
Theorem 6.1 (Joint Density of All the Order Statistics). Let X1 ; X2 ; : : : ; Xn
be independent random variables with a common density function f .x/. Then, the
joint density function of X.1/ ; X.2/ ; : : : ; X.n/ is given by

f1;2;:::;n .y1 ; y2 ; : : : ; yn / D nŠf .y1 /f .y2 /    f .yn /Ify1 <y2 <<yn g :


6.1 Basic Distribution Theory 223

Proof. A verbal heuristic argument is easy to understand. If X.1/ D y1 ; X.2/ D


y2 ; : : : ; X.n/ D yn , then exactly one of the sample values X1 ; X2 ; : : : ; Xn is y1 ,
exactly one is y2 , and so on, but we can put any of the n observations at y1 , any of the
other n  1 observations at y2 , and so on, and so the density of X.1/ ; X.2/ ; : : : ; X.n/
is f .y1 /f .y2 /    f .yn /  n.n  1/    1 D nŠf .y1 /f .y2 /    f .yn /, and obviously
if the inequality y1 < y2 <    < yn is not satisfied, then at such a point the joint
density of X.1/ ; X.2/ ; : : : ; X.n/ must be zero.
Here is a formal proof. The multivariate transformation .X1 ; X2 ; : : : ; Xn / !
.X.1/ ; X.2/ ; : : : ; X.n/ / is not one-to-one, as any permutation of a fixed .X1 ; X2 ; : : : ;
Xn / vector has exactly the same set of order statistics X.1/ ; X.2/ ; : : : ; X.n/ . How-
ever, fix a specific permutation f .1/; .2/; : : : ; .n/g of f1; 2; : : : ; ng and consider
the subset A D f.x1 ; x2 ; : : : ; xn / W x .1/ < x .2/ <    < x .n/ g: Then, the trans-
formation .x1 ; x2 ; : : : ; xn / ! .x.1/ ; x.2/ ; : : : ; x.n/ / is one-to-one on each such A ,
and indeed, then x.i / D x .i / ; i D 1; 2; : : : ; n. The Jacobian matrix of the transfor-
mation is 1, for each such A . A particular vector .x1 ; x2 ; : : : ; xn / falls in exactly
one A , and there are nŠ such regions A , as we exhaust all the nŠ permutations
f .1/; .2/; : : : ; .n/g of f1; 2; : : : ; ng. By a modification of the Jacobian density
theorem, we then get
X
f1;2;:::;n .y1 ; y2 ; : : : ; yn / D f .x1 /f .x2 /    f .xn /
X
D f .x .1/ /f .x .2/ /    f .x .n/ /

X
D f .y1 /f .y2 /    f .yn /

D nŠf .y1 /f .y2 /    f .yn /: t


u

Example 6.2 (Uniform Order Statistics). Let U1 ; U2 ; : : : ; Un be independent


U Œ0; 1 variables, and U.i / ; 1 i n, their order statistics. Then, by our the-
orem above, the joint density of U.1/ ; U.2/ ; : : : ; U.n/ is
f1;2;:::;n .u1 ; u2 ; : : : ; un / D nŠI0<u1 <u2 <<un <1 :
Once we know the joint density of all the order statistics, we can find the marginal
density of any subset by simply integrating out the rest of the coordinates, but being
extremely careful in using the correct domain over which to integrate the rest of the
coordinates. For example, if we want the marginal density of just U.1/ , that is, of
the sample minimum, then we will want to integrate out u2 ; : : : ; un , and the correct
domain of integration would be, for a given u1 , a value of U.1/ , in (0,1),
u1 < u2 < u3 <    < un < 1:
So, we integrate down in the order un ; un1 ; : : : ; u2 , to obtain
Z 1Z 1 Z 1
f1 .u1 / D nŠ  d un d un1    d u3 d u2
u1 u2 un1

D n.1  u1 /n1 ; 0 < u1 < 1:


224 6 Finite Sample Theory of Order Statistics and Extremes

Likewise, if we want the marginal density of just U.n/ , that is, of the sample maxi-
mum, then we want to integrate out u1 ; u2 ; : : : ; un1 , and now the answer is
Z un Z un1 Z u2
fn .un / D nŠ  d u1 d u2    d un1
0 0 0
D nun1
n ; 0 < un < 1:

However, it is useful to note that for the special case of the minimum and the max-
imum, we could have obtained the densities much more easily and directly. Here is
why. First take the maximum. Consider its CDF; for 0 < u < 1:

Y
n
P .U.n/ u/ D P .\niD1 fXi ug/ D P .Xi u/
i D1
D un ;

and hence, the density of U.n/ is fn .u/ D d n


d u Œu  D nun1 ; 0 < u < 1:

Likewise, for the minimum, for 0 < u < 1, the tail CDF is:

P .U.1/ > u/ D P .\niD1 fXi > ug/ D .1  u/n ;

and so the density of U.1/ is

d
f1 .u/ D Œ1  .1  u/n  D n.1  u/n1 ; 0 < u < 1:
du
For a general r; 1 r n, the density of U.r/ works out to a Beta density:


fr .u/ D ur1 .1  u/nr ; 0 < u < 1;
.r  1/Š.n  r/Š

which is the Be .r; n  r C 1/ density.


As a rule, if the underlying CDF F is symmetric about its median, then the
sample median will also have a density symmetric about the median of F ; see the
exercises. When n is even, one has to be careful about this, because there is no uni-
versal definition of a sample median when n is even. In addition, the density of the
sample maximum will generally be skewed to the right, and that of the sample mini-
mum skewed to the left. For general CDFs, the density of the order statistics usually
will not have a simple formula in terms of elementary functions; but approxima-
tions for large n are often possible. This is treated in a later chapter. Although such
approximations for large n are often available, they may not be very accurate unless
n is very large; see Hall (1979).
6.2 More Advanced Distribution Theory 225

14

12

10

x
0.2 0.4 0.6 0.8 1

Fig. 6.1 Density of minimum, median, and maximum of UŒ0; 1 variables; n D 15

Above we have plotted in Fig. 6.1 the density of the minimum, median, and
maximum in the U Œ0; 1 case when n D 15. The minimum and the maximum clearly
have skewed densities, whereas the density of the median is symmetric about .5.

6.2 More Advanced Distribution Theory

Example 6.3 (Density of One and Two Order Statistics). The joint density of any
subset of the order statistics X.1/ ; X.2/ ; : : : ; X.n/ can be worked out from their
joint density, which we derived in the preceding section. The most important case
in applications is the joint density of two specific order statistics, say X.r/ and
X.s/ ; 1 r < s n, or the density of a specific one, say X.r/ . A verbal heuristic
argument helps in understanding the formula for the joint density of X.r/ and X.s/ ,
and also the density of a specific one X.r/ .
First consider the density of just X.r/ . Fix u. To have X.r/ D u, we must have
exactly one observation at u, another r  1 below u, and n  r above u. This will
suggest that the density of X.r/ is
!
n1
fr .u/ D nf .u/ .F .u//r1 .1  F .u//nr
r 1

D .F .u//r1 .1  F .u//nr f .u/;
.r  1/Š.n  r/Š

1 < u < 1. This is in fact the correct formula for the density of X.r/ .
Next, consider the case of the joint density of two order statistics, X.r/ and X.s/ .
Fix 0 < u < v < 1. Then, to have X.r/ D u; X.s/ D v, we must have exactly one
226 6 Finite Sample Theory of Order Statistics and Extremes

observation at u, another r 1 below u, one at v, ns above v, and s r 1 between


u and v. This will suggest that the joint density of X.r/ and X.s/ is
! !
n1 n  r  1
fr;s .u; v/ D nf .u/ .F .u//r1 .n  r/f .v/ .1  F .v//ns
r 1 ns
.F .v/  F .u//sr1

D .F .u//r1 .1  F .v//ns
.r  1/Š.n  s/Š.s  r  1/Š
.F .v/  F .u//sr1 f .u/f .v/;

1 < u < v < 1.


Again, this is indeed the joint density of two specific order statistics X.r/
and X.s/ .
The arguments used in this example lead to the following theorem.
Theorem 6.2 (Density of One and Two Order Statistics and Range). Let X1 ;
X2 ; : : : ; Xn be independent observations from a continuous CDF F .x/ with density
function f .x/. Then,
(a) X.n/ has the density fn .u/ D nF n1 .u/f .u/; 1 < u < 1:
(b) X.1/ has the density f1 .u/ D n.1  F .u//n1 f .u/; 1 < u < 1:
(c) For a general r; 1 r n; X.r/ has the density


fr .u/ D F r1 .u/.1  F .u//nr f .u/; 1 < u < 1:
.r  1/Š.n  r/Š

(d) For general 1 r<s n; .X.r/ ; X.s/ / have the joint density


D .F .u//r1 .1  F .v//ns .F .v/  F .u//sr1
.r  1/Š.n  s/Š.s  r  1/Š
f .u/f .v/; 1 < u < v < 1:

(e) The minimum and the maximum; X.1/ andX.n/ have the joint density

f1;n .u; v/ D n.n  1/.F .v/  F .u//n2 f .u/f .v/; 1 < u < v < 1:

(f) (CDF of Range). W D Wn D X.n/  X.1/ has the CDF


Z 1
FW .w/ D n ŒF .x C w/  F .x/n1 f .x/dx:
1

(g) (Density of Range). W D Wn D X.n/  X.1/ has the density


Z 1
fW .w/ D n.n  1/ ŒF .x C w/  F .x/n2 f .x/f .x C w/dx:
1
6.2 More Advanced Distribution Theory 227

Example 6.4 (Moments of Uniform Order Statistics). The general formulas in the
above theorem lead to the following moment formulas in the uniform case.
In the U Œ0; 1 case,

1 n
E.U.1/ / D ; E.U.n/ / D ;
nC1 nC1
n L
Var.U.1/ / D Var.U.n/ / D I 1  U.n/ D U.1/ I
.n C 1/2 .n C 2/
1
Cov.U.1/ ; .U.n/ / D ;
.n C 1/2 .n C 2/
n1 2.n  1/
E.Wn / D ; Var.Wn / D :
nC1 .n C 1/2 .n C 2/

For a general order statistic, it follows from the fact that U.r/ Be.r; n  r C 1/,
that
r r.n  r C 1/
E.U.r/ / D I Var.U.r/ / D :
nC1 .n C 1/2 .n C 2/

Furthermore, it follows from the formula for their joint density that

r.n  s C 1/
Cov.U.r/ ; U.s/ / D :
.n C 1/2 .n C 2/

Example 6.5 (Exponential Order Statistics). A second distribution of importance


in the theory of order statistics is the exponential distribution. The mean essen-
tially arises as just a multiplier in the calculations. So, we treat only the standard
exponential case.
Let X1 ; X2 ; : : : ; Xn be independent standard exponential variables. Then, by the
general theorem on the joint density of the order statistics, in this case the joint
density of X.1/ ; X.2/ ; : : : ; X.n/ is
Pn
f1;2;:::;n .u1 ; u2 ; : : : ; un / D nŠe  i D1 ui ;

0 < u1 < u2 <    < un < 1. Also, in particular, the minimum X.1/ has the
density
f1 .u/ D n.1  F .u//n1 f .u/ D ne .n1/u e u D ne nu ;
0 < u < 1. In other words, we have the quite remarkable result that the minimum
of n independent standard exponentials is itself an exponential with mean n1 . Also,
from the general formula, the maximum X.n/ has the density
!
X
n1
n  1 .i C1/u
fn .u/ D n.1  e u /n1 e u D n .1/i e ; 0 < u < 1:
i
i D0
228 6 Finite Sample Theory of Order Statistics and Extremes

As a result,
! !
X
n1
n1 1 X
n
i 1 n 1
E.X.n/ / D n .1/ i
D .1/ ;
i .i C 1/2 i i
i D0 i D1

which also equals 1 C 12 C    C n1 . We show later in the section on spacings that by


another argument, it also follows that in the standard exponential case, E.X.n/ / D
1 C 12 C    C n1 .
Example 6.6 (Normal Order Statistics). Another clearly important case is that of
the order statistics of a normal distribution. Because the general N.;  2 / random
variable is a location-scale transformation of a standard normal variable, we have
the distributional equivalence that .X.1/ ; X.2/ ; : : : ; X.n/ / have the same joint distri-
bution as . C Z.1/ ;  C Z.2/ ; : : : ;  C Z.n/ /: So, we consider just the standard
normal case.
Because of the symmetry of the standard normal distribution around zero, for
any r; Z.r/ has the same distribution as Z.nrC1/ . In particular, Z.1/ has the same
distribution as Z.n/ . From our general formula, the density of Z.n/ is

fn .x/ D nˆn1 .x/.x/; 1 < x < 1:

Again, this is a skewed density. It can be shown, either directly, or by making use
of the general theorem on existence of moments of order statistics (see the next
section) that every moment, and in particular the mean and the variance of Z.n/ ;
exists. Except for very small n, closed-form formulas for the mean or variance are
not possible. For small n, integration tricks do produce exact formulas. For example,
1 3
E.Z.n/ / D p ; if n D 2I E.Z.n/ / D p ; if n D 3:
2
Such formulas are available for n 5; see David (1980).
We tabulate the expected value of the maximum for some values of n to illustrate
the slow growth.

n E.Z.n/ /
2 .56
5 1.16
10 1.54
20 1.87
30 2.04
50 2.25
100 2.51
500 3.04
1000 3.24
10000 3.85

More about the expected value of Z.n/ is discussed later.


6.3 Quantile Transformation and Existence of Moments 229

0.8

0.6

0.4

0.2

x
-1 1 2 3 4 5

Fig. 6.2 Density of maximum of standard normals; n D 5, 20, 100

The density of Z.n/ is plotted in Fig. 6.2 for three values of n. We can see that the
density is shifting to the right, and at the same time getting more peaked. Theoretical
asymptotic (i.e., as n ! 1) justifications for these visual findings are possible, and
we show some of them in a later chapter.

6.3 Quantile Transformation and Existence of Moments

Uniform order statistics play a very special role in the theory of order statistics,
because many problems about order statistics of samples from a general density can
be reduced, by a simple and common technique, to the case of uniform order statis-
tics. It is thus especially important to understand and study uniform order statistics.
The technique that makes helpful reductions of problems in the general continuous
case to the case of a uniform distribution on [0,1] is one of making just the quantile
transformation. We describe the exact correspondence below.

Theorem 6.3 (Quantile Transformation Theorem). Let X1 ; X2 ; : : : ; Xn be in-


dependent observations from a continuous CDF F .x/ on the real line, and let
X.1/ ; X.2/ ; : : : ; X.n/ denote their order statistics. Let F 1 .p/ denote the quantile
function of F . Let U1 ; U2 ; : : : ; Un be independent observations from the U Œ0; 1 dis-
tribution, and let U.1/ ; U.2/ ; : : : ; U.n/ denote their order statistics. Also let g.x/ be
any nondecreasing function and let Yi D g.Xi /; 1 i n, with Y.1/ ; Y.2/ ; : : : ; Y.n/
be the order statistics of Y1 ; Y2 ; : : : ; Yn . Then, the following equalities in distribu-
tions hold:
(a) F .X1 / U Œ0; 1; that is; F .X1 / and U1 have the same distribution:
L
We write this equality in distribution as F .X1 / D U1 .
L
(b) F 1 .U1 / D X1 :
230 6 Finite Sample Theory of Order Statistics and Extremes

L
(c) F .X.i / / D U.i / :
L
(d) F 1 .U.i / / D X.i / :
L
(e) .F .X.1/ /; F .X.2/ /; : : : ; F .X.n/ // D .U.1/ ; U.2/ ; : : : ; U.n/ /:
L
(f) .F 1 .U.1/ /; F 1 .U.2/ /; : : : ; F 1 .U.n/ // D .X.1/ ; X.2/ ; : : : ; X.n/ /:
L
(g) .Y.1/ ; Y.2/ ; : : : ; Y.n/ / D .g.F 1 .U.1/ //; g.F 1 .U.2/ //; : : : ; g.F 1 .U.n/ ///:

Remark. We are already familiar with parts (a) and (b); they are restated here only
to provide the context. The parts that we need to focus on are the last two parts.
They say that any question about the set of order statistics X.1/ ; X.2/ ; : : : ; X.n/ of
a sample from a general continuous distribution can be rephrased in terms of the
set of order statistics from the U Œ0; 1 distribution. For this, all we need to do is to
substitute F 1 .U.i / / in place of X.i /, where F 1 is the quantile function of F .
So, at least in principle, as long as we know how to work skillfully with the joint
distribution of the uniform order statistics, we can answer questions about any set of
order statistics from a general continuous distribution, because the latter is simply a
transformation of the set of order statistics of the uniform. This has proved to be a
very useful technique in the theory of order statistics.
As a corollary of part (f) of the above theorem, we have the following connection
between order statistics of a general continuous CDF and uniform order statistics.

Corollary. Let X.1/ ; X.2/ ; : : : ; X.n/ be the order statistics of a sample from a gen-
eral continuous CDF F , and U.1/ ; U.2/ ; : : : ; U.n/ the uniform order statistics. Then,
for any 1 r1 < r2 <    < rk n,

P .X.r1 / u1 ; : : : ; X.rk / uk / D P .U.r1 / F .u1 /; : : : ; U.rk / F .uk //;

8u1 ; : : : ; uk .
Several important applications of this quantile transformation method are given
below.

Proposition. Let X1 ; X2 ; : : : ; Xn be independent observations from a continuous


CDF F . Then, for any r; s; Cov.X.r/ ; X.s/ /  0.

Proof. We use the fact that if g.x1 ; x2 ; : : : ; xn /; h.x1 ; x2 ; : : : ; xn / are two functions
such that they are coordinatewise nondecreasing in each xi , then
Cov.g.U.1/ ; : : : ; U.n/ /; h.U.1/ ; : : : ; U.n/ //  0. By the quantile transformation
theorem, Cov.X.r/ ; X.s/ / D Cov.F 1 .U.r/ /; F 1 .U.s/ //  0, as F 1 .U.r/ / is
a nondecreasing function of U.r/ , and F 1 .U.s/ / is a nondecreasing function of
U.s/ , and hence they are also coordinatewise nondecreasing in each order statistic
U.1/ ; U.2/ ; : : : ; U.n/ . t
u

This proposition was first proved in Bickel (1967), but by using a different
method. The next application is to existence of moments of order statistics.

Theorem 6.4 (On the Existence of Moments). Let X1 ; X2 ; : : : ; Xn be inde-


pendent observations from a continuous CDF F , and let X.1/ ; X.2/ ; : : : ; X.n/
6.3 Quantile Transformation and Existence of Moments 231

be the order statistics. Let g.x1 ; x2 ; : : : ; xn / be a general function. Suppose


EŒjg.X1 ; X2 ; : : : ; Xn /j < 1. Then, EŒjg.X.1/ ; X.2/ ; : : : ; X.n/ /j < 1.
Proof. By the quantile transformation theorem above,

EŒjg.X.1/ ; X.2/ ; : : : ; X.n/ /j


D EŒjg.F 1 .U.1/ /; F 1 .U.2/ /; : : : ; F 1 .U.n/ //j
Z
D nŠ jg.F 1 .u1 /; F 1 .u2 /; : : : ; F 1 .un //jd u1 d u2    d un
0<u1 <u2 <<un <1
Z
nŠ jg.F 1 .u1 /; F 1 .u2 /; : : : ; F 1 .un //jd u1 d u2    d un
.0;1/n
Z
D nŠ jg.u1 ; u2 ; : : : ; un /jf .u1 /f .u2 /    f .un /d u1 d u2    d un
.0;1/n
D nŠEŒjg.X1 ; X2 ; : : : ; Xn /j < 1: t
u

Corollary. Suppose F is a continuous CDF such that EF .jX jk / < 1, for some
given k. Then, E.jX.i /jk / < 1 8i 1 i n. t
u
Aside from just the existence of the moment, explicit bounds are always useful.
Here is a concrete bound (see Reiss (1989)); approximations for moments of order
statistics for certain distributions are derived in Hall (1978).
Proposition. (a) 8r n; E.jX.r/ jk / ..r1/Š.nr/Š/

EF .jX jk /I
(b) E.jX.r/ jk / < 1 ) jF 1 .p/jk p r .1  p/nrC1 C < 1 8pI
(c) jF 1 .p/jk p r .1  p/nrC1 C < 1 8p ) E.jX.s/ jm / < 1, if 1 C mr
k
s n  .nrC1/m k
.
Example 6.7 (Nonexistence of Every Moment of Every Order Statistic). Consider
the continuous CDF F .x/ D 1  log1 x ; x  e. Setting 1  log1 x D p, we get the
1
quantile function F 1 .p/ D e 1p . Fix any n; k, and r n. Consider what happens
when p ! 1.
k
jF 1 .p/jk p r .1  p/nrC1 D e . 1p/ p r .1  p/nrC1
k
 C e 1p .1  p/nrC1 D C e ky y .nrC1/ ;
writing y for 1p1
. For any k > 0, as y ! 1., p ! 1/; e ky y .nrC1/ ! 1.
Thus, the necessary condition of the proposition above is violated, and it follows
that for any r; n; k; E.jX.r/ jk / D 1.
Remark. The preceding example and the proposition show that existence of
moments of order statistics depends on the tail of the underlying CDF (or, equiva-
lently, the tail of the density). If the tail is so thin that the density has a finite mgf
in some neighborhood of zero, then all order statistics will have all moments finite.
Evaluating them in closed form is generally impossible, however. If the tail of the
underlying density is heavy, then existence of moments of the order statistics, and
232 6 Finite Sample Theory of Order Statistics and Extremes

especially the minimum and the maximum, may be a problem. It is possible for
some central order statistics to have a few finite moments, and the minimum or
the maximum to have none. In other words, depending on the tail, anything can
happen. An especially interesting case is the case of a Cauchy density, notorious for
its troublesome tail. The next result describes what happens in that case.
Proposition. Let X1 ; X2 ; : : : ; Xn be independent C.; / variables. Then,
(a) 8n; E.jX.n/ j/ D E.jX.1/ j/ D 1I
(b) For n  3; E.jX.n1/ j/ and E.jX.2/ j/ are finiteI
(c) For n  5; E.jX.n2/ j2 / and E.jX.3/ j2 / are finiteI
(d) In general; E.jX.r/ jk / < 1 if and only if k < minfr; n C 1  rg:
Example 6.8 (Cauchy Order Statistics). From the above proposition we see that the
truly problematic order statistics in the Cauchy case are the two extreme ones, the
minimum and the maximum. Every other order statistic has a finite expectation for
n  3, and all but the two most extremes from each tail even have a finite variance,
as long as n  5. The table below lists the mean of X.n1/ and X.n2/ for some
values of n.

n E.X.n1/ / E.X.n2/ /
5 1.17 .08
10 2.98 1.28
20 6.26 3.03
30 9.48 4.67
50 15.87 7.90
100 31.81 15.88
250 79.56 39.78
500 159.15 79.57

Example 6.9 (Mode of Cauchy Sample Maximum). Although the sample maximum
X.n/ never has a finite expectation in the Cauchy case, it always has a unimodal
density (see a general result in the exercises). So it is interesting to see what the
modal values are for various n. The table below lists the mode of X.n/ for some
values of n.

n Mode of X.n/
5 .87
10 1.72
20 3.33
30 4.93
50 8.12
100 16.07
250 39.98
500 79.76

By comparing the entries in this table with the previous table, we see that the
mode of X.n/ is quite close to the mean of X.n2/ . It would be interesting to find a
theoretical result in this regard.
6.4 Spacings 233

6.4 Spacings

Another set of statistics helpful in understanding the distribution of probability are


the spacings, which are the gaps between successive order statistics. They are useful
in discerning tail behavior. At the same time, for some particular underlying dis-
tributions, their mathematical properties are extraordinarily structured, and in turn
lead to results for other distributions. Two instances are the spacings of uniform and
exponential order statistics. Some basic facts about spacings are discussed in this
section.

Definition 6.6. Let X.1/ ; X.2/ ; : : : ; X.n/ be the order statistics of a sample of n ob-
servations X1 ; X2 ; : : : ; Xn . Then, Wi D X.i C1/  X.i / ; 1 i n  1 are called the
spacings of the sample, or the spacings of the order statistics.

6.4.1 Exponential Spacings and Réyni’s Representation

The spacings of an exponential sample have the characteristic property that the spac-
ings are all independent exponentials as well. Here is the precise result.

Theorem 6.5. Let X.1/ ; X.2/ ; : : : ; X.n/ be the order statistics from an Exp. / distri-
bution. Then W0 D X.1/ ; W1 ; : : : ; Wn1 are independent, with Wi Exp. ni 
/; i D
0; 1; : : : ; n  1.

Proof. The proof follows on transforming to the set of spacings from the set
of order statistics, and by applying the Jacobian density theorem. The transfor-
mation .u1 ; u2 ; : : : ; un / ! .w0 ; w1 ; : : : ; wn1 /, where w0 D u1 ; w1 D u2 
u1 ; : : : ; wn1 D un  un1 is one to one, with the inverse transformation u1 D
w0 ; u2 D w0 C w1 ; u3 D w0 C w1 C w2 ; : : : ; un D w0 C w1 C    C wn1 . The
Jacobian matrix is triangular, and has determinant one. From our general theorem,
the order statistics X.1/ ; X.2/ ; : : : ; X.n/ have the joint density

f1;2;:::;n .u1 ; u2 ; : : : ; un / D nŠf .u1 /f .u2 /    f .un /If0<u1 <u2 <<un <1g :

Therefore, the spacings have the joint density

g0;1;:::;n1 .w0 ; w1 ; : : : ; wn1 /


D nŠf .w0 /f .w0 C w1 /    f .w0 C w1 C    C wn1 /Ifwi >08i g :

This is completely general for any underlying nonnegative variable. Specializing to


the standard exponential case, we get
Pn1 Pj
g0;1;:::;n1 .w0 ; w1 ; : : : ; wn1 / D nŠe  j D0 i D0
wi
Ifwi >08i g
nw0 .n1/w1 wn1
D nŠe Ifwi >08i g :
234 6 Finite Sample Theory of Order Statistics and Extremes

It therefore follows that W0 ; W1 ; : : : ; Wn1 are independent, and also that Wi


1
Exp. ni /. The case of a general follows from the standard exponential case. u t
Corollary (Réyni). The joint distribution of the order statistics of an exponential
distribution with mean have the representation
!
L
X
r
Xi
.X.r/ /jnrD1 D jn ;
n  i C 1 rD1
i D1

where X1 ; : : : ; Xn are themselves independent exponentials with mean . t


u
Remark. Verbally, the order statistics of an exponential distribution are linear com-
binations of independent exponentials, with a very special sequence of coefficients.
In an obvious way, the representation can be extended to the order statistics of a
general continuous CDF by simply using the quantile transformation.
Example 6.10 (Moments and Correlations of Exponential Order Statistics). From
the representation in the above corollary, we immediately have

X
r
1 X
r
1
2
E.X.r/ / D I Var.X.r/ / D :
ni C1 .n  i C 1/2
i D1 i D1

Furthermore,
Pr by using the same representation, for 1 r <s n; Cov.X.r/ ; X.s/ / D
2 1
i D1 ni C12 , and therefore the correlation

vP
u r
u i D1 1
D t Ps
.ni C1/2
X.r/ ;X.s/ 1
:
i D1 .ni C1/2

In particular, p
1
n 6
X.1/ ;X.n/ D qP ;
n 1 n
i D1 i 2

for large n. In particular, X.1/ ;X.n/ ! 0, as n ! 1. In fact, in large samples the


minimum and the maximum are in general approximately independent.

6.4.2 Uniform Spacings

The results on exponential spacings lead to some highly useful and neat represen-
tations for the spacings and the order statistics of a uniform distribution. The next
result describes the most important properties of uniform spacings and order statis-
tics. Numerous other properties of a more special nature are known. David (1980)
and Reiss (1989) are the best references for such additional properties of the uniform
order statistics.
6.5 Conditional Distributions and Markov Property 235

Theorem 6.6. Let U1 ; U2 ; : : : ; Un be independent U Œ0; 1 variables, and U.1/ ;


U.2/ ; : : : ; U.n/ the order statistics. Let W0 D U.1/ ; Wi D U.i C1/  U.i / ; 1 i
U.i /
n  1, and Vi D U.i C1/ ; 1 i n  1; Vn D U.n/ . Let also X1 ; X2 ; : : : ; XnC1 be
.n C 1/ independent standard exponentials, independent of the Ui ; i D 1; 2; : : : ; n.
Then,
(a) V1 ; V22 ; : : :; Vn1
n1
; Vnn are independent U Œ0; 1 variables, and .V1 ; V2 ; : : :Vn1 /
are independent of Vn :
(b) .W0 ; W1 ; : : : ; Wn1 / D.˛/; a Dirichlet distribution with parameter vector
˛nC11 D .1; 1; : : : ; 1/: That is, .W0 ; W1 ; : : : ; Wn1 / is uniformly distributed
in the n-dimensional simplex.  
L X X2 Xn
(c) .W0 ; W1 ; : : : ; Wn1 / D PnC11 ; PnC1 ; : : : ; PnC1 :
Xj Xj Xj
 j D1 j D1 j D1 
L 1 CX2
(d) .U.1/ ; U.2/ ; : : : ; U.n/ / D X
PnC11 ; PXnC1 ; : : : ; X1 CX2 CCXn
PnC1 :
j D1
Xj j D1
Xj j D1
Xj

Proof. For part (a), use the fact that the negative of the logarithm of a U Œ0; 1 vari-
able is standard exponential, and then use the result that the exponential spacings
are themselves independent exponentials. That V1 ; V22 ; : : : ; Vn1
n1
are also uniformly
distributed follows from looking at the joint density of U.i / ; U.i C1/ for any given i .
It follows trivially from the density of Vn that Vnn U Œ0; 1.
For parts (b) and (c), first consider the joint density of the uniform order statis-
tics, and then transform to the variables Wi ; i D 0; : : : ; n  1. This is a one-to-one
transformation, and so we can apply the Jacobian density theorem. The Jacobian
theorem easily gives the joint density of the Wi ; i D 0; : : : ; n  1, and we simply
recognize it to be the density of a Dirichlet with the parameter vector having each
coordinate equal to one. Finally, use the representation of a Dirichlet random vector
in the form of ratios of Gammas (see Chapter 4). t
u

Part (d) is just a restatement of part (c).

Remark. Part (d) of this theorem, representing uniform order statistics in terms of
independent exponentials is one of the most useful results in the theory of order
statistics.

6.5 Conditional Distributions and Markov Property

The conditional distributions of a subset of the order statistics given another subset
satisfy some really structured properties. An illustration of such a result is that if
we know that the sample maximum X.n/ D x, then the rest of the order statistics
would act like the order statistics of a sample of size n  1 from the original CDF,
but truncated on the right at that specific value x. Another prominent property of
the conditional distributions is the Markov property. Again, a lot is known about
the conditional distributions of order statistics, but we present the most significant
236 6 Finite Sample Theory of Order Statistics and Extremes

and easy to state results. The best references for reading more about the conditional
distributions are still David (1980) and Reiss (1989). Each of the following theorems
follows on straightforward calculations, and we omit the calculations.

Theorem 6.7. Let X1 ; X2 ; : : : ; Xn be independent observations from a continuous


CDF F with density f . Fix 1 i < j n. Then, the conditional distribution of
X.i / given X.j / D x is the same as the unconditional distribution of the i th order
statistic in a sample of size j  1 from a new distribution, namely the original F
truncated at the right at x. In notation,
 
.j  1/Š F .u/ i 1
fX.j / jX.i / Dx .u/ D
.i  1/Š.j  1  i /Š F .x/
 
F .u/ j 1i f .u/
1 ; u < x:
F .x/ F .x/

Theorem 6.8. Let X1 ; X2 ; : : : ; Xn be independent observations from a continuous


CDF F with density f . Fix 1 i < j n. Then, the conditional distribution of
X.j / given X.i / D x is the same as the unconditional distribution of the .j  i /th
order statistic in a sample of size n  i from a new distribution, namely the original
F truncated at the left at x. In notation,
 j i 1  nj
.n  i /Š F .u/  F .x/ 1  F .u/
fX.j / jX.i / Dx .u/ D
.j  i  1/Š.n  j /Š 1  F .x/ 1  F .x/
f .u/
; u > x:
1  F .x/

Theorem 6.9 (The Markov Property). Let X1 ; X2 ; : : : ; Xn be independent obser-


vations from a continuous CDF F with density f . Fix 1 i < j n. Then,
the conditional distribution of X.j / given X.1/ D x1 ; X.2/ D x2 ; : : : ; X.i / D xi
is the same as the conditional distribution of X.j / given X.i / D xi . That is, given
X.i / ; X.j / is independent of X.1/ ; X.2/ ; : : : ; X.i 1/ .

Theorem 6.10. Let X1 ; X2 ; : : : ; Xn be independent observations from a continuous


CDF F with density f . Then, the conditional distribution of X.1/ ; X.2/ ; : : : ; X.n1/
given X.n/ D x is the same as the unconditional distribution of the order statistics
in a sample of size n  1 from a new distribution, namely the original F truncated
at the right at x. In notation,

Y
n1
f .ui /
fX.1/ ;:::;X.n1/ jX.n/ Dx .u1 ; : : : ; un1 / D .n  1/Š ; u1 <    < un1 < x:
F .x/
i D1

Remark. A similar and transparent result holds about the conditional distribution of
X.2/ ; X.3/ ; : : : ; X.n/ given X.1/ D x.
6.5 Conditional Distributions and Markov Property 237

Theorem 6.11. Let X1 ; X2 ; : : : ; Xn be independent observations from a continuous


CDF F with density f . Then, the conditional distribution of X.2/ ; : : : ; X.n1/ given
X.1/ D x; X.n/ D y is the same as the unconditional distribution of the order
statistics in a sample of size n  2 from a new distribution, namely the original F
truncated at the left at x, and at the right at y. In notation,

Y
n1
f .ui /
fX.2/ ;:::;X.n1/ jX.1/ Dx;X.n/ Dy .u2 ; : : : ; un1 / D .n  2/Š ;
F .y/  F .x/
i D2
x < u2 <    < un1 < y:

Example 6.11 (Mean Given the Maximum). Suppose X1 ; X2 ; : : : ; Xn are indepen-


dent U Œ0; 1 variables. We want to find E.XN jX.n/ D x/. We use the theorem above
about the conditional distribution of X.1/ ; X.2/ ; : : : ; X.n1/ given X.n/ D x.
!
Xn
E.nXN jX.n/ D x/ D E Xi jX.n/ D x
i D1
! !
X
n X
n1
DE X.i / jX.n/ D x DxCE X.i / jX.n/ D x
i D1 i D1

X
n1
ix
DxC ;
n
i D1

because, given X.n/ D x; X.1/ ; X.2/ ; : : : ; X.n1/ act like the order statistics of a
sample of size n  1 from the U Œ0; x distribution. Now summing the series, we get,

.n  1/x nC1
E.nXN jX.n/ D x/ D x C D x;
2 2
N .n/ nC1
) E.XjX D x/ D x:
2n

Example 6.12 (Maximum Given the First Half). Suppose X1 ; X2 ; : : : ; X2n are in-
dependent standard exponentials. We want to find E.X.2n/ jX.1/ D x1 ; : : : ; X.n/ D
xn /. By the theorem on the Markov property, this conditional expectation equals
E.X.2n/ jX.n/ D xn /. Now, we further use the representation that
!
L
X
n
Xi X
2n
Xi
.X.n/ ; X.2n/ / D ; :
2n  i C 1 2n  i C 1
i D1 i D1

Therefore,
X
n
Xi
E.X.2n/ jX.n/ D xn / D E
2n  i C 1
i D1
!
X
2n
Xi X n
Xi
C j D xn
2n  i C 1 2n  i C 1
i DnC1 i D1
238 6 Finite Sample Theory of Order Statistics and Extremes
!
X
2n
Xi X n
Xi
D xn C E j D xn
2n  i C 1 2n  i C 1
i DnC1 i D1
!
X2n
Xi
D xn C E
2n  i C 1
i DnC1

because the Xi are all independent

X
2n
1
D xn C :
2n  i C 1
i DnC1

For example, in a sample of size 4, E.X.4/ jX.1/ Dx; X.2/ Dy/DE.X.4/ jX.2/ Dy/ D
P
y C 4iD3 5i
1
D y C 32 .

6.6 Some Applications

Order statistics and the related theory have many interesting and important appli-
cations in statistics, in modeling of empirical phenomena, for example, climate
characteristics, and in probability theory itself. We touch on a small number of ap-
plications in this section for purposes of reference. For further reading on the vast
literature on applications of order statistics, we recommend, among numerous possi-
bilities, Lehmann (1975), Shorack and Wellner (1986), David (1980), Reiss (1989),
Martynov (1992), Galambos (1987), Falk et al. (1994), Coles (2001), Embrechts
et al. (2008), and DasGupta (2008).

6.6.1  Records

Record values and their timings are used for the purposes of tracking changes in
some process, such as temperature, and for preparation for extremal events, such as
protection against floods. They are also interesting on their own right.
Let X1 ; X2 ; : : : ; be an infinite sequence of independent observations from a con-
tinuous CDF F . We first give some essential definitions.

Definition 6.7. We say that a record occurs at time i if Xi > Xj 8j < i . By


convention, we say that X1 is a record value, and i D 1 is a record time.
Let Zi be the indicator of the event that a record occurs at time i . The sequence
T1 ; T2 ; : : : defined as T1 D 1I Tj D minfi > Tj 1 W Zi D 1g is called the sequence
of record times. The differences Di C1 D Ti C1  Ti are called the interarrival times.
The sequence XT1 ; XT2 ; : : : ; is called the sequence of record values.
6.6 Some Applications 239

Example 6.13. The values 1.46, .28, 2.20, .72, 2.33, .67, .42, .85, .66, .67, 1.54, .76,
1.22, 1.72, .33 are 15 simulated values from a standard exponential distribution.
The record values are 1:46; 2:20; 2:33, and the record times are T1 D 1; T2 D 3;
T3 D 5. Thus, there are three records at time n D 15. We notice that no records were
observed after the fifth observation in the sequence. In fact, in general, it becomes
increasingly more difficult to obtain a record as time passes; justification for this
statement is shown in the following theorem.
The following theorem summarizes a number of key results about record values,
times, and number of records; this theorem is a superb example of the power of the
quantile transformation method, because the results for a general continuous CDF
F can be obtained from the U Œ0; 1 case by making a quantile transformation. The
details are worked out in Port (1993, pp. 502–509).

Theorem 6.12. Let X1 ; X2 ; : : : be an infinite sequence of independent observations


from a CDF F , and assume that F has the density f . Then,
(a) The sequence Z1 ; Z2 ; : : : is an infinite sequence of independent Bernoulli vari-
i / D P .Zi D 1/ D i ; i  1.
1
ables, with E.ZP
n
(b) Let N D Nn D i D1 Zi be the number of records at time n. Then,

X
n
1 X
n
i 1
E.N / D I Var.N / D :
i i2
i D1 i D1

(c) Fix r  2. Then Dr has the pmf


!
X
k1
k1
P .Dr D k/ D .1/ i
.i C 2/r ; k  1:
i
i D0

(d) The rth record value XTr has the density

Œ log.1  F .x//r1
fr .x/ D f .x/; 1 < x < 1:
.r  1/Š

(e) The first n record values, XT1 ; XT2 ; : : : ; XTn have the joint density

Y
n1
f .xi /
f12n .x1 ; x2 ; : : : ; xn / D f .xn / Ifx <x <<xn g :
rD1
1  F .xi / 1 2

(f) Fix a sequence of reals t1 < t2 < t3 <    < tk , and let for any given real
t; M.t/ be the total number of record values that are t:

M.t/ D #fi W Xi t and Xi is a record valueg:


240 6 Finite Sample Theory of Order Statistics and Extremes

Then, M.ti /  M.ti 1 /; 2 i k are independently distributed, and


 
1  F .ti 1 /
M.ti /  M.ti 1 / Poi log :
1  F .ti /

Remark. From part (a) of the theorem, we learn that if indeed the sequence of obser-
vations keeps coming from the same CDF, then obtaining a record becomes harder
as time passes; P .Zi D 1/ ! 0. We learn from part (b) that both the mean and the
variance of the number of records observed until time n are of the order of log n.
The number of records observed until time n is well approximated by a Poisson
distribution with mean log n, or a normal distribution with mean and variance equal
to log n. We learn from parts (c) and (d) that the interarrival times of the record
values do not depend on F , but the magnitudes of the record values do. Part (f) is
another example of the Poisson distribution providing an approximation in an in-
teresting problem. It is interesting to note the connection between part (b) and part
(f). In part (f), if we take t D F 1 .1  n1 /, then heuristically, Nn , the number of
records observed up to time n, satisfies Nn M.X.n/ / M.F 1 .1  n1 //
Poi. log.1  F .F 1 .1  n1 /// D Poi.log n/, which is what we mentioned in the
paragraph above.

Example 6.14 (Density of Record Values and Times). It is instructive to look at


the effect of the tail of the underlying CDF F on the magnitude of the record
values. Figure 6.3 gives the density of the third record value for three choices of
F; F D N.0; 1/; DoubleExp.0; 1/; C.0; 1/. Although the modal values are not very
different, the effect of the tail of F on the tail of the record density is clear. In par-
ticular, for the standard Cauchy case, record values do not have a finite expectation.

0.4

0.3

0.2

0.1

5 10 15 20 25

Fig. 6.3 Density of the third record value for, top to bottom, N.0; 1/, double exp (0,1), C(0,1) case
6.6 Some Applications 241

0.12

0.1

0.08

0.06

0.04

0.02

1 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930

Fig. 6.4 PMF of interarrival time between second and third record

Next, consider the distribution of the gap between the arrival times of the
second and the third record. Note the long right tail, akin to an exponential den-
sity in Fig. 6.4.

6.6.2 The Empirical CDF

The empirical CDF Fn .x/, defined in Section 6.1.2, is a tool of tremendous impor-
tance in statistics and probability. The reason for its effectiveness as a tool is that if
sample observations arise from some CDF F , then the empirical CDF Fn will be
very close to F for large n. So, we can get a very good idea of what the true F is
by looking at Fn . Furthermore, because Fn F , it can be expected that if T .F / is
a nice functional of F , then the empirical version T .Fn / would be close to T .F /.
Perhaps the simplest example of this P is the mean T .F / D EF .X /. The empirical
n
Xi
version then is T .Fn / D EFn .X / D i D1 n
, because Fn assigns the equal proba-
1
bility n to just the observation values X1 ; : : : ; Xn . This means that the mean of the
sample values should be close to the expected value under the true F . And, this is
indeed true under simple conditions, and we have already seen some evidence for
it in the form of the central limit theorem. We provide some basic properties and
applications of the empirical CDF in this section.

Theorem 6.13. Let X1 ; X2 ; : : : be independent observations from a CDF F . Then,


(a) For any fixed x; nFn .x/ Bin.n; F .x//:
(b) (DKW Inequality). Let n D supx2R jFn .x/  F .x/j. Then, for all n; > 0,
and all F ,
2
P .n > / 2e 2n :
242 6 Finite Sample Theory of Order Statistics and Extremes

(c) Assume that F is continuous. For any given n, and ˛; 0 < ˛ < 1, there exist
positive constants Dn , independent of F , such that whatever be F ,

P .8x 2 R; Fn .x/  Dn F .x/ Fn .x/ C Dn /  1  ˛:

Remark. Part (b), the DKW inequality, was first proved in Dvoretzky et al. (1956),
but in a weaker form. The inequality stated here is proved in Massart (1990). Fur-
thermore, the constant 2 in the inequality is the best possible choice of the constant;
that is, the inequality is false with any other constant C < 2. The inequality says that
uniformly in x, for large n, the empirical CDF is arbitrarily close to the true CDF
with a very high probability, and the probability of the contrary is sub-Gaussian. We
show more precise consequences of this in a later chapter. Part (c) is important for
statisticians, as we show in our next example.

Example 6.15 (Confidence Band for a Continuous CDF). This example is another
important application of the quantile transformation method. Imagine a hypothet-
ical sequence of independent U Œ0; 1 variables, U1 ; U2 ; : : :, and let Gn denote the
empirical CDF of this sequence of uniform random variables; that is,

#fi W Ui tg
Gn .t/ D :
n
By the quantile transformation,

L
n D supx2R jFn .x/  F .x/j D supx2R jGn .F .x//  F .x/j
D sup0<t <1 jGn .t/  tj;

which shows that as long as F is a continuous CDF, so that the quantile transfor-
mation can be applied, for any n, the distribution of n is the same for all F . This
common distribution is just the distribution of sup0<t <1 jGn .t/  tj. Consequently,
if Dn is such that P .sup0<t <1 jGn .t/  tj > Dn / ˛, then Dn also satisfies (the
apparently stronger statement)

P .8x 2 R; Fn .x/  Dn F .x/ Fn .x/ C Dn /  1  ˛:

The probability statement above provides the assurance that with probability 1  ˛
or more, the true CDF F .x/, as a function, is caught between the pair of functions
Fn .x/ ˙ Dn . As a consequence, the band Fn .x/  Dn F .x/ Fn .x/ C Dn ; x 2
R, is called a 100.1  ˛/% confidence band for F . This is of great use in statistics,
because statisticians often consider the true CDF F to be not known.
The constants Dn have been computed and tabulated for small and moderate n.
We tabulate the values of Dn for some selected n for easy reference and use.
6.7 Distribution of the Multinomial Maximum 243

n 95th Percentile 99th Percentile


20 .294 .352
21 .287 .344
22 .281 .337
23 .275 .330
24 .269 .323
25 .264 .317
26 .259 .311
27 .254 .305
28 .250 .300
29 .246 .295
30 .242 .290
35 .224 .269
40 .210 .252
1:36
p 1:63
p
>40 n n

6.7  Distribution of the Multinomial Maximum

The maximum cell frequency in a multinomial distribution is of current interest


in several areas of probability and statistics. It is of wide interest in cluster detec-
tion, data mining, goodness of fit, and in occupancy problems in probability. It also
arises in sequential clinical trials. It turns out that the technique of Poissonization
can be used to establish, in principle, the exact distribution of the multinomial
maximum cell frequency. This can be of substantial practical use. Precisely, if
N Poisson. /, and given N D n; .f1 ; f2 ; : : : ; fk / has a multinomial distribu-
tion with parameters .n; p1 ; p2 ; : : : ; pk /, then unconditionally, f1 ; f2 ; : : : ; fk are
independent and fi Poisson. pi /. It follows that with any given fixed value n,
and any given fixed set A in the k-dimensional Euclidean space Rk , the multino-
mial probability that .f1 ; f2 ; : : : ; fk / belongs to A equals nŠc.n/, with c.n/ being
the coefficient of n in the power series expansion of e  P ..X1 ; X2 ; : : : ; Xk / 2 A/,
where now Xi are independent Poisson. pi /. In the equiprobable case (i.e., when
the pi are all equal to k1 ), this leads to the equality that
0 1k
nŠ X
x1 j
P .maxff1 ; f2 ; : : : ; fk g  x/ D n  the coefficient of n
in @ A I
k jŠ
j D0

see Chapter 2.
As a result, we can compute P .maxff1 ; f2 ; : : : ; fk g  x/ exactly whenever we
P j K
can compute the coefficient of n in the expansion of . x1 j D0 j Š / . This is possible
to do by using symbolic software; see Ethier (1982) and DasGupta (2009).
244 6 Finite Sample Theory of Order Statistics and Extremes

Example 6.16 (Maximum Frequency in Die Throws). Suppose a fair six-sided die
is rolled 30 times. Should we be surprised if one of the six faces appears 10
times? The usual probability calculation to quantify the surprise is to calculate
P .maxff1 ; f2 ; : : : ; f6 g  10/, namely the P-value, where f1 ; f2 ; : : : ; f6 are the
frequencies of the six faces in the 30 rolls. Because of our Poissonization result,
we can compute this probability. From the table of exact probabilities below, we
can see that it would not be very surprising if some face appeared 10 times in 30
rolls of a fair die; after all P .maxff1 ; f2 ; : : : ; f6 g  10/ D :1176, not a very small
number, for 30 rolls of a fair die. Similarly, it would not be very surprising if some
face appeared 15 times in 50 rolls of a fair die, as can be seen in the table below.

P .maxff1 ; f2 ; : : : ; fk g  x/.k D 6/
x n D 30 n D 50
8 .6014 1
9 .2942 1
10 .1176 .9888
11 .0404 .8663
12 .0122 .6122
13 .0032 .3578
14 .00076 .1816
15 .00016 .0827
16 .00003 .0344

Exercises

Exercise 6.1. Suppose X; Y; Z are three independent U Œ0; 1 variables. Let U; V; W


denote the minimum, median, and the maximum of X; Y; Z.
(a) Find the densities of U; V; W .
(b) Find the densities of U
V
V
and W , and their joint density.
U V
(c) Find E. V / and E. W /,
Exercise 6.2. Suppose X1 ; : : : ; X5 are independent U Œ0; 1 variables. Find the joint
density of X.2/ ; X.3/ ; X.4/ , and E.X.4/ C X.2/  2X.3/ /.
Exercise 6.3. * Suppose X1 ; : : : ; Xn are independent U Œ0; 1 variables.
(a) Find the probability that all n observations fall within some interval of length at
most .9.
(b) Find the smallest n such that P .X.n/  :99; X.1/ :01/  :99.
Exercise 6.4 (Correlation Between Order Statistics). Suppose X1 ; : : : ; X5
are independent U Œ0; 1 variables. Find the exact values of X.i / ;X.j / for all
1 i < j 5.
Exercises 245

Exercise 6.5 * (Correlation Between Order Statistics). Suppose X1 ; : : : ; Xn are


independent U Œ0; 1 variables. Find the smallest n such that X.1/ ;X.n/ < ; D
:5; :25; :1.

Exercise 6.6. Suppose X; Y; Z are three independent standard exponential vari-


ables, and let U; V; W be their minimum, median, and maximum. Find the densities
of U; V; W; W  U .

Exercise 6.7 (Comparison of Mean, Median, and Midrange). Suppose X1 ;


X2 ; : : : ; X2mC1 are independent observations from U Œ  ;  C .
X CX
(a) Show that the expectation of each of XN ; X.mC1/ ; .1/ 2 .n/ is .
X CX
(b) Find the variance of each XN ; X.mC1/ ; .1/ 2 .n/ . Is there an ordering among
their variances?

Exercise 6.8 * (Waiting Time). Peter, Paul, and Mary went to a bank to do some
business. Two counters were open, and Peter and Paul went first. Each of Peter,
Paul, and Mary will take, independently, an Exp. / amount of time to finish their
business, from the moment they arrive at the counter.
(a) What is the density of the epoch of the last departure?
(b) What is the probability that Mary will be the last to finish?
(c) What is the density of the total time taken by Mary from arrival to finishing her
business?

Exercise 6.9. Let X1 ; : : : ; Xn be independent standard exponential variables.


(a) Derive an expression for the CDF of the maximum of the spacings, W0 D
X.1/ ; Wi D X.i C1/  X.i / ; i D 1; : : : ; n  1.
(b) Use it to calculate the probability that among 20 independent standard Expo-
nential observations, no two consecutive observations are more than .25 apart.

Exercise 6.10 * (A Characterization). Let X1 ; X2 be independent observations


from a continuous CDF F . Suppose that X.1/ and X.2/  X.1/ are independent.
Show that F must be the CDF of an exponential distribution.

Exercise 6.11 * (Range and Midrange). Let X1 ; : : : ; Xn be independent U Œ0; 1


X CX
variables. Let Wn D .X.n/ X.1/ /; Yn D .n/ 2 .1/ . Find the joint density of Wn ; Yn
(be careful about where the joint density is positive). Use it to find the conditional
expectation of Yn given Wn D w.

Exercise 6.12 * (Density of Midrange). Let X1 ; : : : ; Xn be independent observa-


tions from a continuous CDF F with density f . Show that the density of Yn D
X.n/ CX.1/
2
is given by
Z y
fY .y/ D n ŒF .2y  x/  F .x/n1 f .x/dx:
1
246 6 Finite Sample Theory of Order Statistics and Extremes

Exercise 6.13 * (Mean Given the Minimum and Maximum). Let X1 ; : : : ; Xn be


independent U Œ0; 1 variables. Derive a formula for E.XN jX.1/ ; X.n/ /.

Exercise 6.14 * (Mean Given the Minimum and Maximum). Let X1 ; : : : ; Xn be


independent standard exponential variables. Derive a formula for E.XN jX.1/ ; X.n/ /.

Exercise 6.15 * (Distance Between Mean and Maximum). Let X1 ; : : : ; Xn be in-


dependent U Œ0; 1 variables. Derive as clean a formula as possible for E.jXN X.n/ j/.

Exercise 6.16 * (Distance Between Mean and Maximum). Let X1 ; : : : ; Xn be in-


dependent standard exponential variables. Derive as clean a formula as possible for
E.jXN  X.n/ j/.

Exercise 6.17 * (Distance Between Mean and Maximum). Let X1 ; : : : ; Xn be


independent standard normal variables. Derive as clean a formula as possible for
E.jXN  X.n/ j/.

Exercise 6.18 * (Relation Between Uniform and Standard Normal). Let Z1 ;


Z2 ; : : : be independent standard normal variables. Let U1 ; U2 ; : : : be independent
U Œ0; 1 variables. Prove the distributional equivalence:
P2r !
2
L i D1 Zi
.U.r/ /jnrD1 D P2.nC1/ 2 jnrD1 :
i D1 Zi

Exercise 6.19 * (Confidence Interval for a Quantile). Let X1 ; : : : ; Xn be inde-


pendent observations from a continuous CDF F . Fix 0 < p < 1; 0 < ˛ < 1,
and let F 1 .p/ be the pth quantile of F . Show that for large enough n, there exist
1 r < s n such that P .X.r/ F 1 .p/ X.s/ /  1  ˛.
Do such r; s exist for all n?

Hint: Use the quantile transformation.

Exercise 6.20. Let X1 ; : : : ; Xn be independent observations from a continuous


CDF F . Find the smallest value of n such that P .X.2/ F 1 . 12 / X.n1/ /  :95.

Exercise 6.21. Let X1 ; : : : ; Xn be independent observations from a continuous


CDF F with a density symmetric about some . Show that for all odd sample sizes
n D 2m C 1, the median X.mC1/ has a density symmetric about .

Exercise 6.22. Let X1 ; : : : ; Xn be independent observations from a continuous


CDF F with a density symmetric about some . Show that for any r; X.nrC1/ 
L
 D   X.r/ .

Exercise 6.23 * (Unimodality of Order Statistics). Let X1 ; : : : ; Xn be indepen-


1
dent observations from a continuous CDF F with a density f . Suppose f .x/ is
convex on the support of f , namely, on S D fx W f .x/ > 0g. Show that for any r,
the density of X.r/ is unimodal. You may assume that S is an interval.
References 247

Exercise 6.24. Let X1 ; X2 ; : : : ; Xn be independent standard normal variables.


Prove that the mode of X.n/ is the unique root of

.n  1/.x/ D xˆ.x/:

Exercise 6.25 (Conditional Expectation Given the Order Statistics). Let g.x1 ;
x2 ; : : : ; xn / be a general real-valued function of n variables. Let X1 ; X2 ; : : : ; Xn be
independent observations from a common CDF F . Find as clean an expression as
possible for E.g.X1 ; X2 ; : : : ; Xn / jX.1/ ; X.2/ ; : : : ; X.n/ /.

Exercise 6.26. Derive a formula for the expected value of the rth record when the
sample observations are from an exponential density.

Exercise 6.27 * (Record Values in Normal Case). Suppose X1 ; X2 ; : : : are inde-


pendent observations from the standard normal distribution. Compute the expected
values of the first ten records.

Exercise 6.28. Let Fn .x/ be the empirical CDF of n observations from a CDF F .
Show that

n D supx2R jFn .x/  F .x/j


i i 1
D max max  F .X.i / /; F .X.i / /  :
1i n n n

References

Bickel, P. (1967). Some contributions to order statistics, Proc. Fifth Berkeley Symp., I, 575–591,
L. Le Cam and J. Neyman Eds., University of California Press, Berkeley.
Coles, S. (2001). An Introduction to Statistical Modeling of Extreme Values, Springer, New York.
DasGupta, A. (2008). Asymptotic Theory of Statistics and Probability, Springer, New York.
DasGupta, A. (2009). Exact tail probabilities and percentiles of the multinomial maximum, Tech-
nical Report, Purdue University.
David, H. (1980). Order Statistics, Wiley, New York.
Dvoretzky, A., Kiefer, J., and Wolfowitz, J. (1956). Asymptotic minimax character of the sample
distribution function, Ann. Math. Statist., 27, 3, 642–669.
Embrechts, P., KlRuppelberg, C., and Mikosch, T. (2008). Modelling Extremal Events: For Insurance
and Finance, Springer, New York.
Ethier, S. (1982). Testing for favorable numbers on a roulette wheel, J. Amer. Statist. Assoc., 77,
660–665.
Falk, M., HRusler, J. and Reiss, R. (1994). Laws of Small Numbers, Extremes, and Rare Events,
Birkhauser, Basel.
Galambos, J. (1987). Asymptotic Theory of Extreme Order Statistics, Academic Press, New York.
Hall, P. (1978). Some asymptotic expansions of moments of order statistics, Stoch. Proc. Appl., 7,
265–275.
Hall, P. (1979). On the rate of convergence of normal extremes, J. Appl. Prob., 16, 2, 433–439.
Leadbetter, M., Lindgren, G., and Rootzén, H. (1983). Extremes and Related Properties of Random
Sequences and Processes, Springer, New York.
248 6 Finite Sample Theory of Order Statistics and Extremes

Lehmann, E. (1975). Nonparametrics: Statistical Methods Based on Ranks, McGraw-Hill,


Columbus, OH.
Martynov, G. (1992). Statistical tests based on empirical processes and related problems, Soviet
J. Math, 61, 4, 2195–2271.
Massart, P. (1990). The tight constant in the DKW inequality, Ann. Prob., 18, 1269–1283.
Port, S. (1993). Theoretical Probability for Applications, Wiley, New York.
Reiss, R. (1989). Approximation Theorems of Order Statistics, Springer-Verlag, New York.
Resnick, S. (2007). Extreme Values, Regular Variation, and Point Processes, Springer, New York.
Shorack, G. and Wellner, J. (1986). Empirical Processes with Applications to Statistics, Wiley,
New York.
Chapter 7
Essential Asymptotics and Applications

Asymptotic theory is the study of how distributions of functions of a set of random


variables behave, when the number of variables becomes large. One practical con-
text for this is statistical sampling, when the number of observations taken is large.
Distributional calculations in probability are typically such that exact calculations
are difficult or impossible. For example, one of the simplest possible functions of
n variables is their sum, and yet in most cases, we cannot find the distribution of
the sum for fixed n in an exact closed form. But the central limit theorem allows
us to conclude that in some cases the sum will behave as a normally distributed
random variable, when n is large. Similarly, the role of general asymptotic theory
is to provide an approximate answer to exact solutions in many types of problems,
and often very complicated problems. The nature of the theory is such that the ap-
proximations have remarkable unity of character, and indeed nearly unreasonable
unity of character. Asymptotic theory is the single most unifying theme of probabil-
ity and statistics. Particularly, in statistics, nearly every method or rule or tradition
has its root in some result in asymptotic theory. No other branch of probability and
statistics has such an incredibly rich body of literature, tools, and applications, in
amazingly diverse areas and problems. Skills in asymptotics are nearly indispens-
able for a serious statistician or probabilist.
Numerous excellent references on asymptotic theory are available. A few among
them are Bickel and Doksum (2007), van der Vaart (1998), Lehmann (1999),
Hall and Heyde (1980), Ferguson (1996), and Serfling (1980). A recent reference
is DasGupta (2008). These references have a statistical undertone. Treatments
of asymptotic theory with a probabilistic undertone include Breiman (1968),
Ash (1973), Chow and Teicher (1988), Petrov (1975), Bhattacharya and Rao (1986),
Cramér (1946), and the all-time classic, Feller (1971). Other specific references are
given in the sections.
In this introductory chapter, we lay out the basic concepts of asymptotics with
concrete applications. More specialized tools are separately treated in subsequent
chapters.

A. DasGupta, Probability for Statistics and Machine Learning: Fundamentals 249


and Advanced Topics, Springer Texts in Statistics, DOI 10.1007/978-1-4419-9634-3 7,
c Springer Science+Business Media, LLC 2011
250 7 Essential Asymptotics and Applications

7.1 Some Basic Notation and Convergence Concepts

Some basic definitions, notation, and concepts are put together in this section.
Definition 7.1. Let an be a sequence of real numbers. We write an D o.1/ if
limn!1 an D 0. We write an D O.1/ if 9 K < 1 3 jan j K 8n  1.
More generally, if an ; bn > 0 are two sequences of real numbers, we write an D
o.bn / if abnn D o.1/; we write an D O.bn / if abnn D O.1/.
Remark. Note that the definition allows the possibility that a sequence an which is
O.1/ is also o.1/. The converse is always true; that is, an D o.1/ ) an D O.1/.
Definition 7.2. Let an ; bn be two real sequences. We write an bn if abnn ! 1, as
n ! 1. We write an bn if 0 < lim inf abnn lim sup abnn < 1, as n ! 1.
Example 7.1. Let an D nC1 n
. Then, jan j 18n  1; so an D O.1/. But, an ! 1.
as n ! 1; so an is not o.1/.
However, suppose an D n1 . Then, again, jan j 18n  1; so an D O.1/. But,
this time an ! 0. as n ! 1; so an is both O.1/ and o.1/. But an D O.1/ is a
weaker statement in this case than saying an D o.1/.
Next, suppose an D n. Then jan j D n ! 1, as n ! 1; so an is not O.1/,
and therefore also cannot be o.1/.
cn
Example 7.2. Let cn D log n, and an D cnCk , where k  1 is a fixed positive
integer. Thus,

log n log n 1 1
an D D D ! D 1:
log.n C k/ log n C log.1 C kn / 1C 1
log n
log.1 C k
n
/ 1C0

Therefore, an D O.1/; an 1, but an is not o.1/. The statement that an 1 is


stronger than saying an D O.1/.
Example 7.3. Let an D p1 .
nC1
Then,

1 1 1
an D p C p p
n nC1 n
p p
1 n nC1 1 1
D p C p p Dp p p p p
n n nC1 n n n C 1. n C n C 1/
1 1
D p p p p
n n n.1 C o.1//.2 n C o.1//
1 1
D p  p
n n.1 C o.1//.2 n C o.1//
1 1
D p  3=2
n 2n .1 C o.1//.1 C o.1//
7.1 Some Basic Notation and Convergence Concepts 251

1 1
D p  3=2
n 2n .1 C o.1//
1 1 1 1 
D p  3=2 .1 C o.1// D p  3=2 C o n3=2 :
n 2n n 2n

In working with asymptotics, it is useful to be skilled in calculations of the type


of this example. To motivate the first probabilistic convergence concept, we give an
illustrative example.

Example 7.4. For n  1, consider the simple discrete random variables Xn with the
pmf P .Xn D n1 / D 1  n1 ; P .Xn D n/ D n1 . Then, for large n, Xn is close to
zero with a large probability. Although for any given n; Xn is never equal to zero,
the probability of it being far from zero is very small for large n. For example,
P .Xn > :001/ :001 if n  1000. More formally, for any given > 0, P .Xn >
/ P .Xn > n1 / D n1 , if we take n to be so large that n1 < . As a consequence,
P .jXn j > / D P .Xn > / ! 0, as n ! 1. This example motivates the following
definition.

Definition 7.3. Let Xn ; n  1, be an infinite sequence of random variables defined


on a common sample space . We say that Xn converges in probability to c, a
specific real constant, if for any given > 0; P .jXn  cj > / ! 0, as n ! 1.
Equivalently, Xn converges in probability to c if given any > 0; ı > 0, there exists
an n0 D n0 . ; ı/ such that P .jXn  cj > / < ı 8n  n0 . ; ı/.
P
If Xn converges in probability to c, we write Xn ) c, or sometimes also,
P
Xn ) c.
However, sometimes the sequence Xn may get close to some random variable,
rather than a constant. Here is an example of such a situation.

Example 7.5. Let X; Y be two independent standard normal variables. Define a


sequence of random variables Xn as Xn D X C Yn . Then, intuitively, we feel that
for large n, the Yn part is small, and Xn is very close to the fixed random variable X .
Formally, P .jXn  X j > / D P .j Yn j > / D P .jY j > n / D 2Œ1  ˆ.n / ! 0,
as n ! 1. This motivates a generalization of the previous definition.

Definition 7.4. Let Xn ; X; n  1 be random variables defined on a common sample


space . We say that Xn converges in probability to X if given any > 0; P .jXn 
X j > / ! 0, as n ! 1.
P P
We denote it as Xn ) X , or Xn ) X .

Definition 7.5. A sequence of random variables Xn is said to be bounded in prob-


ability or tight if, given > 0, one can find a constant k such that P .jXn j > k/
for all n  1.
252 7 Essential Asymptotics and Applications

P P
Notation. If Xn ) 0, then we write Xn D op .1/. More generally, if an Xn ) 0
for some positive sequence an , then we write Xn Dop . a1n /.
If Xn is bounded in probability, then we write Xn D Op .1/. If an Xn D Op .1/,
we write Xn D Op . a1n /.
Proposition. Suppose Xn D op .1/. Then, Xn D Op .1/. The converse is, in
general, not true.
Proof. If Xn D op .1/, then by definition of convergence in probability, given
c > 0; P .jXn j > c/ ! 0, as n ! 1. Thus, given any fixed > 0, for all
large n, say n  n0 . /; P .jXn j > 1/ < . Next find constants c1 ; c2 ; : : : ; cn0 ,
such that P .jXi j > ci / < ; i D 1; 2; : : : ; n0 . Choose k D maxf1; c1 ; c2 ; cn0 g.
Then, P .jXn j > k/ < 8n  1. Therefore, Xn D Op .1/.
To see that the converse is, in general, not true, let X N.0; 1/, and define
Xn X; 8n  1. Then, Xn D Op .1/. But P .jXn j > 1/ P .jX j > 1/, which is
a fixed positive number, and so, P .jXn j > 1/ does not converge to zero. t
u
Definition 7.6. Let fXn ; X g be defined on the same probability space. We say that
Xn converges almost surely to X (or Xn converges to X with probability 1) if P .! W
a:s: a:s:
Xn .!/ ! X.!// D 1. We write Xn ! X or Xn ) X .
a:s:
Remark. If the limit X is a finite constant c with probability one, we write Xn ) c.
a:s:
If P .! W Xn .!/ ! 1/ D 1, we write Xn ! 1. Almost sure convergence is a
stronger mode of convergence than convergence in probability. In fact, a character-
ization of almost sure convergence is that for any given > 0,

lim P .jXn  X j 8n  m/ D 1:
m!1

It is clear from this characterization that almost sure convergence is stronger than
convergence in probability. However, the following relationships hold.
P
Theorem 7.1. (a) Let Xn ) X . Then there is a sub-sequence Xnk such that
a:s:
X nk ! X .
P
(b) Suppose Xn is a monotone nondecreasing sequence and that Xn ) X . Then
a:s:
Xn ! X .
Example 7.6 (Pattern in Coin Tossing). For iid Bernoulli trials with a success prob-
ability p D 12 , let Tn denote the number of times in the first n trials that a success is
followed by a failure. Denoting
Ii D Ii th trial is a success and (i C 1)th trial is a failure;
Pn1
we have Tn D i D1 Ii , and therefore E.Tn / D n1
4 , and
X
n1 X
n2
3.n  1/ 2.n  2/ nC1
Var.Tn / D Var.Ii / C 2 Cov.Ii ; Ii C1 / D  D :
16 16 16
i D1 i D1

Tn P 1
It now follows by an application of Chebyshev’s inequality that n ) 4.
7.1 Some Basic Notation and Convergence Concepts 253

Example 7.7 (Uniform Maximum). Suppose X1 ; X2 ; : : : is an infinite sequence of


iid U Œ0; 1 random variables and let X.n/ D maxfX1 ; : : : ; Xn g. Intuitively, X.n/
should get closer and closer to 1 as n increases. In fact, X.n/ converges almost
surely to 1. For P .j1  X.n/ j 8n  m/ D P .1  X.n/ 8n  m/ D
P .X.n/  1  8n  m/ D P .X.m/  1  / D 1  .1  /m ! 1 as m ! 1,
a:s:
and hence X.n/ ) 1.

Example 7.8 (Spiky Normals). Suppose Xn N. n1 ; n1 / is a sequence of indepen-


dent variables. The mean and the variance are both converging to zero, therefore
one would intuitively expect that the sequence Xn converges to zero in some
sense. In
Q fact, it converges almost
Q1surely top zero. 1Indeed, P .jX p nj 1 8n 
m/ D 1 nDm P .jX n j / D nDm Œˆ. n  p
n
/ C ˆ. n C p
n
/  1 D
Q1 .
p
n/  m
2
a:s:
/ D 1 C O. .e pm / / ! 1, as m ! 1, implying Xn ) 0. In
2
nDm Œ1 C O.
p
n
the above, the last but one equality follows on using the tail property of the standard
normal CDF that
 
1  ˆ.x/ 1 1
D CO ; as x ! 1:
.x/ x x3

Next, we introduce the concept of convergence in mean. It often turns out to be


a convenient method for establishing convergence in probability.

Definition 7.7. Let Xn ; X; n  1 be defined on a common sample space . Let


p  1, and suppose E.jXn jp /; E.jX jp / < 1. We say that Xn converges in pth
mean to X or Xn converges in Lp to X if E.jXn  X jp / ! 0, as n ! 1. If p D 2,
we also say that Xn converges to X in quadratic mean. If Xn converges in Lp to X ,
Lp
we write Xn ) X .

Example 7.9 (Some Counterexamples). Let Xn be the sequence of two point random
variables with pmf P .Xn D 0/ D 1  n1 ; P .Xn D n/ D n1 . Then Xn converges in
probability to zero. But, E.jXn j/ D 1 8n, and hence Xn does not converge in L1
to zero. In fact, it does not converge to zero in Lp for any p  1.
Now take the same sequence Xn as above, and assume moreover that they are
independent. Take an > 0; and positive integers m; k. Then,

P .jXn j < 8m n m C k/
Y
mCk
1

D P .Xn D 0 8m n m C k/ D 1
nDm
n
m1
D :
mCk

For any m, this converges to zero as k ! 1. Therefore, limm!1 P .jXn j <


8n  m/ cannot be one, and so, Xn does not converge almost surely to zero.
254 7 Essential Asymptotics and Applications
p
Next, let Xn have the pmf P .Xn D 0/ D 1  n1 ; P .Xn D n/ D n1 . Then, Xn
again converges in probability to zero. Furthermore, E.jXn j/ D p1n ! 0, and so
Xn converges in L1 to zero. But, E.Xn2 / D 1 8n, and hence Xn does not converge
in L2 to zero.
The next result says that convergence in Lp is a useful method for establishing
convergence in probability.

Proposition. Let Xn ; X; n  1 be defined on a common sample space . Suppose


P
Xn converges to X in Lp for some p  1. Then Xn ) X .

Proof. Simply observe that, by using Markov’s inequality,

E.jXn  X jp /
P .jXn  X j > / D P .jXn  X jp > p
/ p

! 0, by hypothesis. t
u

Remark. It is easily established that if p > 1, then

Xn converges in Lp to X ) Xn converges in L1 to X:

7.2 Laws of Large Numbers

The definitions and the treatment in the previous section are for general sequences of
random variables. Averages and sums are sequences of special importance in appli-
cations. The classic laws of large numbers, which characterize the long run behavior
of averages, are given in this section. Truly, the behavior of averages and sums in
complete generality is very subtle, and is beyond the scope of this book. Specialized
treatments are available in Feller (1971), Révész (1968), and Kesten (1972).
A very useful tool for establishing almost sure convergence is stated first.

Theorem 7.2 (Borel–Cantelli Lemma). Let fAn g be a sequence of events on a


sample space . If
1
X
P .An / < 1;
nD1

then P(infinitely many An occur) D 0.


If fAn g are pairwise independent, and
1
X
P .An / D 1;
nD1

then P(infinitely many An occur) D 1.


7.2 Laws of Large Numbers 255

Proof. We prove the first statement. In order that infinitely many among the events
An ; n  1, occur, it is necessary and sufficient that given any m, there is at least one
event among Am ; AmC1 ; : : : that occurs. In other words,

Infinitely many An occur D \1 1


mD1 [nDm An :

On the other hand, the events Bm D [1


nDm An are decreasing in m (i.e., B1  B2 
: : :). Therefore,

P .infinitely many An occur/ D P .\1


mD1 Bm / D lim P .Bm /
m!1
1
X
D lim P .[1
nDm An / lim sup P .An /
m!1 m!1 nDm

D 0;
P1
because, by assumption, nD1 P .An / < 1. t
u
Remark. Although pairwise independence suffices for the conclusion of the second
part of the Borel–Cantelli lemma, common applications involve cases where the An
are mutually independent.
The next example gives an application of the Borel–Cantelli lemma,
Example 7.10 (Tail Runs of Arbitrary Length). Most of us do not feel that it is likely
that in tossing a coin repeatedly, we are likely to see many tails or many heads in
succession. Problems of this kind were discussed in Chapter 1. This example shows
that in some sense, that intuition is wrong.
Consider a sequence of independent Bernoulli trials in which success occurs with
probability p and failure with probability q D 1  p. Suppose p > q, so that
successes are more likely than failures. Consider a hypothetical long uninterrupted
run of m failures, say FF : : : F , for some fixed m. Break up the Bernoulli trials
into nonoverlapping blocks of m trials, and consider An to be the event that the nth
m
P1of only failures. The probability of each An is q , which is free of n.
block consists
Therefore, nD1 P .An / D 1 and it follows from the second part of the Borel–
Cantelli lemma that no matter how large p may be, as long as p < 1, a string of
consecutive failures of any given arbitrary length m reappears infinitely many times
in the sequence of Bernoulli trials. In particular, if we keep tossing an ordinary coin,
then with certainty, we will see 1000 tails (or, 10,000 tails) in succession, and we
will see this occur again and again, infinitely many times, as our coin tosses continue
indefinitely.
Here is another important application of the Borel–Cantelli lemma.
Example 7.11 (Almost Sure Convergence of Binomial Proportion). Let X1 ; X2 ; : : :
be an infinite sequence of independent Ber.p/ random variables, where 0 < p < 1.
Let
Sn X1 C X2 C : : : C Xn
XNn D D :
n n
256 7 Essential Asymptotics and Applications

Then, from our previous formula in Chapter 1 for binomial distributions,


E.Sn  np/4 D np.1  p/Œ1 C 3.n  2/p.1  p/. Thus, by Markov’s inequality,

P .jXNn  pj > / D P .jSn  npj > n /


E.Sn  np/4
D P ..Sn  np/4 > .n /4 /
.n /4
np.1  p/Œ1 C 3.n  2/p.1  p/
D
.n /4
3n2 .p.1  p//2 C np.1  p/ C D
C 3;
n4 4 n2 n
for finite constants C; D. Therefore,
1
X X 1 X1
1 1
P .jXNn  pj > / C 2
C D 3
nD1 nD1
n nD1
n
< 1:

It follows from the Borel–Cantelli lemma that the binomial sample proportion XNn
converges almost surely to p.
In fact, the convergence of the sample mean XNn to E.X1 / (i.e., the common mean
of the Xi ) holds in general. The general results, due to Khintchine and Kolmogorov,
are known as the laws of large numbers, stated below.
Theorem 7.3 (Weak Law of Large Numbers). Suppose X1 ; X2 ; : : : are inde-
pendent and identically distributed (iid) random variables (defined on aPcommon
sample space ), such that E.jX1 j/ < 1, and E.X1 / D . Let XNn D n1 niD1 Xi .
P
Then XN n ) .
Theorem 7.4 (Strong Law of Large Numbers). Suppose X1 ; X2 ; : : : are inde-
pendent and identically distributed random variables (defined on a common sample
space ). Then, XN n has an a.s. (almost sure) limit iff E.jX1 j/ < 1, in which case
a:s:
XN n )  D E.X1 /.
Remark. It is not very simple to prove either of the two laws of large numbers in the
generality stated above. We prove the weak law in Chapter 8, and the strong law in
Chapter 14. If the Xi have a finite variance, then Markov’s inequality easily leads to
the weak law. If Xi have four finite moments, then a careful argument on the lines
of our special binomial proportion example above does lead to the strong law. Once
again, the Borel–Cantelli lemma does the trick.
It is extremely interesting that existence of an expectation is not necessary for the
WLLN (weak law of large numbers) to hold. That is, it is possible that E.jX j/ D 1,
P
and yet XN ) a, for some real number a. We describe this more precisely shortly.
The SLLN (strong law of large numbers) already tells us that if X1 ; X2 ; : : : are
independent with a common CDF F (that is, iid), then the sample mean XN does not
7.2 Laws of Large Numbers 257

have any almost sure limit if EF .jX j/ D 1. An obvious question is what happens
to XN in such a case. A great deal of deep work has been done on this question, and
there are book-length treatements of this issue. The following theorem gives a few
key results only for easy reference.

Definition 7.8. Let x be a real number. The positive and negative part of x are
defined as
x C D maxfx; 0gI x  D maxfx; 0g:
That is, x C D x when x  0, and 0 when x 0. On the other hand, x  D 0 when
x  0, and x when x 0. Consequently, for any real number x,

x C ; x   0I x D xC  x I jxj D x C C x  :

Theorem 7.5 (Failure of the Strong Law). Let X1 ; X2 ; : : : be independent obser-


vations from a common CDF F on the real line. Suppose EF .jX j/ D 1.
(a) For any sequence of real numbers, an ;

P .lim sup jXN  an j D 1/ D 1:


a:s:
(b) If E.X C / D 1, and E.X  / < 1, then XN ) 1:
a:s:
(c) If E.X  / D 1, and E.X C / < 1, then XN ) 1:
More refined descriptions of the set of all possible limit points of the sequence of
means XN are worked out in Kesten (1972). See also Chapter 3 in DasGupta (2008).

Example 7.12. Let F be the CDF of the standard R 1 Cauchy distribution. Due to the
symmetry, we get E.X C / D E.X  / D 1 0 1Cx x
2 dx D 1. Therefore, from

part (a) of the above theorem. with probability one, lim sup jXN j D 1 (i.e., the
sequence of sample means cannot remain bounded). Also, from the statement of
the strong law itself, the sequence will not settle down near any fixed real number.
The four simulated plots in Fig. 7.1 help illustrate these phenomena. In each plot,
1000 standard Cauchy values were simulated, and the sequence of means XN j D
1 Pj
j i D1 Xi were plotted, for j D 1; 2; : : : ; 1000.
Now, we consider the issue of a possibility of a WLLN when the expectation
does not exist. The answer is that the tail of F should not be too slow. Here is the
precise result.

Theorem 7.6 (Weak Law Without an Expectation). Let X1 ; X2 ; : : : be inde-


pendent observations from a CDF F . Then, there exist constants n such that
P
XN  n ) 0 if and only if

x.1  F .x/ C F .x// ! 0;

as x ! 1, in which case the constants n may be chosen as n D EF .XIfjXjng /.


258 7 Essential Asymptotics and Applications

20 1

15
200 400 600 800 1000
10 -1

-2
5
-3
200 400 600 800 1000

1
-0.5

200 400 600 800 1000 -1


-1 -1.5

-2 -2
-2.5
-3
200 400 600 800 1000

Fig. 7.1 Sequence of sample means for simulated C(0,1) data

R1
In particular, if F is symmetric, x.1  F .x// ! 0 as x ! 1, and 0 .1 
P
F .x//dx D 1, then EF .jX j/ D 1, where as XN ) 0.

Remark. See Feller (1971, p. R235) for a proof. It should be noted that the two con-
1
ditions x.1  F .x// ! 0 and 0 .1  F .x//dx D 1 are not inconsistent. It is easy
to find an F that satisfies both conditions.
We close this section with an important result on the uniform closeness of the
empirical CDF to the underlying CDF in the iid case.

Theorem 7.7 (Glivenko–Cantelli Theorem). Let F be any CDF on the real


line, and X1 ; X2 ; : : : iid with common CDF F . Let Fn .x/ D #fi nWX
n
i xg
be the
a:s:
sequence of empirical CDFs. Then, n D supx2R jFn .x/  F .x/j ) 0:

Proof. The main idea of the proof is to discretize the problem, and exploit Kol-
mogorov’s SLLN.
Fix m and define a0 ; a1 ; : : : ; am ; amC1 by the relationships Œai ; ai C1 / D fx W
m
i
F .x/ < i C1
m g; i D 1; 2; : : : ; m  1, and a0 D 1; amC1 D 1. Now fix an i
and look at x 2 Œai ; ai C1 /. Then

1
Fn .x/  F .x/ Fn .ai C1 /  F .ai / Fn .ai C1 /  F .ai 1 / C :
m

Similarly, for x 2 Œai ; ai C1 /,

1
F .x/  Fn .x/ F .ai /  Fn .ai / C :
m
7.3 Convergence Preservation 259

Therefore, because any x belongs to one of these intervals Œai ; ai C1 /,

1
supx2R jFn .x/  F .x/j D max jF .ai /  Fn .ai /j C jF .ai /  Fn .ai /j C :
i m

For fixed m, as n ! 1, by the SLLN each of the terms within the absolute
value sign above goes almost surely to zero, and so, for any fixed m, almost surely,
limn supx2R jFn .x/  F .x/j m1
. Now let m ! 1, and the result follows. t
u

7.3 Convergence Preservation

We have already seen the importance of being able to deal with transformations of
random variables in Chapters 3 and 4. This section addresses the question of when
convergence properties are preserved if we suitably transform a sequence of random
variables.
The next important theorem gives some frequently useful results, that are analo-
gous to corresponding results on convergence of sequences in calculus.

Theorem 7.8 (Convergence Preservation).


P P P
(a) Xn ) X; Yn ) Y ) Xn ˙ Yn ) X ˙ Y:
P P P P P
(b) Xn ) X; Yn ) Y ) Xn Yn ) X Y , Xn ) X; Yn ) Y; P .Y ¤ 0/ D 1 )
Xn P X
Yn ) Y :
a:s: a:s: a:s:
(c) Xn ) X; Yn ) Y ) Xn ˙ Yn ) X ˙ Y:
a:s: a:s: a:s: a:s: a:s:
(d) Xn ) X; Yn ) Y ) Xn Yn ) X Y , Xn ) X; Yn ) Y; P .Y ¤ 0/ D 1 )
Xn a:s: X
Yn
) Y:
L1 L1 L1
(e) Xn ) X; Yn ) Y ) Xn C Yn ) X C Y:
L2 L2 L2
(f) Xn ) X; Yn ) Y ) Xn C Yn ) X C Y:
The proofs of each of these parts use relatively simple arguments, such as the trian-
gular, Minkowski, and the Cauchy–Schwarz inequality (see Chapter 1 for their exact
statements). We omit the details of these proofs; Chow and Teicher (1988, pp. 254–
256) give the details for several parts of this convergence preservation theorem.

Example 7.13. Suppose X1 ; X2 ; : : : are independent N.1 ; 12 / variables, and


Y ; Y ; : : : are independent N.2 ; 22 / variables. For n; m  1, let XNn D n1
P1 n 2 1 Pm
N
i D1 Xi ; Ym D m j D1 Yj . By the strong law of large numbers, (SLLN),
a:s: a:s:
N
as n; m ! 1; Xn ) 1 , and YNm ) 2 . Then, by the theorem above,
a:s:
XNn  YNm ) 1  2 .
a:s:
Also, by the same theorem, XNn YNm ) 1 2 .
260 7 Essential Asymptotics and Applications

Definition 7.9 (The Multidimensional Case). Let Xn ; n  1; X be d -dimensional


P P
random vectors, for some 1 d < 1. We say that Xn ) X, if jjXn  Xjj ) 0.
a:s:
We say that Xn ) X, if P .! W jjXn .!/  X.!/jj ! 0/ D 1. Here, jj:jj denotes
Euclidean length (norm).
Operationally, the following equivalent conditions are more convenient.
P P
Proposition. (a) Xn ) X if and only if Xn;i ) Xi , for each i D 1; 2; : : : ; d .
That is, each coordinate of Xn converges in probability to the corresponding
coordinate of X;
a:s: a:s:
(b) Xn ) X if and only if Xn;i ) Xi , for each i D 1; 2; : : : ; d .
Theorem 7.9 (Convergence Preservation in Multidimension). Let Xn ; Yn ;
n  1; X; Y be d -dimensional random vectors. Let A be some p  d matrix of
real elements, where p  1. Then,
P P P
(a) Xn ) X; Yn ) Y ) Xn ˙ Yn ) X ˙ YI
P P
Xn 0 Yn ) X0 YI AXn ) AX:
(b) Exactly the same results hold, when convergence in probability is replaced
everywhere by almost sure convergence.
Proof. This theorem follows easily from the convergence preservation theorem in
one dimension, and the proposition above, which says that multidimensional con-
vergence is the same as convergence in each coordinate separately. t
u

The next result is one of the most useful results on almost sure convergence and
convergence in probability. It says that convergence properties are preserved if we
make smooth transformations. However, the force of the result is partially lost if
we insist on the transformations being smooth everywhere. To give the most useful
version of the result, we need a technical definition.

Definition 7.10. Let d; p  1 be positive integers, and f W S  Rd ! Rp a


function. Let C.f / D fx 2 S W f is continuous at xg. Then C.f / is called the
continuity set of f .
We can now give the result on preservation of convergence under smooth trans-
formations.
Theorem 7.10 (Continuous Mapping). Let Xn ; X be d -dimensional random vec-
tors, and f W S  Rd ! Rp a function. Let C.f / be the continuity set of f .
Suppose the random vector X satisfies the condition

P .X 2 C.f // D 1:

Then,
P P
(a) Xn ) X ) f .Xn / ) f .X/I
a:s: a:s:
(b) Xn ) X ) f .Xn / ) f .X/:
7.3 Convergence Preservation 261

Proof. We prove part (b). Let S1 D f! 2  W f is continuous at X.!/g. Let S2 D


f! 2  W Xn .!/ ! X.!/g. Then, P .S1 \ S2 / D 1, and for each ! 2 S1 \
a:s:
S2 ; f .Xn .!// ! f .X.!//. That means that f .Xn / ) f .X/: t
u

We give two important applications of this theorem next.

Example 7.14 (Convergence of Sample Variance). Let X1 ; X2 ; : : : ; Xn be indepen-


dent observations from a common distribution F , and suppose that F has finite
mean  and finite variance  2 . The sample variance, of immense importance in
1 Pn N 2
statistics, is defined as s 2 D n1 i D1 .Xi  X / . The purpose of this example is
a:s:
to show that s 2 )  2 , as n ! 1.
P a:s:
N 2 )  2 , then it follows that
First note that if we can prove that n1 niD1 .Xi  X/
n
s 2 also converges almost surely to  2 , because n1 ! 1 as n ! 1. Now,

1X X
n n
N 2D 1
.Xi  X/ Xi2  .XN /2
n n
i D1 i D1

(an algebraic identity). Because F has a finite variance, it also possesses a finite
second moment, namely, EF .X 2 / D  2 C 2 < 1. By applying the strong law
P a:s:
of large numbers to the sequence X12 ; X22 ; : : :, we get n1 niD1 Xi2 ) EF .X 2 / D
a:s:
 2 C 2 . By applying the SLLN to the sequence X1 ; X2 ; : : :, we get XN ) , and
a:s:
therefore by the continuous mapping theorem, .XN /2 ) 2 . Now, by the theorem on
P a:s:
preservation of convergence, we get that n1 niD1 Xi2  .XN /2 )  2 C 2  2 D 2 ,
which finishes the proof.

Example 7.15 (Convergence of Sample Correlation). Suppose F .x; y/ is a joint


CDF in R2 , and suppose that E.X 2 /; E.Y 2 / are both finite. Let

Cov.X; Y / E.X Y /  E.X /E.Y /


D p D p
Var.X /Var.Y / Var.X /Var.Y /

be the correlation between X and Y . Suppose .Xi ; Yi /; 1 i n are n independent


observations from the joint CDF F . The sample correlation coefficient is defined as
Pn
i D1 .Xi XN /.Yi  YN /
r D qP Pn :
n
i D1 .X i  XN / 2
i D1 .Y i  N
Y / 2

The purpose of this example is to show that r converges almost surely to .


It is convenient to rewrite r in the equivalent form
Pn
1
i D1 Xi Yi  XN YN
rD q P n
q P :
n N n N 2
i D1 .Xi  X/ i D1 .Yi  Y /
1 2 1
n n
262 7 Essential Asymptotics and Applications
Pn
By the SLLN, n1 i D1 Xi Yi converges almost surely to E.X Y /, and XN ; YN converge
P
almost surely to E.X /; E.Y /. By the previous example, n1 niD1 .Xi  XN /2
P
converges almost surely to Var.X /, and n1 niD 1 .Yi  YN /2 converges almost
surely to Var.Y /. Now consider the function f .s; t; u; v; w/ D pst pu ; 1 < s; t; u
v w
< 1; v; w > 0. This function is continuous on the set S D f.s; t; Pu; v; w/ W
1 < s; t; u < 1; v; w > 0; .s tu/2 vwg. The joint distribution of n1 niD1 Xi Yi ;
P P
XN ; YN ; n1 niD1 .Xi  XN /2 ; n1 niD1 .Yi  YN /2 assigns probability one to the set S .
a:s:
By the continuous mapping theorem, it follows that r ) .

7.4 Convergence in Distribution

Studying distributions of random variables is of paramount importance in both


probability and statistics. The relevant random variable may be a member of some
sequence Xn . Its exact distribution may be cumbersome. But it may be possible to
approximate its distribution by a simpler distribution. We can then approximate
probabilities for the true distribution of the random variable by probabilities in
the simpler distribution. The type of convergence concept that justifies this sort of
approximation is called convergence in distribution or convergence in law. Of all
the convergence concepts we are discussing, convergence in distribution is among
the most useful in answering practical questions. For example, statisticians are
usually much more interested in constructing confidence intervals than just point
estimators, and a central limit theorem of some kind is necessary to produce a
confidence interval.
We start with an illustrative example.
Example 7.16. Suppose
1 1 1 1
Xn U  ; C ; n  1:
2 nC1 2 nC1
Because the interval Œ 12  nC1
1
; 12 C nC1
1
 is shrinking to the single point 12 , intuitively
we feel that the distribution of Xn is approaching a distribution concentrated at 12 ,
that is, a one-point distribution. The CDF of the distribution concentrated at 12 equals
the function F .x/ D 0 for x < 12 , and F .x/ D 1 for x  12 . Consider now the CDF
of Xn ; call it Fn .x/. Fix x < 12 . Then, for all large n; Fn .x/ D 0, and so limn Fn .x/
is also zero. Next fix x > 12 . Then, for all large n; Fn .x/ D 1, and so limn Fn .x/
is also one. Therefore, if x < 12 , or if x > 12 , limn Fn .x/ D F .x/. If x is exactly
equal to 12 , then Fn .x/ D 12 . But F . 12 / D 1. So x D 12 is a problematic point,
and the only problematic point, in that Fn . 12 / 6! F . 21 /. Interestingly, x D 12 is
also exactly the only point at which F is not continuous. However, we do not want
this one problematic point to ruin our intuitive feeling that Xn is approaching the
one-point distribution concentrated at 12 . That is, we do not take into account any
points where the limiting CDF is not continuous.
7.4 Convergence in Distribution 263

Definition 7.11. Let Xn ; X; n  1, be real-valued random variables defined on a


common sample space . We say that Xn converges in distribution (in law) to X if
P .Xn x/ ! P .X x/ as n ! 1, at every point x that is a continuity point of
the CDF F of the random variable X .
L
We denote convergence in distribution by Xn ) X .
If Xn ; X are d -dimensional random vectors, then the same definition applies by
using the joint CDFs of Xn ; X, that is, Xn converges in distribution to X if P .Xn1
x1 ; : : : ; Xnd xd / ! P .X1 x1 ; : : : ; Xd xd / at each point .x1 ; : : : ; xd / that
is a continuity point of the joint CDF F .x1 ; : : : ; xd / of the random vector X.
An important point of caution is the following.
Caution. In order to prove that d -dimensional vectors Xn converge in distribution
to X, it is not, in general, enough to prove that each coordinate of Xn converges in
distribution to the corresponding coordinate of X. However, convergence of general
linear combinations is enough, which is the content of the following theorem.
Theorem 7.11 (Cramér–Wold Theorem). Let Xn ; X be d -dimensional random
L L
vectors. Then Xn ) X if and only if c0 Xn ) c0 X for all unit d -dimensional
vectors c.
The shortest proof of this theorem uses a tool called characteristic functions,
which we have not discussed yet. We give a proof in the next chapter by using char-
acteristic functions. Returning to the general concept of convergence in distribution,
two basic facts are the following.
L
Theorem 7.12. (a) If Xn ) X , then Xn D Op .1/;
P L
(b) If Xn ) X , then Xn ) X .
Proof. We sketch a proof of part (b). Take a continuity point x of the CDF F of X ,
and fix > 0. Then,

Fn .x/ D P .Xn x/
D P .Xn x; jXn  X j / C P .Xn x; jXn  X j > /
P .X x C / C P .jXn  X j > /:

Now let n ! 1 on both sides of the inequality. Then, we get lim supn Fn .x/
F .x C /, because P .jXn  X j > / ! 0 by hypothesis. Now, letting ! 0, we
get lim supn Fn .x/ F .x/, because F .x C / ! F .x/ by right continuity of F .
The proof will be complete if we show that lim infn Fn .x/  F .x/. This is
proved similarly, except we now start with P .X x  / on the left, and follow the
same steps. It should be mentioned that it is in this part that the continuity of F at x
is used. t
u
Remark. The fact that if a sequence Xn of random variables converges in
distribution, then the sequence must be Op .1/, tells us that there must be se-
quences of random variables which do not converge in distribution to anything. For
264 7 Essential Asymptotics and Applications

example, take Xn N.n; 1/; n  1. This sequence Xn is not Op .1/, and therefore
cannot converge in distribution. The question arises if the Op .1/ property suffices
for convergence. Even that, evidently, is not true; just consider X2n1 N.0; 1/,
and X2n N.1; 1/. However, separately, the odd and the even subsequences do
converge. That is, there might be a partial converse to the fact that if a sequence
Xn converges in distribution, then it must be Op .1/. This is a famous theorem on
convergence in distribution, and is stated below.

Theorem 7.13 (Helly’s Theorem). Let Xn ; n  1 be random variables defined on


a common sample space , and suppose Xn is Op .1/. Then there is a sub-sequence
Xnj ; j  1, and a random variable X (on the same sample space ), such that
L L
Xnj ) X . Furthermore, Xn ) X if and only if every convergent sub-sequence
Xnj converges in distribution to this same X .
See Port (1994, p. 625) for a proof. Major generalizations of Helly’s theorem to
much more general spaces are known. Typically, some sort of a metric structure is
assumed in these results; see van der Vaart and Wellner (2000) for such general
results on weak compactness.

Example 7.17 (Various Convergence Phenomena Are Possible). This quick exam-
ple shows that a sequence of discrete distributions can converge in distribution to
a discrete distribution, or a continuous distribution, and a sequence of continuous
distributions can converge in distribution to a continuous one, or a discrete one.
A good example of discrete random variables converging in distribution to a
discrete random variable is the sequence Xn Bin.n; n1 /. Although it was not ex-
plicitly put in the language of convergence in distribution, we have seen in Chapter 6
that Xn converges to a Poisson random variable with mean one. A familiar example
of discrete random variables converging in distribution to a continuous random vari-
able is the de Moivre–Laplace central limit theorem (Chapter 1), which says that if
n np
Xn Bin.n; p/, then pXnp.1p/ converges to a standard normal variable.
Examples of continuous random variables converging to a continuous random
variable are immediately available by using the general central limit theorem p(also
N
Chapter 1). For example, if Xi are independent U Œ1; 1 variables, then nX ,
where  2 D 3 , converges to a standard normal variable. Finally, as an example
1

of continuous random variables converging to a discrete random variable, consider


Xn Be. n1 ; n1 /. Visually, the density of Xn for large n is a symmetric U-shaped
density, unbounded at both 0 and 1. It is not hard to show that Xn converges in
distribution to X , where X is a Bernoulli random variable with parameter 12 .
Thus, we see that any types of random variables can indeed converge to the same
or the other type.
L
By definition of convergence in distribution, if Xn ) X , and if X has a con-
tinuous CDF F (continuous everywhere), then Fn .x/ ! F .x/ 8x where Fn .x/ is
the CDF of Xn . The following theorem says that much more is true, namely that the
convergence is actually uniform; see p. 265 in Chow and Teicher (1988).
7.4 Convergence in Distribution 265

Theorem 7.14 (Pólya’s Theorem). Let Xn ; n  1 have CDF Fn , and let X have
L
CDF F . If F is everywhere continuous, and if Xn ) X , then
supx2R jFn .x/  F .x/j ! 0;
as n ! 1.
A large number of equivalent characterizations of convergence in distribution
are known. Collectively, these conditions are called the portmanteau theorem.
Note that the parts of the theorem are valid for real valued random variables, or
d -dimensional random variables, for any 1 < d < 1.

Theorem 7.15 (The Portmanteau Theorem). Let fXn ; X g be random variables


taking values in a finite-dimensional Euclidean space. The following are character-
L
izations of Xn ) X W
(a) E.g.Xn // ! E.g.X // for all bounded continuous functions g.
(b) E.g.Xn // ! E.g.X // for all bounded uniformly continuous functions g.
(c) E.g.Xn // ! E.g.X // for all bounded Lipschitz functions g. Here a Lipschitz
function is such that jg.x/  g.y/j C jjx  yjj for some C , and all x; y.
(d) E.g.Xn // ! E.g.X // for all continuous functions g with a compact support.
(e) lim inf P .Xn 2 G/  P .X 2 G/ for all open sets G.
(f) lim sup P .Xn 2 S / P .X 2 S / for all closed sets S .
(g) P .Xn 2 B/ ! P .X 2 B/ for all (Borel) sets B such that the probability of X
belonging to the boundary of B is zero.
See Port (1994, p. 614) for proofs of various parts of this theorem.

Example 7.18. Consider Xn Uniformf n1 ; n2 ; : : : ; n1n ; 1g. Then, it can be shown


easily that the sequence Xn converges in law to the U Œ0; 1 distribution. Consider
now the function g.x/ D x 10 ; 0 x 1. Note that g is continuous and bounded.
P 10
Therefore, by part (a) of the portmanteau theorem, E.g.Xn // D nkD1 . kn11 / !
R 1 10
E.g.X // D 0 x dx D 11 1
.
This can be proved by using convergence of Riemann sums to a Riemann inte-
gral. But it is interesting to see the link to convergence in distribution.

Example 7.19 (Weierstrass’s Theorem). Weierstrass’s theorem says that any con-
tinuous function on a closed bounded interval can be uniformly approximated by
polynomials. In other words, given a continuous function f .x/ on a bounded in-
terval, one can find a polynomial p.x/ (of a sufficiently large degree) such that
jp.x/  f .x/j is uniformly small. Consider the case of the unit interval; the case of
a general bounded interval reduces to this case.
Here we show pointwise convergence by using the portmanteau theorem.
Laws of large numbers are needed for establishing uniform approximability.
We give a constructive proof. Towards this, for n  1; 0 p 1, and a
given continuous function
P g W Œ0; 1
  ! R, define the sequence of Bernstein
polynomials, Bn .p/ D nkD0 g. kn / nk p k .1  p/nk . Note that we can think of
266 7 Essential Asymptotics and Applications

P
Bn .p/ as Bn .p/ D EŒg. Xn / jX Bin.n; p/. As n ! 1; Xn ! p, and it follows
L
that Xn ) ıp , the one-point distribution concentrated at p (we have already seen
that convergence in probability implies convergence in distribution). Because g
is continuous and hence bounded, it follows from the portmanteau theorem that
Bn .p/ ! g.p/, at any p.
It is not hard to establish that Bn .p/  g.p/ converges uniformly to zero, as
n ! 1. Here is a sketch. As above, X denotes a binomial random variable with
parameters n and p. We need to use the facts that a continuous function on Œ0; 1
is also uniformly continuous and bounded. Thus, for any given > 0, we can find
ı > 0 such that jx  yj < ı ) jg.x/  g.y/j , and also we can find a finite C
such that jg.x/j C 8x. So,
ˇ   ˇ
ˇ X ˇ
ˇ
jBn .p/  g.p/j D ˇE g  g.p/ˇˇ
n
ˇ   ˇ
ˇ X ˇ
E ˇˇg  g.p/ˇˇ
n
ˇ   ˇ
ˇ X ˇ
D E ˇˇg  g.p/ˇˇ Ifj X pjı g
n n
ˇ   ˇ
ˇ X ˇ
CE ˇˇg  g.p/ˇˇ Ifj X pj>ı g
n n

ˇ ˇ 
ˇX ˇ
C 2CP ˇˇ  p ˇˇ > ı :
n
Now, in the last line, just apply Chebyshev’s inequality and bound the function
p.1  p/ in Chebyshev’s inequality by 14 . It easily follows then that for all large n,
the second term 2CP .j Xn  pj > ı/ is also , which means, that for all large n,
uniformly in p, jBn .p/  g.p/j 2 .
The most important result on convergence in distribution is the central limit the-
orem, which we have already seen in Chapter 1. The proof of the general case is
given later in this chapter; it requires some additional development.

Theorem 7.16 (CLT). Let Xi ; i  1 be iid with E.Xi / D  and VarXi / D


 2 < 1. Then
p
n.XN  / L
) Z N.0; 1/:

We also write
p
n.XN  / L
) N.0; 1/:

The multidimensional central limit theorem is stated next. We show that it eas-
ily follows from the one-dimensional central limit theorem, by making use of the
Cramér–Wold theorem.
7.5 Preservation of Convergence and Statistical Applications 267

Theorem 7.17 (Multivariate CLT). Let Xi ; i  1, be iid d -dimensional random


vectors with E.X1 / D , and covariance matrix Cov.X1 / D †. Then,

p L
N  / ) Nd .0; †/:
n.X

Remark. If Xi ; i  1 are iid with mean  and variance  2 , then the CLT in one di-
n L
Sn p
mension says that  n
) N.0; 1/, where Sn is the nth partial sum X1 C  CXn .
In particular, therefore,  n
Sn p
n
D Op .1/. In other words, in a distributional sense,
n
Sn p
 n
stabilizes. If we take a large n, then for most sample points !; jSn .!/nj
p
 n
will be, for example, less than 4. But as n changes, this collection of good sample
points also changes. Indeed, any fixed sample point ! is one of the good sample
points for certain values of n, and falls into the category of bad sample points for
(many) other values of n. The law of the iterated logarithm says that if we fix ! and
look at jSn .!/nj
p
 n
along such unlucky values of n, then Sn .!/n
p
 n
will not appear to
be stable. In fact, it will keep growing with n, although at a slow rate. Here is what
the law of the iterated logarithm says.
Theorem 7.18 (Law of Iterated Logarithm(LIL)).
P Let Xi ; i  1 be iid with
mean , variance  2 < 1, and let Sn D niD1 Xi ; n  1. Then,
(a) lim supn p Sn n D  a.s.
2n log log n
S n
(b) lim infn p2nnlog log n D  a.s.
(c) If finite constants a;  satisfy

Sn  na
lim sup p D ;
n 2n log log n

then necessarily Var.X1 / < 1; and a D E.X1 /;  2 D Var.X1 /.


See Chow and Teicher (1988, p. 355) p for a proof. The main use of the LIL is
in proving other strong laws. Because n log log n grows at a very slow rate, the
practical use of the LIL is quite limited. We remark that the LIL provides another ex-
ample of a sequence which converges in probability (to zero), but does not converge
almost surely.

7.5 Preservation of Convergence and Statistical Applications

Akin to the results on preservation of convergence in probability and almost sure


convergence under various operations, there are similar other extremely useful re-
sults on preservation of convergence in distribution. The first theorem is of particular
importance in statistics.
268 7 Essential Asymptotics and Applications

7.5.1 Slutsky’s Theorem

Theorem 7.19 (Slutsky’s Theorem). Let Xn ; Yn be d and p -dimensional random


L P
vectors for some d; p  1. Suppose Xn ) X, and Yn ) c. Let h.x; y/ be a
scalar or a vector-valued jointly continuous function in .x; y/ 2 Rd  Rp . Then
L
h.Xn ; Yn / ) h.X; c/.
Proof. We use the part of the portmanteau theorem which says that a random vector
L
Zn ) Z if EŒg.Zn / ! EŒg.Z/ for all bounded uniformly continuous functions
g. Now, if we simply repeat the proof of the uniform convergence of the Bernstein
polynomials in our example on Weierstrass’s theorem, the result is obtained. t
u
The following are some particularly important consequences of Slutsky’s
theorem.
L P
Corollary. (a) Suppose Xn ) X; Yn ) c, where Xn ; Yn are of the same order.
L
Then, Xn C Yn ) X C c.
L P
(b) Suppose Xn ) X; Yn ) c, where Yn are scalar random variables. Then
L
Yn Xn ) cX.
L P
(c) Suppose Xn ) X; Yn ) c ¤ 0, where Yn are scalar random variables. Then
L X
Xn
Yn ) c.

Example 7.20 (Convergence of the t to Normal). Let Xi ; i  1, be iid N.;  2 /;


1 P
p n
N
 > 0, and let Tn D n.X/
s , where s 2 D n1 .Xi  XN /2 , namely the sample
i D1
variance. We saw in Chapter 15 that Tn has the central t.n  1/ distribution. Write
p
n.XN  /=
Tn D :
s=
a:s: a:s:
We have seen that s 2 )  2 . Therefore, by the continuous mapping theorem, s )
p L
s a:s: N
n.X/
, and so 
) 1. On the other hand, by the central limit theorem, 
)
L
N.0; 1/. Therefore, now by Slutsky’s theorem, Tn ) N.0; 1/.
Indeed, this argument shows that whatever the common distribution of the Xi is,
L
if  2 D Var.X1 / < 1 and > 0, then Tn ) N.0; 1/, although the exact distribution
of Tn is no longer the central t distribution, unless the common distribution of the
Xi is normal.
Example 7.21 (A Normal–Cauchy Connection). Consider iid standard normal vari-
ables, X1 ; X2 ; : : : ; X2n ; n  1. Let
X1 X3 X2n1
Rn D C CC ; and Dn D X12 C X22 C    C Xn2 :
X2 X4 X2n
Rn
Let Tn D Dn
.
7.5 Preservation of Convergence and Statistical Applications 269

Recall that the quotient of two independent standard normals is distributed as a


standard Cauchy. Thus,
X1 X3 X2n1
; ;:::;
X2 X4 X2n
are independent standard Cauchy. In the following, we write Cn to denote a random
variable with a Cauchy distribution with location parameter zero, and scale param-
eter n. From our results on convolutions, we know that the sum of n independent
standard Cauchy random variables is distributed as Cn ; the scale parameter is n.
L L
Thus, Rn D Cn C.0; n/ D nC1 nC.0; 1/. Therefore,
L nC1 C1
Tn D Pn 2
D 1 Pn :
i D1 Xi n i D1 Xi2
P P
Now, by the WLLN, n1 niD1 Xi2 ) E.X12 / D 1. A sequence of random vari-
ables with a distribution identically equal to the fixed C.0; 1/ distribution also,
tautologically, converges in distribution to C1 , thus by applying Slutsky’s theorem,
L
we conclude that Tn ) C1 C.0; 1/.

7.5.2 Delta Theorem

The next theorem says that convergence in distribution is appropriately preserved


by making smooth transformations. In particular, we present a general version of a
theorem of fundamental use in statistics, called the delta theorem.
Theorem 7.20 (Continuous Mapping Theorem). (a) Let Xn be d -dimensional
random vectors and let S  Rd be such that P .Xn 2 S / D 1 8n. Suppose
L
Xn ) X. Let g W S ! Rp be a continuous function, where p is a positive
L
integer. Then g.Xn / ) g.X/.
(b) (Delta Theorem of Cramér). Let Xn be d -dimensional random vectors and let
S  Rd be such that P .Xn 2 S / D 1 8n. Suppose for some d -dimensional
L
vector , and some sequence of reals cn ! 1; cn .Xn  / ) X. Let g W S !
Rp be a function with each coordinate of g once continuously differentiable
with respect to every coordinate of x 2 S at x D . Then
L
cn .g.Xn /  g.// ) Dg./X;

@gi
where Dg./ is the matrix of partial derivatives .. @x //jxD :
j

Proof. For part (a), we use the Portmanteau theorem. Denote g.Xn / D Yn ; g.X/ D Y,
and consider bounded continuous functions f .Yn /. Now, f .Yn / D f .g.Xn // D
h.Xn /, where h.:/ is the composition function f .g.://. Because h is continuous,
because f; g are, and h is bounded, because f is, the Portmanteau theorem implies
270 7 Essential Asymptotics and Applications

that E.h.Xn // ! E.h.X//, that is, E.f .Yn // ! E.f .Y//. Now the reverse
L
implication in the Portmanteau theorem implies that Yn ) Y.
We prove part (b) for the case d D p D 1. First note that it follows from the
assumption cn ! 1 that Xn   D op .1/. Also, by an application of Taylor’s
theorem,
g.x0 C h/ D g.x0 / C hg 0 .x0 / C o.h/

if g is differentiable at x0 . Therefore,

g.Xn / D g./ C .Xn  /g 0 ./ C op .Xn  /:

That the remainder term is op .Xn  / follows from our observation that Xn   D
op .1/. Taking g./ to the left and multiplying by cn , we obtain

cn Œg.Xn /  g./ D cn .Xn  /g 0 ./ C cn op .Xn  /:

The term cn op .Xn  / D op .1/, because cn .Xn  / D Op .1/. Hence, by an


L
application of Slutsky’s theorem, cn Œg.Xn /  g./ ) g 0 ./X . t
u

Example 7.22 (A Quadratic Form). Let Xi ; i  1 be iid random variables with finite
p L
N
mean  and finite variance  2 . By the central limit theorem, n.X/
 ) Z, where
Z N.0; 1/. Therefore, by the continuous mapping theorem, if Qn D n2 .XN /2 ,
p N 2 L L
n.X/
then Qn D 
) Z 2 . But Z 2 21 . Therefore, Qn ) 21 .

Example 7.23 (An Important Statistics Example). Let X D Xn Bin.n; p/; n 


1; 0 < p < 1. In statistics, p is generally treated
p
as an unknown parameter, and the
O
usual estimate of p is pO D Xn . Define Tn D j pn.pp/ j. The goal of this example is
O
p.1 O
p/
to find the limiting distribution of Tn . First, by the central limit theorem,
p
n.pO  p/ X  np L
p Dp ) N.0; 1/:
p.1  p/ np.1  p/

P
Next, by the WLLN, pO ) p, and hence by the continuous mapping theorem for
p P p
O  p/
convergence in probability, p.1 O ) p.1  p/. This gives, by Slutsky’s
p L
O
theorem, pn.pp/ ) N.0; 1/: Finally, because the absolute value function is con-
O
p.1 O
p/
tinuous, by the continuous mapping theorem for convergence in distribution,
p
n.pO  p/ L
Tn D j p j ) jZj;
O  p/
p.1 O

the absolute value of a standard normal.


7.5 Preservation of Convergence and Statistical Applications 271

Example 7.24. Suppose Xi ; i  1, are iid with mean  and variance  2 < 1.
p L
N
By the central limit theorem, n.X/

) Z, where Z N.0; 1/. Consider the
function g.x/ D x 2 . This is continuously differentiable, in fact at any x, and
g 0 .x/ D 2x. If  ¤ 0, then g 0 ./ D 2 ¤ 0. By the delta theorem, we get
p L
that n.XN 2  2 / ) N.0; 42  2 /: If  D 0, this last statement is still true, and
p P
that means nXN 2 ) 0, if  D 0.
Example 7.25 (Sample Variance and Standard Deviation). Suppose again Xi ; i 
1, are iid with mean , variance  2 , and E.X14 / < 1. Also let j D E.X1 
/j ; 1 j 4. This example has d D 2; p D 1. Take
     
XN EX1 Var.X1 / Cov.X1 ; X12 /
Xn D 1 Pn 2 ; D ; †D
n i D1 Xi EX12 Cov.X1 ; X12 / Var.X12 /

p L
By the multivariate central limit theorem, n.Xn  / ) N.0; †/. Now consider
the function g.u; v/ D v  u2 . This is once continuously differentiable with respect
to each of u; v (in fact at any u; v), and the partial derivatives are gu D 2u; gv D 1.
Using the delta theorem, with a little bit of matrix algebra calculations, it follows
that
!
1X
n
p L
n .Xi  XN /  Var.X1 / ) N.0; 4   4 /:
2
n
i D1

Pn
If we choose sn2 D i D1 .Xi  XN /2 =.n  1/ then
Pn !
 XN /2 p 1X
n
p 2 i D1 .Xi
n.sn   2 / D p C n .Xi  XN /2   2
.n  1/ n n
i D1

p P
D n. n1 niD1 .Xi  XN /2   2 / C op .1/, which also converges in law to N.0; 4 
 4 / by Slutsky’s theorem. By anotherpuse of the delta theorem, this time with d D
p D 1, and with the function g.u/ D u, one gets
 
p L 4   4
n.sn  / ) N 0; :
4 2

Example 7.26 (Sample Correlation). Another use of the delta theorem is the deriva-
tion of the limiting distribution of the sample correlation coefficient r for iid
bivariate data .Xi ; Yi /. We have
P
1
Xi Yi  XN YN
rn D q P n
q P :
1
n .Xi  XN /2 n1 .Yi  YN /2
272 7 Essential Asymptotics and Applications

By taking
 
N N 1X 2 1X 2 1X
Tn D X ; Y ; Xi ; Yi ; Xi Yi
n n n
 
 D EX1 ; EY1 ; EX12 ; EY12 ; EX1 Y1

and by taking † to be the covariance matrix of .X1 ; Y1 ; X12q ; Y12 ; X1 Y1 /, and on


using the transformation g.u1 ; u2 ; u3 ; u4 ; u5 / D .u5  u1 u2 /= .u3  u21 /.u4  u22 /
it follows from the delta theorem, with d D 5; p D 1, that
p L
n.rn  / ) N.0; v2 /

for some v > 0. It is not possible to write a clean formula for v in general. If
.Xi ; Yi / are iid N2 .X ; Y ; X2 ; Y2 ; / then the calculation of v2 can be done in
closed-form, and
p L
n.rn  / ) N.0; .1  2 /2 /:

However, convergence to normality is very slow.

7.5.3 Variance Stabilizing Transformations

A major use of the delta theorem is construction of variance stabilizing transforma-


tions (VST), a technique that is of fundamental use in statistics. VSTs are useful
tools for constructing confidence intervals for unknown parameters. The general
idea is the following. Suppose we want to find a confidence interval for some pa-
rameter  2 R. If Tn D Tn .X1 ; : : : ; Xn / is some natural estimate for  (e.g., sample
mean as an estimate of a population mean), then often the CLT, or some generaliza-
L
tion of the CLT, will tell us that Tn   ) N.0;  2 .//, for some suitable function
 2 ./. This implies that in large samples,

./ ./
P .Tn  z ˛2 p  Tn C z ˛2 p / 1  ˛;
n n

where ˛ is some specified number in (0,1) and z˛=2 D ˆ1 .1 ˛2 /. Finally, plugging
in Tn in place of  in ./, a confidence interval for  is Tn ˙ z ˛2 .T
p n / . The delta
n
theorem provides an alternative solution that is sometimes preferred. By the delta
theorem, if g./ is once differentiable at  with g 0 ./ 6D 0, then
p L
n .g.Tn /  g.// ) N.0; Œg 0 ./2  2 .//:
7.5 Preservation of Convergence and Statistical Applications 273

Therefore, if we set

Œg 0 ./2  2 ./ D k 2

p L
for some constant k, then n .g.Tn /  g.// ) N.0; k 2 /, and this produces a
confidence interval for g./:
k
g.Tn / ˙ z ˛2 p :
n

By retransforming back to , we get another confidence interval for :


   
1 k 1 k
g g.Tn /  z ˛2 p  g g.Tn / C z ˛2 p :
n n

The reason that this one is sometimes preferred to the first confidence interval,
namely, Tn ˙ z ˛2 .T
p n / , is that no additional plug-in is necessary to estimate the
n
penultimate variance function in this second confidence interval. The penultimate
variance function is already a constant k 2 by choice in this second method. The
transformation g.Tn / obtained from its defining property

Œg 0 ./2  2 ./ D k 2

has the expression Z


1
g./ D k d;
./
where the integral is to be interpreted as a primitive. The constant k can be chosen
as any nonzero real number, and g.Tn / is called a variance stabilizing transforma-
tion. Although the delta theorem is certainly available in Rd even when d > 1,
unfortunately the concept of VSTs does not generalize to multiparameter cases. It is
generally infeasible to find a dispersion-stabilizing transformation when the dimen-
sion of  is more than one. This example is a beautiful illustration of how probability
theory leads to useful and novel statistical techniques.
p
Example 7.27 (VST in Binomial Case). Suppose Xn Bin.n; p/. Then n.Xn =
L p
n  p/ ) N.0; p.1  p//. So using the notation used above, .p/ D p.1  p/
and consequently, on taking k D 12 ,
Z
1=2 p
g.p/ D p dp D arcsin. p/:
p.1  p/
p
Hence, g.Xn / D arcsin. Xn =n/ is a variance-stabilizing transformation and
indeed, r ! !  
p Xn p  L 1
n arcsin  arcsin p ) N 0; :
n 4
274 7 Essential Asymptotics and Applications

Thus, a confidence interval for p is


r ! !
Xn z˛=2
sin2 arcsin  p :
n 2 n

Example 7.28 (Fisher’s z). Suppose .Xi ; Yi /, i D 1; : : : ; n, are iid bivariate nor-
p L
mal with parameters X ; Y ; X2 ; Y2 ; . Then, as we saw above, n.rn  / )
N.0; .1  2 /2 /, rn being the sample correlation coefficient. Therefore,
Z
1 1 1C
g./ D d D log D arctanh./
.1  / 2 2 1
provides a variance-stabilizing transformation for rn . This is the famous arctanh
transformation
p of Fisher, popularly known as Fisher’s z. By the delta theorem,
n.arctanh.rn /  arctanh.// converges in distribution to the N.0; 1/ distribution.
Confidence intervals for  are computed from Fisher’s z as
 
z˛=2
tanh arctanh.rn / ˙ p :
n
The arctanh transformation z D arctanh.rn / attains approximate normality much
more quickly than rn itself.
Example 7.29 (An Unusual VST). Here is a nonregular example on variance stabi-
lization. Suppose we have iid observations X1 ; X2 ; : : : from the U Œ0;  distribution.
L
Then, the usual estimate of  is the sample maximum X.n/ , and n.  X.n/ / )
Exp./. The asymptotic variance function in the distribution of the sample maxi-
mum is therefore simply  2 , and therefore, a VST is
Z
1
g./ D d D log :

So, g.X.n/ / D log X.n/ is a variance-stabilizing transformation of X.n/ , In fact,
L
n.log   log X.n/ / ) Exp.1/. However, the interesting fact is that for every n,
the distribution of n.log   log X.n/ / is exactly a standard exponential. There is
no nontrivial example such as this in the regular cases (although N.; 1/ is a trivial
example).

7.6 Convergence of Moments

If some sequence of random variables Xn converges in distribution to a random


variable X , then sometimes we are interested in knowing whether moments of
Xn converge to moments of X . More generally, we may want to find approx-
imations for moments of Xn . Convergence in distribution just by itself cannot
ensure convergence of any moment. An extra condition that ensures convergence
7.6 Convergence of Moments 275

of appropriate moments is uniform integrability. There is another side of this story.


If we can show that the moments of Xn converge to moments of some recognizable
distribution, then we can sometimes show that Xn converges in distribution to that
distinguished distribution. Some of these issues are discussed in this section.

7.6.1 Uniform Integrability

Definition 7.12. Let fXn g be a sequence of random variables on some com-


mon sample space . The sequence fXn g is called uniformly integrable if
supn1 E.jXn j/ < 1, and if for any R> 0 there exists a sufficiently small ı > 0,
such that whenever P .A/ < ı, supn1 A jXn jdP < .
Remark. We give two results on the link between uniform integrability and conver-
gence of moments.
Theorem 7.21. Suppose X; Xn ; n  1 are such that E.jX jp / < 1, and E.jXn jp /<
P
1 8n  1. Suppose Xn ) X , and jXn jp is uniformly integrable. Then, E.jXn 
X jp / ! 0, as n ! 1.
For proving this theorem, we need two lemmas. The first one is one of the most
fundamental results in real analysis. It uses the terminology of Lebesgue integrals
and the Lebesgue measure, which we are not treating in this book. Thus, the state-
ment below uses an undefined concept.
Lemma (Dominated Convergence Theorem). Let fn ; f be functions on Rd ;
d <1, and suppose f and each fn is (Lebesgue) integrable. Suppose fn .x/ !
R measure zero. If jfn j
f .x/, except possibly for a set of x Rvalues of Lebesgue g
for some integrable function g, then Rd fn .x/dx ! Rd f .x/dx, as n ! 1.
Lemma. Suppose X; Xn ; n1 are such that E.jX jp / < 1, and E.jXn jp / < 1
8n  1: Then jXn jp is uniformly integrable if and only if jXn  X jp is uniformly
integrable.
Proof of Theorem: Fix c > 0, and define Yn D jXn  X jp ; Yn;c D Yn IfjXn Xjcg .
P P
Because, by hypothesis, Xn ) X , by the continuous mapping theorem, Yn ) 0,
P
and as a consequence, Yn;c ) 0. Furthermore, jYn;c j c p , and the dominated
convergence theorem implies that E.Yn;c / ! 0. Now,

E.jXn  X jp / D E.Yn / D E.Yn;c / C E.Yn IfjXn Xj>cg /


E.Yn;c / C supn1 E.Yn IfjXn Xj>cg /
) lim supn E.jXn  X j / p
supn1 E.Yn IfjXn Xj>cg /
) lim supn E.jXn  X j / p
infc supn1 E.Yn IfjXn Xj>cg /
D 0:

Therefore, E.jXn  X jp / ! 0. t
u
276 7 Essential Asymptotics and Applications

Remark. Sometimes we do not need the full force of the result that E.jXn  X jp /
! 0, but all we want is that E.Xnp / converges to E.X p /. In that case, the conditions
in the previous theorem can be relaxed, and in fact from a statistical point of view,
the relaxed condition is much more natural. The following result gives the relaxed
conditions.
Theorem 7.22. Suppose Xn ; X; n  1 are defined on a common sample space ,
L
that Xn ) X , and that for some given p  1; jXn jp is uniformly integrable. Then
E.Xnk / ! E.X k / 8k p.
Remark. To apply these last two theorems, we have to verify that for the appropriate
sequence Xn , and for the relevant p; jXn jp is uniformly integrable. Direct verifi-
cation of uniform integrability from definition is often cumbersome. But simple
sufficient conditions are available, and these are often satisfied in many applica-
tions. The next result lists a few useful sufficient conditions.
Theorem 7.23 (Sufficient Conditions for Uniform Integrability).
(a) Suppose for some ı > 0, supn EjXn j1Cı < 1. Then fXn g is uniformly inte-
grable.
(b) If jXn j Y; n  1, and E.Y / < 1, then fXn g is uniformly integrable.
(c) If jXn j Yn ; n  1, and Yn is uniformly integrable, then fXn g is uniformly
integrable.
(d) If Xn ; n  1 are identically distributed, and E.jX1 j/ < 1, then fXn g is uni-
formly integrable.
(e) If fXn g and fYn g are uniformly integrable then fXn C Yn g is uniformly inte-
grable.
(f) If fXn g is uniformly integrable and jYn j M < 1, then fXn Yn g is uniformly
integrable.
See Chow and Teicher (1988, p. 94) for further details on the various parts of
this theorem.
Example 7.30 (Sample Maximum). We saw in Chapter 6 that if X1 ; X2 ; : : : are iid,
and if E.jX1 jk / < 1, then any order statistic X.r/ satisfies

E.jX.r/ jk / E.jX1 jk /:
.r  1/Š.n  r/Š
In particular, for the sample maximum X.n/ of n observations,
 
jX.n/ j
E.jX.n/ j/ nE.jX1 j/ ) E E.jX1 j/:
n

jX j
By itself, this does not ensure that n.n/ is uniformly integrable.
However, if we also assume that E.X12 / < 1, then the same argument gives
jX j
E.jX.n/ j2 / nE.X1 /2 , so that supn E. n.n/ /2 < 1, which is enough to conclude
jX.n/ j
that n
is uniformly integrable.
7.6 Convergence of Moments 277

However, we do not need the existence of E.X12 / for this conclusion. Note that

X
n Pn
jX.n/ j i D1 jXi j
jX.n/ j jXi j ) :
n n
i D1

Pn
i D1 jXi j
If E.jX1 j/ < 1, then in fact n is uniformly integrable, and as a conse-
jX.n/ j
quence, n is also uniformly integrable under just the condition E.jX1 j/ < 1.

7.6.2 The Moment Problem and Convergence in Distribution

We remarked earlier that convergence of moments can be useful to establish con-


vergence in distribution. Clearly, however, if we only know that E.Xnk / converges
to E.X k / for each k, from that alone we cannot conclude that the distributions of
Xn converge to the distribution of X . This is because there could, in general, be
another random variable Y with a distribution distinct from that of X but with all
moments equal to the moments of X . However, if we rule out that possibility, then
convergence in distribution follows.
Theorem 7.24. Suppose for some sequence fXn g and a random variable X , E.Xnk /
! E.X k / 8k  1. If the distribution of X is determined by its moments, then
L
Xn ) X .
When is a distribution determined by its sequence of moments? This is a hard
analytical problem, and is commonly known as the moment problem. There is a
huge and sophisticated literature on the moment problem. A few easily understood
conditions for determinacy by moments are given in the next result.

Theorem 7.25. (a) If a random variable X is uniformly bounded, then it is deter-


mined by its moments.
(b) If the mgf of a random variable X exists in a nonempty interval containing zero,
then it is determined by its moments.
(c) Let X have a density function f .x/. If there exist positive constants c; ˛; ˇ; k
such that f .x/ ce ˛jxj jxjˇ 8x such that jxj > k, then X is determined by
its moments.

Remark. See Feller (1971, pp. 227–228 and p. 251) for the previous two theorems.
Basically, part (b) is the primary result here, because if the conditions in (a) or (c)
hold, then the mgf exists in an interval containing zero. However, (a) and (c) are
useful special sufficient conditions.

Example 7.31 (Discrete Uniforms Converging to Continuous Uniform). Consider


random variables Xn with the discrete uniform distribution on f n1 ; n2 ; : : : ; 1g. Fix a
P
positive integer k. Then, E.Xnk / D n1 niD1 . ni /k . This is the upper Riemann sum
corresponding to the partition
278 7 Essential Asymptotics and Applications

i 1 i
; ; i D 1; 2; : : : ; n
n n

for the function f .x/ D x k on .0; 1. R 1Therefore, as n ! 1; E.Xnk /, which is the
k
upper Riemann sum, converges to 0 x dx, which is the kth moment of a ran-
dom variable X having the uniform distribution on the unit interval. Because k is
arbitrary, it follows from part (a) of the above theorem that the discrete uniform dis-
tribution on f n1 ; n2 ; : : : ; 1g converges to the uniform distribution on the unit interval.

7.6.3 Approximation of Moments

Knowing the limiting value of a moment of some sequence of random variables is


only a first-order approximation to the moment. Sometimes we want more refined
approximations. Suppose Xi ; 1 i d , are jointly distributed random variables,
and Td .X1 ; X2 ; : : : ; Xd / is some function of X1 ; X2 ; : : : ; Xd . To find approxima-
tions for a moment of Td , one commonly used technique is to approximate the
function Td .x1 ; x2 ; : : : ; xd / by a simpler function, say gd .x1 ; x2 ; : : : ; xd /, and then
use the moment of gd .X1 ; X2 ; : : : ; Xd / as an approximation to the moment of
Td .X1 ; X2 ; : : : ; Xd /. Note that this is a formal approximation. It does not come
with an automatic quantification of the error of the approximation. Such quantifica-
tion is usually a harder problem, and limited answers are available. We address these
two issues in this section. We consider approximation of the mean and variance of
a statistic, because of their special importance.
A natural approximation of a smooth function is obtained by expanding the func-
tion around some point in a Taylor series. For the formal approximations, we assume
that all the moments of Xi that are necessary for the approximation to make sense
do exist.
It is natural to expand the statistic around the mean vector  of X D .X1 ; : : : ;
Xd /. For notational simplicity, we write t for Td . Then, the first- and second-order
Taylor expansions for t.X1 ; X2 ; : : : ; Xd / are:
X
n
t.x1 ; x2 ; : : : ; xd / t.1 ; : : : ; d / C .xi  i /ti .1 ; : : : ; d /;
i D1

and
P
t.x1 ; x2 ; : : : ; xd / t.1 ; : : : ; d / C diD1 .xi  i /ti .1 ; : : : ; d /
1P
C .xi  i /.xj  j /tij .1 ; : : : ; d /:
2 1i;j d
If we formally take an expectation on both sides, we get the first- and second-order
approximations to EŒTd .X1 ; X2 ; : : : ; Xd /:

EŒTd .X1 ; X2 ; : : : ; Xd / Td .1 ; : : : ; d /;


7.6 Convergence of Moments 279

and
1 X
EŒTd .X1 ; X2 ; : : : ; Xd / Td .1 ; : : : ; d / C tij .1 ; : : : ; d /ij ;
2
1i;j d

where ij is the covariance between Xi and Xj .


Consider now the variance approximation problem. From the first-order Taylor
approximation

X
d
t.x1 ; x2 ; : : : ; xd / t.1 ; : : : ; d / C .xi  i /ti .1 ; : : : ; d /;
i D1

by formally taking the variance of both sides, we get the first-order variance approx-
imation
!
Xd
Var.Td .X1 ; X2 ; : : : ; Xd // Var .xi  i /ti .1 ; : : : ; d /
i D1
X
D ti .1 ; : : : ; d /tj .1 ; : : : ; d /ij :
1i;j d

The second-order variance approximation takes more work. By using the second-
order Taylor approximation for t.x1 ; x2 ; : : : ; xd /, the second-order variance ap-
proximation is
!
X
Var.Td .X1 ; X2 ; : : : ; Xd // Var .Xi  i /ti .1 ; : : : ; d /
i
0 1
1 X
C Var @ .Xi  i /.Xj  j /tij .1 ; : : : ; d /A
4
i;j

X
C Cov .Xi  i /ti .1 ; : : : ; d /;
!
X
.Xj  j /.Xk  k /tjk .1 ; : : : ; d / :
j;k

If we denote E.Xi  i /.Xj  j /.Xk  k / D m3;ijk , and E.Xi  i /.Xj 


j /.Xk  k /.Xl  l / D m4;ijkl , then the second-order variance approximation
becomes
X
Var.Td .X1 ; X2 ; : : : ; Xd // ti .1 ; : : : ; d /tj .1 ; : : : ; d /ij
i;j
X
C ti .1 ; : : : ; d /tjk .1 ; : : : ; d /m3;ijk
i;j;k
280 7 Essential Asymptotics and Applications

1 X
C tij .1 ; : : : ; d /tkl .1 ; : : : ; d /
4
i;j;k;l

Œm4;ijkl  ij kl :

For general d , this is a complicated expression. For d D 1, it reduces to the reason-


ably simple approximation

1
Var.T .X // Œt 0 ./2  2 C t 0 ./t 00 ./E.X  /3 C Œt 00 ./2 ŒE.X  /4   4 :
4
Example 7.32. Let X; Y be two jointly distributed random variables, with means
1 ; 2 , variances 12 ; 22 , and covariance 12 . We work out the second-order ap-
proximation to the expectation of T .X; Y / D X Y . Writing t for T as above, the
various relevant partial derivatives are tx D y; ty D x; txx D tyy D 0; txy D 1.
Plugging into the general formula for the second-order approximation to the mean,
we get E.X Y / 1 2 C 12 Œ12 C 21  D 1 2 C 12 . Thus, in this case, the
second-order approximation reproduces the exact mean of X Y .
Example 7.33 (A Multidimensional Example). Let X D .X1 ; X2 ; : : : ; Xd / have
mean vector  and covariance matrix †. Assume that  is not the null vector. We
find a second-order approximation to E.jjXjj/. Denoting T .x1 ; : : : ; xd / D jjxjj, the
successive partial derivatives are

i 1 2i i j
ti ./ D ; ti i ./ D  ; tij ./ D  .i ¤ j /:
jjjj jjjj jjjj3 jjjj3

Plugging these into the general formula for the second-order approximation of the
expectation, on some algebra, we get the approximation
P 2 P
tr† i i i i i ¤j i j ij
E.jjxjj/ jjjj C  3

2jjjj 2jjjj 2jjjj3
1 0 †
D jjjj C tr†  0 :
2jjjj 

0 †
The ratio 0 
varies between the minimum and the maximum eigenvalue of †,
0 †
where as tr† equals the sum of all the eigenvalues. Thus, tr†  0 
 0; † being
0
a nnd matrix, which implies that the approximation jjjj C 1
2jjjj
Œtr†  † 0   is

 jjjj. This is consistent with the bound E.jjXjj/  jjjj, as is implied by Jensen’s
inequality.
The second-order variance approximation is difficult to work out in this example.
However, the first-order approximation is easily worked out, and gives

0 †
Var.jjXjj/ :
0 
7.6 Convergence of Moments 281

Note that no distributional assumption about X was made in deriving the


approximations.

Example 7.34 (Variance of the Sample Variance). It is sometimes


P necessary to cal-
culate the variance of a centered sample moment mk D n1 niD1 .Xi  XN /k , for
iid observations X1 ; : : : ; Xn from some one-dimensional distribution. Particularly,
the case k D 2 is of broad interest in statistics. Because we are considering cen-
tered moments, we may assume that E.Xi / D 0, so that E.Xik / for any k will
equal the population centered moment k D E.Xi  /k . We also recall that
E.m2 / D n1  2.
n P Pn Pn
Using the algebraic identity niD1 .Xi  XN /2 D i D1 Xi  . i D1 Xi / =n,
2 2

one can make substantial algebraic simplification towards calculating the variance
of m2 . Indeed,

Var.m2 / D EŒm22   ŒE.m2 /2


2
1 4X 4 X 2 2 8 X 2 2 4 X
n
D 2E Xi C Xi Xj C 2 Xi Xj C 2 Xi2 Xj Xk
n n n
i D1 i ¤j i ¤j i ¤j ¤k
3
4X 3 5
 Xi Xj  ŒE.m2 /2 :
n
i ¤j

The expectation of each term above can be found by using the independence of
the Xi and the zero mean assumption, and interestingly, in fact the variance of m2
can be thus found exactly for any n, namely,

1
Var.m2 / D Œ.n  1/2 .4   4 / C 2.n  1/ 4 :
n3
The approximate methods would have produced the answer

4   4
Var.m2 / :
n
It is useful to know that the approximate methods would likewise produce the gen-
eral first-order variance approximation

2r C r 2  2 2r1  2rr1 rC1  2r


Var.mr / :
n
The formal approximations described above may work well in some cases, however,
it is useful to have some theoretical quantification of the accuracy of the approxi-
mations. This is difficult in general, and we give one result in a special case, with
d D1.
282 7 Essential Asymptotics and Applications

Theorem 7.26. Suppose X1 ; X2 ; : : : are iid observations with a finite fourth mo-
ment. Let E.X1 / D , and Var.X1 / D  2 . Let g be a scalar function with four
uniformly bounded derivatives. Then
g .2/ ./ 2
(a) EŒg.X / D g./ C 2n
C O.n2 /I
.g 0 .//2  2
(b) VarŒg.X / D n
C O.n2 /:
See Bickel and Doksum (2007) for a proof of this theorem.

7.7 Convergence of Densities and Scheffé’s Theorem

Suppose Xn ; X; n  1 are continuous random variables with densities fn ; f , and


that Xn converges in distribution to X . It is natural to ask whether that implies that
fn converges to f pointwise. Simple counterexamples show that this need not be
true. We show an example below. However, convergence of densities, when true, is
very useful. It ensures a mode of convergence much stronger than convergence in
distribution. We discuss convergence of densities in general, and for sample means
of iid random variables in particular, in this section.
First, we give an example to show that convergence in distribution does not imply
convergence of densities.

Example 7.35 (Convergence in Distribution Is Weaker Than Convergence of


Density). Suppose Xn is a sequence of random variables on Œ0; 1 with density
L
fn .x/ D 1 C cos.2 nx/. Then, Xn ) U Œ0; 1 by a direct verification of the
definition using CDFs. Indeed, Fn .x/ D x C sin.2nx/
2n ! x 8x 2 .0; 1/. However,
note that the densitities fn do not converge to the uniform density 1 as n ! 1.
Convergence of densities is useful to have when true, because it ensures a much
stronger form of convergence than convergence in distribution. Suppose Xn have
L
CDF Fn and density fn , and X has CDF F and density f . If Xn ) X , then we can
only assert that Fn .x/ D P .Xn x/ ! F .x/ D P .X x/ 8x. However, if we
have convergence of the densities, then we can make the much stronger assertion
that for any event A; P .Xn 2 A/ ! P .X 2 A/, not just for events A of the form
A D .1; x. This is explained below.

Definition 7.13. Let X; Y be two random variables defined on a common sample


space . The total variation distance between the distributions of X and Y is de-
fined as dT V .X; Y / D supA jP .X 2 A/  P .Y 2 A/j.

Remark. Again, actually the set A is not completely arbitrary. We do need the
restriction that A be a Borel set, a concept in measure theory. However, we make no
further mention of this qualification.
The relation between total variation distance and densities when the random vari-
ables X; Y are continuous is described by the following result.
7.7 Convergence of Densities and Scheffé’s Theorem 283

Lemma. LetR X; Y be continuous random variables with densities f; g. Then dT V


1
.X; Y / D 12 1 jf .x/  g.x/jdx.

Proof. The proof is based on two facts:


Z 1 Z 1
jf .x/  g.x/jdx D 2 .f  g/C dx;
1 1

and, for any set A,


Z
jP .X 2 A/  P .Y 2 A/j .f .x/  g.x//dx:
xWf .x/>g.x/

Putting these two together,


Z 1
1
dT V .X; Y / D supA jP .X 2 A/  P .Y 2 A/j jf .x/  g.x/jdx:
2 1

However, Rfor the particular set A0 D fx W f .x/ > g.x/g, jP .X 2 A0 /  P .Y 2


1
A0 /j D 12 1 jf .x/  g.x/jdx, and so, that proves that supA jP .X 2 A/  P .Y 2
R1
A/j exactly equals 12 1 jf .x/  g.x/jdx. t
u

Example 7.36 (Total Variation Distance Between Two Normals). Total variation
distance is usually hard to find in closed analytical form. The absolute value
sign makes closed-form calculations difficult. It is, however, possible to write a
closed-form formula for the total variation distance between two arbitrary normal
distributions in one dimension. No such formula would be possible in higher dimen-
sions.
RLet X N.1 ; 12 /; Y N.2 ; 22 /. We use the result that dT V .X; Y / D
1 1
2 1
jf .x/  g.x/jdx, where f; g are the densities of X; Y . To evaluate the
integral of jf .x/  g.x/j, we need to find the set of all values of x for which
f .x/  g.x/. We assume that 1 > 2 , and use the notation
1 1  2
cD ; D ;
2 2
p p
.c 2  1/2 log c C 2  c .c 2  1/2 log c C 2 C c
AD ; B D :
c2  1 c2  1
The case 1 D 2 is commented on below.
With this notation, by making a change of variable,
Z 1 Z 1
jf .x/  g.x/jdx D j.z/  c. C cz/jd z;
1 1

and .z/ c. C cz/ if and only if A z B. Therefore,


284 7 Essential Asymptotics and Applications
Z 1 Z B
jf .x/  g.x/jdx D Œc. C cz/  .z/d z
1 A
Z A
C Œ.z/  c. C cz/d z
Z1
1
C Œ.z/  c. C cz/d z
B
D 2Œ.ˆ. C cB/  ˆ.B//  .ˆ. C cA/  ˆ.A//;

where the quantities  C cB;  C cA work out to


p
c .c 2  1/2 log c C 2  
 C cB D ;
p c2  1
c .c 2  1/2 log c C 2 C 
 C cA D  :
c2  1

When 1 D 2 , the expression reduces to ˆ. j 2 j /ˆ. j 2 j /. In applying the formula


we have derived, it is important to remember that the larger of the two variances has
been called 1 . Finally, now,

dT V .X; Y / D .ˆ. C cB/  ˆ.B//  .ˆ. C cA/  ˆ.A//;

with  C cA;  C cB; A; B as given above explicitly.


We see from the formula that dT V .X; Y / depends on both individual variances,
and on the difference between the means. When the means are equal, the total vari-
ation distance reduces to the simpler expression
" p ! p !#
c 2 log c 2 log c
2 ˆ p ˆ p
c2  1 c2  1
1
p .3  c/.c  1/;
2 2e
for c 1:
The next fundamental result asserts that if Xn ; X are continuous random vari-
ables with densities fn ; f , and if fn .x/ ! f .x/ 8x, then dT V .Xn ; X / ! 0. This
means that pointwise convergence of densities, when true, ensures an extremely
strong mode of convergence, namely convergence in total variation.

Theorem 7.27 (Scheffé’s Theorem). Let fn ; f be nonnegative integrable func-


tions. Suppose:
Z 1 Z 1
fn .x/ ! f .x/ 8xI fn .x/dx ! f .x/dx:
1 1
R1
Then 1 jfn .x/  f .x/jdx ! 0.
7.7 Convergence of Densities and Scheffé’s Theorem 285

R 1In particular, if fn ; f are all density functions, and if fn .x/ ! f .x/ 8x, then
1 jfn .x/  f .x/jdx ! 0.

Proof. The proof is based on the pointwise algebraic identity

jfn .x/  f .x/j D fn .x/ C f .x/  2 minffn .x/; f .x/g:

Now note that minffn .x/; f .x/g ! f .x/ 8x, as fn .x/ ! f .x/ 8x, and minffn .x/;
f .x/g f .x/.
R 1Therefore, by the dominated
R1 convergence theorem (see the previ-
ous section), 1 minffn .x/; f .x/gdx ! 1 f .x/dx: The pointwise algebraic
identity now gives that
Z 1 Z 1 Z 1 Z 1
jfn .x/  f .x/jdx ! f .x/dx C f .x/dx  2 f .x/dx D 0;
1 1 1 1

which completes the proof. t


u

Remark. As we remarked before, convergence in total variation is very strong, and


should not be expected, without some additional structure. The following theorems
exemplify the kind of structure that may be necessary. The first theorem below is a
general theorem: no assumptions are made on the structural form of the statistic. In
the second theorem below, convergence in total variation is considered for sample
means of iid random variables: there is a restriction on the structural form of the
underlying statistic.
Theorem 7.28 (Ibragimov). Suppose Xn ; X are continuous random variables
L R1
with densities fn ; f that are unimodal. Then Xn ) X if and only if 1 jfn .x/ 
f .x/jdx ! 0.
See Reiss Aui (1989) for this theorem. The next result for sample means of iid
random variables was already given in Chapter 1; we restate it for completeness.

Theorem 7.29 (Gnedenko). Let Xi ; i  1 be iid continuous p


random variables
N
with density f .x/, mean , and finite variance  2 . Let Zn D n.X/
 , and let fn
denote the density of Zn . If f is uniformly bounded, then
R1 nf converges uniformly to
the standard normal density .x/ on .1; 1/, and 1 jfn .x/  .x/jdx ! 0.

Remark. This is an easily stated result covering many examples. But better results
are available. Feller (1971) is an excellent reference for some of the better results,
which, however, involve more complex concepts.

Example 7.37. Let Xn N.n ; n2 /; n  1. For Xn to converge in distribution,


2
each of n ; n must converge. This is because, if Xn does converge in distribution,
then there is a CDF F .x/ such that ˆ. xn
n
/ ! F .x/ at all continuity points of F .
This implies, by selecting two suitable continuity points of F , that each of n ; n
must converge. If n converges to zero, then Xn will converge to a one-point dis-
tribution. Otherwise, n ! ; n ! , for some ; ; 1 <  < 1; 0 <  < 1.
286 7 Essential Asymptotics and Applications

It follows that P .Xn x/ ! ˆ. x  / for any fixed x, and so, Xn converges in
distribution to another normal, namely to X N.;  2 /. Now, either by direct ver-
ification, or from Ibragimov’s theorem, we have that Xn also converges to X in total
variation. The converse is also true. That is, if Xn N.n ; n2 /; n  1, then Xn
can either converge to a one-point distribution, or to another normal distribution, say
N.;  2 /, in which case n ! ; n ! , and convergence in total variation also
holds. Conversely, if n ! ; n !  > 0, then Xn converges in total variation to
X N.;  2 /.

Exercises
2
Exercise 7.1. (a) Show that Xn ! c (i.e., Xn converges in quadratic mean to c)
if and only if E.Xn  c/ and Var.Xn / both converge to zero.
(b) Show by an example (different from text) that convergence in probability does
not necessarily imply almost sure convergence.

Exercise 7.2. (a) Suppose EjXn  cj˛ ! 0, where 0 < ˛ < 1. Does Xn
necessarily converge in probability to c?
L
(b) Suppose an .Xn  / ) N.0; 1/. Under what condition on an can we conclude
P
that Xn ) ?
(c) op .1/ C Op .1/ D?
(d) op .1/Op .1/ D‹
(e) op .1/ C op .1/Op .1/ D‹
L
(f) Suppose Xn ) X: Then, op .1/Xn D?

Exercise 7.3 (Monte Carlo). Consider the purely mathematical problem of finding
a definite integral f .x/dx for some (possibly complicated) function f .x/. Show
that the SLLN provides a methodP for approximately finding the value of the integral
by using appropriate averages n1 niD1 f .Xi /.
Numerical analysts call this Monte Carlo integration.

Exercise 7.4. SupposePX1 ; X2 ; : : : are iid and that E.X1 / D  ¤ 0; Var.X1 / D


p
 2 < 1. Let Sm;p D m i D1 Xi ; m  1; p D 1; 2.
Sm;1
(a) Identify with proof the almost sure limit of Sn;1
for fixed m, and n ! 1.
Snm;1
(b) Identify with proof the almost sure limit of Sn;1
for fixed m, and n ! 1.
Sn;1
(c) Identify with proof the almost sure limit of Sn;2
as n ! 1.
Sn;1
(d) Identify with proof the almost sure limit of Sn2 ;2
as n ! 1.
Exercises 287

Exercise 7.5. Let An ; n  1; A be events with respect to a common sample


space .
L
(a) Prove that IAn ) IA if and only if P .An / ! P .A/.
2
(b) Prove that IAn ) IA if and only if P .AAn / ! 0.

Exercise 7.6. Suppose g W RC ! R is continuous and bounded. Show that


1
X k .n /k
e n g. / ! g. /
n kŠ
kD0

as n ! 1.

Exercise 7.7 * (Convergence of Medians). Suppose Xn is a sequence of random


variables converging in probability to a random variable X I X is absolutely contin-
uous with a strictly positive density. Show that the medians of Xn converge to the
median of X .

Exercise 7.8. Suppose fAn g is an infinite sequence


S of independent events. Show
that P .infinitely many An occur/ D 1 , P . An / D 1.

Exercise 7.9 * (Almost Sure Limit of Mean Absolute Deviation). Suppose


Xi ; i  1 are iid random variables from a distribution F with EF .jX j/ < 1.
P
(a) Prove that the mean absolute deviation n1 niD1 jXi  XN j has a finite almost sure
limit.
(b) Evaluate this limit explicitly when F is standard normal.

Exercise 7.10. * Let Xn be any sequence of random variables. Prove that one can
Xn a:s:
always find a sequence of numbers cn such that cn ) 0.

Exercise 7.11 (Sample Maximum). Let Xi ; i  1 be an iid sequence, and X.n/ the
maximum of X1 ; : : : ; Xn . Let .F / D supfx W F .x/ < 1g, where F is the common
a:s:
CDF of the Xi . Prove that X.n/ ) .F /.

Exercise 7.12. Suppose fAn g is an infinite sequence of events. Suppose that


P .An /  ı 8n. Show that P .infinitely many An occur/  ı.

Exercise 7.13. Let Xi be independent N.; i2 / variables.


(a) Find the BLUE
P (best linear unbiased estimate) of .
(b) Suppose 1 
i D1 i
2
D 1. Prove that the BLUE converges almost surely to .

Exercise 7.14. Suppose Xi are iid standard Cauchy. Show that


(a) P .jXn j > n infinitely often/ D 1,
(b) * P .jSn j > n infinitely often/ D 1.
288 7 Essential Asymptotics and Applications

Exercise 7.15. Suppose Xi are iid standard exponential. Show that lim supn
Xn
log n
D 1 with probability 1.

Exercise 7.16 * (Coupon Collection). Cereal boxes contain independently and


with equal probability exactly one of n different celebrity pictures. Someone having
the entire set of n pictures can cash them in for money. Let Wn be the minimum
number of cereal boxes one would need to purchase to own a complete set of the
Wn P
pictures. Find a sequence an such that an
) 1.
Hint : Approximate the mean of Wn .

Exercise 7.17. Let Xn Bin.n; p/. Show that .Xn =n/2 and Xn .Xn  1/=n.n  1/
both converge in probability to p 2 . Do they also converge almost surely?

Exercise 7.18. Suppose X1 ; : : : ; Xn are iid standard exponential variables, and let
Sn D X1 C : : : C Xn . Apply the Chernoff–Bernstein inequality (see Chapter 1) to
show that for c > 1,
P .Sn > cn/ e n.c1ln c/
and hence that P .Sn > cn/ ! 0 exponentially fast.

Exercise 7.19. Let X1 ; X2 ; : : : be iid nonnegative random variables. Show that


X.n/ P
n ) 0 if and only if nP .X1 > n/ ! 0.
Is this true in the normal case?

Exercise 7.20 (Failure of Weak Law). Let X1 ; X2 ; : : : be a sequence of indepen-


dent variables, with P .Xi D i / D P .Xi D i / D 1=2. Show that XN does not
converge in probability to the common mean  D 0.

Exercise 7.21. Let X1 ; X2 ; X3 ; : : : be iid U Œ0; 1. Let

Gn D .X1 X2 : : : Xn /1=n :

P
Find c such that Gn ) c.

Exercise 7.22 * (Uniform Integrability of Sample Mean). Suppose Xi ; i  1 are


iid from somepCDF F with mean zero and variance one. Find a sufficient condition
on F for E. nXN /k to exist and converge to E.Z k /, where k is fixed, and Z
N.0; 1/.

Exercise 7.23 * (Sufficient Condition for Uniform Integrability). Let fXn g; n 


1 be a sequence of random variables, and suppose for some function f W RC !
RC such that f is nondecreasing and f .x/ x
! 1 as x ! 1, we know that
supn EŒf .jXn j/ < 1. Show that fXn g is uniformly integrable.

Exercise 7.24 (Uniform Integrability of IID Sequence). Suppose fXn g is an iid


sequence with E.jX1 j/ < 1. Show that fXn g is uniformly integrable.
Exercises 289

L
Exercise 7.25. Give an example of a sequence fXn g, and an X such that Xn )
X; E.Xn / ! E.X /, but E.jXn j/ does not converge to E.jX j/.
Exercise 7.26. Suppose Xn has a normal distribution with mean n and variance
n2 . Let n !  and n !  as n ! 1. What is the limiting distribution of Xn ?
Exercise 7.27 (Delta Theorem). Suppose X1 ; X2 ; : : : are iid with mean  and vari-
ance  2 , a finite fourth moment, and let Z N.0; 1/.
p 2 L
(a) Show that n.X  2 / ) 2Z.
p L
(b) Show that n.e X  e  / ) e  Z.
p P L
(c) Show that n.log.1=n niD1 .Xi  X/2 /  log  2 / ) .1= 2 /.EX14 /1=2 Z.
Exercise 7.28 (Asymptotic Variance and True Variance). Let X1 ; X2 ; : : : be iid
observations from a CDF F with four finite P moments. For each of the following
cases, find the exact variance of m2 D n1 niD1 .Xi  XN /2 by using the formula
in the text, and also find the asymptotic variance by using the formula in the text.
Check when the true variance is larger than the asymptotic variance.
(a) F D N.;  2 /.
(b) F D Exp. /.
(c) F D Poi. /.
Exercise 7.29 (All Distributions as Limits of Discrete). Show that any distribu-
tion on Rd is the limit in distribution of distributions on Rd that are purely discrete
with finitely many values.
L L
Exercise 7.30 (Conceptual). Suppose Xn ) X , and also Yn ) X . Does this
mean that Xn  Yn converge in distribution to (the point mass at) zero?
Exercise 7.31. (a) Suppose an .Xn  / ! N.0;  2 /; what can be said about the
limiting distribution of jXn j, when  ¤ 0;  D 0?
(b) * Suppose Xi are iid Bernoulli(p); what can be said about the limiting distribu-
tion of the sample variance s 2 when p D 12 I p ¤ 12 ?
Exercise 7.32 (Delta Theorem). Suppose X1 ; X2 ; : : : are iid Poi. /. Find the lim-
N
iting distribution of e X .
Remark. It is meant that on suitable centering and norming, you will get a nonde-
generate limiting distribution.
Exercise 7.33 (Delta Theorem). Suppose X1 ; X2 ; : : : are iid N.; 1/. Find the lim-
iting distribution of ˆ.XN /, where ˆ, as usual, is the standard normal CDF.
Exercise 7.34 * (Delta Theorem with Lack of Smoothness). Suppose X1 ; X2 ; : : :
are iid N.; 1/. Find the limiting distribution of jXN j when
(a)  ¤ 0.
(b)  D 0.
290 7 Essential Asymptotics and Applications

Exercise 7.35 (Delta Theorem). For each F below, find the limiting distributions
N
of Xs and XsN :
(i) F D U Œ0; 1, (ii) F D Exp. /, (iii) F D 2 .p/.
Exercise 7.36 * (Delta Theorem). Suppose X1 ; X2 ; : : : are iid N.;  2 /. Let
P 1 P
1
.Xi  XN /3 .Xi  XN /4
b1 D  P n
3=2 and b2 D n
1 P 2 3
1
n
.Xi  XN /2 n .Xi  XN /2

be the sample skewness and kurtosis coefficients. Find the joint limiting distribution
of .b1 ; b2 /.
Exercise 7.37 * (Slutsky). Let Xn ; Ym be independent Poisson with means
n; m; m; n  1. Find the limiting distribution of Xn pYXm CY
 .nm/
as n; m ! 1.
n m

Exercise 7.38 (Approximation of Mean and Variance). Let X Bin.n; p/. Find
X
the first- and the second-order approximation to the mean and variance of nX .
Exercise 7.39 (Approximation of Mean and Variance). Let X Poi. /. Find
the first- and the second-order approximation to the mean and variance of e X .
Compare to the exact mean and variance by consideration of the mgf of X .
Exercise 7.40 (Approximation of Mean and Variance). Let X1 ; : : : ; Xn be iid
N.;  2 /. Find the first- and the second-order approximation to the mean and vari-
ance of ˆ.XN /.
Exercise 7.41 * (Approximation of Mean and Variance). Let X Bin.n; p/.
Find the
q first- and the second-order approximation to the mean and variance of
X
arcsin. n /.

Exercise 7.42 * (Approximation of Mean and Variance). Let X Poi.p/. Find


the first- and the second-order approximation to the mean and variance of X .
Exercise 7.43 * (Multidimensional Approximation of Mean). Let X be a d -
dimensional
q random vector. Find a first- and second-order approximation to the
Pd 4
mean of i D1 Xi .

Exercise 7.44 * (Expected Length of Poisson Confidence pInterval). In


Chapter 1, the approximate 95% confidence interval X C 1:92 ˙ 3:69 C 3:84X
for a Poisson mean was derived. Find a first- and second-order approximation to
the expected length of this confidence interval.
Exercise 7.45 * (Expected Length of the t Confidence Interval). The modified
t confidence interval for a population mean  has the limits XN ˙ z˛=2 psn , where
XN and s are the mean and the standard deviation of an iid sample of size n, and
z˛=2 D ˆ1 . ˛2 /. Find a first- and a second-order approximation to the expected
length of the modified t confidence interval when the population distribution is
Exercises 291

(a) N.;  2 /.
(b) Exp./.
(c) U Œ  1;  C 1.

Exercise 7.46 * (Coefficient of Variation). Given a set of positive iid random vari-
ables X1 ; X2 ; : : : ; Xn , the coefficient of variation (CV) is defined as CV D XsN .
Find a second-order approximation to its mean, and a first-order approximation to
its variance, in terms of suitable moments of the distribution of the Xi . Make a note
of how many finite moments you need for each approximation to make sense.

Exercise 7.47 * (Variance-Stabilizing Transformation). Let Xi ; i  1 be iid


Poi. /.
q Pn
i D1 Xi Ca
(a) Show that for each a; b; nCb
is a variance stabilizing transformation.
(b) Find
q the first- and the second-order approximation to the mean of
Pn
Xi Ca
i D1
nCb
.
(c) Are there some particular choices of a; b that make the approximation
0s 1
P n
i D1 Xi Ca A p
E@
nCb

more accurate? Justify your answer.

Exercise 7.48 * (Variance-Stabilizing Transformation). Let Xi ; i  1 be iid


Ber.p/.
q Pn
i D1 Xi Ca
(a) Show that for each a; b; arcsin. nCb / is a variance stabilizing transfor-
mation.
(b) Find the
qfirst- and the second-order approximation to the mean of
Pn
i D1 i X Ca
arcsin. nCb
/.
(c) Are there some particular choices of a; b that make the approximation
2 0s 13
Pn
X C a p 
E 4arcsin @ i D1 i A5 arcsin p
nCb

more accurate ? Justify your answer.

Exercise 7.49. For each of the following cases, evaluate the total variation distance
between the indicated distributions:
(a) N.0; 1/ and C.0; 1/.
(b) N.0; 1/ and N.0; 104/.
(c) C.0; 1/ and C.0; 104 /.
292 7 Essential Asymptotics and Applications

Exercise 7.50 (Plotting the Variation Distance). Calculate and plot (as a function
of ) dT V .X; Y / if X N.0; 1/; Y N.; 1/.
Exercise 7.51 (Convergence of Densities). Let Z N.0; 1/ and Y independent
of Z. Let Xn D Z C Yn ; n  1.
(a) Prove by direct calculation that the density of Xn converges pointwise to the
standard normal density in each of the following cases.
(i) Y N.0; 1/.
(ii) Y U Œ0; 1.
(iii) Y Exp.1/.
(b) Hence, or by using Ibragimov’s theorem prove that Xn ! Z in total variation.
Exercise 7.52. Show that dT V .X; Y / P .X ¤ Y /:
p
Exercise 7.53. Suppose X1 ; X2 ; : : : are iid Exp.1/. Does n.XN  1/ converge to
standard normal in total variation? Prove or disprove.
Exercise 7.54 * (Minimization of Variation Distance). Let X U Œa; a and
Y N.0; 1/. Find a that minimizes dT V .X; Y /.

P Let X; Y be integer-valued random variables. Show that dT V


Exercise 7.55.
.X; Y / D 12 k jP .X D k/  P .Y D k/j.

References

Ash, R. (1973). Real Analysis and Probability, Academic Press, New York.
Bhattacharya, R. and Rao, R. (1986). Normal Approximation and Asymptotic Expansions, Wiley,
New York.
Bickel, P. and Doksum (2007). Mathematical Statistics: Basic Ideas and Selected Topics, Prentice
Hall, Upper Saddle River, NJ.
Breiman, L. (1968). Probability, Addison-Wesley, Reading, MA.
Chow, Y. and Teicher, H. (1988). Probability Theory, 3rd ed., Springer, New York.
Cramér, H. (1946). Mathematical Methods of Statistics, Princeton University Press, Princeton, NJ.
DasGupta, A. (2008). Asymptotic Theory of Statistics and Probability, Springer, New York.
Feller, W. (1971). Introduction to Probability Theory with Applications, Wiley, New York.
Ferguson, T. (1996). A Course in Large Sample Theory, Chapman and Hall, New York.
Hall, P. (1997). Bootstrap and Edgeworth Expansions, Springer, New York.
Hall, P. and Heyde, C. (1980). Martingale Limit Theory and Its Applications, Academic Press,
New York.
Kesten, H. (1972). Sums of independent random variables, Ann. Math. Statist., 43, 701–732.
Lehmann, E. (1999). Elements of Large Sample Theory, Springer, New York.
Petrov, V. (1975). Limit Theorems of Probability Theory, Oxford University Press, London.
Port, S. (1994). Theoretical Probability for Applications, Wiley, New York.
Reiss, R. (1989). Approximate Distribution of Order Statistics, Springer-Verlag, New York.
Révész, P. (1968). The Laws of Large Numbers, Academic Press, New York.
Sen, P. K. and Singer, J. (1993). Large Sample Methods in Statistics, Chapman and Hall, New York.
Serfling, R. (1980). Approximation Theorems of Mathematical Statistics, Wiley, New York.
van der Vaart, Aad (1998). Asymptotic Statistics, Cambridge University Press, Cambridge, UK.
Chapter 8
Characteristic Functions and Applications

Characteristic functions were first systematically studied by Paul Lévy, although


they were used by others before him. It provides an extremely powerful tool in
probability in general, and in asymptotic theory in particular. The power of the
characteristic function derives from a set of highly convenient properties. Like the
mgf, it determines a distribution. But unlike mgfs, existence is not an issue, and
it is a bounded function. It is easily transportable for common functions of ran-
dom variables, such as convolutions. And it can be used to prove convergence of
distributions, as well as to recognize the name of a limiting distribution. It is also
an extremely handy tool in proving characterizing properties of distributions. For
instance, the Cramér–Levy theorem (see Chapter 1), which characterizes a normal
distribution, has so far been proved by only using characteristic function methods.
There are two disadvantages in working with characteristic functions. First, it is a
complex-valued function, in general, and so, familiarity with basic complex analysis
is required. Second, characteristic function proofs usually do not lead to any intu-
ition as to why a particular result should be true. All things considered, knowledge
of basic characteristic function theory is essential for statisticians, and certainly for
students of probability.
We introduce some notation.
The set of complexp numbers is denoted by C. Given a complex number z D
x C iy, where i D 1, x is referred to as the real part of z, and y the complex
part of z, and written as <z; =z. The complex conjugatep of z is x  iy and denoted
as zN. The absolute value is denoted as jzj and equals x 2 C y 2 . We have zNz D jzj2 .
We recall Euler’s formula e it D cos t C i sin t 8t 2 R, and je it j D 1 8t 2 R. Any
z 2 C may be represented as z D r.cos  Ci sin /, where r D jzj, and  2 .; ;
 is called the argument of z. Note the formal equivalence of C to R2 by identifying
the real and the imaginary part of a complex number to r cos  and r sin , where
r;  are the polar coordinates of the point .x; y/ 2 R2 . We also recall de Moivre’s
formula .cos  C i sin /n D cos n C i sin n 8n  1.
Although we mostly limit ourselves to the case of real-valued random variables,
we define a characteristic function (cf) for general d -vectors.

Definition 8.1. Let X be a d -dimensional random vector. Then the cf of X; D


i t0 X
X W R ! C is defined as .t/ D E.e
d
/ D E.cos t0 X/ C iE.sin t0 X/.

A. DasGupta, Probability for Statistics and Machine Learning: Fundamentals 293


and Advanced Topics, Springer Texts in Statistics, DOI 10.1007/978-1-4419-9634-3 8,
c Springer Science+Business Media, LLC 2011
294 8 Characteristic Functions and Applications

The simplest properties of a cf in the case d D 1 are given first.

Proposition.
(a) For any real-valued random variable X , .t/ exists for any t; j .t/j 1, and
.0/ D 1.
(b) X .t/ D X .t/ D X .t/.
(c) For any real-valued random variables X; Y such that X; Y are independent,
XCY .t/ D X .t/ Y .t/.
(d) If Y D a C bX , then Y .t/ D e ita X .bt/.
(e) If X .t/ D Y .t/ 8t, then X; Y have the same distribution; that is, a cf deter-
mines the distribution.
(f) A random variable X has a distribution symmetric about zero if and only if its
characteristic function is real and even.
(g) The cf of any real-valued random variable X is continuous, and even uniformly
continuous on the whole real line.

Proof. j .t/j D jE.e itX /j E.je itX j/ D E.1/ D 1; it is obvious that .0/ D 1.
Next, X .t/ D E.e it .X/ / D E.e i tX / D E.cos tX /  iE.sin tx/ D X .t/ D
X .t/: If X; Y are independent, XCY .t/ D E.e / D E.e itX e i t Y / D
it .XCY /

E.e /E.e / D X .t/ Y .t/. Part (d) is obvious. Part (e) is proved later. For
itX it Y

part (f), X has a distribution symmetric about zero if and only if X and X have
the same distribution if and only if X .t/ D X .t/ D X .t/. Part (g) can be
proved by using simple inequalities on the exponential function and at the same
time, the dominated convergence theorem. We leave this as a short exercise for the
reader. t
u

8.1 Characteristic Functions of Standard Distributions

Characteristic functions of many standard distributions are collected together in


the following table for convenient reference. Some special cases are worked out
as examples.

Distribution Density/pmf cf
Point mass at a e ita
Bernoulli.p/ p x .1  p/1x 1  p C pe it
n  x
Binomial
x
p .1  p/nx .1  p C pe it /n
x
e  xŠ e  .e 1/
it
Poisson
pe it
Geometric p.1  p/x1 1.1p/e it

Negative binomial x1 pe it


n1
p n .1  p/xn Œ 1.1p/eit n
UniformŒa; a 1 sin at
2a at
jxj
TriangularŒ2a; 2a 1
2a
.1  2a
/ . sinatat /2
(continued)
8.1 Characteristic Functions of Standard Distributions 295

(continued)
Distribution Density/pmf cf
p J .t/
Beta.˛; ˛/ .2˛/ .2˛/ ˛ 21 2
Œ.˛/2
x ˛1 .1  x/˛1 .˛/ 1
t ˛ 2
Œcos. 2t / C i sin. 2t /; t >0
Exponential 1 x= 1
e 1i t
Gamma e x= x ˛1
˛ .˛/ Œ 1i1 t ˛
Double exponential 1 jxj= 1
2
e 1C 2 t 2
e .x/ =.2 /
2 2 itt 2  2 =2
Normal p1
e
 2
Cauchy  1
.x/2
e itjtj
 1C
2

t . aC1
2 / aa=4
p
p 2 2a=21 . a2 /
jt ja=2 K a2 . a jt j/
a. a2 /.1C x 2 /.aC1/=2
a
Multivariate Normal 1 0 1 .x 0  t 0 †t
1
1 e  2 .x/ † /
e it 2
.2/d=2 j†j 2

Uniform in n-dimensional unit ball . n2 C1/ n J n .jjtjj/


n 2 2 . n2 C 1/ 2
n
2 jjtjj 2
Qk P n
Multinomial nŠ xj k
Qk j D1 pj j D1 pj e itj
j D1 xj Š

Example 8.1 (Binomial, Normal, Poisson). In this example we work out the cf of
general binomial, Normal, and Poisson distributions.
If X Ber.p/, then immediately from definition its cf is X .t/ D .1pCpe it /.
because the general binomial random variable can be written as the sum of n iid
bernoulli variables, the cf of the Bin.n; p/ distribution is .1  p C pe it /n .
For the Normal case, first consider the standard Normal case. By virtue of
symmetry, from part (f) of the aboveR 1 theorem (or, directly from definition), if
Z N.0; 1/, then its cf is Z .t/ D 1 .cos tz/.z/d z: Therefore,
Z 1
d d
Z .t/ D .cos tz/.z/d z
dt dt 1
Z 1  Z 1
d
D cos tz .z/d z D .sin tz/.z.z//d z
1 dt 1
Z 1   Z 1
d
D .sin tz/ .z/ d z D t .cos tz/.z/d z
1 dz 1
D t Z .t/;
where the interchange of the derivative and the integral is permitted by the dom-
inated convergence theorem, and integration by parts has been used in the final
Z .t/ D t Z .t/ )
d
step of the calculation. Because Z .0/ must be one, dt
t2
Z .t/ D e 2 .
t 2 2
If X N.;  2 /, by part (d) of the above theorem, its cf equals e it 2 . If
X Poi. /, then
296 8 Characteristic Functions and Applications
1
X x
X .t/ D e  e itx
xD0

X1
. e it /x
D e 
xD0

D e  e e D e .e
it it 1/
:
The power series representation of the exponential function on C has been used in
P
summing the series 1 xD0
.e it /x

.
Example 8.2 (Exponential, Double Exponential, and Cauchy). R1 Let X have the
standard exponential density. Then its cf is X .t/ D 0 e itx e x dx. One can
use methods of integration of complex-valued functions, R 1and get the answer.
x
Alternatively,
R1 one can separately find the real integrals .cos
0R 1 tx/e dx, and
x x
.sin
0R 1 tx/e dx, and thus obtain the cf as X .t/ D 0 .cos tx/e dx C
i 0 .sin tx/e x dx. Each integral can be evaluated by various methods. For ex-
P
ample, one can take the power series expansion 1 k .tx/2k
kD0 .1/ .2k/Š for cos.tx/ and
R1
integrate 0 .cos tx/e x dx term by term, and then sum the series, and likewise
R1
for the integral 0 .sin tx/e x dx. Or, one can evaluate the integrals by repeated
integration by parts. Any of these gives the formula X .t/ D 1it 1
.
If X has the standard double exponential density, then simply use the fact that
L
X D X1  X2 , where X1 ; X2 are iid standard exponential. Then, by parts (b) and
(c) of the above theorem,
1 1 1
X .t/ D D :
1  it 1 C it 1 C t2
Note the very interesting outcome that the cf of the standard double exponential
density is the renormalized standard Cauchy density. By a theorem (the inversion
theorem in the next section) that we have not discussed yet, this implies that the cf of
the standard Cauchy density is e jt j , the renormalized standard double exponential
density.
Example 8.3 (Uniform Distribution in n-Dimensional Unit Ball). This example
well illustrates the value of a skilled calculation. We work out the cf of the uniform
distribution in Bn , the n-dimensional unit ball, with the constant density
1 . n2 C 1/
f .x/ D Ifx2Bn g D n Ifx2Bn g :
Vol.Bn / 2
We write f .x/ D cn Ifx2Bn g . Before we start the derivation, it is important to
note that the constants in the calculation do not have to be explicitly carried along
the steps, and a final constant can be found at the end by simply forcing the cf to
equal one at t D 0.
First note that by virtue of symmetry
R (around the origin) of the uniform density
in the ball, the cf equals .t/ D cn Bn cos.t0 x/d x. Let P be an orthogonal ma-
trix such that P t D jjtjje1 , where e1 is the first n-dimensional basis unit vector
8.1 Characteristic Functions of Standard Distributions 297

.1; 0; : : : ; 0/0 . Because jjP xjj D jjxjj (P being an orthogonal matrix), and because
jP j D 1, we get
Z
.t/ D cn cos.jjtjjx1 /d x:
Bn
Now make the n-dimensional polar transformation .x1 ; x2 ; : : : ; xn / ! .; 1 ;
: : : ; n1 / (consult Chapter 4), which we recall has the Jacobian n1 sinn2 1
sinn3 2    sin n2 . Thus,
Z 1 Z  Z  Z 2
.t/ D cn  n1 cos.jjtjj cos 1 /.sin 1 /n2 sinn3 2   
0 0 0 0
 sin n2

dn1    d2 d1 d


Z 1 Z 
D k1n n1 cos.jjtjj cos 1 /.sin 1 /n2 d1 d
0 0

(it is not important to know right now what the constant k1n is)
Z 1 Z 1
n3
D k2n n1 cos.jjtjju/.1  u2 / 2 dud
0 1

(on making the change of variable cos 1 D u, a monotone transformation on .0; /)
Z 1
n
D k3n n1 .jjtjj/1 2 J n2 1 .jjtjj/d
0

(on using the known integral representation of J˛ .x/)


Z 1
n n
D k3n jjtjj1 2  2 J n2 1 .jjtjj/d
0
J n .jjtjj/
1 n 2
D k4n jjtjj 2
jjtjj

(on using the known formula for the integral in the above line)

J n2 .jjtjj/
D k4n n :
jjtjj 2

J n .u/
Now, at this final step use the fact that limu!0 2
n D n
1
, and this gives the
u2 2 2 . n
2 C1/
formula that
n
n J n2 .jjtjj/
.t/ D 2 2  C1 n :
2 jjtjj 2
298 8 Characteristic Functions and Applications

8.2 Inversion and Uniqueness

Given the cf of a random variable, it is possible to recover the CDF at its points of
continuity; additionally, if the cf is absolutely integrable, then the random variable
must have a density, and the density too can be recovered from the cf. The first result
on the recovery of the CDF leads to the distribution determining property of a cf.

Theorem 8.1 (Inversion Theorems). Let X have the CDF F and cf .t/.
(a) For any 1 < a < b < 1, such that a; b are both continuity points of F ,
Z
1 T
e iat  e ibt
F .b/  F .a/ D lim .t/dtI
T !1 2 T it
R1
(b) If 1 j .t/jdt < 1, then X has a density, say f .x/, and
Z 1
1
f .x/ D e itx .t/dt:
2 1

Furthermore, in this case, f is bounded and continuous.


(c) (Plancherel’s Identity) If a random variable X has a density f and the cf ,
then Z 1 Z 1
j .t/j2 dt < 1 , jf .x/j2 dx < 1;
1 1
R1 R1
and 1 j .t/j2 dt D 1 jf .x/j2 dx.

Proof. We prove part (b), by making use of part (a). R1


From part (a), if a; b are continuity points of F , then because 1 j .t/jdt < 1,
the dominated convergence theorem gives the simpler expression
Z 1
1 e iat  e 0ibt
F .b/  F .a/ D .t/dt
2 1 it
"
Z 1 Z b #
1
D e itx dx .t/dt
2 1 a
Z b Z 1
1
D e itx .t/dt dx
a 2 1

(the interchange of the order of integration justified by Fubini’s theorem). Let


a ! 1; noting that we can indeed approach 1 through continuity points, we
obtain Z b Z 1
1
F .b/ D e i tx .t/dt dx;
1 2 1

at continuity points b. To make the proof completely rigorous from here, some mea-
sure theory is needed; but we have basically shown that the CDF can be written as
8.2 Inversion and Uniqueness 299

1
R1 i tx
the integral of the function 2 1 e .t/dt, and so X must have this function
as its density. It is bounded and continuous for the same reason that a cf is bounded
and continuous.
For part (c), notice that j .t/j2 is the cf of X  Y , where X; Y are iid with cf
R1.t/. But the density of X  Y , by the density convolution formula, is g.x/ D
1 f .x/f .x C y/dy. If j .t/j is integrable, the inversion formula of part (b)
2

applies
R 1to this cf, namely, j .t/j2 . If we doRapply the inversion formula, we get
1
that R1 j .t/j dt D g.0/.
2
R 1 But g.0/ D 1 jf .x/j2 dx. Therefore, we must
1
have 1 j .t/j dt D 1 jf .x/j dx. We leave the converse part as an easy
2 2

exercise. t
u
The inversion theorem leads to the distribution determining property.
Theorem 8.2 (Distribution Determining Property). Let X; Y have character-
istic functions X .t/; Y .t/. If X .t/ D Y .t/ 8t, then X; Y have the same
distribution.
The proof uses part (a) of the inversion theorem, which applies to any type of
random variables. Applying part (a) separately to X and Y , one gets that the CDFs
FX ; FY must be the same at all points b that are continuity points of both FX ; FY .
This forces FX and FY to be equal everywhere, because one can approach an arbi-
trary point from above through common continuity points of FX ; FY . t
u
Remark. It is important to note that the assumption that X .t/ D Y .t/ 8t cannot
really be relaxed in this theorem, because it is actually possible for random variables
X; Y with different distributions to have identical characteristic functions at a lot of
points, for example, over arbitrarily long intervals. In fact, we show such an example
later.
We had stated in Chapter 1 that the sum of any number of independent Cauchy
random variables also has a Cauchy distribution. Aided by cfs, we can now prove it.
Example 8.4 (Sum of Cauchys). Let Xi ; 1 i n, be independent, and suppose
Xi C.i ; i2 /. Let Sn D X1 C    C Xn . Then, the cf of Sn is

Y
n Y
n

Sn .t/ D Xi .t/ D Œe i ti i jt j 


i D1 i D1
Pn Pn
i /jt j.
De it. i D1i D1 i / :

P  Pn 2
n
This coincides with the cf of a C i D1 i ; i D1 i distribution, therefore by
the distribution determining property of cfs, one has
0 !2 1
X
n X
n
Sn C@ i ; i A
i D1 i D1
Pn  Pn 2 !
Sn i D1 i i D1 i
, XNn D C ; :
n n n
300 8 Characteristic Functions and Applications

In particular, if X1 ; : : : ; Xn are iid C.;  2 /, then XNn is also distributed as C.;  2 /


for all n. No matter how large n is, the distribution of the sample mean is exactly the
same as the distribution of one observation. This is a remarkable stability property,
and note the contrast to situations where the laws of large numbers and the central
limit theorem hold.

Analogous to the inversion formula for the density case, there is a corresponding
inversion formula for integer-valued random variables, and actually more generally,
for lattice-valued random variables. We need two definitions for presenting those
results.

Definition 8.2. A real number x is called an atom of the distribution of a real-valued


random variable X if P .fxg/ D P .X D x/ > 0:

Definition 8.3. A real-valued random variable X is called a lattice-valued random


variable and its distribution a lattice distribution if there exist numbers a; h > 0
such that P .X D a Cnh for some n 2 Z/ D 1, where Z denotes the set of integers.
In other words, X can only take values a; a ˙ h; a ˙ 2h; : : : ; for some a; h > 0.
The largest number h satisfying this property is called the span of the lattice.

Example 8.5. This example helps illustrate the concepts of an atom and a lattice.
If X has a density, then it cannot have any atoms. It also cannot be a lat-
tice random variable, because any random variable with a density assigns zero
probability to any countable set. Now consider a mixture distribution such as
pı0 C .1  p/N.0; 1/; 0 < p < 1. This distribution has one atom, namely the
value x D 0. Consider a Poisson distribution. This has all nonnegative integers as
its atoms, and it is also a lattice distribution, with a D 0; h D 1. Moreover, this dis-
tribution is purely atomic. Now take Y Bin.n; p/ and let X D Yn . Then X has the
atoms 0; n ; n ; : : : ; 1. This is also a lattice distribution, with a D 0; h D n1 . This dis-
1 2

tribution is also purely atomic. Lastly, let Z N.0; 1/ and let X D 1C2bZc. Then,
the atoms of X are the integers ˙1; ˙3; : : : and again, X has a lattice distribution,
with a D 1 and h D 2.

We now give the inversion formulas for lattice-valued random variables.

Theorem 8.3 (Inversion for Lattice Random Variables).


(a) Let X be an integer-valued random variable with cf .t/. Then for any inte-
ger n,
Z 
1
P .X D n/ D e i nt .t/dt:
2 
(b) More generally, if X is a lattice random variable, then for any integer n,
Z 
h h
P .X D a C nh/ D e i.aCnh/t .t/dt:
2 
h
8.2 Inversion and Uniqueness 301

(c) Given a random variable X with cf .t/, let A be the (countable) set of all the
atoms of X . Then

X Z T
1
ŒP .X D x/2 D lim j .t/j2 dt:
T !1 2T T
x2A

An interesting immediate corollary of part (c) is the following.

Corollary. Suppose X has a cf .t/ that is square integrable. Then X cannot have
any atoms. In particular, if .t/ is absolutely integrable, then X cannot have any
atoms, and in fact must have a density.

Because there is a one-to-one correspondence between characteristic functions


and CDFs, it seems natural to expect that closeness of characteristic func-
tions should imply closeness of the CDFs, and distant characteristic functions
should imply some distance between the CDFs. In other words, there should be
some relationship of implication between nearby laws and nearby characteris-
tic functions. Indeed, there are many such results available. Perhaps the most
well known among them is Esseen’s lemma, which is given below, together
with a reverse inequality, and a third result for a pair of integer-valued random
variables.

Theorem 8.4 (Esseen’s Lemma). (a) Let F; G be two CDFs, of which G has a
uniformly bounded density g; g K < 1. Let F; G have cfs ; . Then for
any T > 0, and b > 21
, there exists a finite positive constant C D C.b/ such
that
Z T
j .t/  .t/j CK
supx jF .x/  G.x/j b dt C :
T jtj T
(b) (Reversal) Let F; G be any two CDFs, with cfs ; . Then
ˇZ ˇ
1 ˇˇ 1 ˇ
supx jF .x/  G.x/j  ˇ Œ .t/  .t/.t/dt ˇˇ ;
2 1

where  is the standard normal density.


(c) (Integer-Valued Case) Let F; G be CDFs of two integer-valued random vari-
ables, with cfs ; . Then
Z
1 
j .t/  .t/j
supx jF .x/  G.x/j dt:
4  jtj

See Chapter 5 in Petrov (1975, pp. 142–147 and pp. 186–187) for each part of
this theorem.
302 8 Characteristic Functions and Applications

8.3 Taylor Expansions, Differentiability, and Moments

Unlike mgfs, characteristic functions need not be differentiable, even once. We have
already seen such an example, namely the cf of the standard Cauchy distribution,
which is e jt j , and therefore continuous but not differentiable. The tail of the Cauchy
distribution is causing the lack of differentiability of its cf. If the tail tapers off
rapidly, then the cf will be differentiable, and could even be infinitely differentiable.
Thus, thin tails of the distribution go hand in hand with smoothness of the cf. Con-
versely, erratic tails of the cf go hand in hand with a CDF that is not sufficiently
smooth. Inasmuch as a thin tail of the CDF is helpful for existence of moments,
these three attributes are linked together, namely,
Does the CDF F have a thin tail?
 Is the cf differentiable enough number of times?
 Does F have enough number of finite moments, and if so, how does one
recover them directly from the cf?
Conversely, these two attributes are linked together:
Does the cf taper off rapidly?
 Is the CDF F very smooth, for example, differentiable a (large) number of
times?
These issues, together with practical applications in the form of Taylor expan-
sions for the cf are discussed next. It should be remarked that Taylor expansions for
the cf and its logarithm form hugely useful tools in various problems in asymptotics.
For example, the entire area of Edgeworth expansions in statistics uses these Taylor
expansions as the most fundamental tool.

Theorem 8.5. Let X have cf .t/.


(a) If E.jX jk / < 1, then is k times continuously differentiable at any point, and
moreover,
E.X m / D .i /m .m/ .0/ 8m k:
(b) If for some even integer k D 2m; is k times differentiable at zero, then
E.X k / < 1.
(c) If for some odd integer k D 2m C 1; is k times differentiable at zero, then
E.X k1 / < 1.
(d) (Riemann–Lebesgue Lemma) If X has a density, then .t/ ! 0 as t ! ˙1.
(e) If X has a density and the density itself has n derivatives, each of which is
absolutely integrable, then .t/ D o.jtjn / as t ! ˙1.

Proof. We outline the proof of parts (a) and (e) of this theorem. See Port (1994,
pp. 658–663 and p. 670) for the remaining parts. For part (a), for notational simplic-
ity, assume that X has a density f , and formally
R 1 differentiate .t/ m times inside
the integral sign. We get the expression i m 1 x m e itx f .x/dx. The absolute value
of the integrand jx m e itx f .x/j is bounded by jx m f .x/j, which is integrable, because
8.4 Continuity Theorems 303

m k and the kth moment exists by hypothesis. This means that the dominated
convergence theorem R1is applicable, and that at any t the cf is m times differentiable,
with .m/ .t/ D i m 1 x m e itx f .x/dx. Putting t D 0; .m/ .0/ D i m E.X m /.
For part (e), suppose X has a density f and thatR f 0 exists and is absolutely
1
integrable. Then, by integration by parts, .t/ D  it 1 e itx f 0 .x/dx. Now apply

the Riemann–Lebesgue lemma, namely part (d), to conclude that j .t/j D o jt1j
as t ! ˙1. For general n, use this same argument by doing repeated integration
by parts. t
u

The practically useful Taylor expansions for the cf and its logarithm are given
next. For completeness, we recall the definition of cumulants of a random variable.

Definition 8.4. Suppose E.jX jn / < 1, and let cj D E.X j /; 1 j n. The


first cumulant is defined to be 1 D c1 , and for 2 j n; j defined by the
Pj 1  1
recursion relation j D cj  kD1 jk1 cj k k is the j th cumulant of X . In
particular, 2 D Var.X /; 3 D E.X  /3 ; 4 D E.X  /4  3ŒVar.X /2 , where
 D c1 D E.X /. h j i
Equivalently, if E.jX jn / < 1, then for 1 j n; j D i1j dt d
j log .t/ jt D0 :

Theorem 8.6 (Taylor Expansion of Characteristic Functions). Let X have the cf


.t/.
(a) If E.jX jn / < 1, then .t/ admits the Taylor expansion

X
n
.it/j
.t/ D 1 C E.X j / C o.jtjn /;

j D1

as t ! 0.
(b) If E.jX jn / < 1, then log .t/ admits the Taylor expansion

X
n
.it/j
log .t/ D j C o.jtjn /;

j D1

as t ! 0. See Port 1994 Port (1994, p. 660) for a derivation of the expansion.

8.4 Continuity Theorems

One of the principal uses of characteristic functions is in establishing convergence


in distribution. Suppose Zn is some sequence of random variables, and we suspect
that Zn converges in distribution to some specific random variable Z (e.g., standard
normal). Then, a standard method of attack is to calculate the cf of Zn and show that
it converges (pointwise) to the cf of the specific Z we have in mind. Another case
304 8 Characteristic Functions and Applications

is where we are not quite sure what the limiting distribution of Zn is. But, still, if
we can calculate the cf of Zn and obtain a pointwise limit for it, say .t/, and if we
can also establish that this function .t/ is indeed a valid cf (it is not automatically
guaranteed), then the limiting distribution will be whatever distribution has .t/ as
its cf. Characteristic functions thus make a particularly effective tool in asymptotic
theory. In fact, we later give a proof of the CLT in the iid case, without making
restrictive assumptions such as the existence of the mgf, by using characteristic
functions.
L
Theorem 8.7. (a) Let Xn ; n  1; X have cfs n .t/; .t/. Then Xn ) X if and
only if n .t/ ! .t/ 8t.
(b) (Lévy) Let Xn ; n  1 have cfs n .t/, and suppose n .t/ converges pointwise
to (some function) .t/. If is continuous at zero, then .t/ must be a cf, in
L
which case Xn ) X , where X has .t/ as its cf.

Proof. We prove only part (a). Let Fn ; F denote the CDFs of Xn ; X . Let

B D fb W b is a continuity point of each Fn and F g:

Note that the complement of B is at most countable. Suppose n .t/ ! .t/ 8t.
By the inversion theorem, for any a; b.a < b/ 2 B,

Z
1 T
e i at  e i bt
F .b/  F .a/ D lim .t/ dt
T !1 2 T it
Z " #
1 T
e i at  e i bt
D lim lim n .t/ dt
T !1 2 T n it
Z
1 T
e i at  e i bt
D lim lim n .t/ dt
T !1 n 2 T it

(by the dominated convergence theorem)


Z
1 T
e i at  e i bt
D lim lim n .t/ dt
n T !1 2 T it
D limŒFn .b/  Fn .a/:
n

L
This implies that Xn ) X .
L
Conversely, suppose Xn ) X . Then, by the portmanteau theorem, for any t,
each of EŒcos.tXn /; EŒsin.tXn / converges to EŒcos.tX /; EŒsin.tX /, and so,
n .t/ ! .t/ 8t. t
u
8.5 Proof of the CLT and the WLLN 305

8.5 Proof of the CLT and the WLLN

Perhaps the most major application of part (a) of the continuity theorem is in sup-
plying a proof of the CLT in the iid case, without making any assumptions other
than what the CLT says. Characteristic functions also provide a very efficient proof
of the weak law of large numbers and the Cramér–Wold theorem. These three proofs
are provided below.

Theorem 8.8 (CLT in the IID Case). Let Xi ; i  1 be iid variables with mean ;
p L
N
n.X/
and variance  2 .< 1/. Let Zn D 
. Let Z N.0; 1/. Then Zn ) Z.

Proof. Without any loss of generality, we may assume that  D 0;  D 1. Let .t/
denote the cf of the Xi . Then, the cf of Zn is
  n
t
n .t/ D p
n
  n
t2 1
D 1 Co
2n n

(by the Taylor expansion for cfs)

t2
! e 2 ;

L
and hence, by part (a) of the continuity theorem, Zn ) Z. t
u

Characteristic functions also provide a quick proof of Khintchine’s weak law of


large numbers.

Theorem 8.9 (WLLN). Suppose Xi ; i  1 are iid, with E.jXi j/ < 1. Let
P
E.Xi / D . Let Sn D X1 C    C Xn ; n  1. Then XN D Sn ) . n

Proof. We may assume without loss of generality that  D 0. Because E.jXi j/ <
1, the cf of Xi admits the Taylor expansion .t/ D 1 C o.jtj/; t ! 0. Now, the cf
of XN is
  n   n
t 1
n .t/ D D 1Co
n n
! 1 8t:

Therefore, XN converges in distribution to (the point mass at) zero, and so, also con-
verges in probability to zero. t
u

A third good application of characteristic functions is writing a proof of the


Cramér–Wold theorem.
306 8 Characteristic Functions and Applications

Theorem 8.10 (Cramér–Wold Theorem). Let Xn ; X be d -dimensional vectors.


L L
Then Xn ) X if and only if c0 Xn ) c0 X for all unit vectors c.
Proof. The only if part is an immediate consequence of the continuous mapping
L
theorem for convergence in distribution. For the if part, if c0 Xn ) c0 X for all unit
vectors c, then by the continuity theorem for characteristic functions, for all real ,
0 0
and for all unit vectors c; E.e i c Xn / ! E.e i c X /. But that is the same as saying
0 0
E.e i t Xn / ! E.e i t X / for all d -dimensional vectors t, and so, once again by the
L
continuity theorem for characteristic functions, Xn ) X: t
u

8.6  Producing Characteristic Functions

An interesting question is which functions can be characteristic functions of some


probability distribution. We only address the case of one-dimensional random vari-
ables. It turns out that there is a classic necessary and sufficient condition. However,
the condition is not practically very useful, because of the difficulty verifying it in
specific examples. Simple sufficient conditions are thus also very useful. In this sec-
tion, we first describe the necessary and sufficient condition, and then give some
examples and sufficient conditions.
Theorem 8.11 (Bochner’s Theorem). Let be a complex-valued function defined
on the real line. Then is the characteristic function of some probability distribu-
tion on R if and only if
(a) is continuous.
(b) .0/ D 1.
(c) j .t/j 1 8t.
(d) For all n  1, for all reals t1 ; : : : ; tn , and for all !1 ; : : : ; !n 2 C,

X
n X
n
!i !Nj .ti  tj /  0:
i D1 j D1

See Port (1994, p. 663) for a proof of Bochner’s theorem.


Example 8.6. We give some simple consequences of Bochner’s P theorem. Sup-
pose i ; i  1 are all characteristic functions, and let pi  0; 1 i D1 pi D 1.
Then,
P1 because each i satisfies
P1 the four conditions of Bochner’s theorem, so does
p
i D1 i i , and therefore p
i D1 i i is also a characteristic function. However,
we
P1really do not need Bochner’s theorem for this,Pbecause we can easily recognize
1
p
i D1 i i to be the characteristic function of p F
i D1 i i where i is the char-
acteristic function of Fi . Similarly, if satisfies the four conditions of Bochner’s
theorem, so does N , by an easy verification of the last condition, and the first three
are obvious. But again, that N is a characteristic function is directly clear on observ-
ing that it is the cf of X if is the cf of X .
8.6 Producing Characteristic Functions 307

We can now conclude more. If 1 ; 2 are both cfs, then so must be their product
1 2 , by the convolution theorem for cfs. Applied to the present situation, this
means that if .t/ is a cf, so must be .t/ .t/ D j .t/j2 . In fact, this is the
cf of X1  X2 , with X1 ; X2 being iid with the cf . Recall that this is just the
symmetrization of X1 .
To conclude this example, because is a cf whenever is, the special convex
combination 12 C 12 is also a cf. But, 12 C 12 is simply the real part of .
Therefore, if is any cf, then so is its real part <. /. There is a simple interpretation
for it. For example, if X has a density f .x/ and the cf , then <. / is just the cf of
the mixture density 12 f .x/ C 12 f .x/.
The following sufficient condition is among the most practically useful methods
for constructing valid characteristic functions. We also show some examples of its
applications.
Theorem 8.12 (Polýa’s Criterion). Let be a real and even function. Suppose
.0/ D 1, and suppose that .t/ is nonincreasing and convex for t > 0, and con-
verges to zero as t ! 1. Then is the characteristic function of some distribution
having a density.
See Feller (1971, p. 509) for a proof.
˛
Example 8.7 (Stable Distributions). Let .t/ D e jt j ; 0 < ˛ 1. Then, by sim-
ple calculus, .t/ satisfies the convexity condition in Polýa’s criterion, and the other
˛
conditions are obviously satisfied. Therefore, for 0 < ˛ 1; e jt j is a valid char-
acteristic function. In fact, these are the characteristic functions of some very special
distributions. Distributions F with the property that for any n, if X1 ; : : : Xn are iid
with CDF F , then their sum Sn D X1 C    C Xn is distributed as an X1 C bn
for some sequences an ; bn called stable distributions. It turns out that the sequence
1
an must be of the form n ˛ for some ˛ 2 .0; 2I ˛ is called the index of the stable
˛
distribution. A symmetric stable law has cfs of the form e cjt j ; c > 0. Therefore,
we have arrived here at the cfs of symmetric stable laws of index 1. Polýa’s cri-
˛
terion breaks down if 1 < ˛ 2. So, although e cjt j ; c > 0 are also valid cfs for
1 < ˛ 2, it cannot be proved by using Polýa’s criterion. Note that the case ˛ D 2
corresponds to mean zero normal distributions, and ˛ D 1 corresponds to centered
Cauchy distributions. The case ˛ D 12 arises as the limiting distribution of nu r2
r
,
where r is the time of the rth return to the starting point zero of a simple symmetric
random walk; see Chapter 11. It can be proved that stable distributions must be con-
tinuous, and have a density. However, except for ˛ D 12 ; 1; and 2, a stable density
cannot be written in a simple form using elementary functions. Nevertheless, stable
distributions are widely used in modeling variables in economics, finance, extreme
event modeling, and generally, whenever densities with heavy tails are needed.
Example 8.8 (An Example Where the Inversion Theorem Fails). The inversion the-
orem says that if the cf is absolutely integrable, then the density can be found
by using the inversion formula. This example shows that the cf need not be ab-
solutely integrable for the distribution to possess a density. For this, consider the
function .t/ D 1Cjt 1
j . Then clearly satisfies every condition in Polýa’s crite-
rion, and therefore is a cf corresponding to a density. However, at the same time,
308 8 Characteristic Functions and Applications
R1 1
1 1Cjt j dtD 1, and therefore the density cannot be found by using the inver-
sion theorem.
Example 8.9 (Two Characteristic Functions That Coincide in an Interval). We re-
marked earlier that unlike mgfs, cfs of different distributions can coincide over
nonempty intervals. We give such an example now. Towards this, define
 
jtj
1 .t/ D 1  Ifjt jT g I
T
 
jtj T
2 .t/ D 1  Ifjt j T g C I T :
T 2 4jtj fjt j> 2 g
Each of 1 ; 2 is a cf by Polýa’s criterion. Note that, however, 1 .t/ D 2 .t/ 8t 2
Œ T2 ; T2 . Because T is arbitrary, this shows that two different cfs can coincide on
arbitrarily large intervals. Also note that at the same time this example provides a cf
with a bounded support.
Perhaps a little explanation is useful: mgfs can be extended into analytic func-
tions defined on C, and there is a theorem in complex analysis that two analytic
functions cannot coincide on any subset of C that has a limit point. Thus, two dif-
ferent mgfs cannot coincide on a nonempty real interval. However, unlike mgfs,
characteristic functions are not necessarily analytic. That leaves an opening for find-
ing nonanalytic cfs which do coincide over nonempty intervals.

8.7 Error of the Central Limit Theorem

Suppose a sequence of CDFs Fn converges in distribution to F , for some F . Such


a weak convergence result is usually used to approximate the true value of Fn .x/ at
some fixed n and x by F .x/. However, the weak convergence result, by itself, says
absolutely nothing about the accuracy of approximating Fn .x/ by F .x/ for that
particular value of n. To approximate Fn .x/ by F .x/ for a given finite n is a leap
of faith, unless we have some idea of the error committed (i.e., of the magnitude of
jFn .x/F .x/j). More specifically, for a sequence of iid random variables Xi ; i  1,
2
with mean  and variance
 p N  , we need some quantification of the error of the normal
n.X/
approximation jP 
x/  ˆ.x/j, both for a fixed x and uniformly over
all x.
The first result for the iid case in this direction is the classic Berry–Esseen theo-
rem (Berry (1941), Esseen (1945)). Typically, these accuracy measures give bounds
on the error in the central limit theorem approximation for all fixed n, in terms of
moments of the Xi . Good general references on this topic are Petrov (1975), Feller
(1971), Hall (1992), and Bhattacharya et al. (1986). Other specific references are
given later. p
N /
In the canonical iid case with a finite variance, the CLT says that n.X  con-
verges in law to the N.0; 1/ distribution. By Polya’s theorem, the uniform error
8.7 Error of the Central Limit Theorem 309
p N
n.X/
n D sup1<x<1 jP 
x  ˆ.x/j ! 0, as n ! 1. Bounds on n
for any given n are called uniform bounds. Here is the classic Berry–Esseen uniform
bound.
Theorem 8.13 (Berry–Esseen Theorem). Let Xi ; i  1, be iid with E.X1 / D ,
Var.X1 / D  2 , and ˇ3 D EjX1  j3 < 1. Then there exists a universal constant
C , not depending on n or the distribution F of the Xi , such that

Cˇ3
n p :
3 n

The proof of the Berry–Esseen theorem uses two technical inequalities on charac-
teristic functions. One is Esseen’s lemma, and the other a lemma that further bounds
a term in Esseen’s lemma itself. It is the second lemma that needs elaborate
p
argu-
N
ments involving Taylor expansions for the characteristic function of n.X/
, and
its logarithm. A detailed proof of this second lemma can be seen in Petrov (1975,
pp. 142–147).
Lemma. Let X1 ; X2 ; : : : ; Xn be iid random variables with mean , variance  2 ,
and ˇ3 D E.jX1  j3 / < 1. Let Ln D  3ˇp 3
n
, and let n .t/ denote the charac-
p
N
n.X/
teristic function of Zn D 
. Then,

t2 t2 1
jn .t/  e  2 j 16Ln jtj3 e  3 8t such that jtj :
4Ln

Sketch of Proof of Berry–Esseen Theorem. Denote the CDF of Zn by Fn . In Esseen’s


lemma, use F D Fn ; G D ˆ; b D 4 1
, and T D 4L1 n . Then, Esseen’s lemma gives
ˇ ˇ
Z ˇ t2 ˇ
1 ˇ n .t/  e  2 ˇ
sup1<x<1 jFn .x/  ˆ.x/j ˇ ˇ dt C C1 Ln ;
 ˇ t ˇ
1
t Wjt j 4L n
ˇ ˇ

on using the simple inequality that the standard normal density .x/ p1 8x.
2
t2
Now, in the first term on the right, use the pointwise bound for jn .t/  e  2 j
from the above lemma, and integrate. Some algebra then gives the Berry–Esseen
theorem with a new universal constant C . t
u
Remark. The universal constant C may be taken as C D 0:8. Fourier analytic
proofs give the best available constant; direct proofs have so far not succeeded
in producing
p
good constants. The constant C cannot be taken to be smaller than
3C 10 :
p
6 2
D 0:41. Better values of the constant C can be found for specific types of
the underlying CDF, for example, if it is known that the samples are iid from an ex-
ponential distribution. However, no systematic studies in this direction seem to have
been done. Also for some specific underlying CDFs F , better rates of convergence
in the CLT may be possible. For example, under suitable additional conditions on
1
F , one can have n D C.F /
n C o.n /:
310 8 Characteristic Functions and Applications

The main use of the Berry–Esseen theorem is that it establishes the facts that,
 in
general, the rate of convergence in the central limit theorem for sums is O p1n ,
and that the accuracy of the CLT approximation is linked to the third moment. It
need not, necessarily, give accurate practical bounds on the error of the CLT in
specific applications. We take an example below.

Example 8.10. Suppose Xi ; 1 i n are iid Bernoulli variables with parameter


p. Take n to be 100. We want to investigate if the Berry–Esseen theorem can let
us conclude that the error of the CLT approximation is at most D :005. In the
Bernoulli case, ˇ3 D EjXi  j3 D pq.1  2pq/, where q D 1  p. Using
C D 0:8, the uniform Berry–Esseen bound is
0:8pq.1  2pq/
n p :
.pq/3=2 n

This is less than or equal to the prescribed D 0:005 iff pq > 0:4784, which does
not hold for any p; 0 < p < 1. Even for p D :5, the Berry–Esseen bound is less
than or equal to D 0:005 only when n > 25,000, which is a very large sample
size.

Various refinements of the Berry–Esseen theorem are available. They replace the
third absolute central moment ˇ3 by an expectation of some more general function.
Petrov (1975) and van der Vaart et al. (1996) are good references for more general
Berry–Esseen type bounds on the error of the CLT for sums. Here is an important
refinement that does not assume the existence of the third moment, or that the vari-
ables are iid. The zero mean assumption in the next theorem is not a restriction,
because we can make the means zero by writing Xi  i in place of Xi .

Theorem 8.14.
PLet X1 ; : : : ; Xn be independent with E.Xi / D 0; Var.Xi / D i2
n
and let Bn D i D1 i2 . Let g W R ! RC be such that
(a) g is even:
x
(b) g.x/ and g.x/ are nondecreasing for x > 0:
(c) E.Xi g.Xi // < 1 for each i D 1; 2; : : : ; n:
2

Then,
ˇ ! ˇ Pn
ˇ XN  E.XN / ˇ 2
ˇ ˇ i D1 E.Xi g.Xi //
sup ˇP p x  ˆ.x/ˇ C p ;
x ˇ Var.XN / ˇ g. Bn /Bn

for some universal constant C; 0 < C < 1.


See Petrov (1975, p. 151) for a proof. An important corollary, obtained by using
the function g.x/ D jxjı .0 < ı 1/, is the following uniform bound in the iid case
without assuming the existence of the third moment.

Corollary. Let X1 ; : : : ; Xn be iid with mean , variance  2 , and suppose for some
ı; 0 < ı 1; E.jX1 j2Cı / < 1. Then,
8.8 Lindeberg–Feller Theorem for General Independent Case 311
p !  
n.XN  / E jX1  j2Cı
n D supx jP x  ˆ.x/j C ı
;
  2Cı n 2

for some universal constant C .


Bounds on the error of the central limit theorem at a fixed x are called local bounds;
Petrov (1975), Serfling (1980), and DasGupta (2008) discuss local bounds. The
problem of deriving bounds on the error of the central limit theorem in the mul-
tidimensional case is much harder. No uniform bounds can be obtained with the
class of all possible (Borel) sets. Bounds have been derived for some special classes
of sets, such as the class of all closed balls, and the class of closed convex sets.
DasGupta (2008) gives a detailed presentation of the results.

8.8 Lindeberg–Feller Theorem for General Independent Case

The central limit theorem generalizes far beyond the iid situation. The general rule is
that if we add a large number of independent summands, none of which dominates
the rest, then the sum should still behave as does a normally distributed random
variable. There are several theorems in this regard, of which the Lindeberg–Feller
theorem is usually considered to be the last word. A weaker result, which is easier
to apply in many problems, is Lyapounov’s theorem, which we present first.
Theorem 8.15 (Lyapounov’s Theorem). Suppose fXn g isP a sequence of indepen-
dent variables, with E.Xi / D i ; Var.Xi / D i2 . Let sn2 D niD1 i2 .
If for some ı > 0,

1 X
n
EjXj  j j2Cı ! 0 as n ! 1;
sn2Cı j D1

then
Pn
i D1 .Xi  i / L
) N.0; 1/:
sn

Corollary. If sn ! 1 and jXj  j j C < 1 8j , then


Pn
i D1 .Xi  i / L
) N.0; 1/:
sn

Sketch of Proof of Lyapounov’s Theorem. We explain the idea of the proof when the
condition of the theorem holds with ı D 1. Here is the idea, first in nontechnical
terms.
P
Assume without any loss of generality that i D 0; i2 D 1 8i  1. Denote
n
i D1 Xi
sn by Zn and its characteristic function by n .t/. The idea is to approximate
312 8 Characteristic Functions and Applications
2
the logarithm of n .t/ and to show that it is approximately equal to  t2 for each
t2
fixed t. This means that n .t/ itself is approximately equal to e  2 , and hence by
L
the continuity theorem for characteristic functions, Zn ) N.0; 1/.
Let the cumulants of Zn be denoted by j;n . Note that 1;n D 0; 2;n D 1. We ap-
proximate the logarithm o f n .t/ by using our previously given Taylor expansion:

X
3
.i t/j
log n .t/ j;n

j D1

t2 i
D  3;n t
3
:
2 6
P P
Now, 3;n D E.Zn3 / D s13 EŒ. niD1 Xi /3 : By an expansion of . niD1 Xi /3 ,
n
and on using the independence and zero mean Pn
property of the Xi , and on using
3
i D1 E.jXi j /
the triangular inequality, one gets j 3;n j s3
, which by the Lyapounov
n
2
condition goes to zero, when ı D 1. Thus, log n .t/  t2 , which is what we
need. t
u
As remarked earlier, a central limit theorem for independent but not iid
summands holds under conditions weaker than the Lyapounov condition. The
Lindeberg–Feller theorem is usually considered the best possible result on this,
in the sense that the conditions of the theorem are not only sufficient, but also
necessary under some natural additional restrictions. However, the conditions of the
Lindeberg–Feller theorem involve calculations with the moments of more compli-
cated functions of the summands. Thus, in applications, using the Lindeberg–Feller
theorem gives a CLT under weaker conditions than Lyapounov’s theorem, but typi-
cally at the expense of more cumbersome calculations to verify the Lindeberg–Feller
conditions.
Theorem 8.16 (Lindeberg–Feller Theorem). With the same notation as in Lya-
pounov’s theorem, assume that

1 X  2
n

8ı > 0; 2
E Xj IfjXj j>ısn g ! 0:
sn
j D1

Then, Pn
i D1 .Xi  i / L
) N.0; 1/:
sn
Proof. We give a characteristic function proof. Merely for notational simplicity, we
assume that each Xi has a density fi . This assumption is not necessary, and the
proof goes through verbatim in general, with integrals replaced by the expectation
notation.
Denote the cf of Xi by i , and without any loss of generality, we assume that
i D 0 8i , so that we have to show that
8.8 Lindeberg–Feller Theorem for General Independent Case 313
Pn
i D1 Xi L
) N.0; 1/
sn
Y
n  
t 2 =2
, k ! e t :
sn
kD1

The last statement is equivalent to

X
n  
t t2
k 1 ! 
sn 2
kD1
X
n  
t t2
, k 1 C ! 0;
sn 2
kD1

by using a two-term Taylor expansion for the logarithm of each k (see the section
on characteristic functions).
Now, because each Xi has mean zero, the quantity on the left in this latest ex-
pression can be rewritten as
n Z
X 1
t t t2
e i sn x  1  i x C 2 x 2 fk .x/dx
1 sn 2sn
kD1
Xn Z
t t t2
D e i sn x  1  i x C 2 x 2 fk .x/dx
jxjısn sn 2sn
kD1
n Z
X t t t2
C e i sn x  1  i x C 2 x 2 fk .x/dx
jxj>ısn sn 2sn
kD1

We bound each term in this expression. First, we bound the integrand. In the
first term, that is when jxj ısn , the integrand is bounded above by j stn xj3 <
ıjtj3 x 2 =sn2 . In the second term, that is when jxj > ısn , the integrand is bounded
2
above by t 2 xs 2 . Therefore,
n

n Z
X t t t2
e i sn x  1  i x C 2 x 2 fk .x/dx
jxjısn sn 2sn
kD1
X
n Z
t t t2
C e i sn x  1  i x C 2 x 2 fk .x/dx
jxj>ısn sn 2sn
kD1
Pn R
3kD1 jxjısn x 2 fk .x/dx
ıjtj
sn2
2 X
n Z
t
C x 2 fk .x/dx:
sn2 jxj>ısn
kD1
314 8 Characteristic Functions and Applications

Hold ı fixed and let n ! 1. Then, the second term ! 0 by hypothesis of the
Lindeberg–Feller theorem. Now, notice that the expression we started with, namely,
n Z
X 1
t t t2
e i sn x  1  i x C 2 x 2 fk .x/dx
1 sn 2sn
kD1

has no ı in it, and so, now letting ı ! 0, even the first term is handled, and we
conclude that
n Z
X 1
t t t2
e i sn x  1  i x C 2 x 2 fk .x/dx ! 0;
1 sn 2sn
kD1

which is what we needed. t


u

Example 8.11 (Linear Combination of IID Variables). Let Ui ; i  1 be iid vari-


ables, and assume that Ui has mean  and variance  2 . Without loss of generality,
we may assume that  D 0;  D 1. Let also cn ; n  1 be a sequence of constants,
such that it satisfies the uniform asymptotic negligibility condition

maxfjc1 j; : : : ; jcn jg
rn D qP ! 0; as n ! 1:
n 2
i D1 i c

Pn L
i D1 ci Ui
We want to show that qP
n 2
) N.0; 1/: We do this by verifying the Lindeberg–
i D1 ci
Feller condition. qP
n
Denote Xi D ci Ui ; sn D 2
i D1 ci . We need to show that for any ı > 0;

1 X  2
n

2
E Xj IfjXj j>ısn g ! 0:
sn
j D1

Fix j , then,
   
E Xj2 IfjXj j>ısn g D cj2 E Uj2 IfjUj j>ısn =jcj jg
 
cj2 E Uj2 IfjUj j>ısn = maxfjc1 j;:::;jcn jgg :

By the uniform asymptotic negligibility condition, sn = maxfjc1 j; : : : ; jcn jg ! 1,


and therefore IfjU1 j>ısn = maxfjc1 j;:::;jcn jgg goes almost surely to zero. Furthermore,
U12 IfjU1 j>ısn = maxfjc1 j;:::;jcn jgg is bounded above by U12 , and we have assumed that
EU12 / < 1. This implies by the dominated convergence theorem that
 
E U12 IfjU1 j>ısn = maxfjc1 j;:::;jcn jgg ! 0:
8.9 Infinite Divisibility and Stable Laws 315

P cj2
But this implies s12 njD1 E.Xj2 IfjXj j>ısn g / goes to zero too, because 2
sn
1. This
n
completes the verification of the Lindeberg–Feller condition.

Example 8.12 (Failure of Lindeberg–Feller Condition). It is possible for standard-


ized sums of independent variables to converge in distribution to N.0; 1/ without
the Lindeberg–Feller condition being satisfied. Basically, one of the variables has to
have an undue influence on the sum. To keep the distributional calculation simple,
take the Xi to be independent normal with mean 0 and variance rapidly increas-
ing so that the nth variable dominates
Pn
the sum of the first n, for example, with
i D1 Xi
Var.Xi / D 2 . Then, for each n,
i
sn
N.0; 1/, and so it is N.0; 1/ in the
limit. However, the Lindeberg–Feller condition fails, which can be shown by an
analysis of the standard normal tail CDF, or by invoking the first part of the follow-
ing theorem.

Theorem 8.17 (Necessity of Lindeberg–Feller Condition). (a) In order that the


maxf12 ;:::;n2 g
Lindeberg–Feller condition holds, it must be the case that 2
sn
! 0.
Pn
maxf12 ;:::;n2 g i D1 .Xi i /
(b) Conversely, if 2
sn
! 0, then for sn
to converge in distribu-
tion to N.0; 1/, the Lindeberg–Feller condition must hold.

Part (b) is attributed to Feller; see Port (1994, p. 704). It shows that if variances
exist, then one cannot do away with the Lindeberg–Feller condition, except in un-
usual cases where the summands are not uniformly negligible relative to the sum.

8.9  Infinite Divisibility and Stable Laws

Infinitely divisible distributions were introduced by de Finetti in 1929 and the most
fundamental results were developed by Kolmogorov, Lévy, and Khintchine in the
thirties. The area has undergone tremendous growth and a massive literature now
exists. Stable distributions form a subclass of infinitely divisible distributions and
are quite extensively used in modeling heavy tail data in various applied fields.
The origin of infinite divisibility and stable laws was in connection with char-
acterizing possible limit distributions of centered and normed partial sums of
independent random variables. If X1 ; :X2 ; : : : is an iid sequence with mean  and a
p L
finite variance  2 , then we know that with an D n and bn D  n; Snba n
n
)
N.0; 1/. But, if X1 ; :X2 ; : : : are iid standard Cauchy, then with an D 0 and
L
bn D n; Snban
n
) C.0; 1/. It is natural to ask what are all the possible limit laws for
suitably centered and normed partial sums of iid random variables. One can remove
the iid assumption and keep just independence, and ask the same question. It turns
out that stable and infinitely divisible distributions arise as the class of all possible
limits in these cases. Precise statements are given in the theorems below. But, first
we need to define infinite divisibility and stable laws.
316 8 Characteristic Functions and Applications

Definition 8.5. A real-valued random variable X with cumulative distribution func-


tion F and characteristic function  is said to be infinitely divisible if for each n > 1,
there exist iid random variables X1n ; : : : ; Xnn with cdf, say Fn , and characteristic
function n , such that X has the same distribution as X1n C    C Xnn , or, equiva-
lently,  D nn .

Example 8.13. Let X be N.;  2 /. For any n > 1, let X1n ; : : : ; Xnn be iid
N.=n;  2 =n/. Then X has the same distribution as X1n C    C Xnn , and so X is
infinitely divisible.

Example 8.14. Let X have a Poisson distribution with mean . For a given n, let
X1n ; : : : ; Xnn be iid Poisson variables with mean =n. Then X has the same distri-
bution as X1n C    C Xnn , and so X is infinitely divisible.

Example 8.15. Let X have the continuous U Œ0; 1 distribution. Then X is not in-
finitely divisible. For if it is, then for any n, there exist iid random variables
X1n ; : : : ; Xnn with some distribution Fn such that X has the same distribution as
X1n C    C Xnn . Evidently, the supremum of the support of Fn is at most 1=n. This
implies Var.X1n / 1=n2 and hence Var.X / 1=n, a contradiction.
In fact, a random variable X with a bounded support cannot be infinitely divisi-
ble, and the uniform case proof applies in general.
The following important property of the class of infinitely divisible distributions
describes the connection of infinite divisibility to possible weak limits of partial
sums of triangular arrays of independent random variables.

Theorem 8.18. A random variable X is infinitely divisible if and only if for each n,
P L
there is an iid sequence Z1n ; : : : ; Znn , such that niD1 Zi n ) X .
See Feller (1971, p. 303) for its proof. The result above allows triangular arrays
of independent random variables, with possibly different common distributions Hn
for the different rows. An important special case is that ofPjust one iid sequence
X1 ; X2 ; : : : with a common cdf H . If the partial sums Sn D niD1 Xi , possibly after
suitable centering and norming, converge in distribution to some random variable
Z, then Z belongs to a subclass of the class of infinitely divisible distributions. This
class is the so-called stable family. We first give a more direct definition of a stable
distribution that better explains the reason for the name stable.

Definition 8.6. A random variable X , or its CDF F , is said to be stable if for every
n  1, there exist constants bn and an such that Sn D X1 C X2 C    C Xn
and bn X1 C an have the same distribution, where X1 ; X2 ; : : : ; Xn are iid with the
common distribution F .
It turns out that bn has to be n1=˛ for some 0 < ˛ 2. The constant ˛ is said to
be the index of the stable distribution F . The case ˛ D 2 corresponds to the normal
case, and ˛ D 1 corresponds to the Cauchy case. Generally, stable distributions are
heavy tailed. For instance, the only stable laws with a finite variance are the normal
distributions. The following theorem is often useful.
8.10 Some Useful Inequalities 317

Theorem 8.19. If X is stable with an index 0<˛<2, then for any p>0; EjX jp <1
if and only if 0<p<˛.
Thus, stable laws with an index ˛ 1 cannot even have a finite mean; see Feller
(1971, pp. 578–579) for a proof of the above theorem.
Stable distributions are necessarily absolutely continuous, and therefore have
densities. However, except for ˛ D 12 ; 1; 32 , and 2, it is not possible to write a stable
density analytically using elementary functions. See Chapter 11 and Chapter 18 for
examples of situations where the stable law with ˛ D 12 naturally arises. There
are various infinite series and integral representations for them. The general stable
distribution is parametrized by four parameters; a location parameter , a scale
parameter , a skewness parameter ˇ, and the index parameter, which we have
called ˛. For instance, for the standard normal distribution, which is stable,  D 0;
 D 1; ˇ D 0, and ˛ D 2. By varying ; ; ˇ, and ˛, one can fit a lot of heavy-tailed
data by using a stable law. However, fitting the four parameters from observed data
is a very hard statistical problem. Feller (1971) is an excellent reference for more
details on stable and infinitely divisible distributions. Infinite divisibility and stable
laws are also treated in more detail in DasGupta (2008).
Here is the result connecting stable laws to limits of partial sums of iid random
variables; see Feller (1971, p. 172).

Theorem 8.20. Let X1 ; X2 ; : : : be iid with the common cdf H . Suppose for constant
L
sequences fan g; fbn g; Snba
n
n
)Z F . Then F is stable.

8.10  Some Useful Inequalities

References to the inequalities below are given in DasGupta (2008), Chapter 34.
Bikelis Nonuniform Inequality. Given independent random variables p X1 ; : : : ; Xn
with mean zero, and finite third absolute moments, and Fn .x/ D P . nXN x/,
Pn
i D1 EjXi j
3
1
jFn .x/  ˆ.x/j A ;
Bn3 .1 C jxj/3
Pn
for all real x, where Bn2 D i D1 Var.Xi /.

Reverse Berry–Esseen Inequality of Hall and Barbour. Given independent


random
Pn variables XP
1 ; : : : ; Xn with mean zero, variances
Pn i2 scaled so that
n
i D1 i D 1, ın D i D1 EŒ.Xi C Xi /IjXi j1  C
2 3 4 2
i D1 EŒXi IjXi j>1 ,
!
1 X
n
sup1<x<1 jFn .x/  ˆ.x/j  ın  121 4
i :
392
i D1
318 8 Characteristic Functions and Applications

Kolmogorov’s Maximal Inequality. Given independent random variables


P
X1 ; : : : ; Xn with mean zero, variances i2 , Sk D kiD1 Xi , and a positive number ,
  Pn 2
i D1 i
P max jSk j  2
:
1kn

Hoeffding’s Inequality. Given independent random variables


P X1 ; : : : ; Xn with
mean zero such that ai Xi bi ; 1 D 1; 2; : : : ; n, Sn D niD1 Xi , and a positive
number t,
2n2 t 2
 Pn
.bi ai /2
P .Sn  nt/ e i D1 :
2n2 t 2
 Pn
.b ai /2
P .Sn nt/ e i D1 i :
2 2
 Pn 2n t
.b ai /2
P .jSn j  nt/ 2e i D1 i :

Partial Sums Moment Inequality. Given n random variables X1 ; : : : ; Xn ; p > 1,


ˇ n ˇ
ˇX ˇp X n
ˇ ˇ
Eˇ Xi ˇ np1 EjXk jp :
ˇ ˇ
i D1 kD1

Given n independent random variables X1 ; : : : ; Xn ; p  2,


ˇ n ˇ
ˇX ˇ p X
n
ˇ ˇ
Eˇ Xi ˇ c.p/np=21 EjXk jp ;
ˇ ˇ
i D1 kD1

for some finite universal constant c.p/.


von Bahr–Esseen Inequality. Given independent random variables X1 ; : : : ; Xn
with mean zero, 1 p 2,
ˇ n ˇ
ˇ X ˇp   n
1 X
ˇ ˇ
Eˇ Xi ˇ 2 EjXk jp :
ˇ ˇ n
i D1 kD1

Rosenthal Inequality. Given independent random variables X1 ; : : : ; Xn ; p  1;


E.Xi / D 0; E.jXi jp / < 1 8i; there exists a finite constant C.p/ independent of
n such that
"ˇ n ˇ # 2 ! p2 3
ˇ X ˇp X n
ˇ ˇ
E ˇ Xi ˇ C.p/E 4 Xi2 5:
ˇ ˇ
i D1 i D1

Hence, if p  2, and Xi are iid, E.Xi / D 0; E.jXi jp / < 1; then


"ˇ n ˇ #
ˇX ˇ p  
ˇ ˇ p
E ˇ Xi ˇ C.p/n 2 E jX1 jp :
ˇ ˇ
i D1
Exercises 319

Rosenthal Inequality II. Given independent random variables X1 ; : : : ; Xn ; p > 1,


! ( n !p )
Xn X X n
p2
E j Xi j p
2 max EjXk j ;
p
EjXk j :
i D1 kD1 kD1

Given independent random variables X1 ; : : : ; Xn , symmetric about zero, 2 <


p < 4; E.jXi jp / < 1 8i ,
"ˇ n ˇ #!  !1=p
ˇX ˇp 1=p 2p=2
ˇ ˇ pC1
E ˇ Xi ˇ 1C p 
ˇ ˇ  2
i D1
80 2 9
ˆ ˇ n ˇ 311=2 !1=p >
< ˇX ˇ2 X
n =
ˇ ˇ
: max @E 4ˇ Xi ˇ 5A ; EjXi jp :
:̂ ˇ ˇ >
;
i D1 i D1

Exercises

Exercise 8.1 (Symmetrization of U Œ0; 1). Let X; Y be iid U Œ0; 1. Calculate the
characteristic function of X  Y . Is X  Y uniformly distributed on Œ1; 1?

Exercise 8.2 (Limit of Characteristic Functions). Let Xn U Œn; n; n  1,


and let n be the cf of Xn . Show that there is a function such that n .t/ ! .t/,
but that is not a cf.

Exercise 8.3 (Limit of Characteristic Functions). Let Xn N.n; n2 /; n  1,


and let n be the cf of Xn .
(a) Does n have a pointwise limit?
(b) Does Xn converge in distribution?
(c) Does Helly’s theorem apply to the sequence fXn g?
(d) Is there any sequence cn such that cn Xn converges in distribution? If so, identify
it and also identify the limit in distribution of cn Xn .

Exercise 8.4. Calculate the cf of the standard logistic distribution with the CDF
F .x/ D 1Ce1x ; 1 < x < 1.

Exercise 8.5 * (Normal–Cauchy Convolution). Show that for appropriate con-


2
stants, ae b.jt jCc/ is a characteristic function. Explicitly describe what a; b; c
can be.

Exercise 8.6 * (Decimal Expansion). Let Xk ; k  1 be iid discrete uniform on


f0; 1; : : : ; 9g.
P
(a) Calculate the cf of nkD1 10
Xk
k.
(b) Find its limit as n ! 1.
(c) Recognize the limit and make an interesting conclusion.
320 8 Characteristic Functions and Applications

Exercise 8.7. For each of the Bi n.n; p/; Geo.p/, and P oi. /, use the cf formula
given in text and show that there exists t ¤ 0 at which j .t/j D 1.
Remark. This is true of any lattice distribution.
Exercise 8.8 * (Random Sums). Suppose Xi ; i  1 are iid, and N Poi. / is
independent of fX1 ; X2 ;    g. Let Sn D X1 C : : : C Xn ; n  1; S0 D 0. Derive a
formula for the cf of SN .
Exercise 8.9 * (Characteristic Function of Products).
(a) Write a general expression for the cf of X Y if X; Y are independent.
(b) Find the cf of the product of two independent standard normal variable s.
Exercise 8.10 * (Sum of Dice Rolls). Let Xi ; 1 i n be iid integer-valued ran-
dom variables, with cf .t/. Let Sn D X1 C    C Xn ; n  1. By using the inversion
formula for integer-valued random variables, derive an expression for P .Sn D k/.
Hence, derive the pmf of the sum of n independent rolls of a fair die.
Exercise 8.11 * (A Characterization of Normal Distribution). Let X; Y be iid
random variables. Give a characteristic function proof that if X C Y; X  Y are
independent, then X; Y must be normal.
Exercise 8.12 (Characteristic Function of Compact Support). Consider the ex-
ample .t/ D .1  jtj/Ifjt j1g given in the text. Find the density function corre-
sponding to this cf.
Exercise 8.13 (Practice with the Inversion Formula). For each of the following
cfs, find the corresponding density function:

t2
.cosh t/1 I .cosh t/2 I t.sinh t/1 I .1/n H2n .t/e  2 ;

where Hr .x/ is the rth Hermite polynomial.


L
Exercise 8.14. Suppose X; Y are independent, and that X C Y DD X . Show that
Y D 0 with probability one.
4
Exercise 8.15 (e jt j Is Not a Characteristic Function).
(a) Suppose a cf is twice differentiable at zero, and 00 .0/ D 0. Show that the
corresponding random variable equals zero with probability one.
˛
(b) Prove that for any ˛ > 2; e jt j cannot be a characteristic function.
Exercise 8.16 * (A Remarkable Fact). Let Xi ; 1 i 4 be iid standard normal.
Prove that X1 X2  X3 X4 has a standard double exponential distribution.
Exercise 8.17 * (Convergence in Total Variation and Characteristic Func-
tions). Let Xn ; n  1; X have densities
R1 fn ; f , and cfs n ; . Suppose j n j and j j
are all integrable. Show that if 1 j n .t/  .t/jdt ! 0, then Xn converges to X
in total variation.
Exercises 321

indep:
Exercise 8.18. Suppose Xi U Œai ; ai ; i  1; ai a < 1 8i .
(a) Give a condition on fai g such that the Lindeberg–Feller condition holds.
(b) Give fai g such that the Lindeberg–Feller condition does not hold.
P
(c) * Prove that the Lindeberg–Feller condition holds if and only if 1 i D1 ai D 1.
2

indep
Exercise 8.19. Let Xi Poi. i /; i  1. Find a sufficient condition on f i g
so that the Lindeberg–Feller condition holds. Next, find a sufficient condition for
Lyapounov’s condition to hold with ı D 1.
indep:
Exercise 8.20. (Lindeberg–Feller for Independent Bernoullis). Let Xni
Pn
Bin.1; tni /; 1 i n, and suppose tni .1  tni / ! 1. Show that
i D1

P
n Pn
Xni  i D1 tni
i D1 L
s ) N.0; 1/:
P
n
tni .1  tni /
i D1

indep
Exercise 8.21 (Lindeberg–Feller for Independent Exponentials). Let Xi
max i
1i n
Exp. i /, i  1, and suppose P
n ! 0. Show that, on standardization to zero
i
i D1
mean and unit variance, XN converges in distribution to N.0; 1/.
indep:
Exercise 8.22 (Poisson Limit in Bernoulli Case). Let Xni Bin.1; tni /; 1
Pn Pn
i n, and suppose tni ! 0 < < 1. Where does Xni converge in law?
i D1 i D1
Hint: Look at characteristic functions.

Exercise 8.23. Verify, for which of the following cases, the Lindeberg–Feller con-
dition holds:
(a) P .Xn D ˙ 1n / D 12 .
(b) P .Xn D ˙n/ D 12 .
(c) P .Xn D ˙2n / D 12 .
(d) P .Xn D ˙2n / D 2n1 :P .Xn D ˙1/ D 1
2  2n1 :

Exercise 8.24. Let X Bin.n; p/. Use the Berry–Esseen theorem to bound
P .X k/ in terms of the standard normal CDF ˆ uniformly in k. Is it possible
to give any nontrivial bounds that are uniform in both k; p?

Exercise 8.25. Suppose X1 ; : : : ; X20 are the scores in 20 independent rolls of a fair
die. Obtain an upper and a lower bound on P .X1 C X2 C    C X20 75/ by using
the Berry–Esseen inequality.
322 8 Characteristic Functions and Applications

Exercise 8.26 * (Hall–Barbour Reverse Bound in Binomial Case). By using the


Hall–Barbour inequality given in the text, find a lower bound on the maximum error
of the CLT in the iid Bernoulli case. Plot this bound as a function of p for some
selected values of n.

R 1Local Bound). For which values of p, does the Bikelis


Exercise 8.27 * (Bikelis
local bound imply that 1 jFn .x/  ˆ.x/jp dx ! 0 as n ! 1?
indep
Exercise 8.28 (Using the Generalized Berry–Esseen). Suppose Xi Poisson
.i /; i  1.
(a) Does XN , on standardization, converge to N.0; 1/?
(b) Obtain an explicit upper bound on the error of the CLT by a suitable choice of
the function g in the generalized Berry–Esseen inequality given in text.

Exercise 8.29. Point out a defect of the Bikelis local bound.

Exercise 8.30 (Applying the Rosenthal Inequalities). Suppose


P X1 ; X2 ; : : : ; Xn
are iid U Œ1; 1. Derive bounds on the expected value of j niD1 Xi j3 by using both
Rosenthal inequalities given in text.

References

Berry, A. (1941). The accuracy of the Gaussian approximation to the sum of independent variates,
Trans. Amer. Math. Soc., 49, 122–136.
Bhattacharya, R.N., and Rao, R.R. (1986). Normal Approximation and Asymptotic Expansions,
John Wiley, New York.
DasGupta, A. (2008). Asymptotic Theory of Statistics and Probability, Springer, New York.
Esseen, C. (1945). Fourier analysis of distribution functions: A mathematical study, Acta Math.,
77, 1–125.
Feller, W. (1971). An Introduction to Probability Theory and Its Applications, Vol II, Wiley,
New York.
Hall, P. (1992). The Bootstrap and the Edgeworth Expansion, Springer, New York.
Petrov, V. (1975). Limit Theorems of Probability Theory, Oxford University Press, London.
Port, S. (1994). Theoretical Probability for Applications, Wiley, New York.
Serfling, R. (1980). Approximation Theorems of Mathematical Statistics, Wiley, New York.
van der Vaart, Aad and Wellner, J. (1996). Weak Convergence and Empirical Processes: With
Applications to Statistics, Springer-Verlag, New York.
Chapter 9
Asymptotics of Extremes and Order Statistics

We discussed the importance of order statistics and sample percentiles in detail in


Chapter 6. The exact distribution theory of one or several order statistics was pre-
sented there. Although closed-form in principle, the expressions are complicated,
except in some special cases, such as the uniform and the exponential. However,
once again it turns out that just like sample means, order statistics and sample per-
centiles also have a very structured asymptotic distribution theory. We present a
selection of the fundamental results on the asymptotic theory for order statistics
and sample percentiles, including the sample extremes. Principal references for this
chapter are David (1980), Galambos (1987), Serfling (1980), Reiss (1989), de Haan
(2006), and DasGupta (2008); other references are given in the sections. First, we
recall some notation for convenience.
Suppose Xi ; i  1, are iid random variables with CDF F . We denote the or-
der statistics of X1 ; X2 ; : : : ; Xn by X1Wn ; X2Wn ; : : : ; XnWn , or sometimes, simply as
X.1/ ; X.2/ , : : : ; X.n/ . The empirical CDF is denoted as Fn .x/ D #fi WXni xg and
Fn1 .p/ D inffx W Fn .x/  pg denotes the empirical quantile function. The popu-
lation quantile function is F 1 .p/ D inffx W F .x/  pg. F 1 .p/ is also denoted
as p .
Consider the kth-order statistic XkWn where k D kn . Three distinguishing
cases are:
p
(a) n. knn  p/ ! 0, for some 0 < p < 1. This is called the central case.
(b) n  kn ! 1 and knn ! 1. This is called the intermediate case.
(c) n  kn D O.1/. This is called the extreme case.
Different asymptotics apply to the three cases; the case of central order statistics is
considered first.

9.1 Central-Order Statistics

9.1.1 Single-Order Statistic

Example 9.1 (Uniform Case). As a simple motivational example, consider the case
of iid U Œ0; 1 observations U1 ; U2 ; : : :. Fix n, and p; 0 < p < 1, and assume for

A. DasGupta, Probability for Statistics and Machine Learning: Fundamentals 323


and Advanced Topics, Springer Texts in Statistics, DOI 10.1007/978-1-4419-9634-3 9,
c Springer Science+Business Media, LLC 2011
324 9 Asymptotics of Extremes and Order Statistics

convenience that k D np is an integer. The goal of this example is to show why


the kth-order statistic of U1 ; U2 ; : : : ; Un is asymptotically p
normal. More precisely,
n.U.k/ p/
if q D 1  p; and ¢ D pq, then we want to show that
2
¢ converges in
distribution to the standard normal.
This can be established in a number of ways. For example, this can be proved by
Pk
Xi
using the representation of U.k/ as U.k/ D i D1
PnC1 , where the Xi are iid standard
i D1
Xi
exponential (see Chapter 6). However, we find the following heuristic argument
more illuminating than using the above representation. We let Fn .u/ denote the
empirical CDF of U1 ; : : : ; Un , and recall that nFn .u/ Bin.n; F .u//, which is
Bin.n; u/ in the uniform
p
case.
n.U.k/ p/
Denote Zk D ¢
. Then,
 
¢z
P .Zk z/ D P U.k/ p C p
n
   
¢z
D P nFn p C p k
n
0 1
p
B k  np  ¢z n C
1  ˆB@r  
C
A
n p C pn q  pn
¢z ¢z

 p 
¢z n
1ˆ p
npq
D 1  ˆ.z/ D ˆ.z/:
The steps
p
of this argument can be made rigorous and this shows that in the U Œ0; 1
n.U.k/ p/
case, p converges in distribution to N.0; 1/. This is an important example,
pq
because the case of a general continuous CDF will follow from the uniform case by
making the very convenient quantile transformation (see Chapter 1).
If the observations are iid according to some general continuous CDF F , with a
L
density f , then, by the quantile transformation, we can write X.k/ D F 1 .U.k/ /.
1
The transformation u ! F .u/ is a continuously differentiable function at u D
p with the derivative f .F 1 1
.p//
, provided 0 < f .F 1 .p// < 1. By the delta
p L
theorem, it then follows that n.X.k/  F 1 .p// ) N.0; Œf .F 1 pq
.p//2
. This is an
important result. However, the best result is quite a bit stronger. We do not have
to assume that k D kn is exactly equal to np for some fixed p, and we do not
require the CDF F to have a density everywhere. In other words, F need not be
differentiable at all x; all we need is enough local smoothness at the limiting value
of kn . Here is the exact theorem. Note that this theorem applies if we take k D bnpc,
the integer part of np for some fixed p.

p kn 9.1 (Single-Order Statistic). Let Xi ; i  1, be iid with


Theorem CDF F . Let
n. n  p/ ! 0 for some 0 < p < 1, as n ! 1. Denote p D F 1 .p/. Then
9.1 Central-Order Statistics 325

(a)    
p p.1  p/
lim PF . n.Xkn Wn  p / t/ D P N 0; 0 t
n!1 .F .p //2
for t < 0, provided the left derivative F 0 .p / exists and is > 0.
(b)    
p p.1  p/
lim PF . n.Xkn Wn  p / t/ D P N 0; 0 t
n!1 .F .p C//2
for t > 0, provided the right derivative F 0 .p C/ exists and is > 0.
say
(c) If F 0 .p / D F 0 .p C/ D f .p / > 0, then
 
p L p.1  p/
n.Xkn Wn  p / ) N 0; 2 :
f .p /

Remark. Part (c) of the theorem is the most ubiquitious version and most used in
applications. The same results hold with Xkn Wn replaced by Fn1 .p/.

9.1.2 Two Statistical Applications

Asymptotic distributions of central order statistics are useful in statistics in various


ways. One common use is to consider estimators of parameters based on central
order statistics and compare their performance with the performance of a more tra-
ditional estimator. Usually, such comparisons would have to be done asymptotically,
because we cannot do the fixed sample size calculations in closed-form. Another use
is to write confidence intervals for a percentile of a distribution (i.e., a population),
when only minimal assumptions have been made about the nature of the distribu-
tion. We give an illustrative example of each kind below.
Example 9.2 (Asymptotic Relative Efficiency). Suppose X1 ; X2 ; : : : are iid N.; 1/.
Let Mn D Xb n2 cWn denote the sample median. Because the standard normal density
p L
f .0/ at zero equals p12 , it follows from the above theorem that n.Mn  / )
p L
N.0; 2 /. On the other hand, n.XN  / ) N.0; 1/. The ratio of the variances in
the two asymptotic distributions, namely, 1= 2 D 2 is called the ARE (asymptotic
relative efficiency) of Mn relative to XN . The idea is that asymptotically the sample
median has a larger variance than the sample mean by a factor of 2 , and therefore
is only 2 as efficient as the sample mean in the normal case. For other distributions
symmetric about some  (e.g., C.; 1/), the preference between the mean and the
median as an estimate of the point of symmetry  can reverse.
Example 9.3. This is an example where the CDF F possesses left and right deriva-
tives at the median, but they are unequal. Suppose
326 9 Asymptotics of Extremes and Order Statistics
(
1
x; for 0 x 2;
F .x/ D
2x  12 ; for 1
2 x 3
4:

p
By the previous theorem, PF . n.Xb n2 cWn  12 / t/ can be approximated by
P .N.0; 14 / t/ when t < 0 and by P .N.0; 16 1
/ t/ when t > 0. Separate
approximations are necessary because F changes slope at x D 12 .
Here is an important statistical application.
Example 9.4 (Confidence Interval for a Quantile). Suppose X1 ; X2 ; : : : ; Xn are iid
observations from some CDF F , and to keep things simple, assume that F has a
density f . Suppose we wish to estimate p D F 1 .p/ for some fixed p; 0 < p < 1.
Let k D kn D bnpc, and let Op D XkWn . Then, from the above theorem,
 
p L p.1  p/
n.Op  p / ) N 0; :
.f .p //2

Thus, a confidence interval for p can be constructed as


p
z’=2 p.1  p/
Op ˙ p :
n f .p /

The interval has a simplistic appeal and is computed much more easily than an exact
interval based on order statistics.

9.1.3 Several Order Statistics

Just as a single central order statistic is asymptotically normal under very mild
conditions, any fixed number of central order statistics are jointly asymptotically
normal, under similar conditions. Furthermore, any two of them are positively corre-
lated, and there is a simple explicit description of the asymptotic covariance matrix.
We present that result next; a detailed proof can be found in Serfling (1980) and
Reiss (1989).
Theorem 9.2 (Several Order Statistics). Let Xki Wn ; 1 i r, be r specified
p
order statistics, and suppose for some 0 < q1 < q2 <    < qr < 1, n. kni 
qi / ! 0 as n ! 1. Then

p L
nŒ.Xk1 Wn ; Xk2 Wn ; : : : ; Xkr Wn /  .q1 ; q2 ; : : : ; qr / ) Nr .0; †/;

where for i j,
qi .1  qj /
¢ij D ;
F .qi /F 0 .qj /
0

provided F 0 .qi / exists and is > 0 for each i D 1; 2; : : : ; r.


9.1 Central-Order Statistics 327

Remark. Note that ¢ij > 0 for i ¤ j in the above theorem. However, as we proved
in Chapter 6, what is true is that if Xi ; 1 i n, are iid from any CDF F , then
provided the covariance exists, Cov.Xi Wn ; Xj Wn /  0 for any i; j and any n  2.
An important consequence of the joint asymptotic normality of a finite number
of order statistics is that linear combinations of a finite number of order statistics
will be asymptotically univariate normal. A precise statement is as follows.

Corollary. Let c1 ; c2 ; : : : ; cr be r fixed real numbers, and let c0 D .c1 ; : : : ; cr /.


Under the assumptions of the above theorem,
!
p X r X
r
L
n ci Xki Wn  ci qi ) N.0; c0 †c/:
i D1 i D1

The corollary is a simple consequence of the continuous mapping theorem. Here is


an important statistical application.

Example 9.5 (The Interquartile Range). The 25th and the 75th percentiles of a set
of sample observations are called the first and the third quartile of the sample. The
difference between them gives information about the spread in the distribution from
which the sample values are coming. The difference is called the interquartile range,
and we denote it as IQR. In statistics, suitable multiples of the IQR are often used as
measures of spread, and are then compared to traditional measures of spread, such
as the sample standard deviation s.
Let k1 D b n4 c; k2 D b 3n
4
c. Then IQR D Xk2 Wn  Xk1 Wn . It follows on some
calculation from the above corollary that
" #!
p L 1 3 3 2
n.IQR  . 3   1 // ) N 0; C  :
4 4 16 f 2 . 3 / f 2 . 1 / f . 1 /f . 3 /
4 4 4 4

Here, the notation f means the derivative of F at the particular point. In most cases,
f is simply the density of F .
Specializing to the case when F is the CDF of N.; ¢ 2 /, on some algebra and
computation, for normally distributed iid observations,
p L
n.IQR  1:35¢/ ) N.0; 2:48¢ 2/
   
p IQR L 2:48 2
) n  ¢ ) N 0; ¢ D N.0; 1:36¢ 2/:
1:35 1:352
p L
On the other hand, n.s  ¢/ ) N.0; :5¢ 2/ (we have previously solved this
problem in general for any distribution with four finite moments). The ratio of the
:5
asymptotic variances, namely, 1:36 D :37 is the ARE of the IQR-based estimate
relative to s: Thus, for normal data, one is better off using s. For populations with
thicker tails, IQR-based estimates can be more efficient. DasGupta and Haff (2006)
work out the general asymptotic theory for comparison between estimates based on
IQR and s.
328 9 Asymptotics of Extremes and Order Statistics

9.2 Extremes

The asymptotic theory of sample extremes for iid observations is completely dif-
ferent from that of central order statistics. For one thing, the limiting distributions
of extremes are never normal. Sample extremes are becoming increasingly useful
in various statistical problems, such as financial modeling, climate studies, multiple
testing, and disaster planning. As such, the asymptotic theory of extremes is gain-
ing in importance. General references for this section are Galambos (1987), Reiss
(1989), Sen and Singer (1993), and DasGupta (2008).
We start with a familiar easy example that illustrates the different kind of asymp-
totics that extremes have, compared to central order statistics.
i id
Example 9.6. Let U1 ; : : : ; Un U Œ0; 1. Then
 
t
P .n.1  UnWn / > t/ D P 1  UnWn >
n
   
t t n
D P UnWn < 1  D 1 ; if 0 < t < n:
n n
L
So P .n.1  UnWn / > t/ ! e t for all real t, which implies that n.1  UnWn / )
Exp.1/.pNotice two key things: the limit is nonnormal and the norming constant is
n, not n. The norming constant in general depends on the tail of the underlying
CDF F .
It turns out that if X1 ; X2 ; : : : are iid from some F , then the limit distributions
of XnWn , if it at all exists, can be only one of three types. Characterizations are
available and were obtained rigorously by Gnedenko (1943), although some of his
results were previously known to Frechet, Fisher, Tippett, and von Mises.
The usual characterization result, called the convergence of types theorem, is
somewhat awkward to state and can be difficult to verify. Therefore, we first present
more easily verifiable sufficient conditions. We make the assumption until further
notified that F is continuous.

9.2.1 Easily Applicable Limit Theorems

We start with a few definitions.


Definition 9.1. A CDF F on an interval with .F / D supfx W F .x/ < 1g < 1 is
said to have terminal contact of order m if F .j / ..F // D 0 for j D 1; : : : ; m
and F .mC1/ ..F // ¤ 0.
Example 9.7. Consider the Beta density f .x/ D .m C1/.1 x/m; 0 < x < 1. Then
the CDF F .x/ D 1  .1  x/mC1 . For this distribution, .F / D 1, and F .j / .1/ D 0
for j D 1; : : : ; m, whereas F .mC1/ .1/ D .m C 1/Š. Thus, F has terminal contact of
order m.
9.2 Extremes 329

Definition 9.2. A CDF F with .F / D 1 is said to be of an exponential type if F


is absolutely continuous and infinitely differentiable, and if for each fixed

F .j / .x/ f .x/
j  2; .1/j 1 ; as x ! 1;
F .j 1/ .x/ 1  F .x/

where means that the ratio converges to 1.

Example 9.8. Suppose F .x/ D ˆ.x/, the N.0; 1/ CDF. Then

F .j / .x/ D .1/j 1 Hj 1 .x/.x/;

where Hj .x/ is the j -th Hermite polynomial and is of degree j (see Chapter 12).
.j /
Therefore, for every j; FF.j 1/.x/
.x/
.1/j 1 x. Thus, F D ˆ is a CDF of the
exponential type.

Definition 9.3. A CDF F with .F / D 1 is said to be of a polynomial type of


order k if x k .1  F .x// ! c for some 0 < c < 1, as x ! 1.

Example 9.9. All t distributions, including therefore the Cauchy, are of polynomial
type. Consider the t-distribution with ’ degrees of freedom and with median zero.
Then, it is easily seen that x ’ .1  F .x// has a finite nonzero limit. Hence, a t’ -
distribution is of the polynomial type of order ’.
We now present our sufficient conditions for weak convergence of the maximum
to three different types of limit distributions. The first three theorems below are
proved in de Haan (2006, pp. 15–19); also see Sen and Singer (1993). The first
result handles cases such as the uniform on a bounded interval.

Theorem 9.3. Suppose X1 ; X2 ; : : : are iid from a CDF with mt h order terminal
contact at .F / < 1. Then for suitable an ; bn ,

XnWn  an L
) G;
bn

where
(
mC1
e .t / t 0
G.t/ D
1 t > 0:

Moreover, an can be chosen to be .F / and one can choose


1
.1/m .m C 1/Š mC1
bn D :
nF .mC1/ ..F //

The second result handles cases such as the t-distribution.


330 9 Asymptotics of Extremes and Order Statistics

Theorem 9.4. Suppose X1 ; X2 ; : : : are iid from a CDF F of a polynomial type


of order k. Then for suitable an ; bn ,
XnWn  an L
) G;
bn
where (
k
e t t 0
G.t/ D
0 t < 0:
Moreover, an can be chosen to be 0 and one can choose bn D F 1 .1  n1 /.
The last result handles in particular the important Normal case.
Theorem 9.5. Suppose X1 ; X2 ; : : : are iid from a CDF F of an exponential type.
Then for suitable an ; bn ,
XnWn  an L
) G;
bn
t
where G.t/ D e e ; 1 < t < 1.
Remark. The choice of an ; bn is discussed later.
Example 9.10. Suppose X1 ; X2 ; : : : are iid from a triangular density on Œ0; . So,
8 2
< 2x 2
if 0 x 2
F .x/ D
: 4x 2x 2
  2  1 if 2 x :
It follows that 1F ./ D 0 and F .1/ ./ D 0, F .2/ ./ D  42 ¤ 0. Thus F has
p
2n.XnWn  / L
terminal contact of order .mC1/ at  with m D 1. It follows that  ) G;
2
where G.t/ D e t ; t 0.
Example 9.11. Suppose X1 ; X2 ; : : : are iid standard Cauchy. Then
Z
1 1 1 1 arctan.x/ 1
1  F .x/ D dt D  ;
 x 1 C t2 2  x

as x ! 1; that is, F .x/ 1 1


x
. Therefore, F 1 .1  n1 / n

. Hence, it follows
L
that XnnWn ) G, where G.t/ is the CDF of the reciprocal of an exponential.1/
random variable.
The derivation of the sequences an ; bn for the asymptotic distribution of the
sample maximum in the all-important normal case is surprisingly involved. The
normal case is so special that we present it as a theorem.
Theorem 9.6. Suppose X1 ; X2 ; : : : are iid N.0; 1/. Then
!
p p log log n C log 4 L
2 log n XnWn  2 log n C p ) G;
2 2 log n
t
where G.t/ D e e ; 1 < t < 1.
9.2 Extremes 331

1
1
0.8
0.8
0.6 0.6
0.4 0.4
0.2 0.2

Fig. 9.1 True and asymptotic density of maximum in N(0, 1) case; n D 50, 100

See de Haan (2006, pp. 11–12) or Galambos (1987) for a proof. The distribution
t
with the CDF e e is generally known as the Gumbel distribution or the extreme
value distribution.

Example 9.12 (Sample Maximum in Normal Case). The density of the Gumbel dis-
t
tribution is g.t/ D e t e e ; 1 < t < 1. This distribution has mean m D C
2
(the Euler constant), and variance v2 D 6 . The asymptotic distribution gives us a
formal approximation for the density of XnWn :

!!
p p p log log n C log 4
fOn .x/ D 2 log n g 2 log n x  2 log n C p :
2 2 log n

Of course, the true density of XnWn is n.x/ˆn1 .x/. The asymptotic and the true
density are plotted in Fig. 9.1 for n D 50; 100. The asymptotic density is more
peaked at the center, and although it is fairly accurate at the upper tail, it is badly
inaccurate at the center and the lower tail. Its lower tail dies too quickly. Hall (1979)
shows that the rate of convergence of the asymptotic distribution is extremely slow,
in a uniform sense.
Formal approximations to the mean and variance of XnWn are also obtained from
the asymptotic distribution. Uniform integrability arguments are required to make
these formal approximations rigorous.
p log log n C  1 log 4
E.XnWn / 2 log n  p C p2 ;
2 2 log n 2 log n
2
Var.XnWn / :
12 log n
The moment approximations are not as inaccurate as the density approximation.
We evaluated the exact means of XnWn in Chapter 6 for selected values of n. We
reproduce those values with the approximate mean given by the above formula for
comparison.
332 9 Asymptotics of Extremes and Order Statistics

n Exact E.XnWn / Approximated Value


10 1.54 1.63
30 2.04 2.11
50 2.25 2.31
100 2.51 2.56
500 3.04 3.07

9.2.2 The Convergence of Types Theorem

We now present the famous trichotomy result of asymptotics for the sample max-
imum in the iid case. Either the sample maximum XnWn cannot be centered and
normalized in any way at all to have a nondegenerate limit distribution, or, it can be,
in which case the limit distribution must be one of exactly three types. Which type
applies to a specific example depends on the support and the upper tail behavior of
the underlying CDF. The three possible types of limit distributions are the following.
r
9
G1;r .x/ D e x ; x>0 >
>
>
>
>
>
D 0; x 0 >
>
=
.x/r
G2;r .x/ D e ; x 0 ;
>
>
D 1; >
>
x>0 >
>
>
>
1 < x < 1 ;
x
G3 .x/ D e e ;

where r > 0.
To identify which type applies in a specific case, we need a few definitions that
are related to the upper tail behavior of the CDF F .

Definition 9.4. A function g is called slowly varying at 1 if for each t > 0;


limx!1 g.tx/
g.x/ D 1 and is said to be of regular variation of index r if for each
g.tx/
t > 0; limx!1 g.x/
D tr.

Definition 9.5. Suppose X1 ; X2 ; : : : are iid with CDF F . We say that F is in the
domain of maximal attraction of a CDF G and write F 2 D.G/, if for some an ; bn ,

XnWn  an L
) G:
bn

We now state the three main theorems that make up the trichotomy result. See
Chapter 1 in de Haan (2006) for a proof of these three theorems.

Theorem 9.7. F 2 D.G1;r / iff .F / D 1 and F is of regular variation at


.F /D1. In this case, an can be chosen to be zero, and bn such that 1F .bn / n1 .
9.3 Fisher–Tippett Family and Putting it Together 333

Theorem 9.8. F 2 D.G2;r / iff .F / < 1 and F e 2 D.G1;r /, where Fe .x/ is the
CDF Fe .x/ D F ..F /  1 /; x > 0. In this case, an may be chosen to be .F / and
x
bn such that 1  F .a  bn / n1 .

Theorem 9.9. F 2 D.G3 / iff there is a function u.t/ > 0, such that

1  F .t C xu.t//
lim D e x ; for all x:
t ! .F / 1  F .t/

Remark. In this last case, an ; bn can be chosen as an D F 1 .1  n1 / and


bn D u.an /. However, the question of choosing such a function u.t/ is the most
complicated part of the trichotomy phenomenon. We have a limited discussion on it
below. A detailed discussion is available in DasGupta (2008). We recall a definition.

Definition 9.6. Given a CDF F , the mean residual life is the function L.t/ D
EF .X  tjX > t/.

One specific important result covering some important special cases with un-
bounded upper end points for the support of F is the following.

Theorem 9.10. Suppose .F / D 1. If L.t/ is of regular variation at t D 1, and


if limt !1 L.t
t
/
D 0, then F 2 D.G3 /.

Here are two important examples.

Example 9.13. Let X Exp. /. Then trivially, L.t/ D constant. Any constant
function is obviously slowly varying and so is of regular variation. Also, obviously,
limt !1 L.t /
t D 0. Therefore, for any exponential distribution, the CDF F 2 D.G3 /.

Example 9.14. Let X N.0; 1/. Then, from Chapter 14,

1 1
L.t/ D t ;
R.t/ t

where R.t/ is the Mills ratio. Therefore, L.t/ is of regular variation at t D 1, and
also, obviously, limt !1 L.t /
t D 0. It follows that ˆ 2 D.G3 /, and as a consequence
the CDF of any normal distribution also belongs to D.G3 /.

9.3  Fisher–Tippett Family and Putting it Together

The three different types of distributions that can at all arise as limit distributions of
centered and normalized sample extremes can be usefully unified in a single one-
parameter family of distributions, called the Fisher–Tippett distributions. We do so
in this section, together with some additional simplifications.
334 9 Asymptotics of Extremes and Order Statistics

Suppose for some real sequences an ; bn , we have the convergence result that
X.n/ an L
bn
) G. Then, from the definition of convergence in distribution, we have
that at each continuity point x of G,
 
X.n/  an
P x ! G.x/ , P .X.n/ an C bn x/ ! G.x/
bn
, F n .an C bn x/ ! G.x/
, n log F .an C bn x/ ! log G.x/
, n logŒ1  .1  F .an C bn x// ! log G.x/
, nŒ1  F .an C bn x/ !  log G.x/
1 1
, ! :
1  F .an C bn x/  log G.x/

Now, if we consider the case where F is strictly monotone, and let U.t/ denote the
inverse function of 1F1 .x/ , then the last convergence assertion above is equivalent to

U.nx/  an  1
! G 1 e  x ;
bn
x > 0:
We can appreciate the role of this function U.t/ in determining the limit distribu-
tion of the maximum in a given case. Not only that, when this is combined with the
Fisher–Tippett result that the possible choices of G are a very precisely defined one-
parameter family, the statement takes an even more aesthetically pleasing form. The
Fisher–Tippett result and a set of equivalent characterizations for a given member
of the Fisher–Tippett family to be the correct limiting distribution in a specific prob-
lem are given below. See pp. 6–8 in de Haan (2006) for a proof of the Fisher–Tippett
theorem.
Theorem 9.11 (Fisher–Tippett Theorem). Let X1 ; X2 ; : : : be iid with the common
X a L
CDF F . Suppose for some sequences an ; bn , and some CDF G, .n/bn n ) G. Then
G must be a location-scale shift of some member of the one parameter family
1=
fG W G .x/ D e .1Cx/ ; 1 C x > 0; 2 RgI
x
in the above, G0 is defined to be G0 .x/ D e e ; 1 < x < 1. See de Haan (2006)
for a proof of this theorem.
We can reconcile the Fisher–Tippett theorem with the trichotomy result we have
previously described. Indeed,
(a) For > 0, using the particular location-scale shift G . x1

/, and denoting 1
as ’, we end up with G1;’ .
(b) For D 0, we directly arrive at the Gumbel law G3 . 
(c) For < 0, using the particular location-scale shift G .xC1/

, and denoting
 1 as ’, we end up with G2;’ .
Exercises 335

If we combine the Fisher–Tippett theorem with the convergence condition


1
U.nx/an
bn
! G 1 .e  x /, we get the following neat and useful theorem.

Theorem 9.12. Let X1 ; X2 ; : : : be iid with the common CDF F . Then the following
are all equivalent.
X.n/ an L
(a) For some real sequences an ; bn ; and some real ; bn ) G .
(b) F n .an C bn x/ ! G .x/ for all real x.
(c) nŒ1F .a1n Cbn x/ ! .1 C x/1= for all real x.
x  1
(d) For some positive function a.t/, and any x > 0; U.tx/U.t
a.t /
/
!  as t ! 1.
x  1
Here, for D 0;  is defined to be log x. Furthermore, if the condition
in this part holds with a specific function a.t/, then in part (a), one may choose
an D a.n/, and bn D U.n/.

Exercises

Exercise 9.1 (Sample Maximum). Let Xi ; i  1 be an iid sequence, and X.n/ the
maximum of X1 ; : : : ; Xn . Let .F / D supfx W F .x/ < 1g, where F is the common
a:s:
CDF of the Xi . Prove that X.n/ ) .F /.

Exercise 9.2 * (Records). Let Xi ; i  1 be an iid sequence with a common density


f .x/, and let Nn denote the number of records observed up to time n. By using the
a:s:
results in Chapter 6, and the Borel–Cantelli lemma, prove that Nn ) 1.

Exercise 9.3 (Asymptotic Relative Efficiency of Median). For each of the fol-
lowing cases, derive the asymptotic relative efficiency of the sample median with
respect to the sample mean:
(a) Double exponential.; 1/.
(b) U Œ  ¢;  C ¢.
(c) Beta.’; ’/.

Exercise 9.4 * (Percentiles with Large and Small Variance). Consider the stan-
dard normal, standard double exponential, and the standard Cauchy densities. For
0 < p < 1, find expressions for the variance in the limiting normal distribution of
Fn1 .p/, and plot them as functions of p. Which percentiles are the most variable,
and which the least?

Exercise 9.5 (Interquartile Range). Find the limiting distribution of the in-
terquartile range for sampling from the normal, double exponential, and Cauchy
distributions.

Exercise 9.6 (Quartile Ratio). Find the limiting distribution of the quartile ratio
defined as Xb 3n cWn =XŒ n4 Wn for the exponential, Pareto, and uniform distributions.
4
336 9 Asymptotics of Extremes and Order Statistics

Exercise 9.7 * (Best Linear Combination). Suppose X1 ; X2 ; : : : are iid with den-
sity f .x  /. For each of the following cases, find the estimate of the form
pXbn’1 cWn C pXbnn’1 cWn C .1  2p/Xb n2 cWn
which has the smallest asymptotic variance:
x2
(a) f .x/ D p1 e  2 ;
2
1 jxj
(b) f .x/ D 2
e ;
(c) f .x/ D 1
.1Cx 2 /
.
Exercise 9.8 * (Poisson Median Oscillates). Suppose Xi are iid from a Poisson
distribution with mean 1. How would the sample median behave asymptotically?
Specifically, does it converge in probability to some number? Does it converge in
distribution on any centering and norming?
Exercise 9.9 * (Position of the Mean Among Order Statistics). Let Xi be
iid standard Cauchy. Let Nn be defined as the number of observations among
X1 ; : : : ; Xn that are less than or equal to XN . Let pn D Nnn . Show that pn converges
to the U Œ0; 1 distribution.
Hint: Use the Glivenko–Cantelli theorem.
Exercise 9.10 * (Position of the Mean Among Order Statistics). Let Xi be
iid standard normal. Let Nn be defined as the number of observations among
X1 ; : : : ; Xn that are less than or equal to XN . Let pn D Nnn . Show that on suitable
centering and norming, pn converges to a normal distribution, and find the variance
of this limiting normal distribution.
Exercise 9.11 (Domain of Attraction for Sample Maximum). In what domain of
(maximal) attraction are the following distributions. (a) F D t’ ; ’ > 0; (b) F D 2k ;
(c) F D .1  /N.0; 1/ C C.0; 1/.
Exercise 9.12 (Maximum of the Absolute Values). Let Xi be iid standard normal,
and let n D max1i n jXi j; n  1. Find constants an ; bn and a CDF G such that
n an L
bn
) G.
Exercise 9.13 (Limiting Distribution of the Second Largest).
(a) Let Xi be iid U Œ0; 1 random variables. Find constants an ; bn and a CDF G such
an L
that Xn1Wn
bn
) G;
(b) * Let Xi be iid standard Cauchy random variables. Find constants an ; bn ; and a
Xn1Wn an L
CDF G such that bn ) G.
Exercise 9.14. Let X1 ; X2 ; : : : be iid Exp(1) samples.
a:s:
(a) Find a sequence an such that X1Wn  an ) 0:
X1Wn a:s:
(b) * Find a sequence cn such that cn ) 1:
References 337

Exercise 9.15 (Unusual Behavior of Sample Maximum). Suppose X1 ; X2 ; : : :; Xn


are iid Bin.N; p/, where N; p are considered fixed. Show, by using the Borel–
Cantelli lemma that XnWn D N with probability one for all large n.

Remark. However, typically, E.XnWn / << N , and estimating N is a very difficult


statistical problem. The problem is treated in detail in DasGupta and Rubin (2005).

Exercise 9.16 * (Limiting Distribution of Sample Range). Let Xi ; i  1 be iid


N.0; 1/ and Wn D XnWn  X1Wn . Identify constants ’n ; ˇn , and a CDF G such that
L
ˇn .Wn  ’n / ) G.

Hint: Use without proof that X1Wn ; XnWn are asymptotically independent and the
result in the text on the limiting distribution of XnWn in the N.0; 1/ case. Finally,
calculate in closed-form the appropriate convolution density.

Exercise 9.17 * (Limiting Distribution of Sample Range). Let Xi ; i  1 be iid


C.0; 1/ and Wn D XnWn  X1Wn . Identify constants ’n ; ˇn , and a CDF G such that
L
ˇn .Wn  ’n / ) G.

Hint: Use the same hint as in the previous exercise.

Exercise 9.18 * (Mean and Variance of Maximum). Suppose X1 ; X2 ; : : : are iid


Exp(1). Do the mean and the variance of XnWn  log n converge to those of the
Gumbel law G3 ?

Hint: They do. Either use direct calculations or use uniform integrability arguments.

Exercise 9.19 (Uniform Integrability of Order Statistics). Suppose Xi ; i  1 are


iid from some CDF F and suppose F has an mgf in an interval around zero. Given
k  1, find a sequence cn such that XnkWn
cn is uniformly integrable.

Hint: Use the moment inequalities in Chapter 6.

Exercise 9.20 * (Asymptotic Independence of Minimum and Maximum). Let


Xi ; i  1 be iid random variables with a common CDF F . Find the weakest condi-
tion on F such that

P .X.1/ > x; X.n/ y/  P .X.1/ > x/P .X.n/ y/ ! 0


as n ! 1 8x; y; x < y:

Exercise 9.21. Suppose Xi ; i  1 are iid U Œ0; 1. Does EŒn.1  XnWn /k have a
limit for all k?
338 9 Asymptotics of Extremes and Order Statistics

References

DasGupta, A. (2008). Asymptotic Theory of Statistics and Probability, Springer, New York.
DasGupta, A. and Haff, L. (2006). Asymptotic expansions for correlations between different mea-
sures of spread, JSPI, 136, 2197–2212.
DasGupta, A. and Rubin, H. (2005). Estimation of binomial parameters when N; p are both un-
known, JSPI, Felicitation Volume for Herman Chernoff, 130, 391–404.
David, H.A. (1980). Order Statistics, Wiley, New York.
de Haan, L. (2006). Extreme Value Theory: An Introduction, Springer, New York.
Galambos, J. (1987). Asymptotic Theory of Extreme Order Statistics, Wiley, New York.
Gnedenko, B.V. (1943). Sur la distribution limité du terme maximum d’une serie aleatoire, Annals
od Math., 44, 423–453.
Hall, P. (1979). On the rate of convergence of normal extremes, J. Appl. Prob., 16(2), 433–439.
Reiss, R. (1989). Approximate Distributions of Order Statistics, Springer, New York.
Sen, P.K. and Singer, J. (1993). Large Sample Methods in Statistics, Chapman and Hall, New York.
Serfling, R. (1980). Approximation Theorems of Mathematical Statistics, Wiley, New York.
Chapter 10
Markov Chains and Applications

In many applications, successive observations of a process, say, X1 ; X2 ; : : : ; have


an inherent time component associated with them. For example, the Xi could be
the state of the weather at a particular location on the i th day, counting from some
fixed day. In a simplistic model, the state of the weather could be “dry” or “wet”,
quantified as, say, 0 and 1. It is hard to believe that in such an example, the se-
quence X1 ; X2 ; : : : could be mutually independent. The question then arises how
to model the dependence among the Xi . Probabilists have numerous dependency
models. A particular model that has earned a very special status is called a Markov
chain. In a Markov chain model, we assume that the future, given the entire past
and the present state of a process, depends only on the present state. In the weather
example, suppose we want to assign a probability that tomorrow, say March 10, will
be dry, and suppose that we have available to us the precipitation history for each
of January 1 to March 9. The Markov chain model would entail that our probability
that March 10 will be dry will depend only on the state of the weather on March
9, even though the entire past precipitation history was available to us. As simple
as it sounds, Markov chains are enormously useful in applications, perhaps more
than any other specific dependency model. They also are independently relevant to
statistical computing in very important ways. The topic has an incredibly rich and
well-developed theory, with links to many other topics in probability theory. Famil-
iarity with basic Markov chain terminology and theory is often considered essential
for anyone interested in studying statistics and probability. We present an introduc-
tion to basic Markov chain theory in this chapter.
Feller (1968), Freedman (1975), and Isaacson and Madsen (1976) are classic
references for Markov chains. Modern treatments are available in Bhattacharya and
Waymire (2009), Brémaud (1999), Meyn and Tweedie (1993), Norris (1997), Seneta
(1981), and Stirzaker (1994). Classic treatment of the problem of gambler’s ruin is
available in Feller (1968) and Kemperman (1950). Numerous interesting examples
at more advanced levels are available in Diaconis (1988); sophisticated applications
at an advanced level are also available in Bhattacharya and Waymire (2009).

A. DasGupta, Probability for Statistics and Machine Learning: Fundamentals 339


and Advanced Topics, Springer Texts in Statistics, DOI 10.1007/978-1-4419-9634-3 10,
c Springer Science+Business Media, LLC 2011
340 10 Markov Chains and Applications

10.1 Notation and Basic Definitions

Definition 10.1. A sequence of random variables fXn g; n  0, is said to be a


Markov chain if for some countable set S  R, and any n  1; xnC1 ; xn ; : : : ; x0 2 S ,

P .XnC1 D xnC1 jX0 D x0 ; : : : ; Xn D xn / D P .XnC1 D xnC1 jXn D xn /:

The set S is called the state space of the chain. If S is a finite set, the chain is called
a finite state Markov chain. X0 is called the initial state.
Without loss of generality, we can denote the elements of S as 1; 2; : : : ; although
in some examples we may use the original labeling of the states to avoid confusion.

Definition 10.2. The distribution of the initial state X0 is called the initial distribu-
tion of the chain. We denote the pmf of the initial distribution as i D P .X0 D i /.

Definition 10.3. A Markov chain fXn g is called homogeneous or stationary if


P .XnC1 D yjXn D x/ is independent of n for any x; y.

Definition 10.4. Let fXn g be a stationary Markov chain. Then the probabilities
pij D P .XnC1 D j jXn D i / are called the one-step transition probabilities, or sim-
ply transition probabilities. The matrix P D ..pij // is called the transition proba-
bility matrix.

Definition 10.5. Let fXn g be a stationary Markov chain. Then the probabili-
ties pij .n/ D P .XnCm D j jXm D i / D P .Xn D j jX0 D i / are called the n-step
transition probabilities, and the matrix P .n/ D ..pij .n/// is called the n-step
transition probability matrix.

Remark. If the state space of the chain is finite and has, Psay t elements, then the
transition probability matrix P is a t  t matrix. Note that j 2S pij is always one.
A matrix with this property is called a stochastic matrix.

Definition
P 10.6. A t  t square matrix P is called a stochastic matrix if for each
i; tj D1 pij D 1. It is called doubly stochastic or bistochastic if, in addition, for
P
every j; ti D1 pij D 1. Thus, a transition probability matrix is always a stochastic
matrix.

10.2 Examples and Various Applications as a Model

Markov chains are widely used as models for discrete time sequences that exhibit
local dependence. Part of the reason for this popularity of Markov chains as a model
is that a coherent, complete, and elegant theory exists for how a chain evolves. We
describe below examples from numerous fields where a Markov chain is either the
correct model, or is chosen as a model.
10.2 Examples and Various Applications as a Model 341

Example 10.1 (Weather Pattern). Suppose that in some particular city, any day is
either dry or wet. If it is dry on some day, it remains dry the next day with probability
˛, and will be wet with the residual probability 1  ˛. On the other hand, if it is wet
on some day, it remains wet the next day with probability ˇ, and becomes dry with
probability 1  ˇ. Let X0 ; X1 ; : : : be the sequence of states of the weather, with
X0 being the state on the initial day (on which observation starts). Then fXn g is a
two-state stationary Markov chain with the transition probability matrix
!
˛ 1˛
P D :
1ˇ ˇ

Example 10.2 (Voting Preference). Suppose that in a presidential election, voters


can vote for either the labor party, or the conservative party, or the Independent
party. Someone who has voted for the Labor candidate in this election will vote
Labor again with 80% probability, will switch to Conservative with 5% probability,
and vote Independent with 15% probability. Someone who has voted for the Conser-
vative candidate in this election will vote Conservative again with 90% probability,
switch to Labor with 3% probability, and vote Independent with 7% probability.
Someone who has voted for the Independent candidate in this election will vote In-
dependent again with 80% probability, or switch to one of the other parties with
10% probability each. This is a three-state stationary Markov chain with state space
S D f1; 2; 3g fLabor, Conservative, Independentg and the transition matrix
0 1
:8 :05 :15
B C
P D @ :03 :9 :07 A :
:1 :1 :8

Example 10.3 (An Urn Model Example). Two balls, say A; B are initially in urn 1,
and two others, say C; D are in urn 2. In each successive trial, one ball is chosen
at random from urn 1, and one independently and also at random from urn 2, and
these balls switch urns. We let Xn denote the vector of locations of the four balls
A; B; C; D, in that order of the balls, after the nth trial. Thus, X10 D 1122 means
that after the tenth trial, A; B are located in urn 1, and C; D in urn 2, and so on. Note
that X0 D 1122. Two of the four   balls are always in urn 1, and two in urn 2. Thus, the
possible number of states is 42 D 6. They are 1122; 1212; 1221; 2112; 2121; 2211.
Then fXn g is a six-state stationary Markov chain. What are the transition pro-
babilities?
For notational convenience, denote the above six states as 1; 2; : : : ; 6 respec-
tively. For the state of the chain to move from state 1 to state 2 in one trial, B; C
have to switch their urns. This will happen with probability :5:5 D :25. As another
example, for the state of the chain to move from state 1 to state 6, all of the four balls
must switch their urns. This is not possible. Therefore, this transition probability is
zero. Also, note that if the chain is in some state now, it cannot remain in that same
342 10 Markov Chains and Applications

state after the next trial. Thus, all diagonal elements in the transition probability
matrix must be zero. Indeed, the transition probability matrix is
0 1
0 :25 :25 :25 :25 0
B C
B :25 0 :25 :25 0 :25 C
B C
B :25 :25 0 0 :25 :25 C
P DB
B :25
C:
B :25 0 0 :25 :25 C
C
B C
@ :25 0 :25 :25 0 :25 A
0 :25 :25 :25 :25 0

Notice that there are really three distinct rows in P , each occuring twice. It is easy
to argue that that is how it should be in this particular urn experiment. Also note the
very interesting fact that in each row, as well as in each column, there are two ze-
roes, and the nonzero entries obviously add to one. This is an example of a transition
probability matrix that is doubly stochastic. Markov chains with a doubly stochastic
transition probability matrix show a unified long run behavior. By definition, ini-
tially the chain is in state 1, and so P .X0 D 1/ D 1; P .X0 D i / D 0 8i ¤ 0.
However, after many trials, the state of the chain would be any of the six possible
states with essentially an equal probability; that is, P .Xn D i / 1
6 for each pos-
sible state i for large n. This unifying long run behavior of Markov chains with a
doubly stochastic transition probability matrix is a significant result of wide appli-
cations in Markov chain theory.

Example 10.4 (Urn Model II: Ehrenfest Model). This example has wide applica-
tions in the theory of heat transfers. The mathematical model is that we initially
have m balls, some in one urn, say urn I, and the rest in another urn, say urn II.
At each subsequent time n D 1; 2; : : : ; one ball among the m balls is selected at
random. If the ball is in urn I, with probability ˛ it is transferred to urn II, and with
probability 1  ˛ it continues to stay in urn I. If the ball is in urn II, with probability
ˇ it is transferred to urn I, and with probability 1  ˇ it continues to stay in urn II.
Let X0 be the number of balls initially in urn I, and Xn the number of balls in
urn I after time n. Then fXn g is a stationary Markov chain with state space S D
f0; 1; : : : ; mg. If there are, say i balls in urn I at a particular time, then at the next
instant urn I could lose one ball, gain one ball, or neither lose nor gain any ball.
It loses a ball if one of its i balls gets selected for possible transfer, and then the
transfer actually happens. So pi;i 1 D mi ˛. Using this simple argument, we get the
one-step transition probabilities as

i i mi
pi;i 1 D ˛I pi;i C1 D cm  i mˇI pii D 1  ˛ ˇ;
m m m

and pij D 0 if j ¤ i  1; i; i C 1.
As a specific example, suppose m D 7, and ˛ D ˇ D 12 . Then the transition
matrix on the state space S D f0; 1; : : : ; 7g can be worked out by using the formulas
given just above, and it is
10.2 Examples and Various Applications as a Model 343
0 1 1
1
2 2 0 0 0 0 0 0
B 1 1 3
0 0 0 0 0 C
B 14 2 7 C
B 1 1 5 C
B 0 0 0 0 0 C
B 7 2
3
14
1 2 C
B 0 0 0 0 0 C
P DB 14 2 7 C:
B 0 0 0 2 1 3
0 0 C
B 7 2 14 C
B 0 0 0 0 5 1 1
0 C
B 14 2 7 C
@ 0 0 0 0 0 3 1 1 A
7 2 14
1 1
0 0 0 0 0 0 2 2

Example 10.5 (Machine Maintenance). Of the machines in a factory, a certain num-


ber break down or are identified to be in need of maintenance on a given day. They
are sent to a maintenance shop the next morning. The maintenance shop is capable
of finishing its maintenance work on some k machines on any given day. We are
interested in the sequence fXn g, where Xn denotes the number of machines in the
maintenance shop on the nth day, n  0. We may take X0 D 0.

Let Z0 machines break down on day zero. Then, X1 D Z0 . Of these, up to k


machines can be fixed by the shop on that day, and these are returned. But now,
on day 1, some Z1 machines break down at the factory, so that X2 D maxfX1 
k; 0g C Z1 D maxfZ0  k; 0g C Z1 , of which up to k machines can be fixed by
the shop on the second day itself and those are returned to the factory. We then have
X3 D maxfX2  k; 0g C Z2

D Z0 C Z1 C Z2  2k; if Z0  k; Z0 C Z1  2kI
D Z2 ; if Z0  k; Z0 C Z1 < 2kI
D Z1 C Z2  k; if Z0 < k; Z1  kI
D Z2 ; if Z0 < k; Z1 < k;

and so on.
If Zi ; i  0 are iid, then fXn g forms a stationary Markov chain. The state
space of this chain is f0; 1; 2; : : :g. What is the transition probability matrix?
For simplicity, take k D 1. For example, P .X2 D 1 jX1 D 0/ D P .Z1 D 1 jZ0 D 0/
D P .Z1 D 1/ D p1 (say). On the other hand, as another example, P .X2 D 2 j
X1 D 4/ D 0. If we denote the common mass function of the Zi by P .Zi D j / D
pj ; j  0, then the transition probability matrix is
0 1
p0 p1 p2 p3 
Bp C
B 0 p1 p2 p3 C
B C
B 0 p0 p1 p2 C
B C
P DB
C
:
B 0 0 p0 p1 C
B C
B 0 0 0 p0 C
@ A
::
:
344 10 Markov Chains and Applications

Example 10.6 (Hopping Mosquito). Suppose a mosquito makes movements be-


tween the forehead, the left cheek, and the right cheek of an individual, which we
designate as states 1; 2; 3, according to the following rules. If at some time n, the
mosquito is sitting on the forehead, then it will definitely move to the left cheek at
the next time n C 1; if it is sitting on the left cheek, it will stay there, or move to the
right cheek with probability .5 each, and if it is on the right cheek, it will stay there,
or move to the forehead with probability .5 each.
Then the sequence of locations of the mosquito form a three-state Markov chain
with the one-step transition probability matrix
0 1
0 1 0
P D@0 :5 :5 A :
:5 0 :5

Example 10.7 (An Example from Genetics). Many traits in organisms, for example,
humans, are determined by genes. For example, eye color in humans is determined
by a pair of genes. Genes can come in various forms or versions, which are called
alleles. An offspring receives one allele from each parent. A parent contributes one
of his or her alleles with equal probability to an offspring, and the parents make their
contributions independently. Certain alleles dominate over others. For example, the
allele for blue eye color is dominated by the allele for brown eye color. The allele
for blue color would be called recessive, and the allele for brown eye color would
be called dominant. If we denote these as b, B respectively, then a person may have
the pair of alleles BB, Bb, or bb. They are called the dominant, hybrid, and recessive
genotypes. We denote them as d, h, r, respectively. Consider now the sequence of
genotypes of descendants of an initial individual, and denote the sequence as fXn g;
for any n; Xn must be one of d, h, r (we may call them 1, 2, 3).
Consider now a person with an unknown genotype (X0 ) mating with a known
hybrid. Suppose he has genotype d. He will necessarily contribute B to the off-
spring. Therefore, the offspring can only have genotype d or h, and not r. It will
be d if the offspring also gets the B allele from the mother, and it will be h if the
offspring gets b from the mother. The chance of each is 12 . Therefore, the transition
probability P .X1 D d jX0 D d / D P .X1 D h jX0 D d / D 12 , and P .X1 D r j
X0 D d / D 0.
Suppose X0 D h. Then the father contributes B or b with probability 12 each,
and so does the mother, who was assumed to be a hybrid. So the probabilities that
X1 D d; h; r are, respectively, 14 ; 12 ; 14 .
If X0 D r, then X1 can only be h or r, with probability 12 each. So, if we assume
this same mating scheme over generations, then fXn g forms a three-state stationary
Markov chain with the transition probability matrix
0 1
:5 :5 0
P D @ :25 :5 :25 A :
0 :5 :5
10.3 Chapman–Kolmogorov Equation 345

Example 10.8 (Simple Random Walk). Consider a particle starting at the origin at
time zero, and making independent movements of one step to the right or one step
to the left at each successive time instant 1; 2; : : : : Assume that the particle moves
to the right at any particular time with probability p, and to the left with probabil-
ity q D 1  p. The mathematical formulation is that the successive movements are
iid random variables X1 ; X2 ; : : : ; with common pmf P .Xi D 1/ D p; P .Xi D  1/
D q. The particle’s location after the nth step has been taken is denoted as
Sn D X0 C X1 C    C Xn D X1 C : : : C Xn , assuming that X0 D 0 with
probability one. At each time the particle can move by just one unit, thus fSn g is a
stationary Markov chain with state space S D Z D f: : : ; 2; 1; 0; 1; 2; : : :g, and
with the transition probabilities

pij D P .SnC1 D j jSn D i /


D p if j D i C 1I
D q if j D i  1I
D 0 if jj  i j > 1; i; j 2 Z:

By virtue of the importance of random walks in theory and applications of prob-


ability, this is an important example of a stationary Markov chain. Note that the
chain is stationary because the individual steps Xi are iid. This is also an example
of a Markov chain with an infinite state space.

10.3 Chapman–Kolmogorov Equation

The Chapman–Kolmogorov equation provides a simple method for obtaining the


higher-order transition probabilities of a Markov chain in terms of lower-order tran-
sition probabilities. Carried to its most convenient form, the equation describes how
to calculate by a simple and explicit method all higher-order transition probabilities
in terms of the one-step transition probabilities. Because we always start analyz-
ing a chain with the one-step probabilities, it is evidently very useful to know how
to calculate all higher-order transition probabilities using just the knowledge of the
one-step transition probabilities.
Theorem 10.1 (Chapman–Kolmogorov Equation). Let fXn g be a stationary
Markov chain with the state space S . Let n; m  1. Then,
X
pij .m C n/ D P .XmCn D j jX0 D i / D pi k .m/pkj .n/:
k2S

Proof. A verbal proof is actually the most easily understood. In order to get to state
j from state i in m C n steps, the chain must go to some state k 2 S in m steps, and
then travel from that k to the state j in the next n steps. By adding over all possible
k 2 S , we get the Chapman–Kolmogorov equation.
An extremely important corollary is the following result.
346 10 Markov Chains and Applications

Corollary. Let P .n/ denote the n-step transition probability matrix. Then, for all
n  2; P .n/ D P n , where P n denotes the usual nth power of P .

Proof. From the Chapman–Kolmogorov equation, by using the definition of matrix


product, for all m; n  1; P .mCn/ D P .m/ P .n/ ) P .2/ D PP D P 2 . We now finish
the proof by induction. Suppose P .n/ D P n 8n k. Then, P .kC1/ DP .k/ P D
P k P D P kC1 , which finishes the proof. t
u

A further important consequence is that we can now write an explicit formula for
the pmf of the state of the chain at a given time n.

Proposition. Let fXn g be a stationary Markov chain with the state space S , and
one-step
P transition probability matrix P . Fix n  1. Then, n .i / D P .Xn D i / D
0
k2S p ki .n/P .X0 D k/. In matrix notation, if  D . 1 ; 2 ; : : :/ denotes the
vector of the initial probabilities P .X0 D k/; k D 1; 2;    , and if n denotes the
row vector of probabilities P .Xn D i /; i D 1; 2; : : : ; then n D P n .
This is an important formula, because it lays out how to explicitly find the distri-
bution of Xn from the initial distribution and the one-step transition matrix P .

Example 10.9 (Weather Pattern). Consider once again the weather pattern example
with the one-step transition probability matrix
!
˛ 1˛
P D :
1ˇ ˇ

We let the states be 1, 2 (1 D dry; 2 D wet). We use the Chapman–Kolmogorov


equation to answer two questions. First, suppose it is Wednesday today, and it is
dry. We want to know what the probability is that Saturday will be dry. In notation,
.3/
we want to find p11 D P .X3 D 1 jX0 D 1/. In order to get a concrete numerical
answer at the end, let us take ˛ D ˇ D :8 Now, by direct matrix multiplication,
!
:608 :392
P D
3
:
:392 :608

Therefore, the probability that Saturday will be dry if Wednesday is dry is


.3/
p11 D :608.

Next suppose that we want to know what the probability is that Saturday and
Sunday will both be dry if Wednesday is dry. In notation, we now want to find

P .X3 D 1; X4 D 1 jX0 D 1/
D P .X3 D 1 jX0 D 1/P .X4 D 1 jX3 D 1; X0 D 1/
D P .X3 D 1 jX0 D 1/P .X4 D 1 jX3 D 1/ D :608  :8
D :4864:
10.3 Chapman–Kolmogorov Equation 347

Coming now to evaluating


P the pmf of Xn itself, we know how to calculate it, namely,
P .Xn D i / D k2S ki .n/P .X0 D k/. Denote P .The initial day was dry/ D
p
1 ; P .The initial day was wet/ D 2 ; 1 C 2 D 1. Let us evaluate the probabilities
that it will be dry one week, two weeks, three weeks, four weeks from the initial day.
This requires calculation of, respectively, P 7 ; P 14 ; P 21 ; P 28 . For example,
 
:513997 :486003
P D
7
:
:486003 :513997

Therefore,

P .It will be dry one week from the initial day/ D :513997 1 C :486003 2
D :5 C :013997. 1  2 /:

Similarly, we can compute P 14 and show that

P .It will be dry two weeks from the initial day/ D :500392 1 C :499608 2
D :5 C :000392. 1  2 /:
P .It will be dry three weeks from the initial day/ D :5 C :000011. 1  2 /:
P .It will be dry four weeks from the initial day/ D :5:

We see that convergence to :5 has occurred, regardless of 1 ; 2 . That is, regardless


of the initial distribution, eventually you will put 5050 probability that a day far
into the future will be dry or wet. Is this always the case? The answer is no. In this
case, convergence to :5 occurred because the one-step transition matrix P has the
doubly stochastic characteristic: each row as well as each column of P adds to one.
We discuss more about this later.

Example 10.10 (Voting Preferences). Consider the example previously given on


voting preferences. Suppose we want to know what the probabilities are that a La-
bor voter in this election will vote, respectively, Labor, Conservative, or Independent
two elections from now. Denoting the states as 1, 2, 3, in notation, we want to find
P .X2 D i jX0 D 1/; i D 1; 2; 3: We can answer this by simply computing P 2 .
Because
0 1
:80 :05 :15
P D @ :03 :90 :07 A ;
:1 :1 :8

by direct computation,
0 1
:66 :1 :24
P 2 D @ :06 :82 :12 A :
:16 :18 :66
348 10 Markov Chains and Applications

Hence, the probabilities that a Labor voter in this election will vote Labor,
Conservative, or Independent two elections from now are 66%; 10%, and 24%.
We also see from P 2 that a Conservative voter will vote Conservative two elections
from now with 82% probability and has a chance of just 6% to switch to Labor, and
so on.
Example 10.11 (Hopping Mosquito). Consider again the hopping mosquito exam-
ple previously introduced. The goal of this example is to find the n-step transition
probability matrix P n for a general n. We describe a general method for finding P n
using a linear algebra technique, known as diagonalization of a matrix. If a square
matrix P (not necessarily symmetric) of order t  t has t distinct eigenvalues, say
ı1 ; : : : ; ıt , which are complex numbers in general, and if u1 ; : : : ; ut are a set of t
eigenvectors of P corresponding to the eigenvalues ı1 ; : : : ; ıt , then define a matrix
U as U D .u1 ; : : : ; ut /; that is, U has u1 ; : : : ; ut as its t columns. Then, U has
the property that U 1 P U D L, where L is the diagonal matrix with the diagonal
elements ı1 ; : : : ; ıt . Now, just note that
U 1 P U D L ) P D ULU 1 ) P n D ULn U 1 ;
8n  2:
Therefore, we only need to compute the eigenvalues of P , and the matrix U of
a set of eigenvectors. As long as the eigenvalues are distinct, the n-step transition
matrix will be provided by the unified formula P n D ULn U 1 .
The eigenvalues of our P are
i i
ı1 D  ; ı2 D ; ı3 D 1I
2 2
note that they are distinct. The eigenvectors (one set) turn out to be:
 0  0
i 1 i C1 0
u1 D 1  i; ; 1 ; u2 D i  1;  ; 1 ; u3 D .1; 1; 1/ :
2 2

Therefore, 0 1
i  1 i 1 1
U D@ i 1
2
 i C1
2
1A;
1 1 1
0 1
3i 1
10
 2i5C1 i C3
10
B C
B C
U 1 D B  3i10
C1 2i 1 3i
C:
@ 5 10 A
1 2 2
5 5 5

This leads to 0 1
. 2i /n 0 0
P DU@ 0
n i n
.2/ 0 A U 1 ;
0 0 1
with U; U 1 as above.
10.4 Communicating Classes 349

For example,
 n    n  
3i  1 i 3i C 1 i 1
p11 .n/ D .i  1/  C .i  1/  C1 1
10 2 10 2 5
 n  
1 2i i 2Ci i n
D C  C I
5 5 2 5 2

this is the probability that the mosquito will be back on the forehead after n moves
if it started at the forehead. If we take n D 2, we get, on doing the complex mul-
tiplication, p11 .2/ D 0. We can logically verify that p11 .2/ must be zero by just
looking at the one-step transition matrix P . However, if we take n D 3, then the
formula will give p11 .3/ D 14 > 0. Indeed, if we take n D 3, we get
01 1 1 1
4 4 2
B1 C
P3 D B
@4
3
8
3
8
C:
A
1 1 3
8 2 8

We notice that every element in P 3 is strictly positive. That is, no matter where the
mosquito was initially seated, by the time it has made three moves, we cannot rule
out any location for where it will be: it can now be anywhere. In fact, this property
of a transition probability matrix is so important in Markov chain theory, that it has
a name. It is the first definition in our next section.

10.4 Communicating Classes

Definition 10.7. Let fXn g be a stationary Markov chain with transition probability
matrix P . It is called a regular chain if there exists a universal n0 > 0 such that
pij .n0 / > 0 8i; j 2 S .
So, what we just saw in the last example is that the mosquito is engaged in move-
ments according to a regular Markov chain.
A weaker property is that of irreducibility.

Definition 10.8. Let fXn g be a stationary Markov chain with transition probability
matrix P . It is called an irreducible chain if for any i; j 2 S; i ¤ j , there exists
n0 > 0, possibly depending on i; j such that pij .n0 / > 0.
Irreducibility means that it is possible to travel from any state to any other state,
however many steps it might take, depending on which two states are involved.
Another terminology also commonly used is that of communicating.

Definition 10.9. Let fXn g be a stationary Markov chain with transition probability
matrix P . Let i; j be two specific states. We say that i communicates with j.i $ j /
350 10 Markov Chains and Applications

if there exists n0 > 0, possibly depending on i; j such that pij .n0 / > 0, and also,
there exists n1 > 0, possibly depending on i; j such that pj i .n1 / > 0.
In words, a pair of specific states i; j are communicating states if it is possible to
travel back and forth between i; j , however many steps it might take, depending on
i; j , and possibly even depending on the direction of the journey, that is, whether
the direction is from i to j , or from j to i .
By convention, we say that i $ i . Thus, $ defines an equivalence relation on
the state space S :

i $ i I i $ j ) j $ i I i $ j; j $ k ) i $ k:

Therefore, like all equivalence relations, $ partitions the state space S into mutually
exclusive subsets of S , say C1 ; C2 ; : : : : These partitioning subsets C1 ; C2 ; : : : are
called the communicating classes of the chain.
Here is an example to help illustrate the notion.
Example 10.12 (Identifying Communicating Classes). Consider the one-step transi-
tion matrix
0 1
:75 :25 0 0 0 0
B 0 0C
B 0 1 0 0 C
B C
B :25 0 0 :25 :5 0C
P DB
B 0
C:
B 0 0 :75 :25 0C
C
B C
@ 0 0 0 0 0 1A
0 0 0 0 1 0

Inspecting the transition matrix, we see that 1 $ 2, because it is possible to go from


1 to 2 in just one step, and conversely, starting at 2, one will always go to 3, and
it is possible to then go from 3 to 1. Likewise, 2 $ 3, because if we are at 2, we
will always go to 3, and conversely, if we are at 3, then we can first go to 1, and
then from 1 to 2. Next, state 5 and state 6 are communicative, but they are clearly
not communicative with any other state, because once at 5, we can only go to 6, and
once at 6, we can only go to 5. Finally, if we are at 4, then we can go to 5, and from
5 to 6, but 6 does not communicate with 4. So, the communicating classes in this
example are
C1 D f1; 2; 3g; C2 D f4g; C3 D f5; 6g:

Note that they are disjoint, and that C1 [ C2 [ C3 D f1; 2; 3; 4; 5; 6g D S . As a


further interesting observation, if we are in C3 , then we cannot make even one-way
trips to any state outside of C3 . Such a communicating class is called closed. In this
example, C3 is the only closed communicating class. For example, C1 D f1; 2; 3g
is not a closed class because it is possible to make one-way trips from 1 to 4, 5. The
reader can verify trivially that C2 D f4g is also not a closed class.
10.4 Communicating Classes 351

We can observe more interesting things about the chain from the transition ma-
trix. Consider, for example, state 5. If you are in state 5, then your transitions
would have to be 565656    . So, starting at 5, you can return to 5 only at times
n D 2k; k  1. In such a case, we call the state periodic with period equal to two.
Likewise, state 6 is also periodic with period equal to two. An exercise asks to show
that all states within the same communicating class always have the same period.
It is useful to have a formal definition, because there is an element of subtlety about
the exact meaning of the period of a state.

Definition 10.10. A state i 2 S is said to have the period d.> 1/ if the greatest
common divisor of all positive integers n for which pii .n/ > 0 is the given number
d . If a state i has no period d > 1, it is called an aperiodic state. If every state of a
chain is aperiodic, the chain itself is called an aperiodic chain.

Example 10.13 (Computing the Period). Consider the hopping mosquito example
again. First, let us look at state 1. Evidently, we can go to 1 from 1 in any number
of steps: p11 .n/ > 0 8n  1. So the set of integers n for which p11 .n/ > 0 is
f1; 2; 3; 4; : : :g, and the gcd (greatest common divisor) of these integers is one. So 1
is an aperiodic state. Because f1; 2; 3g is a communicating class, we then must have
that 3 is also an aperiodic state. Let us see it. Note that in fact we cannot go to 3
from 3 in one step. But we can go from 3 to 1, then from 1 to 2, and then from 2 to
3. That takes three steps. But we can also go from 3 to 3 in n steps for any n > 3,
because once we go from 3 to 1, we can stay there with a positive probability for
any number of times, and then go from 1 to 2, and from 2 to 3. So the set of integers
n for which p33 .n/ > 0 is f3; 4; 5; 6; : : :g, and we now see that 3 is an aperiodic
state. Similarly, one can verify that 2 is also an aperiodic state.

Remark. It is important to note the subtle point that just because a state i has period
d , it does not mean that pii .d / > 0. Suppose, for example, that we can travel from
i back to i in steps 6; 9; 12; 15; 18; : : : ; which have gcd equal to 3, and yet pii .3/ is
not greater than zero.
A final definition for now is that of an absorbing state. Absorption means that
once you have gotten there, you will remain there forever. The formal definition is
as follows.

Definition 10.11. A state i 2 S is called an absorbing state if pij .n/ D 0 for all n
and for all j ¤ i . Equivalently, i 2 S is an absorbing state if pii D 1; that is, the
singleton set fi g is a closed class.

Remark. Plainly, if a chain has an absorbing state, then it cannot be regular, and
cannot even be irreducible. Absorption is fundamentally interesting in gambling
scenarios. A gambler may decide to quit the game as soon as his net fortune becomes
zero. If we let Xn denote the gambler’s net fortune after the nth play, then zero will
be an absorbing state for the chain fXn g. For chains that have absorbing states, time
taken to get absorbed is considered to be of basic interest.
352 10 Markov Chains and Applications

10.5 Gambler’s Ruin

The problem of the gambler’s ruin is a classic and entertaining example in the theory
of probability. It is an example of a Markov chain with absorbing states. Answers to
numerous interesting questions about the problem of the gambler’s ruin have been
worked out; this is all very classic. We provide an introductory exposition to this
interesting problem.
Imagine a gambler who goes to a casino with $a in his pocket. He will play a
game that pays him one dollar if he wins the game, or has him pay one dollar to the
house if he loses the game. He will play repeatedly until he either goes broke, or his
total fortune increases from the initial amount a he entered with to a prespecified
larger amount b.b > a/. The idea is that he is forced to quit if he goes broke, and
he leaves of his own choice if he wins handsomely and is happy to quit. We can ask
numerous interesting questions. But let us just ask what is the probability that he
will leave because he goes broke.
This is really a simple random walk problem again. Let the gambler’s initial
fortune be S0 D a. Then, the gambler’s net fortune after n plays is Sn D S0 C X1 C
X2 C    C Xn , where the Xi are iid with the distribution P .Xi D 1/ D p; P .Xi D
1/ D q D 1  p. We make the realistic assumption that p < q , p < 12 ; that is,
the game is favorable to the house and unfavorable to the player. Let pa denote the
probability that the player will leave broke if he started with $a as his initial fortune.
In the following argument, we hold b fixed, and consider pa as a function of a, with
a varying between 0 and the fixed bI 0 a b. Note that p0 D 1 and pb D 0.
Then, pa satisfies the recurrence relation

pa D ppaC1 C .1  p/pa1 ; 1 a < b:

The argument is that if the player wins the very first time, which would happen
with probability p, then he can eventually go broke with probability paC1 , because
the first win increases his fortune by one dollar from a to a C 1; but, if the player
loses the very first time, which would happen with probability 1  p, then he can
eventually go broke with probability pa1 , because the first loss will decrease his
fortune by one dollar from a to a  1.
Rewrite the above equation in the form

1p
paC1  pa D .pa  pa1 /:
p

If we iterate this identity, we get


 a
1p
paC1  pa D .p1  1/I
p

here, we have used the fact that p0 D 1.


10.5 Gambler’s Ruin 353

Now use this to find an expression for paC1 as follows:

paC1  1 D ŒpaC1  pa  C Œpa  pa1  C    C Œp1  p0 


"    #
1p a 1  p a1
D .p1  1/ C CC1
p p
 a
1p
p
1
D .p1  1/ 1p
p
1
 a
1p
p 1
) paC1 D 1 C .p1  1/ 1p :
p
1

However, we can find p1 explicitly by using the last equation with the choice a D
b  1, which gives
 b1
1p
p
1
0 D pb D 1 C .p1  1/ 1p
:
p 1

Substituting the expression we get for p1 from here into the formula for paC1 , we
have

.q=p/b  .q=p/aC1
paC1 D :
.q=p/b  1

This last formula actually gives an expression for px for a general x b; we can
use it with x D a in order to write the final formula

.q=p/b  .q=p/a
pa D :
.q=p/b  1

Note that this formula does give p0 D 1; pb D 0, and that limb!1 pa D 1, on


using the important fact that pq > 1. The practical meaning of limb!1 pa D 1
is that if the gambler is targeting too high, then actually he will certainly go broke
before he reaches that high target.
To summarize, this is an example of a stationary Markov chain with two distinct
absorbing states, and we have worked out here the probability that the chain reaches
one absorbing state (the gambler going broke) before it reaches the other absorbing
state (the gambler leaving as a winner on his terms).
354 10 Markov Chains and Applications

10.6 First Passage, Recurrence, and Transience

Recurrence, transience, and first passage times are fundamental to understanding


the long run behavior of a Markov chain. Recurrence is also linked to the stationary
distribution of a chain, one of the most important things to study in analyzing and
using a Markov chain.
Definition 10.12. Let fXn g; n  0 be a stationary Markov chain. Let D be a given
subset of the state space S . Suppose the initial state of the chain is state i . The first
passage time to the set D, denoted as TiD , is defined to be the first time that the
chain enters the set D; formally,

TiD D inffn > 0 W Xn 2 Dg;

with TiD D 1 if Xn 2 D c , the complement of D, for every n > 0. If D is a


singleton set fj g, then we denote the first passage time to j as just Tij . If j D i ,
then the first passage time Tii is just the first time the chain returns to its initial
state i . We use the simpler notation Ti to denote Tii .

Example 10.14 (Simple Random Walk). Let PXi ; i  1 be iid random variables, with
P .Xi D ˙1/ D 12 , and let Sn D X0 C niD1 Xi ; n  0, with the understanding
that X0 D 0. Then fSn g; n  0 is a stationary Markov chain with initial state as
zero, and state space S D f: : : ; 2; 1; 0; 1; 2; : : :g.
A graph of the first 50 steps of a simulated random walk is given in Fig. 10.1.
By carefully reading the plot, we see that the first passage to zero, the initial state,
occurs at T0 D 4. We can also see from the graph that the walk returns to zero a
total of nine times within these first 50 steps. The first passage to j D 5 occurs at
T05 D 9. The first passage to the set D D f: : : ; 9; 6; 3; 3; 6; 9; : : :g occurs at

n
10 20 30 40 50

-2

-4

-6

Fig. 10.1 First 50 steps of a simple symmetric random walk


10.6 First Passage, Recurrence, and Transience 355

T0D D 7. The walk goes up to a maximum of 6 at the tenth step. So, we can say
that T07 > 50; in fact, we can make a stronger statement about T07 by looking at
where the walk is at time n D 50. The reader is asked to find the best statement we
can make about T07 based on the graph.

Example 10.15 (Infinite Expected First Passage Times). Consider the three-state
Markov chain with state space S D f1; 2; 3g and transition probability matrix
0 1
x y z
P D @p q 0A;
0 0 1

where x C y C z D p C q D 1.
First consider the recurrence time T1 . Note that for the chain to return at all to
state 1, having started at 1, it cannot ever land in state 3, because 3 is an absorbing
state. So, if T1 D t, then the chain spends t  1 time instants in state 2, and then
returns to 1. In other words, P .T1 D 1/ D x, and for t > 1; P .T1 D t/ D yq t 2 p.
From here, we can compute P .T1 < 1/. Indeed,
1
py X t
P .T1 < 1/ D x C q
q 2 t D2

py q 2
DxC D x C y D 1  z:
q2 p

Therefore, P .T1 D 1/ D z, and if z > 0, then obviously E.T1 / D 1, because T1


itself can be 1 with a positive probability. If z D 0, then
1
py X t
E.T1 / D x C tq
q 2 t D2

py 2q 2  q 3 1 C p  x.1 C p 2 /
DxC D :
q 2 p q2 p.1  p/

We now define the properties of recurrence and transience of a state. At first


glance, it would appear that there could be something in between recurrence and
transience; but, in fact, a state is either recurrent or transient. The mathematical
meanings of recurrence and transience would really correspond to their dictionary
meanings. A recurrent state is one that you keep coming back to over and over again
with certainty; a transient state is one that you will ultimately leave behind forever
with certainty. Below, we are going to use the simpler notation Pi .A/ to denote
the conditional probability P .AjX0 D i /, where A is a generic event. Here are the
formal definitions of recurrence and transience.
356 10 Markov Chains and Applications

Definition 10.13. A state i 2 S is called recurrent if Pi .Xn D i for infinitely many


n  1/ D 1. The state i 2 S is called transient if Pi .Xn D i for infinitely
many n  1/ D 0.

Remark. Note that if a stationary chain returns to its original state i (at least) once
with probability one, it will then also return infinitely often with probability one. So,
we could also think of recurrence and transience of a state in terms of the following
questions.
(a) Is Pi .Xn D i for some n  1/ D 1‹
(b) Is Pi .Xn D i for some n  1/ < 1‹
Here is another way to think about it. Consider our previously defined recurrence
time Ti (still with the understanding that the initial state is i ). We can think of
recurrence in terms of whether Pi .Ti < 1/ D 1.
Needless to say that just because Pi .Ti < 1/ D 1, it does not follow that
its expectation Ei .Ti / < 1. It is a key question in Markov chain theory whether
Ei .Ti / < 1 for every state i . Not only is it of practical value to compute Ei .Ti /,
the finiteness of Ei .Ti / for every state i crucially affects the long run behavior of
the chain. If we want to predict where the chain will be after it has run for a long
time, our answers will depend on these expected values Ei .Ti /, provided they are all
finite. The relationship of Ei .Ti / to the limiting value of P .Xn D i / is made clear
in the next section. Because of the importance of the issue of finiteness of Ei .Ti /,
the following are important definitions.

Definition 10.14. A state i is called null recurrent if Pi .Ti < 1/ D 1, but


Ei .Ti / D 1. The state i is called positive recurrent if Ei .Ti / < 1. The Markov
chain fXn g is called positive recurrent if every state i is positive recurrent.
Recurrence and transience can be discussed at various levels of sophistication,
and the treatment and ramifications can be confusing. So a preview is going to be
useful.

Preview.
(a) PYou can verify recurrence or transience of a given state i by verifying whether
1
i D0 pi i .n/ D 1 or < 1:
(b) You can also try to directly verify whether Pi .Ti < 1/ D 1 or < 1:
(c) Chains with a finite state space are more easily handled as regards settling recur-
rence or transience issues. For finite chains, there must be at least one recurrent
state; that is, not all states can be transient, if the chain has a finite state space.
(d) Recurrence is a class property; that is, states within the same communicating
class have the same recurrence status. If one of them is recurrent, so are all the
others.
(e) In identifying exactly which communicating classes have the recurrence prop-
erty, you can identify which of the communicating classes are closed.
(f) Even if a state i is recurrent, Ei .Ti / can be infinite: the state i can be null
recurrent. However, if the state space is finite, and if the chain is regular, then
you do not have to worry about it. As a matter of fact, for any set D, TiD will
10.6 First Passage, Recurrence, and Transience 357

be finite with probability one, and even Ei .TiD / will be finite. So, for a finite
regular chain, you have a very simple recurrence story; every state is not just
recurrent, but even positive recurrent.
(g) For chains with an infinite state space, it is possible that every state is transient,
and it is also possible that every state is recurrent, or it could also be something
in between. Whether the chain is irreducible is going to be a key factor in sorting
out the exact recurrence structure.
Some of the major results on recurrence and transience are now given.
P
Theorem 10.2. Let fXn g beP a stationary Markov chain. If 1 nD0 pii .n/ D 1, then
i is a recurrent state, and if 1 nD0 p ii .n/ < 1, then i is a transient state.
P1
Proof. Introduce the variable Vi D nD0 IfXn Di g ; thus, Vi is the total number of
visits of the chain to state i . Let also pi D Pi .Ti < 1/. By using the Markov
property of fXn g, it follows that Pi .Vi > m/ D pim for any m  0. Suppose now
pi < 1. Then, by the tailsum formula for expectations,
1
X
Ei .Vi / D Pi .Vi > m/
mD0
X1
1
D pim D < 1:
mD0
1  pi

But also,
" 1
#
X
Ei .Vi / D Ei IfXn Di g
nD0
1
X 1
X
D EŒIfXn Di g  D Pi .Xn D i /
nD0 nD0
X1
D pii .n/:
nD0

P1
P1if pi < 1, then we must have nD0 pii .n/ < 1, which is the same as saying if
So,
nD0 pii .n/ D 1, then pi must be equal to 1, and so i must be a recurrent state.
Suppose on the other hand that pi D 1. Then, for any m; Pi .Vi > P m/ D 1, and so,
1
with probability one, Vi D 1. So, Ei .V
P1i / D 1, which implies that nD0 pii .n/ D
Ei .Vi / D 1. So,P if p i D 1, then nD0 p ii .n/ must be 1, which is the same
as saying that if 1 nD0 pii .n/ < 1, then pi < 1, which would mean that i is a
transient state. t
u
The next theorem formalizes the intuition that if you keep coming back to some
state over and over again, and that state communicates with some other state, then
you will be visiting that state over and over again as well. That is, recurrence is a
class property, and that implies that transience is also a class property.
358 10 Markov Chains and Applications

Theorem 10.3. Let C be any communicating class of states of a stationary Markov


chain fXn g. Then, either all states in C are recurrent, or all states in C are transient.

Proof. The theorem is proved if we can show that if i; j both belong to a common
communicating class, and i is transient, then j must also be transient. If we can
prove this, it follows that if j is recurrent, then i must also be recurrent, for if it
were not, it would be transient, and so that would make j transient, which would be
a contradiction.
So, suppose i 2P C , and assume that i is transient.
P1 By virtue of the transience
of i , we know that 1 p
rD0 ii .r/ < 1, and so, rDR ii .r/ < 1 for any fixed R.
p
This is useful to us in the proof.
Now consider another state j 2 C . Because C is a communicating class, there
exist k; n such that pij .k/ > 0; pj i .n/ > 0. Take such k; n and hold them fixed.
Now observe that for any m, we have the inequality

pii .k C m C n/  pij .k/pjj .m/pji .n/


1
) pjj .m/ pii .k C m C n/
pij .k/pji .n/
X1 X1
1
) pjj .m/ pii .k C m C n/ < 1;
mD0
pij .k/pji .n/ mD0
P1
because
P1 P1positive numbers, and mD0 pii .k C m C n/ D
pij .k/; pji .n/ are two fixed
rDkCn pii .r/ < 1. But, if mD0 pjj .m/ < 1, then we already know that j
must be transient, which is what we want to prove. t
u

If a particular communicating class C consists of (only) recurrent states, we call


C a recurrent class. The following are two important consequences of the above
theorem.

Theorem 10.4. (a) Let fXn g be a stationary irreducible Markov chain with a finite
state space. Then every state of fXn g must be recurrent.
(b) For any stationary Markov chain with a finite state space, a communicating
class is recurrent if and only if it is closed.

Example 10.16 (Various Illustrations.). We revisit some of the chains in our previ-
ous examples and examine their recurrence structure.

In the weather pattern example,


 
˛ 1˛
P D :
1ˇ ˇ

If 0 < ˛ < 1 and also 0 < ˇ < 1, then clearly the chain is irreducible, and
it obviously has a finite state space. And so, each of the two P states is recurrent.
If ˛ D ˇ D 1, then each state is an absorbing state, and clearly, 1nD0 pi i .n/ D 1
10.7 Long Run Evolution and Stationary Distributions 359

for both i D 1; 2. So, each state is recurrent. If ˛ D ˇ D 0, then the chain evolves
either as 121212 : : : ; or 212121    : Each state is periodic and recurrent.
In the hopping mosquito example,
0 1
0 1 0
@
P D 0 :5 :5 A :
:5 0 :5

In this case, some elements of P are zero. However, we have previously seen that
every element in P 3 is strictly positive. Hence, the chain is again irreducible. Once
again, each of the three states is recurrent.
Next consider the chain with the transition matrix
0 1
:75 :25 0 0 0 0
B 0 0 0C
B 0 1 0 C
B C
B :25 0 0 :25 :5 0 C
P DB C:
B 0 0 0 :75 :25 0 C
B C
@ 0 0 0 0 0 1A
0 0 0 0 1 0

We have previously proved that the communicating classes of this chain are
f1; 2; 3g; f4g; f5; 6g, of which f5; 6g is the only closed class. Therefore, 5; 6 are
the only recurrent states of this chain.

10.7 Long Run Evolution and Stationary Distributions

A natural human instinct is to want to predict the future. It is not surprising that
we often want to know exactly where a Markov chain will be after it has evolved
for a fairly long time. Of course, we cannot say with certainty where it will be.
But perhaps we can make probabilistic statements. In notation, suppose a stationary
Markov chain fXn g started at some initial state i 2 S . A natural question is what
can we say about P .Xn D j jX0 D i / for arbitrary j 2 S , if n is large. Again, a
short preview might be useful.
Preview. For chains with a finite state space, the answers are concrete, extremely
structured, and furthermore, convergence occurs rapidly. That is, under some rea-
sonable conditions on the chain, regardless of what the initial state i is, P .Xn D
j jX0 D i / has a limit j , and P .Xn D j jX0 D i / j for quite moder-
ate values of n. In addition, the marginal probabilities P .Xn D j / are also well
approximated by the same j , and there is an explicit formula for determining the
limiting probability j for each j 2 S . Somewhat different versions of these results
are often presented in different texts, under different sets of conditions on the chain.
360 10 Markov Chains and Applications

Our version balances the ease of understanding the results with the applicability of
the conditions assumed. But, first let us see two illustrative examples.
Example 10.17. Consider first the weather pattern example, and for concreteness,
take the one-step transition probability matrix to be
 
:8 :2
P D :
:2 :8

Then, by direct computation,


  !
:50302 :49698 :50024 :49976
P 10 D 15
I P D I
:49698 :50302 :49976 :50024
   
:50018 :49982 :50000 :50000
P 20 D I P D
25
:
:49982 :50018 :50000 :50000
We notice that P n appears to converge to a limiting matrix with each row of the
limiting matrix being the same, namely, .:5; :5/. That is, regardless of the initial
state i; P .Xn D j jX0 D i / appears to converge to j D :5. Thus, if indeed
˛ D ˇ D :8 in the weather pattern example, then in the long run the chances of a
dry or wet day would be just 5050, and the effect of the weather on the initial day
is going to wash out.
On the other hand, consider a chain with the one-step transition matrix
0 1
x y z
@
P D p q 0A:
0 0 1

Notice that this chain has an absorbing state; once you are in state 3, you can never
leave from there. To be concrete, take x D :25; y D :75; p D q D :5. Then, by
direct computing,
0 1 0 1
:400001 :599999 0 :4 :6 0
P 10 D @ :4 :6 0AI P 20 D @ :4 :6 0A:
0 0 1 0 0 1

This time it appears that P n converges to a limiting matrix whose first two rows are
the same, but the third row is different. Specifically, the first two rows of P n seem
to be converging to .:4; :6; 0/ and the third row is .0; 0; 1/, the same as the third row
in P itself. Thus, the limiting behavior of P .Xn D j jX0 D i / seems to depend on
the initial state i .
The difference between the two chains in this example is that the first chain is
regular, whereas the second chain has an absorbing state and cannot be regular.
Indeed, regularity of the chain is going to have a decisive effect on the limiting
behavior of P .Xn D j jX0 D i /. An important theorem is the following.
10.7 Long Run Evolution and Stationary Distributions 361

Theorem 10.5 (Fundamental Theorem for Finite Markov Chains). Let fXn g be
a stationary Markov chain with a finite state space S , consisting of t elements.
Assume furthermore that fXn g is regular. Then, there exist j ; j D 1; 2; : : : ; t such
that
(a) For any initial state i; P .Xn D j jX0 D i / ! j ; j D 1; 2; : : : ; t:
(b) 
P1t; 2 ; : : : ; t are the uniquePsolutions of the system of equations j D
t
i D1 i pij ; j D 1; 2; : : : ; t, j D1 j D 1, where pij denotes the .i; j /th
element in the one-step transition matrix P .
Equivalently, the row vector  D .1 ; 2 ; : : : ; t / is the unique solution of the
equations P D , 10 D 1, where 1 is a row vector with each coordinate
equal to one.
(c) The chain fXn g is positive recurrent; that is, for any state i , the mean recurrence
time i D Ei .Ti / < 1, and furthermore i D 1i :
The vector  D .1 ; 2 ; : : : ; t / is called the stationary distribution of the regular
finite chain fXn g. It is also sometimes called the equilibrium distribution or the
invariant distribution of the chain. The difference in terminology can be confusing.
Suppose now that a stationary chain has a stationary distribution . If we use
this  as the initial distribution of the chain, then we observe that
X
P .X1 D j / D P .X1 D j jX0 D k/k D j ;
k2S

by the fact that  is a stationary distribution of the chain. Indeed, it now follows
easily by induction that for any n; P .Xn D j / D j ; j 2 S . Thus, if a chain
has a stationary distribution, and the chain starts out with that distribution, then at
all subsequent times, the distribution of the state of the chain remains exactly the
same, namely the stationary distribution. This is why a chain that starts out with its
stationary distribution is sometimes described to be in steady-state.

We now give a proof of parts (a) and (b) of the fundamental theorem of Markov
chains. For this, we use a famous result in linear algebra, which we state as a lemma;
see Seneta (1981) for a proof.

Lemma (Perron–Frobenius Theorem). Let P be a real t  t square matrix with


all elements pij strictly positive. Then,
(a) P has a positive real eigenvalue 1 such that for any other eigenvalue j of
P; j j j < 1 ; j D 2; : : : ; t.
(b) 1 satisfies X X
min pij 1 max pij :
i i
j j

(c) There exist left and right eigenvectors of P , each having only strictly posi-
tive elements, corresponding to the eigenvalue 1 ; that is, there exist vectors
; !, with both ; ! having only strictly positive elements, such that P D
1 I P ! D 1 !:
362 10 Markov Chains and Applications

(d) The algebraic multiplicity of 1 is one and the dimension of the set of left as
well as right eigenvectors corresponding to 1 equals one.
(e) For any i , and any vector .c1 ; c2 ; : : : ; ct / with each cj > 0,

1 X 1 X
lim log pij .n/cj D lim log pj i .n/cj D log 1:
n n n n
j j

Proof of Fundamental Theorem. Because for a transition probability matrix of a


Markov chain, the row sums are all equal to one, it follows immediately from
the Perron–Frobenius theorem that if every element of P is strictly positive, then
1 D 1 is an eigenvalue of P and that there is a left eigenvector  with only strictly
positive elements such that P D . We can always normalize  so that its ele-
ments add to exactly one, and so the renormalized  is a stationary distribution for
the chain, by the definition of a stationary distribution. If the chain is regular, then
in general we can only assert that every element of P n is strictly positive for some
n. Then the Perron–Frobenius theorem applies to P n and we have a left eigenvector
 satisfying P n D . It can be proved from here that the same vector  satis-
fies P D , and so the chain has a stationary distribution. The uniqueness of the
stationary distribution is a consequence of part (d) of the Perron–Frobenius theorem.
Coming to part (a), note that part (a) asserts that every row of P n converges to
the vector ; that is, 0 1

B C
B C
Pn ! B C
B :: C :
@:A

We prove this by the diagonalization argument we previously used in working
out a closed-form formula for P n in the hopping mosquito example. Thus, consider
the case where the eigenvalues of P are distinct, remembering that one eigen-
value is one, and the rest less than one in absolute value. Let U 1 P U D L D
diagf1; 2 ;    ; t g, where
0 1 0 1
1 u12 u13  1 2    t
B1 u C B u21 u22    u2t C
B 22 u23 C B C
U DB
B :: ::
CI
C U 1 DB
B :: ::
C:
C
@ : : A @ : : A
1 ut 2 ut 3  ut1 ut 2  ut t

This implies P D ULU 1 ) P n D ULn U 1 . Because each j for j > 1 sat-


isfies j j j < 1, we have j j jn ! 0, as n ! 1. This fact, together with the
explicit forms of U; U 1 given immediately above leads to the result that each
row of ULn U 1 converges to the fixed row vector , which is the statement in
part (a). t
u
10.7 Long Run Evolution and Stationary Distributions 363

We assumed that our chain is regular for the fundamental theorem. An exercise
asks us to show that regularity is not necessary for the existence of a stationary
distribution. Regular chains are of course irreducible. But irreducibility alone is not
enough for the existence of a stationary distribution. More is said of the issue of
existence of a stationary distribution a bit later. For finite chains, irreducibility plus
aperiodicity is enough for the validity of the fundamental theorem because of the
simple reason that such chains are regular in the finite case. It is worth mentioning
this as a formal result.

Theorem 10.6. Let fXn g be a stationary Markov chain with a finite-state space S .
If fXn g is irreducible and aperiodic, then the fundamental theorem holds.

Example 10.18 (Weather Pattern). Consider the two-state Markov chain with the
transition probability matrix
!
˛ 1˛
P D :
1ˇ ˇ

Assume 0 < ˛; ˇ < 1, so that the chain is regular. The stationary probabilities
1 ; 2 are to be found from the equation

.1 ; 2 /P D .1 ; 2 /
) ˛1 C .1  ˇ/2 D 1 I
1˛
) .1  ˛/1 D .1  ˇ/2 ) 2 D 1 :
1ˇ

1ˇ
Substituting this into 1 C 2 D 1 gives 1 C 1ˇ1˛
1 D 1, and so 1 D 2˛ˇ ,
which then gives 2 D 1  1 D 2˛ˇ : For example, if ˛ D ˇ D :8, then we get
1˛

1 D 2 D 2:8:8
1:8
D :5, which is the numerical limit we saw in our example by
n
computing P explicitly for large n. For general 0 < ˛; ˇ < 1, each of the states is
positive recurrent. For instance, if ˛ D ˇ D :8, then Ei .Ti / D :5
1
D 2 for each of
i D 1; 2.

Example 10.19. With the row vector  D .1 ; 2 ; : : : ; t / denoting the vector of
stationary probabilities of a chain,  satisfies the vector equation P D , and
taking a transpose on both sides, P 0  0 D  0 . That is, the column vector  0 is a right
eigenvector of P 0 , the transpose of the transition matrix.
For example, consider the voting preferences example with
0 1
:8 :05 :15
P D @ :03 :9 :07 A :
:1 :1 :8
364 10 Markov Chains and Applications

The transpose of P is
0 1
:8 :03 :1
P 0 D @ :05 :9 :1 A :
:15 :07 :8

A set of its three eigenvectors is

0 1 0 1 0 1
:38566 :44769 :56867
@ :74166 A ; @ :81518 A ; @ :22308 A :
:54883 :36749 :79174

Of these, the last two cannot be the eigenvector we are looking for, because they
contain negative elements. The first eigenvector contains only nonnegative (actually,
strictly positive) elements, and when normalized to give elements that add to one,
results in the stationary probability vector  D .:2301; :4425; :3274/: We could
have also obtained it by the method of elimination as in our preceding example, but
the eigenvector method is a general clean method, and is particularly convenient
when the number of states t is not small.

Example 10.20 (Ehrenfest Urn). Consider the symmetric version of the Ehrenfest
urn model in which a certain number among m balls are initially in urn I, the rest
in urn II, and at each successive time one of the m balls is selected completely at
random and transferred to the other urn with probability 12 (and left in the same urn
with probability 12 ). The one-step transition probabilities are

i mi 1
pi;i 1 D ; pi;i C1 D ; pi i D :
2m 2m 2
A stationary distribution  would satisfy the equations

mj C1 j C1 j
j D j 1 C j C1 C ; 1 j m  1I
2m 2m 2
0 1 m m1
0 D C I m D C :
2 2m 2 2m
These are equivalent to the equations

1 m1 mj C1
0 D I m D I ; j D j 1
m m m
j C1
C j C1 ; 1 j m  1:
m
Starting with 1 , one can solve these equations by just successive
 substitution, leav-
ing 0 as an undetermined constant to get j D m j
0 . Now use the fact that
10.7 Long Run Evolution and Stationary Distributions 365

Pm .m/
j D0 j must equal one. This forces 0 D 21m , and hence, j D 2jm . We now
realize that these are exactly the probabilities in a binomial distribution with param-
eters m and 12 . That is, in the symmetric Ehrenfest urn problem, there is a stationary
distribution and it is the Bin.m; 12 / distribution. In particular, after the process has
evolved for a long time, we would expect close to half the balls to be in each urn.
Each state is positive recurrent, that is, the chain is sure to return to its original
configuration with a finite expected value for the time it takes to return to that con-
figuration. As a specific example, suppose m D 10 and that initially, there were five
.10/
balls in each urn. Then, the stationary probability 5 D 2150 D 256 63
D :246, so that
we can expect that after about four transfers, the urns will once again have five balls
each.

Example 10.21 (Asymmetric Random Walk). Consider a random walk fSn g; n  0


starting at zero, and taking independent steps of length one at each time, either to
the left or to the right, with the respective probabilities depending on the current
position of the walk. Formally, Sn is a Markov chain with initial state zero, and with
the one-step transition probabilities pi;i C1 D ˛i ; pi;i 1 D ˇi ; ˛i C ˇi D 1 for any
i  0. In order to restrict the state space of the chain to just the nonnegative integers
S D f0; 1; 2; : : :g, we assume that ˛0 D 1. Thus, if you ever reach zero, then you
start over.

If a stationary distribution  exists, by virtue of the matrix equation  D P , it


satisfies the recursion

j D j 1 ˛j 1 C j C1 ˇj C1 ;

with the initial equation


0 D 1 ˇ1 :

This implies, by successive substitution,

1 ˛0 ˛0 ˛1
1 D 0 D 0 I 2 D 0 I : : : ;
ˇ1 ˇ1 ˇ1 ˇ2

and for a general j > 1;


˛0 ˛1    ˛j 1
j D 0 :
ˇ1 ˇ2    ˇj
Because each j ; j  0 is clearly nonnegative, the only issueP is whether they
constitute a probability distribution, that is, whether 0 C 1 j D1 j D 1. This is
P1 ˛ ˛ ˛
equivalent to asking whether .1 C j D1 cj /0 D 1, where cj D 0ˇ1 1ˇ2 ˇj 1 . In
j
otherPwords, the chain has a stationary distribution if and only if the infinite se-
ries 1 j D1 cj converges to some positive finite number ı, in which case 0 D 1Cı
1
cj
and for j  1; j D 1Cı .
366 10 Markov Chains and Applications

Consider now the special case when ˛i D ˇi D 12 for all i  1. Then, for
P
any j  1; cj D 12 , and hence 1 j D1 cj diverges. Therefore, the case of the sym-
metric random walk does not possess a stationary distribution, in the sense that no
stationary distribution exists that is a valid probability distribution.
The stationary distribution of a Markov chain is not just the limit of the n-step
transition probabilities; it also has important interpretations in terms of the marginal
distribution of the state of the chain. Suppose the chain has run for a long time,
and we want to know what the chances are that the chain is now in some state j . It
turns out that the stationary probability j approximates that probability too. The
approximations are valid in a fairly strong sense, made precise below. Even more,
j is approximately equal to the fraction of the time so far that the chain has spent
visiting state j . To describe these results precisely, we need a little notation.
Given a stationary chain fXn g, P we denote fn .j / D P .Xn D j /. Also let
n
Ik .j / D IfXk Dj g , and Vn .j / D kD1 Ik .j /. Thus, Vn .j / counts the number
of times up to time n that the chain has been in state j , and ın .j / D Vnn.j / mea-
sures the fraction of times up to time n that the chain has been in state j . Then, the
following results hold.
Theorem 10.7 (Weak Ergodic Theorem). Let fXn g be a regular Markov chain
with a finite state space and the stationary distribution  D .1 ; 2 ; : : : ; t /. Then,
(a) Whatever the initial distribution of the chain is, for any j 2 S , P .Xn D j / !
j as n ! 1.
(b) For any > 0; and for any j 2 S , P .jın .j /  j j > / ! 0 as n ! 1.
(c) More generally, given any bounded function g, and any > 0, P .j n1
Pn Pt
kD1 g.Xk /  j D1 g.j /j j > / ! 0 as n ! 1.
Remark. See Norris (1997, p. 53) for a proof of this theorem. Also see Section
19.3.1 in this text, where an even stronger version is proved. The theorem provides a
basis for estimating the stationary probabilities of a chain by following its trajectory
for a long time. Part (c) of the theorem says that time averages of a general bounded
function will ultimately converge to the state space average of the function with
respect to the stationary distribution. In fact, a stronger convergence result than the
one we state here holds and is commonly called the ergodic theorem for stationary
Markov chains; see Brémaud 1999 or Norris (1997).

Exercises

Exercise 10.1. A particular machine is either in working order or broken on any


particular day. If it is in working order on some day, it remains so the next day with
probability .7, whereas if it is broken on some day, it stays broken the next day with
probability .2.
Exercises 367

(a) If it is in working order on Monday, what is the probability that it is in working


order on Saturday?
(b) If it is in working order on Monday, what is the probability that it remains in
working order all the way through Saturday?
Exercise 10.2. Consider the voting preferences example in text with the transition
probability matrix
0 1
:8 :05 :15
P D @ :03 :9 :07 A :
:1 :1 :8
Suppose a family consists of the two parents and a son. The three follow the same
Markov chain described above in deciding their votes. Assume that the family mem-
bers act independently, and that in this election the father voted Conservative, the
mother voted Labor, and the son voted Independent.
(a) Find the probability that they will all vote the same parties in the next election
as they did in this election.
(b) * Find the probability that as a whole, the family will split their votes among
the three parties, one member for each party, in the next election.
Exercise 10.3. Suppose fXn g is a stationary Markov chain. Prove that for all n, and
all xi ; i D 0; 1; : : : ; n C 2; P .XnC2 D xnC2 ; XnC1 D xnC1 jXn D xn ; Xn1 D
xn1 ; : : : ; X0 D x0 / D P .XnC2 D xnC2 ; XnC1 D xnC1 jXn D xn /.
Exercise 10.4 * (What the Markov Property Does Not Mean). Give an example
of a stationary Markov chain with a small number of states such that P .XnC1 D
xnC1 jXn xn ; Xn1 xn1 ; : : : ; X0 x0 / D P .XnC1 D xnC1 jXn xn / is
not true for arbitrary x0 ; x1 ; : : : ; xnC1 .
Exercise 10.5 (Ehrenfest Urn). Consider the Ehrenfest urn model when there are
only two balls to distribute.
(a) Write the transition probability matrix P .
(b) Calculate P 2 ; P 3 .
(c) Find general formulas for P 2k ; P 2kC1 .
Exercise 10.6 (The Cat and Mouse Chain). In one of two adjacent rooms, say
room 1, there is a cat, and in the other one, room 2, there is a mouse. There is a
small hole in the wall through which the mouse can travel between the rooms, and
there is a larger hole through which the cat can travel between the rooms. Each
minute, the cat and the mouse decide the room they want to be in by following a
stationary Markov chain with the transition probability matrices
   
:5 :5 :1 :9
P1 D I P2 D :
:5 :5 :6 :4

Let Xn be the room in which the cat is at time n, and Yn the room in which the
mouse is at time n. Assume that the chains fXn g and fYn g are independent.
368 10 Markov Chains and Applications

(a) Write the transition matrix for the chain Zn D .Xn ; Yn /.


(b) Let pn D P .Xn D Yn /. Compute pn for n D 1; 2; 3; 4; 5, taking the initial time
to be n D 0.
(c) The very first time that they end up in the same room, the cat will eat the mouse.
Let qn be the probability that the cat eats the mouse at time n. Compute qn for
n D 1; 2; 3.

Exercise 10.7 (Diagonalization in the Two-State Case). Consider a two-state sta-


tionary chain with the transition probability matrix
 
˛ 1˛
P D :
1ˇ ˇ

(a) Find the eigenvalues of P . When are they distinct?


(b) Diagonalize P when the eigenvalues are distinct.
(c) Hence find a general formula for p11 .n/.

Exercise 10.8. A flea is initially located on the top face of a cube, which has six
faces, top and bottom, left and right, and front and back. Every minute it moves
from its current location to one of the other five faces chosen at random.
(a) Find the probability that after four moves it is back to the top face.
(b) Find the probability that after n moves it is on the top face; on the bottom face.
(c) * Find the probability that the next five moves are distinct. This is the same as
the probability that the first six locations of the flea are the six faces of the cube,
each location chosen exactly once.

Exercise 10.9 (Subsequences of Markov Chains). Suppose fXn g is a stationary


Markov chain. Let Yn D X2n . Prove or disprove that fYn g is a stationary Markov
chain. How about fX3n g? fXkn g for a general k?

Exercise 10.10. Let fXn g be a three-state stationary Markov chain with the transi-
tion probability matrix
0 1
0 x 1x
P D @y 1y 0 A:
1 0 0

Define a function g as g.1/ D 1; g.2/ D g.3/ D 2 and let Yn D g.Xn /. Is fYn g a


stationary Markov chain?
Give an example of a function g such that g.Xn / is not a Markov chain.

Exercise 10.11 (An IID Sequence). Let Xi ; i  1 be iid Poisson random variables
with some common mean . Prove or disprove that fXn g is a staionary Markov
chain. If it is, describe the transition probability matrix.
How important is the Poisson assumption? What happens if Xi ; i  1 are inde-
pendent, but not iid?
Exercises 369

Exercise 10.12. Let fXn g be a stationary Markov chain with transition matrix P ,
and g a one-to-one function. Define Yn D g.Xn /. Prove that fYn g is a Markov chain,
and characterize as well as you can the transition probability matrix of fYn g.

Exercise 10.13 * (Loop Chains). Suppose fXn g is a stationary Markov chain with
state space S and transition probability matrix P .
(a) Let Yn D .Xn ; XnC1 /. Show that Yn is also a stationary Markov chain.
(b) Find the transition probability matrix of Yn .
(c) How about Yn D .Xn ; XnC1 ; XnC2 /? Is this also a stationary Markov chain?
(d) How about Yn D .Xn ; XnC1 ;    ; XnCd / for a general d  1?

Exercise 10.14 (Dice Experiments). Consider the experiment of rolling a fair die
repeatedly. Define
(a) Xn D # sixes obtained up to the nth roll.
(b) Xn D number of rolls, at time n, that a six has not been obtained since the last
six.
Prove or disprove that each fXn g is a Markov chain, and if they are, obtain the
transition probability matrices.

Exercise 10.15. Suppose fXn g is a regular stationary Markov chain with transition
probability matrix P . Prove that there exists m  1 such that every element in P n
is strictly positive for all n  m.

Exercise 10.16 (Communicating Classes). Consider a finite-state stationary


Markov chain with the transition matrix
0 1
0 :5 0 :5 0
B0 0 C
B 0 1 0 C
B C
P D B :5 0 0 0 :5 C :
B C
@ 0 :25 :25 :25 :25 A
:5 0 0 0 :5

(a) Identify the communicating classes of this chain.


(b) Identify those classes that are closed.

Exercise 10.17 * (Periodicity and Simple Random Walk). Consider the Markov
chain corresponding to the simple random walk with general step probabilities
p; q; p C q D 1.
(a) Identify the periodic states of the chain and the periods.
(b) Find the communicating classes.
(c) Are there any communicating classes that are not closed? If there are, iden-
tify them. If not, prove that there are no communicating classes that are not
closed.
370 10 Markov Chains and Applications

Exercise 10.18 * (Gambler’s Ruin). Consider the Markov chain corresponding to


the problem of the gambler’s ruin, with initial fortune a, and absorbing states at 0
and b.
(a) Identify the periodic states of the chain and the periods.
(b) Find the communicating classes.
(c) Are there any communicating classes that are not closed? If there are, identify
them.
Exercise 10.19. Prove that a stationary Markov chain with a finite-state space has
at least one closed communicating class.
Exercise 10.20 * (Chain with No Closed Classes). Give an explicit example of a
stationary Markov chain with no closed communicating classes.
Exercise 10.21 (Skills Exercise). Consider the stationary Markov chains corre-
sponding to the following transition probability matrices:
01 1
1
01 2 1 2 0 0 0 2
0 B C
B
3 3
C B0 1
0 1
0C
P D@0 1 2
AI P DB
B0
2 2 C:
2
3 3
1 @
3
4
1
8
1
8
0C
A
3 0 3 1 1
2
0 0 0 2

(a) Are the chains irreducible?


(b) Are the chains regular?
(c) For each chain, find the communicating classes.
(d) Are there any periodic states? If there are, identify them.
(e) Do both chains have a stationary distribution? Is there anything special about
the stationary distribution of either chain? If so, what is special?
Exercise 10.22 * (Recurrent States). Let Zi ; i  1 be iid Poisson random vari-
ables with mean one. For each of the sequences

X
n
Xn D Zi ; Xn D maxfZ1 ; : : : ; Zn g; Xn D minfZ1 ; : : : ; Zn g;
i D1

(a) Prove or disprove that fXn g is a stationary Markov chain.


(b) For those that are, write the transition probability matrix.
(c) Find the recurrent and the transient states of the chain.
Exercise 10.23 (Irreducibility and Aperiodicity). For stationary Markov chains
with the following transition probability matrices, decide whether the chains are
irreducible and aperiodic.
01 1 0 1
  4
1
2
1
4 0 1 0
0 1
P D I P D@0 1 1 AI P D@0 0 1A:
p 1p 2 2
1 0 0 p 1p 0
Exercises 371

Exercise 10.24 (Irreducibility of the Machine Maintenance Chain). Consider


the machine maintenance example given in the text. Prove that the chain is irre-
ducible if and only if p0 > 0 and p0 C p1 < 1.
Do some numerical computing that reinforces this theoretical result.

Exercise 10.25 * (Irreducibility of Loop Chains). Let fXn g be a stationary


Markov chain and consider the loop chain defined by Yn D .Xn ; XnC1 /. Prove
that if fXn g is irreducible, then so is fYn g.
Do you think this generalizes to Yn D .Xn ; XnC1 ; : : : ; XnCd / for general
d  1?

Exercise 10.26 * (Functions of a Markov Chain). Consider the Markov chain


fXn g corresponding to the simple random walk with general step probabilities
p; q; p C q D 1.
(a) If f .:/ is any strictly monotone function defined on the set of integers, show
that ff .Xn /g is a stationary Markov chain.
(b) Is this true for a general chain fYn g? Prove it or give a counterexample.
(c) Show that fjXn jg is a stationary Markov chain, although x ! jxj is not a strictly
monotone function.
(d) Give an example of a function f such that ff .Xn /g is not a Markov chain.

Exercise 10.27 (A Nonregular Chain with a Stationary Distribution). Consider


a two-state stationary Markov chain with the transition probability matrix
 
0 1
P D :
1 0

(a) Show that the chain is not regular.


(b) Prove that nevertheless, the chain has a unique stationary distribution and iden-
tify it.

Exercise 10.28 * (Immigration–Death Model). At time n; n  1; Un particles en-


ter into a box. U1 ; U2 ; : : : are assumed to be iid with some common distribution F .
The lifetimes of all the particles are assumed to be iid with common distribution G.
Initially, there are no particles in the box. Let Xn be the number of particles in the
box just after time n.
(a) Take F to be a Poisson distribution with mean two, and G to be geometric with
parameter 12 . That is, G has the mass function 21x ; x D 1; 2; : : : : Write the
transition probability matrix for fXn g.
(b) Does fXn g have a stationary distribution? If it does, find it.

Exercise 10.29 * (Betting on the Basis of a Stationary Distribution). A particular


stock either retains the value that it had at the close of the previous day, or gains a
point, or loses a point, the respective states denoted as 1; 2; 3. Suppose Xn is the
372 10 Markov Chains and Applications

state of the stock on the nth day; thus, Xn takes the values 1; 2, or 3. Assume that
fXn g forms a stationary Markov chain with the transition probability matrix
0 1 1 1
0 2 2
B C
P D @ 13 1
3
1
3 A:
1 3 1
2 8 8

A friend offers you the following bet: if the stock goes up tomorrow, he pays you 15
dollars, and if it goes down you pay him 10 dollars. If it remains the same as where
it closes today, a fair coin will be tossed and he will pay you 10 dollars if a head
shows up, and you will pay him 15 dollars if a tail shows up. Will you accept this
bet? Justify with appropriate calculations.

Exercise 10.30 * (Absent-Minded Professor). A mathematics professor has two


umbrellas, both of which were originally at home. The professor walks back and
forth between home and office, and if it is raining when he starts a journey, he
carries an umbrella with him unless both his umbrellas are at the other location. If it
is clear when he starts a journey, he does not take an umbrella with him. We assume
that at the time of starting a journey, it rains with probability p, and that the states
of weather are mutually independent.
(a) Find the limiting proportion of journeys in which the professor gets wet.
(b) What if the professor had three umbrellas to begin with, all of which were orig-
inally at home?
(c) Is the limiting proportion affected by how many were originally at home?

Exercise 10.31 * (Wheel of Fortune). A pointed arrow is set on a circular wheel


marked with m positions labeled as 0; 1; : : : ; m  1. The hostess turns the wheel
at each game, so that the arrow either remains where it was before the wheel was
turned, or it moves to a different position. Let Xn denote the position of the arrow
after n turns.
1
(a) Suppose at any turn, the arrow has an equal probability m of ending up at any of
the m positions. Does fXn g have a stationary distribution? If it does, identify it.
(b) Suppose at each turn, the hostess keeps the arrow where it was, or moves it
one position clockwise or one position counterclockwise, each with an equal
probability 13 . Does fXn g have a stationary distribution? If it does, identify it.
(c) Suppose again that each turn, the hostess keeps the arrow where it was, or moves
it one position clockwise or one position counterclockwise, but now with re-
spective probabilities ˛; ˇ; ; ˛ C ˇ C D 1. Does fXn g have a stationary
distribution? If it does, identify it.

Exercise 10.32 (Wheel of Fortune Continued). Consider again the Markov chains
corresponding to the wheel of fortune. Prove or disprove that they are irreducible
and aperiodic.
Exercises 373

Exercise 10.33 * (Stationary Distribution in Ehrenfest Model). Consider the


general Ehrenfest chain defined in the text, with m balls, and transfer probabilities
˛; ˇ; 0 < ˛; ˇ < 1. Identify a stationary distribution, if it exists.

Exercise 10.34 * (Time Till Breaking Away). Consider a general stationary


Markov chain fXn g and let T D minfn  1 W Xn ¤ X0 g.
(a) Can T be equal to 1 with a positive probability?
(b) Give a simple necessary and sufficient condition for P .T < 1/ D 1.
(c) For each of the weather pattern, Ehrenfest urn, and the cat and mouse chains,
compute E.T jX0 D i / for a general i in the corresponding state space S .

Exercise 10.35 ** (Constructing Examples). Construct an example of each of the


following phenomena.
(a) A Markov chain with only absorbing states.
(b) A Markov chain that is irreducible but not regular.
(c) A Markov chain that is irreducible but not aperiodic.
(d) A Markov chain on an infinite state space that is irreducible and aperi odic, but
not regular.
(e) A Markov chain in which there is at least one null recurrent state.
(f) A Markov chain on an infinite state space such that every state is transient.
(g) A Markov chain such that each first passage time Tij has all moments finite.
(h) A Markov chain without a proper stationary distribution.
(i) Independent irreducible chains fXn g; fYn g, such that Zn D .Xn ; Yn / is not irre-
ducible.
(j) Markov chains fXn g; fYn g such that Zn D .Xn ; Yn / is not a Markov chain.

Exercise 10.36 * (Reversibility of a Chain). A stationary chain fXn g with tran-


sition probabilities pij is called reversible if there is a function m.x/ such that
pij m.i / D pj i m.j / for all i; j 2 S . Give a simple sufficient condition in terms
of the function m which ensures that a reversible chain has a proper stationary dis-
tribution. Then, identify the stationary distribution.

Exercise 10.37. Give a physical interpretation for the property of reversibility of a


Markov chain.

Exercise 10.38 (Reversibility). Give an example of a Markov chain that is re-


versible, and of one that is not.

Exercise 10.39 (Use Your Computer: Cat and Mouse). Take the cat and mouse
chain and simulate it to find how long it takes for the cat and the mouse to end up
in the same room. Repeat the simulation and estimate the expected time until the
cat and the mouse end up in the same room. Vary the transition matrix and examine
how the expected value changes.
374 10 Markov Chains and Applications

Exercise 10.40 (Use Your Computer: Ehrenfest Urn). Take the symmetric
Ehrenfest urn; that is, take ˛ D ˇ D :5. Put all the m balls in the second urn.
Simulate the chain and find how long it takes for the urns to have an equal number
of balls for the first time. Repeat the simulation and estimate the expected time until
both urns have an equal number of balls. Take m D 10; 20.

Exercise 10.41 (Use Your Computer: Gambler’s Ruin). Take the gambler’s ruin
problem with p D :4; :49. Simulate the chain using a D 10; b D 25, and find the
proportion of times that the gambler goes broke by repeating the simulation. Com-
pare your empirical proportion with the exact theoretical value of the probability
that the gambler will go broke.

References

Bhattacharya, R.N. and Waymire, E. (2009). Stochastic Processes with Applications, Siam,
Philadelphia.
Brémaud, P. (1999). Markov Chains, Gibbs Fields, Monte Carlo, and Queues, Springer, New York.
Diaconis, P. (1988). Group Representations in Probability and Statistics, IMS, Lecture Notes and
Monographs Series, Hayward, CA.
Feller, W. (1968). An Introduction to Probability Theory, With Applications, Wiley, New York.
Freedman, D. (1975). Markov Chains, Holden Day, San Francisco.
Isaacson, D. and Madsen, R. (1976). Markov Chains, Theory and Applications, Wiley, New York.
Kemperman, J. (1950). The General One-Dimensional Random Walk with Absorbing Barriers,
Geboren Te, Amsterdam.
Meyn, S. and Tweedie, R. (1993). Markov Chains and Stochastic Stability, Springer, New York.
Norris, J. (1997). Markov Chains, Cambridge University Press, Cambridge, UK.
Seneta, E. (1981). Nonnegative Matrices and Markov Chains, Springer-Verlag, New York.
Stirzaker, D. (1994). Elementary Probability, Cambridge University Press, Cambridge, UK.
Chapter 11
Random Walks

We have already encountered the simple random walk a number of times in the
previous chapters. Random walks occupy an extremely important place in prob-
ability because of their numerous applications, and because of their theoretical
connections in suitable limiting paradigms to other important random processes
in time. Random walks are used to model the value of stocks in economics, the
movement of the molecules of a particle in a liquid medium, animal movements in
ecology, diffusion of bacteria, movement of ions across cells, and numerous other
processes that manifest random movement in time in response to some external
stimuli. Random walks are indirectly of interest in various areas of statistics, such
as sequential statistical analysis and testing of hypotheses. They also help a student
of probability simply to understand randomness itself better.
We present a treatment of the theory of random walks in one or more dimensions
in this section, focusing on the asymptotic aspects that relate to the long run prob-
abilistic behavior of a particle performing a random walk. We recommend Feller
(1971), Rényi (1970), and Spitzer (2008) for classic treatment of random walks;
Spitzer (2008), in particular, provides a comprehensive coverage of the theory of
random walks with numerous examples in setups far more general than we consider
in this chapter.

11.1 Random Walk on the Cubic Lattice

Definition 11.1. A particle is said to perform a d -dimensional cubic lattice random


walk, starting at the origin, if at each time instant n; n  1, one of the d locational
coordinates of the particle is changed by either C1 or 1, the direction of movement
being equally likely to be either of these 2d directions.
As an example, in three dimensions, the particle’s initial position is S0 D .0; 0; 0/
and the position at time n D 1 is S1 D S0 C X1 , where X1 takes one of the values
.1; 0; 0/; .1; 0; 0/; .0; 1; 0/; .0; 1; 0/; .0; 0; 1/; .0; 0; 1/ with an equal probabil-
ity, and so on for the successive steps of the walk. In the cubic lattice random walk,
at any particular time, the particle can take a unit step to the right, or left, or front,

A. DasGupta, Probability for Statistics and Machine Learning: Fundamentals 375


and Advanced Topics, Springer Texts in Statistics, DOI 10.1007/978-1-4419-9634-3 11,
c Springer Science+Business Media, LLC 2011
376 11 Random Walks

or back, or up, or down, choosing one of these six options with an equal probability.
In d dimensions, it chooses one of the 2d options with an equal probability.
We show two simulations of 100 steps of a simple symmetric random walk on
the line; this is just the cubic lattice random walk when d D 1. We use these two
simulated plots to illustrate a number of important and interesting variables con-
nected to a random walk. One example of such an interesting variable is the number
of times the walk returns to its starting position in the first n steps (in the simulated
plots, n D 100).
We now give some notation:

Sn D d  dimensional cubic lattice random walk; n  0; S0 D 0I


S D State space of fSn g D f.i1 ; i2 ; : : : ; id / W i1 ; i2 ; : : : ; id 2 ZgI
Pd;n D P .Sn D 0/I Pn P1;n I
Qd;n D P .Sk ¤ 0 81 k < n; Sn D 0/I
D P .The random walk returns to its starting point 0 for the first
time at time n/I
X1 X1
Qd D Qd;n D Qd;2k D P .The random walk ever returns to its
nD1 kD1
starting point 0/:

And, for d D 1,

i D The time of the i th return of the walk to 0I


n D Number of times the walk returns to 0 in the first n stepsI
Xn

n D IfSk >0g
kD1
D Number of times in the first n steps that the walk takes a positive value:

Example 11.1 (Two Simulated Walks). We refer to the plots in Fig. 11.1 and
Fig. 11.2 of the first 100 steps of the two simulated simple random walks. First, note
that in both plots, the walk does not at all stay above and below the axis about 50%
of the time. In the first simulation, the walk spends most of its time on the negative
side, and in the second simulation, it does the reverse. This is borne out by theory,
although at first glance it seems counter to intuition of most people. We give a table
providing the values of the various quantities defined above corresponding to the two
simulations.

Walk n i n
1 3 2, 14, 16 11
2 8 2, 4, 6, 8, 10, 22, 24, 34 69
11.1 Random Walk on the Cubic Lattice 377

20 40 60 80 100

-5

-10

-15

-20

-25

Fig. 11.1 A simulated random walk

15

12.5

10

7.5

2.5

20 40 60 80 100
-2.5

Fig. 11.2 A simulated random walk

A matter of key interest is whether the random walk returns to its origin, and if so
how many times. More generally, we may ask how many times the random walk
visits a given state x 2 S. These considerations lead to the following fundamental
definition.

Definition 11.2. A state x 2 S is said to be a recurrent state if

P .Sn D x infinitely often/ D 1:

The random walk fSn g is said to be recurrent if every x 2 S is a recurrent state.


fSn g is said to be transient if it is not recurrent.
378 11 Random Walks

11.1.1 Some Distribution Theory

We now show some exact and asymptotic distribution theory. Part of the asymptotic
distribution theory has elements of surprise.
It is possible to write combinatorial formulas for Pd;n . These simplify to simple
expressions for d D 1; 2. Note first that the walk cannot return to the origin at odd
times 2n C 1. For the walk to return to the origin at an even time 2n, in each of the d
coordinates, the walk has to take an equal number of positive and negative steps. We
can think of this as a multinomial trial, where the 2n times 1; 2; : : : ; 2n are thought
of as 2n balls, and they are dropped into 2d cells, with the restriction that pairs of
cells receive equal number of balls n1 ; n1 ; n2 ; n2 ; : : : ; nd ; nd . Thus,

1 X .2n/Š
Pd;2n D 2    .n Š/2
.2d /2n n1 ;:::;nd 0;n1 CCnd Dn
.n1 Š/ d

2n  2
X nŠ
n
D :
.2d /2n n1 ;:::;nd 0;n1 CCnd Dn
n1 Š    nd Š

In particular,

2n
P1;2n D n
I
.4/n
 2
Œ 2n 
P2;2n D n nI
.16/
2n
X nŠ 2
P3;2n D n
:
.36/n kŠlŠ.n  k  l/Š
k;l0;kCln

Apart from the fact that simply computable exact formulas are always attractive,
these formulas have other very important implications, as we demonstrate shortly.

Example 11.2. We give a plot of Pd;2n in Fig. 11.3 for d D 1; 2; 3 for n 25.
There are two points worth mentioning. The return probabilities for d D 2 and
d D 3 cross each other. As n ! 1; P3;n ! 0 at a faster rate. This is shown
later. The second point is that the return probabilities decrease the most when the
dimension jumps from one to two.

In addition, here are some actual values of the return probabilities.


11.1 Random Walk on the Cubic Lattice 379

0.5

0.4

0.3

0.2

0.1

Fig. 11.3 Probability that d-dimensional lattice walk D 0 at time 2n, d D 1, 2, 3

n P1;n P2;n P3;n


2 .5 .25 .222
4 .375 .145 .116
6 .3125 .098 .080
10 .246 .061 .052
20 .176 .031 .031
50 .112 .013 .017

11.1.2 Recurrence and Transience

It is possible to make deeper conclusions about the lattice random walk by using the
above exact formulas for Pd;n . Plainly, by using Stirling’s approximation,

1 1
P1;2n p I P2;n ;
n n

where recall that the notation means that the ratio converges to one, as n ! 1.
Establishing the asymptotic order of Pd;n for d > 2 takes a little more effort. This
can be done by other more powerful means, but we take a direct approach. We use
two facts:
P
(a) d1n n1 ;:::;nd 0;n1 CCnd Dn n1 Šn


D 1:

(b) The multinomial coefficient n1 Šnd Š is maximized essentially when
n1 ; : : : ; nd are all equal.
380 11 Random Walks

Now using the exact expression from above,


2n  2
X nŠ
Pd;2n D n
.2d /2n n ;:::;n 0;n CCn Dn n1 Š    nd Š
1 d 1 d
2n
X 1 nŠ nŠ
D n n
.4d / n ;:::;n 0;n CCn Dn d n1 Š    nd Š n1 Š    nd Š
n
1 d 1 d
2n
n nŠ
max
.4d / n1 ;:::;nd 0;n1 CCnd Dn n1 Š    nd Š
n
2n
n nŠ
.4d /n Œ. dn C 1/d
 
1
DO ;
nd=2
by applying Stirling’s approximation to the factorials and also the Gamma function.
This proves that Pd;2n D O. nd=2 1
/, but it does not prove that Pd;2n is of the
1
exact order of nd=2 . In fact, it is. We state the following exact asymptotic result.
Theorem 11.1. For the cubic lattice random walk in d dimensions, 1 d < 1,
 d=2
1 d
Pd;2n D d 1 C o.nd=2 /:
2 n
See Rényi (1970, p. 503). An important consequence of this theorem is the following
result.
Corollary. The d -dimensional cubic lattice random walk is transient for d  3.
P P1
Proof. From above theorem, if d  3, 1
P1 the d=2 nD1 Pd;n D nD1 Pd;2n < 1,
d=2
because
P1 nD1 n < 1 8d  3, and if n D o.n / for any d  3, then
nD1 n < 1. Therefore, by the Borel–Cantelli lemma, the probability that the
random walk takes the zero value infinitely often is zero. Hence, for d  3, the
walk must be transient. t
u
How about the case of one and two dimensions? In those two cases, the cubic
lattice random walk is in fact recurrent. To show this, note that it is enough to show
that the walk returns to zero at least once with probability one (if it returns to zero
at least once with certainty, it will return infinitely often with certainty). But, using
our previous notation,
1
X
P .The random walk returns to zero at least once/ D Qd;2n :
nD1

This can be proved to be one for d D 1; 2 in several ways. One possibility is to


show, by using generating functions, that, always,
1
X P1
nD1 Pd;2n
Qd;2n D P ;
nD1
1C 1 nD1 Pd;2n
11.1 Random Walk on the Cubic Lattice 381
P1
and therefore, nD1 Qd;2n D 1 if d D 1; 2. Other more sophisticated meth-
ods using more analytical means are certainly available; one of these is to use the
Chung–Fuchs (1951) theorem, which is described in a subsequent section. There is
also an elegant Markov chain argument for proving the needed recurrence result,
and we outline that argument below. In fact, the cubic lattice random walk visits
every state infinitely often with probability one when d D 1; 2; we record this
formally.

Theorem 11.2 (Recurrence and Transience). The d -dimensional cubic lattice


random walk is recurrent for d D 1; 2 and transient for d  3.
We now outline a Markov chain argument for establishing the recurrence proper-
ties of a general Markov chain. This then automatically applies to the lattice random
walk, because it is a Markov chain. For notational brevity, we treat only the case
d D 1. But the Markov chain argument will in fact work in any number of dimen-
sions. Here is the Markov chain theorem that we use (this is the same as Theorem
10.2).

Theorem 11.3. Let fXn g; n  0 be a Markov chain with the state space S equal to
the set of all integers. Assume that X0 D 0. Let

.k/
p00 D P .Xk D 0/I  D P .The chain always returns to zero/I
N D Total number of visits of the chain to zero:

Then,
1
X 1
X
.k/
p00 D n D E.N /:
kD1 nD1

The consequence we are interested in is the following.

Corollary 11.1. The Markov chain fXn g returns to zero infinitely often with prob-
P
ability one if and only if 1 .k/
kD1 p00 D 1.
It is now clear why the lattice random walk is recurrent in one dimension. In-
deed, we have already established that for the one-dimensional lattice random walk,
.k/ .2n/ P1 .k/
p00 D 0 if k is odd, and p00 D P1;2n p1 . Consequently,
kD1 p00 D
P1 n
nD1 P1;2n D 1, and therefore, by the general Markov chain theorem we
stated above, the chain returns to zero infinitely often with probability one in one
dimension.
Exactly the same Markov chain argument also establishes the lattice random
walk’s recurrence in two dimensions, and transience in dimensions higher than two.
This is an elegant and shorter derivation of the recurrence structure of the lattice
random walk than the alternative direct method we provided first in this section. The
disadvantage of the shorter proof is that we must appeal to Markov chain theory to
give the shorter proof.
382 11 Random Walks

11.1.3  Pólya’s Formula for the Return Probability

We just saw that in dimensions three or higher, the probability that the cu-
bic lattice random walk returns to zero infinitely often is zero. In notation, if
d  3; P .Sn D 0 for infinitely many n/ D 0. But this does not mean that
P .Sn D 0 for some n/ D 0 also. This latter probability is the probability that
the walk returns to the origin at least once. We know that for d  3, it is not 1; but
it need not be zero. Indeed, it is something in between. In 1921, Pólya gave a pretty
formula for the probability that the cubic lattice random walk returns to the origin
at least once.
Theorem 11.4 (Pólya’s Formula). Let Sn be the cubic lattice random walk
starting at 0. Then,

Qd;n D P .The walk returns to 0 at least once/


P1
nD1 Pd;2n
D P ;
1C 1 nD1 Pd;2n

where
1
X
Pd;2n D The expected number of returns to the origin
nD1
Z Z !1
d X
d
D  d cos k d1    dd  1
.2 /d  
kD1
Z 1   d
t
D e t I0 dt  1:
0 d

In particular, for d D 3,

1
X Z Z Z
3 1
P3;2n D dxdyd z  1
nD1
.2 /3    3  cos x  cos y  cos z
p        
6 1 5 7 11
D 3
    1
32 24 24 24 24
D :5164

) P .The three-dimensional cubic lattice random walk returns at


least once to the origin/ D :3405:

In addition to Pólya (1921), see Finch (2003, p. 322) for these formulas.
11.2 First Passage Time and Arc Sine Law 383

Example 11.3 (Expected Number and Probability of Return). For computational


purposes, the integral representation using the Bessel I0 function is the most con-
venient, although careful numerical integration is necessary to get accurate values.
Some values are given in the following table.

d Expected Number of Returns Probability of Return


3 .5164 .3405
4 .2394 .1931
5 .1562 .1351
8 .0785 .0728
10 .0594 .0561
15 .0371 .0358
50 .0102 .0101
51 .0100 .0099

We note that it takes 50 dimensions for the probability of a return to become 1%.

11.2 First Passage Time and Arc Sine Law

For the simple symmetric random walk on the real line starting at zero, consider the
number of steps necessary for the walk to reach a given integer j for the first time:
Tj D minfn > 0 W Sn D j g:
Also let Tj;r ; r  1 denote the successive visit times of the random walk of the
integer j ; note that the one-dimensional symmetric walk we are considering visits
every integer j infinitely often with probability one, and so it is sensible to talk
of Tj;r .
Definition 11.3. Tj D Tj;1 is called the first passage time to j , and the sequence
fTj;r g the recurrence times of j .
For the special case j D 0, we denote the recurrence times as 1 ; 2 ; 3 ; : : :,
instead of using the more complicated notation T0;1 ; T0;2 ; : : :. Our goal is to write
the distribution of 1 , and from there conclude the asymptotic distribution of r
as r ! 1. It turns out that although the random walk returns to zero infinitely
often with probability one, the expected value of 1 is infinite! This precludes
Gaussian asymptotics for r as r ! 1. Recall also that n denotes the number
of times in the first n steps that the random walk returns to zero.
Theorem 11.5 (Distribution of Return Times).
(a) 1 ; 2  1 ; 3  2 ; : : : are independent and identically
p distributed.
(b) The generating function of 1 is E.t 1 / D 1  1  t 2 ; jtj 1.
.2k2/
(c) P . 1 D 2k C 1/ D 0 8kI P . 1 D 2k/ D k 2k1 2k1 :
(d) E. 1 / D 1: p
(e) The characteristic function of 1 is 1 .t/ D 1  1  e 2i t :
384 11 Random Walks

(f) P . r2r / x/ ! 2Œ1  ˆ. p1x / 8x > 0, as r ! 1.


n
(g) P . p n
x/ ! 2ˆ.x/  1 8x > 0, as n ! 1.

Detailed Sketch of Proof. Part (a) is a consequence of the observation that each
time the random walk returns to zero, it simply starts over again. A formal proof
establishes that the Markov property is preserved at first passage times.
For part (b), first obtain the generating function
1
X
G.t/ D P1;n t n
nD1
1 1 2n
X X
D P1;2n t 2n
D n
t 2n
nD1 nD1
4n
1
D p  1; jtj < 1:
1  t2

Now, notice that the sequences P1;n ; Q1;n satisfy the recursion relation

X
n1
P1;2n D Q1;2n C P1;2k Q1;2n2k :
kD1

This results in the functional identity


1
X
E.t 1 / D Q1;n t n
nD1
G.t/
D ; jtj < 1;
1 C G.t/

which produces the expression of part (b). Part (c) now comes out of part (b) on dif-
ferentiation of the generating function E.t 1 /. Part (d) follows on directly showing
P .2k2
k1 /
that 1 kD1 k k 22k1 D 1.
Parts (e) and (f) are connected. The characteristic functionPformula follows
1
by a direct evaluation of the complex power series 1 .t/ D nD1 Q1;n e
itn
D
P1 P 1
2n2
2itn . n1 /
nD1 Q1;2n e
2itn
D nD1 e n22n1 . Alternatively, it can also be deduced by the
argument that led to the generating function formula of part (b).
For part (f), by the iid property of 1 ; 2  1 ; : : : ; we can represent r as a sum
of iid variables
r D 1 C . 2  1 / C    C . r  r1 /:

Therefore, by virtue of part (e), we can write the characteristic function r .t/ of r,
and we get     r
t t p
lim r 2
D lim 2
D e  2i t :
r!1 r r!1 r
11.2 First Passage Time and Arc Sine Law 385

We now use the inversion of technique to produce a density, which, by the continuity
theorem for characteristic functions, must be the density of the limiting distribution
of r r2 . The inversion formula gives
Z 1 p
1
f .x/ D e i tx e  2i t
dt
2 1
1 3
e  2x x  2
D p ; x > 0:
2

The CDF corresponding to this density f is 2Œ1  ˆ. p1x /; x > 0, which is what
part (f) says.
Finally, part (g) is actually a restatement of part (f) because of the identity P . r
n/ D P .n  r/: t
u

Example 11.4 (Returns to Origin and Stable Law of Index 12 ). There are several in-
teresting things about the density f .x/ of the limiting distribution of r r2 . First, by
inspection, we see that it is the density of Z12 where Z N.0; 1/. It clearly does
not have a finite expectation (and neither does r for any r). Moreover, the charac-
teristic function of f matches the form of the characteristic function of a one-sided
(positive) stable law of index ˛ D 12 . This is an example of a density with tails
even heavier than that of the absolute value of a Cauchy random variable. Although
the recurrence times do not have a finite expectation, either for finite r or asymp-
totically, it is interesting to know that the median of the asymptotic distribution of
r
r2
is Œ 11 3 2 D 2:195. So, for large r, with probability about 50%, the rth return
ˆ .4/
to zero will come within about 2:2r 2 steps of the random walk.
A plot of the asymptotic density in Fig. 11.4 shows the extremely sharp peak
near zero, accompanied by a very flat tail. The lack of a finite expectation is due to
this tail.

0.4

0.3

0.2

0.1

Fig. 11.4 Asymptotic density of scaled return times


386 11 Random Walks

We next turn to the question of studying the proportion of time that a random
walk spends above the horizontal axis.
P In notation, we are interested in the distribu-
tion of nn , where recall that n D nkD1 IfSk >0g . The steps each have a symmetric
distribution, and they are independent, thus each Sk has a symmetric distribution.
Intuitively, one might expect that a trajectory of the random walk, meaning the
graph of the points .n; Sn /, is above and below the axis close to 50% of the time.
The arc sine law, which we now present, says that this is not true. It is more likely
that the proportion nn will either be near zero or near one, than that it will be near
the naively expected value 12 . We provide formulas for the exact and asymptotic
distribution of n .

Theorem 11.6 (Arc Sine Law).


.2k/.2n2k/  
(a) For each n  1; P . n D k/ D k 4nnk ; k D 0; 1; : : : ; n, with 00 D 1.
p
(b) For each x 2 Œ0; 1; P . nn x/ ! 2 arcsin. x/ as n ! 1.
The proof of part (a) is a careful combinatorial exercise, and we omit it. Feller
(1971) gives a very careful presentation of the argument. However, granted part (a),
part (b) then follows by a straightforward application of Stirling’s approximation,
using the exact formula of part (a) with k D kn D bnxc.

p
Example 11.5. The density of the CDF 2 arcsin. x/ is the Beta density f .x/ D
p 1
x.1x/
; 0 < x < 1. The density is unbounded as x ! 0; 1, and has its minimum
(rather than the maximum) at x D 12 . Consider the probability that the random walk
takes a positive value between 45% and 55% of the times in the first n steps. The arc
sine law implies, by just integrating the Beta density, that this probability is :0637
for large n. Now, consider the probability that the random walk takes a positive value
either more than 95% or less than 5% of the times in the first n steps. By integrating
the Beta density, we find this probability to be :2871, more than four times the :0637
value. We can see the tendency of a trajectory to spend most of the time either above
or below the axis, rather than dividing its time on a close to 5050 basis above and
below the axis. This seems counter to intuition largely because we automatically
expect a random variable to concentrate near its expectation, which in this case is
1
2 . The arc sine law of random walks says that this sort of intuitive expectation can
land us in trouble.
We provide some values on the exact distribution of nn for certain selected n.

n P .:45  n
n
 :55/ P. n
n
 :05 or  :95/
10 .0606 .5379
25 .0500 .4269
50 .0631 .3518
100 .0698 .3077
250 .0636 .3011
11.3 The Local Time 387

11.3  The Local Time

Consider the simple symmetric random walk starting at zero, with iid steps having
the common distribution P .Xi D 1/ D P .Xi D 1/ D 12 , and fix any integer x.
The family of random variables

.x; n/ D #fk W 0 k n; Sk D xg

is called the local time of the random walk. Note, therefore, that .0; n/ is the same
as n . The local time of the random walk answers the following interesting question
of how much time does the random walk spend at x up to the time n? It turns out
that the distribution of .x; n/ can be written in a simple closed-form, and therefore
we can compute it.
Note that .0; 2n/ D .0; 2n C 1/, because the simple symmetric walk starting
at zero can visit the origin only at even numbered times. Therefore, in the case of
.0; n/, it is enough to know the distribution of .0; 2n/ for a general n.
Theorem 11.7. For any n  1; 0 k n,
2nk 
P ..0; 2n/ D k/ D P ..0; 2n C 1/ D k/ D n
:
22nk
Proof. For the proof of this formula, we require a few auxiliary combinatorial facts,
which we state together as a lemma.
Lemma. For the simple symmetric random walk, with the same notation as in the
previous section,
(a) For any k  1; P . 1 D 2k jX1 D 1/ D P . 1 D 2k jX1 D 1/ D P .T1 D
2k  1/.
(b) For any k  1; Tk and k  k have the same distribution.
n
.b nk c
/
(c) For any n, and 0 k n; P .Mn D k/ D 22n .
Part (c) of the lemma is not very easy to prove, and we do not present its proof here.
Part (a) is immediate from the symmetry of the distribution of X1 and from the total
probability formula

1 1
P. 1 D 2k/ D P . 1 D 2k jX1 D 1/ C P . 1 D 2k jX1 D 1/
2 2
1 1
D P .T1 D 2k  1/ C P .T1 D 2k  1/ D P .T1 D 2k  1/:
2 2
Therefore, 1  1 and T1 have the same distribution. It therefore follows that for any
k  1; k  k and Tk have the same distribution. As a consequence, we have the
important identity

P. k > n C k/ D P . k  k > n/ D P .Tk > n/ D P .Mn < k/:


388 11 Random Walks

Now, observe that

P ..0; 2n/ D k/ D P . k 2n; kC1 > 2n/


2nk 
n
D P .M2nk  k; M2nk1 k/ D P .M2nk D k/ D ;
22nk
by part (c) of the lemma. This proves the formula given in the theorem.
It turns out that once we have the distribution of .0; 2n/, that of .x; n/ can be
found by cleverly conditioning on the value of the first time instant that the random
walk hits the number x. Precisely, for x > 0, write

X
n2k
P ..x; n/ D k/ D P ..x; n/ D k jTx D j /P .Tx D j /
j Dx

X
n2k
D P ..0; n  j / D k  1/P .Tx D j /;
j Dx

and from here, on some algebra, the distribution of .x; n/ works out to the follow-
ing formula. t
u

Theorem 11.8. For x; k > 0,


nkC1
nCx
P ..x; n/ D k/ D 2
;
2nkC1

if .n C x/ is even;
 nk 
nCx1
P ..x; n/ D k/ D 2
;
2nk
if .n C x/ is odd.

Remark. In the above, combinatorial coefficients rs are to be taken as zero if r < s.
It is a truly rewarding consequence of clever combinatorial arguments that in the
symmetric case, the distribution of the local time can in fact be fully written down.

Example 11.6 (The Local Time at Zero). Consider the simple symmetric random
walk starting at zero. Because zero is the starting point of the random walk, the
local time at zero is of special interest. In this example, we want to investigate the
local time at zero in some additional detail.
First, what exactly does the distribution look like as a function of n? By using
our analytical formula,
2nk 
n
P ..0; 2n/ D k/ D ;
22nk
11.4 Practically Useful Generalizations 389

we can easily compute the pmf of the local time for (small) given n. An inspection
of the pmf will help us in understanding the distribution. Here is a small table.

P ..0; 2n/ D k/
k nD3 nD6 n D 10
0 .3125 .2256 .1762
1 .3125 .2256 .1762
2 .25 .2051 .1669
3 .125 .1641 .1484
4 .1094 .1222
5 .0547 .0916
6 .0156 .0611
7 .0349
8 .0161
9 .0054
10 .0010

We see from this table that the distribution of the local time at zero has a few
interesting properties: it is monotone nondecreasing, and it has its maximum value
at k D 0 and k D 1. Thus the random walk spends more time near its original home
than at another location. Both of these properties can be proved analytically. Also,
from the table of the pmf, we can easily compute the mean of the local time at zero.
For n D 3; 6; 10; EŒ.0; 2n/ equals 1.1875, 1.93, and 2.70. For somewhat larger
say n D 100; EŒ.0; 2n/ equals p
n, p 10.33. The mean local time grows at the rate
of n. In fact, when normalized by n, the local time has a limiting distribution,
and the limiting distribution is a very interesting one, namely the absolute value of a
standard normal. This can be proved by using Stirling’s approximation in the exact
formula for the pmf of the local time. Here is the formal result.

Theorem 11.9. Let Z N.0; 1/. Then,

.0; n/ L
p ) jZj:
n

11.4 Practically Useful Generalizations

Difficult and deep work by a number of researchers, including in particular, Sparre-


Andersen, Erdos,R Kac, Spitzer, and Kesten, give fairly sweeping generalizations
of many of the results and phenomena of the previous section to random walks
more general than the cubic lattice random walk. Feller (1971) and Rényi (1970)
are classic and readable accounts of the theory of random walks, as it stood at that
time. These generalizations are of particular interest to statisticians, because of their
methodological potential in fields such as testing of hypotheses. For the sake of
reference, we collect two key generalizations in this section. Another major general
result is postponed till the next section.
390 11 Random Walks

Definition P11.4. Let Xi ; i  1 be iid random variables with common CDF F , and
n
let Sn D i D1 Xi ; n  1. Let x be a real number and Sn;x D x C Sn . Then
fSn;x g; n  1 is called a random walk with step distribution F starting at x.

Theorem 11.10. (a) (Sparre-Andersen). If d D 1; and the common distribution


F of Xi has a density symmetric about zero, then the distribution of n is as in
the case of the simple symmetric random walk; that is,
2k 2n2k 
P. n D k/ D k nk
; k D 0; 1; : : : ; n:
4n

(b) (ErdRos–Kac). If Xi are independent, have zero mean, and satisfy the conditions
of the Lindeberg–Feller theorem, but are not necessarily iid, then the arc sine
law holds: 
n 2 p
P x ! arcsin. x/; 8x 2 Œ0; 1:
n
Part (a) is also popularly known as Spitzer’s identity. If, in part (a), we choose
k D n, we get

P. n D n/ D P .S1 > 0; S2 > 0; : : : ; Sn > 0/


D P .S1  0; S2  0; : : : ; Sn  0/
2n
D n
D P .S2n D 0/;
4n
for every CDF F with a symmetric density function. To put it another way, if T
marks the first time that the random walk enters negative territory, then P .T > n/
is completely independent of the underlying CDF, as long as it has a symmetric
density. For the simple symmetric random walk, the same formula also holds, and is
commonly known as the ballot theorem. A Fourier analytic proof of Spitzer’s identity
is given in Dym and McKean (1972, pp. 184–187).

11.5 Wald’s Identity

Returning to the random variable T , namely the first time the random walk enters
negative territory, an interesting general formula for its expectation can be given,
for random walks more general than the simple symmetric random walk. We have
to be careful about talking about E.T /, because it need not be finite. But if it is, then
it is an interesting number to know. Consider, for example, a person gambling in a
casino, and repeatedly playing a specific game. We may assume that the game is (at
least slightly) favorable to the house, and unfavorable to the player. So, intuitively,
the player already knows that eventually she will be sunk. But how long can she
continue without being sunk? If the player expects that she can hang around for a
long time, it may well add to the excitement of the game.
11.5 Wald’s Identity 391

We do in fact deal with a more general result, known as Wald’s identity. It was
proved in the context of sequential testing of hypotheses in statistics. To describe the
identity, we need a definition. We caution the reader that although the meaning of the
definition is clear, it is not truly rigorous because of our not using measure-theoretic
terminology.
Definition 11.5. Let X1 ; X2 ; : : : be an infinite sequence of random variables defined
on a common sample space . A nonnegative integer-valued random variable N ,
also defined on , is called a stopping time if whether or not N n for a given n can
be determined by only knowing X1 ; X2 ; : : : ; Xn , and if, moreover, P .N < 1/ D 1.
Example 11.7. Suppose a fair die is rolled repeatedly (and independently), and let
the sequence of rolls be X1 ; X2 ; : : :. Let N be the first throw at which the sum of
the rolls exceeds 10. In notation, N D minfn W Sn D X1 C    C Xn > 10g. Clearly,
N cannot be more than 11 and so P .N < 1/ D 1, and also, whether the sum has
exceeded 10 within the first n rolls can be decided by knowing the values of only
the first n rolls. So N is a valid stopping time.
Example 11.8. Suppose X1 ; X2 ; : : : are iid N.0; 1/, and let Wn D XnWn  X1Wn be
the range of the X1 ; X2 ; : : : ; Xn . Suppose N is the first time the range Wn exceeds
a:s: a:s: a:s:
five. Because XnWn ) 1, and X1Wn ) 1, we have that Wn ) 1. Therefore,
P .N < 1/ D 1. Also, evidently, whether Wn > 5 can be decided by knowing the
values of only X1 ; X2 ; : : : ; Xn . So N is a valid stopping time.
Theorem 11.11 (Wald’s Identity). Let Xi ; i  1 be iid random variables, with
E.jX1 j/ < 1. Let Sn D X1 C    C Xn ; n  1; S0 D 0 and let N be a stopping
time with a finite expectation. Then E.SN / D E.N /E.X1 /.
Proof. The proof is not completely rigorous, and we should treat it as a sketch of
the proof. First, note that
1
X 1
X
SN D SN IfN Dng D Sn IfN Dng
nD1 nD1
X1
D Sn ŒIfN >n1g  IfN >ng 
nD1
X1
D Xn IfN >n1g
nD1
" 1
#
X
) E.SN / D E Xn IfN >n1g
nD1
1
X
D EŒXn IfN >n1g 
nD1
X1
D EŒXn EŒIfN >n1g 
nD1
392 11 Random Walks

(because N is assumed to be a stopping time, and so whether N > n  1 depends


only on X1 ; X2 ; : : : ; Xn1 , and so IfN >n1g is independent of Xn )

1
X
D EŒX1 EŒIfN >n1g 
nD1

(because the Xi all have the same expectation)


1
X
D EŒX1  P .N > n  1/
nD1

D E.X1 /E.N /: t
u

Example 11.9 (Time to Enter Negative Territory). Consider a one-dimensional ran-


dom walk on Z, but not the symmetric one. The random walk is defined as follows.
Suppose Xi ; i  1 are iid with the common distribution P .Xi D 1/ D p; P .Xi D
1/ D 1 p; p < 12 . Let S0 D 0; Sn D X1 C  CXn ; n  1. This corresponds to a
gambler betting repeatedly on something, where he wins one dollar with probability
p and loses one dollar to the house with probability 1  p. Then, Sn denotes the
player’s net fortune after the nth play. Now consider the first time his net fortune
becomes negative. This is of interest to him if he has decided to pack it in as soon
as his net fortune becomes negative. In notation, we are looking at the stopping
time T D minfn  1 W Sn < 0g. The random walk moves by just one step at a
time, therefore note that ST D 1. Provided that E.T / < 1 (which we have not
proved), by Wald’s identity,

E.ST / D E.1/ D 1 D E.X1 /E.T / D .2p  1/E.T /


1
) E.T / D :
1  2p

This is in fact correct, as it can be shown that E.T / < 1.


Suppose now the game is just slightly favorable to the house, say, p D :49. Then
we get, E.T / D 50. If each play takes just one minute, the gambler can expect to
hang in for about an hour, with a minimal loss.

11.6  Fate of a Random Walk

Suppose XP 1 ; X2 ; : : : is a sequence of iid real random variables, and for each n  1,


n
let Sn D i D1 Xi ; set also S0 D 0. This is a general one-dimensional random
walk with iid steps X1 ; X2 ; : : :. A very natural question is what does such a general
random walk do in the long run? Excluding the trivial case that the Xi are degenerate
at zero, the random walk Sn ; n  1 can do one of three things. It can drift off to C1,
11.6 Fate of a Random Walk 393

or drift off to 1, or it can oscillate. If it oscillates, what can we say about the nature
of its oscillation? For example, under certain conditions on the common distribution
of the Xi , does it come arbitrarily close to any real number over and over again?
Clearly, the answer depends on the common distribution of the Xi . For example, if
the Xi can only take the values ˙1, then obviously the random walk cannot come
arbitrarily close to any real number over and over again. The answer in general is
that either the random walk does not visit neighborhoods of any number over and
over again, or that it visits neighborhoods of every number over and over again,
or perhaps that it visits neighborhoods of certain distinguished numbers over and
over again. We give two formal definitions and a theorem for the one-dimensional
case first. The case of random walks in general dimensions is treated following the
one-dimensional case.

Definition 11.6. Let Sn ; n  1 be a general random walk on R. A specific real


number x is called a possible value for the random walk Sn if for any given >
0; P .jSn  xj / > 0 for some n  1.

Definition 11.7. Let Sn ; n  1 be a general random walk on R. A specific real


number x is called a recurrent value for the random walk Sn if

for any given > 0; P .jSn  xj infinitely often/ D 1:

Definition 11.8. Let X D fx 2 R W x is a recurrent value of Sn g. Then X is called


the recurrent set of Sn .
Then, we have the following neat dichotomy result on the recurrence structure of
a general nontrivial random walk; see Chung (1974) or Chung (2001, p. 279) for a
proof.

Theorem 11.12. Let Sn ; n  1 be a general random walk on R. Assume that


P .Xi D 0/ ¤ 1. Then the recurrent set X of Sn is either empty, or the entire
real line, or a countable lattice set of the form f˙nx0 W n D 0; 1; 2; : : :g for some
specific real number x0 .

Remark. The recurrent set will be empty when the random walk drifts off to one
of C1 or 1. For example, an asymmetric simple random walk will do so. The
simple symmetric random walk will have a countable lattice set as its recurrent set.
On the other hand, as we later show, a random walk with iid standard normal steps
will have the entire real line as its recurrent set.
Although we shortly present a handy and effective all at one time theorem for
verifying whether a particular random walk in the general dimension has a specific
point, say the origin, in its recurrent set, the following intuitively plausible result in
the one-dimensional case is worth knowing; see Chung (2001, pp. 279–283) for a
proof.
P
Theorem 11.13. (a) IfPfor some > 0; 1 nD1 P .jSn j / D 1, then 0 2 X .
(b) If for some > 0; 1 nD1 P .jS n j / < 1, then 0 … X .
394 11 Random Walks

Although part (b) of this theorem easily follows by an application of the Borel–
Cantelli lemma, the proof of part (a) is nontrivial; see Chung (1974).
A generalization of this is assigned as a chapter exercise. Let us see a quick
example of application of this theorem.
Example 11.10 (Standard Normal Random P Walk). Suppose X1 ; X2 ; : : : are iid
N.0; 1/ and consider the random walk Sn D niD1 Xi ; n  1. Then, Sn N.0; n/
Sn
and therefore p n
N.0; 1/. Fix any > 0. Then,
ˇ ˇ   
ˇ Sn ˇ
P .jSn j /DP ˇp ˇ p D 2ˆ p  1;
ˇ nˇ n n
where ˆ denotes the standard normal CDF. Now,
 
ˆ p ˆ.0/ C p .0/;
n n
P
and therefore, 2ˆ. pn /  1 2 p .0/
n
. Because .0/ > 0, and 1 p1
nD1 n diverges,
by the theorem above, 0 is seen to be a recurrent state for Sn . The rough approxima-
tion argument we gave can be made rigorous easily, by using lower bounds on the
standard normal CDF. In fact, any real number x is a recurrent state for Sn in this
case, and it is shown shortly.

11.7 Chung–Fuchs Theorem

We show a landmark application of characteristic functions to the problem of recur-


rence or transience of general d -dimensional random walks.
Consider Xi ; i  1, iid with the common CDF F , and assume that (each) Xi
has a distribution symmetric around zero. Still defining S0 D 0; Sn D X1 C    C
Xn , Sn is called a random walk with steps driven by F . A question of very great
interest is whether Sn will revisit neighborhoods of the origin at infinitely many
future time instants n. In a landmark article, Chung and Fuchs (1951) completely
settled the problem for random walks in a general dimension d < 1. We first define
the necessary terminology.
Definition 11.9. Let Xi ; i  1 be iid d -dimensional vectors with common CDF
L
F . Assume that F is symmetric, in the sense that Xi D Xi : Let S0 D 0; Sn D
X1 C    C Xn ; n  1. Then Sn ; n  0 is called the d -dimensional random walk
driven by F .
Definition 11.10. The d -dimensional random walk driven by F is called recurrent
if for every open set C  Rd containing the origin, P .Sn 2 C infinitely often/ D 1.
If Sn is not recurrent, then it is called transient.
Some authors use the terminology neighborhood recurrent for the above defini-
tion. Here is the Chung–Fuchs result.
11.7 Chung–Fuchs Theorem 395

Theorem 11.14 (Chung–Fuchs). Let Sn be the d -dimensional random walk


driven by F . Then Sn is recurrent if and only if
Z
1
d t D 1;
.1;1/d 1  .t/

where .t/ is the characteristic function of F .


The proof is somewhat technical to repeat here and we omit it. However, let us
see some interesting examples.

Example 11.11 (One-Dimensional Simple, Gaussian, and Cauchy Random Walks).


Let F be, respectively, the CDF of the symmetric two-point distribution on ˙1, the
standard normal distribution, and the standard Cauchy distribution. We show that
the random walk driven by each is recurrent.
2
If P .Xi D ˙1/ D 12 , then the cf equals .t/ D cos t  1  t2 ) 1 1 .t /  t22 ,
R1
and therefore 1 1 1 .t / dt D 1. Therefore, by the Chung–Fuchs theorem, the
one-dimensional simple random walk is recurrent.
2 2
If Xi N.0; 1/, then the cf equals .t/ D e t =2  1  t2 , and so again, the
random walk driven by F is recurrent.
If Xi C.0; 1/, then the cf equals .t/ D e jt j  1  jtj ) 1 1 .t /  jt1j ,
R1 R1
and hence, 1 1 1 .t / dt  2 0 1t dt D 1, and so the random walk driven by F is
recurrent.

Example 11.12 (One-Dimensional Stable Random Walk). Let F be the CDF of a


symmetric stable distribution with index ˛ < 1. We have previously seen that the cf
˛
of F equals .t/ D e cjt j ; c > 0. Take c D 1, for simplicity. For

˛ jtj˛ 1 2
jtj < 1; e jt j < 1  ) < ˛:
2 1  .t/ jtj
R1
But for 0 < ˛ < 1; 0 t2˛ dt < 1, and therefore, by the reverse part of the Chung–
Fuchs theorem, one-dimensional stable random walks with index smaller than one
are transient.

Example 11.13 (d -Dimensional Gaussian Random Walk). Consider the d -


dimensional random walk driven by the CDF F of the Nd .0; Id / distribution. The cf
0
of F equals .t/ D e t t =2 . Therefore, locally near t D 0, 1  .t/R t 0 t=2. Now,
with Sd denoting the surface area of the unit d -dimensional ball, t Wt 0 t 1 t10 t dt D
R1
Sd 0 r d 3 dr (by transforming to the d -dimensional polar coordinates; see Chap-
R
ter 4) < 1 if and only if d  3. Now note that .1;1/d 1 1 .t / dt < 1 if and only if
R
t Wt 0 t 1 1 .t / dt < 1. Therefore, the d -dimensional Gaussian random walk is re-
1

current in one and two dimensions, and transient in all dimensions higher than two.
A number of extremely important general results follow from the Chung–Fuchs
theorem. Among them, the following three stand out.
396 11 Random Walks

Theorem 11.15 (Three General Results). (a) Suppose d D 1, and that E.jXi j/
< 1, with E.Xi / D 0. Then the random walk Sn is recurrent.
P
(b) More generally, suppose Snn ) 0. Then Sn is recurrent.
(c) Consider a general d  3, and suppose F is not supported on any two-
dimensional subset of Rd . Then Sn must be transient.
Part (c) is the famous result that there are no recurrent random walks beyond two
dimensions. Beyond two dimensions, a random walker has too many directions in
which to wander off, and does not return to its original position recurrently, that is,
infinitely often.

Proof. Parts (a) and (c) are in fact not very hard to derive from the Chung–
Fuchs theorem. For example, for part (a), finiteness of E.jXi j/ allows us to write
a Taylor expansion for .t/, the characteristic function of F (as we have al-
ready discussed). Furthermore, that E.Xi / D 0 allows us to conclude that lo-
cally, near the origin, 1  .t/ is o.jtj/. This leads to divergence of the integral
R1 1
1 1 .t / dt, and hence recurrence of the random walk by the Chung–Fuchs the-
orem. Part (c) is also proved by a Taylor expansion, but we do not show the
argument, because we have not discussed Taylor expansions of characteristic func-
tions for d > 1. Part (b) is harder to prove, but it too follows from the Chung–Fuchs
theorem. t
u

11.8  Six Important Inequalities

Inequalities are of tremendous value for proving convergence of suitable things, for
obtaining rates of convergences, and for finding concrete bounds on useful functions
and sequences. We collect a number of classic inequalities on the moments and
distributions of partial sums in this section. References to each inequality are given
in DasGupta (2008).

Kolmogorov’s Maximal Inequality. For independent random variables X1 ; : : : ;


Xn ; E.Xi / D 0; Var.Xi / < 1, for any t > 0,
  Pn
kD1 Var.Xk /
P max jX1 C    C Xk j  t :
1kn t2

Hájek–Rényi Inequality. For independent random variables X1 ; : : : ; Xn ; E.Xi / D


0; Var.Xi / D i2 , and a nonincreasing positive sequence ck ,
0 1
 
1 @ 2 X 2 X
m n
P max ck j.X1 C    C Xk /j  2
cm k C ck2 k2 A :
mkn
kD1 kDmC1
Exercises 397

Lévy Inequality. Given independent random variables X1 ; : : : ; Xn , each symmet-


ric about zero,
 
P max jXk j  x 2P .jX1 C    C Xn j  x/I
1kn
 
P max jX1 C    C Xk j  x 2P .jX1 C    C Xn j  x/:
1kn

Doob–Klass Prophet Inequality. Given iid mean zero random variables X1 ; : : : ;


Xn ,
E max .X1 C    C Xk / 3EjX1 C    C Xn j:
1kn

Bickel Inequality. Given independent random variables X1 ; : : : ; Xn , each sym-


metric about E.Xk / D 0, and a nonnegative convex function g,

E max g.X1 C    C Xk / 2EŒg.X1 C    C Xn /:


1kn

General Prophet Inequality of Bickel (Private Communication). Given inde-


pendent random variables X1 ; : : : ; Xn , each with mean zero,

E max .X1 C    C Xk / 4EŒjX1 C    C Xn j:


1kn

Exercises

Exercise 11.1 (Simple Random Walk). By evaluating its generating function, cal-
culate the probabilities that the second return to zero of the simple symmetric
random walk occurs at the fourth step; at the sixth step.

Exercise 11.2 (Simple Random Walk). For the simple symmetric random walk,
find limn!1 E.
p n / , where n is the number of returns to zero by step n.
n

Exercise 11.3. Let Sn ; Tn ; n  1 be two independent simple symmetric random


walks. Show that Sn  Tn is also a random walk, and find its step distribution.

Exercise 11.4. Consider two particles starting out at specified integer values x; y.
At each subsequent time, one of the two particles is selected at random, and moved
one unit to the right or one unit to the left, with probabilities p; q; p C q D 1.
Calculate the probability that the two particles will eventually meet.

Exercise 11.5 (On the Local Time). For n D 10; 20; and 30, calculate the prob-
ability that the simple symmetric random walk spends zero time at the state x for
x D 2; 1; 1; 2; 5; 10.
398 11 Random Walks

Exercise 11.6 (On the Local Time). Find and plot the mass function of the local
time .x; n/ of the simple symmetric random walk for x D 5; 10; 15 for n D 25.
Exercise 11.7 (On the Local Time). Calculate the expected value of the local
time .x; n/ of the simple symmetric random walk for x D 2; 5; 10; 15 and
n D 25; 50; 100. Comment on the patterns that emerge.
Exercise 11.8 (Quartiles of Number of Positive Terms). For the simple symmet-
ric random walk, compute the quartiles of n , the number of positive values among
S1 ; : : : ; Sn , for n D 5; 8; 10; 15.
Exercise 11.9 * (Range of a Random Walk). Consider the simple symmetric ran-
dom walk and let Rn be the number of distinct states visited by the walk up to time
n (i.e., Rn is the number of distinct elements in the set of numbers fS0 ; S1 ; : : : ; Sn g.
(a) First derive a formula for P .Sk ¤ Sj 8j; 0 j k  1/ for any given k  1.
(b) Hence derive a formula for E.Rn / for any n.
(c) Compute this formula for n D 2; 5; 10; 20.
Exercise 11.10 * (Range of a Random Walk). In the notation of the previous ex-
Rn P
ercise, show that n ) 0.

Exercise 11.11 * (A Different Random Walk on Zd ). For d > 1; let Xi D


.Xi1 ; : : : ; Xid /, where Xi1 ; : : : ; Xid are iid and take the values ˙1 with probability
1
2
each. The Xi are themselves independent. Let S0 D 0; Sn D X1 C  CXn ; n  1.
Show that P .Sn D 0 infinitely often/ D 0 if d  3.
Exercise 11.12. Show that for a general one-dimensional random walk, a recurrent
value must be a possible value.
Exercise 11.13. Consider the one-dimensional random walk with iid steps hav-
ing the common distribution as the standard exponential. Characterize its recurrent
class X .
Exercise 11.14. Consider the one-dimensional random walk with iid steps having
the U Œ1; 1 distribution as the common distribution of the steps. Characterize its
recurrent class X .
Exercise 11.15. Consider the one-dimensional random walk with iid steps having
the common distribution given by P .Xi D ˙2/ D :2; P .Xi D ˙1/ D :3. Charac-
terize its recurrent class X .
Exercise 11.16. Consider the one-dimensional random walk with iid steps having
the common distribution given by P .Xi D ˙1/ D :4995; P .Xi D :001/ D 001.
Characterize its recurrent class X .
Exercise 11.17 * (Two-Dimensional Cauchy Random Walk). Consider the two-
dimensional random walk driven by the CDF F of the two-dimensional Cauchy
c
density .1Cjjxjj2 /3=2 , where c is the normalizing constant. Verify whether the random

walk driven by F is recurrent.


Exercises 399

Exercise 11.18 (The Asymmetric Cubic Lattice Random Walk). Consider the
cubic lattice random walk in d dimensions, but with the change that a coordinate
changes by ˙1 with respective probabilities p; q; p C q D 1.
P1 Derive a formula for Pd;2n in this general situation, and then verify whether
nD1 Pd;2n converges or diverges for given d D 1; 2; 3; : : :. Make a conclusion
about the recurrence or transience of such an asymmetric random walk for each
d  1.

Exercise 11.19 * (A Random Pn Walk with Poisson Steps). Let Xi ; i  1 be iid


P oi. / and let Sn D i D1 Xi ; n  1; S0 D 0. Show that the random walk
fSn g; n  1 is transient.

Exercise 11.20 * (Recurrence Time in an Asymmetric Random Walk). Consider


again the simple random walk on Z, but with the change that the steps take the values
˙1 with respective probabilities p; q; p C q D 1. Let 1 denote the first instant at
which the walk returns to zero; that is, 1 D minfn  1 W Sn D 0g.
Find a formula for P . 1 < 1/.

Exercise 11.21 * (Conditional Distribution of Recurrence Times). For the sim-


ple symmetric random walk on Z,
(a) Find P . 2 k j 1 D m/, where i denotes the time of the i th return of the
random walk to zero. Compute your formula for k D 4; 6; 8; 10 when m D 2.
(b) Prove or disprove: E. 2 j 1 D m/ D 1 8m.

Exercise 11.22 (Ratio of Recurrence Times). For the simple symmetric random
walk on Z,
(a) Find an explicit formula for E. rs /; 1 r < s < 1, where i denotes the time
of the i th return of the random walk to zero.
(b) Find an explicit answer for E. rs /; 1 r < s < 1.

Exercise 11.23 * (Number of Positive Terms Given the Last Term). Consider a
one-dimensional random walk driven by F D ˆ, the standard normal CDF. Derive
a formula for E. n jSn D c/, that is, the conditional expectation of the number of
positive terms given the value of the last term.

Exercise 11.24 (Expected Number of Returns). Consider the simple symmetric


random walk and consider the quantities m;n D EŒ#fj W m j m C n;
Sj D 0g.
Prove that m;n is always maximized at m D 0, whatever n is.

Exercise 11.25 (Application of Wald’s Identity). In repeated independent rolls


of a fair die, show that the expected number of throws necessary for the total to
become 10 or more is less than 4.3.

Exercise 11.26 (Application of Wald’s Identity). Let X have the negative bino-
mial distribution with parameters r; p. Prove by using only Wald’s identity that
E.X / D pr .
400 11 Random Walks

Exercise 11.27 (Application of Lévy and Kolmogorov’s Inequalities). Consider


a random walk with step distribution F starting at zero. Let m be a fixed integer. For
each of the following cases, find a lower bound on the probability that the random
walk does not cross either of the boundaries y D m, y D Cm within the first n
steps.
(a) F D Ber.:5/.
(b) F D N.0; 1/.
(c) F D C.0; 1/.

Exercise 11.28 (Applying Prophet Inequalities). Suppose X1 ; X2 ; : : : ; Xn are


iid N.0; 1/ and let Sn D X1 C    C Xn .
(a) Calculate E.SnC / exactly.
(b) Find an upper bound on E.max1kn SkC /.
(c) Repeat part (b) with max1kn jSk j.

References

Chung, K.L. (1974). A Course in Probability, Academic Press, New York.


Chung, K.L. (2001). A Course in Probability, 3rd Edition, Academic Press, New York.
Chung, K.L. and Fuchs, W. (1951). On the distribution of values of sums of random variables,
Mem. Amer. Math. Soc., 6, 12.
Dym, H. and McKean, H. (1972). Fourier Series and Integrals, Academic Press, New York.
R P. and Kac, M. (1947). On the number of positive sums of independent random variables,
Erdos,
Bull. Amer. Math. Soc., 53, 1011–1020.
Feller, W. (1971). Introduction to Probability Theory with Applications, Wiley, New York.
Finch, S. (2003). Mathematical Constants, Cambridge University Press, Cambridge, UK.
Pólya, G. (1921). Uber eine Aufgabe der Wahrsch einlichkeitstheorie betreffend die Irrfahrt im
Strasseenetz, Math. Annalen, 84, 149–160.
Rényi, A. (1970). Probability Theory, Nauka, Moscow.
Sparre-Andersen, E. (1949). On the number of positive sums of random variables, Aktuarietikskr,
32, 27–36.
Spitzer, F. (2008). Principles of Random Walk, Springer, New York.
Chapter 12
Brownian Motion and Gaussian Processes

We started this text with discussions of a single random variable. We then proceeded
to two and more generally, a finite number of random variables. In the last chapter,
we treated the random walk, which involved a countably infinite number of random
variables, namely the positions of the random walk Sn at times n D 0; 1; 2; 3; : : :.
The time parameter n for the random walks we discussed in the last chapter belongs
to the set of nonnegative integers, which is a countable set. We now look at a special
continuous time stochastic process, which corresponds to an uncountable family of
random variables, indexed by a time parameter t belonging to a suitable uncountable
time set T . The process we mainly treat in this chapter is Brownian motion, although
some other Gaussian processes are also treated briefly.
Brownian motion is one of the most important continuous-time stochastic pro-
cesses and has earned its special status because of its elegant theoretical properties,
its numerous important connections to other continuous-time stochastic processes,
and due to its real applications and its physical origin. If we look at the path of
a random walk when we run the clock much faster, and the steps of the walk are
also suitably smaller, then the random walk converges to Brownian motion. This
is an extremely important connection, and it is made precise later in this chapter.
Brownian motion arises naturally in some form or other in numerous statistical
inference problems. It is also used as a real model for modeling stock market
behavior.
The process owes its name to the Scottish botanist Robert Brown, who noticed
under a microscope that pollen particles suspended in fluid engaged in a zigzag and
eccentric motion. It was, however, Albert Einstein who in 1905 gave Brownian mo-
tion a formal physical formulation. Einstein showed that Brownian motion of a large
particle visible under a microscope could be explained by assuming that the parti-
cle gets ceaselessly bombarded by invisible molecules of its surrounding medium.
The theoretical predictions made by Einstein were later experimentally verified by
various physicists, including Jean Baptiste Perrin who was awarded the Nobel prize
in physics for this work. In particular, Einstein’s work led to the determination of
Avogadro’s constant, perhaps the first major use of what statisticians call a moment
estimate. The existence and construction of Brownian motion was first explicitly
established by Norbert Wiener in 1923, which accounts for the other name Wiener
process for a Brownian motion.

A. DasGupta, Probability for Statistics and Machine Learning: Fundamentals 401


and Advanced Topics, Springer Texts in Statistics, DOI 10.1007/978-1-4419-9634-3 12,
c Springer Science+Business Media, LLC 2011
402 12 Brownian Motion and Gaussian Processes

There are numerous excellent references at various technical levels on the


topics of this chapter. Comprehensive and lucid mathematical treatments are avail-
able in Freedman (1983), Karlin and Taylor (1975), Breiman (1992), Resnick
(1992), Revuz and Yor (1994), Durrett (2001), Lawler (2006), and Bhattacharya
and Waymire (2009). Elegant and unorthodox treatment of Brownian motion is
given in Mörters and Peres (2010). Additional specific references are given in the
sections.

12.1 Preview of Connections to the Random Walk

We remarked in the introduction that random walks and Brownian motion are in-
terconnected in a suitable asymptotic paradigm. It would be helpful to understand
this connection in a conceptual manner before going into technical treatments of
Brownian motion.
Consider then the usual simple symmetric random walk defined by S0 D 0; Sn D
X1 C X2 C   C Xn ; n  1, where the Xi are iid with common distribution P .Xi D
˙1/ D 12 . Consider now a random walk that makes its steps at much smaller time
intervals, but the jump sizes are also smaller. Precisely, with the Xi ; i  1 still as
above, define
Sbnt c
Sn .t/ D p ; 0 t 1;
n
where bxc denotes the integer part of a nonnegative real number x. This amounts to
joining the points
   
1 X1 2 X1 C X2
.0; 0/; ;p ; ; p ;:::
n n n n

by linear interpolation, thereby obtaining a curve. The simulated plot of Sn .t/ for
n D 1000 in Fig. 12.1 shows the zigzag path of the scaled random walk. We can see
that the plot is rather rough, and the function takes the value zero at t D 0; that is,
Sn .0/ D 0, and Sn .1/ ¤ 0.
It turns out that in a suitable precise sense, the graph of Sn .t/ on Œ0; 1 for large
n should mimic the graph of a random function called Brownian motion on Œ0; 1.
Brownian motion is a special stochastic process, which is a collection of infinitely
many random variables, say W .t/; 0 t 1, each W .t/ for a fixed t being a
normally distributed random variable, with other additional properties for their joint
distributions. They are introduced formally and analyzed in greater detail in the next
sections.
The question arises of why is the connection between a random walk and the
Brownian motion of any use or interest to us. A short nontechnical answer to that
question is that because Sn .t/ acts like a realization of a Brownian motion, by using
known properties of Brownian motion, we can approximately describe properties
of Sn .t/ for large n. This is useful, because the stochastic process Sn .t/ arises in
12.2 Basic Definitions 403

0.75

0.5

0.25

t
0.2 0.4 0.6 0.8 1
-0.25

-0.5

-0.75

Fig. 12.1 Simulated plot of a scaled random walk

numerous problems of interest to statisticians and probabilists. By simultaneously


using the connection between Sn .t/ and Brownian motion, and known properties of
Brownian motion, we can assert useful things concerning many problems in statis-
tics and probability that would be nearly impossible to assert in a simple direct
manner. That is why the connections are not just mathematically interesting, but
also tremendously useful.

12.2 Basic Definitions

Our principal goal in the subsequent sections is to study Brwonian motion and
Brownian bridge due to their special importance among Gaussian processes.
The Brownian bridge is closely related to Brownian motion, and shares many of
the same properties as Brownian motion. They both arise in many statistical appli-
cations. It should also be understood that the Brownian motion and bridge are of
enormous independent interest in the study of probability theory, regardless of their
connections to problems in statistics.
We caution the reader that it is not possible to make all the statements in this
chapter mathematically rigorous without using measure theory. This is because we
are now dealing with uncountable collections of random variables, and problems of
measure zero sets can easily arise. However, the results are accurate and they can be
practically used without knowing exactly how to fix the measure theory issues.
We first give some general definitions for future use.

Definition 12.1. A stochastic process is a collection of random variables fX.t/; t


2 T g taking values in some finite-dimensional Euclidean space Rd ; 1 d < 1,
where the indexing set T is a general set.
404 12 Brownian Motion and Gaussian Processes

Definition 12.2. A real-valued stochastic process fX.t/; 1 < t < 1g is called


weakly stationary if
(a) E.X.t// D  is independent of t.
(b) EŒX.t/2 < 1 for all t; and Cov.X.t/; X.s// D Cov.X.t C h/; X.s C h// for
all s; t; h:

Definition 12.3. A real-valued stochastic process fX.t/; 1 < t < 1g is


called strictly stationary if for every n  1; t1 ; t2 ; : : : ; tn , and every h, the
joint distribution of .Xt1 ; Xt2 ; : : : ; Xtn / is the same as the joint distribution of
.Xt1 Ch ; Xt2 Ch ; : : : ; Xtn Ch /.

Definition 12.4. A real-valued stochastic process fX.t/; 1 < t < 1g is called a


Markov process if for every n  1; t1 < t2 <    < tn ,

P .Xtn xtn jXt1 D xt1 ; : : : ; Xtn1 D xtn1 / D P .Xtn xtn jXtn1 D xtn1 /I

that is, the distribution of the future values of the process given the entire past de-
pends only on the most recent past.

Definition 12.5. A stochastic process fX.t/; t 2 T g is called a Gaussian process


if for every n  1; t1 ; t2 ; : : : ; tn , the joint distribution of .Xt1 ; Xt2 ; : : : ; Xtn / is a
multivariate normal.
This is often stated as a process is a Gaussian process if all its finite-dimensional
distributions are Gaussian.
With these general definitions at hand, we now define Brownian motion and the
Brownian bridge. Brownian motion is intimately linked to the simple symmetric
random walk, and partial sums of iid random variables. The Brownian bridge is
intimately connected to the empirical process of iid random variables. We focus on
the properties of Brownian motion in this chapter, and postpone discussion of the
empirical process and the Brownian bridge to a later chapter. However, we define
both Brownian processes right now.

Definition 12.6. A stochastic process W .t/ defined on a probability space


.; A; P /; t 2 Œ0; 1/ is called a standard Wiener process or the standard Brownian
motion starting at zero if:
(i) W .0/ D 0 with probability one.
(ii) For 0 s < t < 1; W .t/  W .s/ N.0; t  s/.
(iii) Given 0 t0 < t1 < : : : < tk < 1; the random variables W .tj C1 / 
W .tj /; 0 j k  1 are mutually independent.
(iv) The sample paths of W .:/ are almost all continuous; that is except for a set of
sample points of probability zero, as a function of t; W .t; !/ is a continuous
function.

Remark. Property (iv) actually can be proved to follow from the other three proper-
ties. But it is helpful to include it in the definition to emphasize the importance of the
continuity of Brownian paths. Property (iii) is the celebrated independent increments
12.2 Basic Definitions 405

property and lies at the heart of numerous further properties of Brownian motion.
We often just omit the word standard when referring to standard Brownian motion.

Definition 12.7. If W .t/ is a standard Brownian motion, then X.t/ D x C W .t/,


x 2 R, is called Brownian motion starting at x, and Y .t/ D W .t/;  > 0 is called
Brownian motion with scale coefficient or diffusion coefficient .

Definition 12.8. Let W .t/ be a standard Wiener process on Œ0; 1. The process
B.t/ D W .t/  tW .1/ is called a standard Brownian bridge on Œ0; 1.

Remark. Note that the definition implies that B.0/ D B.1/ D 0 with probability
one. Thus, the Brownian bridge on Œ0; 1 starts and ends at zero. Hence the name tied
down Wiener process. The Brownian bridge on Œ0; 1 can be defined in various other
equivalent ways. The definition we adopt here is convenient for many calculations.

Definition 12.9. Let 1 < d < 1, and let Wi .t/; 1 i d , be independent Brown-
ian motions on Œ0; 1/. Then Wd .t/ D .W1 .t/; : : : ; Wd .t// is called d -dimensional
Brownian motion.

Remark. In other words, if a particle performs independent Brownian move-


ments along d different coordinates, then we say that the particle is engaged in
d -dimensional Brownian motion. Figure 12.2 demonstrates the case d D 2. When
the dimension is not explicitly mentioned, it is understood that d D 1.

Example 12.1 (Some Illustrative Processes). We take a few stochastic processes,


and try to understand some of their basic properties. The processes we consider are
the following.
(a) X1 .t/ X , where X N.0; 1/.
(b) X2 .t/ D tX , where X N.0; 1/.

0.4

0.2

0.2 0.4 0.6 0.8

-0.2

-0.4

-0.6

-0.8

Fig. 12.2 State visited by a planar Brownian motion


406 12 Brownian Motion and Gaussian Processes

(c) X3 .t/ D A cos  t C B sin  t, where  is a fixed positive number, t  0, and


A; B are iidR N.0; 1/.
t
(d) X4 .t/ D 0 W .u/d u; t  0, where W .u/ is standard Brownian motion on
Œ0; 1/, starting at zero.
(e) X5 .t/ D W .t C 1/  W .t/; t  0, where W .t/ is standard Brownian motion on
Œ0; 1/, starting at zero.
Each of these processes is a Gaussian process on the time domain on which it is
defined. The mean function of each process is the zero function.
Coming to the covariance functions, for s t,

Cov.X1 .s/; X1 .t// 1:


Cov.X2 .s/; X2 .t// D st:
Cov.X3 .s/; X3 .t// D cos s cos  t C sin s sin  t D cos .s  t/:
Z s Z t Z tZ s
Cov.X4 .s/; X4 .t// D E W .u/d u W .v/d v D E W .u/W .v/d ud v
0 0 0 0
Z tZ s Z tZ s
D EŒW .u/W .v/d ud v D min.u; v/d ud v
0 0 0 0
Z sZ s Z tZ s
D min.u; v/d ud v C min.u; v/d ud v
0 0 s 0
Z sZ v Z sZ s
D min.u; v/d ud v C min.u; v/d ud v
0 0 0 v
Z tZ s
C min.u; v/d ud v
s 0
s3 s3 s3 s2
D C  C .t  s/
6 2 3 2
2 3 2
s t s s
D  D .3t  s/:
2 6 6
Cov.X5 .s/; X5 .t// D Cov.W .s C 1/  W .s/; W .t C 1/  W .t//
D s C 1  min.s C 1; t/  s C s
D 0; if s  t 1; and D s  t C 1 if s  t > 1:

The two cases are combined into the single formula Cov.W .s C 1/  W .s/;
W .t C 1/  W .t// D .s  t C 1/C . The covariance functions of X1 .t/; X3 .t/, and
X5 .t/ depend only on s  t, and these are stationary.

12.2.1 Condition for a Gaussian Process to be Markov

We show a simple and useful result on characterizing Gaussian processes that are
Markov. It turns out that there is a simple way to tell if a given Gaussian process
is Markov by simply looking at its correlation function. Because we only need to
12.2 Basic Definitions 407

consider finite-dimensional distributions to decide if a stochastic process is Markov,


it is only necessary for us to determine when a finite sequence of jointly normal
variables has the Markov property. We start with that case.
Definition 12.10. Let X1 ; X2 ; : : : ; Xn be n jointly distributed continuous ran-
dom variables. The n-dimensional random vector .X1 ; : : : ; Xn / is said to have
the Markov property if for every k n, the conditional distribution of Xk
given X1 ; : : : ; Xk1 is the same as the conditional distribution of Xk given Xk1
alone.
Theorem 12.1. Let .X1 ; : : : ; Xn / have a multivariate normal distribution with
means zero and the correlations Xj ;Xk D jk . Then .X1 ; : : : ; Xn / has the Markov
property if and only if for 1 i j k n; i k D ij jk .
Proof. We may assume that each Xi has variance one. If .X1 ; : : : ; Xn / has the
Markov property, then for any k, E.Xk jX1 ; : : : ; Xk1 / D E.Xk jXk1 / D
k1;k Xk1 (see Chapter 5). Therefore, Xk  k1;k Xk1 D Xk  E.Xk jX1 ; : : : ;
Xk1 / is independent of the vector .X1 ; : : : ; Xk1 /. In particular, each covari-
ance Cov.Xk  k1;k Xk1 ; Xi / must be zero for all i k  1. This leads to
i k D i;k1 k1;k , and to the claimed identity i k D ij jk by simply applying
i k D i;k1 k1;k repeatedly.
Conversely, suppose the identity i k D ij jk holds for all 1 i j
k n. Then, it follows from the respective formulas for E.Xk jX1 ; : : : ; Xk1 /
and Var.Xk jX1 ; : : : ; Xk1 / (see Chapter 5) that E.Xk jX1 ; : : : ; Xk1 / D
k1;k Xk1 D E.Xk jXk1 /, and Var.Xk jX1 ; : : : ; Xk1 / D Var.Xk jXk1 /.
All conditional distributions for a multivariate normal distribution are also normal,
therefore it must be the case that the distribution of Xk given X1 ; : : : ; Xk1 and
distribution of Xk given just Xk1 are the same. This being true for all k, the full
vector .X1 ; : : : ; Xn / has the Markov property. t
u

Because the Markov property for a continuous-time stochastic process is defined


in terms of finite-dimensional distributions, the above result gives us the following
simple and useful result as a corollary.
Corollary. A Gaussian process X.t/; t 2 R is Markov if and only if X.s/;X.u/ D
X.s/;X.t /  X.t /;X.u/ for all s; t; u; s t u.

12.2.2  Explicit Construction of Brownian Motion

It is not a priori obvious that an uncountable collection of random variables with the
defining properties of Brownian motion can be constructed on a common probability
space (a measure theory terminology). In other words, that Brownian motion exists
requires a proof. Various proofs of the existence of Brownian motion can be given.
We provide two explicit constructions, of which one is more classic in nature. But
the second construction is also useful.
408 12 Brownian Motion and Gaussian Processes

Theorem 12.2 (Karhunen–Loéve Expansion). (a) Let Z1 ; Z2 ; : : : be an infinite


sequence of iid standard normal variables. Then, with probability one, the infi-
nite series
1   
p X sin m  12  t
W .t/ D 2   Zm
mD1
m  12 
converges uniformly in t on Œ0; 1 and the process W .t/ is a Brownian motion
on Œ0; 1. p P
The infinite series B.t/ D 2 1 mD1
sin.m t /
m
Zm converges uniformly in t
on Œ0; 1 and the process B.t/ is a Brownian Bridge on Œ0; 1.
(b) For n  0, let In denote the set of odd integers in Œ0; 2n . Let Zn;k ; n  0; k 2
In be a double array of iid standard normal variables. Let Hn;k .t/; n  0; k 2
In be the sequence of Haar wavelets defined as

k1 kC1
Hn;k .t/ D 0 if t … ; n ; and Hn;k .t/ D 2.n1/=2
2n 2

k1 k k kC1
if t 2 ; ; and  2.n1/=2 if t 2 ; :
2n 2n 2n 2n

Let Sn;k .t/ be the sequence of Schauder functions defined as Sn;k .t/ D
Rt
0 Hn;k .s/ds; 0 t 1; n  0; k 2P In . P
1
Then the infinite series W .t/ D nD0 k2In Zn;k Sn;k .t/ converges uni-
formly in t on Œ0; 1 and the process W .t/ is a Brownian motion on Œ0; 1.

Remark. See Bhattacharya and Waymire (2007, p. 135) for a proof. Both con-
structions of Brownian motion given above can be heuristically understood by
using ideas of Fourier theory. If the sequence f0 .t/ D 1; f1 .t/; f2 .t/; : : : forms
an orthonormal basis of L2 Œ0; 1, then
P we can expand a square integrable func-
tion, say w.t/, as an infinite series i ci fi .t/, where ci equals the inner product
R1
0 w.t/fi .t/dt. Thus, c0 D 0 if the integral of w.t/ is zero. The Karhunen–Loéve
expansion can be heuristically explained as a random orthonormal expansion of
W .t/. The basis functions fi .t/ chosen do depend on the process W .t/, specifically
R1
the covariance function. The inner products 0 W .t/fi .t/dt; i  1 form a sequence
of iid standard normals. This is very far from a proof, but provides a heuristic con-
text for the expansion. The second construction is based similarly on expansions
using a wavelet basis instead of a Fourier basis.

12.3 Basic Distributional Properties

Distributional properties and formulas are always useful in doing further calcula-
tions and for obtaining concrete answers to questions. The most basic distributional
properties of the Brownian motion and bridge are given first.
12.3 Basic Distributional Properties 409

Throughout this chapter, the notation W .t/ and B.t/ mean a (standard) Brownian
motion and Brownian bridge. The phrase standard is often deleted for brevity.

Proposition. (a) Cov.W .s/; W .t// D min.s; t/I Cov.B.s/; B.t// D min.s; t/st:
(b) (The Markov Property). For any given n and t0 < t1 <    < tn ,
the conditional distribution of W .tn / given that W .t0 / D x0 ; W .t1 / D
x1 ; : : : ; W .tn1 / D xn1 is the same as the conditional distribution of W .tn /
given W .tn1 / D xn1 .
(c) Given s < t, the conditional distribution of W .t/ given W .s/ D w is N.w; t s/.
(d) Given t1 < t2 <    < tn , the joint density of W .t1 /; W .t2 /; : : : ; W .tn / is given
by the function

f .x1 ; x2 ; : : : ; xn / D p.x1 ; t1 /p.x2  x1 ; t2  t1 /    p.xn  xn1 ; tn  tn1 /;

x2
where p.x; t/ is the density of N.0; t/; that is, p.x; t/ D p 1 e 2t :
2 t

Each part of this proposition follows on simple and direct calculation by


using the definition of a Brownian motion and Brownian bridge. It is worth
mentioning that the Markov property is extremely important and is a conse-
quence of the independent increments property. Alternatively, one can simply
use our previous characterization that a Gaussian process is Markov if and only
if its correlation function satisfies X.s/;X.u/ D X.s/;X.t /  X.t /;X.u/ for all
s t u.
The Markov property can be strengthened to a very useful property, known as the
strong Markov property. For instance, suppose you are waiting for the process to
reach a level b for the first time. The process will reach that level at some random
time, say . At this point, the process will simply start over, and W .t/  b will act
like a path of a standard Brownian motion from that point onwards. For the general
description of this property, we need a definition.

Definition 12.11. A nonnegative random variable  is called a stopping time for the
process W .t/ if for any s > 0, whether  s depends only on the values of W .t/
for t s.

Example 12.2. For b > 0, consider the first passage time Tb D infft > 0 W W .t/
D bg. Then, Tb > s if and only if W .t/ < b for all t s. Therefore, Tb is a stopping
time for the process W .t/.

Example 12.3. Let X be a U Œ0; 1 random variable independent of the process


W .t/. Then the nonnegative random variable  D X is not a stopping time for
the process W .t/.

Theorem 12.3 (Strong Markov Property). If  is a stopping time for the process
W .t/, then W . Ct/W ./ is also a Brownian motion on Œ0; 1/ and is independent
of fW .s/; s g.
See Bhattacharya and Waymire (2007, p. 153) for its proof.
410 12 Brownian Motion and Gaussian Processes

12.3.1 Reflection Principle and Extremes

It is important in applications to be able to derive the distribution of special func-


tionals of Brownian processes. They can be important because a Brownian process
is used directly as a statistical model in some problem, or they can be important
because the functional arises as the limit of some suitable sequence of statistics in
a seemingly unrelated problem. Examples of the latter kind are seen in applications
of the so-called invariance principle. For now, we provide formulas for the distribu-
tion of certain extremes and first passage times of a Brownian motion. The following
notation is used:

M.t/ D sup0<st W .s/I Tb D infft > 0 W W .t/ D bg:

Theorem 12.4 (Reflection Principle). (a) For b > 0, P .M.t/ > b/ D 2P .W


.t/ > b/:
(b) For t > 0; M.t/ D sup0<st W .s/ has the density
r
2 x 2 =.2t /
e ; x > 0:
t
(c) For b > 0, the first passage time Tb has the density

 pbt
;
t 3=2
where  denotes the standard normal density function.
(d) (First Arcsine Law). Let  be the point of maxima
p of W .t/ on Œ0; 1. Then  is
almost surely unique, and P . t/ D 2 arcsin. t /.
(e) (Reflected Brownian Motion). Let X.t/ D sup0st jW .s/j. Then X.1/ D
sup0s1 jW .s/j has the CDF

1
4 X .1/m .2mC1/2  2 =.8x 2 /
G.x/ D e ; x  0:
 mD0 2m C 1

(f) (Maximum of a Brownian Bridge). Let B.t/ be a Brownian bridge on Œ0; 1.
Then, sup0t 1 jB.t/j has the CDF

1
X 2 x2
H.x/ D 1  .1/k1 e 2k ; x  0:
kD1

(g) (Second Arcsine


p Law). Let L D supft 2 Œ0; 1 W W .t/ D 0g. Then P .L t/ D
2
 arcsin. t /.
(h) Given 0 q< s < t, P .W .t/ has at least one zero in the time interval .s; t// D
2 s

arccos. t
/.
12.3 Basic Distributional Properties 411

Proof of the Reflection Principle: The reflection principle is of paramount im-


portance and we provide a proof of it. The reflection principle follows from two
observations, the first of which is obvious, and the second needs a clever argument.
The observations are:

P .Tb < t/ D P .Tb < t; W .t/ > b/ C P .Tb < t; W .t/ < b/;

and,
P .Tb < t; W .t/ > b/ D P .Tb < t; W .t/ < b/:
Because P .Tb < t; W .t/ > b/ D P .W .t/ > b/ (because W .t/ > b implies
that Tb < t), if we accept the second identity above, then we immediately have the
desired result P .M.t/ > b/ D P .Tb < t/ D 2P .W .t/ > b/. Thus, only the second
identity needs a proof. This is done by a clever argument.
The event fTb < t; W .t/ < bg happens if and only if at some point  < t,
the process reaches the level b, and then at time t drops to a lower level l; l < b.
However, once at level b, the process could as well have taken the path reflected
along the level b, which would have caused the process to end up at level b C
.b  l/ D 2b  l at time t. We now observe that 2b  l > b, meaning that
corresponding to every path in the event fTb < t; W .t/ < bg, there is a path in
the event fTb < t; W .t/ > bg, and so P .Tb < t; W .t/ < b/ must be equal to
P .Tb < t; W .t/ > b/.
This is the famous reflection principle for Brownian motion. An analytic proof
of the identity P .Tb < t; W .t/ < b/ D P .Tb < t; W .t/ > b/ can be given by using
the strong Markov property of Brownian motion.
Note that both parts (b) and (c) of the theorem are simply restatements of part
(a). Many of the remaining parts follow on calculations that also use the reflection
principle. Detailed proofs can be seen, for example, in Karlin and Taylor (1975,
pp. 345–354). t
u

Example 12.4 (Density of Last Zero Before T ). Consider standard Brownian motion
W .t/ on Œ0; 1/ starting at zero and fix a time T > 0. We want to find the density
of the last zero of W .t/ before the time T . Formally, let  D T D supft < T W
W .t/ D 0g. Then, we want to find the density of .
By using part (g) of the previous theorem,
r 
2 s
P . > s/ D P .There is at least one zero of W in .s; T // D arccos :
 T

Therefore, the density of  is


r 
d 2 s 1
f .s/ D  arccos D p ; 0 < s < T:
ds  T  s.T  s/

Notice that the density is symmetric about T


2, and therefore E./ D T
2 . A calcula-
T2
tion shows that E. 2 / D 3 2
8T , and therefore Var./ D 3 2
8T  4 D 1 2
8T .
412 12 Brownian Motion and Gaussian Processes

12.3.2 Path Properties and Behavior Near Zero and Infinity

A textbook example of a nowhere differentiable P and yet everywhere continuous


function is Weierstrass’s function f .t/ D 1 nD0 2 n
cos.b n  t/; 1 < t < 1, for
b > 2 C 3. Constructing another example of such a function is not trivial. A result
of notoriety is that almost all sample paths of Brownian motion are functions of
this kind; that is, as a function of t; W .t/ is continuous at every t, and differentiable
at no t! The paths are extremely crooked. The Brownian bridge shares the same
property. The sample paths show other evidence of extreme oscillation; for example,
in any arbitrarily small interval containing the starting time t D 0, W .t/ changes its
sign infinitely often. The various important path properties of Brownian motion are
described and discussed below.

Theorem 12.5. Let W .t/; t > 0 be a Brownian motion on Œ0; 1/. Then,
1
(a) (Scaling). For c > 0; X.t/ D c  2 W .ct/ is a Brownian motion on Œ0; 1/.
(b) (Time reciprocity). X.t/ D tW . 1t /, with the value being defined as zero at
t D 0 is a Brownian motion on Œ0; 1/.
(c) (Time Reversal). Given 0 < T < 1; XT .t/ D W .T /W .T t/ is a Brownian
motion on Œ0; T .

Proof. Only part (b) requires a proof, the others being obvious. First note that for
s t, the covariance function is
    
1 1 1 1 1
Cov sW ; tW D st min ; D st D s D minfs; tg:
s t s t t

It is obvious that X.t/  X.s/ N.0; t  s/. Next, for s < t < u; Cov.X.t/ 
X.s/; X.u/  X.t// D t  s  t C s D 0, and the independent increments property
holds. The sample paths are continuous (including at t D 0) because W .t/ has con-
tinuous sample paths, and X.0/ D 0. Thus, all the defining properties of a Brownian
motion are satisfied, and hence X.t/ must be a Brownian motion. t
u

Part (b) leads to the following useful property.

W .t /
Proposition. With probability one, t
! 0 as t ! 1.

The behavior of Brownian motion near t D 0 is quite a bit more subtle, and
we postpone its discussion till later. We next describe a series of classic results that
illustrate the extremely rough nature of the paths of a Brownian motion. The results
essentially tell us that at any instant, it is nearly impossible to predict what a particle
performing a Brownian motion will do next. Here is a simple intuitive explanation
for why the paths of a Brownian motion are so rough.
12.3 Basic Distributional Properties 413

Take two time instants s; t; s < t. We then have the simple moment formula
EŒ.W .t/  W .s/2 D .t  s/. Writing t D s C h, we get
2
W .s C h/  W .s/ 1
EŒW .s C h/  W .s/2 D h , E D :
h h

If the time instants s; t are close together, then h 0, and so h1 is large. We can
W .sCh/W .s/
see that the increment h
is blowing up in magnitude. Thus, differentia-
bility is going to be a problem. In fact, not only is the path of a Brownian motion
guaranteed to be nondifferentiable at any prespecified t, it is guaranteed to be non-
differentiable simultaneously at all values of t. This is a much stronger roughness
property than lack of differentiability at a fixed t.
The next theorem is regarded as one of the most classic ones in probability theory.
We first need a few definitions.
Definition 12.12. Let f be a real-valued continuous function defined on some open
subset T of R. The upper and the lower Dini right derivatives of f at t 2 T are
defined as

f .t C h/  f .t/ f .t C h/  f .t/
D C f .t/ D lim sup ; DC f .t/ D lim inf :
h#0 h h#0 h

Definition 12.13. Let f be a real-valued function defined on some open subset T


of R. The function f is said to be Holder continuous of order > 0 at t if for
some finite constant C (possibly depending on t), jf .t C h/  f .t/j C jhj for
all sufficiently small h. If f is Holder continuous of order at every t 2 T with a
universal constant C , it is called Holder continuous of order in T .
Theorem 12.6 (Crooked Paths and Unbounded Variation). (a) Given any
T > 0; P .supt 2Œ0;T  W .t/ > 0; inft 2Œ0;T  W .t/ < 0/ D 1. Hence, with proba-
bility one, in any nonempty interval containing zero, W .t/ changes sign at least
once, and therefore infinitely often.
(b) (Nondifferentiability Everywhere). With probability one, W .t/ is (simultane-
ously) nondifferentiable at all t > 0; that is,

P .For each t > 0; W .t/ is not differentiable at t/ D 1:

(c) (Unbounded Variation). For every T > 0, with probability one, W .t/ has an
unbounded total variation as a function of t on Œ0; T .
(d) With probability one, no nonempty time interval W .t/ can be monotone increas-
ing or monotone decreasing.
(e) P .For all t > 0; D C W .t/ D 1 or DC W .t/ D 1 or both/ D 1.
(f) (Holder Continuity). Given any finite T > 0 and 0 < < 12 , with probability
one, W .t/ is Holder continuous on Œ0; T  of order .
(g) For any > 12 , with probability one, W .t/ is nowhere Holder continuous of
order .
414 12 Brownian Motion and Gaussian Processes

(h) (Uniform Continuity in Probability). Given any > 0; and 0 < T < 1;
P .supt;s;0  t;s  T;jt sj<h jW .t/  W .s/j > / ! 0 as h ! 0.

Proof. Each of parts (c) and (d) would follow from part (b), because of results in
real analysis that monotone functions or functions of bounded variation must be
differentiable almost everywhere. Part (e) is a stronger version of the nondifferen-
tiability result in part (b); see Karatzas and Shreve (1991, pp. 106–111) for parts
(e)–(h). Part (b) itself is proved in many standard texts on stochastic processes; the
proof involves quite a bit of calculation. We show here that part (a) is a consequence
of the reflection principle.
Clearly, it is enough just to show that for any T > 0; P .supt 2Œ0;T  W .t/ > 0/ D 1.
This will imply that P .inft 2Œ0;T  W .t/ < 0/ D 1, because W .t/ is a Brownian
motion if W .t/ is, and hence it will imply all the other statements in part (a). Fix
c > 0. Then,

P . sup W .t/ > 0/  P . sup W .t/ > c/ D 2P .W .T / > c/ .reflection principle/


t 2Œ0;T  t 2Œ0;T 

! 1 as c ! 0, and therefore P .supt 2Œ0;T  W .t/ > 0/ D 1. t


u

Remark. It should be noted that the set of points at which the path of a Brownian
motion is Holder continuous of order 12 is not empty, although in some sense such
points are rare.
The oscillation properties of the paths of a Brownian motion are further illus-
trated by the laws of the iterated logarithm for Brownian motion. The path of a
Brownian motion is a random function. Can we construct suitable deterministic
functions, say u.t/ and l.t/, such that for large t the Brownian path W .t/ will be
bounded by the envelopes l.t/; u.t/? What are the tightest such envelope functions?
Similar questions can be asked about small t. The law of the iterated logarithm an-
swers these questions precisely. However, it is important to note that in addition to
the intellectual aspect of just identifying the tightest envelopes, the iterated loga-
rithm laws have other applications.
p
Theorem 12.7 (LIL). Let f .t/ D 2t log j log tj; t > 0. With probability one,
W .t / W .t /
(a) lim supt !1 f .t / D 1I lim inft !1 f .t / D 1.
(b) lim supt !0 W .t / W .t /
f .t / D 1I lim inft !0 f .t / D 1.

Remark on Proof: Note that the lim inf statement in part (a) follows from the
lim sup statement because W .t/ is also a Brownian motion if W .t/ is. On the
other hand, the two statements in part (b) follow from the corresponding statements
in part (a) by the time reciprocity property that tW . 1t / is also a Brownian motion
if W .t/ is. For a proof of part (a), see Karatzas and Shreve (1991), or Bhattacharya
and Waymire (2007, p. 143). t
u
12.3 Basic Distributional Properties 415

12.3.3  Fractal Nature of Level Sets

For a moment, let us consider a general question. Suppose T is a subset of the


real line, and X.t/; t 2 T a real-valued stochastic process. Fix a number u, and
ask how many times does the path of X.t/ cut the line drawn at level u; that is,
consider NT .u/ D #ft 2 T W X.t/ D ug. It is not a priori obvious that NT .u/ is
finite. Indeed, for Brownian motion, we already know that in any nonempty interval
containing zero, the path hits zero infinitely often with probability one. One might
guess that this lack of finiteness is related to the extreme oscillatory nature of the
paths of a Brownian motion. Indeed, that is true. If the process X.t/ is a bit more
smooth, then the number of level crossings will be finite. However, investigations
into the distribution of NT .u/ will still be a formidable problem. For the Brownian
motion, it is not the number of level crossings, but the geometry of the set of times
at which it crosses a given level u that is of interest. In this section, we describe the
fascinating properties of these level sets of the path of a Brownian motion. We also
give a very brief glimpse into what we can expect for processes whose paths are
more smooth, to draw the distinction from the case of Brownian motion.
Given b 2 R, let
Cb D ft  0 W W .t/ D bg:
Note that Cb is a random set, in the sense that different sample paths will hit the level
b at different sets of times. We only consider the case b D 0 here, although most of
the properties of C0 extend in a completely evident way to the case of a general b.

Theorem 12.8. With probability one, C0 is an uncountable, unbounded, closed set


of Lebesgue measure zero, and has no isolated points; that is, in any neighborhood
of an element of C0 , there is at least one other element of C0 .

Proof. It follows from an application of the reflection principle that P .supt 0


W .t/ D 1; inft 0 W .t/ D 1/ D 1 (check it!). Therefore, given any T > 0,
there must be a time instant t > T such that W .t/ D 0. For if there were a finite last
time that W .t/ D 0, then for such a sample path, the supremum and the infimum
cannot simultaneously be infinite. This means that the zero set C0 is unbounded. It is
closed because the paths of Brownian motion are continuous. We have not defined
what Lebesgue measure means, therefore we cannot give a rigorous proof that C0
has zero Lebesgue measure. Think of Lebesgue measure of a set C as its total length
.C/. Then, by Fubini’s theorem,
Z Z
EŒ .C0 / D E dt D E IfW .t /D0g dt
C0 Œ0;1/
Z
D P .W .t/ D 0/dt D 0:
Œ0;1/

If the expected length is zero, then the length itself must be zero with probability
one. That C0 has no isolated points is entirely nontrivial to prove and we omit the
416 12 Brownian Motion and Gaussian Processes

proof. Finally, by a result in real analysis that any closed set with no isolated points
must be uncountable unless it is empty, we have that C0 is an uncountable set. t
u
Remark. The implication is that the set of times at which Brownian motion returns
to zero is a topologically large set marked by holes, and collectively the holes are big
enough that the zero set, although uncountable, has length zero. Such sets in one di-
mension are commonly called Cantor sets. Corresponding sets in higher dimensions
often go by the name fractals.

12.4 The Dirichlet Problem and Boundary Crossing


Probabilities

The Dirichlet problem on a domain in Rd ; 1 d < 1 was formulated by Gauss


in the mid-nineteenth century. It is a problem of special importance in the area of
partial differential equations with boundary value constraints. The Dirichlet problem
can also be interpreted as a problem in the physical theory of diffusion of heat
in a d -dimensional domain with controlled temperature at the boundary points of
the domain. According to the laws of physics, the temperature as a function of the
location in the domain would have to be a harmonic function. The Dirichlet problem
thus asks for finding a function u.x/ such that

u.x/ D g.x/ .specified/; x 2 @U I u.:/ harmonic in U;

where U is a specified domain in Rd . In this generality, solutions to the Dirich-


let problem need not exist. We need the boundary value function g as well as the
domain U to be sufficiently nice. The interesting and surprising thing is that so-
lutions to the Dirichlet problem have connections to the d -dimensional Brownian
motion. Solutions to the Dirichlet problem can be constructed by solving suitable
problems (which we describe below) about d -dimensional Brownian motion. Con-
versely, these problems on the Brownian motion can be solved if we can directly find
solutions to a corresponding Dirichlet problem, perhaps by inspection, or by using
standard techniques in the area of partial differential equations. Thus, we have an
altruistic connection between a special problem on partial differential equations and
a problem on Brownian motion. It turns out that these connections are more than
intellectual curiosities. For example, these connections were elegantly exploited in
Brown (1971) to solve certain otherwise very difficult problems in the area of sta-
tistical decision theory.
We first provide the necessary definitions. We remarked before that the Dirichlet
problem is not solvable on arbitrary domains. The domain must be such that it does
not contain any irregular boundary points. These are points x 2 @U such that a
Brownian motion starting at x immediately falls back into U . A classic example
is that of a disc, from which the center has been removed. Then, the center is an
irregular boundary point of the domain. We refer the reader to Karatzas and Shreve
(1991, pp. 247–250) for the exact regularity conditions on the domain.
12.4 The Dirichlet Problem and Boundary Crossing Probabilities 417

Definition 12.14. A set U  Rd ; 1 d < 1 is called a domain if U is connected


and open.

Definition 12.15. A twice continuously differentiable real-valued function u.x/


defined on a domain U  Rd is called harmonic if its Laplacian 4u.x/ D
Pd @2
i D1 2 u.x/ 0 for all x D .x1 ; x2 ; : : : ; xd / 2 U .
@xi

Definition 12.16. Let U be a bounded regular domain in Rd , and g a real-valued


continuous function on @U . The Dirichlet problem on the domain U  Rd with
boundary value constraint g is to find a function u W U ! R such that u is harmonic
in U , and u.x/ D g.x/ for all x 2 @U , where @U denotes the boundary of U and
U the closure of U .

Theorem 12.9. Let U  Rd be a bounded regular domain. Fix x 2 U . Consider


Xd .t/; t  0, d -dimensional Brownian motion starting at x, and let  D U D
infft > 0 W Xd .t/ … U g D infft > 0 W Xd .t/ 2 @U g. Define the function u
pointwise on U by

u.x/ D Ex Œg.Xd .//; x 2 UI u D g on @U:

Then u is continuous on U and is the unique solution, continuous on U , to the


Dirichlet problem on U with boundary value constraint g.

Remark. When Xd .t/ exits from U having started at a point inside U , it can exit
through different points on the boundary @U . If it exits at the point y 2 @U , then
g.Xd .// will equal g.y/. The exit point y is determined by chance. If we average
over y, then we get a function that is harmonic inside U and equals g on @U . We
omit the proof of this theorem, and refer the reader to Karatzas and Shreve (1991,
p. 244), and Körner (1986, p. 55).

Example 12.5 (Dirichlet Problem on an Annulus). Consider the Dirichlet problem


on the d -dimensional annulus U D fz W r < jjzjj < Rg; where 0 < r < R < 1.
Specifically, suppose we want a function u such that

u harmonic on fz W r < jjzjj < Rg; u D 1 on fz W jjzjj D Rg;


u D 0 on fz W jjzjj D rg:

A continuous solution to this can be found directly. The solution is

jzj  r
u.z/ D for d D 1I
Rr
log jjzjj  log r
u.z/ D for d D 2I
log R  log r
r 2d  jjzjj2d
u.z/ D for d > 2:
r 2d  R2d
418 12 Brownian Motion and Gaussian Processes

We now relate this solution to the Dirichlet problem on U with d -dimensional


Brownian motion. Consider then Xd .t/, d -dimensional Brownian motion that
started at a given point x inside U ; r < jjxjj < R. Because the function g corre-
sponding to the boundary value constraint in this example is g.z/ D IfzWjjzjjDRg , by
the above theorem, u.x/ equals

u.x/ D Ex Œg.Xd .//


D Px .Xd .t/ reaches the spherejjzjj D R before it reaches the spherejjzjj D r/:

For now, let us consider the case d D 1. Fix positive numbers r; R and suppose a
one-dimensional Brownian motion starts at a number x between r and R, 0 < r <
x < R < 1. Then the probability that it will hit the line z D R before hitting
the line z D r is Rrxr
. The closer the starting point x is to R, the larger is the
probability that it will first hit the line z D R. Furthermore, the probability is a very
simple linear function. We revisit the case d > 1 when we discuss recurrence and
transience of d -dimensional Brownian motion in the next section.

12.4.1 Recurrence and Transience

We observed during our discussion of the lattice random walk (Chapter 11) that it
is recurrent in dimensions d D 1; 2 and transient for d > 2. That is, in one and two
dimensions the lattice random walk returns to any integer value x at least once (and
hence infinitely often) with probability one, but for d > 2, the probability that the
random walk returns at all to its starting point is less than one. For the Brownian
motion, when the dimension is more than one, the correct question is not to ask if
it returns to particular points x. The correct question to ask is if it returns to any
fixed neighborhood of a particular point, however small. The answers are similar
to the case of the lattice random walk; that is, in one dimension, Brownian motion
returns to any point x infinitely often with probability one, and in two dimensions,
Brownian motion returns to any given neighborhood of a point x infinitely often
with probability one. But when d > 2, it diverges off to infinity. We can see this by
using the connection of Brownian motion to the Dirichlet problem on discs. We first
need two definitions.
Definition 12.17. For d > 1, a d -dimensional stochastic process Xd .t/; t  0 is
called neighborhood recurrent if with probability one, it returns to any given ball
B.x; / infinitely often.
Definition 12.18. For any d , a d -dimensional stochastic process Xd .t/; t  0 is
called transient if with probability one, Xd .t/ diverges to infinity.
We now show how the connection of the Brownian motion to the solution of the
Dirichlet problem will help us establish that Brownian motion is transient for d > 2.
That is, if we let B be the event that limt !1 jjWd .t/jj ¤ 1, then we show that
P .B/ must be zero for d > 2. Indeed, to be specific, take d D 3, pick a point
12.5 The Local Time of Brownian Motion 419

x 2 R3 with jjxjj > 1, suppose that our Brownian motion is now sitting at the point
x, and ask what the probability is that it will reach the unit ball B1 before it reaches
the disk jjzjj D R. Here R > jjxjj. We have derived this probability. The Markov
1
1 jjxjj
property of Brownian motion gives this probability to be exactly equal to 1  1 .
1 R
1
This clearly converges to jjxjj as R ! 1. Imagine now that the process has evolved
for a long time, say T , and that it is now sitting at a very distant x (i.e., jjxjj is large).
The LIL for Brownian motion guarantees that we can pick such a large T and such
a distant x. Then, the probability of ever returning from x to the unit ball would
be the small number D jjxjj 1
. We can make arbitrarily small by choosing jjxjj
sufficiently large, and what that means is that the probability of the process returning
infinitely often to the unit ball B1 is zero. The same argument works for Bk , the ball
of radius k for any k  1, and therefore, P .B/ D P .[1 kD1 Bk / D 0; that is, the
process drifts off to infinity with probability one. The same argument works for any
d > 2, not just d D 3. The case of d D 1; 2 is left as a chapter exercise. We then have
the following theorem. t
u

Theorem 12.10. Brownian motion visits every real x infinitely often with prob-
ability one if d D 1, is neighborhood recurrent if d D 2, and transient if d > 2.
Moreover, by its neighborhood recurrence for d D 2, the graph of a two-dimensional
Brownian path on Œ0; 1/ is dense in the two-dimensional plane.

12.5  The Local Time of Brownian Motion

For the simple symmetric random walk in one dimension, we derived the distribu-
tion of the local time .x; n/, which is the total time the random walk spends at the
integer x up to the time instant n. It would not be interesting to ask exactly the same
question about Brownian motion, because the number of time points t up to some
time T at which the Brownian motion W .t/ equals a given x is zero or infinity. Paul
Lévy gave the following definition for the local time of a Brownian motion. Fix a
set A in the real line and a general time instant T; T > 0. Now ask what is the total
size of the times t up to T at which the Brownian motion has resided in the given
set A. That is, denoting Lebesgue measure on R by , look at the following kernel

H.A; T / D ft T W W .t/ 2 Ag:

Using this, Lévy formulated the local time of the Brownian motion at a given x as

H.Œx  ; x C ; T /
.x; T / D lim ;
#0 2

where the limit is supposed to mean a pointwise almost sure limit. It is important to
note that the existence of the almost sure limit is nontrivial.
420 12 Brownian Motion and Gaussian Processes

Instead of the clumsy notation T , we eventually simply use the notation t, and
thereby obtain a new stochastic process .x; t/, indexed simultaneously by two
parameters x and t. We can regard .x; t/ together as a vector-valued time parameter,
and call .x; t/ a random field. This is called the local time of one-dimensional
Brownian motion. The local time of Brownian motion is generally regarded to be an
analytically difficult process to study. We give a relatively elementary exposition to
the local time of Brownian motion in this section.
Recall now the previously introduced maximum process of standard Brownian
motion, namely M.t/ D sup0st W .s/. The following major theorem on the dis-
tribution of the local time of Brownian motion at zero was proved by Paul Lévy.
Theorem 12.11. Let W .s/; s  0 be standard Brownian motion starting at zero.
Consider the two stochastic processes, f.0; t/; t  0g, and fM.t/; t  0g. These
two processes have the same distribution.
In particular, for any given fixed t, and y > 0,
  r Z y
.0; t/ 2 2 =2
P p y D e z d z D 2ˆ.y/  1
t  0
r Z y
2 2 =.2t /
, P ..0; t/ y/ D e z d z:
t 0

For a detailed proof of this theorem, we refer to Mörters and Peres (2010, p. 160).
A sketch of the proof can be seen in Révész (2005).
For a general level x, the corresponding result is as follows, and it follows from
the case x D 0 treated above.
Theorem 12.12.
 
y C jxj
P ..x; t/ y/ D 2ˆ p  1; 1 < x < 1; t; y > 0:
t

It is important to note that if the level x ¤ 0, then the local time .x; t/ can actually
be exactly equal to zero with a positive probability, and this probability is simply
the probability that Brownian motion does not reach x within time t, and equals
jxj
2ˆ. p t
/  1. This is not the case if the level is zero, in which case the local time
.0; t/ possesses a density function.
The theorem
p above also says that the local time of Brownian motion grows at
the rate Rof t for any level x. The expected value follows easily by evaluating the
1
integral 0 Œ1  P ..x; t/ y/dy, and one gets
     
p x jxj jxj 2
EŒ.x; t/ D 4 t  p 1ˆ p  4jxj 1  ˆ p :
t t t
q p
The limit of this as x ! 0 equals 2 t , which agrees with EŒ.0; t/. The ex-
pected local time is plotted in Fig. 12.3.
12.6 Invariance Principle and Statistical Applications 421

2
10

1 8

0 6
0
1 4
2
3 2
4
50

Fig. 12.3 Plot of the expected local time as function of (x,t)

12.6 Invariance Principle and Statistical Applications

We remarked in the first section of this chapter that scaled random walks mimic the
Brownian motion in a suitable asymptotic sense. As a matter of fact, if X1 ; X2 ; : : :
is any iid sequence of one-dimensional random variables satisfying
P some relatively
simple conditions, then the sequence of partial sums Sn D niD1 Xi ; n  1, when
appropriately scaled, mimics Brownian motion in a suitable asymptotic sense. Why
is this useful? This is useful because in many concrete problems of probability and
statistics, suitable functionals of the sequence of partial sums arise as the objects
of direct importance. The invariance principle allows us to conclude that if the se-
quence of partial sums Sn mimics W .t/, then any nice functional of the sequence of
partial sums will also mimic the same functional of W .t/. So, if we can figure out
how to deal with the distribution of the needed functional of the W .t/ process, then
we can use it in practice to approximate the much more complicated distribution of
the original functional of the sequence of partial sums. It is a profoundly useful fact
in the asymptotic theory of probability that all of this is indeed a reality. This sec-
tion treats the invariance principle for the partial sum process of one-dimensional
iid random variables. We recommend Billingsley (1968), Hall and Heyde (1980),
and Csörgo and Révész (1981) for detailed and technical treatments; Erdös and Kac
(1946), Donsker (1951), Komlós et al. (1975, 1976, Major (1978), Whitt (1980),
and Csörgo and Hall (1984) for invariance principles for the partial sum process;
and Pyke (1984) and Csörgo (2002)) for lucid reviews. Also, see Dasgupta (2008)
for references to various significant extensions, such as the multidimensional and
dependent cases.
422 12 Brownian Motion and Gaussian Processes

Although the invariance principle for partial sums of iid random variables is usu-
ally credited to Donsker (1951), Erdös and Kac (1946) contained the basic idea
behind the invariance principle and also worked out the asymptotic distribution of a
number of key and interesting functionals of the partial sum process. Donsker (1951)
provided the full generalization of the Erdös–Kac technique by providing explicit
Sk
embeddings of the discrete sequence p n
; k D 1; 2; : : : ; n into a continuous-time
stochastic process Sn .t/ and by establishing the limiting distribution of a general
continuous functional h.Sn .t//. In order to achieve this, it is necessary to use a con-
tinuous mapping theorem for metric spaces, as consideration of Euclidean spaces
is no longer enough. It is also useful to exploit a property of the Brownian mo-
tion known as the Skorohod embedding theorem. We first describe this necessary
background material.
Define

C Œ0; 1 D Class of all continuous real valued functions on Œ0; 1; and
DŒ0; 1 D Class of all real-valued functions on Œ0; 1 that are right continuous
and have a left limit at every point in Œ0; 1:

Given two functions f; g in either C Œ0; 1 or DŒ0; 1, let .f; g/ D sup0t 1
jf .t/  g.t/j denote the supremum distance between f and g. We refer to  as the
uniform metric. Both C Œ0; 1 and DŒ0; 1 are (complete) metric spaces with respect
to the uniform metric .
Suppose X1 ; X2 ; : : : is an iid sequence of real valued random variables with mean
Sk
zero and variance one. Two common embeddings of the discrete sequence p n
;kD
1; 2; : : : ; n into a continuous time process are the following.

1
Sn;1 .t/ D p ŒSbnt c C fntgXbnt cC1 ;
n

and
1
Sn;2 .t/ D p SŒnt  ;
n
0 t 1. Here, b:c denotes the integer part and f:g the fractional part of a positive
real.
Sk
The first one simply continuously interpolates between the values p n
by drawing
straight lines, but the second one is only right continuous, with jumps at the points
t D kn ; k D 1; 2; : : : ; n. For certain specific applications, the second embedding is
more useful. It is because of these jump discontinuities that Donsker needed to con-
sider weak convergence in DŒ0; 1. It led to some additional technical complexities.
The main idea from this point on is not difficult. One can produce a version
of Sn .t/, say SQn .t/, such that SQn .t/ is close to a sequence of Wiener processes
Wn .t/. Because SQn .t/ Wn .t/, if h.:/ is a continuous functional with respect to
the uniform metric, then one can expect that h.SQn .t// h.Wn .t// D h.W .t// in
distribution. SQn .t/ being a version of Sn .t/; h.Sn .t// D h.SQn .t// in distribution,
12.6 Invariance Principle and Statistical Applications 423

and so, h.Sn .t// should be close to the fixed Brownian functional h.W .t// in distri-
bution, which is the question we wanted to answer.
The results leading to Donsker’s theorem are presented below.

Theorem 12.13 (Skorohod Embedding). Given a random variable X with mean


zero and a finite variance  2 , we can construct (on the same probability space) a
standard Brownian motion W .t/ starting at zero, and a stopping time  with respect
to W .t/ such that E./ D  2 and X and the stopped Brownian motion W ./ have
the same distribution.

Theorem 12.14 (Convergence of the Partial Sum Process to Brownian Motion).


Let Sn .t/ D Sn;1 .t/ or Sn;2 .t/ as defined above. Then there exists a common prob-
ability space on which one can define Wiener processes Wn .t/ starting at zero, and
a sequence of processes fSQn .t/g; n  1, such that
(a) For each n; Sn .t/ and SQn .t/ are identically distributed as processes.
P
(b) sup0t 1 jSQn .t/  Wn .t/j ) 0:

We prove the last theorem, assuming the Skorohod embedding theorem. A proof
of the Skorohod embedding theorem may be seen in Csörgo and Révész (1981), or
in Bhattacharya and Waymire (2007, p. 160).

Proof. We treat only the linearly interpolated process Sn;1 .t/, and simply call it
Sn .t/. To reduce notational clutter, we write as if the version SQn of Sn is Sn itself.
Thus, the SQn notation is dropped in the proof of the theorem. Without loss of gener-
ality, we take E.X1 / D 0 and Var.X1 / D 1. First, by using the Skorohod embedding
theorem, construct a stopping time 1 with respect to the process W .t/; t  0 such
L
that E.1 / D 1 and such that W .1 / D X1 . Using the strong Markov property of
Brownian motion, W .t C 1 /  W .1 / is also a Brownian motion on Œ0; 1/, inde-
pendent of .1 ; W .1 //, and we can now pick a stopping time, say 20 with respect to
L
this process, with the two properties E.20 / D 1 and W .20 / D X2 . Therefore, if we
define 2 D 1 C 20 , then we have obtained a stopping time with respect to the orig-
inal Brownian motion, with the properties that its expectation is 2, and 2  1 and
1 are independent. Proceeding in this way, we can construct an infinite sequence of
stopping times 0 D 0 1 2 3    , such that k  k1 are iid with mean
one, and the two discrete time processes Sk and W .k / have the same distribution.
Moreover, by the usual SLLN,

1X
n
n a:s:
D Œk  k1  ! 1;
n n
kD1

from which it follows that

max0kn jk  kj P
! 0:
n
424 12 Brownian Motion and Gaussian Processes
p
Set Wn .t/ D Wp.nt
n
/
; n  1. Therefore, in this notation, W .k / D nWn . nk /. Now
fix > 0 and consider the event Bn D fsup0t 1 jSn .t/  Wn .t/j > g. We need to
show that P .Bn / ! 0.

Now, because Sn .t/ is defined by linear interpolation, in order that Bn happens,


at some t in Œ0; 1 we must have one of
ˇ p ˇ ˇ p ˇ
ˇSk = n  Wn .t/ˇ and ˇSk1 = n  Wn .t/ˇ

larger than , where k is the unique k such that k1 n


t < kn . Our goal is to
show that the probability of the union ofpthese two events is small. Now use the
fact that in distribution, Sk D W .k / D nWn . nk /, and so it will suffice to show
that the probability of the union of the two events fjWn . nk /  Wn .t/j > g and
fjWn . k1
n /Wn .t/j > g is small. However, the union of these two events can only
happen if either Wn differs by a large amount in a small interval, or one of the two
time instants nk and k1
n
are far from t. The first possibility has a small probability
by the uniform continuity of paths of a Brownian motion (on any compact interval),
and the second possibility has a small probability by our earlier observation that
max0kn jk kj P
n ! 0: This implies that P .Bn / is small for all large n, as we wanted
to show.
This theorem implies the following important result by an application of the con-
tinuous mapping theorem, continuity being defined through the uniform metric on
the space C Œ0; 1.

Theorem 12.15 (Donsker’s Invariance Principle). Let h be a continuous func-


tional with respect to the uniform metric on C Œ0; 1 and let Sn .t/ be defined as
L
either Sn;1 .t/ or Sn;2 .t/. Then h.Sn .t// ) h.W .t//, as n ! 1.

Example 12.6 (CLT Follows from Invariance Principle). The central limit theorem
for iid random variables having a finite variance follows as a simple consequence of
Donsker’s invariance principle. Suppose X1 ; X2 ; : : : are iid random variables with
P
mean zero and variance 1. Let Sk D kiD1 Xi ; k  1. Define the functional h.f / D
f .1/ on C Œ0; 1. This is obviously a continuous functional on C Œ0; 1 with respect to
the uniform metric .f; g/ D sup0t 1 jf .t/  g.t/j. Therefore, with Sn .t/ as the
linearly interpolated partial sum process, it follows from the invariance principle that
Pn
D1 Xi L
h.Sn / D Sn .1/ D ip ) h.W / D W .1/ N.0; 1/;
n

which is the central limit theorem.

Example 12.7 (Maximum of a Random Walk). We apply the Donsker invariance


principle to the problem of determination of the limiting distribution of a functional
of a random walk. Suppose X1 ; X2 ; : : : are iid random variables with mean zero and
P
variance 1. Let Sk D kiD1 Xi ; k  1. We want to derive the limiting distribution of
12.7 Strong Invariance Principle and the KMT Theorem 425
max S
Gn D 1kn k
p
n
. To derive its limiting distribution, define the functional h.f / D
sup0t 1 f .t/ on C Œ0; 1. This is a continuous functional on C Œ0; 1 with respect
to the uniform metric .f; g/ D sup0t 1 jf .t/  g.t/j. Further notice that our
statistic Gn can be represented as Gn D h.Sn /, where Sn is the linearly interpolated
L
partial sum process. Therefore, by Donsker’s invariance principle, Gn D h.Sn / )
h.W / D sup0t 1 W .t/, where W .t/ is standard Brownian motion on Œ0; 1. We
know its CDF explicitly, namely, for x > 0, P .sup0t 1 W .t/ x/ D 2ˆ.x/  1.
Thus, P .Gn x/ ! 2ˆ.x/  1 for all x > 0.

Example 12.8 (Sums of Powers of Partial Sums). Consider once again iid random
variables X1 ; X2 ; : : : with zero mean and a P
unit variance. Fix a positive integer m
and consider the statistic Tm;n D n1m=2 nkD1 Skm . By direct integration of the
R1
polygonal curve ŒSn .t/m , we find that Tm;n D 0 ŒSn .t/m dt. This guides us to
R1 m
the functional h.f / D 0 f .t/dt. Because Œ0; 1 is a compact interval, it is easy
to verify that h is a continuous functional on C Œ0; 1 with respect to the uniform
metric. Indeed, the continuity of h.f / follows from simply the algebraic identity
jx m  y m j D jx  yjjx m1 C x m2 y C    C y m1 j. It therefore follows from
L R1
Donsker’s invariance principle that Tm;n ) 0 W m .t/dt. At first glance it seems
surprising that a nondegenerate limit distribution for partial sums of Skm can exist
with only two moments.

12.7 Strong Invariance Principle and the KMT Theorem

In addition to the weak invariance principle described above, there are also strong
invariance principles. The first strong invariance principle for partial sums was ob-
tained in Strassen (1964). Since then, a lot of literature has developed, including
for the multidimensional case. Good sources for information are Strassen (1967),
Komlós et al. (1976), Major (1978), Csörgo and Révész (1981), and Einmahl (1987).
It would be helpful to first understand exactly what a strong invariance principle
is meant to achieve. Suppose X1 ; X2 ; : : : is a zero mean unit variance
P iid sequence
of random variables. For n  1, let Sn denote the partial sum niD1 Xi , and Sn .t/
Sk
the interpolated partial sum process with the special values Sn . kn / D p n
for each
n and 1 k n. In the process of proving Donsker’s invariance principle, we
have shown that we can construct (on a common probability space) a process SQn .t/
(which is equivalent to the original process Sn .t/ in distribution) and a single Wiener
P
process W .t/ such that sup0t 1 jSQn .t/  p1 W .nt/j ! 0. Therefore,
n

1 P
jSQn .1/  p W .n/j ! 0
n
jSQn  W .n/j P
) p ! 0:
n
426 12 Brownian Motion and Gaussian Processes

The strong invariance principle asks if we can find suitable functions g.n/ such that
Q W .n/j a:s:
we can make the stronger statement jSng.n/ ! 0, and as a next step, what is the
best possible choice for such a function g.
The exact statements of the strong invariance principle results require us to say
that we can construct an equivalent process SQn .t/ and a Wiener process W .t/ on
Q W .n/j a:s:
some probability space such that jSng.n/ ! 0 for some suitable function g. Due
to the clumsiness in repeatedly having to mention these qualifications, we drop the
SQn notation and simply say Sn .t/, and we also do not mention that the processes
have all been constructed on some new probability space. The important thing for
applications is that we can use the approximations on the original process itself, by
simply adopting the equivalent process on the new probability space.
Paradoxically, the strong invariance principle does not imply the weak invariance
principle (i.e., Donsker’s invariance principle) in general. This is because under the
assumption of just the pfiniteness of the variance of the Xi , the best possible g.n/
increases faster than n. On the other hand, if the common distribution of the Xi
satisfies more stringent pmoment conditions, then we can make g.n/ a lot slower,
and even slower than n. The array of results that is available is bewildering and
they are all difficult to prove. We prefer to report a few results of great importance,
including in particular the KMT theorem, due to Komlós et al. (1976).

Theorem 12.16. Let X1 ; X2 ; : : : be an iid sequence with E.Xi / D 0; Var.Xi / D 1.


(a) There exists a Wiener process W .t/; t  0, starting at zero such that
Sn W .n/ a:s:
p ! 0.
p
n log log n
(b) The n log log n rate of part (a) cannot be improved in the sense that
given any nondecreasing sequence an ! 1 (however slowly), there exists
a CDF F with zero mean and unit variance, such that with probability one,
lim supn an pSnn W .n/
log log n
D 1, for any iid sequence Xi following the CDF F , and
any Wiener process W .t/.
(c) If we make the stronger assumption that Xi has a finite mgf in some open neigh-
borhood of zero, then the statement of part (a) can be improved to jSn W .n/j D
O.log n/ with probability one.
(d) (KMT Theorem) Specifically, if we make the stronger assumption that Xi has a
finite mgf in some open neighborhood of zero, then we can find suitable positive
constants C; K; such that for any real number x and any given n,

P . max jSk  W .k/j  C log n C x/ Ke x ;


1kn

where the constants C; K; depend only on the common CDF of the Xi .

Remark. The KMT theorem is widely regarded as one of the most major advances
in the area of invariance principles and central limit problems. One should note that
the inequality given in the above theorem has a qualitative nature attached to it,
as we can only use the inequality with constants C; K; that are known to exist,
depending on the underlying F . Refinements of the version of the inequality given
12.8 Brownian Motion with Drift and Ornstein–Uhlenbeck Process 427

above are available. We refer to Csörgo and Révész (1981) for such refinements and
general detailed treatment of the strong invariance principle.

12.8 Brownian Motion with Drift and Ornstein–Uhlenbeck


Process

We finish with two special processes derived from standard Brownian motion. Both
are important in applications.
Definition 12.19. Let W .t/; t  0 be standard Brownian motion starting at zero.
Fix  2 R and  > 0. Then the process X.t/ D t C W .t/; t  0 is called
Brownian motion with drift  and diffusion coefficient . It is clear that it inherits
the major path properties of standard Brownian motion, such as nondifferentiablity
at all t with probability one, the independent increments property, and the Markov
property. Also, clearly, for fixed t; X.t/ N.t;  2 t/.

12.8.1 Negative Drift and Density of Maximum

There are, however, also some important differences when a drift is introduced. For
example, unless  D 0, the reflection principle no longer holds, and consequently
one cannot derive the distribution of the running maximum M .t/ D sup0st X.s/
by using symmetry arguments. If   0, then it is not meaningful to ask for the
distribution of the maximum over all t > 0. However, if  < 0, then the process
has a tendency to drift off towards negative values, and in that case the maximum in
fact does have a nontrivial distribution. We derive the distribution of the maximum
when  < 0 by using a result on a particular first passage time of the process.
Theorem 12.17. Let X.t/; t  0 be Brownian motion starting at zero, and with
drift  < 0 and diffusion coefficient . Fix a < 0 < b, and let

Ta;b D minŒinfft > 0 W X.t/ D ag; infft > 0 W X.t/ D bg;

the first time X.t/ reaches either a or b. Then,


2
e 2a=  1
P .XTa;b D b/ D :
e 2a= 2  e 2b= 2
A proof of this theorem can be seen in Karlin and Taylor (1975, p 361). By using
this result, we can derive the distribution of supt >0 X.t/ in the case  < 0.

Theorem 12.18 (The Maximum of Brownian Motion with a Negative Drift). If


X.t/; t  0 is Brownian motion starting at zero, and with drift  < 0 and diffusion
2
coefficient , then, supt >0 X.t/ is distributed as an exponential with mean  2 .
428 12 Brownian Motion and Gaussian Processes

Proof. In the theorem stated above, by letting a ! 1, we get


2
P .X.t/ ever reaches the level b > 0/ D e 2b= :
2
But this is the probability that an exponential variable with mean  2

is larger
than b. On the other hand, P .X.t/ ever reaches the level b > 0/ is the same as
P .supt >0 X.t/  b/. Therefore, supt >0 X.t/ must have an exponential distribution
2
with mean  2 . t
u

Example 12.9 (Probability That Brownian Motion Does Not Hit a Line). Consider
standard Brownian motion W .t/ starting at zero on Œ0; 1/, and consider a straight
line L with the equation y D a C bt; a; b > 0. Because W .0/ D 0; a > 0, and paths
of W .t/ are continuous, the probability that W .t/ does not hit the line L is the same
as P .W .t/ < a C bt8t > 0/. However, if we define a new Brownian motion (with
drift) X.t/ as X.t/ D W .t/  bt, then
 
P .W .t/ < a C bt8t > 0/ D P sup X.t/ a D 1  e 2ab ;
t >0

by our theorem above on the maximum of a Brownian motion with a negative drift.
We notice that the probability that W .t/ does not hit L is monotone increasing in
each of a; b, as it should be.

12.8.2  Transition Density and the Heat Equation

If we consider Brownian motion starting at some number x, and with drift  < 0
and diffusion coefficient , then by simple calculations, the conditional distribution
of X.t/ given that X.0/ D x is N.x C t;  2 t/, which has the density
2
1  .yxt/
pt .x; y/ D p p e 2 2 t :
2 t

This is called the transition density of the process. The transition density satisfies a
very special partial differential equation, which we now prove.
By direct differentiation,

@ .x  y/2  2 t 2   2 t  .yxt/ 2
pt .x; y/ D p e 2 2 t I
@t 2 2 3 t 5=2
@ x  y C t  .yxt/ 2
pt .x; y/ D p e 2 2 t I
@y 2 3 t 3=2
@2 .x  y C t/2   2 t  .yxt/ 2
p t .x; y/ D p e 2 2 t :
@y 2 2 5 t 5=2
12.8 Brownian Motion with Drift and Ornstein–Uhlenbeck Process 429

On using these three expressions, it follows that the transition density pt .x; y/ sat-
isfies the partial differential equation

@ @  2 @2
pt .x; y/ D  pt .x; y/ C pt .x; y/:
@t @y 2 @y 2
This is the drift-diffusion equation in one dimension. In the particular case that  D
0(no drift), and  D 1, the equation reduces to the celebrated heat equation

@ 1 @2
pt .x; y/ D pt .x; y/:
@t 2 @y 2
Returning to the drift-diffusion equation for the transition density in general, if we
now take a general function f .x; y/ that is twice continuously differentiable in y
and is bounded above by Ke cjyj for some finite K; c > 0, then integration by parts
in the drift-diffusion equation produces the following expectation identity, which we
state as a theorem.
Theorem 12.19. Let x;  be any real numbers, and  > 0. Suppose Y N.x C
t;  2 t/, and f .x; y/ twice continuously differentiable in y such that for some 0 <
K; c < 1; jf .x; y/j Ke cjyj for all y. Then,

@ @ 2 @2
Ex Œf .x; Y / D Ex f .x; Y / C Ex f .x; Y / :
@t @y 2 @y 2
This identity and a multidimensional version of it has been used in Brown et al.
(2006) to derive various results in statistical decision theory.

12.8.3  The Ornstein–Uhlenbeck Process

The covariance function of standard Brownian motion W .t/ is Cov.W .s/; W .t// D
p
min.s; t/. Therefore, if we scale by t , and let X.t/ D Wp.tt / ; t > 0, we get that
q q
min.s;t /
Cov.X.s/; X.t// D max.s;t /
D s
t
, if s t. Therefore, the covariance is a func-
tion of only the time lag on a logarithmic time scale. This motivates the definition
of the Ornstein–Uhlenbeck process as follows.
Definition 12.20. Let W .t/ be standard Brownian motion starting at zero, and let
’t
’ > 0 be a fixed constant. Then X.t/ D e  2 W .e ’t /; 1 < t < 1 is called
the Ornstein–Uhlenbeck process. The most general Ornstein–Uhlenbeck process is
defined as
 ’t
X.t/ D  C p e  2 W .e ’t /; 1 <  < 1; ’;  > 0:

In contrast to the Wiener process, the Ornstein–Uhlenbeck process has a locally


time-dependent drift. If the present state of the process is larger than , the global
430 12 Brownian Motion and Gaussian Processes

mean, then the drift drags the process back towards , and if the present state of
the process is smaller than , then it does the reverse. The ’ parameter controls this
tendency to return to the grand mean. The third parameter  controls the variability.
Theorem 12.20. Let X.t/ be a general Ornstein–Uhlenbeck process. Then, X.t/
is a stationary Gaussian process with EŒX.t/ D , and Cov.X.s/; X.t// D
2  ’2 jst j .
’ e

Proof. It is obvious that EŒX.t/ D . By definition of X.t/,

 2  ’ .sCt /  2  ’ .sCt / ’ min.s;t /


Cov.X.s/; X.t// D e 2 min.e ’s ; e ’t / D e 2 e
’ ’
 2 ’ min.s;t /  ’ max.s;t /  2 .’=2/jst j
D e2 e 2 D e ;
’ ’
and inasmuch as Cov.X.s/; X.t// is a function of only js  tj, it follows that it is
stationary. u
t
Example 12.10 (Convergence of Integrated Ornstein–Uhlenbeck to Brownian Mo-
tion). Consider an Ornstein–Uhlenbeck process X.t/ with parameters ; ’, and
 2 . In a suitable asymptotic sense, the integrated Ornstein–Uhlenbeck process
converges to a Brownian motion with drift  and an appropriate diffusion coef-
ficient; the
R t diffusion coefficient can be adjusted to be one. Towards this, define
Y .t/ D 0 X.u/d u. This is clearly also a Gaussian process. We show in this ex-
2
ample that if  2 ; ’ ! 1 in such a way that 4 ’2
! c 2 ; 0 < c < 1, then
Cov.Y .s/; Y .t// ! c 2 min.s; t/. In other words, in the asymptotic paradigm where
; ’ ! 1, but are of comparable order, the integrated Ornstein–Uhlenbeck process
Y .t/ is approximately the same as a Brownian motion with some drift and some
diffusion coefficient, in the sense of distribution.
We directly calculate Cov.Y .s/; Y .t//. There is no loss of generality in taking 
to be zero. Take 0 < s t < 1. Then
Z tZ s Z Z
 2 t s  ’ juvj
Cov.Y .s/; Y .t// D EŒX.u/X.v/dudv D e 2 dudv
0 0 ’ 0 0
Z sZ v Z sZ s
2 ’ ’
D e  2 juvj dudv C e  2 juvj dudv
’ 0 0 0 v
Z tZ s

C e  2 juvj dudv
s 0
Z s Z Z sZ s
2 ’
v

D e  2 .vu/ dudv C e  2 .uv/ d ud v
’ 0 0 0 v
Z tZ s

C e  2 .vu/ d ud v
s 0
2 4 h ’s=2 ’t =2 ’.t s/=2
i
D ’s C e C e  e ;
’ ’2
on doing the three integrals in the line before, and on adding them.
Exercises 431
2
If ’;  2 ! 1, and 4 ’2
converges to some finite and nonzero constant c 2 , then
for any s; t; 0 < s < t, the derived expression for Cov.Y .s/; Y .t// ! c 2 s D
c 2 min.s; t/, which is the covariance function of Brownian motion with diffusion
coefficient c.
The Ornstein–Uhlenbeck process enjoys another important property besides sta-
tionarity. It is also a Markov process. It is the only stationary and Markov Gaussian
process with paths that are smooth. This property explains some of the popularity
of the Ornstein-Uhlenbeck process in fitting models to real data.

Exercises

Exercise 12.1 (Simple Processes). Let X0 ; X1 ; X2 ; : : : be a sequence of iid stan-


dard normal variables, and W .t/; t  0 a standard Brownian motion independent of
the Xi sequence, starting at zero. Determine which of the following processes are
Gaussian, and which are stationary.
X1p CX2
(a) X.t/ 2
:
CX2
X1p
(b) X.t/ j j:
2
(c) X.t/ D qtX1 X2 :
X 2 CX22
Pk1
(d) X.t/ D j D0 ŒX2j cos j t C X2j C1 sin j t:
(e) X.t/ D t 2 W . t12 /; t > 0, and X.0/ D 0:
(f) X.t/ D W .tjX0 j/:

Exercise 12.2. Let X.t/ D sin  t, where  U Œ0; 2.


(a) Suppose the time parameter t belongs to T D f1; 2; : : : ; g. Is X.t/ stationary?
(b) Suppose the time parameter t belongs to T D Œ0; 1/. Is X.t/ stationary?

Exercise 12.3. Suppose W .t/; t  0 is a standard Brownian motion starting at zero,


and Y N.0; 1/, independent of the W .t/ process. Let X.t/ D Yf .t/ C W .t/,
where f is a deterministic function. Is X.t/ stationary?

Exercise 12.4 (Increments of Brownian Motion). Suppose W .t/; t  0 is a


standard Brownian motion starting at zero, and Y is a positive random vari-
able independent of the W .t/ process. Let X.t/ D W .t C Y /  W .t/. Is X.t/
stationary?

Exercise 12.5. Suppose W .t/; t  0 is a standard Brownian motion starting at zero.


Let X.n/ D W .1/ C W .2/ C : : : C W .n/; n  1. Find the covariance function of
the process X.n/; n  1.
432 12 Brownian Motion and Gaussian Processes

Exercise 12.6 (Moments of the Hitting Time). Suppose W .t/; t  0 is a standard


Brownian motion starting at zero. Fix a > 0 and let Ta be the first time W .t/ hits a.
Characterize all ’ such that EŒTa’  < 1.

Exercise 12.7 (Hitting Time of the Positive Quadrant). Suppose W .t/; t  0 is a


standard Brownian motion starting at zero. Let T D infft > 0 W W .t/ > 0g. Show
that with probability one, T D 0.

Exercise 12.8. Suppose W .t/; t  0 is standard Brownian motion starting at zero.


Fix z > 0 and let Tz be the first time W .t/ hits z. Let 0 < a < b < 1. Find
E.Tb jTa D t/.

Exercise 12.9 (Running Maximum of Brownian Motion). Let W .t/; t  0


be standard Brownian motion on Œ0; 1/ and M.t/ D sup0st W .s/. Evaluate
P .M.1/ D M.2//.

Exercise 12.10. Let W .t/; t  0 be standard Brownian motion on Œ0; 1/. Let T >
0 be a fixed finite time instant. Find the density of the first zero of W .t/ after the
time t D T . Does it have a finite mean?

Exercise 12.11 (Integrated Brownian Motion). Rt Let W .t/; t  0 be standard


Brownian motion on Œ0; 1/. Let X.t/ D 0 W .s/ds. Identify explicit positive

constants K; ’ such that for any t; c > 0; P .jX.t/j  c/ Ktc :

Exercise 12.12 (Integrated Brownian RMotion). Let W .t/; t  0 be standard


t
Brownian motion on Œ0; 1/. Let X.t/ D 0 W .s/ds. Prove that for any fixed t; X.t/
has a finite mgf everywhere, and use it to derive the fourth moment of X.t/.

Exercise 12.13 (Integrated Brownian RMotion). Let W .t/; t  0 be standard


t
Brownian motion on Œ0; 1/. Let X.t/ D 0 W .s/ds. Find
(a) E.X.t/ jW .t/ D w/.
(b) E.W .t/ jX.t/ D x/.
(c) The correlation between X.t/ and W .t/.
(d) P .X.t/ > 0; W .t/ > 0/ for a given t.

Exercise 12.14 (Application of Reflection Principle). Let W .t/; t  0 be standard


Brownian motion on Œ0; 1/ and M.t/ D sup0st W .s/. Prove that P .W .t/
p /; x  w; x  0. Hence, derive the joint density of
w; M.t/  x/ D 1  ˆ. 2xwt
W .t/ and M.t/.

Exercise 12.15 (Current Value and Current Maximum). Let W .t/; t  0 be


standard Brownian motion on Œ0; 1/ and M.t/ D sup0st W .s/. Find P .W .t/ D
M.t// and find its limit as t ! 1.

Exercise 12.16 (Current Value and Current Maximum). Let W .t/; t  0


be standard Brownian motion on Œ0; 1/ and M.t/ D sup0st W .s/. Find
E.M.t/ jW .t/ D w/.
Exercises 433

Exercise 12.17 (Predicting the Next Value).


Rt Let W .t/; t  0 be standard Brown-
ian motion on Œ0; 1/ and let WN .t/ D 1t 0 W .s/ds the current running average.

(a) Find WO .t/ D E.W .t/ jWN .t/ D w/:


(b) Find the prediction error EŒjW .t/  WO .t/j:

Exercise 12.18 (Zero-Free Intervals). Let W .t/; t  0 be standard Brownian mo-


tion, and 0 < s < t < u < 1. Find the conditional probability that W .t/ has no
zeroes in Œs; u given that it has no zeroes in Œs; t.

Exercise 12.19 (Application of the LIL). Let W .t/; t  0 be standard Brownian


motion, and 0 < s < t < u < 1. Let X.t/ D Wp.tt / ; t > 0. Let K; M be any two posi-
tive numbers. Show that infinitely often, with probability one, X.t/ > K and <  M .

Exercise 12.20. Let W .t/; t  0 be standard Brownian motion, and 0 < s < t <
u < 1. Find the conditional expectation of X.t/ given X.s/ D x; X.u/ D y.

Hint: Consider first the conditional expectation of X.t/ given X.0/ D X.1/ D 0.

Exercise 12.21 (Reflected Brownian Motion Is Markov). Let W .t/; t  0 be


standard Brownian motion starting at zero. Show that jW .t/j is a Markov process.

Exercise 12.22 (Adding a Function to Brownian Motion). Let W .t/ be standard


Brownian motion on Œ0; 1 and f a general continuous function on Œ0; 1. Show that
with probability one, X.t/ D W .t/ C f .t/ is everywhere nondifferentiable.

Exercise 12.23 (No Intervals of Monotonicity). Let W .t/; t  0 be standard


Brownian motion, and 0 < a < b < 1 two fixed positive numbers. Show, by
using the independent increments property, that with probability one, W .t/ is non-
monotone on Œa; b.

Exercise 12.24 (Two-Dimensional Brownian Motion). Show that two-dimensional


standard Brownian motion is a Markov process.

Exercise 12.25 (An Interesting Connection to Cauchy Distribution). Let


W1 .t/; W2 .t/ be two independent standard Brownian motions on Œ0; 1/ start-
ing at zero. Fix a number a > 0 and let Ta be the first time W1 .t/ hits a. Find the
distribution of W2 .Ta /.

Exercise 12.26 (Time Spent in a Nonempty Set). Let W2 .t/; t  0 be a two-


dimensional standard Brownian motion starting at zero, and let C be a nonempty
open set of R2 . Show that with probability one, the Lebesgue measure of the set of
times t at which W .t/ belongs to C is infinite.

Exercise 12.27 (Difference of Two Brownian Motions). Let W1 .t/; W2 .t/; t  0


be two independent Brownian motions, and let c1 ; c2 be two constants. Show that
X.t/ D c1 W1 .t/ C c2 W2 .t/ is another Brownian motion. Identify any drift and
diffusion parameters.
434 12 Brownian Motion and Gaussian Processes

Exercise 12.28 (Intersection of Brownian Motions). Let W1 .t/; W2 .t/; t  0 be


two independent standard Brownian motions starting at zero. Let C D ft > 0 W
W1 .t/ D W2 .t/g.
(a) Is C nonempty with probability one?
(b) If C is nonempty, is it a finite set, or is it an infinite set with probability one?
(c) If C is an infinite set with probability one, is its Lebesgue measure zero or
greater than zero?
(d) Does C have accumulation points? Does it have accumulation points with prob-
ability one?

Exercise 12.29 (The L1 Norm of Brownian Motion). Let W .t/; t  0 be stan-


R 1 Brownian motion starting at zero. Show that with probability one, I D
dard
0 jW .t/jdt D 1.

Exercise 12.30 (Median Local Time). Find the median of the local time .x; t/ of
a standard Brownian motion on Œ0; 1/ starting at zero.
Caution: For x ¤ 0, the local time has a mixed distribution.

Exercise 12.31 (Monotonicity of the Mean Local Time). Give an analytical proof
that the expected value of the local time .x; t/ of a standard Brownian motion
starting at zero is strictly decreasing in the spatial coordinate x.

Exercise 12.32 (Application of Invariance Principle). Let X1 ; X2 ; : : : be iid vari-


P
ables with the common distribution P .Xi D ˙1/ D 12 . Let Sk D kiD1 Xi ; k  1,
and …n D n1 #fk W Sk > 0g. Find the limiting distribution of ˘n by applying
Donsker’s invariance principle.

Exercise 12.33 (Application of Invariance Principle). Let X1 ; X2 ; : : : be iid vari-


Pk
ables with zero mean and a finite variance  2 . Let Sk D i D1 Xi ; k  1,
and Mn D n1=2 max1kn Sk . Find the limiting distribution of Mn by applying
Donsker’s invariance principle.

Exercise 12.34 (Application of Invariance Principle). Let X1 ; X2 ; : : : be iid vari-


Pk
ables with zero mean and a finite variance  2 . Let Sk D i D1 Xi ; k  1, and
An D n1=2 max1kn jSk j. Find the limiting distribution of An by applying
Donsker’s invariance principle.

Exercise 12.35 (Application of Invariance Principle). Let X1 ; X2 ; : : : be iid vari-


Pk
ables with zero
P mean and a finite variance  2 . Let Sk D i D1 Xi ; k  1, and
Tn D n3=2 kD1 jSk j. Find the limiting distribution of Tn by applying Donsker’s
n

invariance principle.

Exercise 12.36 (Distributions of Some Functionals). Let W .t/; t  0 be standard


Brownian motion starting at zero. Find the density of each of the following func-
tionals of the W .t/ process:
References 435

(a) supt >0 W 2 .t/;


R1
0 W .t /dt
(b) W.1
;
2/

Hint: The terms in the quotient are jointly normal with zero means.
W .t /
(c) supt >0 aCbt ; a; b > 0.

Exercise 12.37 (Ornstein–Uhlenbeck Process). Let X.t/ be a general Ornstein–


Uhlenbeck process and s < t two general times. Find the expected value of jX.t/ 
X.s/j.

Exercise
Rt 12.38. Let X.t/ be a general Ornstein–Uhlenbeck process and Y .t/ D
0 X.u/d u. Find the correlation between Y .s/ and Y .t/ for 0 < s < t < 1, and
find its limit when ; ’ ! 1 and ’ ! 1.

Exercise 12.39. Let W .t/; t  0 be standard Brownian motion starting at zero, and
0 < s < t < 1 two general times. Find an expression for P .W .t/ > 0 jW .s/ > 0/,
and its limit when s is held fixed and t ! 1.

Exercise 12.40 (Application of the Heat Equation). Let Y N.0;  2 / and f .Y /


a twice continuously differentiable convex function of Y . Show that EŒf .Y / is
increasing in , assuming that the expectation exists.

References

Bhattacharya, R.N. and Waymire, E. (2007). A Basic Course in Probability Theory, Springer,
New York.
Bhattacharya, R.N. and Waymire, E. (2009). Stochastic Processes with Applications, SIAM,
Philadelphia.
Billingsley, P. (1968). Convergence of Probability Measures, John Wiley, New York.
Breiman, L. (1992). Probability, Addison-Wesley, New York.
Brown, L. (1971). Admissible estimators, recurrent diffusions, and insoluble boundary value prob-
lems, Ann. Math. Statist., 42, 855–903.
Brown, L., DasGupta, A., Haff, L.R., and Strawderman, W.E. (2006). The heat equation and Stein’s
identity: Connections, Applications, 136, 2254–2278.
Csörgo, M. (2002). A glimpse of the impact of Pal Erdös on probability and statistics, Canad. J.
Statist., 30, 4, 493–556.
Csörgo, M. and Révész, P. (1981). Strong Approximations in Probability and Statistics, Academic
Press, New York.
Csörgo, S. and Hall, P. (1984). The KMT approximations and their applications, Austr. J. Statist.,
26, 2, 189–218.
Dasgupta, A. (2008), Asymptotic Theory of Statistics and Probability Springer,New York.
Donsker, M. (1951). An invariance principle for certain probability limit theorems, Mem. Amer.
Math. Soc., 6.
Durrett, R. (2001). Essentials of Stochastic Processes, Springer, New York.
Einmahl, U. (1987). Strong invariance principles for partial sums of independent random vectors,
Ann. Prob., 15, 4, 1419-1440.
Erdös, P. and Kac, M. (1946). On certain limit theorems of the theory of probability, Bull. Amer.
Math. Soc., 52, 292–302.
436 12 Brownian Motion and Gaussian Processes

Freedman, D. (1983). Brownian Motion and Diffusion, Springer, New York.


Hall, P. and Heyde, C. (1980). Martingale Limit Theory and Its Applications, Academic Press,
New York.
Karatzas, I. and Shreve, S. (1991). Brownian Motion and Stochastic Calculus, Springer, New York.
Karlin, S. and Taylor, H. (1975). A First Course in Stochastic Processes, Academic Press,
New York.
Komlós, J., Major, P., and Tusnady, G. (1975). An approximation of partial sums of independent
rvs and the sample df :I, Zeit für Wahr. Verw. Geb., 32, 111–131.
Komlós, J., Major, P. and Tusnady, G. (1976). An approximation of partial sums of independent
rvs and the sample df :II, Zeit für Wahr. Verw. Geb., 34, 33–58.
Körner, T. (1986). Fourier Analysis, Cambridge University Press, Cambridge, UK.
Lawler, G. (2006). Introduction to Stochastic Processes, Chapman and Hall, New York.
Major, P. (1978). On the invariance principle for sums of iid random variables, Mult. Anal., 8,
487-517.
Mörters, P. and Peres, Y. (2010). Brownian Motion, Cambridge University Press, Cambridge, UK.
Pyke, R. (1984). Asymptotic results for empirical and partial sum processes: A review, Canad.
J. Statist., 12, 241–264.
Resnick, S. (1992). Adventures in Stochastic Processes, Birkhäuser, Boston.
Révész, P. (2005). Random Walk in Random and Nonrandom Environments, World Scientific Press,
Singapore.
Revuz, D. and Yor, M. (1994). Continuous Martingales and Brownian Motion, Springer, Berlin.
Strassen, V. (1964). An invariance principle for the law of the iterated logarithm, Zeit. Wahr. verw.
Geb., 3, 211–226.
Strassen, V. (1967). Almost sure behavior of sums of independent random variables and martin-
gales, Proc. Fifth Berkeley Symp., 1, 315–343, University of California Press, Berkeley.
Whitt, W. (1980). Some useful functions for functional limit theorems, Math. Opem. Res., 5,
67–85.
Chapter 13
Poisson Processes and Applications

A single theme that binds together a number of important probabilistic concepts


and distributions, and is at the same time a major tool to the applied probabilist and
the applied statistician is the Poisson process. The Poisson process is a probabilistic
model of situations where events occur completely at random at intermittent times,
and we wish to study the number of times the particular event has occurred up to
a specific time instant, or perhaps the waiting time till the next event, and so on.
Some simple examples are receiving phone calls at a telephone call center, receiv-
ing an e-mail from someone, arrival of a customer at a pharmacy or some other
store, catching a cold, occurrence of earthquakes, mechanical breakdown in a com-
puter or some other machine, and so on. There is no end to how many examples we
can think of, where an event happens, then nothing happens for a while, and then
it happens again, and it keeps going like this, apparently at random. It is therefore
not surprising that the Poisson process is such a valuable tool in the probabilist’s
toolbox. It is also a fascinating feature of Poisson processes that it is connected in
various interesting ways to a number of special distributions, including the Poisson,
exponential, Gamma, Beta, uniform, binomial, and the multinomial. These embrac-
ing connections and wide applications make the Poisson process a very special topic
in probability.
The examples we mentioned above all correspond to events occurring at time
instants taking values on the real line. In all of these examples, we can think of
the time parameter as real physical time. However, Poisson processes can also be
discussed in two, three, or indeed any number of dimensions. For example, Pois-
son processes are often used to model spatial distribution of trees in a forest. This
would be an example of a Poisson process in two dimensions. Poisson processes
are used to model galaxy distributions in space. This would be an example of a
Poisson process in three dimensions. The fact is, Poisson processes can be defined
and treated in considerably more generality than just the linear Poisson process.
We start with the case of the linear Poisson process, and then consider the higher-
dimensional cases. Kingman (1993) is a classic reference on Poisson processes.
Some other recommended references are Parzen (1962), Karlin and Taylor (1975),
and Port (1994). A concise well-written treatment is given in Lawler (2006).

A. DasGupta, Probability for Statistics and Machine Learning: Fundamentals 437


and Advanced Topics, Springer Texts in Statistics, DOI 10.1007/978-1-4419-9634-3 13,
c Springer Science+Business Media, LLC 2011
438 13 Poisson Processes and Applications

13.1 Notation

The Poisson process is an example of a stochastic process with pure jumps indexed
by a running label t. We call t the time parameter, and for the purpose of our dis-
cussion here, t belongs to the infinite interval Œ0; 1/. For each t  0, there is a
nonnegative random variable X.t/, which counts how many events have occurred
up to and including time t. As we vary t, we can think of X.t/ as a function. It is a
random function, because each X.t/ is a random variable. Like all functions, X.t/
has a graph. The graph of X.t/ is called a path of X.t/. It is helpful to look at a
typical path of a Poisson process; Fig. 13.1 gives an example.
We notice that the path is a nondecreasing function of the time parameter t, and
that it increases by jumps of size one. The time instants at which these jumps occur
are called the renewal or arrival times of the process. Thus, we have an infinite se-
quence of arrival times, say Y1 ; Y2 ; Y3 ; : : : ; the first arrival occurs exactly at time
Y1 , the second arrival occurs at time Y2 , and so on. We define Y0 to be zero. The
gaps between the arrival times, namely, Y1  Y0 ; Y2  Y1 ; Y3  Y2 ; : : : are called the
interarrival times. Writing Yn  Yn1 D Tn , we see that the interarrival times and
the arrival times are related by the simple identity

Yn D .Yn Yn1 /C.Yn1 Yn2 /C  C.Y2 Y1 /C.Y1 Y0 / D T1 CT2 C  CTn :

A special property of a Poisson process is that these interarrival times are indepen-
dent. So, for instance, if T3 , the time you had to wait between the second and the
third event, were large, then you would have no right to believe that T4 should be
small, because T3 and T4 are actually independent for a Poisson process.

t
0.5 1 1.5 2

Fig. 13.1 Path of a Poisson process


13.2 Defining a Homogeneous Poisson Process 439

13.2 Defining a Homogeneous Poisson Process

The Poisson process can be arrived at in a number of ways. All of these apparently
different definitions are actually equivalent. Here are some equivalent definitions of
a Poisson process.
Definition # 1. One possibility is to start with the interarrival times, and assume
that they are iid exponential with some mean . Then the number of arrivals up to
and including time t is a Poisson process with a constant arrival rate D 1 .
Definition # 2. Or, we may assume that the number of arrivals in a general time
interval Œt1 ; t2  is a Poisson variable with mean .t2  t1 / for some fixed positive
number , and that the number of arrivals over any finite number of disjoint intervals
is mutually independent Poisson variables. This is equivalent to the first definition
given in the paragraph above.
Definition # 3. A third possibility is to use a neat result due to Alfred Rényi. Rényi
proved that if X.t/ satisfies the Poisson property that the number of arrivals X.B/
within any set of times B, not necessarily a set of the form of an interval, is a
Poisson variable with mean jBj, where jBj denotes the Lebesgue measure of B,
then X.t/ must be a Poisson process. Note that there is no mention of independence
over disjoint intervals in this definition. Independence falls out as a consequence
of Rényi’s condition, and Rényi’s result is also a perfectly correct definition of a
Poisson process; see Kingman (1993, p. 33).
Definition # 4. From the point of view of physical motivation and its original his-
tory, it is perhaps best to define a Poisson process in terms of some characteristic
physical properties of the process. In other words, if we make a certain number of
specific assumptions about how the process X.t/ behaves, then those assumptions
serve as a definition of a Poisson process. If you do not believe in one or more of
these assumptions in a particular problem, then the Poisson process is not the right
model for that problem. Here are the assumptions.
(a) X.0/ D 0:
(b) The rate of arrival of the events remains constant over time, in the sense there is
a finite positive number such that for any t  0,

.i / For h ! 0; P .X.t C h/ D X.t/ C 1/ D h C o.h/:

(c) The number of events over nonoverlapping time intervals are independent;
that is,
given disjoint time intervals Œai ; bi ; i D 1; 2; : : : ; n;
the random variables X.bi /  X.ai /; i D 1; 2; : : : ; n are mutually independent.
(d) More than one event cannot occur at exactly the same time instant.
440 13 Poisson Processes and Applications

Precisely,
(i) For h ! 0; P .X.t C h/ D X.t// D 1  h C o.h/:
(ii) For h ! 0; P .X.t C h/ D X.t/ C 1/ D h C o.h/:
(iii) For h ! 0; P .X.t C h/ > X.t/ C 1/ D o.h/:
The important point is that all of these definitions are equivalent. Depending on
taste, one may choose any of these as the definition of a homogeneous or stationary
Poisson process on the real line.

13.3 Important Properties and Uses as a Statistical Model

Starting with these assumptions in our Definition #4 about the physical behavior of
the process, one can use some simple differential equation methods and probability
calculations to establish various important properties of a Poisson process. We go
over some of the most important properties next.
Given k  0, let pk .t/ D P .X.t/ D k/, and let fk .t/ D e t pk .t/. By the total
probability formula, as h ! 0,

pk .t C h/ D P .X.t C h/ D k jX.t/ D k/pk .t/


CP .X.t C h/ D k jX.t/ D k  1/pk1 .t/ C o.h/
) pk .t C h/ D .1  h/pk .t/ C hpk1 .t/ C o.h/
pk .t C h/  pk .t/
) D  Œpk .t/  pk1 .t/ C o.1/
h
) pk0 .t/ D  Œpk .t/  pk1 .t/;

for k  1; t > 0, and when k D 0; p00 .t/ D  p0 .t/; t > 0. Because X.0/ D 0, the
last equation immediately gives p0 .t/ D e t . For k  1, the system of differential
equations pk0 .t/ D  Œpk .t/  pk1 .t/ is equivalent to

fk0 .t/ D fk1 .t/; t > 0; fk .0/ D 0I

note that fk .0/ D 0 because P .X.0/ D k/ D 0 if k  1.


This last system of differential equations fk0 .t/ D fk1 .t/; t > 0; fk .0/ D 0 is
easy to solve recursively and the solutions are

. t/k
fk .t/ D ; k  1; t > 0

. t/k
) pk .t/ D P .X.t/ D k/ D e t :

This is true also for k D 0. So, we have proved the following theorem, which
accounts for the name Poisson process.
13.3 Important Properties and Uses as a Statistical Model 441

Theorem 13.1. If X.t/; t  0 is a Poisson process with the constant arrival rate ,
then for any t > 0; X.t/ is distributed as a Poisson with mean t. More generally,
the number of arrivals in an interval .s; t is distributed as a Poisson with mean
.t  s/.

Example 13.1 (A Medical Example). Suppose between the months of May and
October, you catch allergic rhinitis at the constant average rate of once in six weeks.
Assuming that the incidences follow a Poisson process, let us answer some simple
questions.
First, what is the expected total number of times that you will catch allergic
rhinitis between May and October in one year? Take the start of May 1 as t D 0,
and X.t/ as the number of fresh incidences up to(and including) time t. Note that
time is being measured in some implicit unit, say weeks. Then, the arrival rate of
the Poisson process for X.t/ is D 16 . There are 24 weeks between May and
October, and X.24/ is distributed as a Poisson with mean 24 D 4, which is the
expected total number of times that you will catch allergic rhinitis between May and
October.
Next, what is the probability that you will catch allergic rhinitis at least once
before the start of August and at least once after the start of August? This is the
same as asking what is P .X.12/  1; X.24/  X.12/  1/. By the property of
independence of X.12/ and X.24/  X.12/, this probability equals

P .X.12/  1/P .X.24/  X.12/  1/ D ŒP .X.12/  1/2


12
D Œ1  P .X.12/ D 0/2 D Œ1  e  6 2 D :7476:

A key property of the Poisson process is that the sequence of interarrival times
T1 ; T2 ; : : : is iid exponential with mean 1 . We do not rigorously prove this here, but
as the simplest illustration of how the exponential density enters into the picture, we
show that T1 has the exponential density. Indeed, P .T1 > h/ D P .X.h/ D 0/ D
e h , because X.h/ has a Poisson distribution with mean h. It follows that T1 has
the density fT1 .h/ D e h ; h > 0.
As a further illustration, consider the joint distribution of T1 and T2 . Here is
a very heuristic explanation for why T1 ; T2 are iid exponentials. Fix two positive
numbers t; u. The event fT1 > t; T2 > ug is just the event that the first arrival time
Y1 is at some time later than t, and counting from Y1 , no new further events occur
for another time interval of length u. But the intervals Œ0; t and .Y1 ; Y1 C u/ are
nonoverlapping if the first arrival occurs after t, and the probability of zero events
in both of these intervals would then factor as e t e u . This means P .T1 > t/ D
e t and P .T2 > u/ D e u , and that T1 ; T2 are iid exponentials with the same
density function.
Because the sum of iid exponentials has a Gamma distribution (see Chapter 4),
it also follows that for a Poisson process, the nth arrival time Yn has a Gamma
distribution. All of these are recorded in the following theorem.
442 13 Poisson Processes and Applications

Theorem 13.2. Let X.t/; t  0 be a Poisson process with constant arrival rate .
Then,
(a) T1 ; T2 ; : : : are iid exponential with the density function fTi .t/ D e t ; t > 0:
n e y y n1
(b) Let n  1:Then Yn has the Gamma densityfYn .y/ D .n1/Š ;y > 0:

See Kingman (1993, p. 39) for a rigorous proof of this key theorem.

Example 13.2 (Geiger Counter). Geiger counters, named after Hans Geiger, are
used to detect radiation-emitting particles, such as beta and alpha particles, or
low-energy gamma rays. The counter does so by recording a current pulse when
a radioactive particle or ray hits the counter. Poisson processes are standard models
for counting particle hits on the counter.
Suppose radioactive particles hit such a Geiger counter at the constant average
rate of one hit per 30 seconds. Therefore, the arrival rate of our Poisson process
is D 30 1
. Let Y1 ; Y2 ; : : : ; Yn be the times of the first n hits on the counter, and
T1 ; T2 ; : : : ; Tn the time gaps between the successive hits. We ask a number of ques-
tions about these arrival and interarrival times in this example.
First, by our previous theorem, Yn has the Gamma distribution with density
y
e  30 y n1
fYn .y/ D ; y > 0:
.30/n .n  1/Š

Suppose we want to calculate the probability that the hundredth hit on the Geiger
counter occurs within the first hour. This is equal to P .Y100 3600/, there being
60  60 D 3600 seconds in an hour. We can try to integrate the Gamma density
for Yn and evaluate P .Y100 3600/. This will require a computer. On the other
hand, the needed probability is also equal to P .X.3600/  100/, where X.3600/
Poi.120/, because with t D 3600; t D 3600 30 D 120. However, this calculation is
also clumsy, because the Poisson mean is such a large number. We can calculate the
probability approximately by using the central limit theorem. Indeed,
 
3600  100  30
P .Y100 3600/ P Z p D P .Z 2/ D :9772;
100  900

where Z denotes a standard normal variable.


Next, suppose we wish to find the probability that at least once, within the first
hundred hits, two hits come within a second of each other. At first glance, it might
seem that this is unlikely, because the average gap between two successive hits is as
large as 30 seconds. But, actually the probability is very high.
If we denote the order statistics of the first hundred interarrival times by T.1/ <
T.2/ < : : : < T.100/ , then we want to find P .T.1/ 1/. Recall that the interarrival
times T1 ; T2 ; : : : themselves are iid exponential with mean 30. Therefore, the mini-
mum T.1/ also has an exponential distribution, but the mean is 100 30
D :3. Therefore,
R1 1 x
our needed probability is 0 :3 e :3 dx D :9643.
13.3 Important Properties and Uses as a Statistical Model 443

Example 13.3 (Poisson Process and the Beta Distribution). It was mentioned in the
chapter introduction that the Poisson process has a connection to the Beta distribu-
tion. We show this connection in this example.
Suppose customers stream in to a drugstore at the constant average rate of 15 per
hour. The pharmacy opens its doors at 8:00 AM and closes at 8:00 PM. Given that
the hundredth customer on a particular day walked in at 2:00 PM, we want to know
what is the probability that the fiftieth customer came before noon. Write n for 100
and m for 50, and let Yj be the arrival time of the j th customer on that day. Then,
we are told that Yn D 6, and we want to calculate P .Ym < 4 jYn D 6/.
We can attack this problem more generally by drawing a connection of a Poisson
process to the Beta distribution. For this, we recall the result from Chapter 4 that if
X; Y are independent positive random variables, with densities

’ x ’1 ˇ x ˇ 1


e x e x
fX .x/ D ; fY .y/ D ; ’; ˇ; > 0;
.’/ .ˇ/

then U D X
XCY
and V D X C Y are independent, and U has the Beta density

.’ C ˇ/ ’1
fU .u/ D u .1  u/ˇ 1 ; 0 < u < 1:
.’/.ˇ/

Returning now to our problem, Ym D T1 C T2 C : : : C Tm , and Yn D Ym C TmC1 C


: : : C Tn , where T1 ; T2 ; : : : ; Tn are iid exponentials. We can write the ratio YYmn as
Ym Ym
Yn D Ym C.Yn Ym / . Because Ym and Yn  Ym are independent Gamma variables, it
Ym
follows from the above paragraph that U D Yn has the Beta density

.n/
fU .u/ D um1 .1  u/nm1 ; 0 < u < 1;
.m/.n  m/

and furthermore, U and Yn are independent. This is useful to us. Indeed,


   
Ym 4 Ym 4
P .Ym < 4 jYn D 6/ D P < jYn D 6 D P <
Yn 6 Yn 6

Ym
(inasmuch as U D Yn and Yn are independent)

Z 4 Z 4
6 .100/ 6
D fU .u/d u D u49 .1  u/49 d u D :9997:
0 Œ.50/2 0

Example 13.4 (Filtered Poisson Process). Suppose the webpage of a certain text is
hit by a visitor according to a Poisson process X.t/ with an average rate of 1.5 per
day. However, 30% of the time, the person visiting the page does not purchase the
book. We assume that customers make their purchase decisions independently, and
independently of the X.t/ process. Let Y .t/ denote the number of copies of the book
444 13 Poisson Processes and Applications

sold up to and including the tth day. We assume, by virtue of the Poisson process
assumption, that more than one hit is not made at exactly the same time, and that a
visitor does not purchase more than one book. What kind of a process is the process
Y .t/; t  0?
We can imagine a sequence of iid Bernoulli variables U1 ; U2 ; : : : , where Ui D 1
if the i th visitor to the webpage actually purchases the book. Each Ui is a Bernoulli
with parameter p D :7. Also let X.t/ denote the number of hits made on the page
P /
up to time t. Then, Y .t/ D X.t i D1 Ui , where the sum over an empty set is defined,
as usual, to be zero. Then,
1
X
P .Y .t/ D k/ D P .Y .t/ D k jX.t/ D x/P .X.t/ D x/
xDk
!  k
1 p 1
X x k t x X . t.1p//x
xk e . t/ t 1p
D p .1p/ De
k xŠ kŠ .x  k/Š
xDk xDk
 k
p
1p
D e t . t.1  p//k e t .1p/

.p t/k
D e pt :

Therefore, for each t; Y .t/ is also a Poisson random variable, but the mean has
changed to p t.
This is not enough to prove that Y .t/ is also a Poisson process. One needs
to show, in addition, the time homogeneity property, namely, regardless of
s; Y .t C s/  Y .s/ is a Poisson variable with mean p t, and the independent
increments property; that is, over disjoint intervals .ai ; bi , Y .bi /  Y .ai / are inde-
pendent Poisson variables. Verification of these is just a calculation, and is left as
an exercise.
To summarize, a filtered Poisson process is also a Poisson process.
Example 13.5 (Inspection Paradox). Suppose that buses arrive at a certain bus stop
according to a Poisson process with some constant average rate , say once in 30
minutes. Thus, the average gap between any two arrivals of the bus is 30 minutes.
Suppose now that out of habit, you always arrive at the bus stop at some fixed time
t, say 5:00 PM. The term inspection paradox refers to the mathematical fact that
the average time gap between the last arrival of a bus before 5:00 PM and the first
arrival of a bus after 5:00 PM is larger than 30 minutes. It is as if by simply showing
up at your bus stop at a fixed time, you can make the buses tardier than they are on
an average! We derive this peculiar mathematical fact in this example.
We need some notation. Given a fixed time t, let ıt denote the time that has
elapsed since the last event; in symbols, ıt D t  YX.t / . Also let t denote the time
until the next event; in symbols, t D YX.t /C1  t. The functions ıt and t are
commonly known as current life and residual life in applications. We then have

P .ıt > h/ D P .No events between t  h and t/ D e h ; 0 h t:


13.3 Important Properties and Uses as a Statistical Model 445

It is important that we note that P .ıt > t/ D e t , and that for h > t; P .ıt > h/ D 0.
Thus, the function P .ıt > h/ has a discontinuity at h D t. Likewise,

P. t > h/ D P .No events between t and t C h/ D e h ; h  0:

Therefore,
Z t
1
EŒıt C t D h. e h /dh C te t C
0
2 1
D  e t ;

on actually doing the integration.


Now, clearly, 2  1 e t > 1 , because t > 0, and so, we have the apparent
conundrum that the average gap between the two events just prior to and just suc-
ceeding any fixed time t is larger than the average gap between all events in the
process.
Example 13.6 (Compound Poisson Process). In some applications, a cluster of in-
dividuals of a random size arrive at the arrival times of an independent Poisson
process. Here is a simple example. Tourist buses arrive at a particular location ac-
cording to a Poisson process, and each arriving bus brings with it a random number
of tourists. We want to study the total number of arriving tourists in some given time
interval, for example, from 0 to t.
Then let X.t/ be the underlying Poisson process with average constant rate
for the arrival of the buses, and assume that the number of tourists arriving in the
buses forms an iid sequence W1 ; W2 ; : : : with the mass function P .Wi D k/ D
pk ; k  0. The sequence W1 ; W2 ; : : : is assumed to be independent of the entire
X.t/ process. Let Y .t/ denote the total arriving number of tourists up to time t.
PX.t /
Then, we have Y .t/ D i D1 Wi . Such a process Y .t/ is called a compound Poisson
process.
We work out the generating function of Y .t/ for a general t; see Chapter 1
for the basic facts about the generating function of a nonnegative integer-valued
random variable. We recall that for any nonnegative integer-valued random
variable Z, the generating function is defined as GZ .s/ D E.s Z /, and for any
G .k/ .0/
k  0; P .Z D k/ D ZkŠ , and also, provided that the kth moment of Z exists,
.k/
EŒZ.Z  1/    .Z  k C 1/ D GZ .1/.
The generating function of Y .t/ then equals
h i h PX.t / i
GY .s/ D E s Y.t / D E s i D1 Wi
1
X h PX.t / ˇ i x X1 h Px i . t/x
ˇ t . t/ t
D E s i D1 Wi ˇ X.t/Dx e De E s i D1 Wi
xD0
xŠ xD0

446 13 Poisson Processes and Applications

(because it has been assumed that W1 ; W2 ; : : : are independent of the X.t/ process)
!
1
X Y
x h i . t/x
t
De E s Wi

xD0 i D1
X1
. t/x
D e t ŒGW .s/x D e t e t GW .s/ ;
xD0

where, in the last two lines we have used our assumption that W1 ; W2 ; : : : are iid. We
have thus derived a fully closed-form formula for the generating function of Y .t/.
We can, in principle, derive P .Y .t/ D k/ for any k from this formula. We can also
derive the mean and the variance, for which we need the first two derivatives of this
generating function. The first two derivatives are

GY0 .s/ D te t e t GW .s/ GW


0
.s/;

and
GY00 .s/ D e t e t GW .s/ .Œ tGW
0 00
.s/2 C tGW .s//:
If the Wi have a finite mean and variance, then we know from Chapter 1 that
0 00
E.W / D GW .1/ and EŒW .W  1/ D GW .1/. Plugging these into our latest ex-
0 00
pressions for GY .s/ and GY .s/, and some algebra, we find

EŒY .t/ D tE.W /; and EŒY .t/2 D Œ tE.W /2 C tE.W 2 /:

Combining, we get Var.Y .t// D tE.W 2 /.


A final interesting property of a Poisson process that we present is a beautiful
connection to the uniform distribution. It is because of this property of the Poisson
process that the Poisson process is equated in probabilistic folklore to complete
randomness of pattern. The result says that if we are told that n events in a Poisson
process have occurred up to the time instant t for some specific t, then the actual
arrival times of those n events are just uniformly scattered in the time interval .0; t.
Here is a precise statement of the result and its proof.

Theorem 13.3 (Complete Randomness Property). Let X.t/; t  0 be a Pois-


son process with a constant arrival rate, say . Let t > 0 be fixed. Then the joint
conditional density of Y1 ; Y2 ; : : : ; Yn given that X.t/ D n is the same as the joint
density of the n-order statistics of an iid sample from the U Œ0; t distribution.

Proof. Recall the notation that the arrival times are denoted as Y1 ; Y2 ; : : : , and the
interarrival times by T1 ; T2 ; : : : , and so on, so that Y1 D T1 ; Y2 D T1 C T2 ; Y3 D
T1 C T2 C T3 ; : : : ; Fix 0 < u1 < u2 <    < un < t. We show that

@n nŠ
P .Y1 u 1 ; Y2 u 2 ; : : : ; Yn un jX.t/ D n/ D ;
@u1 @u2    @un tn
13.3 Important Properties and Uses as a Statistical Model 447

0 < u1 < u2 <    < un < t. This completes our proof because the mixed partial
derivative of the joint CDF gives the joint density in general(see Chapter 3), and the
function of n variables u1 ; u2 ; : : : ; un given by tnŠn on 0 < u1 < u2 <    < un < t
indeed is the joint density of the order statistics of n iid U Œ0; t variables.
For ease of presentation, we show the proof for n D 2; the proof for a general n is
exactly the same. Assume then that n D 2. The key fact to use is that the interarrival
times T1 ; T2 ; T3 are iid exponential. Therefore, the joint density of T1 ; T2 ; T3 is
3 .t1 Ct2 Ct3 /
e ; t1 ; t2 ; t3 > 0. Now make the linear transformation Y1 D T1 ; Y2 D
T1 C T2 ; Y3 D T1 C T2 C T3 . This is a one-to-one transformation with a Jacobian
equal to one. Therefore, by the Jacobian method (see Chapter 4), the joint density
of Y1 ; Y2 ; Y3 is
3 y3
fY1 ;Y2 ;Y3 .y1 ; y2 ; y3 / D e ;

0 < y1 < y2 < y3 . Consequently,

P .Y1 u 1 ; Y2 u2 jX.t/ D 2/
P .Y1 u1 ; Y2 u2 ; Y3 > t/
D
P .X.t/ D 2/
R u1 R u2 R 1 3 y
0 y1 t e 3 dy3 dy2 dy1
D 2
e t .t2Š/
 
2 t u2
e u1 u2  2 1

D 2
e t .t2/

(by just doing the required integration in the numerator)


 
2 u2
D 2 u1 u2  1 :
t 2

Therefore,
@2 2
P .Y1 u 1 ; Y2 u2 jX.t/ D 2/ D ;
@u1 @u2 t2

0 < u1 < u2 < t, which is what we wanted to prove. t


u
Example 13.7 (E-Mails). Suppose that between 9:00 AM and 5:00 PM, you receive
e-mails at the average constant rate of one per 10 minutes. You left for lunch at
12 noon, and when you returned at 1:00 PM, you found nine new e-mails waiting
for you. We answer a few questions based on the complete randomness property,
assuming that your e-mails arrive according to a Poisson process with rate D 6,
using an hour as the unit of time.
First, what is the probability that the fifth e-mail arrived before 12:30? From
the complete randomness property, given that X.t/ D 9, with t D 1, the arrival
times of the nine e-mails are jointly distributed as the order statistics of n D 9 iid
448 13 Poisson Processes and Applications

U Œ0; 1 variables. Hence, given that X.t/ D 9; Y5 , the time of the fifth arrival has
the Beta.5; 5/ density. From the symmetry of the Beta.5; 5/ density, it immediately
follows that P .Y5 :5 jX.t/ D 9/ D :5. So, there is a 50% probability that the fifth
e-mail arrived before 12:30.
Next, what is the probability that at least three e-mails arrived after 12:45? This
probability is the same as P .Y7 > :75 jX.t/ D 9/. Once again, this follows from
a Beta distribution calculation, because given that X.t/ D 9; Y7 has the Beta.7; 3/
density. Hence, the required probability is
Z 1
.10/
x 6 .1  x/2 dx D :3993:
.7/.3/ :75

Finally, what is the expected gap between the times that the third and the seventh
e-mail arrived? That is, what is E.Y7  Y3 jX.t/ D 9/? because E.Y7 jX.t/ D 9/ D
7
10
D :7, and E.Y3 jX.t/ D 9/ D 10 3
D :3, we have that E.Y7  Y3 jX.t/ D 9/ D
:7  :3 D :4. Hence the expected gap between the arrival of the third and the seventh
e-mail is 24 minutes.

13.4  Linear Poisson Process and Brownian Motion:


A Connection

The model in the compound Poisson process example in the previous section can be
regarded as a model for displacement of a particle in a medium subject to collisions
with other particles or molecules of the surrounding medium. Recall that this was
essentially Einstein’s derivation of Brownian motion for the movement of a physical
particle in a fluid or gaseous medium.
How does the model of the compound Poisson process example become useful
in such a context, and what exactly is the connection to Brownian motion? Suppose
that a particle immersed in a medium experiences random collisions with other par-
ticles or the medium’s molecules at random times which are the event times of a
homogeneous Poisson process X.t/ with constant rate . Assume moreover that at
each collision, our particle moves a distance of a units linearly to the right, or a
units linearly to the left, with an equal probability. In other words, the sequence of
displacements caused by the successive collisions form an iid sequence of random
variables W1 ; W2 ; : : : , with P .Wi D ˙a/ D 12 . Then, the total displacement of the
P /
particle up to a given time t equals W .t/ D X.t i D1 Wi , with the empty sum W .0/
being zero. If we increase the rate of collisions and simultaneously decrease the
displacement a caused by a single collision in just the right way, then the W .t/ pro-
cess is approximately a one-dimensional Brownian motion. That is the connection
to Brownian motion in this physical model for how a particle moves when immersed
in a medium.
We provide an explanation for the emergence of Brownian motion in such a
model. First, recall from our calculations for the compound Poisson example that
13.4 Linear Poisson Process and Brownian Motion: A Connection 449

the mean of W .t/ as defined above is zero for all t, and Var.W .t// D EŒW .t/2  D
a2 t. Therefore, to have any chance of approximating the W .t/ process by a Brow-
nian motion, we should let a2 converge to some finite constant  2 . We now look at
the characteristic function of W .t/ for any fixed t, and show that if a ! 0; ! 1
in such a way that a2 !  2 , then the characteristic function of W .t/ itself con-
verges to the characteristic function of the N.0;  2 t/ distribution. This is be one step
towards showing that the W .t/ process mimics a Brownian motion with zero drift
and diffusion coefficient  if a2 !  2 .
The characteristic function calculation is exactly similar to the generating func-
tion calculation that we did in our compound Poisson process example. Indeed, the
characteristic function of W .t/ equals W .t / .s/ D e t Œ W .s/1 , where W .s/ de-
notes the common characteristic function of W1 ; W2 ; : : : , the sequence of displace-
ments of the particle. Therefore, by a Taylor expansion of W .s/ (see Chapter 8),
h 2
i
t 1Ci sE.W1 / s2 E.W12 /1C jsj3 E.jW1 j3 /
W .t / .s/ D e

(for some  with jj 1)


h 2 i
t  s2 a2 C jsj3 a3 s2
2t
De D e 2 .1 C o.1//;

if a ! 0; ! 1, and a2 !  2 . Therefore, for any s, the characteristic function


of W .t/ converges to that of the N.0;  2 t/ distribution in this specific asymptotic
paradigm, namely when a ! 0; ! 1, and a2 !  2 . This is clearly not enough
to assert that the process W .t/ itself is approximately a Brownian motion. We need
to establish that the W .t/ process has independent increments, that the increments
W .t/  W .s/ have distribution depending only on the difference t  s, that for each
t; E.W .t// D 0, and W .0/ D 0. Of these, the last two do not need a proof, as they
are obvious. The independent increments property follows from the independent
increments property of the Poisson process X.t/ and the independence of the Wi
sequence. The stationarity of the increments follows by a straight calculation of
the characteristic function of W .t/  W .s/, and by explicitly exhibiting that it is a
function of t  s. Indeed, writing D D X.t/  X.s/,

h i 1  h
X i e .t s/ . .t  s//d
E e i uŒW .t /W .s/ D E e i uŒW .t /W .s/ jDDd

d D0

D e .t s/Œ W .u/1


;

on using the facts that conditional on D D d , W .t/  W .s/ is the sum of d iid
variables, which are jointly independent of the X.t/ process, and then on some
algebra. This shows that the increments W .t/  W .s/ are stationary.
450 13 Poisson Processes and Applications

13.5 Higher-Dimensional Poisson Point Processes

We remarked in the introduction that Poisson processes are also quite commonly
used to model a random scatter of points in a planar area or in space, such as trees
in a forest, or galaxies in the universe. In the case of the one-dimensional Pois-
son process, there was a random sequence of points too, say …, and these were
just the arrival times of the events. We then considered how many elements of …
belonged to a test set, such as an interval Œs; t. The number of elements in any
particular interval was Poisson with mean depending on the length of the interval,
and the number of elements in disjoint intervals were independent. In higher dimen-
sions, say Rd , we similarly have a random countable set of points … of Rd , and
we consider how many elements of … belong to a suitable test set, for example, a
Q
d -dimensional rectangle A D diD1 Œai ; bi . Poisson processes would now have the
properties that the counts over disjoint test sets would be independent, and the count
of any particular test set A, namely N.A/ D #fx 2 … W x 2 Ag, will have a Poisson
distribution with some appropriate mean depending on the size of A.

Definition 13.1. A random countable set …  Rd is called a d -dimensional Pois-


son point process with intensity or mean measure  if
(a) RFor a general set A  Rd ; N.A/ Poi..A//, where the set function .A/ D
A .x/dx for some fixed nonnegative function .x/ on R , with the property
d

that .A/ < 1 for any bounded set A.


Q
(b) For any n  2, and any n disjoint rectangles Aj D diD1 Œai;j ; bi;j , the random
variables N.A1 /; N.A2 /; : : : ; N.An / are mutually independent.

Remark. If .x/ > 0, then .A/ D Œvol.A/, and in that case the Poisson
process is called stationary or homogeneous with the constant intensity . Notice
that this coincides with the definition of a homogeneous Poisson process in one
dimension that was given in the previous sections, because in one dimension volume
would simply correspond to length.
According to the definition, given a set A  Rd ; N.A/ Poi..A//, and there-
fore, the probability that a given set A contains no points of the random countable
set … is P .N.A/ D 0/ D e .A/ . It is clear, therefore, that if we know just these
void probabilities P .N.A/ D 0/, then the intensity measure and the distribution of
the full process is completely determined. Indeed, one could use this as a definition
of a d -dimensional Poisson process.

Definition 13.2. A random countable set … in Rd is a Poisson process with inten-


sity measure  if for each bounded A  Rd ; P .… \ A D / D e .A/ .

Remark. This definition is saying that if the scientist understands the void probabil-
ities, then she understands the entire Poisson process.
It is not obvious that a Poisson process with a given intensity measure exists.
This requires a careful proof, within the rigorous measure-theoretic paradigm of
probability. We only state the existence theorem here. A proof may be seen in many
13.5 Higher-Dimensional Poisson Point Processes 451

places, for instance, Kingman (1993). We do need some restrictions on the intensity
measure for an existence result. The restrictions we impose are not the weak-
est possible, but we choose simplicity of the existence theorem over the greatest
generality.

Definition 13.3. Suppose .x/ is aRnonnegative function on Rd with the property


R < 1. Then a Poisson process with
that for any bounded set A in Rd ; A .x/dx
the intensity measure  defined by .A/ D A .x/dx exists.

Example 13.8 (Distance to the Nearest Event Site). Consider a Poisson process …
in Rd with intensity measure , and fix a point xQ 2 Rd . We are interested in
D D D.x; Q …/, the Euclidean distance from xQ to the nearest element of the Poisson
Q r/ is the
process …. The survival function of D is easily calculated. Indeed, if B.x;
Q then
ball of radius r centered at x,

Q r// D 0/ D e .B.x;r//
P .D > r/ D P .N.B.x; Q
:

In particular, if … is a homogeneous Poisson process with constant intensity > 0,


then
d=2 r d
  d
P .D > r/ D e .
2
C1/
:
Differentiating, the density of D is
d=2 r d
d  d=2 r d 1   d
fD .r/ D e . 2 C1/
:
. d2 C 1/

 d=2
Or, equivalently, Dd has a standard exponential density. In the special case
. d
2 C1/
of two dimensions, this means that D 2 has a standard exponential distribution.
This result is sometimes used to statistically test whether a random countable set in
some specific application is a homogeneous Poisson process.

Example 13.9 (Multinomial Distribution and Poisson Process). Suppose the in-
tensity measure  of a Poisson process is finite; that is, .Rd / < 1. In that
case, the cardinality of … itself, that is, N.Rd / < 1 with probability one, be-
cause N.Rd / has a Poisson distribution with the finite mean M D .Rd /. Suppose
in a particular realization, the total number of events N.Rd / D n for some fi-
nite n. We want to know how these n events are distributed among the members
of a partition A1 ; A2 ; : : : ; Ak of Rd . It turns out that the joint distribution of
N.A1 /; N.A2 /; : : : ; N.Ak / is a multinomial distribution.
P
Indeed, given n1 ; n2 ; : : : ; nk  0 such that kiD1 ni D n,

P .N.A1 / D n1 ; N.A2 / D n2 ;    ; N.Ak / D nk jN.Rd / D n/


P .N.A1 / D n1 ; N.A2 / D n2 ;    ; N.Ak / D nk /
D
P .N.Rd / D n/
452 13 Poisson Processes and Applications
Qk e .Ai / ..Ai //ni
i D1 ni Š
D
e M M n

Yk  ni
nŠ .Ai /
D Qk ;
i D1 ni Š i D1 M

which shows that conditional on N.Rd / D n; .N.A1 /; N.A2 /; : : : ; N.Ak // is


jointly distributed as a multinomial with parameters .n; p1 ; p2 ; : : : ; pk /, where
pi D .A i/
M .

13.5.1 The Mapping Theorem

If a random set … is a Poisson process in Rd with some intensity measure, and if the
points are mapped into some other space by a transformation f , then the mapped
points will often form another Poisson process in the new space with a suitable new
intensity measure. This is useful, because we are often interested in a scatter induced
by an original scatter, and we can view the induced scatter as a mapping of a Poisson
process.
If … is a Poisson process with intensity measure , and f WRd ! Rk is a map-
ping, denote the image of … under f by … ; that is, … D f .…/. Let f 1 .A/ D
fx 2 Rd W f .x/ 2 Ag. Then,

N  .A/ D #fy 2 … W y 2 Ag D #fx 2 … W x 2 f 1 .A/g D N.f 1 .A//:

Therefore, N  .A/ Poi..f 1 .A///. If we denote  .A/ D .f 1 .A//, then


 .A/ acts as the intensity measure of the … process. However, in order for … to


be a Poisson process, we also need to show the independence property for disjoint
sets A1 ; A2 ; : : : ; An ; for any n  2, and we need to ensure that singleton sets do not
carry positive weight under the new intensity measure  . The independence over
disjoint sets is inherited from the independence over disjoint sets for the original
Poisson process …; the requirement that  .fyg/ should be zero for any single-
ton set fyg is usually verifiable on a case-by-case basis in applications. The exact
statement of the mapping theorem then says the following.

Theorem 13.4. Let d; k  1, and let f W Rd ! Rk be a transformation. Suppose


…  Rd is a Poisson process with the intensity measure , and let  be defined
by the relation  .A/ D .f 1 .A//; A  Rk . Assume that  .fyg/ D 0 for all
y 2 Rk . Then the transformed set …  Rk is a Poisson process with the intensity
measure  .
An important immediate consequence of the mapping theorem is that lower-
dimensional projections of a Poisson process are Poisson processes in the corre-
sponding lower dimensions.
13.6 One-Dimensional Nonhomogeneous Processes 453

R … be a Poisson process in R with the intensity measure  given


d
Corollary. Let
by .A/ D A .x1 ; x2 ; : : : ; xd /dx1 dx2 : : : dxd . Then, for any p < d , the pro-
jection
R of … to Rp is a Poisson process with the intensity measure p .B/ D
B p .x1 ; : : : ; xp /dx1 : : : dx
R p , where the intensity function p .x1 ; : : : ; xp / is de-
fined by p .x1 ; : : : ; xp / D Rd p .x1 ; x2 ; : : : ; xd /dxpC1 : : : dxd .
In particular, we have the useful result that all projections of a homogeneous
Poisson process are homogeneous Poisson processes in the corresponding lower
dimensions.

13.6 One-Dimensional Nonhomogeneous Processes

Nonhomogeneous Poisson processes are clearly important from the viewpoint of ap-
plications. But an additional fact of mathematical convenience is that the mapping
theorem implies that a nonhomogeneous Poisson process in one dimension can be
transformed into a homogeneous Poisson process. As a consequence, various for-
mulas and results for the nonhomogeneous case can be derived rather painlessly, by
simply applying the mapping technique suitably.
Suppose X.t/; t  0 is a Poisson process on the real line with intensity
Rt function
.x/, and intensity measure , so that for any given t; .Œ0; t/ D 0 .x/dx. We
denote .Œ0; t/ as ƒ.t/. Thus, ƒ.t/ is a nonnegative, continuous, and nondecreas-
ing function for t  0, with ƒ.0/ D 0.
We show how this general Poisson process can be transformed to a homogeneous
one by simply changing the clock. For this, think of the Poisson process as a random
countable set and simply apply the mapping theorem. Using the same notation as
in the mapping theorem, consider the particular mapping f .x/ D ƒ.x/. Then, by
the mapping theorem, … D ƒ.…/ is another Poisson process with the intensity
measure  .Œ0; t/ D .f 1 .Œ0; t// D ƒ.ƒ1 .t// D t. In other words, mapping
the original Poisson process by using the transformation f .x/ D ƒ.x/ converts the
general Poisson process with the intensity measure  into a homogeneous Poisson
process with constant intensity equal to one. An equivalent way to state this is to
say that if we define a new process Z.t/ as Z.t/ D X.ƒ1 .t//; t  0, then Z.t/
is a homogeneous Poisson process with constant rate (intensity) equal to one. This
amounts to simply measuring time according to a new clock.
This transformation result is useful in deriving important distributional properties
for a general nonhomogeneous Poisson process on the real line. We first need some
notation. Let Ti;X ; i D 1; 2; : : : denote the sequence of interarrival times for our
Poisson process X.t/ with intensity function .x/, and let Ti;Z denote the sequence
of interarrival times for the transformed process Z.t/ D X.ƒ1 .t//. Let also, Yn;X
be the nth arrival time in the X.t/ process, and Yn;Z the nth arrival time in the Z.t/
process.
Consider the simplest distributional question, namely what is the distribution of
T1;X . By applying the mapping theorem,

P .T1;X > t/ D P .T1;Z > ƒ.t// D e ƒ.t / :


454 13 Poisson Processes and Applications

Therefore, by differentiation, T1;X has the density

fT1;X .t/ D .t/e ƒ.t / ; t > 0:


Similarly, once again, by applying the mapping theorem,
P .T2;X > t jT1;X D s/ D P .ƒ1 .Y2;Z / > s C t jƒ1 .Y1;Z / D s/
D P .Y2;Z > ƒ.t C s/ jT1;Z D ƒ.s//
D P .T2;Z > ƒ.t C s/  ƒ.s/ jT1;Z D ƒ.s//
D P .T2;Z > ƒ.t C s/  ƒ.s// D e Œƒ.sCt /ƒ.s/ :
By differentiating, the conditional density of T2;X given T1;X D s is

fT2;X jT1;X .t js/ D .s C t/e Œƒ.sCt /ƒ.s/ :

The mapping theorem similarly leads to the main distributional results for the gen-
eral nonhomogeneous Poisson process on the real line, which are collected in the
theorem below.
Theorem 13.5. Let R t X.t/; t  0 be a Poisson process with the intensity function
.x/. Let ƒ.t/ D 0 .x/dx. Let the interarrival times be Ti ; i  1, and the arrival
times Yi ; i  1. Then,
(a) For n  1; Yn has the density
.t/e ƒ.t / .ƒ.t//n1
fYn .t/ D ; t > 0:
.n  1/Š
(b) For n  2 the conditional density of Tn given Yn1 D w is

fTn jYn1 .tjw/ D .t C w/e Œƒ.t Cw/ƒ.w/ ; t; w > 0:

(c) T1 has the density


fT1 .t/ D .t/e ƒ.t / ; t > 0:
(d) For n  2; Tn has the marginal density
Z 1
1
fTn .t/ D .w/ .t C w/e ƒ.t Cw/ .ƒ.w//n2 d w:
.n  2/Š 0

Example 13.10 (Piecewise Linear Intensity). Suppose printing jobs arrive at a com-
puter printer according to the piecewise linear periodic intensity function
.x/ D x; if 0 x :1I D :1; if :1 x :5I
D :2.1  x/; if :5 x 1:

We assume that the unit of time is a day, so that for the first few hours in the morning
the arrival rate increases steadily, and then for a few hours it remains constant. As the
13.6 One-Dimensional Nonhomogeneous Processes 455

day winds up, the arrival rate decreases steadily. The intensity function is periodic
with a period equal to one.
By direct integration, in the interval Œ0; 1,
t2
ƒ.t/ D ; if t D :005 C :1.t  :1/; if :1 t :5I
:1I
2
 2
t2 t
D :045 C :2.t   :375/ D :2   :03; if :5 t 1:
2 2

The intensity function .x/ and the mean function ƒ.t/ are plotted in Figs. 13.2
and 13.3. Note that ƒ.t/ would have been a linear function if the process were
homogeneous.

0.1

0.08

0.06

0.04

0.02

x
0.2 0.4 0.6 0.8 1

Fig. 13.2 Plot of the intensity function lambda

0.07

0.06

0.05

0.04

0.03

0.02

0.01

t
0.2 0.4 0.6 0.8 1

Fig. 13.3 Plot of the mean function lambda


456 13 Poisson Processes and Applications

Suppose as an example that we want to know what the probability is that at


least one printing job arrives at the printer in the first ten hours of a particular day.
Because ƒ. 10 24 / D ƒ.:4167/ D :0367, the number of jobs to arrive at the printer in
the first ten hours has a Poisson distribution with mean :0367, and so the probability
we seek equals 1  e :0367 D :0360.

13.7  Campbell’s Theorem and Shot Noise

Suppose f .y1 ; y2 ; : : : ; yd / is a real-valued function of d variables. Campbell’s


theorem tells us how to find the distributionP and summary measures, such as the
mean and the variance, of sums of the form Y 2… f .Y /, where the P countable set
… is a Poisson process in Rd . A more general form is a sum 1 nD1 Xn f .Yn /,
where Y1 ; Y2 ; : : : are the points of the Poisson process …, and X1 ; X2 ; : : : is some
suitable auxiliary sequence of real-valued random variables. Sums of these types
occur in important problems in various applied sciences. An important example of
this type is Nobel Laureate S. Chandrasekhar’s description of the gravitational field
in a Poisson stellar ensemble.
Other applications include the so-called shot effects in signal detection problems.
Imagine radioactive rays hitting a counter at the event times of a Poisson process,
Y1 ; Y2 ; : : : . A ray hitting the counter at time Yi produces a lingering shot effect in
the form of an electrical impulse, say !.t  Yi /, the effect being a function of the
time lag between the P fixed instant t and the event time Yi . Then the total impulse
produced is the sum 1 i D1 !.t  Yi /, which is a function of the type described
above.
We first need some notation and definitions. We follow Kingman (1993) closely.
Varadhan (2007) also has an excellent treatment of the topic in this section. Given a
function f W Rd ! R, and a Poisson process … in Rd , let
X
†f D f .Y /I F .y/ D e f .y/ :
Y 2…

For f  0, let
.f / D E.e †f /:
Definition 13.4. Let … be a Poisson process in Rd , and f a nonnegative function
on Rd . Then .f / is called the characteristic functional of ….
The convergence of the sum †f is clearly an issue of importance. A necessary
and sufficient condition for the convergence of †f is given in the theorem below.
The trick is first to prove it for functions that take only a finite number of values
and are of a compact support (i.e., they are zero outside of a compact set), and
then to approximate a general nonnegative function by functions of this type. See
Kingman (1993) for the details; these proofs and the outline of our arguments below
involve notions of integrals with respect to general measures. This is unavoidable,
and some readers may want to skip directly to the statement of the theorem below.
13.7 Campbell’s Theorem and Shot Noise 457

The mean of †f can be calculated by essentially applying Fubini’s theorem for


measures. More specifically, if we define a counting function (measure) N.A/ as the
number of points of … in (a general) set A, then the sum †f can be viewed as the
mean of the function f with respect to this random measure N . On the other hand,
the mean of the random measure N is the intensity measure  of the Poisson process
…, in the sense EŒN.A/ D .A/ forRany A. Then, R heuristically, by interchanging
the order of integration, E.†f / D E. f dN / D f d. This is a correct formula.
In fact, a similar approximation argument leads to an expression for the character-
istic function of †f , or more generally, for E.e  †f / for a general complex number
, when this expectation exists. Suppose our function f takes only k distinct values
f1 ; f2 ; : : : ; fk on the sets A1 ; A2 ; : : : ; Ak . Then,
  Pk
E e  †f D E e  j D1 fj N.Aj /

Y
k h i Y
k
fj
1/.Aj /
D E e fj N.Aj / D e .e
j D1 j D1

(because N.Aj / are independent Poisson random variables with means .Aj /),
Pk  Pk R
fj f .x/ 1
e 1 .Aj / .e /d.x/
D e R j D1 De j D1 Aj

f .x/ 1
D e R d .e /d.x/ :

One now approximates a general nonnegative function f by a sequence of such


functions taking just a finite number of distinct values, and it turns out that the
above formula for E.e  †f / holds generally, when this expectation is finite. This is
the cream of what is known as Campbell’s theorem. R f .x/ 1/d.x/
If we take this formula, namely, E.e  †f / D e Rd .e , and differen-
tiate it twice with respect to  (or simply expand the expression in powers of 
and look at the coefficient of  2 ), then we get the
R second moment, and hence the
variance of †f , and it is seen to be Var.†f / D Rd f 2 .x/d.x/.

Theorem 13.6 (Campbell’s Theorem). Let … be a Poisson process in Rd with


intensity function .x/, and f W Rd ! R a nonnegative function. Then,
R
(a) †f < 1 ifRand only if Rd min.1; f .x// .x/dx < 1.
(b) E.† R f / D Rd f .x/ .x/dx, R provided the integral exists.
(c) RIf Rd f .x/ .x/dx and Rd f 2 .x/ .x/dx are both finite, then Var.†f / D
2
Rd f .x/ .x/dx. R
(d) Let  be any complex number for which I./ D Rd .1  e f .x/ / .x/dx < 1.
Then, E.e  †f / D e I. / .
(e) The intensity function of … is completely determined by the characteristic func-
tional .f / on the class of functions f that take a finite number of distinct
Q on R .
d
values R
(f) EŒ Y 2… F .Y / D e  Rd Œ1F .x/.x/dx .
For a formal proof of this theorem, see Kingman (1993, p. 28–31).
458 13 Poisson Processes and Applications

13.7.1 Poisson Process and Stable Laws

Stable distributions, which we


P have previously encountered in Chapters 8 and 11
arise as the distributions of Y 2… f .Y / for suitable functions f .y/ when … is a
one-dimensional homogeneous Poisson process with some constant intensity . We
demonstrate this interesting connection below.

Example 13.11 (Shot Noise and Stable Laws). Consider the particular function f W
R ! R defined byR f .x/ D cjxjˇ sgn.x/; 0 < ˇ < 1. We eventually take
ˇ > 12 . The integral R min.1; f .x//dx exists as a principal value, by taking limit
RM
through M min.1; f .x//dx. We evaluate the characteristic function of †f , that
is, we evaluate E.e  †f / for  D i u, where u is a real. We evaluate the characteristic
function by using Campbell’s theorem.
For this, we first evaluate I./ with  D i u. Because of the skew-symmetry of
the function f .x/, only the cosine part in the characteristic function remains. In
other words, if  D iu;u > 0, then
Z 1 Z 1
I./ D Œ1  cos.uf .x//dx D 2 Œ1  cos.uf .x//dx
1 0
Z 1 1 Z 1
2 .uc/ ˇ 1
D2 Œ1  cos.ucx ˇ /dx D Œ1  cos.y/y  ˇ 1 dy:
0 ˇ 0

R1 1
The integral 0 Œ1cos.y/y  ˇ 1 dy converges if and only if ˇ > 12 , in which case
R1 1 1
denote 0 Œ1  cos.y/y  ˇ 1 dy D k.ˇ/. Thus, for ˇ > 12 ; u > 0; I./ D .ˇ/u ˇ
for some constant .ˇ/. As a function of u; I./ is symmetric in u, and so ultimately
we get that if ˇ > 12 , then the characteristic function of †f is

 1
D e juj ;
ˇ
E e iu†f

for some positive constant . This is the characteristic function of a stable law with
exponent ’ D ˇ1 < 2 (and, of course, ’ > 0). We thus find here an interesting
manner in which stable distributions arise from consideration of shot noises and
Campbell’s theorem.

Exercises

Exercise 13.1 (Poisson Process for Catching a Cold). Suppose that you catch a
cold according to a Poisson process of once every three months.
(a) Find the probability that between the months of July and October, you will catch
at least four colds.
Exercises 459

(b) Find the probability that between the months of May and July, and also between
the months of July and October, you will catch at least four colds.
(c) * Find the probability that you will catch more colds between the months of
July and October than between the months of May and July.

Exercise 13.2 (Events up to a Random Time). Jen has two phones on her desk.
On one number, she receives internal calls according to a Poisson process at the rate
of one in two hours. On the other number, she receives external calls according to
a Poisson process at the rate of one per hour. Assume that the two processes run
independently.
(a) What is the expected number of external calls by the time the second internal
call arrives?
(b) * What is the distribution of the number of external calls by the time the second
internal call arrives?

Exercise 13.3 (An Example Due to Emanuel Parzen). Certain machine parts in a
factory fail according to a Poisson process at the rate of one in every six weeks. Two
such parts are in the factory’s inventory. The next supply will come in two months
(eight weeks). What is the probability that production will be stopped for a week or
more due to the lack of this particular machine part?

Exercise 13.4 (An Example Due to Emanuel Parzen). Customers arrive at a


newspaper stall according to a Poisson process at the rate of one customer per two
minutes.
(a) Find the probability that five minutes have passed since the last customer ar-
rived.
(b) Find the probability that five minutes have passed since the next to the last cus-
tomer arrived.

Exercise 13.5 (Spatial Poisson Process). Stars are distributed in a certain part of
the sky according to a three-dimensional Poisson process with constant intensity .
Find the mean and the variance of the separation between a particular star and the
star nearest to it.

Exercise 13.6 (Compound Poisson Process). E-mails arrive from a particular


friend according to a Poisson process at the constant rate of twice per week. When
she writes, the number of lines in her e-mail is exponentially distributed with a mean
of 10 lines. Find the mean, variance, and the generating function of X.t/, the total
number of lines of correspondence received in t weeks.

Exercise 13.7 (An Example Due to Emanuel Parzen: Cry Baby). A baby cries
according to a Poisson process at the constant rate of three times per 30 minutes.
The parents respond only to every third cry. Find the probability that 20 minutes
will elapse between two successive responses of the parents; that 60 minutes will
elapse between two successive responses of the parents.
460 13 Poisson Processes and Applications

Exercise 13.8 (Nonhomogeneous Poisson Process). Let X.t/ be a one-dimen-


sional Poisson process with an intensity function .x/. Give a formula for
P .X.t/ D k/ for a general k  0. Here X.t/ counts the number of arrivals in
the time interval Œ0; t.
Exercise 13.9 (Nonhomogeneous Poisson Process). A one-dimensional nonho-
mogeneous Poisson process has an intensity function .x/ D cx for some c > 0.
Find the density of the nth interarrival time, and the nth arrival time.
Exercise 13.10 (Nonhomogeneous Poisson Process). A one-dimensional Poisson
process has an intensity function .x/. Give a necessary and sufficient condition
that the interarrival times are mutually independent.
Exercise 13.11 (Nonhomogeneous Poisson Process). A one-dimensional Poisson
process has an intensity function .x/. Give a necessary and sufficient condition
that E.Tn /, the mean of the nth interarrival time, remains bounded as n ! 1.
Exercise 13.12 (Conditional Distribution of Interarrival Times). A one-dimen-
sional Poisson process X.t/ has an intensity function .x/. Show that the condi-
tional distribution of the n arrival times given that X.t/ D n is still the same as
the joint distribution of n-order statistics from a suitable density, say g.x/. Identify
this g.
Exercise 13.13 (Projection of a Poisson Process). Suppose … is a Poisson point
Pd
process in Rd , with the intensity function .x1 ; x2 ; : : : ; xd / D e  i D1 jxi j .
(a) Are the projections of … to the individual dimensions also Poisson processes?
(b) If they are, find the intensity functions.
Exercise 13.14 (Polar Coordinates of a Poisson Point Process). Let … be a two-
dimensional Poisson point process with a constant intensity function .x1 ; x2 / D .
Prove that the set of polar coordinates of the points of … form another Poisson point
process, and identify its intensity function.
Exercise 13.15 (Distances from Origin of a Poisson Point Process). Let …
be a two-dimensional Poisson point process with a constant intensity function
.x1 ; x2 / D . Consider the set of distances of the points of … from the origin.
Is this another Poisson process? If it is, identify its intensity function.
Exercise 13.16 (Deletion of SPAM). Suppose that you receive e-mails according
to a Poisson process with a constant arrival rate of one per 10 minutes. Mutually
independently, each mail has a 20% probability of being deleted by your spam fil-
ter. Find the probability that the tenth mail after you first log in shows up in your
mailbox within two hours.
Exercise 13.17 (Production on Demand). Orders for a book arrive at a publisher’s
office according to a Poisson process with a constant rate of 4 per week. Production
for a received order is assumed to start immediately, and the actual time to produce
a copy is uniformly distributed between 5 and 10 days. Let X.t/ be the number of
orders in actual production at a given time t. Find the mean and the variance of X.t/.
Exercises 461

Exercise 13.18 (Compound Poisson Process). An auto insurance agent receives


claims at the average constant rate of one per day. The amount of claims are iid,
with an equal probability of being 250, 500, 1000, or 1500 dollars.
(a) Find the generating function of the total claim made in a 30-day period.
(b) Hence find the mean and the variance of the total claim made in a 30-day period.

Exercise 13.19 (Connection of a Poisson Process to Binomial Distribution).


Suppose X.t/; t  0 is a Poisson process with constant average rate . Given that
X.t/ D n, show that the number of events up to the time u, where u < t, has a
Binomial distribution. Identify the parameters of this binomial distribution.

Exercise 13.20 * (Inspection Paradox). The city bus arrives at a certain bus stop
at the constant average rate of once per 30 minutes. Suppose that you arrive at the
bus stop at a fixed time t. Find the probability that the time till the next bus is larger
than the time since the last bus.

Exercise 13.21 (Correlation in a Poisson Process). Suppose X.t/ is a Poisson


process with average constant arrival rate . Let 0 < s < t < 1. Find the correla-
tion between X.s/ and X.t/.

Exercise 13.22 (A Strong Law). Suppose X.t/ is a Poisson process with constant
arrival rate . Show that X.t /
t converges almost surely to as t ! 1.

Exercise 13.23 (Two Poisson Processes). Suppose X.t/; Y .t/; t  0 are two Pois-
son processes with rates 1 ; 2 . Assume that the processes run independently.
(a) Prove or disprove: X.t/ C Y .t/ is also a Poisson process.
(b) * Prove or disprove: jX.t/  Y .t/j is also a Poisson process.

Exercise 13.24 (Two Poisson Processes; Continued). Suppose X.t/; Y .t/; t  0


are two Poisson processes with rates 1 ; 2 . Assume that the processes run inde-
pendently.
(a) Find the probability that the first event in the X process occurs before the first
event in the Y process.
(b) * Find the density function, mean, and variance of the minimum of the first
arrival time in the X process and the first arrival time in the Y process.
(c) Find the distribution of the number of events in the Y process in the time interval
between the first and the second event in the X process.

Exercise 13.25 (Displaced Poisson Process). Let … be a Poisson point process on


Rd with intensity function .x/ constantly equal to . Suppose the points Y of …
are randomly shifted to new points Y C W , where the shifts W are iid, and are
independent of …. Let … denote the set of points Y C W .
(a) Is … also a Poisson point process?
(b) If it is, what is its intensity function?
462 13 Poisson Processes and Applications

Exercise 13.26 (The Midpoint Process). Let … be a one-dimensional Poisson


point process. Consider the point process … of the midpoints of the points in ….
(a) Is … also a Poisson point process?
(b) If it is, what is its intensity function?

Exercise 13.27 (kth Nearest Neighbor). Let … be a Poisson point process on Rd


with intensity function .x/ constantly equal to . For k  1, let Dk be the distance
from the origin of the kth nearest point of … to the origin.
Find the density, mean, and variance of Dk .

Exercise 13.28 (Generalized Campbell’s Theorem). Let X.t/ be a homogeneous


P rate , and let Yi ; i  1 be
Poisson process on the real line with constant arrival
the arrival times in the X.t/ process. Let S.t/ D 1 i D1 Zi !.t  Yi /; t  0, where
w.s/ D e cs Is0 for some positive constant c, and Zi ; i  1 is a positive iid
sequence, which you may assume to be independent of the X.t/ process.
(a) Find the characteristic function of S.t/ for a fixed t.
(b) Hence, find the mean and variance of S.t/, when they exist.

Exercise 13.29 (Poisson Processes and Geometric Distribution). Suppose Xi .t/;


i D 1; 2 are two homogeneous Poisson processes, independent of each other, and
with respective arrival rates ; . Show that the distribution of the number of arrivals
N in the X2 .t/ process between two successive arrivals in the X1 .t/ process is given
by P .N D k/ D p.1  p/k ; k D 0; 1; 2;    for some 0 < p < 1. Identify p.

Exercise 13.30 (Nonhomogeneous Shot Noise). Let X.t/ be a P one-dimensional


1
Poisson process with the intensity function .x/. Let S.t/ D i D1 !.t  Yi /,
where Yi ; i  1 are the arrival times of the X.t/ process, and !.s/ D 0 for s < 0.
Derive formulas for the characteristic function, mean, and variance of S.t/ for a
given t.

References

Karlin, S. and Taylor, H.M. (1975). A First Course in Stochastic Processes, Academic Press,
New York.
Kingman, J.F.C. (1993). Poisson Processes, Oxford University Press, Oxford, UK.
Lawler, G. (2006). Introduction to Stochastic Processes, Chapman and Hall, New York.
Parzen, E. (1962). Stochastic Processes, Holden-Day, San Francisco.
Port, S. (1994). Theoretical Probability for Applications, Wiley, New York.
Chapter 14
Discrete Time Martingales and Concentration
Inequalities

For an independent sequence of random variables X1 ; X2 ; : : : ; the conditional


expectation of the present term of the sequence given the past terms is the same
as its unconditional expectation. Martingales let the conditional expectation depend
on the past terms, but in a special way. Thus, similar to Markov chains, martingales
act as natural models for incorporating dependence into a sequence of observed
data. But the value of the theory of martingales is much more than simply its mod-
eling value. Martingales arise, as natural byproducts of the mathematical analysis
in an amazing variety of problems in probability and statistics. Therefore, results
from martingale theory can be immediately applied to all these situations in order
to make deep and useful conclusions about numerous problems in probability and
statistics. A particular modern set of applications of martingale methods is in the
area of concentration inequalities, which place explicit bounds on probabilities of
large deviations of functions of a set of variables from their mean values. This chap-
ter gives a glimpse into some important concentration inequalities, and explains
how martingale theory enters there. Martingales form a nearly indispensable tool
for probabilists and statisticians alike.
Martingales were introduced into the probability literature by Paul Lévy, who
was interested in finding situations beyond the iid case where the strong law of
large numbers holds. But its principal theoretical studies were done by Joseph Doob.
Two extremely lucid expositions on martingales are Doob (1971) and Heyde (1972).
Some other excellent references for this chapter are Karlin and Taylor (1975), Chung
(1974), Hall and Heyde (1980), Williams (1991), Karatzas and Shreve (1991),
Fristedt and Gray (1997), and Chow and Teicher (2003). Other references are pro-
vided in the sections.

14.1 Illustrative Examples and Applications in Statistics

We start with a simple example, which nevertheless captures the spirit of the idea of
a martingale sequence of random variables.

A. DasGupta, Probability for Statistics and Machine Learning: Fundamentals 463


and Advanced Topics, Springer Texts in Statistics, DOI 10.1007/978-1-4419-9634-3 14,
c Springer Science+Business Media, LLC 2011
464 14 Discrete Time Martingales and Concentration Inequalities

Example 14.1 (Gambler’s Fortune). Consider a gambler repeatedly playing a fair


game in a casino. Thus, a fair coin is tossed. If heads show, the player wins
$1; if it is tails, the house wins $1. He plays repeatedly. Let X1 ; X2 ; : : : be the
players’s sequence of wins. Thus, the Xi are iid with the common Pn distribution
P .Xi D ˙1/ D 12 . The player’s fortune after n plays is Sn D S0 C P i D1 Xi ; n  1.
If we take the player’s initial fortune S0 to be just zero, then Sn D niD1 Xi . Sup-
pose now the player has finished playing for n times, and he is looking ahead to
what his fortunes will be after he plays the next time. In other words, he wants to
find E.SnC1 jS1 ; : : : ; Sn /. But,

E.SnC1 jS1 ; : : : ; Sn /
D E.Sn C XnC1 jS1 ; : : : ; Sn / D Sn C E.XnC1 jS1 ; : : : ; Sn /
D Sn C E.XnC1 / D Sn C 0 D Sn :

In the above, E.XnC1 jS1 ; : : : ; Sn / equals the unconditional expectation of XnC1


because XnC1 is independent of .X1 ; X2 ; : : : ; Xn /, and hence, independent of
.S1 ; : : : ; Sn /.
Notice that the sequence of fortunes S1 ; S2 ; : : : is not an independent sequence.
There is information in the past sequence of fortunes for predicting the current for-
tune. But the players’s forecast for what his fortune will be after the next round of
play is simply what his fortunes are right now, no more and no less. This is basically
what the martingale property means, and is the reason for equating martingales with
fair games.
Here is the definition. Rigorous treatment of martingales requires use of measure
theory. For the most part, our treatment avoids measure-theory terminology.
Definition 14.1. Let Xn ; n  1 be a sequence of random variables defined on a
common sample space  such that E.jXn j/ < 1 for all n  1. The sequence fXn g
is called a martingale adapted to itself if for each n  1, E.XnC1 jX1 ; X2 ; : : : ;
Xn / D Xn with probability one.
The sequence fXn g is called a supermartingale if for each n  1, E.XnC1 jX1 ;
X2 ; : : : ; Xn / Xn with probability one. The sequence fXn g is called a submartin-
gale if for each n  1, E.XnC1 jX1 ; X2 ; : : : ; Xn /  Xn with probability one.
Remark. We generally do not mention the adapted to itself qualification when that
is indeed the case. It is sometimes useful to talk about the martingale property with
respect to a different sequence of random variables. This concept is defined below
and Example 14.8 is an example of such a martingale sequence.
Note that Xn is a submartingale if and only if Xn is a supermartingale, and that
it is a martingale if and only if it is both a supermartingale and a submartingale. Also
notice that for a martingale sequence Xn ; E.XnCm / D E.Xn / for all n; m  1; in
other words, E.Xn / D E.X1 / for all n.
Definition 14.2. Let Xn ; n  1 and Yn ; n  1 be sequences of random variables
defined on a common sample space  such that E.jXn j/ < 1 for all n  1.
14.1 Illustrative Examples and Applications in Statistics 465

The sequence fXn g is called a martingale adapted to the sequence fYn g if for each
n  1; Xn is a function of Y1 ; : : : ; Yn , and E.XnC1 jY1 ; Y2 ; : : : ; Yn / D Xn with
probability one.

Some elementary examples are given first.

Example 14.2 (Partial Sums). Let Z1 ; Z2 ; : : : be independent


Pn zero mean ran-
dom variables, and let Sn denote the partial sum i D1 Zi . Then, clearly,
E.SnC1 jS1 ; : : : ; Sn / D Sn C E.ZnC1 jS1 ; : : : ; Sn / D Sn C E.ZnC1 / D Sn ,
and so fSn g forms a martingale. More generally, if the common mean of the Zi is
some number , then Sn  n is a martingale.

Example 14.3 (Sums of Squares). Let Z1 ; Z2 ; : : : be iid N.0; 1/ random variables,


and let Xn D .Z1 C    C Zn /2  n D Sn2  n, where Sn D Z1 C    C Zn . Then,

E.XnC1 jX1 ; X2 ; : : : ; Xn /
D EŒ.Z1 C    C Zn /2 C 2ZnC1 .Z1 C    C Zn /
2
C ZnC1 jX1 ; X2 ; : : : ; Xn   .n C 1/
D Xn C n C 2.Z1 C    C Zn /E.ZnC1 jX1 ; X2 ; : : : ; Xn /
C E.ZnC1
2
jX1 ; X2 ; : : : ; Xn /  .n C 1/
D Xn C n C 0 C 1  .n C 1/ D Xn ;

and so fXn g forms a martingale sequence.


Actually, we did not use the normality of the Zi at all, and the martingale prop-
erty holds without the normality assumption. That is, if Z1 ; Z2 ; : : : are iid with mean
zero and variance  2 , then Sn2  n 2 is a martingale.
P
Example 14.4. Suppose X1 ; X2 ; : : : are iid N.0; 1/ variables and Sn D niD1 Xi .
2 2
Because Sn N.0; n/, its mgf E.e tSn / D e nt =2 . Now let Zn D e tSn nt =2 ,
2
where t is a fixed real number. Then, E.ZnC1 jZ1 ; : : : ; Zn / D e .nC1/t =2 E.e tSn
2 2
e tXnC1 jSn / D e .nC1/t =2 e tSn e t =2 D Zn . Therefore, for any real t, the sequence
2
e tSn nt =2 forms a martingale.
Once again, a generalization beyond the normal case is possible; see the chapter
exercises for a general result.

Example 14.5 (Matching Problem). Consider the matching problem. For example,
suppose N people, each wearing a hat, have gathered in a party and at the end of
the party, the N hats are returned to them at random. Those that get their own hats
back then leave the room. The remaining hats are distributed among the remaining
guests at random, and so on. The process continues until all the hats have been given
away. Let Xn denote the number of guests still present after the nth round of this
hat returning process.
466 14 Discrete Time Martingales and Concentration Inequalities

At each round, we expect one person to get his own hat back and leave the room.
In other words, E.Xn  XnC1 / D 1 8n. In fact, with a little calculation, we even
have

E.XnC1 jX1 ; : : : ; Xn / D E.XnC1  Xn C Xn jX1 ; : : : ; Xn /

D E.XnC1  Xn jX1 ; : : : ; Xn / C Xn D 1 C Xn :

This immediately implies that E.XnC1 CnC1 jX1 ; : : : ; Xn / D 1C.nC1/CXn D


Xn C n. Hence the sequence fXn C ng is a martingale.

Example 14.6 (Pólya’s Urn). The Pólya urn scheme is defined as follows. Initially,
an urn contains a white and b black balls, a total of a C b balls. One ball is drawn
at random from among all the balls in the urn. It, together with c more balls of its
color is returned to the urn, so that after the first draw, the urn has a C b C c balls.
This process is repeated.
Suppose Xi is the indicator of the event Ai that a white ball is drawn at the i th
trial, and for given n  1; Sn D X1 C    C Xn , which is the total number of
times that a white ball has been drawn in the first n trials. For the sake of notational
simplicity, we take c D 1. Then, the proportion of white balls in the urn just after
aCSn
the nth trial has been completed is Rn D aCbCn .
Elementary arguments show that

a C x1 C    C xn
P .XnC1 D 1 jX1 D x1 ; : : : ; Xn D xn / D :
aCbCn

Thus,

a C Sn
E.SnC1 jS1 ; : : : ; Sn / D E.SnC1 jSn / D Sn C
aCbCn
a 1
) E.RnC1 jR1 ; : : : ; Rn / D C
aCbCnC1 aCbCnC1
Œ.a C b C n/Rn  a C Rn  D Rn :

We therefore have the interesting result that in the Pólya urn scheme, the sequence
of proportions of white balls in the urn forms a martingale.

Example 14.7 (The Wright–Fisher Markov Chain). Consider the stationary Markov
chain fXn g on the state space f0; 1; 2; : : : ; N g with the one-step transition pro-
babilities
!   
N i j i N j
pij D 1 :
j N N
14.1 Illustrative Examples and Applications in Statistics 467

This is the Wright–Fisher chain in population genetics (see Chapter 10). We show
that Xn is a martingale adapted to itself. Indeed, by direct calculation,

E.XnC1 jX1 ; : : : ; Xn / D E.XnC1 jXn /


!   
X
N
N Xn j Xn N j Xn
D j 1 DN D Xn :
j N N N
j D0

Example 14.8 (Likelihood Ratios). Suppose X1 ; X2 ; : : : ; Xn are iid with a common


density function f , which is one of f0 , and f1 , two different density functions. The
statistician is supposed to choose from the two densities f0 ; f1 , the one that is truly
generating the observed data x1 ; x2 ; : : : ; xn . One therefore has the null hypothesis
H0 that f D f0 , and the alternate hypothesis that f D f1 . The statistician’s decision
is commonly based on the likelihood ratio

Y
n
f1 .Xi /
ƒn D :
f0 .Xi /
i D1

If ƒn is large for the observed data, then one concludes that the data values come
from a high-density region of f1 and a low-density region of f0 , and therefore con-
cludes that the true f generating the observed data is f1 .
Suppose now the null hypothesis is actually true; that is, truly, X1 ; X2 ; : : : are iid
with the common density f0 . Now,

f1 .XnC1 /
Ef0 ŒƒnC1 jƒ1 ; : : : ; ƒn  D Ef0 ƒn jƒ1 ; : : : ; ƒn
f0 .XnC1 /
f1 .XnC1 /
D ƒn Ef0 jƒ1 ; : : : ; ƒn
f0 .XnC1 /
f1 .XnC1 /
D ƒn Ef0
f0 .XnC1 /

(because the sequence X1 ; X2 ; : : : are independent)


Z Z
f1 .x/
D ƒn f0 .x/dx D ƒn f1 .x/dx
R f0 .x/ R

D ƒn  1 D ƒn :

Therefore, the sequence of likelihood ratios forms a martingale under the null hy-
pothesis (i.e., if the true f is f0 ).
Example 14.9 (Bayes Estimates). Suppose random variables Y; X1 ; X2 ; : : : are de-
fined on a common sample space . For given n  1; .X1 ; X2 ; : : : ; Xn / has the
joint conditional distribution P;n given that Y D . From a statistical point of
view, Y is supposed to stand for an unknown parameter, which is formally treated
468 14 Discrete Time Martingales and Concentration Inequalities

as a random variable, and X .n/ D .X1 ; X2 ; : : : ; Xn / for some specific n, namely


the actual sample size, is the data that the statistician has available to estimate the
unknown parameter. The Bayes estimate of the unknown paramter is the posterior
mean E.Y jX .n/ / (see Chapter 3).
Denote for each n  1; E.Y jX .n/ / D Zn . We show that Zn forms a martin-
gale sequence with respect to the sequence X .n/ ; that is, E.ZnC1 jX .n/ / D Zn .
However, this follows on simply observing that by the iterated expectation formula,
 h  i 
Zn D E Y jX .n/ D EXnC1 jX .n/ E Y jX .n/ ; XnC1 D E ZnC1 jX .n/ :

Example 14.10 (Square of a Martingale). Suppose Xn , defined on some sample


space  is a positive submartingale sequence. For simplicity, let us consider the
case when it is adapted to itself. Thus, for any n  1; E.XnC1 jX1 ; : : : ; Xn /  Xn
(with probability one). Therefore, for any n  1,
2
E.XnC1 jX1 ; : : : ; Xn /  ŒE.XnC1 jX1 ; : : : ; Xn /2
 Xn2 :

Therefore, if we let Zn D Xn2 , then Zn is a submartingale sequence.


If we inspect this example carefully, then we realize that we have only used a very
special case of Jensen’s inequality to establish the needed submartingale property
for the Zn sequence. Furthermore, if the original fXn g sequence is a martingale,
rather than a submartingale, then the positivity restriction on the Xn is no longer
necessary. Thus, by simply following the steps of this example, we in fact have the
following simple but widely useful general result.

Theorem 14.1 (Convex Function Theorem). Let Xn ; n  1 be defined on a


common sample space . Let f be a convex function on R, and let Zn D f .Xn /.
(a) Suppose fXn g is a martingale adapted to some sequence fYn g. Then fZn g is a
submartingale adapted to fYn g.
(b) Suppose fXn g is a submartingale adapted to some sequence fYn g. Assume that
f is in addition nondecreasing. Then fZn g is a submartingale adapted to fYn g.

14.2 Stopping Times and Optional Stopping

The optional stopping theorem is one of the most useful results in martingale theory.
It can be explained in gambling terms. Consider a gambler playing a fair game
repeatedly, so that her sequence of fortunes forms a martingale. One might think that
by gaining experience as the game proceeds, and by quitting at a cleverly chosen
opportune time based on the gambler’s experience, a fair game could be turned
into a favorable game. The optional stopping theorem says that this is in fact not
possible, if the gambler does not have unlimited time on her hands and the house
14.2 Stopping Times and Optional Stopping 469

has limits on how much she can put up on the table. Mathematical formulation of the
optional stopping theorem requires use of stopping times, which were introduced in
Chapter 11 in the context of random walks. We redefine stopping times and give
additional examples below before introducing optional stopping.

14.2.1 Stopping Times

Definition 14.3. Let X1 ; X2 ; : : : be a sequence of random variables, all defined on


a common sample space . Let  be a nonnegative integer-valued random vari-
able, also defined on . We call  a stopping time adapted to the sequence fXn g if
P . < 1/ D 1, and if for each n  1; Ifng is a function of only X1 ; X2 ; : : : ; Xn .
In other words,  is a stopping time adapted to fXn g if for any n  1, whether or
not  n can be determined by only knowing X1 ; X2 ; : : : ; Xn , and provided that 
cannot be infinite with a positive probability.
We have seen some examples of stopping times in Chapter 11. We start with a
few more illustrative examples.
Example 14.11 (Sequential Tests in Statistics). Suppose to start with we have an
infinite sequence of random variables X1 ; XP
2 ; : : : on a common sample space ,
and let Sn denote the nth partial sum, Sn D niD1 Xi ; n  1. The Xn need not be
independent. Fix numbers 1 < l < u < 1. Then  defined as

 D inffn W Sn < l or Sn > ug;

and  D 1 if l Sn u 8n  1, is a stopping time adapted to the sequence fSn g.


A particular case of this arises in sequential testing of hypotheses in statistics.
Suppose an original sequence Z1 ; Z2 ; : : : is iid from some density f , which equals
either f0 or f1 . Then, as we have seen above, the likelihood ratio is
Qn
f1 .Zi /
ƒn D QinD1 :
i D1 f0 .Zi /

The Wald sequential probability ratio test (SPRT) continues sampling as long as ƒn
remains between two specified numbers a and b; a < b, and stops and decides in
favor of f1 or f0 the first time ƒn > b or < a. If we denote l D log a; u D log b,
P Pn
then Wald’s test waits till the first time log ƒn D niD1 log ff10 .Zi/
.Zi / D i D1 Xi (say)
goes above u or below l, and thus the sampling number of Wald’s SPRT is a stopping
time.
Example 14.12 (Combining Stopping Times). This example shows a few ways that
we can make new stopping times out of given ones. Suppose  is a stopping time
(adapted to some sequence fXn g) and n is a prespecified positive integer. Then n D
min.; n/ is a stopping time (adapted to the same sequence). This is because

fn kg D f kg [ fn kg;
470 14 Discrete Time Martingales and Concentration Inequalities

and therefore,  being a stopping time adapted to fXn g, for any given k, deciding
whether n k requires the knowledge of only X1 ; : : : ; Xk .
Suppose 1 ; 2 are both stopping times, adapted to some sequence fXn g. Then
1 C 2 is also a stopping time adapted to the same sequence. To prove this, note
that

f1 C 2 kg D [kiD0 [ij D0 f1 D j; 2 D i  j g D [kiD0 [ij D0 Aij :

and whether any Aij occurs depends only on X1 ; : : : ; Xk .


For the sake of reference, we collect a set of such facts about stopping times in
the next result. They are all easy to prove.

Theorem 14.2. (a) Let  be a stopping time adapted to some sequence fXn g. Then,
for any given n  1; min.; n/ is also a stopping time adapted to fXn g.
(b) Let 1 ; 2 be stopping times adapted to fXn g. Then each of 1 C 2 ; min.1 ; 2 /;
max.1 ; 2 / is a stopping time adapted to fXn g.
(c) Let fk ; k  1g be a countable family of stopping times, each adapted to fXn g.
Let
 D inf k I  D sup k I  D lim k ;
k k k!1

where ;  , and  are defined pointwise, and it is assumed that the limit  exists
almost surely. Then each of ;  and  is a stopping time adapted to fXn g.

14.2.2 Optional Stopping

The most significant derivative of introducing the concept of stopping times is the
optional stopping theorem. At the expense of using some potentially hard to verify
conditions, stronger versions of our statement of the optional stopping theorem can
be stated. We choose to opt for simplicity of the statement over greater generality,
and refer to more general versions (which are useful!). The main message of the op-
tional stopping theorem is that a gambler cannot convert a fair game into a favorable
one by using clever quitting strategies.

Theorem 14.3 (Optional Stopping Theorem). Let fXn ; n  0g be a submartin-


gale adapted to some sequence fYn g, and  a stopping time adapted to the same
sequence. For n  0, let n D min.; n/. Then fXn g is also a submartingale
adapted to fYn g, and for each n  0,

E.X0 / E.Xn / E.Xn /:

In particular, if

fXn g is a martingale; E.jX j/ < 1; and lim E.Xn / D E.X /;


n!1
14.2 Stopping Times and Optional Stopping 471

then
E.X / D E.X0 /:

Remark. It is of course unsatisfactory to simply demand that E.jX j/ < 1 and


limn!1 E.Xn / D E.X /. What we need are simple sufficient conditions that a
user can verify relatively easily. This is addressed following the proof of the above
theorem.

Proof of Theorem. For simplicity, we give the proof only for the case when fXn g is
adapted to itself. The main step involved is to notice the identity

X
n1
Wn D Xn D Xi IfDi g C Xn Ifng ; ./
i D0

for all n  0. It follows from this identity and the submartingale property of the
fXn g sequence that

E.WnC1 jX0 ; : : : ; Xn /
X
n
D E.Xi IfDi g jX0 ; : : : ; Xn / C E.XnC1 If>ng jX0 ; : : : ; Xn /
i D0

X
n
D Xi IfDi g C If>ng E.XnC1 jX0 ; : : : ; Xn /
i D0

X
n
 Xi IfDi g C Xn If>ng D Xn D Wn :
i D0

Thus, as claimed, Wn D fXn g is a submartinagle adapted to the original fXn g


sequence. It follows that

E.Xn / D E.Wn /  E.W0 / D E.X0 /:

To complete the proof of the theorem, we need the reverse inequality E.Wn /
E.Xn /. This too follows from the same identity ./ given at the beginning of the
proof of this theorem, and on using the additional inequality

E.Xn IfDi g jX0 ; : : : ; Xi /


D IfDi g E.Xn jX0 ; : : : ; Xi /  IfDi g Xi ;

because fXn g is a submartingale. If this bound on Xi IfDi g is plugged into our basic
identity ./ above, the reverse inequality follows.
The remaining claim, when fXn g is in fact a martingale, follows immediately
from the two inequalities E.X0 / E.Wn / E.Xn /. t
u
472 14 Discrete Time Martingales and Concentration Inequalities

14.2.3 Sufficient Conditions for Optional Stopping Theorem

Easy examples show that the assertion E.X / D E.X0 / for a martingale sequence
fXn g cannot hold without some control on the stopping time . We first provide
a simple example where the assertion of the optional stopping theorem fails. In
looking for such counterexamples, it is useful to construct the stopping time in a
way that when we stop, the value of the stopped martingale is a constant; that is, X
is a constant.

Example 14.13 (An Example Where the Optional Stopping Theorem Fails). Con-
sider again the gambling example, or what really is the simple symmetric random
walk, Pwith Xi iid having the common distribution P .Xi D ˙1/ D 12 , and
Sn D niD1 Xi ; n  1. We define S0 D 0. We know Sn to be a martingale. Consider
now the stopping time
 D inffn > 0 W Sn D 1g:
We know from Chapter 11 that the one-dimensional simple symmetric random walk
is recurrent; thus, P . < 1/ D 1. Note that S D 1, and so, E.S / D 1. However,
E.S0 / D E.Sn / D 0. So, the assertion of the optional stopping theorem does not
hold.
What is going on in this example is that we do not have enough control on the
stopping time . Although the random walk visits all its states (infinitely often) with
probability one, the recurrence times are infinite on the average. Thus,  can be
uncontrollably large. Indeed, the assumption

lim E.Smin.;n/ / D E.S /.D 1/


n!1

does not hold. Roughly speaking, P . > n/ goes to zero at the rate p1n and if the
random walk still has not reached positive territory
p by time n, then it has traveled
to some distance roughly of the order of  n. These two now exactly balance
out, so that E.Smin.;n/ /If>ng does not go to zero. This causes the assumption
limn!1 E.Smin.;n/ / D E.S / D 1 to fail.
Thus, our search for sufficient conditions in the optional stopping theorem should
be directed at finding nice enough conditions that ensure that the stopping time 
cannot get too large with a high probability. The next two theorems provide such a
set of aesthetically attractive sufficient conditions. It is not hard to prove these two
theorems. We refer the reader to Fristedt and Gray (1997, Chapter 24) for proofs of
these two theorems.

Theorem 14.4. Suppose fXn ; n  0g is a martingale, adapted to itself, and  a


stopping time adapted to the same sequence. Suppose any one of the following con-
ditions holds.
(a) For some n < 1; P . n/ D 1.
(b) For some nonnegative random variable Z with E.Z/ < 1, the martingale
sequence fXn g satisfies jXn j Z for all n  0.
14.2 Stopping Times and Optional Stopping 473

(c) For some positive and finite c; jXnC1  Xn j c for all n  0, E.jX0 j/ < 1,
and E./ < 1.
(d) For some finite constant c; E.Xn2 / c for all n  0.
Then E.X / D E.X0 /.

Remark. It is important to keep in mind that none of these four conditions is


necessary for the equality E.X / D E.X0 / to hold. We recall from our discussion
of uniform integrability in Chapter 7 that conditions (b) and (d) in Theorem 14.4
each imply that the sequence fXn g is uniformly integrable. In fact, it may be shown
that under the weaker condition that our martingale sequence fXn g is uniformly
integrable, the equality E.X / D E.X0 / holds. The important role played by uni-
form integrability in martingale theory reappears when we discuss convergence of
martingales.
An important case where the equality holds with essentially the minimum re-
quirements is the special case of a random walk. We precisely describe this im-
mediately below. The point is that the four sufficient conditions are all-purpose
conditions. But if the martingale has a special structure, then the conditions can
sometimes be weakened. Here is such a result for a special martingale, namely the
random walk.

Pn14.5. Let Z1 ; Z2 ; : : : be an iid sequence such that E.jZ1 j/ < 1. Let


Theorem
Sn D i D1 Zi ; n  1. Let  be any stopping time adapted to fSn g such that
E./ < 1. Consider the martingale sequence Xn D Sn  n; n  1, where
 D E.Z1 /. Then the equality E.X / D E.X1 / D 0 holds.

Remark. The special structure of the random walk martingale allows us to conclude
the assertion of the optional stopping theorem, without requiring the bounded incre-
ments condition jXnC1  Xn j c, which was included in the all-purpose sufficient
condition in Theorem 14.4.

Example 14.14 (Weighted Rademacher Series). Let X1 ; X2 ; : : : be a sequence of iid


Rademacher variables with common distribution P .Xi D ˙1/ D 12 . For n  1, let
P
Sn D niD1 X i
i’
, where ’ > 12 . Because Xi are independent and E. X i
i’
/ D 0 for all
i; Sn forms a martingale sequence (see Example 14.2). On the other hand,

X
n
Var.Xi /
E.Sn2 / D Var.Sn / D
i 2’
i D1
X n 1
X
1 1
D D .2’/ < 1;
i 2’ i 2’
i D1 i D1

P1 1
where .z/ is the Riemann zeta function .z/ D nD1 nz ; z > 1. Therefore, if
’ > 2 ; E.Sn / c D .2’/ for all n, and hence by our theorem above, E.S / D 0
1 2

holds for any stopping time  adapted to fSn g.


474 14 Discrete Time Martingales and Concentration Inequalities

Example 14.15 (The Simple Random Walk). Consider the one-dimensional random
walk with iid steps Xi , having the common distribution P .Xi D 1/ D p; P .Xi D
1/ D q; 0 <Pp < 1; p C q D 1. Then, E.Xi / D p  q D  (say), and Sn  n,
where Sn D niD1 Xi , is a martingale. We also have, for any n,

jSnC1  .n C 1/  .Sn  n/j D jXnC1  j 2:

Furthermore, E.jS1  j/ is clearly finite. Therefore, for any stopping time  with
a finite expectation, by using our theorem above, the equality E.S  / D 0, or
equivalently, E.S / D E./ holds. Recall from Chapter 11 that this is a special
case of Wald’s identity. Wald’s identity is revisited in the next section.

14.2.4 Applications of Optional Stopping

We provide a few applications of the optional stopping theorem. The optional stop-
ping theorem also has important applications to martingale inequalities, which is
our topic in the next section.
Perhaps the two best general applications of the optional stopping theorem are
two identities, known as Wald identities. Of these, the first Wald identity is already
known to us; see Chapter 11. We connect that identity to martingale theory and
present a second identity, which was not presented in Chapter 11.

Theorem 14.6 (Wald’s First and Second Identity). Let X1 ; X2 ; : : : be a se-


of iid random variables, defined on a common sample space . Let
quence P
n
Sn D i D1 Xi ; n  1. Let  be a stopping time adapted to the sequence fSn g
and suppose that E./ < 1.
(a) Suppose E.jX1 j/ < 1 and E.X1 / D  (which need not be zero). Then
E.S / D .E/.
(b) Suppose E.X1 / D 0; E.X12 / D  2 < 1. Then Var.S / D  2 .E/.

Proof. Both parts of this theorem follow from Theorem 14.5. For part (a), apply
Theorem 14.5 to the martingale sequence Sn  n to conclude that E.S  / D
0 ) E.S / D .E/. For part (b), because  D E.X1 / has now been assumed to
be zero, by applying part (a) of this theorem,

Var.S / D E.S  EŒS /2 D E.S  0/2 D E.S2 /:

Next note that because the Xi are independent,

Var.SnC1 jS1 ; : : : ; Sn / D Var.XnC1 / D  2


2
) E.SnC1 jS1 ; : : : ; Sn / D Sn2 C  2
) E.SnC1
2
 .n C 1/ 2 jS1 ; : : : ; Sn / D Sn2  n 2 I
14.2 Stopping Times and Optional Stopping 475

that is, Sn2  n 2 is a martingale sequence adapted to the Sn sequence. From here,
it follows that E.S2   2 / D E.S12   2 / D 0, which means

Var.S / D E.S2 / D  2 .E/;

which is what part (b) says. t


u
Example 14.16 (Expected Hitting Times for a Random Walk). The Wald identity
may be used to evaluate the expected hitting time of a given level by a random walk.
Specifically, let Sn be the one-dimensional simple symmetric random walk with the
iid steps having the common distribution P .Xi D ˙1/ D 12 , Let x be any given
positive integer and consider the first passage time

x D inffn > 0 W Sn D xg:

We know from general random walk theory (Chapter 11) that P .x < 1/ D 1.
Also, obviously E.jX1 j/ D 1 < 1, and  D E.X1 / D 0. Therefore, if E.x /
is finite, Wald’s identity E.Sx / D E.x / will hold. However, Sx D x with
probability one, and hence, E.Sx / D x. It follows that the equality x D 0  E.x /
cannot hold for any finite value of E.x /. In other words, for any positive x, the
expected hitting time of x must be infinite for the simple symmetric random walk.
The same argument also works for negative x.
Example 14.17 (Gambler’s Ruin). Now let us revisit the so-called gambler’s ruin
problem, wherein the gambler quits when he either goes broke, or attains a prespec-
ified amount of fortune (see Chapter 10). In other words, the gambler waits for the
random walk Sn to hit one of two integers 0; b; b > 0. Suppose a < b is the amount
of money with which the gamblerPwalked in, so that the gambler’s sequence of for-
tunes is the random walk Sn D niD1 Xi C S0 , where S0 D a, and the steps are
still iid with P .Xi D ˙1/ D 12 . Formally, let

 D fa;bg D inffn > 0 W Sn 2 f0; bgg:

By applying the optional stopping theorem,

E.S / D 0  P .S D 0/ C bŒ1  P .S D 0/ D E.S0 / D aI

note that we have implicitly assumed the validity of the optional stopping theorem
in the last step (which is true in this example; why?). Rearranging terms, we deduce
that P .S D 0/ D ba b
, or equivalently, P .S D b/ D ab .
Example 14.18 (Generalized Wald Identity). The two identities of Wald given above
assume only the existence of the first and the second moment of Xi , respectively.
If we make the stronger assumption that the Xi have a finite mgf, then a more
embracing martingale identity can be proved, from which the two Wald identities
given above fall out as special cases. This generalized Wald identity is presented in
this example.
476 14 Discrete Time Martingales and Concentration Inequalities

The basic idea is the same as before, which is to think of a suitable martingale,
and apply the optional stopping theorem to it. Suppose then that X1 ; X2 ; : : : is an
iid sequence, with the mgf .t/ D E.e tXi /, which we assume to be finite in some
nonempty interval containing zero. The martingale that works for our purpose in
this example is
Zn D Œ .t/n e tSn ; n  0;
Pn
where, as usual, Sn D i D1 Xi , and we take S0 D 0. The number t is fixed, and is
often cleverly chosen in specific applications.
The special normal case of this martingale was seen in Example 14.4. Exactly
the same proof works in order to show that Zn as defined above is a martingale in
general, not just the normal case. Formally, therefore, whenever we have a stopping
time  such that the optional stopping theorem is valid for this martingale sequence
Zn , we have the identity

E.Z / D EŒ. .t// e tS  D E.Z0 / D 1:

Once we have this general identity, we can manipulate it for special stopping times
 to make useful conclusions in specific applications.

Example 14.19 (Error Probabilities of Wald’s SPRT). As a specific application of


historical importance in statistics, consider again the example of Wald’s SPRT
(Example 14.11). The setup is that we are acquiring iid observations Z1 ; Z2 ; : : :
from a parametric family of densities f .xj/, and we have to decide between the
two hypotheses H0 W  D 0 (the null hypothesis), and H1 W  D 1 (the alternative
hypothesis). As was explained in Example 14.11, we continue sampling as long as
l < Sn < u for some l; u; l < u, and stop and decide in favor of H1 or H0 when for
the first time Sn  u or Sn l; here, Sn is the log likelihood ratio
Qn
f .Zi j1 /
Sn D log ƒn D log QinD1
i D1 f .Zi j0 /
Xn
f .Zi j1 / X n
D log D Xi say:
f .Zi j0 /
i D1 i D1

Therefore, in this particular case, the relevant stopping time is

 D inffn > 0 W Sn 62 .l; u/g:

The type I error probability of our test is the probability that the test would reject
H0 if H0 happened to be true. Denoting the type I error probability as ’, we have
’ D P D0 .S  u/. We use Wald’s generalized identity to approximate ’. Exact
calculation of ’ is practically impossible except in stray cases.
14.3 Martingale and Concentration Inequalities 477

To proceed with this approximation, suppose there is a number t ¤ 0 such that


E D0 .e tXi / D 1. In our notation for the generalized Wald identity, this makes
.t/ D 1 for this judiciously chosen t. If we now make the assumption (of some
faith) that when Sn leaves the interval .l; u/, it does not overshoot the limits l; u by
too much, we should have

S uIfS ug C lIfS lg :

Therefore, by applying Wald’s generalized identity,

1 D E D0 .e tS / e t u ’ C e t l .1  ’/
1  et l
)’ :
et u  et l
This is the classic Wald approximation to the type I error probability of the SPRT
(sequential probability ratio test). A similar approximation exists for the type II
error probability of the SPRT, which is the probability that the test will accept H0 if
H0 happens to be false.

14.3 Martingale and Concentration Inequalities

The optional stopping theorem is also the main tool in proving a collection of impor-
tant inequalities involving martingales. To provide a little context for P
such inequali-
ties, consider the special martingale of a random walk, namely Sn D niD1 Xi ; n 
1, where we assume the Xi to be iid mean zero random variables with a fi-
nite variance  2 . If we take any fixed n, and any fixed > 0, then simply by
E.Sn2 /
Chebyshev’s inequality, P .jSn j  / 2
. Kolmogorov’s inequality (see Chap-
E.S 2 /
ter 8) makes the stronger assertion P .max1kn jSk j  / 2
n
. A fundamental
inequality in martingale theory says that such an inequality holds for more general
martingales, and not just the special martingale of a random walk.

14.3.1 Maximal Inequality

Theorem 14.7 (Martingale Maximal Inequality).


(a) Let fXn ; n  0g be a nonnegative submartingale adapted to some sequence
fYn g, and any fixed positive number. Then, for any n  0,
 
E.Xn /
P max Xk  :
0kn
478 14 Discrete Time Martingales and Concentration Inequalities

(b) Let fXn ; n  0g be a martingale adapted to some sequence fYn g, and any
fixed positive number. Suppose p  1 is such that E.jXk jp / < 1 for any
k  0. Then, for any n  0,
   
E jXn jp Ifmax0kn jXk jg E.jXn jp /
P max jXk j  p p
:
0kn

Proof. Note that the final inequality in part (b) follows from part (a) by use of
Theorem 14.1 because f .z/ D jzjp is a nonnegative convex function, and therefore
if fXn g is a martingale adapted to some sequence fYn g, then for p  1, fjXn jp g is a
nonnegative submartingale (adapted to that same sequence fYn g). The first inequal-
ity in part (b) is proved by partitioning the event fmax0kn jXk j  g into disjoint
events of the form fjX0 j < ; : : : jXi j < ; jXi C1 j  g, and then using simple
bounds on each of these partitioning sets. This is left as an exercise.
For proving part (a) of this theorem, define the stopping time

 D inffk  0 W Xk > g;

and n D min.; n/.


Then, by the optional stopping theorem,
 
E.Xn /  E.Xn / D E Xn Ifmax0kn Xk g C E.Xn Ifmax0kn Xk <g /
 
 E Xn Ifmax0kn Xk g

(since the fXn g sequence has been assumed to be nonnegative)


 
 
 E Ifmax0kn Xk g D P max Xk  ;
0kn

which is what part (a) of this theorem says. t


u

Part (a) of the theorem above assumes the submartingale fXn g to be nonnega-
tive. This assumption is in fact not needed. In addition, the inequality itself can be
somewhat strengthened. The following improved version of the maximal inequality
can be proved by minor modifications of the argument given above; we record the
stronger version, which is important for applications.

Theorem 14.8 (A Better Maximal Inequality). Let fXn ; n  0g be a submartin-


gale adapted to some sequence fYn g, and any fixed positive number. Then, for any
n  0,  
E.XnC / E.jXn j/
P max Xk  ;
0kn

where for any real number x; x C D max.x; 0/ jxj.


14.3 Martingale and Concentration Inequalities 479

Example 14.20 (Sharper Bounds Near Zero). The bounds in Theorem 14.7 and
Theorem 14.8 are not useful unless is large, because the upper bounds blow up
as ! 0. However, if we work a little harder, then useful bounds can be derived
at least in some cases even when is near zero. This example illustrates such a
calculation.
Let fXn g be a zero mean martingale, and suppose k2 D Var.Xk / < 1 for
all k. For n  0, denote Mn D max0kn Xk . Fix a constant c > 0; the constant
c is chosen later suitably. By Theorem 14.1, f.Xk C c/2 g is a submartingale, and
therefore, by Theorem 14.8,
 
P .Mn  / D P .Mn C c  C c/ D P max .Xk C c/  Cc
0kn

E.Xn C c/2 c 2 C n2


D :
. C c/2 c 2 C 2c C 2

Therefore,
c 2 C n2
P .Mn  / inf :
c>0 c 2 C 2c C 2

c 2 Cn2
The function c 2 C2cC2
is uniquely minimized at the root of the derivative equation

c cC
 D0
c2 C n2 c 2 C 2c C 2

n2
, c 2 C c. 2
 n2 /  n2 D 0 , c D :

Plugging this value of c, we get

c 2 C n2
P .Mn  / inf
c>0 c 2 C 2c C 2

2
D 2 n 2;
C n

for any > 0. Clearly, this bound is strictly smaller than one for any > 0.

Example 14.21 (Bounds on the Moments of the Maximum). Here is a clever applica-
tion of Theorem 14.7 to bounding the moments of Mn D max0kn jXk j in terms
of the same moment of jXn j for a martingale sequence fXn g. The example is a very
nice illustration of the art of putting simple things together to get a pretty end result.
Suppose that fXn ; n  0g is a martingale sequence, and p > 1 is such that
E.jXk jp / < 1 for every k. The proof of the result in this example makes use
of Holder’s inequality E.jX Y j/ .EjX j’ /1=’ .EjY jˇ /1=ˇ , where ’; ˇ > 1, and
ˇ D ’1 (see Chapter 1).

480 14 Discrete Time Martingales and Concentration Inequalities

Proceeding,
Z 1
E.Mnp / D p p1
P .Mn > /d
0
Z 1
 
p1 E jXn jIfMn g
p d
0

(by using part (b) of Theorem 14.7)


Z " Z !#
1   Mn
D p p2
E jXn jIfMn g d D E pjXn j p2
d
0 0

(by Fubini’s theorem)


" #
Mn
p1
p  
D E pjXn j D E jXn jMnp1
p1 p1
p
ŒEjXn jp 1=p ŒE.Mnp /.p1/=p
p1
p
(by using Holder’s inequality with ’ D p; ˇ D p1 ).
p .p1/=p
Transferring ŒE.Mn / to the left side,
p
ŒE.Mnp /1=p ŒEjXn jp 1=p :
p1

In particular, for a square integrable martingale, by using p D 2 in the inequality


we just derived,

ŒE.Mn2 /1=2 2ŒE.Xn2 /1=2 ) E.Mn2 / 4E.Xn2 /;

a very pretty and useful inequality.

14.3.2  Inequalities of Burkholder, Davis, and Gundy

The previous two examples indicated applications of various versions of the


maximal inequality to obtaining bounds on the moments of the maximum
Mn D max0kn jXk j for a martingale sequence fXn g. The maximal inequality
tells us how to obtain bounds on the moments from bounds on the tail probability.
In particular, if the martingale is square integrable, that is, if E.Xk2 / < 1 for any
k, then the maximal inequality leads to a bound on the second moment of Mn in
terms of the second moment of the last term, namely E.Xn2 /.
14.3 Martingale and Concentration Inequalities 481

2 2
There is a useful connection between E.X Pnn/ and E.Dn / for a general square
integrable martingale fXn g, where Dn D 2
i D1 .X i  X i 1 / 2
. The connection,
which we prove below, is the neat identity E.Xn2 /  E.X02 / D E.Dn2 /, so that if
X0 D 0, then E.Xn2 / and E.Dn2 / are equal. Therefore, we can think of the maximal
inequality and the implied moment bounds in terms of E.Dn2 /, because E.Dn2 / and
E.Xn2 / are, after all, equal. It was shown in Burkholder (1973), Davis (1970), and
Burkholder, Davis, and Gundy (1972) that one can bound expectations of far more
general functions of Mn in terms of expectations of the same functions of Dn ; in
particular, one can bound the pth moment of Mn from both directions by multiples
of the pth moment of Dn for general p  1. In some sense, the moments of Mn
and the moments of Dn grow in the same order; if one can control the increments of
the martingale sequence, then one can control the maximum. Three such important
bounds are presented in this section for reference and completeness. But first, we
demonstrate the promised connection between E.Xn2 / and E.Dn2 /, an interesting
result in its own right.

P Suppose fXn ; n  0g is a martingale. Let Vi D Xi  Xi 1 ; i  1,


Proposition.
and Dn2 D niD1 Vi2 . Suppose E.Xk2 / < 1 for each k  0. Then, for any n  1,

E.Dn2 / D E.Xn2 /  E.X02 /:

Proof.

X
n X
n
E.Dn2 / D EŒ.Xi  Xi 1 /2  D EŒXi .Xi  Xi 1 /  Xi 1 .Xi  Xi 1 /
i D1 i D1

X
n
D E.EŒXi .Xi  Xi 1 / jX0 ; : : : ; Xi 1 /
i D1

X
n
 E.EŒXi 1 .Xi  Xi 1 / jX0 ; : : : ; Xi 1 /
i D1

X
n
D fE.EŒXi2 jX0 ; : : : ; Xi 1 /  E.Xi 1 EŒXi jX0 ; : : : ; Xi 1 /g
i D1

X
n
 E.Xi 1 EŒXi jX0 ; : : : ; Xi 1   Xi21 /
i D1

X
n X
n
D fE.Xi2 /  E.Xi21 /g  E.Xi21  Xi21 /
i D1 i D1

D E.Xn2 /  E.X02 /: t
u
482 14 Discrete Time Martingales and Concentration Inequalities

Remark. In view of this result, we can restate part (b) of Theorem 14.7 for the case
p D 2 in the following manner.

Theorem 14.9. Let fXn ; n  0g be a martingale such that X0 D 0 and E.Xk2 / <
1 for all k  1. Let be any fixed positive number, and for any n  1; Mn D
max0kn jXk j. Then,
E.Dn2 /
P .Mn  / 2
:

The inequalities of Burkholder, Davis, and Gundy show how to establish bounds
on moments of Mn in terms of the same moments of Dn . To describe some of these
bounds, we first need a little notation.

Given a real-valued random variable X , and a positive number p, the Lp norm


1
of X is defined as jjX jjp D ŒE.jX jp / p , assuming that E.jX jp / < 1. Obviously,
1
if X is already a nonnegative random variable, then jjX jjp D ŒE.X p / p , Here
are two specific bounds on the Lp norms of Mn in terms of the Lp norms of Dn .
Of these, the case p > 1 was considered in works of Donald Burkholder (e.g.,
Burkholder (1973)); the case p D 1 needed a separate treatment, and was dealt with
in Davis (1970).

Theorem 14.10. (a) Suppose fXn ; n  0g is a martingale, with X0 D 0. Suppose


for some given p > 1; jjXk jjp < 1 for all k  1. Then, for any n  1,

p1 18p 3=2


jjDn jjp jjMn jjp jjDn jjp :
18p 3=2 .p  1/3=2

(b) There exist universal positive constants c1 ; c2 such that

c1 jjDn jj1 jjMn jj1 c2 jjDn jj1 :


p
Moreover, the constant c2 may be taken to be 3.
For p  1, the functions x ! jxjp are convex. It was shown in Burkholder,
Davis, and Gundy (1972) that bounds of the same nature as in the theorem above
hold for general convex functions. The exact result says the following.

Theorem 14.11. Suppose fXn ; n  0g is a martingale with X0 D 0 and


 W R ! R a convex function. Then there exist universal positive constants
c ;C ;c C , depending only on the function , such that for any n  1,

c E..Dn // E..Mn // C E..Dn //:

Remark. Note that apart from the explicit constants, both parts of Theorem 14.10
follow as special cases of this theorem. To our knowledge, no explicit choices of
c ; C are known.
14.3 Martingale and Concentration Inequalities 483

14.3.3 Inequalities of Hoeffding and Azuma

The classical inequality of Hoeffding (Hoeffding (1963); see Chapter 8) gives


bounds on the probability of a large deviation of a partial sum of bounded iid ran-
dom variables from its mean value. The message of that inequality is that if the
iid summands can be controlled, then the deviations of the sum from its mean can
be controlled. Inequalities on probabilities of the form P .jf .X1 ; X2 ; : : : ; Xn / 
E.f .X1 ; X2 ; : : : ; Xn //j > t/ are called concentration inequalities. An equally
classic concentration inequality of K. Azuma (Azuma (1967)) shows that a Ho-
effding type inequality holds for a martingale sequence, provided that the incre-
ments Xk  Xk1 vary in bounded intervals. The analogy between the iid case
Pn the martingale case is then clear. In the iid case, we can control Sn D
and
i D1 Xi if we can Pcontrol the summands Xi ; in the martingale case, we can con-
trol Xn  X0 D niD1 .Xi  Xi 1 / if we can control the summands Xi  Xi 1 .
Here is Azuma’s inequality in its classic form; a more general form is given
afterwards.

Theorem 14.12 (Azuma’s Inequality). Suppose fXn ; n  0g is a martingale such


that Vi D jXi  Xi 1 j ci , where ci are positive constants. Then, for any positive
number t and any n  1,
t2
 Pn
c2
(a) P .Xn  X0  t/ e 2
i D1 i :
t2
 P
2 n c2
(b) P .Xn  X0 t/ e i D1 i :
t2
 P
2 n c2
(c) P .jXn  X0 j  t/ 2e i D1 i :

The proof of part (b) is exactly the same as that of part (a), and part (c) is an im-
mediate consequence of parts (a) and (b). So only part (a) requires a proof. For this,
we need a classic convexity lemma, originally used in Hoeffding (1963), and then a
generalized version of it. Here is the first lemma.

(Hoeffding’s Lemma). Let X be a zero mean random variable such that P .a


X b/ D 1, where a; b are finite constants. Then, for any s > 0,
s 2 .ba/2
E.e sX / e 8 :

Remark. It is important to note that the bound in this lemma depends only on b  a
and the mean zero assumption, but not on the individual values of a; b.

Proof of Hoeffding’s Lemma. The proof uses convexity of the function x ! e sx ,


and a calculus inequality on the function .u/ D pu C log.1  p C pe u /; u  0,
where p is a fixed number in .0; 1/.
484 14 Discrete Time Martingales and Concentration Inequalities

First, by the convexity of x ! e sx , for a x b,

x  a sb b  x sa
e sx e C e :
ba ba
Taking an expectation,

E.e sX / pe sb C .1  p/e sa : ./


a
where p D ba
; note that p belongs to Œ0; 1. It now remains to show that pe sb C
s 2 .ba/2
.1  p/e sa e 8 . Towards this, write
h i h i
pe sb C .1  p/e sa D e sa 1  p C pe s.ba/ D e sp.ba/ 1  p C pe s.ba/
s.ba/ /
D e sp.ba/Clog.1pCpe D e puClog.1pCpe / ;
u

writing u for s.b  a/.


A relatively simple calculus argument shows that the function .u/ D pu C
2
log.1  p C pe u / is bounded above by u8 for all u > 0. Plugging this bound in ./
results in the bound in the lemma.

(Generalized Hoeffding Lemma). Let V; Z be two random variables such that

E.V jZ/ D 0; and P .f .Z/ V f .Z/ C c/ D 1

for some function f .Z/ of Z and some positive constant c. Then, for any s > 0,

s2 c2
E.e sV jZ/ e 8 :

The generalized Hoeffding lemma has the same proof as Hoeffding’s lemma itself.
Refer to the remark that we made just before the proof of Hoeffding’s lemma. It is
the generalized Hoeffding lemma that gives us Azuma’s inequality.

Proof of Azuma’s Inequality. Still using the notation Vi D Xi  Xi 1 , then, with


s > 0,
 
P .Xn  X0  t/ D P e s.Xn X0 /  e st e st E e s.Xn X0 /
 Pn  Pn1
D e st E e s i D1 Vi D e st E e s i D1 Vi CsVn
 Pn1 h i
D e st E e s i D1 Vi E e sVn jX0 ; : : : ; Xn1
 P 2 2

st s n1D1 Vi s .2c n/
e E e i e 8
14.3 Martingale and Concentration Inequalities 485

(because E.Vn jX0 ; : : : ; Xn1 / D 0 by the martingale property of fXn g, and then
by applying the generalized Hoeffding lemma)

2
s 2 cn
 Pn1 s2
Pn
c2
i D1 i
D e st e 2 E e s i D1 Vi e st e 2 ;

by repeating the same argument.


This latest inequality is true for any s > 0. Therefore, by minimizing the bound
over s > 0,
Pn t2
s2 c2  Pn
st i D1 i c2
P .Xn  X0  t/ inf e e 2 De 2
i D1 i ;
s>0

where the infimum over s is easily established by a simple calculus argument. This
proves Azuma’s inequality. t
u

14.3.4  Inequalities of McDiarmid and Devroye

McDiarmid (1989) and Devroye (1991) use novel martingale techniques to derive
concentration inequalities and variance bounds for potentially complicated func-
tions of independent random variables. The only requirement is that the function
should not change by arbitrarily large amounts if all but one of the coordinates
remain fixed. The first result below says that functions of certain types are concen-
trated near their mean value with a high probability.

Theorem 14.13. Suppose X1 ; : : : ; Xn are independent random variables, and


f .x1 ; : : : ; xn / is a function such that for each i; 1 i n, there exist finite
constant ci D ci;n such that

jf .x1 ; : : : ; xi 1 ; xi ; xi C1 ; : : : ; xn /  f .x1 ; : : : ; xi 1 ; xi0 ; xi C1 ; : : : ; xn /j ci

for all x1 ; : : : ; xi ; xi0 ; : : : ; xn . Let t be any positive number. Then,


2
 Pn2t
c2
(a) P .f  E.f /  t/ e i D1 i :
2
 Pn2t
c2
(b) P .f  E.f / t/ e i D1 i :
2
 Pn2t
c2
(c) P .jf  E.f /j  t/ 2e i D1 i :

Proof. Once again, only part (a) is proved, because (b) is proved exactly analo-
gously, and (c) follows by adding the inequalities in (a) and (b). For notational
convenience, we take E.f / to be zero; this allows us to write f in place of f E.f /
below.
486 14 Discrete Time Martingales and Concentration Inequalities
Pn
The trick is to decompose f as f D kD1 Vk , where fVk g is a martingale
difference sequence such that it can be bounded in both directions, Zk Vk Wk ,
in a manner so that Wk  Zk ck ; k D 1; 2; : : : ; n. Then, Azuma’s inequality
applies and the inequality of this theorem falls out. Construct the random variables
Vk ; Zk ; Wk as follows.
Define

.x1 ; : : : ; xk / D EŒf .X1 ; : : : ; Xn / jX1 D x1 ; : : : ; Xk D xk I


Vk D .X1 ; : : : ; Xk /  .X1 ; : : : ; Xk1 / for k  2; and V1 D .X1 /I
Zk D inf .X1 ; : : : ; Xk1 ; xk /  .X1 ; : : : ; Xk1 / for k  2;
xk

and Z1 D inf .x1 /I


x1

Wk D sup .X1 ; : : : ; Xk1 ; xk /  .X1 ; : : : ; Xk1 / for k  2;


xk

and W1 D sup .x1 /:


x1

Now observe the following facts.


(a) By construction, Zk Vk Wk for each k.
(b) By hypothesis, Wk P  Zk ck for each k.
(c) f .X1 ; : : : ; Xn / D nkD1 Vk .
(d) fVk gn1 forms a martingale difference sequence.
Therefore, we can once again apply the generalized Hoeffding lemma and simply
repeat the proof of Azuma’s inequality to obtain the inequality in part (a) of this
theorem. t
u

An interesting feature of McDiarmid’s inequality is that martingale methods were


used to derive a probability inequality involving independent random variables. It
turns out that martingale methods may also be used to derive variance bounds for
functions of independent random variables. The following variance bound is taken
from Devroye (1991).

Theorem 14.14. Suppose X1 ; : : : ; Xn are independent random variables and


f .x1 ; : : : ; xn / is a function that satisfies the conditions of Theorem 14.13. Then,
Pn 2
i D1 ci
Var.f .X1 ; : : : ; Xn // :
4
Proof. We use the same notation
P as in the proof of Theorem
 2 14.13.
 The proof con-
sists of showing Var.f / D E. niD1 Vi2 / and E.Vi2 / ci =4 .
14.3 Martingale and Concentration Inequalities 487

To prove the first fact, we use the martingale decomposition as in Theorem 14.13
to get
! 2 !2 3
X
n X
n
Var.f / D Var Vi DE4 Vi 5
i D1 i D1

X
n XX
D EŒVi2  C 2 EŒVi Vj 
i D1 i <j

X
n XX
D EŒVi2  C 2 E.Vi EŒVj jX1 ; : : : ; Xj 1 /
i D1 i <j

X
n XX X
n
D EŒVi2  C 2 E.Vi  0/ D EŒVi2 :
i D1 i <j i D1

To prove the second fact, we use an extremal property of two-point distribu-


tions, namely that the two-point distribution placing probability 12 at each of a; b
maximizes the variance among all distributions supported on Œa; b, and that this
2
two-point distribution has variance .ba/
4 . From the proof of Theorem 14.13,
Zi Vi Wi Zi C ci . Therefore, the conditional variance of Vi given
ci2
X1 ; : : : ; Xi 1 is at most 4
, and the conditional mean is zero. Putting these two
ci2
facts together, we get our desired bound E.Vi2 / 4 , which gives the variance
bound stated in this theorem. t
u

The two theorems in this section give useful probability and variance bounds
in many complicated problems in which direct evaluation would be essentially
impossible.

Example 14.22 (The Kolmogorov–Smirnov Statistic). Suppose X1 ; X2 ; : : : ; Xn are


iid observations from some continuous CDF F .x/ on the real line. It is some-
times of interest in statistics to test the hypothesis that F D F0 , some specific
CDF on the real line. By the Glivenko–Cantelli theorem (see Chapter 7), the em-
pirical CDF Fn converges uniformly to the true CDF with probability one. So
a measure of discrepancy of the observed data from the postulated CDFpF0 is
n D supx jFn .x/  F0 .x/j. The Kolmogorov–Smirnov statistic is Dn D nn .
Exact calculations with Dn are very cumbersome, because of the complicated nature
of its distribution for given n. The purpose of this example is to use the inequalities
of McDiarmid and Devroye to get useful bounds on its tail probabilities and the
variance.
The function f to which we would apply the inequalities of McDiarmid and De-
vroye is f .X1 ; : : : ; Xn / D supx jFn .x/  F0 .x/j. We need to show that if just one
data value changes, then the function f cannot change by too large an amount.
Indeed, consider two datasets, fX1 ; : : : ; Xi ; : : : ; Xn g and fX1 ; : : : ; Xi0 ; : : : ; Xn g,
where in the second set the Xi0 value is different from Xi . Let the corresponding
488 14 Discrete Time Martingales and Concentration Inequalities

empirical CDFs be Fn ; Fn0 . Fix an x. The number of observations x in the two


datasets can differ by at most one, and therefore jFn .x/  Fn0 .x/j 1
n
.This holds
for any x. By the triangular inequality,

j sup jFn .x/  F0 .x/j  sup jFn0 .x/  F0 .x/jj sup jFn .x/  Fn0 .x/j 1=n:
x x x

Thus, we may use ci D ci;n D n1 in the inequalities of McDiarmid and Devroye.


First, by simply plugging ci D n1 in Theorem 14.13, we get

2
P .jn  E.n /j  t/ 2e 2nt
2
) P .jDn  E.Dn /j  t/ 2e 2t :

This concentration inequality holds for every fixed n and t > 0, and we do not need
to deal with the exact distribution of Dn to arrive at this inequality.
Again plugging ci D n1 in Theorem 14.14, we get

1 1
Var.n / ) Var.Dn / ;
4n 4
for all n  1. Once again, this is an attractive variance bound that is valid for every
n, and we do not need to work with the exact distribution of Dn to arrive at this
bound.

14.3.5 The Upcrossing Inequality

A final key inequality in martingale theory that we present is Doob’s upcrossing


inequality. The inequality is independently useful for studying fluctuations in the
trajectory of a martingale (submartingale) sequence. It is also the result we need in
the next section for establishing the fundamental convergence theorem for martin-
gales (submartingales).
Given the discrete time process fXn ; n  0g, fix an integer N > 0, and two
numbers a; b; a < b. We now track the time instants at which this process crosses b
from below, or a from above. Formally, let T0 D inffk  0 W Xk ag. If X0 > a,
then this is the first downcrossing of a. If X0 a, then T0 D 0. Now we wait for the
first upcrossing of b after the time T0 . Formally, T1 D inffk > T0 W Xk  bg. We
continue tracking the down and the upcrossings of the two levels a; b in this fashion.
Here then is the formal definition for the entire sequence of stopping times Tn :

T0 D inffk  0 W Xk agI
T2nC1 D inffk > T2n W Xk  bg; n  0I
T2nC2 D inffk > T2nC1 W Xk ag; n  0:
14.3 Martingale and Concentration Inequalities 489

The times T1 ; T3 ; : : : are then the instants of upcrossing, and the times T0 ; T2 ; : : :
are the instants of downcrossing. The upcrossing inequality places a bound on the
expected value of Ua;b;N , the number of upcrossings up to the time N . Note that
this is simply the number of odd labels 2n C 1 for which T2nC1 N .

Theorem 14.15. Let fXn ; n  0g be a submartingale. Then for any a; b; N.a < b/,

EŒ.XN  a/C   EŒ.X0  a/C  E.jXN j/ C jaj


EŒUa;b;N  :
ba ba
Proof. The second inequality follows from the first inequality by the pointwise in-
equality .x  a/C x C C jaj jxj C jaj, and so, we prove only the first inequality.
First make the following reduction. Define a new nonnegative submartingale as
Yn D .Xn  a/C ; n  0. This shifting by a is going to result in a useful reduction.
There is a functional identity between the upcrossing variable that we are interested
in, namely Ua;b;N and the number of upcrossings V0;ba;N of this new process
fYn gN
0 of the two new levels 0 and b  a. Indeed, Ua;b;N D V0;ba;N . So we need
N Y0 /
to show that EŒV0;ba;N  E.Yba .
The key to proving this inequality is to write a clever decomposition

X
N
YN  Y0 D .Yi  Yi 1 /;
i D0

such that three things happen:


(i) The i are increasing stopping times, so that the submartingale property is in-
herited by the Yi sequence.
(ii) The sum over the odd labels in this decomposition satisfy the pointwise in-
equality X
.Yi  Yi 1 /  .b  a/V0;ba;N :
i W0i N; i odd

(iii) The sum over the even labels satisfy the inequality
2 3
X
E4 .Yi  Yi 1 /5  0:
i W0i N; i even

If we put (ii) and (iii) together, we immediately get

E.YN  Y0 /  .b  a/EŒV0;ba;N ;

which is the needed result.


What are these stopping times i , and why are (ii) and (iii) true? The stopping
times 0 1 : : : are defined in the following way. Analogous to the down-
crossing and upcrossing times T0 ; T1 ; : : : of .a; b/ for the original fXn g process, let
490 14 Discrete Time Martingales and Concentration Inequalities

T00 ; T10 ; : : : be the downcrossing and upcrossing times of .0; b  a/ for the new fYn g
process. Now define i D min.Ti0 ; N /. The i are increasing, that is, 0 1 : : :,
because the Ti0 are. Note that these i are stopping times adapted to fYn g.
Now look at the sum over the odd labels, namely .Y1  Y0 / C .Y3  Y2 / C    .
Break this sum further into two subsets of labels, i V D V0;ba;N , and i > V .
For each label i in the first subset, .Y2i C1  Y2i /  b  a, because Y2i C1  b and
Y2i a. Adding over these labels, of which there are V many, we get the sum to
be  .b  a/V . The labels in the other subset can be seen to give a sum  0 (just
think of what V means, and a little thinking shows that the rest of the labels produce
a sum  0). So, now adding over the two subsets of labels, we get our claimed
inequality in (ii) above.
The claim in (iii) is automatic by the optional stopping theorem, because for each
individual i , we will have E.Yi 1 / E.Yi / (actually, this is a slightly stronger
demand than what the optional stopping theorem says; but it is true).
As was explained above, this completes the argument for the upcrossing
inequality. t
u

14.4 Convergence of Martingales

14.4.1 The Basic Convergence Theorem

Paul Lévy initiated his study of martingales in his search for laws of large numbers
beyond the case of means in the iid case. It turns out that martingales often con-
verge to a limiting random variable, and even convergence of the means or higher
moments can be arranged, provided that our martingale sequence is not allowed to
fluctuate or grow out of control. To see why some such conditions P would be needed,
consider the case of the simple symmetric random walk Sn D niD1 Xi , where the
Xi are iid taking the values ˙1 with probability 12 each. We know that the simple
symmetric random walk is recurrent, and therefore it comes back infinitely often
to every integer value x with probability one. So Sn , although a martingale, does
p value of jSn j in the simple p
not converge to some S1 . The expected symmetric ran-
dom walk case is of the order of c n for some constant c, and c n diverges as
n ! 1. A famous result in martingale theory says that if we can keep E.jXn j/ in
control (i.e., bounded away from 1), then a martingale sequence fXn g will in fact
converge to some suitable X1 . Furthermore, some such condition is also essentially
necessary for the martingale to converge. We start with an example.
Example 14.23 (Convergence of the Likelihood Ratio). Consider again the likeli-
Q
hood ratio ƒn D niD1 ff10 .X i/
.Xi /
, where f0 ; f1 are two densities and the sequence
X1 ; X2 ; : : : is iid from the density f0 . We have seen that ƒn is a martingale (see
Example 14.8).
The likelihood ratio ƒn gives a measure of the support in the first n data values
for the density f1 . We know f0 to be the true density from which the data values
14.4 Convergence of Martingales 491

are coming, therefore we would like the support for f1 to diminish as more data are
accumulated. Mathematically, we would like ƒn to converge to zero as n ! 1.
We recognize that this is therefore a question about convergence of a martingale
sequence, because ƒn , after all, is a martingale if the true density is f0 .
Does ƒn indeed converge (almost surely) to zero? Indeed, it does, and we can
verify it directly, without using any martingale convergence theorems that we have
not yet encountered. Here is why we can verify the convergence directly.
Assume that f0 ; f1 are strictly positive for the same set of x values; that is,
fx W f1 .x/ > 0g D fx W f0 .x/ > 0g. Since u ! log u is a strictly concave function
on .0; 1/, by Jensen’s inequality,
 
f1 .X / f1 .X /
m D Ef0 log < log Ef0 D log 1 D 0:
f0 .X / f0 .X /

Because Zi D log ff10 .Xi/


.Xi / are iid with mean m, by the usual SLLN for iid random
variables,

1X
n
1 a:s: a:s:
log ƒn D Zi ! m < 0 ) log ƒn ! 1
n n
i D1
a:s:
) ƒn ! 0:

So, in this example, the martingale ƒn does converge with probability one to a
limiting random variable ƒ1 , and ƒ1 happens to be a constant random variable,
equal to zero. We remark that the martingale ƒn satisfies E.jƒn j/ D E.ƒn / D 1
and so, a fortiori, supn E.jƒn j/ < 1. This has something to do with the fact that
ƒn converges in this example, although the random walk, also a martingale, failed
to converge. This is borne out by the next theorem, a famous result in martingale
theory. The proof of this next theorem requires the use of two basic facts in measure
theory, which we state below.
Theorem 14.16 (Fatou’s Lemma). Let Xn ; n  1 and X be random variables
defined on a common sample space . Suppose each Xn is nonnegative with prob-
a:s:
ability one, and suppose Xn ! X . Then, lim infn E.Xn /  E.X /.
Theorem 14.17 (Monotone Convergence Theorem). Let Xn ; n  1 and X be
random variables defined on a common sample space . Suppose each Xn is non-
a:s:
negative with probability one, that X1 X2 X3 : : :, and Xn ! X . Then
E.Xn / " E.X /.
Theorem 14.18 (Submartingale Convergence Theorem). (a) Let fXn g be a
submartingale such that supn E.XnC / D c < 1. Then there exists a random
a:s:
variable X D X1 , almost surely finite, such that Xn ! X .
(b) Let fXn g be a nonnegative supermartingale, or a nonpositive submartingale.
Then there exists a random variable X D X1 , almost surely finite, such that
a:s:
Xn ! X .
492 14 Discrete Time Martingales and Concentration Inequalities

Proof. The proof uses the upcrossing inequality, the monotone convergence theo-
rem, and Fatou’s lemma. The key idea is first to show that under the hypothesis
of the theorem, the process fXn g cannot fluctuate indefinitely between two given
numbers a; b; a < b. Then a standard analytical technique of approximation by ra-
tionals, and use of the monotone convergence theorem and Fatou’s lemma produces
the submartingale convergence theorem. Here are the steps of the proof. Define

Ua;b;N D Number of upcrossings of (a,b) by X0 ; X1 ; : : : ; XN I


Ua;b D Number of upcrossings of (a,b) by X0 ; X1 ; : : : I
‚a;b D f! 2  W lim inf Xn a<b lim sup Xn gI
n n
‚ D f! 2  W lim inf Xn < lim sup Xn g:
n n

First, by the monotone convergence theorem, EŒUa;b;N  ! EŒUa;b  as N ! 1,


because Ua;b;N converges monotonically to Ua;b as N ! 1. Therefore, by the
upcrossing inequality,

E.jXN j/ C jaj
EŒUa;b;N  ) EŒUa;b  D lim EŒUa;b;N 
ba N
lim supN E.jXN j/ C jaj
< 1:
ba
This means that Ua;b must be finite with probability one (i.e., it cannot equal 1
with a positive probability).
Next, note that ‚  [fa<b; a; b rationalg ‚a;b , and because we now have that
P .‚a;b / D 0 for any specific pair a; b; P .[fa<b; a; b rationalg ‚a;b / must also be
zero. This then implies that P .‚/ D 0, which establishes the existence of an almost
sure limit for the sequence Xn .
However, a subtle point still remains. The limit, X , could be 1 or 1 with a
positive probability. We use Fatou’s lemma to rule out that possibility. Indeed, by
Fatou’s lemma,

E.jX j/ lim inf E.jXn j/ sup E.jXn j/ < 1;


n n

and so X must be finite with probability one. This finishes the proof of part (a) of
the submartingale convergence theorem.
Part (b) is an easy consequence of part (a). For example, if fXn g is a nonpositive
submartingale, then

sup E.jXn j/ D sup E.Xn / D  inf E.Xn / D E.X1 / < 1;


n n n

and so convergence of Xn to an almost surely finite X follows from part (a). t


u
14.4 Convergence of Martingales 493

14.4.2 Convergence in L1 and L2

The basic convergence theorem that we just proved says that an L1 bounded sub-
martingale converges to some random variable X . It is a bit disappointing that the
apparently strong hypothesis that the submartingale is L1 bounded is not strong
enough to ensure convergence of the expectations: E.Xn / need not converge to
E.X / in spite of the L1 bounded assumption. A slightly stronger control on the
growth of the submartingale sequence is needed to ensure convergence of expec-
tations, in addition to the convergence of the submartingale itself. For example,
supn E.jXn jp / < 1 for some p > 1 will suffice. A condition of this sort immedi-
ately reminds us of uniform integrability. Indeed, if supn E.jXn jp / < 1 for some
p > 1, then fXn g will be uniformly integrable. It turns out that uniform integrability
will be enough to assure us of convergence of the expectations in the basic conver-
gence theorem, and it is almost the minimum that we can get away with. Statisticians
are often interested in convergence of variances also. That is a stronger demand, and
requires a stronger hypothesis. The next theorem records the conclusions on these
issues. For reasons of space, this next theorem is not proved. One can see a proof in
Fristedt and Gray (1997, p. 480).

Theorem 14.19. Let fXn ; n  0g be a submartingale.


a:s:
(a) Suppose fXn g is uniformly integrable. Then there exists an X such that Xn !
X , and E.jXn  X j/ ! 0 as n ! 1.
(b) Conversely, suppose there exists an X such that E.jXn  X j/ ! 0 as n ! 1.
Then fXn g must be uniformly integrable, and moreover, Xn necessarily con-
verges almost surely to this X .
(c) If fXn g is a martingale, and is L2 bounded (i.e., supn E.Xn2 / < 1), then there
a:s:
exists an X such that Xn ! X , and E.jXn  X j2 / ! 0 as n ! 1.

Example 14.24 (Pólya’s Urn). We previously saw that the proportion of white balls
aCSn
in Pólya’s urn, namely Rn D aCbCn forms a martingale (see Example 14.6). This
is an example in which the various convergences that we may want come easily. Be-
cause Rn is obviously a uniformly bounded sequence, by the theorem stated above,
Rn converges almost surely and in L2 (and therefore, in L1 ) to a limiting random
variable R, taking values in Œ0; 1.
Neither the basic (sub)martingale convergence theorem nor the theorem in this
section helps us in any way to identify the ditribution of R. In fact, in this case,
R has a nondegenerate distribution, which is a Beta distribution with parameters a
and b. As a consequence of this, E.Rn / ! aCba
and Var.Rn / ! .aCb/2ab .aCbC1/
as
n ! 1. A proof that R has a Beta distribution with parameters a; b is available in
DasGupta (2010).

Example 14.25 (Bayes Estimates). We saw in Example 14.9 that the sequence of
Bayes estimates (namely, the mean of the posterior distribution of the parameter)
494 14 Discrete Time Martingales and Concentration Inequalities

is a martingale adapted to the sequence of data values fXn g. Continuing with the
same notation as in Example 14.9, Zn D E.Y jX .n/ / is our martingale sequence.
Assume that the prior distribution for the parameter has a finite variance; that is,
E.Y 2 / < 1. Then, by using Jensen’s inequality for conditional expectations,

E.Zn2 / D EŒ.E.Y jX .n/ //2  EŒE.Y 2 jX .n/ / D E.Y 2 /:

Hence, by the theorem above in this section, the sequence of Bayes estimates Zn
converges to some Z almost surely, and moreover the mean and the variance of Zn
converge to the mean and the variance of Z.
A natural followup question is what exactly is this limiting random variable Z.
We can only give partial answers in general. For example, for each n; E.Z jX .n/ / D
Zn with probability one. It is tempting to conclude from here that Z is the same as
Y with probability one. This will be the case if knowledge of the entire infinite data
sequence X1 ; X2 ; : : : pins down Y completely, that is, if it is the case that someone
who knows the infinite data sequence also knows Y with probability one.

14.5  Reverse Martingales and Proof of SLLN

Partial sums of iid random variables are of basic interest in many problems in proba-
bility, such as the study of random walks, and as we know, the sequence of centered
partial sums forms a martingale. On the other hand, the sequence of sample means is
of fundamental interest in statistics; but the sequence of means does not form a mar-
tingale. Interestingly, if we measure time backwards, then the sequence of means
does form a martingale, and then the rich martingale theory once again comes into
play. This motivates the concept of a reverse martingale.

Definition 14.4. A sequence of random variables fXn ; n  0g defined on a com-


mon sample space  is called a reverse submartingale adapted to the sequence
fYn ; n  0g, defined on the same sample space , if E.jXn j/ < 1 for all n and
E.Xn jYnC1 ; YnC2 ; : : :/  XnC1 for each n  0. The sequence fXn g is called a
reverse supermartingale if E.Xn jYnC1 ; YnC2 ; : : :/ XnC1 for each n.
The sequence fXn g is called a reverse martingale if it is both a reverse submartin-
gale and a reverse supermartingale with respect to the same sequence fYn g, that is,
if E.Xn jYnC1 ; YnC2 ; : : :/ D XnC1 for each n.

Example 14.26 (Sample Means). Let X1 ; X2 ; : : : be an infinite exchangeable


sequence of random variables: for any n  2 and any permutation n of
.1; 2; : : : ; n/; .X1 ; X2 ; : : : ; Xn / and .Xn .1/ ; Xn .2/ ; : : : ; Xn .n/ / have the same
joint distribution. For n  1, let X n D X1 CCX n
n
D Snn be the sequence of sample
means.
14.5 Reverse Martingales and Proof of SLLN 495

Then, by the exchanageability property of the fXn g sequence, for any given n,
and any k; 1 k n,

1X
n
X n D E.X n jSn ; SnC1 ; : : :/ D E.Xi jSn ; SnC1 ; : : :/
n
i D1
1
D nE.Xk jSn ; SnC1 ; : : :/ D E.Xk jSn ; SnC1 ; : : :/:
n
Consequently,

1 X
n1
E.X n1 jSn ; SnC1 ; : : :/ D E.Xk jSn ; SnC1 ; : : :/
n1
kD1
1
D .n  1/X n D X n ;
n1
which shows that the sequence of sample means is a reverse martingale (adapted to
the sequence of partial sums).
There is a useful convex function theorem for reverse martingales as well, which
is straightforward to prove.
Theorem 14.20 (Second Convex Function Theorem). Let fXn g be a sequence of
random variables defined on some sample space , and f a convex function. Let
Zn D f .Xn /.
(a) If fXn g is a reverse martingale, then fZn g is a reverse submartingale.
(b) If fXn g is a reverse submartingale, and f is also nondecreasing, then fZn g is
a reverse submartingale.
(c) If fXn;m g; m D 1; 2; : : : is a countable family of reverse submartingales, defined
on the same space  and all adapted to the same sequence, then fsupm Xn;m g
is also a reverse submartingale, adapted to the same sequence.
Example 14.27 (A Paradoxical Statistical Consequence). Suppose Y is some real-
valued random variable with mean , and that we do not know the true value of .
Thus, we would like to estimate . But, suppose that we cannot take any observa-
tions on the variable Y (for whatever reason). We can, however, take observations
on a completely unrelated random variable X , where E.jX j/ < 1. Suppose we do
take n iid observations on X . Call them X1 ; X2 ; : : : ; Xn and let X n be their mean.
Then, by part (a) of the second convex function theorem, jX n  j forms a reverse
submartingale, and hence E.jX n  j/ is monotone nonincreasing in n. In other
words, E.jX nC1  j/ E.jX n  j/ for all n, and so taking more observations
on the useless variable X is going to be beneficial for estimating the mean of Y , a
comical conclusion.
Note that there is really nothing special about using the absolute difference
jX n  j as the criterion for the accuracy of estimation of . The standard termi-
nology in statistics for the criterion to be used is a loss function, and loss functions
L.X n ; / with a convexity property with respect to X n for any fixed  will result in
the same paradoxical conclusion. One needs to make sure that EŒL.X n ; / is finite.
496 14 Discrete Time Martingales and Concentration Inequalities

Reverse martingales possess a universal special property that is convenient in


applications. The property is that a reverse martingale always converges almost
surely to some finite random variable. The convergence property also holds for re-
verse submartingales, but the limiting random variable may equal C1 or 1 with
a positive probability. An important and interesting consequence of this universal
convergence property is a proof of the SLLN in the iid case by using martingale
techniques. This is shown seen below as an example. The convergence property of
reverse martingales is stated below.

Theorem 14.21 (Reverse Martingale Convergence Theorem). (a) Let fXn g be


a reverse martingale adapted to some sequence. Then it is necessarily uniformly
integrable, and there exists a random variable X , almost surely finite, such that
a:s:
Xn ! X , and EjXn  X j/ ! 0 as n ! 1.
(b) Let fXn g be a reverse submartingale adapted to some sequence. Then there
a:s:
exists a random variable X taking values in Œ1; 1 such that Xn ! X .
See Fristedt and Gray (1997, pp. 483–484) for a proof using uniform integrability
techniques. Here is an important application of this theorem.

Example 14.28 (Proof of Kolmogorov’s SLLN). Let X1 ; X2 ; : : : be iid random vari-


ables, with E.jX1 j/ < 1, and let E.X1 / D . The goal of this example is to show
that the sequence of sample means, X n , converges almost surely to .
We use the reverse martingale convergence theorem to obtain a proof. Because
we have already shown that fX n g forms a reverse martingale sequence, by the re-
verse martingale convergence theorem we are assured of a finite random variable Y
such that X n converges almost surely to Y , and we are also assured that E.Y / D .
The only task that remains is to show that Y equals  with probability one.
This is achieved by establishing that P .Y  y/ D ŒP .Y  y/2 for all real y
(i.e., P .Y  y/ is 0 or 1 for any y), which would force Y to be degenerate and
therefore degenerate at . To prove that P .Y  y/ D ŒP .Y  y/2 for all real y,
define the double sequence

XmC1 C XmC2 C    C XmCn


Ym;n D ;
n

m; n  1. Note that X k and Ym;n are independent for any m, k m, and any n, and
that, furthermore, for any fixed m; Ym;n converges almost surely to Y (the same Y
as above) as n ! 1. These two facts together imply
   
P Y  y; max X k  y D P .Y  y/P max X k  y
1km 1km

) P .Y  y/ D P .Y  y/P .Y  y/ D ŒP .Y  y/2 ;

which is what we needed to complete the proof.


14.6 Martingale Central Limit Theorem 497

14.6 Martingale Central Limit Theorem

For an iid mean zero sequence of random variables Z1 ; Z2 ; : : : with variance one,
the central limit theorem says that for large n; Z1 CCZ
p
n
n
is approximately standard
normal. Suppose now that we consider a mean zero martingale (adapted to some
sequence fYn g) fXn ; n  0g with X0 D 0 and write Zi D Xi  Xi 1 ; i  1. Then,
obviously we can write

X
n X
n
Xn D Xn  X0 D .Xi  Xi 1 / D Zi :
i D1 i D1

The summands Zi are certainly no longer independent; however, they are un-
correlated (see the chapter exercises). The martingale central limit theorem
says that under certain conditions on the growth of the conditional variances
Xn
Var.Zn jY0 ; : : : ; Yn1 /, pn
will still be approximately normally distributed for
large n.
The area of martingale central limit theorems is a bit confusing due to an over-
whelming variety of central limit theorems, each known as a martingale central
limit theorem. In particular, the normalization of Xn can be deterministic or ran-
dom. Also, there can be a double array of martingales and central limit theorems
for them, analogous to Lyapounov’s central limit theorem for the independent case.
The best source and exposition of martingale central limit theorems is the classic
book by Hall and Heyde (1980). We present two specific martingale central limit
theorems in this section.
First, we need some notation. Let fXn ; n  0g be a zero mean martingale adapted
to some sequence fYn g, with X0 D 0. Let

Zi D Xi  Xi 1 ; i  1I
j2 D Var.Zj jY0 ; : : : ; Yj 1 / D E.Zj2 jY0 ; : : : ; Yj 1 /I
X
n
Vn2 D j2 I
j D1

sn2 D E.Vn2 / D E.Xn2 / D Var.Xn /I

(see Section 14.3.2 for the fact that E.Vn2 / and E.Xn2 / are equal if X0 D 0).
The desired result is that Xsnn converges in distribution to N.0; 1/. The question is
under what conditions can one prove such an asymptotic normality result. The con-
ditions that we use are very similar to the corresponding Lindeberg–Lévy conditions
in the independent case. Here are the two conditions we assume.
(A) Concentration Condition

Vn2 Vn2 P
2
D ! 1:
sn E.Vn2 /
498 14 Discrete Time Martingales and Concentration Inequalities

(B) Martingale Lindeberg Condition


Pn
j D1 E.Zj2 IfjZj j sn g / P
For any > 0; ! 0:
sn2

Under condition (A), the Lindeberg condition (B) is nearly equivalent to the uni-
max1j n  2 P
form asymptotic negligibility condition that 2
j
! 0. We commonly see
sn
such uniform asymptotic negligibility conditions in the independent case central
limit theorems. See Hall and Heyde (1980) and Brown (1971) for much additional
discussion on the exact role of the Lindeberg condition in martingale central limit
theorems. Here is our basic martingale CLT.

Theorem 14.22 (Basic Martingale CLT). Suppose conditions (A) and (B) hold.
L
Then Xsnn ) Z, where Z N.0; 1/.
The proof of the Lindeberg–Lévy theorem for the independent case has to be
suitably adapted to the martingale structure in order to prove this theorem. The
two references above can be consulted for a proof. The Lindeberg condition can be
difficult to verify. The following simpler version of martingale central limit theorems
suffices for many applications. For this, we need the additional notation
8 9
< X
n =
t D inf n > 0 W j2  t :
: ;
j D1

Here is our simpler version of the martingale CLT.

Theorem 14.23. Assume that

jZi j K<1 for all i and some KI


1
X
j2 D 1 almost surelyI
j D1
t a:s: 2
! for some finite and positive constant  2 :
t
Xn L
Then p
n
) W , where W N.0;  2 /.

Exercises

Exercise 14.1. Suppose fXn ; n  1g is a martingale adapted to some sequence


fYn g. Show that E.XnCm jY1 ; : : : ; Yn / D Xn for all m; n  1.
Exercises 499

Exercise 14.2. Suppose fXn ; n  1g is a martingale adapted to some sequence


fYn g. Fix m  1 and define Zn D Xn  Xm ; n  m C 1. Is it true that fZn g is also
a martingale?

Exercise 14.3 (Product Martingale). Let X1 ; X2 ; : : : be iid nonnegative random


Q a finite positive mean . Identify a sequence of constants cn such that
variables with
Zn D cn . niD1 Xi /; n  1 forms a martingale.

Exercise 14.4. Let fUn g; fVn g be martingales, adapted to the same sequence fYn g.
Identify, with proof, which of the following are also submartingales, and for those
that are not necessarily submartingales, give a counterexample.
(a) jUn  Vn j.
(b) Un2 C Vn2 .
(c) U n  Vn .
(d) min.Un ; Vn /.

Exercise 14.5 (Bayes Problem). Suppose given p, X1 ; X2 ; : : : are iid Bernoulli


variables with a parameter p, and the marginal distribution of p is U Œ0; 1. Let
n C1
Sn D X1 C    C Xn ; n  1, and Zn D SnC2 . Show that fZn g is a martingale with
respect to the sequence fXn g.

Exercise 14.6 (Bayes Problem). Suppose given , X1 ; X2 ; : : : are iid Poisson


’ ˇ ’1
variables with some mean , and the marginal density of is ˇ e .’/ , where
Sn C’
’; ˇ > 0 are constants. Let Sn D X1 C    C Xn ; n  1, and Zn D nCˇ . Show
that fZn g is a martingale with respect to the sequence fXn g.

Exercise 14.7 (Bayes Problem). Suppose given , X1 ; X2 ; : : : are iid N.; 1/


variables, and that the marginal distribution of  is standard normal. Let Sn D
Sn
X1 C    C Xn ; n  1, and Zn D nC1 . Show that fZn g is a martingale with respect
to the sequence fXn g.

Exercise 14.8. Suppose fXn g is known to be a submartingale with respect to some


sequence fYn g. Show that fXn g is also a martingale if and only if E.Xn / D E.Xm /
for all m; n.

Exercise 14.9. Let X1 ; X2 ; : : : be a sequence of iid random variables such that


E.jX1 j/ < 1. For n  1, let XnWn D max.X1 ; : : : ; Xn /. Show that fXnWn g is a
submartingale adapted to itself.

Exercise 14.10 (Random Walk). Consider a simple asymmetric random walk


with iid steps distributed as P .Xi D 1/ D p; P .Xi D 1/ D 1  p; p < 12 . Let
Sn D X1 C    C Xn ; n  1, Show that
(a) Vn D . 1p
p /
Sn
is a martingale.
(b) Show that with probability one, supn Sn < 1.
500 14 Discrete Time Martingales and Concentration Inequalities

Exercise 14.11 (Branching Process). Let fZij g be a double array of iid random
P n
variables with mean  and variance  2 < 1. Let X0 D 1 and XnC1 D X j D1 Znj .
Show that
(a) Wn D X n
n is a martingale.
(b) supn E.Wn / < 1.
(c) Is fWn g uniformly integrable? Prove or disprove it.
Remark. The process Wn is commonly called a branching process and is important
in population studies.
Exercise 14.12 (A Time Series Model). Let Z0 ; Z1 ; : : : be iid standard normal
variables. Let X0 D Z0 , and for n  1; Xn D Xn1 C Zn hn .X0 ; : : : ; Xn1 /,
where for each n; hn .x0 ; : : : ; xn1 / is an absolutely bounded function.
Show that fXn g is a martingale adapted to some sequence fYn g, and explicitly
identify such a sequence fYn g.
Exercise 14.13 (Another Time Series Model). Let Z0 ; Z1 ; : : : be a sequence of
random variables such that E.ZnC1 jZ0 ; : : : ; Zn / D cZn C .1  c/Zn1 ; n  1,
where 0 < c < 1. Let X0 D Z0 ; Xn D ’Zn C Zn1 ; n  1. Show that ’ may be
chosen to make fXn ; n  0g a martingale with respect to fZn g.
Exercise 14.14 (Conditional Centering of a General Sequence). Let Z0 ; Z1 ; : : :
be a general sequence of random variables,
P not necessarily independent, such that
E.jZk j/ < 1 for all k. Let Vn D niD1 ŒZi  E.Zi jZ0 ; : : : ; Zi 1 /; n  1. Show
that fVn g is a martingale with respect to the sequence fZn g.
Exercise 14.15 (The Cross-Product Martingale). Let X1 ; X2 ; : : : be indepen-
P with E.jXi j/ < 1 and E.Xi / D 0 for all i . For a fixed
dent random variables,
k  1, let Vk;n D 1i1 <i2 <<ik n Xi1 : : : Xik ; n  k. Show that fVk;n g is a
martingale with respect to fXn g.
Exercise 14.16 (The Wright–Fisher Markov Chain). Consider the Wright-Fisher
Markov chain of Example 14.7. Let

Xn .N  Xn /
Vn D ; n  0:
.1  N1 /n

Show that fVn gN


0 is a martingale.

Exercise 14.17 (An Example of Samuel Karlin). Let f be a continuous function


n n
defined on Œ0; 1 and U U Œ0; 1. Let Xn D b22nU c , and Vn D f .Xn C22n/f .Xn / .
Show that fVn g is a martingale with respect to the sequence fXn g.

P Let X1 ; X2 ; : : : be iid symmetric random variables with mean zero,


Exercise 14.18.
and let Sn D niD1 Xi ; n  1, and S0 D 0. Let .t/ be the characteristic function
of X1 , and Vn D Œ .t/n e i tSn ; n  0. Show that the real part as well as the
imaginary part of fVn g is a martingale.
Exercises 501

Exercise 14.19 (Stopping Times). Consider the simple symmetric random walk
Sn with S0 D 0. Identify, with proof, which of the following are stopping times,
and which among them have a finite expectation.
(a) inffn > 0 W jSn j > 5g.
(b) inffn  0 W Sn < SnC1 g.
(c) inffn > 0 W jSn j D 1g.
(d) inffn > 0 W jSn j > 1g.

Exercise 14.20. Let  be a nonnegative integer-valued random variable, and


fXn ; n  0g a sequence of random variables, all defined on a common sample
space . Prove or disprove that  is a stopping time adapted to fXn g if and only if
for every n  0; IfDng is a function of only X0 ; : : : ; Xn .

Exercise 14.21. Suppose 1 ; 2 are both stopping times with respect to some se-
quence fXn g. Is j1  2 j necessarily a stopping time with respect to fXn g?

Exercise 14.22 (Condition for Optional Stopping Theorem). Suppose fXn ; n 


0g is a martingale, and  a stopping time, both adapted to a common sequence
fYn g. Show that the equality E.X / D E.X0 / holds if E.jX j/ < 1, and
E.Xmin.;n/ If>ng / ! 0 as n ! 1.

Exercise
P 14.23 (The Random Walk). Consider the asymmetric random walk
Sn D niD1 Xi , where P .Xi D 1/ D p; P .Xi D 1/ D q D 1  p; p > 12 , and
S0 D 0. Let x be a fixed positive
p integer, and  D inffn > 0 W Sn D xg. Show that
1 14pqs 2 x
for 0 < s < 1; E.s  / D . 2qs
/ .

Exercise 14.24 (The Random Walk; continued). For the stopping time  of the
previous exercise, show that

x xŒ1  .p  q/2 
E./ D and Var./ D :
pq .p  q/3
P
Exercise 14.25 (Gambler’s Ruin). Consider the general random walk Sn D niD1
Xi , where P .Xi D 1/ D p ¤ 12 ; P .Xi D 1/ D q D 1  p, and S0 D 0. Let a; b
be fixed positive integers, and  D inffn > 0 W Sn D b or Sn D ag. Show that
p b
b a C b Œ1  . q / 
E./ D  ;
p  q p  q Œ1  . pq /aCb 

and that by an application of L’Hospital’s rule, this gives the correct formula for
E./ even when p D 12 .

Exercise 14.26 (Martingales for Patterns). Consider the following martingale


approach to a geometric distribution problem. Let X1 ; X2 ; : : : be iid Bernoulli vari-
ables, with P .Xi D 1/ D p; P .Xi D 0/ D q D 1p. Let  D minfk  1 W Xk D 0g,
and n D min.; n/; n  1.
502 14 Discrete Time Martingales and Concentration Inequalities
Pn
Define Vn D 1
q i D1 IfXi D0g ; n  1.
(a) Show that fVn  ng is a martingale with respect to the sequence fXn g.
(b) Show that E.Vn / D E.n / for all n.
(c) Hence, show that E./ D E.V / D q1 .
Exercise 14.27 (Martingales for Patterns). Let X1 ; X2 ; : : : be iid Bernoulli vari-
ables, with P .Xi D 1/ D p; P .Xi D 0/ D q D 1  p. Let  be the first k such that
Xk2 ; Xk1 ; Xk are each equal to one (e.g., the number of tosses of a coin necessary
to first obtain three consecutive heads), and n D min.; n/; n  3.
Define

1 X
n2
1 1
Vn D 3 IfXi DXi C1 DXi C2 D1g C 2 IfXn DXn1 D1g C IfXn D1g ; n  3:
p p p
i D1

(a) Show that fVn  ng is a martingale with respect to the sequence fXn g.
(b) Show that E.Vn / D E.n / for all n.
(c) Hence, show that

1 1 1
E./ D E.V / D C 2 C 3:
p p p

(d) Generalize to the case of the expected waiting time for obtaining r consecu-
tive 1s.
Exercise 14.28. Let fXn ; n  0g be a martingale.
(a) Show that limn!1 E.jXn j/ exists.
(b) Show that for any stopping time ; E.jX j/ limn!1 E.jXn j/.
(c) Show that if supn E.jXn j/ < 1, then E.jX j/ < 1 for any stopping time .
Exercise 14.29 (Inequality for Stopped Martingales). Let fXn ; n  0g be a
martingale, and  a stopping time adapted to fXn g. Show that E.jX j/
2 supn E.XnC /  E.X1 / 3 supn E.jXn j/.

P iid random variables such that E.jX1 j/ < 1.


Exercise 14.30. Let X1 ; X2 ; : : : be
Consider the random walk Sn D niD1 Xi ; n  1 and S0 D 0. Let  be a stopping
time adapted to fSn g. Show that if E.jS j/ D 1, then E./ must also be infinite.
Exercise 14.31. Let fXn ; n  0g be a martingale, with X0 D 0. Let Vi D Xi 
Xi 1 ; i  1. Show that for any i ¤ j; Vi and Vj are uncorrelated.

P 14.32. Let fXn ; n  1g be some sequence of random variables. Suppose


Exercise
Sn D niD1 Xi ; n  1, and that fSn ; n  1g forms a martingale. Show that for any
i ¤ j; E.Xi Xj / D 0.
Exercise 14.33. Let fXn ; n  0g and fYn ; n  0g both be square integrable
adapted to some common sequence. Let X0 D Y0 D 0. Show that
martingales, P
E.Xn Yn / D niD1 EŒ.Xi  Xi 1 /.Yi  Yi 1 / for any n  1.
References 503

Exercise 14.34. Give an example of a submartingale fXn g and a convex function


f such that ff .Xn /g is not a submartingale.
Remark. Such a function f cannot be increasing.
Exercise 14.35 (Characterization of Uniformly Integrable Martingales). Let
fXn g be uniformly integrable and a martingale with respect to some sequence fYn g.
Show that there exists a random variable Z such that E.jZj/ < 1 and such that for
each n; E.Z jY1 ; : : : ; Yn / D Xn with probability one.
Exercise 14.36 (Lp -Convergence of a Martingale). Let fXn ; n  0g be a martin-
gale, or a nonnegative submartingale. Suppose for some p > 1; supn E.jXn jp / < 1.
Show that there exists a random variable X , almost surely finite, such that
a:s:
E.jXn  X jp / ! 0 and Xn ! X as n ! 1.
Exercise 14.37. Let fXn g be a nonnegative martingale. Suppose E.Xn / ! 0 as
a:s:
n ! 1. Show that Xn ! 0.
Exercise 14.38. LetP X1 ; X2 ; : : : be iid normal variables with mean zero and vari-
ance  2 . Show that 1nD1
sin.nx/
n
Xn converges with probability one for any given
real number x.
Exercise 14.39 (Generalization of Maximal Inequality). Let fXn ; n  0g be a
nonnegative submartingale, and fbn ; n  0g a nonnegative
P nonincreasing sequence
of constants such that bn ! 0 as n ! 1, and 1 Œb
nD0 n  b nC1 E.Xn / converges.

(a) Show that for any x > 0,


1
1X
P .sup bn Xn  x/ Œbn  bnC1 E.Xn /:
n0 x nD0

(b) Derive the Kolmogorov maximal inequality for nonnegative submartingales as


a corollary to part (a).
Exercise 14.40 (Decomposition of an L1 -Bounded Martingale). Let fXn g be an
L1 -bounded martingale adapted to some sequence fYn g, that is, supn E.jXn j/ < 1.
(a) Define Zm;n D EŒjXmC1 j jY1 ; : : : ; Yn . Show that Zm;n is nondecreasing in m.
(b) Show that for fixed n; Zm;n converges almost surely.
(c) Let Un D limm Zm;n . Show that fUn g is an L1 -bounded martingale.
(d) Show that Xn admits the decomposition Xn D Un  Vn , where both Un ; Vn are
nonnegative L1 -bounded martingales.

References

Azuma, K. (1967). Weighted sums of certain dependent random variables, Tohoku Math. J., 19,
357–367.
Brown, B.M. (1971). Martingale central limit theorems, Ann. Math. Statist., 42, 59–66.
Burkholder, D.L. (1973). Distribution function inequalities for martingales, Ann. Prob., 1, 19–42.
504 14 Discrete Time Martingales and Concentration Inequalities

Burkholder, D.L., Davis, B., and Gundy, R. F. (1972). Integral inequalities for convex functions
of operators on martingales, Proc. Sixth Berkeley Symp. Math. Statist. Prob., Vol. II, 223–240,
University of California Press, Berkeley.
Chow, Y.S. and Teicher, H. (2003). Probability Theory: Independence, Interchangeability, Martin-
gales, Springer, New York.
Chung, K.L. (1974). A Course in Probability, Academic Press, New York.
DasGupta, A. (2010). Fundamentals of Probability: A First Course, Springer, New York.
Davis, B. (1970). On the integrability of the martingale square function, Israel J. Math., 8,
187–190.
Devroye, L. (1991). Exponential Inequalities in Nonparametric Estimation, Nonparametric Func-
tional Estimation and Related Topics, 31–44, Kluwer Acad. Publ., Dordrecht.
Doob, J.L. (1971). What is a martingale?, Amer. Math. Monthly, 78, 451–463.
Fristedt, B. and Gray, L. (1997). A Modern Approach to Probability Theory, Birkhauser,R Boston.
Hall, P. and Heyde, C. (1980). Martingale Limit Theory and Its Applications, Academic Press,
New York.
Heyde, C. (1972). Martingales: A case for a place in a statistician’s repertoire, Austr. J. Statist., 14,
1–9.
Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables, J. Amer.
Statisto Assoc., 58, 13–30.
Karatzas, I. and Shreve, S. (1991). Brownian Motion and Stochastic Calculus, Springer, New York.
Karlin, S. and Taylor, H.M. (1975). A First Course in Stochastic Processes, Academic Press, New
York.
McDiarmid, C. (1989). On the Method of Bounded Differences, Surveys in Combinatorics, London
Math. Soc. Lecture Notes, 141, 148–188, Cambridge University Press, Cambridge, UK.
Williams, D. (1991). Probability with Martingales, Cambridge University Press, Cambridge, UK.
Chapter 15
Probability Metrics

The central limit theorem is a very good example of approximating a potentially


complicated exact distribution by a simpler and easily computable approximate dis-
tribution. In mathematics, whenever we do an approximation, we like to quantify the
error of the approximation. Common sense tells us that an error should be measured
by some notion of distance between the exact and the approximate. Therefore, when
we approximate one probability distribution (measure) by another, we need a notion
of distances between probability measures. Fortunately, we have an abundant supply
of distances between probability measures. Some of them are for probability mea-
sures on the real line, and others for probability measures on a general Euclidean
space. Still others work in more general spaces. These distances on probability mea-
sures have other independent uses besides quantifying the error of an approximation.
We provide a basic treatment of some common distances on probability measures
in this chapter. Some of the distances have the so-called metric property, and they
are called probability metrics, whereas some others satisfy only the weaker notion
of being a distance. Our choice of which metrics and distances to include was nec-
essarily subjective.
The main references for this chapter are Rachev (1991), Reiss (1989), Zolotarev
(1983), Leise and Vajda (1987), Dudley (2002), DasGupta (2008), Rao (1987), and
Gibbs and Su (2002). Diaconis and Saloff-Coste (2006) illustrate some concrete
uses of probability metrics. Additional references are given in the sections.

15.1 Standard Probability Metrics Useful in Statistics

As we said above, there are numerous metrics and distances on probability mea-
sures. The choice of the metric depends on the need in a specific situation. No single
metric or distance is the best or the most preferable. There is also the very important
issue of analytic tractability and ease of computing. Some of the metrics are more
easily bound, and some less so. Some of them are hard to compute. Our choice of
metrics and distances to cover in this chapter is guided by all these factors, and also
personal preferences. The definitions of the metrics and distances are given below.
However, we must first precisely draw the distinction between metrics and distances.

A. DasGupta, Probability for Statistics and Machine Learning: Fundamentals 505


and Advanced Topics, Springer Texts in Statistics, DOI 10.1007/978-1-4419-9634-3 15,
c Springer Science+Business Media, LLC 2011
506 15 Probability Metrics

NM be a class of probability measures on a sample space .


Definition 15.1. Let
A function d W M M ! R is called a distance on M if
(a) d.P; Q/  0 8 P; Q; 2 M and d.P; Q/ D 0 , P D QI
(b) d.P1 ; P3 / d.P1 ; P2 /Cd.P2 ; P3 / 8 P1 ; P2 ; P3 2 M (Triangular inequality).
d is called a metric on M if, moreover.
(c) d.P; Q/ D d.Q; P / 8 P; Q; 2 M (Symmetry).
Here now are the probability metrics and distances that we mention in this chapter.
Definition 15.2 (Kolmogorov Metric). Let P; Q be probability measures on Rd ;
d  1, with corresponding CDFs F; G. The Kolmogorov metric is defined as

d.P; Q/ D sup jF .x/  G.x/j:


x2Rd

Wasserstein Metric. Let P; Q be probability measures on R with corresponding


CDFs F; G. The Wasserstein metric is defined as
Z 1
W .P; Q/ D jF .x/  G.x/jdx:
1

Total Variation Metric. Let P; Q be absolutely continuous probability measures


on Rd ; d  1, with corresponding densities f; g. The total variation metric is
defined as
Z
1
.P; Q/ D .f; g/ D jf .x/  g.x/jdx:
2
If P; Q are discrete, with corresponding mass functions p; q on the set of values
fx1 ; x2 ; : : : ; g, then the total variation metric is defined as

1X
.P; Q/ D .p; q/ D jp.i /  q.i /j;
2
i

where p.i /; q.i / are the probabilities at xi under P and Q, respectively.


Separation Distance. Let P; Q be discrete, with corresponding mass functions
p; q, The separation distance is defined as
 
p.i /
D.P; Q/ D sup 1  :
i q.i /

Note that the order of P; Q matters in defining D.P; Q/.


Hellinger Metric. Let P; Q be absolutely continuous probability measures on
Rd ; d  1, with corresponding densities f; g. The Hellinger metric is defined as
Z p p 2 1=2
H.P; Q/ D f .x/  g.x/ dx :
15.1 Standard Probability Metrics Useful in Statistics 507

If P; Q are discrete, with corresponding mass functions p; q, the Hellinger metric


is defined as
X p p 2 1=2
H.P; Q/ D pi  qi :
i

Kullback–Leibler Distance. Let P; Q be absolutely continuous probability mea-


sures on Rd ; d  1, with corresponding densities f; g. The Kullback–Leibler
distance is defined as
Z
f .x/
K.P; Q/ D K.f; g/ D f .x/log dx:
g.x/
If P; Q are discrete, with corresponding mass functions p; q, then the Kullback–
Leibler distance is defined as
X p.i /
K.P; Q/ D K.p; q/ D p.i / log :
q.i /
i

Note that the order of P; Q matters in defining K.P; Q/.


Lévy–Prokhorov Metric. Let P; Q be probability measures on Rd ; d  1. The
Lévy–Prokhorov metric is defined as
L.P; Q/ D inff > 0 W 8 B Borel; P .B/ Q.B / C g;
where B is, the outer -parallel body of B; that is,
B D fx 2 Rd W inf jjx  yjj g:
fy2Bg

If d D 1, then L.P; Q/ equals


L.P; Q/ D inff > 0 W 8 x; F .x/ G.x C / C g;
where F; G are the CDFs of P; Q.
f -Divergences. Let P; Q be absolutely continuous probability measures on Rd ;
d  1, with densities p; q, and f any real-valued convex function on RC ,with
f .1/ D 0. The f -divergence between P; Q is defined as
Z  
p.x/
df .P; Q/ D q.x/f dx:
q.x/
If P; Q are discrete, with corresponding mass functions p; q, then the f -divergence
is defined as
X  
p.i /
df .P; Q/ D q.i /f :
q.i /
i
P
f -divergences have the finite partition property that df .P; Q/D supfAj g j Q.Aj /

P .A /
f Q.Aj / , where the supremum is taken over all possible finite partitions fAj g
j

of Rd .
508 15 Probability Metrics

15.2 Basic Properties of the Metrics

Basic properties and interrelationships of these distances and metrics are now
studied.
Theorem 15.1. (a) Let Pn ; P be probabilty measures on R such that d.Pn ; P /
! 0 as n ! 1. Then Pn converges weakly (i.e., in distribution) to P . If the
CDF of P is continuous, then the converse is also true.
(b) .P; Q/ D supB jP .B/Q.B/j; where the supremum is taken over all (Borel)
sets B in Rd .
(c) Given probability measures P1 ; P2 ; .P1 ; P2 / satisfies the coupling identity

.P1 ; P2 /D inffP .X ¤Y / W all jointly distributed .X; Y / with X P1 ; Y P2 g:

Furthermore,
1
.P1 ; P2 / D sup jEP1 .h/  EP2 .h/j :
2 hWjhj1
(d) Let H denote the family of all functions h W R ! R with the Lipschitz norm
bounded by one; that is,
( )
jh.x/  h.y/j
H D h W sup 1 :
fx;yg jx  yj

Then,
W .P; Q/ D supfjEP .h/  EQ .h/j W h 2 Hg:
(e) Let Pn ; P be probabilty measures on Rd . Then Pn converges weakly to P if
and only if L.Pn ; P / ! 0.
(f) If any of .Pn ; P /; W .Pn ; P /; D.Pn ; P /; H.Pn ; P /; K.Pn ; P / ! 0, then Pn
converges weakly to P .
(g) The following converses of part (f) are true.
(i) Pn converges weakly to P ) .Pn ; P / ! 0 if Pn ; P are absolutely con-
tinuous and unimodal probability measures on R, or if Pn ; P are discrete
with mass functions pn ; p.
(ii) Pn converges weakly to P ) W .Pn ; P / ! 0 if Pn ; P are all supported
on a bounded set  in R.
(iii) Pn converges weakly to P ) H.Pn ; P / ! 0 if Pn ; P are absolutely con-
tinuous and unimodal probability measures on R, or if Pn ; P are discrete
with mass functions pn ; p.
Proof. Due to the long nature of the proofs, we only refer to the proofs of parts
(c)–(g). Parts (a) and (b) are proved below. We refer to Gibbs and Su (2002) for
parts (c), (e), and (f). Part (d) is proved in Dudley (2002). Part (i) in (g) is a result
in Ibragimov (1956). Part (ii) in (g) is a consequence of Theorem 2 in Gibbs and
Su (2002), and part (e) in this theorem. The first statement in part (iii) in (g) is a
15.2 Basic Properties of the Metrics 509

consequence of inequality (8) in Gibbs and Su (2002) (which is a result in LeCam


(1969)) and part (i) in (g) of this theorem. The second statement in part (iii) in (g) is
a consequence of Scheffé’s theorem (see Chapter 7), and once again inequality (8)
in Gibbs and Su (2002).
As regards part (a), that Pn converges weakly to P if d.Pn ; P / ! 0 is a conse-
quence of the definition of convergence in distribution. That the converse is also true
when P has a continuous CDF is a consequence of Pólya’s theorem (see Chapter 7).
The proof of part (b) proceeds as follows. Consider the case when P; Q have
densities f; g. Fix any (Borel) set B, and define B0 D fx W f .x/  g.x/g. Then,
Z Z
P .B/  Q.B/ D Œf .x/  g.x/dx Œf .x/  g.x/dx:
B B0

On the other hand,


Z Z Z
jf .x/  g.x/jdx D Œf .x/  g.x/dx C Œg.x/  f .x/dx
B0 B0c
Z
D2 Œf .x/  g.x/dx:
B0

Combining these two facts,


Z Z
1
jf .x/  g.x/jdx D Œf .x/  g.x/dx D P .B0 /  Q.B0 /
2 B0
Z
sup.P .B/  Q.B// Œf .x/  g.x/dx
B B0
Z
1
) supB .P .B/  Q.B// D jf .x/  g.x/jdx:
2

WeR can also switch P; Q in the above argument to show that supB .Q.B/P .B// D
1
2 jf .x/  g.x/jdx, and therefore the statement in part (b) follows. t
u

Example 15.1 (Distances of Joint and Marginal Distributions). Suppose f1 ; g1 and


f2 ; g2 are two pairs of univariate densities such that according to some metric (or
distance), f1 is close to g1 , and f2 is close to g2 . Now make two bivariate random
vectors, say .X1 ; X2 / and .Y1 ; Y2 /, with X1 f1 ; X2 f2 ; Y1 g1 ; Y2 g2 .
Then we may expect that the joint distribution of .X1 ; X2 / is close to the joint
distribution of .Y1 ; Y2 / in that same metric. Indeed, such concrete results hold if we
assume that X1 ; X2 are independent, and that Y1 ; Y2 are independent. This example
shows a selection of such results.
Consider first the total variation metric, with the independence assumptions
stated in the above paragraph. Denote the distribution of Xi by Pi ; i D 1; 2, and
denote the distribution of Yi by Qi ; i D 1; 2, Then, the total variation distance be-
tween the joint distribution P of .X1 ; X2 / and the joint distribution Q of .Y1 ; Y2 / is
510 15 Probability Metrics
Z Z
2.P; Q/ D jf1 .x1 /f2 .x2 /  g1 .x1 /g2 .x2 /jdx1 dx2
Z Z
D jf1 .x1 /f2 .x2 /  f1 .x1 /g2 .x2 / C f1 .x1 /g2 .x2 /
g1 .x1 /g2 .x2 /jdx1 dx2
Z Z Z Z
f1 .x1 /jf2 .x2 /  g2 .x2 /jdx1 dx2 C g2 .x2 /jf1 .x1 /
g1 .x1 /jdx1 dx2
Z Z
D jf2 .x2 /  g2 .x2 /jdx2 C jf1 .x1 /  g1 .x1 /jdx1
D 2.P2 ; Q2 / C 2.P1 ; Q1 /
) .P; Q/ .P1 ; Q1 / C .P2 ; Q2 /:

In fact, quite evidently, the inequality generalizes to the case of k-variate joint dis-
Pk
tributions, .P; Q/ i D1 .Pi ; Qi /, as long as we make the assumption of
independence of the coordinates in the k-variate distributions.
Next consider the Kullback–Leibler distance between joint distributions, under
the same assumption of independence of the coordinates. Consider the bivariate case
for ease. Then,
Z Z  
f1 .x1 /f2 .x2 /
K.P; Q/ D f1 .x1 /f2 .x2 / log dx1 dx2
g1 .x1 /g2 .x2 /
Z Z

D f1 .x1 /f2 .x2 / log f1 .x1 / C log f2 .x2 /

 log g1 .x1 /  log g2 .x2 / dx1 dx2
Z Z
D f1 .x1 /logf1 .x1 /dx1 C f2 .x2 /logf2 .x2 /dx2
Z Z
 f1 .x1 /log g1 .x1 /dx1  f2 .x2 /logg2 .x2 /dx2
Z Z
f1 .x1 / f2 .x2 /
D f1 .x1 /log dx1 C f2 .x2 /log dx2
g1 .x1 / g2 .x2 /
D K.P1 ; Q1 / C K.P2 ; Q2 /:

Once again, the result generalizes to the case of a general k; that is, under the
assumption of independence of the coordinates of the k-variate joint distribution,
P
K.P; Q/ D kiD1 K.Pi ; Qi /.
A formula connecting H.P; Q/, the Hellinger distance, to H.Pi ; Qi / is also
possible, and is a chapter exercise.
Example 15.2 (Hellinger Distance Between Two Normal Distributions). The total
variance distance between two general univariate normal distributions was worked
out in Example 7.36. It was, in fact, somewhat involved. In comparison, the
Hellinger and the Kullback–Leibler distance between two normal distributions is
easy to find. We work out a formula for the Hellinger distance in this example. The
Kullback–Leibler case is a chapter exercise.
15.2 Basic Properties of the Metrics 511

Let P be the measure corresponding to the d -variate Nd .1 ; †/ distribution,


and Q the measure corresponding to the Nd .2 ; †/ distribution. We may assume
without any loss of generality that 1 D 0, and write simply  for 2 . Denote the
corresponding densities by f; g. Then,
Z p p 2
f  g dx
Rd
Z p
D2 1 fgdx
Rd
" Z q #
1 0 1 0 1
 x †2 x  .x/ †2 .x/
D2 1 1
e dx
.2/d=2 j†j 2 Rd
" Z #
1 0 1 0 1
 x †4 x  .x/ †4 .x/
D2 1 1
e dx :
.2/d=2 j†j 2 Rd
1
Now, writing † 2  D ,
Z
1 x 0 †1 x .x/0 †1 .x/
1
e 4  4 dx
.2/d=2 j†j Rd
2
Z
1 .x 0 x/ 0x 0
D d=2
e  2 C 2 dx  e  4
.2/
ZR
d

1 1 0 0 0
D d=2
e  2 x 2 .x 2 / dx  e  8 D e 8 :
.2/ Rd

s
0 †1 
This gives H.P; Q/ D 2 1  e 8 . For the general case, when P corre-

sponds to Nd .1 ; †/ and Q corresponds to Nd .2 ; †/, this implies that


s
.2 1 /0 †1 .2 1 /
H.P; Q/ D 2 1  e 8 :

We recall from Chapter 7 that in the univariate case when the variances are equal, the
total variation distance between N.1 ;  2 /, and N.2 ;  2 / has the simple formula
2ˆ j2  
1j
 1. The Hellinger distance and the total variation distance between
N.0; 1/ and N.; 1/ distributions in one dimension are plotted in Fig. 15.1 for visual
comparison; the similarity of the two shapes is interesting.

Example 15.3 (Best Normal Approximation to a Cauchy Distribution). Consider the


one-dimensional standard Cauchy distribution and denote it as P . Consider a gen-
eral one-dimensional normal distribution N.;  2 /, and denote it as Q. We want to
find the particular Q that minimizes some suitable distance between P and Q. We
use the Kolmogorov metric d.P; Q/ in this example.
512 15 Probability Metrics

1.4

1.2

0.8

0.6

0.4

0.2

mu
-7.5 -5 -2.5 2.5 5 7.5

Fig. 15.1 Plot of Hellinger and variation distance between N.0; 1/ and N(mu,1)

Because d.P; Q/ D supx jF .x/  G.x/j, where F; G are the CDFs of P; Q,


respectively, it follows that there exists an x0 such that d.P; Q/ D jF .x0 /G.x0 /j,
and moreover, that at such an x0 , the two respective density functions f; g must cut.
That is, we must have f .x0 / D g.x0 / at such an x0 .
Now,
r
   .x0 /2 2
f .x0 / D g.x0 / , 1 C x0 e 2 2 D 
2

 
  .x0  /2 1 2
, log 1 C x02  2
D c D log  C log :
2 2 
The derivative with respect to x0 of the lhs of the above equation equals

 C .2 2  1/x0 C x02  x03


:
 2 .1 C x02 /

The numerator of this expression is a cubic, thus it follows that it can be zero at
at most three values of x0 , and therefore the mean value theorem implies that the
number of roots of the equation f .x0 / D g.x0 / can be at most four. Let 1 <
x1;; x2;; x3;; x4;; < 1 be the roots of f .x0 / D g.x0 /. Then,
d.P; Q/ D max jF .xi;; /  G.xi;; /j D d; say:
1i 4

and the best normal approximation to the standard Cauchy distribution is found
by minimizing d; over .; /. This cannot be done in a closed-form analytical
manner.
Numerical work gives that the minimum Kolmogorov distance is attained
when  D ˙:4749 and  D 3:10, resulting in the best normal approximation
N.˙:4749; 9:61/, and the corresponding minimum Kolmogorov distance of :0373.
15.2 Basic Properties of the Metrics 513

0.3

0.25

0.2

0.15

0.1

0.05

x
-10 -5 5 10

Fig. 15.2 Standard cauchy and the best Kolmogorov normal approximation

The best normal approximation is not symmetric around zero, although the standard
Cauchy is. Moreover, the best normal approximation does not approximate the
density well, although the Kolmogorov distance is only :0373. Both are plotted in
Fig. 15.2 for a visual comparison.
Example 15.4 (f -Divergences). f -Divergences have a certain unifying character
with respect to measuring distances between probability measures. A number of
leading metrics and distances that we have defined above are f -divergences with
special choices of the convex function f .
For example, if f .x/ D jx1j
2 , then
Z   Z
p.x/ 1 p.x/
df .P; Q/ D q.x/f dx D q.x/j  1jdx
q.x/ 2 q.x/
Z
1
D jp.x/  q.x/jdx D .P; Q/:
2
If we let f .x/ D  log x, then
Z   Z
p.x/ p.x/
df .P; Q/ D q.x/f dx D q.x/  log dx
q.x/ q.x/
Z
q.x/
D q.x/log dx D K.Q; P /:
p.x/

On the other hand, if we choose f .x/ D x log x, then


Z
p.x/ p.x/
df .P; Q/ D q.x/log dx
q.x/ q.x/
Z
p.x/
D p.x/ log dx D K.P; Q/:
q.x/
514 15 Probability Metrics

Note that two different choices of f were needed to produce K.Q; P / and
K.P; Q/; this is because the Kullback–Leibler distance is not symmetric between
P and Q.
p
Next, if f .x/ D 2.1  x/, then

Z   Z " s #
p.x/ p.x/
df .P; Q/ D q.x/f dx D 2 q.x/ 1  dx
q.x/ q.x/
Z p
D 2 1 p.x/q.x/dx D H 2 .P; Q/:

2
p H .P; Q/ rather than H.P; Q/ itself. Interestingly,
Notice that we ended up with
if we choose f .x/ D .1  x/2 , then
Z p 2
p
df .P; Q/ D q.x/ 1  p dx
q
Z p Z
p p  p 
D q.x/ 1  2 p C dx D q  2 pq C p dx
q q
Z Z
p p
D 12 pqdx C 1 D 2 1  pqdx D H 2 .P; Q/:

So, two different choices of the function f result in the f -divergence being the
square of the Hellinger metric.

Example 15.5 (Distribution of Uniform Maximum). Suppose X1 ; X2 ; : : : are iid


U Œ0; 1 variables, and for n  1, let X.n/ be the maximum of X1 ; : : : ; Xn . Then,
for any given x > 0, and n  x,
 x  x n
P .n.1  X.n/ / > x/ D P X.n/ < 1  D 1 ! e x
n n

as n ! 1. Thus, n.1  X.n/ / converges in distribution to a standard exponential


random variable, and moreover, the density of n.1  X.n/ / at any fixed x, say fn .x/,
converges to the standard exponential density f .x/ D e x . Therefore, by Scheffé’s
theorem (Chapter 7), .Pn ; P / ! 0, where Pn ; P stand for the exact distribution
of n.1  X.n/ / and the standard exponential distribution, respectively. We obtain an
approximation for .Pn ; P / in this example.

Because fn .x/ D .1  xn /n1 If0<x<ng , we have


Z ˇ ˇ Z 1
ˇ n
x n1 ˇ
2.Pn ; P / D ˇ 1 x ˇ
 e ˇ dx C e x dx
ˇ n
0 n
Z n ˇ ˇ
ˇ x n1 ˇ
D ˇ 1 x ˇ
 e ˇ dx C e n :
ˇ n
0
15.3 Metric Inequalities 515

Now observe that the second derivative of log.1 xn /n1 log.e x / is 1n
n2 .1 x 2 <0
n/
x
for all x in .1; n/. Therefore, log.1  x n1
n/
can cut log.e / at most twice on
.1; n/, which means that .1  xn /n1 can cut e x at most twice on .1; n/.
Because there is obviously one cut at x D 0, there can be at most one more cut
on .1; n/, and hence at most one more cut on .0; n/. In fact, there is such a cut,
as can be seen by observing that .1  xn /n1 > e x for small positive x, whereas
.1  xn /n1 < e x at x D n. Denote this unique point of cut by xn . Then,
Z xn  Z n 
x n1 x n1
2.Pn ; P / D 1  e x dxC e x  1  dxCe n :
0 n xn n

The point of cut xn can be analytically approximated as xn 2. This results, on


doing the necessary integrations in the above line, to
Z  2 Z n  x
x n1 n1
2.Pn ; P / 1  e x dxC e x  1 dxCe n
0 n 2 n
 n 2
2 4e
D 2 e 2  1  :
n n

Thus, the total variation distance between the exact and the limiting distribution of
2
n.1  X.n/ / goes to zero at the rate n1 and is asymptotic to 2en . Even for n D
2e 2
20, the approximation n is extremely accurate. The exact value for n D 20
2e 2
is .Pn ; P / D :01376 and the approximation is n D :01353, the error being
:00023.

15.3 Metric Inequalities

Due to an abundance of metrics and distances on probability measures, the ques-


tion of which metric or distance to choose and what would happen if another
metric or distance were chosen is very important. The interrelations between the
various metrics are well illustrated by known inequalities that they share among
themselves. These inequalities are also useful for the reason that some of the met-
rics are inherently easier to compute and some others are harder to compute. So,
when an inequality is available, the computationally hard metric can be usefully
approximated in terms of the computationally easier metric. The inequalities are
also fundamentally interesting because of their theoretical elegance. A selection of
metric inequalities is presented in this section.
Theorem 15.2.
p
(a) d.P; Q/ .P; Q/ H.P; Q/ K.P; Q/:
(b) .P; Q/ satisfies the lower bound
H 2 .P; Q/ W .P; Q/
.P; Q/  max L.P; Q/; ; ;
2 diam.S /
516 15 Probability Metrics

where S is the smallest set satisfying P .S / D Q.S / D 1, and diam.S / D


supfjjx  yjj W x; y 2 S g.
(c) If P; Q are discrete,
( r )
K.P; Q/
.P; Q/ min D.P; Q/; :
2

(d) ŒL.P; Q/2 W .P; Q/ .1 C diam.S //L.P; Q/:


(e) If P; Q are supported on the set of integers f0; ˙1; ˙2; : : :g, then

W .P; Q/  .P; Q/:

Proof. We prove a few of these inequalities here; see Gibbs and Su (2002), and
Reiss (1989) for the remaining parts.
Because d.P; Q/ D supx jP ..1; x/  Q..1; x/j, and .P; Q/ D
supB jP .B/  Q.B/j, it is obvious that d.P; Q/ .P; Q/, because .P; Q/
is a supremum over a larger collection of sets.
To prove that .P; Q/ H.P; Q/, we use
p p p p
.p  q/ D . p  q/. p C q/;
and therefore, by the Cauchy–Schwarz inequality,
Z 2 Z  Z 
p p 2 p p 2
jp  qj . p  q/ . p C q/
Z 
p p 2
D H .P; Q/
2
. p C q/
Z 
2
H .P; Q/ 2.p C q/ D 4H 2 .P; Q/
Z 2
1
) 2 .P; Q/ D jp  qj H 2 .P; Q/;
4
giving the inequality .P; Q/ H.P; p Q/. 2
We now prove H.P; Q/ K.P; Q/, which
R pis the same as H .P; Q/
K.P; Q/. For this,
Rp recall that H .P; Q/ D 2Œ1 
2
pq. We now obtain a suitable
lower bound on R pq, which leads to the desired upper bound on H 2 .P; Q/. The
p
lower bound on pq is
Z Z r r 
p q q
pq D p D EP
p p
1 q 1 q
D EP .e 2 log p /  e 2 EP .log p /
(by Jensen’s inequality applied to the convex function u ! e u )
R Z
1
p log q 1 q
De 2 p 1C p log
2 p
15.3 Metric Inequalities 517

(by using the inequality e u  1 C u)


Z
1 p 1
D 1 p log D 1  K.P; Q/
2 q 2
Z
p 1
)1 pq K.P; Q/ ) H 2 .P; Q/ K.P; Q/:
2

We now sketch the


qproof of the particular inequality in part (c) that in the discrete
K.P;Q/
case .P; Q/ 2 . For this, we show that 22 .P; Q/ K.P; Q/. This
bound is proved by reducing the general case to the case of P; Q being two-point
distributions, supported on the same two points. Suppose then at first that P; Q are
distributions given by P .0/ D 1  P .1/ D p; Q.0/ D 1  Q.1/ D r. Then,

p 1p
22 .P; Q/ D 2.p  r/2 p log C .1  p/ log ;
r 1r
which is an entropy inequality. Thus, for such two-point distributions supported on
the same two points, 22 .P; Q/ K.P; Q/.
Now take general P; Q, and consider the set B0 D fi W p.i /  q.i /g. Also,
consider the special two-point distributions P  ; Q with supports on f0; 1g and de-
fined by p D P  .0/ D P .B0 /, and Q .0/ D Q.B0 /. Then, from the proof of part
(b) of Theorem 15.1, .P; Q/ D .P  ; Q /. On the other hand, by the finite parti-
tion property of general f -divergences (see Section 15.1). K.P; Q/  K.P  ; Q /.
Therefore,

22 .P; Q/ D 22 .P  ; Q / K.P  ; Q / K.P; Q/:

We finally give a proof of part (e) in a special case. The special case we consider is
when the two mass functions fpi g and fri g corresponding to the two distributions
P; Q have one cut; that is, there exists i0 such that pi  ri for i i0 and pi ri
for i > i0 . Then,
Z X Z kC1
W .P; Q/ D jF .x/  G.x/jdx D jF .x/  G.x/jdx
k
k
ˇ ˇ ˇ ˇ
X ˇˇ X ˇ ˇX
ˇ ˇ
ˇ X
ˇ
D ˇ .pi  ri /ˇ  ˇ .pi  ri /ˇ D .pi  ri / D .P; Q/:
ˇ ˇ ˇ ˇ
k i k i i0 i i0

t
u

Example 15.6 (Total Variation versus Hellinger). The inequalities in parts (a) and
(b) of the above theorem say that the planar point .H.P; Q/; .P; Q// for any P
and Q falls in the convex region bounded by the straight line y x, the parabolic
2 p N
curve y  x2 , and the rectangle Œ0; 2 Œ0; 1. In Fig. 15.3, we first provide a plot
of the curve of the set of points .H.P; Q/; .P; Q// as P; Q run through the family
518 15 Probability Metrics

rho
1.4

1.2

0.8

0.6

0.4

0.2

H
0.2 0.4 0.6 0.8 1 1.2 1.4

Fig. 15.3 Plot of total variation versus. Hellinger for P, Q in Poisson family

rho
1.4

1.2

0.8

0.6

0.4

0.2

H
0.2 0.4 0.6 0.8 1 1.2 1.4

Fig. 15.4 Plot of total variation versus. Hellinger for P, Q univariate normal

of Poisson distributions. This curve is the dark curve inside the region bounded by
2 p
y D x; y D x2 ; x  0; x 2, which are also plotted to give a perspective
for where the curve for the Poisson family lies within the admissible region. The
curve for the Poisson family was obtained by computing .H.P; Q/; .P; Q// at
1;000 randomly chosen pairs of Poisson distributions with means between 0 and
10. Then, the same plot is provided when P; Q run through the family of univariate
normal distributions in Fig. 15.4. The curves for the Poisson and the normal case
look very similar.
15.4 Differential Metrics for Parametric Families 519

15.4  Differential Metrics for Parametric Families

If the probability measures P; Q both belong to a common parametric family of


distributions parametrized by some finite-dimensional parameter , then some of the
distances and metrics discussed by us can be used to produce new simple distance
measures between two distributions in that parametric family. The new distances
are produced by consideration of geometric ideas and have some direct relevance
to statistical estimation of unknown finite-dimensional Euclidean parameters. The
approach was initiated in Rao (1945). Two subsequent references are Kass and Vos
(1997) and Rao (1987). We begin with an illustrative example.

Example 15.7 (Curvature of the Kullback–Leibler Distance). Let ff .xj/g;  2


‚  R be a family of densities indexed by a real parameter . Certain formal
operations are done in this example, whose justifications require a set of regularity
conditions on the densities f .xj/. Let P ; P be two distributions in this para-
metric family. Then, the Kullback–Leibler distance between them is K.P ; P / D
R
f .xj/ log ff .xj
.xj /
/
dx D J ./ (say). Formally expanding this as a function of 
around the point , we have
ˇ ˇ
d ˇ .  /2 d 2 ˇ
J ./ D J ./ C .  / J ./ ˇˇ D C 2
J ./ ˇ
ˇ C
d 2 d D
Z
d
D 0 C .  / f .xj/
. log f .xj//dxj D
d
Z
.  /2 d2
C f .xj/ 2 . log f .xj//dxj D C   
2 d
Z
 d
d
f .xj/
D .  / f .xj/ dx
f .xj/
2  2
3
2 Z d2 d
6  2 f .xj/
.  / f .xj/
d 7
C f .xj/ 4 d C 5 dx C   
2 f .xj/ f 2 .xj/
Z Z
d .  /2 d2
D .  / f .xj/dx  f .xj/dx
d 2 d 2
 2
2 Z
d
.  / d
f .xj/
C dx C   
2 f .xj/
Z Z
d .  /2 d 2
D .  / f .xj/dx  f .xj/dx
d 2 d 2
 2
2 Z
d
.  / d f .xj/
C dx C   
2 f .xj/
520 15 Probability Metrics
 2
Z d
.  /2 d f .xj/
D 0C0C dx
2 f .xj/
 2
Z d
f .xj/
.  /2 d
D dx C    :
2 f .xj/

The quantity
 2
Z d
f .xj/ 2
d d
dx D E log f .xj/
f .xj/ d
is called the Fisher information function for the family ff .xj/g and is usually de-
noted as If ./. Thus, subject to the validity of the formal calculations done in the
.  / 2
above lines, for  ; K.P ; P / 2 If ./. In other words, the Fisher in-
formation function If ./ measures the curvature of the Kullback–Leibler distance
at the particular distribution P , and can be used as a measure for how quickly the
measure P is changing if we change  slightly.

15.4.1  Fisher Information and Differential Metrics

The calculations of this example can be generalized to the case of multivariate


densities more general than the normal, and to the case of vector-valued parame-
ters. Furthermore, the basic divergence to start with can be more general than the
Kullback–Leibler distance. Such a general treatment is sketched below; we follow
Rao (1987). N
Let F W RC RC ! R satisfy the following properties.
(a) F is three times continuously partially differentiable with respect to each coor-
dinate.
(b) F .x; x/ D 0 8x 2 RC .
(c) F .x; y/ is strictly convex in y for each given x 2 RC .
C
@y F .x; y/jyDx D 0 8x 2 R .
(d) @F
Let ff .xj/g be a family of densities (or mass functions) with common support
S  Rk for some k; 1 k < 1, and  2 ‚  Rp for some p; 1 p < 1. The
functions f .xj/ are assumed to satisfy the following properties.
@3
(i) @i3
f .xj/ exists for each i D 1; 2; : : : ; p and for each x.
(ii) For each fixed , the divergence functional
Z
J.; / D F .f .xj/; f .xj//dx
S

(the integral to be replaced by a sum if the distributions are discrete) can be


partially differentiated twice with respect to each i ; i D 1; 2; : : : ; p inside the
15.4 Differential Metrics for Parametric Families 521

integral. Fix  and consider another  D  C d . Then the change in the


divergence functional J.; / from 0 D J.; / due to changing  to  is
ˇ ˇ
X
p ˇ X
p
@2 ˇ
@ ˇ ˇ
rJ J.; /ˇ di C J.; /ˇ di dj
@i ˇ @i @j ˇ
i D1 D i;j D1 D
X
p
@2
D J.; /j D di dj ;
@i @j
i;j D1

as the first-order partial derivatives will vanish under the assumptions made on
F and f (in the above, when i D j , the iterated partial derivatives denote
the second order partial derivative). Now, by a direct differentiation under the
integral sign, we get

@2
J.; /j D
@i @j
Z   
@ @
D Fyy .f .xj/; f .xj// f .xj/ f .xj/ dx D gi;j
F
./(say);
@i @j

where the notation Fyy means


P the second partial derivative of F .x; y/ with
respect to y. The quantity pi;j D1 gi;j
F
./di dj is called the differential metric
on ‚ induced by F .
Suppose now we specialize to the case of f -divergences, so that our functional
F .x; y/ is now x yx . The notation  is used so as not to cause confusion with the
F
underlying densities, which have been denoted as f . In this case, gi;j ./ takes the
form
Z   
00 1 @f @f
gi;j ./ D f .1/
F
f .xj/ @i @j
Z   
00 @ log f @ log f
D f .1/ f .xj/ :
@i @j

The integral

Z      
@ log f @ log f @ log f @ log f
f .xj/  D E 
@i @j @i @j

is the .i; j /th element in the Fisher information matrix corresponding to the regular
family of densities ff .xj/g. Thus, apart from the constant multiplier f 00 .1/, the
differential metric arising from all of these divergences is the same. This is an inter-
esting unifying phenomenon, and at the same time shows that the Fisher information
function has a special role.
522 15 Probability Metrics

15.4.2  Rao’s Geodesic Distances on Distributions

The differential metric corresponding to the Fisher information function can be used
to produce distances between two members P ; P of a parametric family of dis-
tributions where the parameter belongs to a manifold in some Euclidean space. The
distance between P and P is simply the geodesic distance between  and  arising
from the differential metric on ‚, the parameter space. The geodesic distance is ba-
sically the distance between  and  along the shortest curve joining the two points
along the manifold ‚, This approach was initiated in Rao (1945), and in special
parametric families that we encounter in applications, the geodesic distance works
out to neat and interesting distances. Specifically, the geodesic distance often has
a connection to the form of a variance-stabilizing transformation (see Chapter 7).
The geodesic distance is defined below.
Definition 15.3. Let ‚ be a manifold in an Euclidean space Rp for some p; 1
p < 1, and d a metric on it. Let ;  2 ‚, and let C be the family of curves .t/
on Œ0; 1 defined as

C D f.t/ W .0/ D ; .1/


D ; .t/ is piecewise continuously differentiable onŒ0; 1g:

The geodesic distance between  and  is the infimum of the lengths of the curves
.t/ 2 C, with length of a curve being defined with respect to the metric d .
Geodesic curves are usually hard to compute, and need not, in general, be unique.
Using calculus of variation methods, it can be shown that a geodesic curve is a
solution to the following boundary value problem. Find a curve .t/ such that

.0/ D I .1/ D I
2
d k di dj
C ijk D 0;
dt 2 dt dt

where
1 @ @ @
ijk D gjk ./ C gki ./  gij ./ :
2 @i @j @k

The geodesic distances have been calculated in the literature for several well-known
parametric families of distributions. They are always nontrivial to calculate. We
provide a selection of these formulas; see Rao (1987) for these formulas.
Exercises 523

Distribution Density Geodesic Distance


n  x p p p
Binomial x
p .1  p/nx 2 n j arcsin. p1 /  arcsin. p2 /j
Poisson e x p p

2j 1 2j
p p p
Geometric 1 p1 p2 Cj p1  p2 j
.1  p/p x 2 log p
.1p1 /.1p2 /
e  x  ’ x ’1
p
Gamma .’/
’ j log 1  log 2 j
.x/2
j1 2 j
Normal (fixed variance) p1
2
e 2 2

.x/2 p
Normal (fixed mean) p1
2
e 2 2 2 j log 1  log 2 j
.x/2 p q
General Normal .  /2 C2.  /2
p1
2
e 2 2 2 2tanh1 .1122/2 C2.11C22 /2
1 0 1
p-Variate Normal († fixed) e  2 .x/ † .x/
.2/p=2 j†j1=2
.1  2 /0 †1 .1  2 /
1 0 1 qP
p-Variate Normal ( fixed) e  2 .x/ † .x/ p1 p 2
.2/p=2 j†j1=2 2 iD1 log i

(here, f i g are the roots of j†2  †1 j D 0)


Multinomial p Pk p
Qk

pini 2  arccos. iD1 pi i /
i D1 ni Š

Exercises

Exercise 15.1 (Skills Exercise). Compute the Hellinger and the Kullback–Leibler
distance between P and Q when (a) P D U Œ0; 1; Q D Beta.2; 2/; (b) P D
U Œa; a; Q D N.0; 1/; (c) P D Exp. /; Q D Gamma.’; /.
Exercise 15.2 (A Useful Formula). Compute the Kullback–Leibler distance be-
tween two general d -dimensional normal distributions.
Exercise 15.3. Suppose X1 P1 ; X2 P2 ; Y1 Q1 ; Y2 Q2 , and
that X1 ; X2 are independent and Y1 ; Y2 are independent. Let P; Q denote the
joint distributions of .X1 ; X2 / and .Y1 ; Y2 /. Show that 1  12 H 2 .P; Q/ D
  
1  12 H 2 .P1 ; Q1 / 1  12 H 2 .P2 ; Q2 / .
Exercise 15.4. Suppose X1 ; X2 ; : : : ; Xn are iid U Œ0; 1. Compute the Kullback–
Leibler distance between P D Pn and Q where P is the exact distribution of
n.1  X.n/ / and Q stands for an exponential distribution with mean one. Hence,
find a sequence cn ! 0 such that K.Pcnn;Q/ ! 1 as n ! 1.
Exercise 15.5 * (How Large is the Class of t Distributions). Consider the class
F of all one-dimensional t densities symmetric about zero; that is,
8 9
ˆ
<  ’C1  >
=
 2
F D f W f .x/ D
 
; ’ > 0 :
:̂ p 2
.’C1/=2 >
;
’ ’2 1 C x’

Show that supff;g2F g .f; g/ D 1.


524 15 Probability Metrics

Exercise 15.6. Calculate the Kullback–Leibler distance between P and Q, where


P; Q are the Bin.n; p/ and Bin.n; / distributions.

Exercise 15.7 * (Binomial and Poisson). Let Pn be the Bin.n; n / distribution and
Q a Poisson distribution with mean ; here 0 < < 1 is a fixed number inde-
pendent of n. Prove that the total variation distance between Pn and Q converges to
zero as n ! 1.

Exercise 15.8 * (Convergence of Discrete Uniforms). Let Un have a discrete uni-


form distribution on f1; 2; : : : ; ng and let Xn D Unn . Let Pn denote the distribution
of Xn . Identify a distribution Q such that the Kolmogorov distance between Pn and
Q goes to zero, and then compute this Kolmogorov distance.
Does there exist any distribution R such that the total variation distance between
Pn and R converges to zero? Rigorously justify your answer.

Exercise 15.9 * (Generalized Hellinger Distances). For ’ > 0, define


Z ’
1
H’ .P; Q/ D jf ’ .x/  g ’ .x/j ’ dx :

For what values of ’, is H’ a distance?

Exercise 15.10 * (An Interesting Plot). Use the formula in the text (Example 7.36)
for the total variation distance between two general univariate normal distributions
to plot the following set.

S D f.; / W .N.0; 1/; N.;  2 // g:

Use D :01; :05; :1.

Exercise 15.11 (A Bayes Problem). Suppose X1 ; X2 ; : : : ; Xn are iid N.; 1/ and


that  has a N.0;  2 / prior distribution.
p Compute the total variation distance be-
tween the posterior distribution of n.  X / and the N.0; 1/ distribution, and
show that it goes to zero as n ! 1.

Exercise 15.12 * (Variation Distance Between Multivariate Normals). Let P be


the Np .1 ; †/ and Q the Np .2 ; †/ distribution. Prove that the total variation dis-
tance between P and Q admits the formula
p !
.1  2 /0 †1 .1  2 /
.P; Q/ D 2ˆ  1;
2

where ˆ denotes the standard normal CDF.

Exercise 15.13 (Metric Inequalities).pDirectly verify the series of inequalities


d.P; Q/ .P; Q/ H.P; Q/ K.P; Q/ when P D N.0; 1/ and Q D
N.;  2 /. Are these inequalities all sharp in this special case?
References 525

Exercise 15.14 * (Converse of Scheffé’s Theorem Is False). Give an example of a


sequence of densities fn and another density f such that fn converges to f in total
variation, but fn .x/ does not converge pointwise to f .x/.

Exercise 15.15 (Bhattacharya Affinity). Given twoR densities


p f and g on a general
Euclidean space Rk , let A.f; g/ D 1  2 arccos. Rk fg/. Express the Hellinger
distance between f and g in terms of A.f; g/.

Exercise 15.16 (Chi-Square Distance). If P; Q are probability measures on a


qR space with densities f; g with a common support S , define
general Euclidean
.f g/2
 .P; Q/ D
2
S g
.

(a) Show that 2 .P; Q/ is a distance, but not a metric.


(b) Show that .P; Q/ 12 2 .P; Q/.
(c) Show that H.P; Q/ 2 .P; Q/.

Exercise 15.17 (Alternative Formula for Wasserstein Distance). Let P; Q be


distributions on the real line. with CDFsR F; G. Show that the Wasserstein distance
1
between P and Q satisfies W .P; Q/ D 0 jF 1 .’/  G 1 .’/jd’.

Exercise 15.18. * Prove that almost sure convergence is not metrized by any metric
on probability measures.

Exercise 15.19. * Given probability measures P; Q on R with finite means, sup-


pose M is the set of all joint distributions with marginals equal to P and Q.
Show that the Wasserstein distance has the alternative representation W .P; Q/ D
inffE.jX  Y j/ W .X; Y / M; M 2 Mg.

Exercise 15.20 * (Generalized Wasserstein Distance). Let p  1. Given proba-


bility measures P; Q on a general Euclidean space Rk with finite pth moment, let
M be the set of all joint distributions with marginals equal to P and Q. Consider
Wp .P; Q/ D inffE.jjX  Y jjp / W .X; Y / M; M 2 Mg.
(a) Show that Wp is a metric.
L
(b) Suppose Wp .Pn ; P / ! 0. For which values of p, does this imply that Pn ) P ?

References

DasGupta, A. (2008). Asymptotic Theory of Statistics and Probability, Springer, New York.
Diaconis, P. and Saloff-Coste, L. (2006). Separation cut-offs for birth and death chains, Ann. Appl.
Prob., 16, 2098–2122.
Dudley, R. (2002). Real Analysis and Probability, Cambridge University Press, Cambridge, UK.
Gibbs, A. and Su, F. (2002). On choosing and bounding probability metrics, Internat. Statist. Rev.,
70, 419–435.
Ibragimov, I.A. (1956). On the composition of unimodal distributions, Theor. Prob. Appl., 1,
283–288.
526 15 Probability Metrics

Kass, R. and Vos, p. (1997). Geometrical Foundations of Asymptotic Inference, Wiley, New York.
Lecam, L. (1969). Théorie Asymptotique de la Bécision statistique, Les presses de l’université de
Montréal, Montréal.
Leise, F. and Vajda, I. (1987). Convex Statistical Distances, Teubner, Leipzig.
Rachev, S.T. (1991). Probability Metrics and the Stability of Stochastic Models, Wiley, New York.
Rao, C.R. (1945). Information and accuracy available in the estimation of statistical parameters,
Bull. Calcutta Math. Soc., 37, 81–91.
Rao, C.R. (1987). Differential metrics in probability spaces, in Differential Geometry in Statistical
Inference, S.-I. Amari et al. Eds., IMS Lecture Notes and Monographs Series, Hayward, CA.
Reiss, R. (1989). Approximation Theorems of Order Statistics, Springer-Verlag, New York.
Zolotarev, V.M. (1983). Probability metrics, Theor. Prob. Appl., 28, 2, 264–287.
Chapter 16
Empirical Processes and VC Theory

Like martingales, empirical processes also unify an incredibly large variety of


problems in probability and statistics. Results in empirical processes theory are ap-
plicable to numerous classic and modern problems in probability and statistics; a
few examples of applications are the study of central limit theorems in more gen-
eral spaces than Euclidean spaces, the bootstrap, goodness of fit, density estimation,
and machine learning. Familiarity with the basic theory of empirical processes is
extremely useful across fields in probability and statistics.
Empirical process theory has seen a major revolution since the early 1970s with
the emergence of the powerful Vapnik–Chervonenkis theory. The classic theory
of empirical processes relied primarily on tools such as invariance principles and
the Wiener process. With the advent of the VC (Vapnik–Chervonenkis) theory,
combinatorics and geometry have increasingly become major tools in studies of
empirical processes. In some sense, the theory of empirical processes has two
quite different faces: the classic aspect and the new aspect. Both are useful. We
provide an introduction to both aspects of the empirical process theory in this
chapter. Among the enormous literature on empirical process theory, we recom-
mend Shorack and Wellner, (2009) Csörgo and Révész (1981), Dudley (1984),
van der Vaart and Wellner (1996), and Kosorok (2008) for comprehensive treat-
ments, and Pollard (1989), Giné (1996), Csörgo (2002), and Del Barrio et al.
(2007) for excellent reviews and overviews. A concise treatment is also available in
DasGupta (2008). Other specific references are provided in the sections.

16.1 Basic Notation and Definitions

Given n real-valued random variables X1 ; : : : ; Xn , the empirical CDF of


X1 ; : : : ; Xn is defined as
1X
n
#fXi W Xi tg
Fn .t/ D D IfXi t g ; t 2 R:
n n
i D1

If X1 ; : : : ; Xn happen to be iid, with the common CDF F , then nFn .t/ Bin.n;
F .t// for any fixed t. The mean of Fn .t/ is F .t/, and the variance is F .t /.1F
n
.t //
.

A. DasGupta, Probability for Statistics and Machine Learning: Fundamentals 527


and Advanced Topics, Springer Texts in Statistics, DOI 10.1007/978-1-4419-9634-3 16,
c Springer Science+Business Media, LLC 2011
528 16 Empirical Processes and VC Theory
p
So, a natural centering for Fn .t/ is F .t/, and a natural scaling is n.pIndeed, by
the CLT for iid random variables with a finite variance, for any fixed t; nŒFn .t/ 
L
F .t/ ) N.0; F .t/.1  F .t///. p We can think of t as a running time parameter, and
define a process ˇn .t/ D nŒFn .t/  F .t/; t 2 R. This is the one-dimensional
(normalized) empirical process. We can talk about it in more generality than the iid
case; but we only consider the iid case in this chapter.
Note that the empirical CDF is a finitely supported CDF; it has the data values
as its jump points. Any CDF gives rise to a probability distribution (measure), and
so does the empirical CDF Fn . The empirical measure Pn is defined by Pn .A/ D
#fXi WXi 2Ag
n
, for a general set A  R (we do not mention it later, but we need A to
be a Borel set).
There is no reason why we have to restrict such a definition to the case of
real-valued random variables. We can talk about an empirical measure in far more
general spaces. In particular, given d -dimensional random vectors X1 ; : : : ; Xn from
some distribution P in Rd ; 1 d < 1, we can define the empirical measure in
exactly the same way, namely, Pn .A/ D #fXi WXn i 2Ag , A  Rd .
Why is a study of the empirical process (measure) so useful? A simple expla-
nation is that all the information about the underlying P in our data is captured by
the empirical measure. So, if we know how to manipulate the empirical measure
properly, we should be able to analyze and understand the probabilistic aspects of a
problem, and also analyze and understand particular methods to infer any unknown
aspects of the underlying P .
A useful reduction for many problems in the theory of empirical processes is that
we can often pretend as if the true F in the one-dimensional case is the CDF of the
U Œ0; 1 distribution. For this reason, we need a name for the empirical process in
the special U Œ0; 1 case. If U1 ; : : : ; Un are
p iid U Œ0; 1 variables, then the normalized
uniform empirical process is ’n .t/ D nŒGn .t/  t; t 2 Œ0; 1, where Gn .t/ D
#fUi WUi t g
n
.
Just as the empirical CDF mimics the true underlying CDF well in large samples,
the empirical percentiles mimic the true population percentiles well in large sam-
ples. Analogous to the empirical process, we can define a quantile process. Suppose
X1 ; X2 ; : : : ; Xn are iid with a common CDF F , which we assume to be continuous.
The quantile function of F is defined in the usual way, namely,

Q.y/ D F 1 .y/ D infft W F .t/  yg; 0<y 1:

The empirical quantile function is defined in terms of the order statistics of the
sample values X1 ; X2 ; : : : ; Xn , namely,

k1 k
Qn .y/ D Fn1 .y/ D XkWn ; <y ;
n n

1 k n, where X1Wn ; X2Wn ; : : : ; XnWn are the order statisticspof X1 ; X2 ; : : : ; Xn .


The normalized quantile process is simply qn .y/ D nŒQn .y/  Q.y/;
0 < y 1. We often do not mention the normalized term, and just call it
16.2 Classic Asymptotic Properties of the Empirical Process 529

the quantile process. In the special case of U Œ0; 1 variables, we of course have
Q.y/ D y. We use the notation Un .y/ for the quantile function Qn .y/ in
p special U Œ0; 1 case, which gives us the uniform quantile process un .y/ D
the
nŒUn .y/  y; 0 < y 1.
Given a real-valued function f on some interval Œa; b  R; let jjf jj1 D
supaxb jf .x/j. We often call it the uniform norm or the L1 norm of f . The
paths of the empirical process are only right continuous; indeed, Fn .t/ jumps at the
data values X1 ; : : : ; Xn . However, at any point t; Fn .t/ has a left limit. We need a
name for functions of these types. Let Œa; b be an interval with 1 a < b 1.
Then, C Œa; b denotes the class of all continuous real-valued functions on Œa; b,
and `1 Œa; b denotes the class of all real-valued functions f with jjf jj1 < 1.
The class of all real-valued functions that are right continuous on Œa; b and have
left limits everywhere is denoted by DŒa; b; they are commonly known as cadlag
functions. An important inclusion property is C Œa; b  DŒa; b  `1 Œa; b.
As an estimate of the true CDF F , the empirical CDF Fn is uniformly accurate
in large samples. Indeed, the Glivenko–Cantelli theorem says that (in the iid case)
a:s:
jjFn  F jj1 ! 0 as n ! 1 (see Chapter 7). A common test p statistic for goodness
of fit in statistics is the Kolmogorov–Smirnov statistic Dn D njjFn  F jj1 (see
Chapter 14). Empirical process theory is going to help us in pinning down finer
properties of jjFn  F jj1 and the asymptotic distribution of Dn .
Note that jjFn  F jj1 is just the Kolmogorov distance (see Chater 15) be-
tween Fn and F . Hence, the Glivenko–Cantelli theorem may be rephrased as
a:s:
d.Fn ; F / ! 0. A simple point worthy of mention is that Fn need not be close
to the true CDF F according to other common notions of distance. For example,
consider the empirical measure Pn and the true underlying distribution (measure)
P that corresponds to the CDF F . The total variation distance between Pn and
P is .Pn ; P / D supA jPn .A/  P .A/j, the supremum being taken over arbitrary
(Borel) sets of R. Then, clearly, .Pn ; P /, the total variation distance between Pn
and P , cannot converge to zero as n ! 1 in general, because Pn is always sup-
ported on a finite set, and P may even give zero probability to all finite sets (e.g.,
if P had a density). In such a case, .Pn ; P / would actually be equal to 1 for any
n, and therefore would not converge to zero. This argument has nothing to do with
the variables being real-valued. The same argument works in Rd for any d < 1.
Thus, the empirical measure need not estimate the true P accurately even in large
samples, if our notion of accuracy is too strong. Empirical process theory is about
the nature of deviation of Pn from P , especially in large samples, and to quantify
the deviation in very precise terms, using highly powerful mathematical ideas and
tools. Of course, the theory has numerous applications.

16.2 Classic Asymptotic Properties of the Empirical Process

In studying the classic results on empirical processes, it is helpful first to understand


at a heuristic level that the one-dimensional empirical process and the standard
Brownian bridge on Œ0; 1 have some connections. The connections have exactly the
530 16 Empirical Processes and VC Theory

same flavor as those between the partial sum process and Brownian motion, which
was extensively discussed in Chapter 12. The connection between the empirical pro-
cess and the Brownian bridge is the driving force behind many of the most valuable
results on the asymptotic behavior of the one-dimensional empirical process, and it
is useful to have a preview of it.
Preview. Consider first the uniform empirical process ’n .t/; t 2 Œ0; 1. As we re-
L
marked in the previous section, for any fixed t; ’n .t/ ) Zt , where Zt is distributed
as N.0; t.1  t//. This is an immediate consequence of the one-dimensional central
limit theorem in the iid case. We can quickly generalize this to the case of any finite
number of times 0 t1 < t2 < : : : < tk 1. First note that by a simple and direct
calculation, for any fixed n, and a pair of times ti ; tj ,

Cov.’n .ti /; ’n .tj // D P .U1 ti ; U1 tj /  P .U1 ti /P .U1 tj /


D min.ti ; tj /  ti tj :

Therefore, by the multivariate central limit theorem for the iid case (see Chapter 7),
we have the convergence result that
L
.’n .t1 /; ’n .t2 /; : : : ; ’n .tk // ) .Zt1 ; Zt2 ; : : : ; Ztk /;
where .Zt1 ; Zt2 ; : : : ; Ztk / has a k-dimensional normal distribution with means
zero, and the covariance matrix with elements Cov.Zti ; Ztj / D min.ti ; tj /  ti tj .
Notice that the covariances min.ti ; tj /  ti tj exactly coincide with the formula for
the covariance between B.ti / and B.tj /, where B.t/ is a Brownian bridge on Œ0; 1
(see Chapter 12). In other words, the finite-dimensional distributions of the pro-
cess ’n .t/ converge to the corresponding finite-dimensional distributions of B.t/, a
Brownian bridge on Œ0; 1.
This would lead one to hope that perhaps ’n .t/ converges to B.t/ as a process. If
true, this would be stronger than convergence of just the finite-dimensional distribu-
tions, and would lead to better applications. It was a triumph of probability theory
that in 1952, Monroe Donsker proved that indeed ’n .t/ converges to B.t/ as a pro-
cess (Donsker (1952)). There were certain technical problems of measurability in
Donsker’s proof. The problems were initially overcome by Anatoliy Skorohod and
Andrei Kolmogorov. In order to avoid bringing in that part of the theory, we state
Donsker’s theorem in the form obtained in Dudley (1999).
The approximation of the one-dimensional uniform empirical process by a Brow-
nian bridge easily carries over to the general one-dimensional empirical process,
by essentially using the quantile transformation for real-valued random variables
with a continuous CDF. This is seen below. However, the approximation through
a Brownian bridge is actually much stronger than what convergence in distribution
would imply. Starting in the 1970s, a much stronger form of approximation of the
empirical process by a sequence of Brownian bridges was established. This is the
analogue of what we called the strong invariance principle for the partial sum pro-
cess in Chapter 12. These strong approximations for the one-dimensional empirical
16.2 Classic Asymptotic Properties of the Empirical Process 531

process allow one to derive finer asymptotic properties of the empirical process, and
in some problems (but not all) lead to quick solutions of apparently involved weak
convergence problems.
One problematic feature of empirical process theory, even in one dimension, is
that the proofs of the most key results are almost always long and involved. We often
refer to a source for a proof because of this reason.

16.2.1 Invariance Principle and Statistical Applications


a:s:
Theorem 16.1. (a) Glivenko–Cantelli Property jjFn  F jj1 ) 0 as n ! 1.
(b) On a suitable probability space, it is possible to construct an iid sequence Ui
of U Œ0; 1 random variables, and a sequence of Brownian bridges Bn .t/, such
a:s:
that jj’n  Bn jj1 ) 0 as n ! 1.
(c) (Donsker’s Theorem). The uniform empirical process ’n .t/ converges in dis-
tribution to a Brownian bridge B.t/, as random elements of the class of
functions DŒ0; 1.
(d) For a general continuous CDF F on the real line, if X1 ; X2 ; : : : are iid with
the common CDF F , then the empirical process ˇn .t/ converges in distribution
to a Gaussian process BF .t/ with EŒBF .t/ D 0 and Cov.BF .s/; BF .t// D
F .min.s; t//  F .s/F .t/, as random elements of the class of functions D.1;
1/.
(e) (Weak Invariance Principle). Given any real-valued functional defined on
L
D.1; 1/ that is continuous with respect to the uniform norm, h.ˇn .t// )
h.BF .t//.
Due to the length and technical nature of the proof of this theorem, we refer to
Billingsley (1968) and Dudley (1999) for its proof. Part (a) of this theorem was pre-
viously proved in Theorem 7.7. Part (b) is a special result in strong approximations
of the empirical process. We treat the topic of strong approximations in more detail
later. It is part (e) of the theorem that is of the greatest use in applications, as we
demonstrate shortly in some examples. Part (e) is commonly known as the invari-
ance principle. The spectacular aspect of the result in part (e) is that regardless of
what the functional h is, as long as it is continuous with respect to the uniform norm,
the distribution of h.ˇn .t// will be close to the distribution of h.BF .t// for large n.
Therefore, if we know how to find the distribution of the functional h.BF .t// for our
limiting Gaussian process BF .t/, then we have bypassed the usually insurmount-
able problem of finding the finite sample distribution of h.ˇn .t//, and we have done
so all at one time for any continuous functional h. In applications, we can therefore
pick and choose a suitable h, and simply apply the invariance principle. Here are
some examples of the above theorem.
Example 16.1 (The Kolmogorov–Smirnov Statistic). The Kolmogorov–Smirnov
statistic is commonly used for testing the hypothesis F D F0 , where F is an
unknown true CDF on the real line from which we have n iid observations
532 16 Empirical Processes and VC Theory

X1 ; X2 ; : : : ; Xn , and
p F0 is a specified p
continuous CDF (see Example 14.22). It
is defined as Dn D njjFn  F0 jj1 D n sup1<t <1 jFn .t/  F0 .t/j. The exact
distribution of Dn is difficult to find except for quite small n; some calculations
were done in Kolmogorov (1933). Usually, in applications of the Kolmogorov–
Smirnov test, the exact distribution is replaced by its asymptotic distribution, and
this can be obtained elegantly by using the invariance principle.
First note that under any continuous CDF F0 on the real line, the quantile trans-
form shows that for any n, the distribution of Dn is the same for all F0 . Thus,
one may assume F0 to be the U Œ0; 1 distribution. Consider now the functional
h.f / D jjf jj1 on DŒ0; 1. It is obviously continuous with respect to the uniform
norm, and therefore, by the invariance principle,
p p
Dn D n sup jFn .t/  tj D sup j n.Fn .t/  t/j D sup j’n .t/j
0<t <1 0<t <1 0<t <1
L
) sup jB.t/j;
0<t <1

where B.t/ is a Brownian bridge on Œ0; 1. From Chapter 12 (see Theorem 12.4),
P
sup0<t <1 jB.t/j has the CDF H.x/ D 1  2 1 kD1 .1/
k1 2k 2 x 2
e ; x  0. There-
P1 2 2
fore, we have the result that for any x > 0, P .Dn > x/ ! 2 kD1 .1/k1 e 2k x
as n ! 1. This is of tremendous practical utility in statistics.
Example 16.2 (Cramér–von Mises Statistic). It is conceivable that the empirical
CDF Fn is a moderate distance away from the postulated CDF F0 over large parts
of the real line, although it is never too far away. In such a case, the Kolmogorov–
Smirnov statistic may fail to detect the falsity of the postulated null hypothesis,
R 1deviation of Fn from F0 may succeed. The
but a statistic that measures an average
Cramér–von Mises statistic Cn2 D n 1 .Fn .t/  F0 .t//2 dF0 .t/ is such a statistic,
and is frequently used as an alternative to the Kolmogorov–Smirnov statistic, or as
a complementary statistic.
Once again, the exact distribution of Cn2 is difficult to find, but an application
of the invariance principle leads to the asymptotic distribution, which is used as an
approximation to the exact distribution. As long as F0 is continuous, the distribution
of Cn2 is independent of F0 for any n, and so once again, as in our previous example,
R1 p
we may take F0 to be the U Œ0; 1 CDF. Then, we have Cn2 D 0 Œ n.Fn .t/ 
R 1 R 1
t/2 dt D 0 ’2n .t/dt. The functional h.f / D 0 f 2 .t/dt is continuous on DŒ0; 1
with respect to the uniform norm. To see this, use the sequence of inequalities
ˇZ Z ˇ Z
ˇ 1 1 ˇ 1
ˇ f .t/dt 
2
g .t/dtˇˇ
2
jf 2 .t/  g 2 .t/jdt
ˇ
0 0 0
Z 1
D jf .t/  g.t/jjf .t/ C g.t/jdt
0
Z 1
jjf  gjj1 .jf .t/j C jg.t/j/dt
0
jjf  gjj1 .jjf jj1 C jjgjj1 /:
16.2 Classic Asymptotic Properties of the Empirical Process 533

L R1
Therefore, by the invariance principle, Cn2 ) 0 B 2 .t/dt. It remains to characterize
R1
the distribution of 0 B 2 .t/dt.
For this, we usep the Karhunen–Loeve expansion of B.t/ (see Theorem 12.2)
P
given by B.t/ D 2 1 mD1
sin.m t /
m Zm , where Z1 ; Z2 ; : : : is an iid N.0; 1/ se-
quence. By the orthogonality of the sequence of functions sin.m t/; m  1, we
get
Z Z 1 1 R1
1 1 p X sin.m t/ 2
2 X Zm
2
sin2 .m t/dt
B .t/dt D
2
2 Zm dt D 2 0
0 0 mD1
m  mD1 m2
1 1
2 X Zm 2
1 X Zm 2
D 2 D D Y .say/:
 mD1 2m2  2 mD1 m2

R1
That is, 0 B 2 .t/dt is distributed as an infinite linear combination of iid chi-square
random variables of one degree of freedom. At this point, the strategy is to find the
cf (characteristic function) of Y , and to invert it to find a density for Y (see Theorem
8.1). The cf of a single chi-square random variable with one degree of freedom is
.1  2i t/1=2 . It follows that the cf of Y is
Y1  1=2  Y 1 1=2
t t
Y .t/ D 1  2i 2 2 D .1  2i 2 2 / :
mD1
 m mD1
 m
Q1 p
sin. z/
mD1 .1  D
z p
We now use the identity that for a complex number z; m2
/  z
.
Using z D 2it
2
, we get, with a little algebra, that

 p  s p
sin. 2it/ 1=2 2it
Y .t/ D p D p :
2it sin. 2it/

It is possible to invert this to write the CDF of Y as a convergent infinite series (see
part (a) of Theorem 8.1). The CDF has the formula
1 Z .2j C2/2  2 s p
1 X 1 z yz
FY .y/ D 1  .1/ j
p e  2 d z:
 .2j C1/ 
2 2 z  sin. z/
j D0

This is certainly complicated, even if partially closed-form. In practice, one would


probably calculate a tail probability for this statistic by carefully simulating the
null distribution of Cn2 (which is independent of F0 ). Nevertheless, the example
demonstrates the power of the invariance principle and characteristic functions to
solve important problems.
534 16 Empirical Processes and VC Theory

16.2.2  Weighted Empirical Process

It is sometimes useful to consider normalized or otherwise weighted versions of the


empirical process in order to accentuate its behavior at the tails, that is, for t near
zero or one. Thus, given a nonnegative weight function w on .0; 1/, we may want to
consider the weighted empirical process
p
n.Fn .t/  F .t//
ˇn;w .t/ D :
w.F .t//

Possible uses of such a weighted empirical process would include statistical tests of
a null hypothesis H0 W F D F0 by using test statistics such as
ˇp ˇ Z
ˇ n.Fn .t/  F0 .t// ˇ 1
ŒFn .t/  F0 .t/2
Dn;w D sup ˇ ˇ ˇ
ˇ ; or Cn;w D n
2
dF0 .t/:
1<t <1 w.F .t// 0 1 w2 .F0 .t//

The limiting behavior of such statistics is not necessarily what one might expect
intuitively. For instance, Dn;w does not necessarily converge in law to the supremum
of jB.t /j
w.t /
. The tails of the function w must be such that B.t /
w.t /
does not blow up near
t D 0 or 1. The specification of the weighting function w such that no such disas-
ters occur is a very subtle and nontrivial problem. The following result (Chibisov
(1964), O’Reilly (1974)) completely describes the properties required of w so that
the weighted empirical process behaves well at the tails.
Theorem 16.2 (Chibisov–O’Reilly). (a) Suppose the function w is nondecreas-
ing in a neighborhood of zero and nonincreasing in a neighborhood of 1. The
statistic Dn;w has a nontrivial limiting distribution if and only if

Z w2 .t /
1
e  t .1t /
dt < 1 for some > 0;
0 t.1  t/

L
in which case Dn;w ) sup0<t <1 jB.t /j
w.t / .
2
(b) The statistic Cn;w has a nontrivial limiting distribution if and only if
R 1 t .1t / L R 1  B.t / 2
0 w2 .t / dt < 1, in which case C 2
n;w ) 0 w.t / dt.

Example 16.3. Consider the Anderson–Darling statistic


Z 1
.Fn .t/  F0 .t//2
A2n D n dF 0 .t/;
1 F0 .t/.1  F0 .t//

and the weighted Kolmogorov–Smirnov statistic

p jFn .t/  F0 .t/j


Dn D n sup p :
1<t <1 F0 .t/.1  F0 .t//
16.2 Classic Asymptotic Properties of the Empirical Process 535

Note that A2n and Dn have some formal similarity; Dn is the supremum (L1 norm)
and An the L2 norm of the weighted empirical process
p
n.Fn .t/  F0 .t//
p :
F0 .t/.1  F0 .t//

However, the theorem of Chibishov and O’Reilly implies that Dn does not have
a nontrivial limiting distribution, whereas A2n does, and converges in distribution
R 1 2 .t / R 1 2 .t /
to 0 tB.1t /
dt. It is a chapter exercise to find the distribution of 0 tB.1t /
dt. The
 1
problem with Dn is that pt .1t / diverges to infinity at 0 and 1 too rapidly to balance
the behavior of B.t/ near zero and one. If we weight the empirical process by a more
modest weight function, the problems disappear and the weighted Kolmogorov–
Smirnov statistic has a nontrivial limiting distribution.
As we just saw, the supremum of the standardized empirical process, namely Dn ,
does not have a nontrivial limiting distribution by itself. However, it can be centered
and normed suitably to make it have a nontrivial limiting distribution. Very roughly
speaking, Dn does not have a nontrivial limiting
p distribution because it blows up
when n ! 1, and it grows at the rate of 2 log log n. So, by centering it (more
p
or less) at 2 log log n, and then norming it, we can obtain a limiting distribution.
This can still be used to calculate large sample approximations to tail probabilities
for Dn (i.e., P-values). The following results are due to Jaeschke (1979) and Eicker
(1979), Csáki (1980), and Einmahl and Mason (1985).

Theorem 16.3.
Dn P
(a) p ) 1:
2 log log n " # !
p 
2 log log n C 12 log log log nC 12 log 
(b) 8 x; P 2 log log n Dn  p x
2 log log n
2e x
!e :
1
(c) For any > 0; with probability 1 for all large n; Dn .log n/ 2 C :
Notice that part (a) of this theorem follows from part (b). It is to be noted that
part (a) cannot be strengthened to almost sure convergence.
1
The preceding discussion and the results show that the weight function pt .1t /
grows too rapidly near t D 0 and 1 for the supremum of the weighted empirical
process to settle down. A particular special case of part (a) of Theorem 16.2 is
sometimes useful in applications, and is given below.

Theorem 16.4. For each 0 < < 12 ,


jB.t/j
(a) Almost surely, sup < 1:
p 0<t <1 Œt.1  t/


j n.Fn .t/  F0 .t//j L jB.t/j


(b) sup ) sup :
1<t <1 ŒF0 .t/.1  F0 .t//
0<t <1 Œt.1  t/
536 16 Empirical Processes and VC Theory

16.2.3 The Quantile Process

Just as the empirical CDF Fn approximates a true CDF F on the real line, the
empirical quantile function Qn .y/ D Fn1 .y/ approximates the true quantile func-
tion Q.y/ D F 1 .y/. However, we can intuitively see that it is difficult to estimate
quantiles of F beyond the range of the data, namely, X1 ; X2 ; : : : ; Xn . So, we have to
be careful about the nature of uniform approximability of Q.y/ by Qn .y/. In addi-
tion, to get convergence results to Brownian bridges akin to the case of the empirical
process, we have to properly normalize the quantile process, so that the covariance
structure matches that of a Brownian bridge. The theorem below collects the three
most important asymptotic properties of the normalized quantile process; a proof
for them can be seen in csörgo (1983). Quantile processes are useful in a number of
statistical problems, such as change point problems, goodness of fit, reliability, and
survival analysis.

Theorem 16.5. Suppose Fpis an absolutely continuous CDF on the real line, with
a density f . Let qn .y/pD nŒQn .y/  Q.y/; 0 < y < 1; n  1, be the quantile
process, and n .y/ D nf .Q.y//ŒQn .y/  Q.y/; 0 < y < 1; n  1, be the
standardized quantile process.
(a) (Restricted Glivenko–Cantelli Property). Suppose f has a bounded support,
that f is differentiable, is bounded away from zero on its support, and that jf 0 j
is bounded on the support of f . Then
a:s:
sup jQn .y/  Q.y/j ! 0:
0<y<1

(b) If F has an unbounded support, then with probability one,

lim sup jQn .y/  Q.y/j D 1:


n!1 0<y<1

(c) Suppose F has a general support (not necessarily bounded) with a density f .
Assume that f is strictly positive on its support, that f is differentiable, and
0 .x/j
that F .x/.1  F .x// jf
f 2 .x/
is uniformly bounded on the support of f . Then, on
a suitable probability space, one can construct an iid sequence X1 ; X2 ; : : : with
the CDF F , and a sequence of Brownian bridges Bn .y/; 0 y 1, such that
P
sup jn .y/  Bn .y/j ) 0:
1 n
nC1 y nC1

(d) (Weak Invariance Principle). Let h be a real-valued functional on DŒ0; 1,


continuous with respect to the uniform norm. Then, under the conditions of
L
part (c), for any c; 0 < c < 1; h.n .y//jcy1c ) h.B.y//jcy1c , where
L
B.y/ is a Brownian bridge on Œ0; 1. In particular, supcy1c jn .y/j )
supcy1c jB.y/j.
16.2 Classic Asymptotic Properties of the Empirical Process 537

16.2.4 Strong Approximations of the Empirical Process

We proved in Chapter 12 that the partial sum process for an iid sequence with a
finite variance converges in distribution to a Brownian motion. This is the invariance
principle for the partial sum process (see Section 12.6). In Section 12.7, we showed
that in fact for each fixed n, we can construct a Wiener process such that the partial
sum process is uniformly close to the Wiener process with probability one. This was
called the strong invariance principle for the partial sum process. A parallel strong
approximation theory exists for the empirical process of an iid sequence. The strong
approximation provides a handy tool in many situations for solving a particular
problem. In addition, the error statements in the strong approximation results give a
very precise idea about the accuracy of the Brownian bridge approximation of the
empirical process. The results pin down what can and cannot be done. We remark
in passing that part (a) in Theorem 16.1 and part (c) in Theorem 16.5 are in fact
instances of strong approximations.

Theorem 16.6. On a suitable probability space, one can construct an iid sequence
X1 ; X2 ; : : : with the common CDF F , and p a sequence of Brownian bridges Bn .t/,
such that the empirical process ˇn .t/ D n.Fn .t/  F .t//; 1 < t < 1 satisfies
 
log n
(a) Almost surely; sup jˇn .t/  Bn .F .t//j D O p :
1<t <1  n 
12 log n C x
(b) 8 n  1; and 8 x 2 R; P sup jˇn .t/  Bn .F .t//j > p
1<t <1 n
x
2e  6 :
jˇn .t/  Bn .F .t//j
(c) 8 c > 0; sup p D OP .1/:
F 1 . n
c
/ t F 1 .1 nc / F .t/.1  F .t//
1 jˇn .t/  Bn .F .t//j
(d) 8 c > 0; and 8 0 < < ; sup D
2 F 1 . c / t F 1 .1 c / ŒF .t/.1  F .t//
 1
n n

OP n 2 :

Part (a) of this theorem actually follows from part (b) by clever choices of x, and
then an application of the Borel–Cantelli lemma. The choice of x is indicated in a
chapter exercise. It is important to note that the inequality is valid for all n and all
real x, and so in a specific application, n and x can be chosen to suit one’s need. In
particular, x can depend on n. Part (b) is called the Komlós, Major, Tusnady (KMT)
theorem. See Komlós et al. (1975a, b), and Mason and van Zwet (1987) for a more
detailed proof. Part (c) is an OP .1/ result, rather than an oP .1/ one, and reinforces
our
p discussion in the previous section that if we weight the empirical process by
F .1  F /, then the invariance principle will fail. But if we weight it by a smaller
power of F .1  F /, then not only can we recover the invariance principle, but even
n
a strong approximation holds, as in part (d). The KMT rate log p in part (a) cannot
n
be improved.
538 16 Empirical Processes and VC Theory

16.3 Vapnik–Chervonenkis Theory

As we have seen, an important consequence of Donsker’s


p invariance principle is the
derivation of the limiting distribution of Dn D n sup1<t <1 jFn .t/  F .t/j,
for a continuous one-dimensional CDF F . If we let F D fI1;t  W t 2 Rg,
p L
then the Kolmogorov–Smirnov result says that n supf 2F .EPn f  EP f / )
sup0t 1 jB.t/j, Pn ; P being the probability measures corresponding to Fn ; F , and
B.t/ being a Brownian bridge. Extensions of this involve study of the asymptotic
behavior of supf 2F .EPn f  EP f / for much more general classes of functions F
and the range space of the random variables Xi ; the range space need not be R, or
Rd for some finite d . It can be a much more general set S. Examples of asymptotic
behavior include derivation of laws of large numbers and central limit theorems. Be-
cause of the involvement of the supremum over f 2 F , this is much more difficult
than consideration of EPn f  EP f for a single or a finite number of functions f .
There are numerous applications of these extensions to classic statistical prob-
lems. The more modern applications are in areas of statistical classification of
objects involving many variables, and other problems in machine learning. To give
a simple example, suppose X1 ; X2 ; : : : ; Xn are d -dimensional iid random vectors
from some P and we want to test the null hypothesis that P D P0 (specified). Then,
a natural statistic to assess the truth of the hypothesis is Tn D supC 2C jPn .C / 
P0 .C /j for a suitable class of sets C. Now, if C is too rich, for example, if it is the
class of all measurable sets, then clearly there cannot be any meaningful asymp-
totics if P0 is absolutely continuous. On the other hand, if C is too small, then the
statistic cannot be sensitive enough for detecting departures from the null hypothe-
sis. So these extensions study the question of what kinds of families C or function
classes F allow meaningful asymptotics and also result in good and common sense
tests. Vapnik–Chervonenkis theory pins down exactly how rich such a class C can
be, and even more, supplies the user with readymade classes that will work. It is im-
portant to understand that applications of the VC theory are far more wide ranging
than testing problems in statistics.
The technical tools required for such generalizations are extremely sophisti-
cated, and have led to striking new discoveries and mathematical advances in the
theory of empirical processes. Along with these advances, have come numerous
new and useful statistical and probabilistic applications. The literature is huge; we
strongly recommend Wellner (1992), Giné (1996), Pollard (1989), and Giné and
Zinn (1984) for comprehensive reviews, and sources for major theorems and addi-
tional references; specific references to some results are given later. We present a
basic description of certain key results and tools in VC theory below.

16.3.1 Basic Theory

We first discuss the plausibility of strong laws more general than the well-known
Glivenko–Cantelli theorem, which asserts that in the one-dimensional iid case
16.3 Vapnik–Chervonenkis Theory 539
a:s:
supt jFn .t/  F .t/j ! 0. We need a concept of combinatorial richness of a class
of sets C that will allow us to make statements like such as supC 2C jPn .C / 
a:s:
P .C /j ! 0. A class of sets for which this property holds is called a Glivenko–
Cantelli Class. A useful such concept of combinatorial richness is that of the
Vapnik–Chervonenkis dimension of a class of sets. Meaningful asymptotics will
exist for classes of sets that have a finite Vapnik–Chervonenkis dimension. It is
therefore critical to know what it means and what are good examples of classes of
sets with a finite Vapnik–Chervonenkis dimension. A basic treatment of this is given
next.
We often use the notation below. Given a specific set of n elements x1 ; x2 ; : : : ; xn
of a general set S, which need not be distinct, and a specific class C of subsets of S,
we let
4C .x1 ; x2 ; : : : ; xn / D Card.fx1 ; x2 ; : : : ; xn g \ C W C 2 C/;
where Card denotes cardinality. In words, 4C .x1 ; x2 ; : : : ; xn / counts how many
distinct subsets of x1 ; x2 ; : : : ; xn can be generated by intersecting fx1 ; x2 ; : : : ; xn g
with the members of C.

Definition 16.1. Let A  S be a fixed set, and C a class of subsets of S. A is said to


be shattered by C if every subset U of A is the intersection of A with some member
C of C, that is, fA \ C W C 2 Cg D P.A/, where P.A/ denotes the power set of A.
Sometimes the phenomenon is colloquially described as every subset of A is
picked up by some member of C.

Definition 16.2. The Vapnik–Chervonenkis (VC) dimension of C is the size of the


largest set A that can be shattered by C.
Although this is already fine as a definition, a more formal definition is given by
using the concept of shattering coefficients.

Definition 16.3. For n  1, the nth shattering coefficient of C is defined to be

S.n; C/ D maxx1 ;x2 ;:::;xn 2S 4C .x1 ; x2 ; : : : ; xn /:

That is, S.n; C/ is the largest possible number of subsets of some (wisely chosen)
set of n points that can be formed by intersecting the set with members of C. Clearly,
for any n; S.n; C/ 2n .

Here is an algebraic definition of the VC dimension of a class of sets.


Definition 16.4. The VC dimension of C equals V C.C/ D min fn W S.n; C/
< 2n g  1 D maxfn W S.n; C/ D 2n g.

Definition 16.5. C is called a Vapnik–Chervonenkis (VC) class if V C.C/ < 1.

The following remarkable result is known as Sauer’s lemma (Sauer (1972)).


PV C.C/ n
Proposition. Either S.n; C/ D 2n 8n, or 8n; S.n; C/ i D0 i .
540 16 Empirical Processes and VC Theory

Remark. Sauer’s lemma says that either a class of sets has infinite VC dimension,
or its shattering coefficients grow polynomially. A few other important and useful
properties of the shattering coefficients are listed below; most of them are derived
easily. These properties are useful for generating new classes of VC sets from known
ones by using various Boolean operations.

Theorem 16.7. The shattering coefficients S.n; C/ of a class of sets C satisfy


(a) S.m; C/ < 2m for some m ) S.n; C/ < 2n 8n > m.
(b) S.n; C/ .n C 1/V C.C/ 8n  1.
(c) S.n; C c / D S.n; C/, where C c is the class of complements of members of C.
(d) S.n; B \ C/ S.n; B/S.n; C/, where the \ notation means the class of sets
formed by intersecting members of B with those of C.
(e) S.n; B ˝ C/ S.n; B/S.n; C/, where the ˝ notation means the class of sets
formed by taking Cartesian products of members of B and those of C.
(f) S.m C n; C/ S.m; C/S.n; C/.
See Vapnik and Chervonenkis (1971) and Sauer (1972) for many of the parts in
this theorem. Now we give examples for practical use.

16.3.2 Concrete Examples

Example 16.4. Let C be the class of all left unbounded closed intervals on the real
line; that is, C D f.1; x W x 2 Rg. To illustrate the general formula, suppose
n D 2; what is S.n; C/? Clearly, if we pick up the larger one among x1 ; x2 , we
will pick up the smaller one too. Or, we may pick up none of them or just the
smaller one. So we can pick up three distinct subsets from the power set of fx1 ; x2 g.
The same argument shows that the general formula for the shattering coefficients is
S.n; C/ D n C 1. Consequently, this is a VC class with VC dimension one.

Example 16.5. Although topologically, there are just as many left unbounded in-
tervals on the real line as there are arbitrary intervals, in the VC index they act
differently. This is interesting. Thus, let C D f.a; b/ W a b 2 Rg. Then it is easy
to establish the formula S.n; C/ D 1 C nC1 2 . For n D 2, this is equal to four, which
nC1
is also 2 . For n D 3; 1 C 2 D 7 < 2n . Consequently, this is a VC class with
2

VC dimension two.

Example 16.6. The previous example says that on R, the class of all convex sets is
a VC class. However, this is far from being true even in two dimensions. Indeed, if
we let C be just the class of convex polygons in the plane, it is clear geometrically
that for any n, C can shatter n points. So, convex polygons in R2 have an infinite
VC dimension.
More examples of exact values of VC dimensions are given in the chapter ex-
ercises. For actual applications of these ideas to concrete extensions of Donsker’s
principles, it is extremely useful to know what other natural classes of sets in various
16.3 Vapnik–Chervonenkis Theory 541

spaces are VC classes. The various parts of the following result are available in
Vapnik and Chervonenkis (1971) and Dudley (1978, 1979).
Theorem 16.8. Each of the following classes of sets is a VC class.
(a) The class of all southwest quadrants of Rd ; that is, the class of all sets of the
Q
form diD1 .1; xi .
(b) The class of all closed half-spaces of Rd .
(c) The class of all closed balls of Rd .
(d) The class of all closed rectangles of Rd .
(e) C D ffx 2 Rd W g.x/  0g W g 2 Gg, where G is a finite-dimensional vector
space of real-valued functions defined on Rd .
(f) Projections of a class of the form in part (e) onto a smaller number of
coordinates.
There are practically useful ways to generate new VC classes from known ones.
By using the various parts of Theorem 16.7, one can obtain the following useful
results.
Theorem 16.9. (a) C is VC implies C c D fC c W C 2 Cg is VC.
(b) C is VC implies .C/ is VC for any one-to-one function , where .C/ is the
class of images of sets in C under .
(c) C; D are VC implies C \ D and C [ D are VC, where the intersection and union
classes are defined as classes of intersections C \ D and unions C [ D; C 2
C; D 2 D.
(d) C is VC in some space S1 , and D is VC in some space S2 implies that the
Cartesian product C ˝ D is VC in the product space S1 ˝ S2 .
We can now state a general version of the familiar Glivenko–Cantelli theorem.
It say’s that a VC class is a Glivenko–Cantelli class. The following famous theo-
rem of Vapnik and Chervonenkis (1971) on Euclidean spaces is considered to be a
penultimate result of the problem.
iid
Theorem 16.10. Let X1 ; X2 ; : : : P , a probability measure on Rd for some fi-
nite d . Given any class of (measurable) sets C, for n  1; > 0,
2 =32
P .sup jPn .C /  P .C /j > / 8EŒ4C .X1 ; : : : ; Xn /e n
C
2 =32
8S.n; C/e n :

Remark. This theorem implies that for classes of sets that are of the right complexity
as measured by the VC dimension, the empirical measure converges to the true at
an essentially exponential rate. This is a sophisticated generalization of the one-
dimensional DKW inequality. The first bound of this theorem is harder to implement
because it involves computation of a hard expectation, namely EŒ4C .X1 ; X2 ; : : : ;
Xn /. It would usually not be possible to find this expectation, although simulating
the quantity 4C .X1 ; X2 ; : : : ; Xn / would be an interesting exercise.
The general theorem is given next; see Giné (1996).
542 16 Empirical Processes and VC Theory

Theorem 16.11. Let P be a probability measure on a general measurable space S


iid
and let X1 ; X2 ; : : : P . Let Pn denote the sequence of empirical measures and let
C be a class of sets in S. Then, under suitable measurability conditions,
a:s: P
(a) supC 2C jPn .C /  P .C /j ! 0 iff supC 2C jPn .C /  P .C /j ) 0:
a:s: log 4C .X1 ;X2 ;:::;Xn / P
(b) supC 2C jPn .C /  P .C /j ! 0 iff n ! 0:
a:s:
(c) Suppose C is a VC class of sets. Then supC 2C jPn .C /  P .C /j ! 0.
It is not easy to find 4C .X1 ; X2 ; : : : ; Xn / for the classes of sets C that we will
meet in practice. Note, however, that we always have the upper bound 4C .X1 ; X2 ;
: : : ; Xn / S.n; C/, and so, if we know the shattering coefficient S.n; C/, then
application of the sufficiency part of this theorem becomes more practicable. The
point is that one way or another, we need to find a bound on 4C .X1 ; X2 ; : : : ; Xn /
rather than working with it directly. Here is a very neat application of this result
from Giné (1996).

Example 16.7 (Uniform Convergence of Relative Frequencies). This example could


be characterized as the reason that statistics exists as a subject. Suppose we have
a multinomial distribution with cell probabilities p1 ; p2 ; : : :, and in a sample of
n individuals, Xi are found to be of type i . Thus, the relative frequencies are
Xi
n
; i D 1; 2; : : :. Of course, if the number of types (cells) is finite, then by just
the SLLN, maxi j Xni  pi j converges almost surely to zero. Suppose that the num-
ber of cells is infinite. Even then, the relative frequencies are uniformly close to the
true cell probabilities in large samples statistical sampling will unveil the complete
truth if large samples are taken. We give a proof of this by using part (a) of our
theorem above.
To formulate the problem, suppose P is supported on a countable set, which
without loss of generality we take to be ZC D f1; 2; 3; : : :g, and suppose X1 ; X2 ;
: : : ; Xn is an iid sample from P . Let Pn be the empirical measure. We want to
prove that jjPn  P jjC converges almost surely to zero even if C is the power set
of ZC . To derive this result, we use the binomial distribution tail inequality that if
W Bin.n; p/, then for any k; P .W  k/ . enp k
k / (see, e.g., Giné (1996)). To
verify part (a) of the above theorem, fix > 0, and observe that
 
log 4C .X1 ; X2 ; : : : ; Xn /
P > log 2 D P .4C .X1 ; X2 ; : : : ; Xn / > 2 n /
n
P .Number of distinct values among
X1 ; X2 ; : : : ; Xn  b nc/
P .At least half of X1 ; X2 ; : : : ; Xn
1
are  b nc/
2
 1
enP .X1  12 b nc/ 2 b nc
1
2
b nc
16.4 CLTs for Empirical Measures and Applications 543

(this is the binomial distribution inequality referred to above)

! 0;

because is a fixed number and P .X1  12 b nc/ ! 0, making the quotient


 
1 1
enP X1  b nc b nc
2 2

eventually bounded away from 1. This proves that

log 4C .X1 ; X2 ; : : : ; Xn / P
) 0;
n

as was desired.

16.4 CLTs for Empirical Measures and Applications

Theorem 16.8 and Theorem 16.9 give us hope for establishing CLTs for suit-
ably normalized versions of supC 2C jPn .C /  P .C /j in general spaces and with
general VC classes of sets. It is useful to think of this as an analogue of the
one-dimensional Kolmogorov–Smirnov statistic for real-valued random variables,
namely, supx jFn .x/  F .x/j. Invariance principles allowed us to conclude that the
limiting distribution is related to a Brownian bridge, with real numbers in Œ0; 1 as
the time parameter. Now, however, the setup is much more abstract. The space is
not an Euclidean space, and the time parameter is a set or a function. So the for-
mulation and description of the appropriate CLTs is more involved, and although
suitable Gaussian processes will still emerge as the relevant processes that deter-
mine the asymptotics, they are not Brownian bridges, and they even depend on the
underlying P from which we are sampling. Some of the most profound advances
in the theory of statistics and probability in the twentieth century have taken place
around this problem, resulting along the way in deep mathematical developments
and completely new tools. A short description of this is provided next.

16.4.1 Notation and Formulation

First some
R notation R and definitions. We recall that the notation .Pn  P /.f / would
mean f dPn  f dP . Here, f is supposed to belong to some suitable class
of functions F . For example, F could be the class of indicator functions of the
members C of a class of sets C. In that case, .Pn  P /.f / would simply mean
544 16 Empirical Processes and VC Theory

Pn .C /  P .C /; we have just talked about strong laws for their supremums as C


varies over C. That is a uniformity result. Likewise, we now need certain uniformity
assumptions on the class of functions F . We assume that
(a) supf 2F jf .s/j WD F .s/ < 1 8s 2 S
(measurability of F is clearly not obvious, but is being ignored here);
(b) F 2 L2 .P /:
The function F is called the envelope of F . In the case of real-valued random
variables and for the problem of convergence of the process Fn .t/  F .t/, the corre-
sponding functions, as we just noted, are indicator functions of .1; t/, which are
uniformly bounded functions. Now the time parameter has become a function itself,
and we need to talk about uniformly bounded functionals of functions; we use the
notation
l1 .F / D fh W F ! R W sup jh.f /j < 1g:
f 2F

Furthermore, we refer to supf 2F jh.f /j as the uniform norm and denote it as


jjhjj1;F .
The other two notions we need to formalize are those of convergence of the
process .Pn  P /.f / (on normalization) and of a limiting Gaussian process that
plays the role of a Brownian bridge in these general circumstances.
The Gaussian process, which we denote as BP .f /, continues to have contin-
uous sample paths, as was the case for the ordinary Brownian bridge, but now
p time parameter is a function, and continuity is with respect to P .f; g/ D
the
EP .f .X /  g.X //2 . BP has mean zero, and the covariance kernel Cov.BP .f /;
BP .g// D P .fg/P .f /P .g/ WD EP .f .X /g.X //EP .f .X //EP .g.X //. Note
that due to our assumption that F 2 L2 .P /, the covariance kernel is well defined.
Trajectories of our Gaussian process BP are therefore members of l1 .F /, also (uni-
formly) continuous with respect to the norm P we have defined above.
Finally,
p as in the Portmanteau theorem in Chapter 7, convergence of the pro-
cess pn.Pn  P /.f / to BP .f / would mean that expectation of any functional
H of n.Pn  P /.f / will converge to the expectation of H.BP .f //, H being a
bounded and continuous functional, defined on l1 .F /, and taking values in R. We
remind our reader that continuity on l1 .F / is with respect to the uniform norm we
have already defined there. A class of functions F for which this central limit prop-
erty holds is called a P -Donsker class; if the property holds for every probability
measure P on S, it is called a universal Donsker class.

16.4.2 Entropy Bounds and Specific CLTs

We now discuss what sorts of assumptions on our class of functions F will ensure
that weak convergence occurs (i.e., a CLT holds), and also what are some good
applications of such CLTs. There are multiple sets of assumptions on the class of
16.4 CLTs for Empirical Measures and Applications 545

functions F that ensure a CLT. Here, we describe only two, one of which relates
to the concept of VC classes. and the second related to metric entropy and packing
numbers. Inasmuch as we are already familiar with the concept of VC classes, we
first state a CLT based on a VC assumption of a suitable class of sets.

Definition 16.6. A family F of functions f on a (measurable) space S is called


VC-subgraph if the class of subgraphs of f 2 F is a VC class of sets, where the
subgraph of f is defined to be Cf D f.x; y/; x 2 S; y 2 R W 0 y f .x/ or
f .x/ y 0g.
iid
Theorem 16.12. Given X1 ; X2 ; : : : P , a probability measure on a measurable
space S, and a family of functions F on S such that F .s/ WD supf 2F jf .s/j 2
p L
L2 .P /; n.Pn  P /.f / ) BP .f / if F is a VC-subgraph family of functions.
An important application of this theorem is the following result.
p
Corollary 16.1. Under the other assumptions made in the above theorem, n
L
.Pn  P / .f / ) BP .f / if F is a finite-dimensional space of functions or if
F D fIC W C 2 Cg, where C is any VC class of sets.
This theorem beautifully connects the scope of a Glivenko–Cantelli theorem to
that of a CLT via the same VC concept, modulo the extra qualification that F 2
L2 .P /. One can see more about this key theorem in Alexander (1987) and Giné
(1996).
A pretty and useful statistical application of the above result is the following
example on extension (due to Beran and Millar (1986) ) of the familiar Kolmogorov–
Smirnov test for goodness of fit to general spaces.

Example 16.8. Let X1 ; X2 ; : : : be iid observations from P on some space S


and consider testing the null hypothesis H0 W P D P0 (specified).pThe natu-
ral Kolmogorov–Smirnov type test statistic for this problem is Tn D n supC 2C
jPn .C /  P0 .C /j, for a judiciously chosen family of (measurable) sets C. The
above theorem implies that Tn converges under the null in distribution to the supre-
mum of the absolute value of the Gaussian process BP0 .f /, the sup being taken
over all f D IC ; C 2 C, a VC class of subsets of S. In principle, therefore, the
null hypothesis can be tested by using this Kolmogorov–Smirnov type statistic.
Note, however, that the limiting Gaussian process depends on P0 . Evaluation of the
critical points of the limiting distribution of Tn under the null needs some type of
numerical work. In particular, a powerful statistical tool known as the bootstrap can
help in this specific situation; see Giné (1996) for more discussion and references
on this computational issue.
The second CLT we present requires the concepts of metric entropy and bracket-
ing numbers, which we introduce next.

Definition 16.7. Let F  be a space of real-valued functions defined on some space


S, and suppose F  is equipped with a norm jj:jj. Let F be a specific subcollection
546 16 Empirical Processes and VC Theory

of F  . The covering number of F is defined to be the smallest number of balls


B.g; / D fh W jjh  gjj < g needed to cover F , where > 0 is arbitrary but fixed,
g 2 F  , and jjgjj < 1.
The covering number of F is denoted as N. ; F ; jj:jj/. log N. ; F ; jj:jj/; is called
the entropy without bracketing of F .

Definition 16.8. In the same setup of the previous definition, a bracket is the set of
functions sandwiched between two given functions l; u; that is, i.e., a bracket is the
set ff W l.s/ f .s/ u.s/ 8s 2 Sg. It is denoted as Œl; u.

Definition 16.9. The bracketing number of F is defined to be the smallest number


of brackets Œl; u needed to cover F under the restriction jjl  ujj < ; > 0 an
arbitrary but fixed number.
The bracketing number of F is denoted as NŒ . ; F ; jj:jj/; log NŒ . ; F ; jj:jj/ is
called the entropy with bracketing of F .

Discussion: Clearly, the smaller the radius of the balls or the width of the brackets,
the greater is the number of balls or brackets necessary to cover the function class F .
The important thing is to pin down, qualitatively, the rate at which the entropy (with
or without bracketing) is going to 1 for a given F , as ! 0. It turns out, as we
show, that for many interesting and useful classes of functions F , this rate would
be of the order of . log /, and this will, by virtue of some theorems to be given
below, ensure that the class F is P -Donsker.

Theorem 16.13. Assume that F 2 L2 .P /. Then, F is P -Donsker if


either
Z 1q
(a) log NŒ . ; F ; jj:jj D L2 .P //d < 1,
0
or,
Z 1 q 
 
(b) sup log N jjF jj2;Q ; F ; jj:jj D L2 .Q/ d < 1;
0 Q

where Q denotes a general probability measure on S.


See van der Vaart and Wellner (1996, pp. 127–132) for a proof of this theorem,
under the two separate sufficient conditions.
We have previously seen that if F is a VC-subgraph, then it is P -Donsker. It
turns pout that this result follows from our above theorem on the integrability of
supQ log N . What one needs is the following upper bound on the entropy without
bracketing of a VC-subgraph class. See van der Vaart and Wellner (1996, p. 141)
for its proof.

Proposition. Given a VC-subgraph class F , for any probability measure Q,


and any r  1, for all 0 < < 1; N. jjF jjr;Q ; F ; jj:jj D Lr .Q//
C. 1 /rV C.C/ , where the constant C depends only on V C.C/; C being the subgraph
class of F .
16.5 Maximal Inequalities and Symmetrization 547

16.4.3 Concrete Examples

Here are some additional good applications of the entropy results.

Example 16.9. As mentioned above, the key to the applicability of the entropy the-
orems is a good upper bound on the rate of growth of the entropy numbers of the
class. Such bounds have been worked out for many intuitively interesting classes.
The bounds are sometimes sharp in the sense lower bounds can also be obtained that
grow at the same rate as the upper bounds. In nearly every case mentioned in this
example, the derivation of the upper bound is completely nontrivial. A very good
reference is van der Vaart and Wellner (1996), particularly Chapter 2.7 there.

Uniformly Bounded Monotone Functions on R For this function class F ; log NŒ
. ; F ; jj:jj D L2 .P // K , where K is a universal constant independent of P , and
so by part (a) of Theorem 16.11, this class is in fact universal P -Donsker.
Uniformly Bounded Lipschitz Functions on Bounded Intervals in R Let F be
the class of real-valued functions on a bounded interval I in R that are uniformly
bounded by a universal constant and are uniformly Lipschitz of some order ’ > 12 ;
a that is, jf .x/  f .y/j M jx  yj’ , uniformly in x; y and for some finite uni-
 1=’
versal constant M . For this class, log NŒ . ; F ; jj:jj D L2 .P // K 1 , where
K depends only on the length of I; M , and ’, and so this class is also universal
P -Donsker.
Compact Convex Subsets of a Fixed Compact Set in Rd Suppose S is a compact
set in Rd for some finite d , and let C be the class of all compact convex subsets of S .
For any absolutely continuous P , this class satisfies log NŒ . ; C; jj:jj D L2 .P //
 d 1
K 1 , where K depends on S; P , and d . Here it is meant that the function class
is the set of indicators of the members of C. Thus, for d D 2; F is P -Donsker for
any absolutely continuous P .
A common implication of all of these applications of the entropy thorems is that
in each of these cases, asymptotic goodness-of-fit tests can be constructed by using
these function classes.

16.5 Maximal Inequalities and Symmetrization

Consider a general stochastic process X.t/; t 2 T , for some time-indexing set T .


Inequalities on the tail probabilities or moments of the maxima of X.t/ are called
maximal inequalities for the X.t/ process. More precisely, a maximal inequality
places an upper bound on probabilities of the form P .supt 2t jX.t/j > / or on
EŒ.supt 2t jX.t/j/p , often with p D 2. Constructing useful maximal inequalities is
always a hard problem. Perhaps the maximum amount of concrete theory exists for
suitable Gaussian processes (e.g., continuity of the sample
p paths is often required).
In the context of developing maximal inequalities for n.Pn  P /.f /; f 2 F , this
548 16 Empirical Processes and VC Theory

turns out to be a rather happy coincidence, because empirical processes are, after
all, roughly Gaussian. A highly ingenious technique that has been widely used to
turn this near Gaussianity into useful maximal inequalities for the empirical process
is the symmetrization method. Ultimately, the inequalities on tail probabilities are
aimed at establishing some sort of an exponential decay for the tail, and the mo-
ment inequalities often end up having ties to covering and entropy numbers of the
class F . In this way, maximal inequalities for Gaussian processes, the technique of
symmetrization, and covering numbers come together very elegantly to produce a
collection of sophisticated results on deviations of the empirical from the true. The
treatment below gives a flavor of this part of the theory. We follow Pollard (1989)
and Giné (1996).
Preview. It is helpful to have a summary of the main ideas behind the technical
development of the ultimate inequalities. The path to the inequalities is roughly the
following.
(a) Use existing Gaussian process theory to write down inequalities for the mo-
ments of the maxima of a Gaussian process in terms of its covariance function.
This has connections to covering numbers.
(b) Invent a sequence of suitable new random variables 1 ; 2 ; : : : , and from these,
a sequence of new processes Zn ; n  1, such that conditional on the Xi , Zn is
a Gaussian process, and such that the tails of Zn are heavier than the tails of the
empirical process. This is the symmetrization part of the story.
(c) Apply the Gaussian process moment inequalities to the Zn process conditional
on the Xi , and then uncondition them. Because Zn has heavier tails than the
empirical process, these unconditioned moment bounds will also be valid for
the empirical process.
(d) Use the moment bounds to bound in turn tail probabilities for the empirical
process.

These steps are sketched below. We need a new definition and a Gaussian process
inequality.

Definition 16.10. Let T be an indexing set equipped with a pseudometric . For


a given > 0, the packing number D. ; T; / is defined to be the largest possible
n such that there exist t1 ; t2 ; : : : ; tn 2 T; with .ti ; tj / > for 1 i ¤ j n;
log D. ; T; / is called the -capacity of T .
The Gaussian process inequality that we need uses the -capacity of its time
indexing set T . It is stated below.

Lemma. Suppose Z.t/; t 2 T is a Gaussian process whose paths are continuous


with respect to some specific pseudometric  on T . Assume also that Z.t/ satisfies
EŒZ.t/  Z.s/2 2 .t; s/ for all s; t 2 T . Fix t0 2 T . Then, there exists a finite
universal constant K such that
  12 Z ı
1 1
EŒsup jZ.t/j 
2
.EŒZ .t0 // C K
2 2 .log D.x; T; // 2 dx;
t 2T 0
where ı D supt 2T .t; t0 /.
16.5 Maximal Inequalities and Symmetrization 549

There are similar inequalities for other moments of supt 2T jZ.t/j, and those too
can be used to bound the tail probabilities.

Theorem 16.14. Let F be a given class of functions and let, for n  1; supf 2F j
p
n.Pn  P /.f /j D n . Suppose the family F has an envelope function F bounded
by M < 1. Then, for a suitable universal finite constant K,

Z M p 2
EŒn2  2K 2 E log D.x; F ; Pn /dx ;
0

where D.x; F ; Pn / is the largestp number N for which we can find functions
f1 ; f2 ; : : : ; fN 2 F such that Pn .fi  fj /2  x for any i; j; i ¤ j .

Proof. We give a sketch of its proof. For this proof, we use the notation EfXg to
denote conditional expectation given X1 ; X2 ; : : : ; Xn . The bulk of the proof consists
of showing the following.
(a) Construct a new process Zn .X; f /; f 2 F , based on a sequence of new random
variables described below, such that for any X1 ; X2 ; : : : ; Xn ,

 1=2 Z M p
EfXg sup jZn .X; f /j2 K log D.x; F ; Pn /dx:
f 2F 0

(b) Show that EŒn2  2EŒsupf 2F jZn .X; f /j2 ; note that here E means uncon-
ditional expectation.
(c) Square the bound in part (a) on the conditional expectation, and take another
expectation to get a bound on the unconditional expectation. Then combine it
with the bound in part (b), and the theorem falls out. Thus, part (a) and part (b)
easily imply the result of the theorem.
To prove parts (a) and (b), we construct a duplicate iid sequence X10 ; X20 ; : : : ; Xn0
from our basic measure P , and iid N.0; 1/ variables W1 ; W2 ; : : : ; Wn , such that the
entire collection of variables fX1 ; X2 ; : : : ; Xn g; fX10 ; X20 ; : : : ; Xn0 g; fW1 ; W2 ; : : : ;
Wi
Wn g is mutually independent. Let i D jW ij
; 1 i n; note that i are symmetric
˙1 valued random variables, and i is independent of jWi j. We now define our new
process Zn :

X
n X
n
Zn .X; f / D n1=2 Wi f .Xi / D n1=2 i jWi jf .Xi /:
i D1 i D1

Note that conditionally, given the Xi , this is a Gaussian process, and the
Gaussian process lemma stated above will apply to it. The duplicate sequence
550 16 Empirical Processes and VC Theory

fX10 ; X20 ; : : : ; Xn0 g has not so far been used, but is used very cleverly in the steps
below:
2 ˇ !ˇ2 3
ˇp Xn Z ˇ
ˇ 1 ˇ
EŒn2  D E 4sup ˇ n f .Xi /  f dP ˇ 5
f ˇ n ˇ
i D1
2 ˇ  n 3
ˇp 1 X X n ˇˇ2
ˇ 1 ˇ
D E 4sup ˇ n f .Xi /  Ef .Xi0 / ˇ 5
f ˇ n
i D1
n
i D1
ˇ
2 ˇ n ˇ2 3
ˇX ˇ
1 ˇ ˇ
D E 4 sup ˇ Œf .Xi /  Ef .Xi0 /ˇ 5
n f ˇ ˇ
i D1
2 ˇ n ˇ2 3
ˇX ˇ
1 ˇ ˇ
D E 4 sup ˇ Œf .Xi /  EfXg f .Xi0 /ˇ 5
n f ˇ ˇ
i D1

(because the fXi0 g are independent of the fXi g)


2 ˇ ˇ2 3
ˇ Xn ˇ
1 ˇ ˇ
D E 4 sup ˇEfXg Œf .Xi /  f .Xi0 /ˇ 5
n f ˇ ˇ
i D1

(for the same reason)


2 ˇ n ˇ2 3
ˇ X ˇ
1 ˇ ˇ
E 4 sup EfXg ˇ Œf .Xi /  f .Xi0 /ˇ 5 I
n f ˇ ˇ
i D1

(inside the supremum, use the fact that for any U; jE.U /j2 E.U 2 /);
2 ˇ n ˇ2 3
ˇ X ˇ
1 ˇ ˇ
E 4 EfXg sup ˇ Œf .Xi /  f .Xi0 /ˇ 5
n f ˇ ˇ
i D1

(use the fact that the supremum of EfXg of a collection is the EfXg of the supre-
mum of the collection)
2 ˇ n ˇ2 3
ˇX ˇ
1 ˇ ˇ
D E 4 sup ˇ Œf .Xi /  f .Xi0 /ˇ 5
n f ˇ ˇ
i D1

(because expectation of a conditional expectation is the marginal expectation)


2 ˇ n ˇ2 3
ˇX ˇ
1 ˇ ˇ
D E 4 sup ˇ i Œf .Xi /  f .Xi0 /ˇ 5
n f ˇ ˇ
i D1
16.6 Connection to the Poisson Process 551

(because the three sequences fi g; fXi g; fXi0 g are mutually independent, and the
inequality holds for each of the 2n combinations for the vector .1 ; 2 ; : : : ; n /)

1 X n
4E sup j i f .Xi /j2
n f
i D1

P P
(use
Pn the triangular inequality j niD1 i Œf .Xi /  f .Xi0 /j j niD1 i f .Xi /j C
j i D1 i f .Xi0 /j, and then the fact that .jaj C jbj/2 2.a2 C b 2 /).
P
This implies, by the q definition of Zn .X; f / D n1=2 niD1 i jWi jf .Xi /, and
the fact that E.jWi j/ D 2

,


EŒn2  4 E sup jZn .X; f /j2 D 2E sup jZn .X; f /j2 :
2 f f

Finally, now, evaluate E supf jZn .X; f /j2 by first conditioning on fX1 ; X2 ; : : : ;
Xn g, so that the Gaussian process lemma applies, and we get
Z ı p 2
EfXg sup jZn .X; f /j 2
K 2
log D.x; F ; Pn /dx
f 0
Z M p 2
K2 log D.x; F ; Pn /dx :
0

This proves both part (a) and part (b) and that leads to the theorem. t
u

16.6  Connection to the Poisson Process

In describing the complete randomness property of a homogeneous Poisson process


on RC , we noted in Theorem 13.3 that the conditional distribution of the arrival
times in the Poisson process given that there have been n events in Œ0; 1 is the same
as the joint distribution of the order statistics in a sample of size n from a U Œ0; 1
distribution. Essentially the same proof leads to the result that for a sample of size
n from a U Œ0; 1 distribution, the counting process fnGn .t/; 0 t 1g, which
counts the number of sample values that are t, acts as a homogeneous Poisson
process on Œ0; 1 given that there have been n events in the time interval Œ0; 1. This
is useful, because the Poisson process has special properties, that can be used to ma-
nipulate the empirical process. In particular, this Poisson process connection leads
to useful inequalities on the exceedance probabilities of the empirical process, that
is, probabilities of the form P .sup0t T j’n .t/j  c/ for T < 1; c > 0.
We first give the exact Poisson process connection, and then describe an ex-
ceedance inequality.
552 16 Empirical Processes and VC Theory

Theorem 16.15. For given n  1, let U1 ; U2 ; : : : ; Un be iid U Œ0; 1 variables, and


let Gn .t/ D #fi WUni t g ; 0 t 1. Let X.t/ be a homogeneous Poisson process with
constant arrival rate n. Then,

L
fnGn .t/; 0 t 1g D fX.t/; 0 t 1 jX.1/ D ng:

This distributional equality allows us to essentially conclude that the probability


that the path of a uniform empirical process has some property is comparable to the
probability that the path of a certain homogeneous Poisson process has that same
property. Once we have this, the machinery of Poisson processes can be brought to
bear.

Theorem 16.16. Let E be any set of nonnegative, nondecreasing, right-continuous


functions on Œ0; T ; T < 1. Fix n  1. Then, there exists a finite constant C D C.T /
such that
   
P fnGn .t/; 0 t T g 2 E CP fX.t/; 0 t T g 2 E ;

where as above, X.t/ is a homogeneous Poisson process with constant arrival


rate n.

Proof. Denote

M D X.T /; MN D X.1/  X.T /;


A1 D The event that fnGn .t/; 0 t T g 2 EI
A2 D The event that fX.t/; 0 t T g 2 E:

Because X.t/ is a Poisson process, .A2 ; M / is (jointly) independent of MN . Using


this, and the distributional equality fact in Theorem 16.15,

P .A2 \ fM C MN D ng/
P .A1 / D P .A2 jX.1/ D n/ D
P .M C MN D n/
1 Xn
D P .A2 \ fM D n  kg/P .MN D k/
P .M C MN D n/ kD0

supk0 P .MN D k/ X
n
P .A2 \ fM D n  kg/
P .X.1/ D n/
kD0

supk0 P .MN D k/
P .A2 /:
P .X.1/ D n/

Now we use the fact that b c is a mode of a Poisson distribution with mean (see
Theorem 1.25). This allows us to conclude
16.6 Connection to the Poisson Process 553

Œn.1  T /bn.1T /c
sup P .MN D k/ e n.1T / :
k0 bn.1  T /cŠ
Plugging this into the previous line,
1 Œn.1  T /bn.1T /c
P .A1 / e n.1T / P .A2 /:
e n nŠ nn bn.1  T /cŠ
Now using Stirling’s approximation for the factorials, it follows that
e n nŠ n.1T / Œn.1  T /bn.1T /c
e
nn bn.1  T /cŠ
has a finite supremum over n, and we may take the constant C of the theorem to be
this supremum. This completes the proof of this theorem. t
u

This theorem, together with the following Poisson process inequality ultimately
leads to an exceedance probability for the empirical process. The function h in the
theorem below is h.x/ D x log x  x C 1; x > 0.

Theorem 16.17. Let X.t/; t  0 be a homogeneous Poisson process with constant


arrival rate equal to 1. Fix T; ’  0. Then,
!
P sup jX.t/  tj  ’T 2e T h.1C’/ :
0t T

See Del Barrio et al. (2007, p. 126) for this Poisson process inequality.
Here is our final exceedance probability for the uniform empirical process. By
the usual means of a quantile transform, this can be translated into an exceedance
probability for ˇn .t/, the general one-dimensional empirical process. The theorem
below establishes exponential bounds on the tail of the supremum of the empirical
process. In spirit, this is similar to the DKW inequality (see Theorem 6.13). Pro-
cesses with such an exponential rate of decay for the tail are called sub-Gaussian.
Thus, the theorem below and the DKW inequality establish the sub-Gaussian nature
of the one-dimensional empirical process.

Theorem 16.18 (Exceedance Probability for the Empirical Process). Let 0 <
T < 1, and ’ > 0. The uniform empirical process ’n .t/; 0 t 1 satisfies the
inequality
! 
p nTh 1C p’
P sup j’n .t/j  ’ T Ce nT
0t T
2 ’3
 ’2 C p
Ce 6 nT ;

for some constant C depending only on T .


We refer to del Barrio, Deheuvels and van de Geer (2007, p. 147) for a detailed
proof, using Theorem 16.16 and Theorem 16.17.
554 16 Empirical Processes and VC Theory

Exercises

Exercise 16.1. Simulate iid U Œ0; 1 observations and plot the uniform empirical
process ’n .t/ for n D 10; 25; 50.
Exercise 16.2. Give examples of three functions that are members of DŒ0; 1, but
not of C Œ0; 1.
Exercise 16.3. For a given n, what is the set of possible values of the Cramér–von
Mises statistic Cn2 ?
Exercise 16.4. Find an expression for E.Cn2 /.
Exercise 16.5. By using the Karhunen–Loeve expansion for a Brownian bridge,
P
give a proof of the familiar calculus formula 1 2
nD1 n2 D 6 .
1

Exercise 16.6 * (Anderson–Darling Statistic). Find the characteristic function of


the limiting distribution of the Anderson–Darling statistic A2n .
Dn  P
Exercise 16.7. Prove that p
2 log log n
) 1.

Exercise 16.8 * (Deviations from the Three Sigma Rule). Let X1 ; X2 ; : : : be iid
N.;  2 /, and let X n ; sn be the mean and the standard deviation of the first n ob-
servations. On suitable centering and normalization, find the limiting distribution of
the number of observations among the first n that are between X n ˙ ksn , where k
is a fixed general positive number.
Exercise 16.9 (Rate in Strong Approximation). Prove part (a) of Theorem 16.5
by using the KMT inequality in part (b), using x D xn D c log n, and then by
invoking the Borel–Cantelli lemma.
Exercise 16.10 (Strong Approximation of Quantile Process). Show that the
strong approximation result in part (c) of Theorem 16.5 holds for any normal and
any Cauchy distribution.
Exercise 16.11 * (Strong Approximation of Quantile Process). Give an example
of a distribution for which the conditions of part (c) of Theorem 16.5 do not hold.
Hint: Look at distributions with very slow tails.
Exercise 16.12 (Nearest Neighbors Are Close). Let X1 ; X2 ; : : : be iid from a dis-
tribution P supported on a compact set C in Rd . Let K be a fixed compact set in
a:s:
Rd . Prove that supx2K\C jjXn;NN;x  xjj ! 0, where for a given x; Xn;NN;x is a
nearest neighbor of x among X1 ; X2 ; : : : ; Xn .
Is this result true without the assumption of compact support for P ?
Exercise 16.13. Consider the weight function w’ .t/ D t ’ for 0 t 1
2 , and
’
w’ .t/ D .1  t/ for 2 1
t 1. For what values of ’; ’  0, does w’ .t/
satisfy part (a) of the Chibisov–O’Reilly theorem? Part (b) of the Chibisov–O’Reilly
theorem?
Exercises 555

Exercise 16.14. Let B.t/ be a Brownian bridge on Œ0; 1. What can you say about
lim supt !0 pjB.t /j
t .1t /
?

Exercise 16.15 (Deviation of Sample Mean from Population Mean). Suppose


X1 ; X2 ; : : : are iid from a distribution P on the real line with a finite variance  2 .
jX j
Does E.jX j/
converge almost surely to one? Prove your answer.

Exercise 16.16 (Evaluating Shattering Coefficients).


  Suppose
 C is the class of all
closed rectangles in R. Show that S.n; C/ D 1 C n1 C n2 , and hence show that the
VC dimension of C is 2.

Exercise 16.17 * (Evaluating VC Dimensions). Find the VC dimension of the fol-


lowing classes of sets :
(a) Southwest quadrants of Rd
(b) Closed half-spaces of Rd
(c) Closed balls of Rd
(d) Closed rectangles of Rd

Exercise 16.18. Give examples of three nontrivial classes of sets in Rd that are not
VC classes.

Exercise 16.19 * (Evaluating VC Dimensions). Find the VC dimension of the


class of all polygons in the plane with four vertices.

Exercise 16.20. *Is the VC dimension of the class of all ellipsoids of Rd the same
as that of the class of all closed balls of Rd ?

Exercise 16.21. Find the VC dimension of the class of all simplices in Rd , where
P C1 P C1
a simplex is a set of the form C D fx W x D di D1 pi xi ; pi  0; di D1 pi D 1g.

Exercise 16.22 (VC Dimension Under Translation). Suppose F is a given class


of real valued functions, and g is some other fixed function. Let G be the family of
functions f C g with f 2 F . Show that the class of subgraphs of the functions in
F and the class of subgraphs of the functions in G have the same VC dimension.

Exercise 16.23 * (Huge Classes with Low VC Dimension). Show that it is possi-
ble for a class of sets of integers to have uncountably many sets, and yet to have a
VC dimension of just one.

Exercise 16.24 * (A Concept Similar to VC Dimension). Let C be a class of sets


and define D.C/ D inff’ > 0 W supn1 S.n;C/
n’
< 1g.
Show that
(a) D.C/ V C.C/.
(b) V C.C/ < 1 if and only if D.C/ < 1.

Exercise 16.25. Give a proof of the one dimensional Glivenko-Cantelli theorem by


using part (b) of Theorem 16.11.
556 16 Empirical Processes and VC Theory

Exercise 16.26. *Design a test for testing that sample observations in R2 are iid
from a uniform distribution in the unit square by using suitable VC classes and
applying the CLT for empirical measures.
Exercise 16.27. Let P be a distribution on Rd , and Pn the empirical
q measure.
 Let
log n
C be a VC class. Show that EŒsupC 2C jPn .C /  P .C /j D O n
.

Hint: Look at Theorem 16.8.


Exercise 16.28 * (Distinct Values in a Sample). Suppose P is a countably sup-
ported distribution on the real line and X1 ; X2 ; : : : are iid with common distribu-
tion P . Show that
E Card.X1 ; X2 ; : : : ; Xn /
! 0;
n
where as usual, Card.X1 ; X2 ; : : : ; Xn / denotes the number of distinct values among
X1 ; X2 ; : : : ; Xn .
Exercise 16.29 * (Farthest Nearest Neighbor). Suppose X1 ; X2 ; : : : are iid ran-
dom variables from a continuous CDF F on the real line. For any j; 1 j n,
let Xj;n;NN denote the nearest neighbor of Xj among the rest of the observations
X1 ; X2 ; : : : ; Xn . Let dj;n D jXj  Xj;n;NN j and Mn D maxj dj;n . Show that
a:s:
nMn ! 1.
Exercise 16.30 * (Orlicz Norms). For a nondecreasing convex function on
Œ0; 1/ with .0/ D 0, the
h  Orlicz
i norm of a real-valued random variable X is
jXj
jjX jj D inffc>0 W E c
1g, and jjX jj D 1 if no c exists for which
h  i
jXj
E c 1.

(a) Show that the Lp norm of X , namely jjX jjp D .EŒjX jp /1=p is an Orlicz norm
for all p  1.
p
(b) Consider p .x/ D e x  1 for p  1. Show that jjX jjp pŠjjX jj 1 .
(c) Show that jjX jj2 jjX jj 2 .
(d) Obtain the tail bound P .jX j  x/  1  for a general such .
x
jjX jj

(e) For any constant ’  1; jjX jj’ ’jjX jj .


(f) Show that the infimum in the definition of an Orlicz norm is attained when the
Orlicz norm is finite.
Exercise 16.31 * (Orlicz Norms and Growth of the Maximum). Let X1 ; X2 ; : : : ;
Xn be any n random variables and an increasing convex function with .0/D0.
Suppose also that lim supx;y!1 .x/ .y/
.cxy/
< 1 for some finite positive c.
(a) Then there exists a universal constant K depending only on such that
1
jj max Xi jj K .n/ max jjXi jj :
1i n 1i n
References 557
p
(b) Use this inequality with the functions p .x/ D e x  1; p  1, for iid N.0; 1/
random variables X1 ; X2 ; : : : ; Xn .

Hint: See van der Vaart and Wellner (1996, p. 96).

Exercise 16.32 (Maximum of Correlated Normals). Suppose X1 ; X2 ; : : : ; Xn are


jointly normally distributed, with zero means, and Var.Xi / D i2 . Show that
p
E max1i n jXi j 2 2 log n max1i n i :

Exercise 16.33 (Metric Entropy and Packing Numbers). Suppose T is a


bounded set in an Euclidean space Rd . Show that the metric entropy and the
-capacity satisfy the inequality D.2 ; T; jj:jj2 / N. ; T; jj:jj2 / D. ; T; jj:jj2 /.

Exercise 16.34 (Metric Entropy). Suppose T is a compact subset of Œ0; 1. Show
that
1
(a) If T has positive Lebesgue measure, then N. ; T; jj:jj/ , where the norm
jj:jj is the usual distance between two reals.
1
(b) If T has zero Lebesgue measure, then N. ; T; jj:jj/ D o. /.

Exercise 16.35 * (Packing Number of Spheres). Let B be a ball of radius r in Rd .


 3r d
Show that D. ; B; jj:jj2 / .

Hint: See p. 94 in van der Vaart and Wellner (1996).

Exercise 16.36. Prove the following generalization of the binomial tail inequality
in Example 16.7: suppose W1 ; W2 ; : : : ; Wn are independent Bernoulli variables with
P  k
parameters p1 ; p2 ; : : : ; pn . Then, for any k; P . niD1 Wi > k/ e np
k
, where
p1 Cp2 C:::Cpn
pD n
.

Exercise 16.37 (Exceedance Probability for Poisson Processes). Suppose X.t/;


t  0 is a general Poisson process with intensity function .t/ and mean measure .
By using Theorem 16.17, derive an upper bound on P .sup0t T jX.t/.Œ0; t/j 
’T /, where T; ’ are positive constants. State your assumptions on the intensity func-
tion .t/.

Exercise 16.38. By using Theorem 16.18, find an upper bound on EŒsup0t T


j’n .t/j that is valid for every n and each T < 1.

References

Alexander, K. (1987). The central limit theorem for empirical processes on Vapnik-Chervonenkis
classes, Ann. Prob., 15, 178–203.
Beran, R. and Millar, P. (1986). Confidence sets for a multinomial distribution, Ann. Statist., 14,
431–443.
558 16 Empirical Processes and VC Theory

Billingsley, P. (1968). Convergence of Probability Measures, John Wiley, New York.


Chibisov, D. (1964). Some theorems on the limiting behavior of an empirical distribution function,
Theor. Prob. Appl. 71, 104–112.
Csáki, E. (1980). On the standardized empirical distribution function, Nonparametric Statist. Infer.
I, Colloq. Math. Soc. János Bolyai, 32, North-Holland, Amsterdam. 123–138.
R
Csorgo, M. (1983). Quantile processes with statistical applications, CBMS-NSF Regional conf.
ser., 42, SIAM, Philadelphia.
R
Csorgo, M. (2002). A glimpse of the impact of Pal Erdos R on probability and statistics, Canad. J.
Statist., 30, 4, 493–556.
R
Csorgo, M. and Révész, P. (1981). Strong Approximations in Probability and Statistics, Academic
Press, New York.
DasGupta, A. (2008). Asymptotic Theory of Statistics and Probability, Springer, New York.
del Barrio, E., Deheuvels, P., and van de Geer, S. (2007). Lectures on Empirical Processes, Euro-
pean Mathematical Society, Zurich.
Donsker, M. (1952). Justification and extension to Doob’s heuristic approach to Kolmogorov-
Smirnov theorems, Ann. Math. Statist., 23, 277–281.
Dudley, R. (1978). Central limit theorems for empirical measures, Ann. Prob., 6, 899–929.
Dudley, R. (1979). Central limit theorems for empirical measures, Ann. Prob., 7, 5, 909–911.
Dudley, R. (1984). A Course on Empirical Processes, Springer-Verlag, New York.
Dudley, R. (1999). Uniform Central Limit Theorems, Cambridge University Press, Cambridge,UK.
Eicker, F. (1979). The asymptotic distribution of the supremum of the standardized empirical pro-
cess, Ann. Statist., 7, 116-138.
Einmahl, J. and Mason, D. M. (1985). Bounds for weighted multivariate empirical distribution
functions, Z. Wahr. Verw. Gebiete, 70, 563–571.
Giné, E. (1996). Empirical processes and applications: An overview, Bernoulli, 2, 1, 1–28.
Giné, E. and Zinn, J. (1984). Some limit theorems for empirical processes: With discussion, Ann.
Prob., 12, 4, 929–998.
Jaeschke, D. (1979). The asymptotic distribution of the supremum of the standardized empirical
distribution function over subintervals, Ann. Statist., 7, 108–115.
Komlós, J., Major, P., and Tusnady, G. (1975a). An approximation of partial sums of independent
rvs and the sample df :I, Zeit fRur Wahr. Verw. Geb., 32, 111–131.
Komlós, J., Major, P., and Tusnady, G. (1975b). An approximation of partial sums of independent
rvs and the sample df :II, Zeit fRur Wahr. Verw. Geb., 34, 33–58.
Kosorok, M. (2008). Introduction to Empirical Processes and Semiparametric Inference, Springer,
New York.
Mason, D. and van Zwet, W.R. (1987). A refinement of the KMT inequality for the uniform em-
pirical process, Ann. Prob., 15, 871–884.
O’Reilly, N. (1974). On the weak convergence of empirical processes in sup-norm metrics, Ann.
Prob., 2, 642–651.
Pollard, D. (1989). Asymptotics via empirical processes, statist. Sci., 4, 4, 341–366.
Sauer, N. (1972). On the density of families of sets, J. Comb. Theory, Ser A, 13, 145–147.
Shorack, G. and Wellner, J. (2009). Empirical Processes, with Applications to Statistics, SIAM,
Philadelphia.
van der Vaart, A. and Wellner, J. (1996). Weak Convergence and Empirical Processes, Springer-
Verlag, New York.
Vapnik, V. and Chervonenkis, A. (1971). On the uniform convergence of relative frequencies of
events to their probabilities, Theory Prob. Appl., 16, 264–280.
Wellner, J. (1992). Empirical processes in action: A review, Int. Statist. Rev., 60, 3, 247–269.
Chapter 17
Large Deviations

The mean  of a random variable X is arguably the most common one number
summary of the distribution of X . Although averaging is a primitive concept with
some natural appeal, the mean  is a useful summary only when the random variable
X is concentrated around the mean , that is, probabilities of large deviations from
the mean are small. The most basic large deviation inequality is Chebyshev’s in-
equality, which says that if X has a finite variance  2 , then P .jX  j > k/ k12 .
But, usually, this inequality is not strong enough in specific applications, in the sense
that the assurance we seek is much stronger than what Chebyshev’s inequality will
give us. The theory of large deviations is a massive and powerful mathematical en-
terprise that gives bounds, usually of an exponential nature, on probabilities of the
form P .f .X1 ; X2 ; : : : ; Xn /  EŒf .X1 ; X2 ; : : : ; Xn / > t/, where X1 ; X2 ; : : : ; Xn
is some set of n random variables, not necessarily independent, f .X1 ; X2 ; : : : ; Xn /
is a suitable function of them, and t is a given positive number. We expect that the
probability P .f .X1 ; X2 ; : : : ; Xn /EŒf .X1 ; X2 ; : : : ; Xn / > t/ is small for large n
P
if we know from some result that f .X1 ; X2 ; : : : ; Xn /EŒf .X1 ; X2 ; : : : ; Xn / ) 0;
the theory of large deviations attempts to quantify how small this probablity is.
The basic Chernoff–Bernstein inequality (see Theorem 1.34) is generally re-
garded to be the first and most fundamental large deviation Pn inequality. The
Chernoff–Bernstein inequality is for f .X1 ; X2 ; : : : ; Xn / D i D1 Xi in the one-
dimensional iid case, assuming the existence of an mgf for the common distribution
of the Xi . Since then, the theory of large deviations has grown by leaps and bounds
in every possible manner. The theory covers dependent sequences, higher and even
infinite dimensions, nonlinear functions f , much more abstract spaces than Eu-
clidean spaces, and in certain cases the existence of an mgf is no longer assumed.
Numerous excellent book-length treatments of the theory of large deviations are
now available. We can only give a flavor of this elegant theory in a few selected
cases. It should be emphasized that large deviations is one area of probability where
beautiful mathematics has smoothly merged with outstanding concrete applications
over a diverse set of problems and areas. It is now seeing applications in emerging
areas of contemporary statistics, such as multiple testing in high-dimensional situa-
tions. The importance of large deviations is likely to increase even more than what
it already is.

A. DasGupta, Probability for Statistics and Machine Learning: Fundamentals 559


and Advanced Topics, Springer Texts in Statistics, DOI 10.1007/978-1-4419-9634-3 17,
c Springer Science+Business Media, LLC 2011
560 17 Large Deviations

Among numerous excellent books on large deviations, we recommend Stroock


(1984), Varadhan (1984), Dembo and Zeitouni (1993), den Hollander (2000), and
Bucklew (2004). There is considerable overlap between the modern theory of con-
centration inequalities and the area of large deviations. Relevant references include
Devroye et al. (1996), McDiarmid (1998), Lugosi (2004), Ledoux (2004), and
Dubhashi and Panconesi (2009). Shao (1997), Hall and Wang (2004), and Dembo
and Shao (2006), among many others, treat the modern problem of large devia-
tions using self-normalization, applicable when an mgf or even sufficiently many
moments fail to exist.

17.1 Large Deviations for Sample Means

Although large deviation theory has been worked out for statistics that are far more
general than sample means, and without requiring that the underlying sequence
X1 ; X2 ; : : : be iid or even independent, for historical importance we start with the
case of sample means of iid random variables in one dimension.

17.1.1 The Cramér–Chernoff Theorem in R

We start with the basic Cramér–Chernoff theorem. This result may be regarded as
the starting point of large deviation theory.
Theorem 17.1. Suppose X1 ; X2 ; : : : are iid zero mean random variables with an
mgf .z/ D E.e zX1 /, assumed to exist for all real z. Let k.z/ D log .z/ be the
cumulant generating function of X1 . Then, for fixed t > 0,

1
lim log P .X > t/ D I.t/ D inf .k.z/  tz/ D  sup.tz  k.z//
n n z2R z2R

The function I.t/ is called the rate function corresponding to F , the common
distribution of the Xi . Because we assume the existence of a mean (in fact, the
existence of an mgf), by the WLLN X converges in probability to zero. Therefore,
we already know that P .X > t/ is small for large enough n. According to the
Cramér–Chernoff theorem, limn n1 log P .X > t/ D I.t/, and so for large n; n1 log
P .X > t/ I.t/. Therefore, as a first approximation, P .X > t/ e nI.t / . In
other words, assuming the existence of an mgf, P .X > t/ converges to zero at an
exponential rate, and I.t/ exactly characterizes that exponential rate. This justifies
the name rate function for I.t/.
Actually, it would be natural to consider P .X > t/ itself, rather than its log-
arithm. But the quantity P .XN > t/ is a sequence of the form cn .t/e nI.t / , for
some suitable sequence cn .t/, which does not converge to zero at an exponential
rate. Pinning down the exact asymptotics of the cn sequence is a difficult problem;
17.1 Large Deviations for Sample Means 561

see Exercise 17.13. If we instead look at n1 log P .X > t/, then I.t/ becomes the
dominant term, and analysis of the sequence cn can be avoided.
P
Proof of the Cramér–Chernoff theorem. For n  1, let Sn D niD1 Xi . First note
that we may, by a simple translation argument, take t to be zero, and correspond-
 of X1 to be < 0. This reduces the theorem to showing that
ingly, the mean
limn n1 log P X  0 D I.0/. Because X  0 if and only if Sn  0, we need to
show that limn n1 log P .Sn  0/ D I.0/.
If the common CDF F of the Xi is supported on .1; 0, that is, if
P .X1 0/ D 1, then the theorem falls out easily. This case is left as an exer-
cise. We therefore consider the case where P .X1 < 0/ > 0; P .X1 > 0/ > 0;  D
E.X1 / < 0.
We now observe a few important facts that are used in the rest of the proof.
0 00
.a/ .0/ D 1; .0/ D  < 0; .z/ > 0 8 z 2 R;

implying that is strictly convex. Let  be the unique


R minima of and let
./ D infz2R .z/ D . Note that  > 0, and 0 ./ D xe x dF .x/ D 0.

.b/ I.0/ D  inf log .z/ D  log inf .z/ D  log 


z2R z2R
) I.0/ D log :

Therefore, we have to prove that

1
lim log P .Sn  0/ D log :
n n

This consists of showing that lim supn n1 log P .Sn  0/ log  and lim infn
1
n
log P .S n  0/  log . Of these, the first inequality is nearly immediate. Indeed,
for any z > 0,

P .Sn  0/ D P .zSn  0/ D P .e zSn  1/


EŒe zSn  D Œ .z/n

(by Markov’s inequality)


1
) log P .Sn  0/ inf log .z/ D log inf .z/
n z>0 z>0
D log inf .z/ D log ;
z2R

giving the first inequality lim supn n1 log P .Sn  0/ log :


The second inequality is more involved, and uses a common technique in large
deviation theory known as exponential tilting. Starting from the CDF F of the Xi ,
define a new CDF
R y
.1;x e dF .y/
FQ .x/ D :

562 17 Large Deviations

Note that d FQ .x/ D e dF .x/ ) dF .x/ D e x d FQ .x/. Let XQ1 ; XQ2 ; : : : ; XQn be
x

an iid sample from FQ . We have


Z R 0
xe x dF .x/ ./
E.XQi / D xd FQ .x/ D D D 0:
 

Also, XQi have a finite variance. With all these, for any given n  1,
Z
P .Sn  0/ D dF .x1 /dF .x2 /    dF .xn /
.x1 ;x2 ;:::;xn /W x1 Cx2 CCxn 0
Z
D n e .x1 Cx2 CCxn /
.x1 ;x2 ;:::;xn /W x1 Cx2 CCxn 0

d FQ .x1 /d FQ .x2 /    d FQ .xn /


Q
D n E e  Sn ISQn 0 ;

where SQn D XQ 1 C XQ 2 C    C XQn .


Therefore,

1 1 Q
log P .Sn  0/ D log  C log E e  Sn ISQn 0 :
n n

Q
Sn
By the CLT, p n
is approximately a mean zero normal random variable. This implies
(with just a little manipulation) that

Q
E e  Sn ISQn 0

p
is bounded below by ce a n (with a lot of room to spare) for suitable a; c > 0,
and so, by taking logarithms,
p
1 Q log c a n
lim inf log E e  Sn ISQn 0  lim inf  D 0;
n n n n n

and this gives us the second inequality lim infn 1


n
log P .Sn  0/  log . t
u

What is the practical value of a large deviation result? The practical value is
that, in the absence of a large deviation result, we approximate a probability of the
type P .XN > t/ by a CLT approximation, as that would be almost the only other
approximation that we can think of. Indeed, it is common practice to use the CLT
approximation in applied work. However, for fixed t, the CLT approximation is not
17.1 Large Deviations for Sample Means 563

going to give an accurate approximation to the true value of P .XN > t/. The CLT is
supposed to be applied for those t that are typical values for XN , that is, t of the order
of p1n , but not for fixed t. The exponential tilting technique brilliantly reduces the
problem to the case of a typical value, but for a new sequence XQ1 ; XQ 2 ; : : :. Thus, in
comparison, an application of the Cramér–Chernoff theorem is going to produce a
more accurate approximation than a straight CLT approximation. Whether it really
is more accurate in a given case depends on the value of t. See Groeneboom and
Oosterhoff (1977, 1980, 1981) for extensive finite sample numerics on the compar-
ative accuracy of the CLT and the large deviation approximations to P .XN > t/.
We now work out some examples of the rate function I.t/ for some common
choices of F .
Example 17.1 (Rate Function for Normal). Let X1 ; X2 ; : : : be iid N.0;  2 /. Then
2 2
.z/ D e z  =2 and k.z/ D log .z/ D z2  2 =2. Therefore, k.z/  tz D z2  2 =2  tz
is a strictly convex quadratic in z with a unique minima at z0 D t2 , and
t2
a minimum value of z20  2 =2  tz0 D  2 2
. Therefore, in this case, the rate
t2
function equals I.t/ D .
This can be used to form the first approximation
2 2 
nt 2 =.2 2 / 2
P .X > t/ e , or, equivalently P X  >t e nt =2 .
Note that in this example, the rate function I.t/ turns out to be convex and strictly
positive unless t D 0. It is also smooth. These turn out to be true in general, as we
later show.

Example 17.2 (Rate Function in Bernoulli Case). Here is an example where the
rate function I.t/ is related to the Kullback–Leibler distance between two suit-
able Bernoulli distributions (see Chapter 15 for the definition and properties of the
Kullback–Leibler distance). Suppose X1 ; X2 ; : : : is an iid sequence of Bernoulli
variables with parameter p. Then the mgf .z/ D pe z C q, where q D 1  p. In
the Bernoulli case, the question of deriving the rate function is interesting only for
0 < t < 1, because P .X > t/ D 1 if t < 0 and P .X > t/ D 0 if t  1. Also, if
t D 0, then, trivially, P .X > t/ D 1  .1  p/n .
For 0 < t < 1,

d pe z qt  pe z .1  t/
Œtz  log .z/ D t  z D ;
dz pe C q pe z C q

and,
d2 pqe z
Œtz  log .z/ D < 0 8 z:
d z2 .pe z C q/2
Therefore, tz  log .z/ has a unique maxima at z given by

1p t
qt D pe z .1  t/ , z D log C log ;
p 1t
564 17 Large Deviations

and the rate function equals


 
1p t qt
I.t/ D supŒtz  log .z/ D t log C log  log Cq
z2R p 1t 1t
t 1t 1t t 1t
D t log  t log C log D t log C .1  t/ log :
p 1p 1p p 1p
Interestingly, this exactly equals the Kullback–Leibler distance K.P; Q/ between
the distributions P; Q with P as a Bernoulli distribution with parameter t and Q a
Bernoulli distribution with parameter p. Indeed,

X
1
t x .1t/1x t 1t
K.P; Q/ D t x .1t/1x log x 1x
DEP X log C.1X / log
xD0
p .1p/ p 1p
t 1t
D t log C .1  t/ log :
p 1p
Example 17.3 (The Cauchy Case). The Cramér–Chernoff theorem requires the ex-
istence of the mgf for the underlying F . Suppose X1 ; X2 ; : : : are iid C.0; 1/. In
this case, XN is also distributed as C.0; 1/ for all n (see Chapter 8), and there-
fore, limn n1 log P .XN > t/ D limn n1 log P .X1 > t/ D 0. As regards the mgf
.z/; .0/ D 1 and at any z ¤ 0; .z/ D 1. Therefore, formally, supz Œtz 
log .z/ D 0. That is, formally, the rate function I.t/ D 0, which gives the correct
answer for limn n1 log P .XN > t/.

17.1.2 Properties of the Rate Function

The rate function I.t/ satisfies a number of general shape and smoothness proper-
ties. We need a few definitions and some notation to describe these properties. Given
a CDF F , we let  be its mean, and let

D D fz 2 R W .z/ D EF .e zX / < 1gI D D ft 2 R W I.t/ < 1g:

Definition 17.1. A level set of a real-valued function f is a set of the form ft 2 R W


f .t/ cg, for some constant c.
Definition 17.2. A real-valued function f is called lower semicontinuous if all its
level sets are closed, or equivalently, for any t, and any sequence tn converging to
t; lim infn f .tn /  f .t/.
Definition 17.3. A real-valued function f is called real analytic on an open set C
C and can be expanded around any t0 2 C as a
if it is infinitely differentiable onP
convergent Taylor series f .t/ D 1 nD0 an .t  t0 / for all t in a neighborhood of t0 .
n

Theorem 17.2. The rate function I.t/ satisfies the following properties:
(a) I.t/  0 for all t and equals zero only when t D .
(b) I.t/ is convex on R. Moreover, it is strictly convex on the interior of D .
17.1 Large Deviations for Sample Means 565

(c) I.t/ is lower semicontinuous on R.


(d) I.t/ is a real analytic function on the interior of D .
(e) If z solves k 0 .z/ D t, then I.t/ D tz  k.z/.

Proof. Part (a), the convexity part of part (b), and part (c) follow from simple facts
in real analysis (such as the supremum of a collection of continuous functions is
lower semicontinuous, and the supremum of a collection of linear functions is con-
vex). Part (e) follows from the properties of strict convexity and differentiability, by
simply setting the first derivative of tz  k.z/ to be zero. Part (d) is a deep property.
See p. 121 in den Hollander (2000). t
u

Example 17.4 (Error Probabilities of Likelihood Ratio Tests). Suppose X1 ; X2 ; : : :


are iid observations from a distribution P on some space X and suppose that we
want to test a null hypothesis H0 W P D P0 against an alternative hypothesis H1 W
P D P1 . We assume, for simplicity, that P0 ; P1 have densities f0 ; f1 ; respectively.
The likelihood function based on X1 ; X2 ; : : : ; Xn is defined to be
Qn Y n
i D1 f1 .Xi / f1 .Xi /
ƒn D Qn D :
f
i D1 0 .X i / f0 .Xi /
i D1

A likelihood ratio test is a test that rejects H0 when ƒn > n for some sequence of
numbers n . The type I error probability of such a test is defined to be the probability
of false rejection of H0 , namely,

’n D P0 .ƒn > n /;

and the type II error probability is defined to be the probability of false acceptance
of H0 , namely,
ˇn D P1 .ƒn n /:

Again, for simplicity, we take n D n , where is a constant, and denote D


log . The purpose of this example is to show that the two error probabilities of the
likelihood ratio test converge to zero at an exponential rate, when the number of
observations n ! 1, and that the rates of these exponential convergences can be
paraphrased in large deviation terms. We denote log ff10 .Xi/
.Xi / D Yi below. Note that
the mgf of Y1 under H0 is
Z f1 .x/
Z
z log
0 .z/ D EP0 Œe zY1
D e f0 .x/
f0 .x/dx D f1z f01z dx:

Then,

’n D P0 .ƒn > n / D P0 .log ƒn > log n /


! !
X
n
f1 .Xi / X n
D P0 log > n D P0 Yi > n D P0 .Y > /:
f0 .Xi /
i D1 i D1
566 17 Large Deviations

It is already clear from this that by virtue of the Cramér–Chernoff theo-


rem, ’n converges to zero exponentially. More specifically, suppose that we
choose D K.P1 ; P0 /, the Kullback–Leibler distance between P1 and P0 .
The motivation for choosing D K.P1 ; P0 / would be that EP1 .Y / D EP1 .Y1 / D
R
log ff10 .x/
.x/ f1 .x/, which is K.P1 ; P0 /; that is, we accept P1 to be the true P if Y
exceeds its expected value under P1 . In that case,
Z
z  log 0 .z/ D zK.P1 ; P0 /  log f1z f01z dx:

By direct differentiation,
Z R
d log ff10 f1z f01z dx
zK.P1 ; P0 /  log f1z f01z dx D K.P1 ; P0 /  R z 1z ;
dz f1 f0 dx

which equals zero at z D 1, which is therefore the maxima of z  log 0 .z/, and
the maximum value is  log 0 .1/ D  log 1 D , giving eventually,
1
log ’n !  D K.P1 ; P0 /:
n
So, as a first approximation, we may write ’n e nK.P1 ;P0 / ; the larger the distance
between P0 and P1 , the smaller is the chance that the test will make a type I error.
The same analysis can also be done for the type II error probability ˇn .

17.1.3 Cramér’s Theorem for General Sets

The basic Cramér–Chernoff theorem in R deals with probabilities of the form


P .X 2 C /, where C is a set of the form .t; 1/. The rate of exponential con-
vergence of this probability to zero is determined by I.t/. In other words, the rate of
exponential convergence is determined solely by the value of the rate function I.:/
at the boundary point t of our set C D .t; 1/. The following generalization of the
basic Cramér–Chernoff inequality says that this fundamental phenomenon is true in
far greater generality. A large deviation probability is a probability of a collection of
rare events. The generalization says that to the first order, the magnitude of the large
deviation probability is determined by the least rare of the collection of rare events.

Theorem 17.3. Suppose X1 ; X2 ; : : : are iid zero mean random variables, with an
mgf .z/ D EŒe zX1 , assumed to exist for all real z. Let F; C be general (mea-
surable) closed and open sets in R, respectively, and let I.t/ D supz2R Œtz  k.z/,
where k.z/ D log .z/. For any given set S , denote I.S / D inft 2S I.t/. Then,
(a) lim supn n1 log P .XN 2 F / I.F /:
(b) lim infn n1 log P .XN 2 C /  I.C /:
17.2 The GRartner–Ellis Theorem and Markov Chain Large Deviations 567

See Dembo and Zeitouni (1998, p. 27) for a proof of this theorem. When we
compare the basic Cramér–Chernoff theorem with this more general theorem, we
see that in this greater generality, we can no longer assert the existence of a limit
for n1 log P .XN 2 F / (or for n1 log P .XN 2 C //. We can assert the existence of a
limit for n1 log P .XN 2 S / for a general set S , provided inft 2S 0 I.t/ D inft 2SN I.t/,
where S 0 ; SN denote the interior and the closure of S . If this holds, then n1 log P .XN 2
S / has a limit, and

1
lim log P .XN 2 S / D  inf I.t/ D  inf I.t/ D  inf I.t/ D I.S /:
n n t 2S 0 t 2SN t 2S

This property is aesthetically desirable, but we can ensure it only when I.t/ does
not have discontinuities on the boundary of the set S .
For example, consider singleton sets S D ftg, for fixed t 2 R. Now P .XN 2
S / D P .XN D t/ is actually zero if the summands Xi have a continuous distribution
(i.e., if the Xi have a density). If we insist on the equality of limn n1 log P .XN 2 S /
and I.S /, this would force I.t/ to be identically equal to 1. But if I.t/ 1,
then the equality limn n1 log P .XN 2 S / D I.S / will fail at other subsets S of R.
This explains why the general Cramér–Chernoff theorem is in the form of lower and
upper bounds, rather than an exact inequality.

17.2 The GRartner–Ellis Theorem and Markov Chain Large


Deviations

For sample means of iid random variables, the mgf is determined from the mgf of
the underlying distribution itself. And that underlying mgf determines the large de-
viation rate function I.t/. When we give up the iid assumption, there is no longer
one underlying mgf that determines the final rate function, even if there are mean-
ingful large deviation asymptotics in the new problem. Rather, one has a sequence of
mgfs, one for each n, corresponding to whatever statistics Tn we want to consider;
for example, Tn could still be a sequence of sample means, but when the Xi have
some dependence among themselves. Or, Tn could be a more complicated statistic
with some nonlinear structure.
It turns out that despite these complexities, under some conditions a large de-
viation rate can be established without imposing the restriction of an iid setup,
or requiring that Tn be the sequence of sample means. This greatly expands the
scope of applications of large deviation theory, but in exchange for considerably
more subtlety in exactly what assumptions are needed for which result to hold.
The GRartner–Ellis theorem, a special case of which we present below, is regarded
as a major advance in large deviation theory, due to its wide-reaching applica-
tions. Although the assumptions (namely, (a)–(d) below) can fail in simple-looking
problems, the GRartner–Ellis theorem has also been successfully used to find large
568 17 Large Deviations

deviation
Prates in important non-iid setups, for example, for functionals of the form
n
Tn D i D1 .Xi /, where X1 ; X2 ; : : : forms a Markov chain; see the examples
f
below.

Theorem 17.4. Let fTn g be a sequence of random variables taking values in an


0
Euclidean space Rd . Let Mn .z/ D EŒe z Tn  and kn .z/ D log Mn .z/, defined for z
at which Mn .z/ exists. Let .z/ D limn!1 n1 kn .nz/, assuming that the limit exists
as an extended real-valued function.
Assume the following conditions on the function .
(a) D D fz W .z/ < 1g has a nonempty interior.
(b) is differentiable in the interior of D .
(c) If n is a sequence in the interior of D approaching a point  on the boundary
of D , then the length of the gradient vector of at n converges to 1; that is,
jjr .n /jj ! 1 for any sequence n ! , a point on the boundary of D .
(d) The origin belongs to the interior of D .
Let I.t/ D supz2Rd Œt 0 z  .z/. Then, for general closed and open sets F; C in Rd
respectively,
(a) lim supn n1 log P .Tn 2 F /  inft 2F I.t/:
(b) lim infn n1 log P .Tn 2 C /   inft 2C I.t/:
Without requiring any of the assumptions (a)–(d) above, for compact sets S ,
(c) lim supn 1
n
log P .Tn 2 S /  inft 2S I.t/:
A simple sufficient condition for both parts (a) and (b) of the theorem to hold is that
.z/ D limn!1 n1 kn .nz/ exists, is finite, and is also differentiable at every z 2 Rd .
We refer to Dembo and Zeitouni (1998, pp. 44–51) for a proof.

Example 17.5 (Multivariate Normal Mean). Suppose X1 ; X2 ; : : : are iid Nd .0; †/


random vectors,
P and assume that † is positive definite. Let Tn be the sample mean
vector n1 niD1 Xi . Thus, Tn still has the sample mean structure in an iid setup,
except it is not a real-valued random variable, but a vector. We use the GRartner–Ellis
theorem to derive the rate function for Tn .
Using the formula for the mgf of a multivariate normal distribution (see
Chapter 5),
h 0 i z0 †z
Mn .z/ D E e z X D e 2n
z0 †z 1 z0 †z
) kn .z/ D ) kn .nz/ D ;
2n n 2
z0 †z
and therefore .z/ D limn n1 kn .nz/ D 2 . Therefore, I.t/ D supz2Rd Œt 0 z 
z0 †z t 0 †1 t z0 †z
2  D 2 , because t 0 z  2 is a strictly concave function of z due to the
positive definiteness of †, and the unique maxima occurs at z D †1 t. Note that
I.t/ is smooth everywhere in Rd .
17.2 The GRartner–Ellis Theorem and Markov Chain Large Deviations 569

Suppose now S is a set separated from the origin. Then, by the GRartner–Ellis
theorem, n1 log P .X 2 S / is going to be approximately equal to the minimum of
0 1 0 1
I.t/ D t †2 t as t ranges over S . Now the contours of the function t †2 t are
ellipsoids centered at the origin. So, to get the limiting value of n1 log P .X 2 S /,
we keep drawing ellipsoids centered at the origin and with orientation †, until for
the first time the ellipsoid is just large enough to touch the set S . The point where
the ellipsoid touches S will determine the large deviation rate. This is a very elegant
geometric connection to the probability problem at hand.

Example 17.6 (Sample Mean of a Markov Chain). Let X1 ; X2 ; : : : be a stationary


Markov chain with the finite state space S D f1; 2; : : : ; tg. We assume that the chain
is irreducible (see Chapter 10), and denote the initial state to be i . Also let pij denote
the one-step transition probabilities pij D P .XnC1 D j jXn D i /. Although just
irreducibility is enough, we assume in this example that pij is strictly positive for
all i; j . We apply the GRartner–Ellis theorem
P to derive the large deviation rate for the
sequence of sample means X n D n1 niD1 Xi .
Given a real z, define a matrix …z with entries …z;i;j D pij e zj . Note that the
entries of …z are all strictly positive. We now calculate kn .nz/. We have,
Pn X
kn .nz/ D log EŒe z kD1 Xk
 D log e z.x1 Cx2 xn /
x1 ;x2 ;:::;xn
P .X1 D x1 ; X2 D x2 ; : : : ; Xn D xn /
X
D log e zx1 pi;x1 e zx2 px1 ;x2    e zxn pxn1 ;xn
x1 ;x2 ;:::;xn

(recall that the notation i stands for the initial state of the chain)
X
D log Œ…z ni;xn ;
xn

where Œ…z n denotes the nth power of the matrix …z and Œ…z ni;xn means the .i; xn /
element of it. This formula for kn .nz/ leads to the large deviation rate for the sample
mean of our Markov chain.
By part (e) of the Perron–Frobenius theorem (see Chapter 10), we get

1 X
lim.1=n/kn .nz/ D lim log Œ…z ni;xn
n n n x n

D log 1 .z/;

where 1 .z/ is the largest eigenvalue of …z (see part (a) of the Perron–Frobenius
theorem). Therefore, .z/ D limn n1 kn .nz/ exists for every z, and it is differentiable
by the implicit function theorem (i.e., basically because 1 .z/ is the maximum root
of the determinant of …z  I , and the determinant itself is a smooth function of z).
570 17 Large Deviations

It now follows from the GRartner–Ellis theorem that the conclusion of both parts
(a) and (b) of the theorem hold, with the large deviation rate function being I.t/ D
supz Œtz  log 1 .z/.

17.3 The t-Statistic


p
Given n random variables X1 ; X2 ; : : : ; Xn , the t-statistic is defined as Tn D nX
s ,
1 Pn
where s D n1 i D1 .Xi X / is the sample variance. If the sequence X1 ; X2 ; : : :
2 2

L
is iid with zero mean and a finite variance, then Tn ) N.0; 1/ (see Chapter 7). The
t-statistic is widely used in statistics in situations involving tests of hypotheses about
a population mean. Effect of nonnormality on the t-statistic has been studied by sev-
eral authors, among them, Efron (1969), Cressie (1980), Hall (1987), and Basu and
DasGupta (1991). Logan et al. (1973) showed that if X1 ; X2 ; : : : are iid standard
Cauchy, then Tn still does converge in distribution, but to a bimodal and unbounded
nonstandard density. It was conjectured there that the t-statistic converges in distri-
bution if and only if the underlying CDF, say F , is in the domain of attraction of a
stable law. This conjecture has turned out to be correct. Even more, it is now known
that the t-statistic converges in distribution to a normal distribution if and only if the
underlying CDF is in the domain of attraction of the normal; see Giné et al. (1997).
For example, the t-statistic converges in distribution to N.0; 1/ if F itself is a t-
distribution with two degrees of freedom, which does not have a finite variance.
Hall and Wang (2004) characterize the speed of convergence of the t-statistic to the
standard normal under this weakest possible assumption.
Large deviations for the t-statistic are of interest on a few grounds. One inter-
esting fact is that the large deviation rate function for the t-statistic is not the same
as the rate function for the sample mean, even if we have a conventional finite mgf
situation. Second, in certain multiple testing problems of recent interest in statistics,
large deviations of the t-statistic have become important. The results on the large
deviations of the t-statistic are complex, and the proofs are long. We refer to Shao
(1997) for the derivation of the main result on the rate function of the t-statistic.
The most striking part of this result is that large deviation rates are derived without
making any moment conditions at all, in contrast to the Cramér–Chernoff theorem
for the sample mean, which assumes the existence of the mgf itself. The multivari-
ate case, usually known as Hotelling’s T 2 (see Chapter 5), is treated in Dembo and
Shao (2006).
We use the following notation. Given an iid sequence of random variables
X1 ; X2 ; : : : from a CDF F on R, we let, for n  2,

X
n X
n
1 X
n
Sn D Xi ; Vn2 D Xi2 ; sn2 D .Xi  X /2 ;
n1
i D1 i D1 i D1
p
Sn nX
Zn D ; Tn D :
Vn sn
17.3 The t -Statistic 571

Then, the self-normalized statistic Zn and the t-statistic Tn are related by the
identity
n  1 1=2
Tn D Zn :
n  Zn2
The function z ! p zis strictly increasing, as can be seen from its first derivative
n
nz2 p p
.nz2 /3=2
, which is strictly positive on . n; n/. As a result,
!  p 
Zn t t n
P .Tn > t/ D P p >p DP Zn > p :
n  Zn2 n1 n  1 C t2

Therefore, the large deviation rate function of the t-statistic can be figured out from
the large deviation rate function of the formally simpler statistic Zn . This is the
approach taken in Shao (1997); this convenient technique was also adopted in Efron
(1969), and is generally a useful trick to remember in dealing with t-statistics.

Theorem 17.5. Let X1 ; X2 ; : : : be an iid sequence of real-valued random variables.


(a) Suppose E.X1 / exists and equals zero. Then, for x > 0,
2 3
  xt c 2 CX 2
. /
1 Tn ctX1  p 1
lim log P p x D sup inf log E 4e 2 1Cx 2 5:
n n n c0 t 0

(b) If E.X1 / does not exist, then the result of part (a) still holds for all x > 0.
Part (a) actually holds with a minor modification if E.X1 / exists and is nonneg-
ative, but we do not show the modification here; see Shao (1997).

Example 17.7 (The Normal Case). Suppose X1 ; X2 ; : : : are iid N.0; 1/. In this case,
xt .c 2 CX 2 /
ctX1  p 1
EŒe .2 1Cx 2 /  exists for all t  0, and on calculation and algebra, we get
  p !
2 3  
2 1=4
c 2 t t x 1Cx 2
.p 2
xt c 2 CX1 / 1Cx exp  p
ctX1  2 1Cx 2 Ctx 1Cx 2
E 4e 2 1Cx 2 5D p p
tx C 1 C x 2
D g.x; c; t/ (say):

Tn
To find the limit of n1 log P p n
 x , we need to find supc0 inft 0 g.x; c; t/. By
taking the logarithm of g.x; c; t/, and differentiating it with respect to t, we find that
the infimum over t  0 occurs at the unique positive root of the following function
of t.
p  
 .1Cc 2 /x2.1Cc 2 /x 3 .1Cc 2 /x 5 C 1Cx 2 t 2c 2 C.c 2 2/x 2  .c 2 C4/x 4
  p
C 3c 2 x C .3c 2  1/x 3  x 5 t 2 C c 2 x 2 1 C x 2 t 3 D 0:
572 17 Large Deviations
p
It turns out that the required root is t D x 1 C x 2 . Plugging this back into the
expression for g.x; c; t/, we find that supc0 inft 0 g.x; c; t/ D p 1 2 . That is,
1Cx

 
1 Tn 1
lim log P p x D  log.1 C x 2 /:
n n n 2
2
We notice that the rate function 12 log.1 C x 2 / is smaller than x2 , which was the
rate function for the sample mean in the standard normal case. This is due to the
fact that the t-statistic is more dispersed than the mean, and the rate functions of the
t-statistic and the mean are demonstrating that.

17.4 Lipschitz Functions and Talagrand’s Inequality

Suppose Z N.0; 1/ and f W R ! R is a function with a uniformly bounded


first derivative, namely jf 0 .z/j M < 1 for all z. Then, by a first-order Taylor
expansion,

f .Z/ f .0/ C Zf 0 .0/ ) jf .Z/  f .0/j jZjjf 0 .0/j M jZj:

Therefore, given > 0,

P .jf .Z/  EŒf .Z/j > / P .jf .Z/  f .0/j > / P .jZjjf 0 .0/j > /
 
P .M jZj > / D P jZj > D2 1ˆ
M M
2

e 2M 2 ;
2
1 z =2
because 1  ˆ.z/ 2e for z  0. These heuristics suggest that smooth func-
tions of a normal random variable should be quite tightly concentrated around their
mean, as long as they do not change by too much over small intervals. We present
a major result to this effect. The result was originally proved in Borell (1975) and
Sudakov and Cirelson (1974). The proof below is coined from Talagrand (1995),
and uses the following lemma. We need the notion of outer parallel body for the
proof, which is defined first.
Definition 17.4. Let B  Rd , and let > 0. Then the -outer parallel body of B
is the set B D fx 2 Rd W infy2B jjx  yjj g.
Also, let H denote the collection of all half spaces in Rd . The idea of the proof of
the result in this section is that if a set B has probability at least :5 under a standard
d -dimensional normal distribution, then with a large probability the normal random
vector will lie within the -outer parallel body of B, and that the worst offenders to
this rule are half-spaces. Here, the requirement that B have a probability at least :5
is indispensable.
17.4 Lipschitz Functions and Talagrand’s Inequality 573

Lemma. Let Z Nd .0; I / and let 0 < ’ < 1. Then,

sup P .Z 62 B / D sup P .Z 62 H /:
BWP .Z2B/D’ H 2HWP .Z2H /D’

Theorem 17.6. Let Z Nd .0; I /, and let f .z/ be a function with Lipschitz norm
M < 1; that is,

jf .x/  f .y/j
sup D M < 1:
x;y jjx  yjj

Let  be either EŒf .Z/ or the median of f .Z/. Then, for any > 0,

2

P .jf .Z/  j > / e 2M 2 :

Proof. We prove the theorem only for the case when  is a median of f .Z/. Let
B D fz W f .z/ g. Fix > 0, and consider y 2 B . Pick z 2 B such that
jjy  zjj . Then,

f .y/ D f .y/  f .z/ C f .z/ M jjy  zjj C f .z/ M C


) P .Z 2 B / P .f .Z/ M C / ) P .f .Z/ > M C / P .Z 62 B /
sup P .Z 62 A / D sup P .Z 62 H /:
AWP .A/ 1
2 H 2HWP .Z2H / 1
2

(by the lemma). This latest supremum can be proved to be equal to 1ˆ. /, leading
to, for x > 0,
x
P .f .Z/ > M C / 1  ˆ. / ) P .f .Z/   > x/ 1  ˆ :
M
Similarly, x
P .f .Z/   < x/ 1ˆ :
M
Adding the two inequalities together,
x x2

P .jf .Z/  j > x/ 2 1ˆ e 2M 2 :
M
t
u

Example 17.8 (Maximum of Correlated Normals). We apply the above theorem to


find a concentration inequality for the maximum of d jointly distributed normal
random variables. Thus, let X N.0; †/, where † is positive definite, and let
i2 D Var.Xi /; i D 1; 2; : : : ; d . Let † D AA0 , where A is nonsingular.
Consider now the function f W Rd ! R defined by f .u/ D maxf.Au/1 ;
.Au/2 ; : : : ; .Au/d g, where .Au/1 means the first coordinate of the vector Au, and
574 17 Large Deviations

so on. Denote M D maxi i . We claim that f is a Lipschitz function with Lipschitz


norm bounded by M . To see this, for notational simplicity, consider the case when
A is diagonal; the general case can be similarly handled. Consider two vectors u; w.
Then,

.Au/1 D a11 u1 D a11 w1 C a11 .u1  w1 / a11 w1 C M jju  wjj


maxf.Aw/1 ; .Aw/2 ; : : : ; .Aw/d g C M jju  wjj:

By the same argument, for each i; .Au/i maxf.Aw/1 ; .Aw/2 ; : : : ; .Aw/d g C


M jju  wjj, and therefore,
f .u/ D maxf.Au/1 ; .Au/2 ; : : : ; .Au/d g
maxf.Aw/1 ; .Aw/2 ; : : : ; .Aw/d g C M jju  wjj
D f .w/ C M jju  wjj:
By switching the roles of u and w; f .w/ f .u/ C M jju  wjj, and hence,
jf .u/  f .w/j M jju  wjj.
Next observe that
L
maxfX1 ; X2 ; : : : ; Xd g D maxf.AZ/1 ; .AZ/2 ; : : : ; .AZ/d g D f .Z/;
where Z N.0; I /. Therefore, by our above theorem,
P .j maxfX1 ; X2 ; : : : ; Xd g  EŒmaxfX1 ; X2 ; : : : ; Xd gj > x/
x2

D P .jf .Z/EŒf .z/j > x/ e 2M 2 ;

where M D maxi i . The striking parts of this are that although X1 ; X2 ; : : : ; Xd


are not assumed to be independent, we can prove an exponential concentration
inequality by using only the coordinatewise variances, and that the inequality is
valid for all d .

17.5  Large Deviations in Continuous Time

Let X.t/; t 2 T , be a stochastic process in continuous time, where T is some in-


dexing set. Random variables of the type supt 2T X.t/, or supt 2T jX.t/j, and so on
are collectively known as extremes. Extreme statistics are widely used in assessing
rarity of an event, or for disaster planning, and have become particularly important
in recent years in genetics, finance, climate studies, environmental planning, and
astronomy. Extremes for a discrete-time stochastic sequence were previously dis-
cussed in Chapters 6 and 9. The case of continuous time is treated in this section.
For extreme statistics, we often want to know how rare was an observed value of
17.5 Large Deviations in Continuous Time 575

the extreme of some stochastic sequence or process. We are then content with good
upper bounds on probabilities of the form P .supt 2T jX.t/j > x/. This has a resem-
blance to a large deviation probability. A basic treatment of this topic when X.t/ is a
Gaussian process is provided in this section. As usual, we first need some definitions
and notation.
Throughout we take X.t/ to be a zero mean one-dimensional Gaussian process;
it is referred to as a centered Gaussian process. We also take the indexing set T to
be a set in an Euclidean space, usually the real line. The covariance kernel of X.t/
is denoted as .s; t/, and when X.t/ is stationary, we drop one of the arguments,
and simply use the notation .t/. The covariance kernel induces a new metric on T ,
which we define below.

p The canonical metric of T induced by the process X.t/ is defined


Definition 17.5.
as 4.s; t/ D E.X.t/  X.s//2 ; s; t 2 T .
It is meaningless to talk about P .supt 2T jX.t/j > x/ unless supt 2T jX.t/j is
finite with probability one. Thus, boundedness of X.t/ over T is going to be a key
consideration for this section. On the other hand, boundedness of X.t/ over T will
have close relations to continuity of the sample paths. These two properties are
defined below.

Definition 17.6. Let X.t/; t 2 T be a real-valued stochastic process. Let


t1 ; t2 ; : : : ; tn 2 T . Then the joint distribution of .Xt1 ; Xt2 ; : : : ; Xtn / is called a
finite-dimensional distribution (fdd) of the process X.t/.

Definition 17.7. Let X.t/; Y .t/; t 2 T be two stochastic processes defined on T .


They are called versions of each other if all of their finite-dimensional distributions
are the same.
For the purpose of defining continuity of a process, we need to give a meaning to
closeness of two different times s; t. For Gaussian processes, we can achieve this by
simply using the canonical metric 4.s; t/. For more general processes, we achieve
this by simply assuming that T is a subset in an Euclidean space. For defining the
notion of boundedness of a process, we do not require definition of the closeness of
times s; t. Thus, in defining boundedness, we can let the time set T be more general.

Definition 17.8. Let X.t/; t 2 T be a stochastic process, where T is a subset of


an Euclidean space Rd . X.t/ is called continuous if there is a version of it, say
Y .t/; t 2 T , such that with probability one, Y .t/ is continuous on T as a function
of t.

Definition 17.9. Let X.t/; t 2 T be a stochastic process. X.t/ is called bounded if


there is a version of it, say Y .t/; t 2 T , such that with probability one, supt 2T jY .t/j
< 1.
In order to avoid mentioning the availability of a version that is continuous or
bounded, we simply say that X.t/ is continuous, or bounded, on some relevant set
T . This is not mentioned again.
576 17 Large Deviations

17.5.1  Continuity of a Gaussian Process

Continuity of X.t/ is a key factor in our final goal of writing bounds on prob-
abilities of the form P .supt 2T X.t/ > x/ or P .supt 2T jX.t/j > x/. This is
because continuity and boundedness are connected, and also because a smooth
process does not jump around too much, which helps in keeping the supremum
of the process in control. We limit our treatment to Gaussian processes with the
time set T a subset of an Euclidean space, and often just Œ0; 1. Conditions for the
continuity of a Gaussian process have evolved from simple sufficient conditions to
more complex necessary and sufficient conditions. Obviously, if the time set T is
compact, then continuity of the process implies its boundedness. If T is not com-
pact, this need not be true. A simple example is that of the Wiener process X.t/
on T D Œ0; 1/. We know from Chapter 12 that X.t/ is continuous. However,
P .supt 0 X.t/ D 1; inft 0 X.t/ D 1/ D 1, so that X.t/ is not bounded, and
is in fact almost surely unbounded. We first give a theorem with a set of classic
sufficient conditions for the continuity of a Gaussian process. See Adler (1987) or
Fernique (1974) for a proof.

Theorem 17.7. (a) Suppose X.t/; t 2 T D Œ0; 1 is a centered Gaussian process


R1 2
satisfying 4.s; t/ .jt  sj/ for some function  such that 0 .e x /
dx < 1. Then X.t/ is continuous on T .
(b) Suppose X.t/; t 2 T is a centered Gaussian process on a compact set T in an
Euclidean space. Suppose for some C; ’, and h > 0,

C
42 .s; t/ D EŒX.t/  X.s/2 8 s; t; jjs  tjj h:
j log jjs  tjj j1C’

Then X.t/ is almost surely continuous on T .


(c) Suppose we additionally assume that X.t/ is stationary on T . Suppose for some
A; ˇ; ı > 0,

A
.0/  .t/  8 t; jjtjj ı:
j log jjtjj j1ˇ

Then X.t/ is almost surely discontinuous on T .


Part (c) of the theorem says that the sufficient condition in part (b) is at the
borderline of necessary and sufficient if the Gaussian process is stationary.

Example 17.9 (The Wiener Process). We prove by using the theorem above that
paths of a Wiener process are continuous. If we can show that a Wiener process is
continuous on Œ0; 1, it follows that it is continuous on all of Œ0; 1/. For the Wiener
process, 42 .s; t/ D EŒX.t/  X.s/2 D t C s  2 min.s; t/ D jt  sj. Choose
p R1 2
.u/ D u. Then, we have 4.s; t/ .jt  sj/. Furthermore, 0 .e x /dx D
R 1 x 2 =2
0 e dx < 1. Therefore, by part (a) of the above theorem, the Wiener pro-
cess is continuous on Œ0; 1/. Next, as regards part (b) of the theorem, because
17.5 Large Deviations in Continuous Time 577

x.log x/2 ! 0 as x # 0, the Wiener process also satisfies the sufficient condi-
tion in part (b) with, for example, ’ D 1; C D 1, and h D 12 , and part (b) again
shows that the Wiener process is continuous on Œ0; 1, and hence, on Œ0; 1/.

Example 17.10 (Logarithm Tail of the Maxima and the Landau–Shepp Theorem).
This example illustrates a famous theorem of Landau and Shepp (1970). The
Landau–Shepp theorem says that if X.t/ is a centered and almost surely bounded
Gaussian process, then the tail of its supremum acts like the tail of a suitable
single normal random variable. Precisely, let X.t/ be a centered and bounded Gaus-
sian process on some set T . For instance, if T is a compact interval in R and if
X.t/ satisfies one of the two sufficient conditions in our theorem above, then the
Landau–Shepp theorem applies. Let T2 D supt 2T Var.Xt /. The Landau–Shepp the-
orem says that limu!1 u12 log P .MT > u/ D  12 , where MT D supt 2T X.t/.
2T
Let us put this result in context. Take a single univariate normal random variable
X with mean zero and variance  2 . Then, the distribution of X satisfies the two-
sided bounds
 
 3  u
2
  u
2
p p e 2 2 P .X > u/ p e 2 2 ;
2u 2u3 2u

for all u > 0. Taking logarithms, it follows on elementary manipulation that


limu!1 u12 log P .X > u/ D  21 2 .
Suppose now that there exists t0 2 T such that Var.Xt0 / D T2 ; that is, the
maximum variance value is attained. Then, by using the Landau–Shepp theorem,
we see that MT satisfies the same logarithmic tail rate as the single normal random
variable Xt0 . That is, on a logarithmic scale, the tail of MT is the same as that of a
normal variable with the maximal variance over the set T .
To complete this example, we recall from Chapter 12 that for the Wiener process
on Œ0; T  for any finite T , the exact
h i of MT D sup0t T X.t/ is actually
distribution

known, and P .MT > u/ D 2 1  ˆ pu ; u > 0. Therefore, for the special case
T
of the Wiener process, the Landau–Shepp theorem follows directly from the exact
distribution of MT .

17.5.2  Metric Entropy of T and Tail of the Supremum

The classic sufficient conditions for the continuity of a Gaussian process have
evolved into conditions on the size of the time set T that control continuity of
the process and magnitude of the supremum at the same time. These conditions
involve the metric entropy of the set T with respect to the canonical metric of T .
The smoother the covariance kernel of the process is near zero, the smaller will be
the metric entropy of T , and the easier it is to control the supremum of the pro-
cess over T . Roughly, if the covering numbers N. ; T; 4/ of T grow at the rate of
’
for some positive ’, then the Gaussian process will be continuous, and various
578 17 Large Deviations

results on the tail of the supremum of the process can be derived. These metric en-
tropy conditions and their ties to the magnitude of the supremum of the process, as
in the theorem below, are due to Dudley (1967) and Talagrand (1996). The reader
should compare the results below with the main theorem in Section 16.5, where
the L2 norm of the supremum of a continuous Gaussian process is bounded by the
-capacity of T . We remark that stationarity of the process is not assumed in the
theorem below.
Theorem 17.8. Let X.t/; t 2 T be a real-valued centered Gaussian process. Let
N. ; T; 4/ be the covering number of T with respect to the canonical metric 4 of
T . Assume:
(a) For each > 0; N. ; T; 4/ < 1. p
(b) The diameter of T with respect to 4, that is, L D sups;t 2T EŒX.s/  X.t/2 ,
is finite.
RLp
(c) 0 log N. ; T; 4/d < 1.

Then X.t/ is (almost surely) continuous and bounded on T and


Z 1 p Z L
p
2
EŒsup X.t/ 12 log N. ; T; 4/d D 12 log N. ; T; 4/d :
t 2T 0 0
’
If N. ; T; 4/ satisfies N. ; T; 4/ K for some finite ’; K > 0, then there
exists a finite constant C such that for any ı > 0,
u2

2 2
P .sup X.t/ > u/ C u’Cı e T ;
t 2T

for all large u, where T2 D supt 2T Var.X.t//.


Example 17.11 (The Wiener Process). We examine various assumptions and con-
clusions of this theorem in the context of a Wiener process X.t/ on Œ0; 1. Note that
we know the exact distribution and the expectation of M D sup0t 1 X.t/, namely,
q
P .M > u/ D 2Œ1  ˆ.u/; u > 0, and E.M / D 2 D :799.
p
The canonical metric in this case is 4.s; t/ D js  tj. Thus, the diameter of
p time set T D Œ0; 1 with respect to 4 is L D 1. Next, because
the 4.s; t/ D
js  tj, one ball of 4-radius will cover an Euclidean length of 2 2 . Therefore,
if we consider the particular cover Œ0; 2 2 ; Œ2 2 ; 4 2 ; : : :, then we require at most
1
2 2
C 1 balls to cover the time set T D Œ0; 1. We therefore have

1
N. ; T; 4/ 1C
2 2
1
) log N. ; T; 4/ log 2 C 2 2 2 log C 2 2  log 2:
2
R :5 p
We can evaluate the integral 0 2 log C 2 2  log 2d . We can easily show
analytically that the integral is finite. Our theorem above then implies that X.t/ is
Exercises 579

continuous and bounded on Œ0; 1 (something that we already knew). Moreover, the
theorem also says that
Z :5 p Z :5 p
E.M / 12 log N. ; T; 4/d 12 2 log C 2 2  log 2d
0 0
D 12  :811 D 9:732;

whereas the exact value is E.M / D :799. So the bounds are not yet practically very
useful, but describe exactly how the covering numbers must behave for the tail of
the supremum to go down at a Gaussian rate.

Exercises

Exercise 17.1. Derive the rate function I.t/ for the sample mean in the exponential
case.
Exercise 17.2. Derive the rate function I.t/ for the sample mean in the Poisson
case.
Exercise 17.3. Derive the rate function I.t/ for the sample mean for the three-point
distribution giving probability 13 to each of 0; ˙1.
Note: You will need a formula for the derivative of the cosine hyperbolic function.
Exercise 17.4 (SLLN from Cramér–Chernoff Theorem). Prove, by using the
Cramér–Chernoff theorem, the SLLN for the mean of an iid sequence under the
conditions of the Cramér–Chernoff theorem.
Exercise 17.5 * (Rate Function for Uncommon Distributions). For each of the
following densities, derive the rate function I.t/ for the sample mean.
x
(a) f .x/ D 1
Beta.;’/ e .1  e x /’1 ; x; ; ’ > 0.
x P1
(b) f .x/ D e
 ./ ./.e e
x
1/
; x > 0;  > 1, where ./ D nD1 n .
2 2
e x
(c) f .x/ D 2 ; 1 < x < 1;  > 0.
.1ˆ. //e .1Cx 2 /

Exercise 17.6 * (Multivariate Normal). Characterize the rate function I.t/ for the
statistic Tn D jjXjj, when X1 ; X2 ; : : : are iid Nd .0; I /.
Exercise 17.7 * (Uniform Distribution in a Ball). Characterize the rate function
I.t/ for the statistic Tn D jjXjj, when X1 ; X2 ; : : : are iid uniform in the d -
dimensional unit ball.
Exercise 17.8 (Numerical Accuracy of Large Deviation Approximation). Let
Wn 2n . Do a straight CLT approximation for P .Wn > .1 C t/n/, and do an
approximation using the Cramér–Chernoff theorem.
Compare the numerical accuracy of these two approximations for t D :5; 1; 2
when n D 30.
580 17 Large Deviations

Exercise 17.9. Prove that the rate function I.t/ for the sample mean of an iid
sequence in the one-dimensional case cannot be zero at t ¤ , when the mgf exists
for all z.

Exercise 17.10. Prove that the rate function I.t/ for the sample mean of an iid
sequence satisfies I 00 ./ D 12 .
1
Exercise 17.11 * (Type II Error Rate of the Likelihood Ratio Test). Find limn n
log ˇn for the likelihood ratio test of Example 17.4.

Exercise 17.12 (Discontinuous Rate Function). Give an example of a distribution


on R for which the rate function for the sample mean is not continuous.

Exercise 17.13 * (Second-Order Term in the Cramér–Chernoff Theorem).


Suppose we consider P .X > t/ itself, instead of its logarithm, in the one-
dimensional iid case. Identify cn .t/ in the representation P .X >t/Dcn .t/e nI.t / Œ1C
o.1/ when the Xi are iid N.0; 1/.

Exercise 17.14 * (Rate Function When the Population Is t). Suppose X1 ; X2 ; : : :


are iid from a t-distribution with degrees of freedom ’ > 0. Prove that whatever is
’; limn n1 log P .X > t/ D 0 for t > 0.

Exercise 17.15 * (Adjusted Large Deviation Rate When the Population Is t).
Suppose X1 ; X2 ; : : : are iid from a t-distribution with two degrees of freedom. In
this case, X converges in probability to zero by the WLLN, but P .X > t/ does not
converge to zero exponentially. Find the exact rate at which P .X > t/ converges to
zero.

Exercise 17.16. Suppose Xn Bin.n; p/. Show that for any a > 0; P .Xn <
a2
 2np
np  a/ e .

Hint: Use the technique to obtain the upper bound part in the Cramér–Chernoff
theorem.

Exercise 17.17. Suppose X Poi. /. Show that for any > 0; P .X < .1 
 2
// e 2 .

Hint: Approximate a Poisson by a suitable binomial. Then use the binomial distri-
bution inequality in the exercise above. Take a limit.

Exercise 17.18 * (Verification of Conditions in the GRartner–Ellis Theorem).


Suppose Xi are iid N.; 1/ and Y is an independent standard exponential random
variable. Consider the statistic Tn D X C pYn .

(a) Is .z/ finite and differentiable at every z?


(b) Which of the conditions (a)–(d) in the GRartner–Ellis theorem hold?
Exercises 581

Exercise 17.19 (GRartner–Ellis Theorem). For the two-state Markov chain of


Example 10.1, derive the rate function for the sample mean.
Exercise 17.20 * (GRartner–Ellis Theorem). For the transition matrix
0 1
0 1 0
P D@0 0 1A;
p 1p 0

(a) Is the GRartner–Ellis theorem applicable?


(b) If it is, derive the rate function for the sample mean.
Hint: Verify if the chain is irreducible.
Exercise
 17.21 (The t-statistic). Suppose X1 ; X2 ; : : : are iid N.0; 1/. Approximate
Tn
P p n
> x by using the normal approximation, and the large deviation approxi-
mation (see the rate
 function worked out in Example 17.7). Compare them to the
Tn
exact value of P p n
> x , which is obtainable by using a t table. Use n D 30 and
x D 0:1; 0:25; 0:5.
Exercise 17.22 (Mean Absolute Deviation of a Lipschitz Function). Suppose
Z Nd .0; I / and f .Z/ is a Lipschitz function. Use Theorem 17.6 to find an
upper bound on EŒjf .Z/  E.f .Z//j in terms of the Lipschitz norm of f .
Exercise 17.23 * (The Canonical Metric and the Euclidean Metric). Suppose
X.t/ is a centered Gaussian process on a compact interval T on R. Show that X.t/
is continuous with respect to the canonical metric of T if and only if it is continuous
with respect to the Euclidean metric on T .
Exercise 17.24 (Increments of a Wiener Process). Let W .t/ be a Wiener process
on Œ0; 1/, and let X.t/ D W .t C 1/  W .t/.
(a) Calculate the canonical metric on T .
(b) Without using the fact that W .t/ is continuous, prove the continuity of X.t/ by
using the Fernique sufficient conditions in Theorem 17.7.

Hint: See Example 12.1.


Exercise 17.25 (Ornstein–Uhlenbeck Process). Consider the Ornstein–Uhlenbeck
’t
process X.t/ D p’ e  2 W .e ’t /; t  0, where W .t/ is a Wiener process. Verify the
Fernique conditions for continuity of the X.t/ process.
Exercise 17.26 * (Metric Entropy of the Ornstein–Uhlenbeck Process).
Consider the Ornstein–Uhlenbeck process of the above exercise.
(a) Calculate the canonical metric for this process.
(b) Show that the metric entropy condition in part (c) of Theorem 17.8 holds for
0 t T for any T < 1.
(c) Find a bound on EŒsup0t T X.t/ by using part (c) of Theorem 17.8. Evaluate
the bound when  D ’ D T D 1.
582 17 Large Deviations

References

Adler, R.J. (1987). An Introduction to Continuity, Extrema, and Related Topics, IMS, Lecture Notes
and Monograph Series, Hayward, CA.
Basu, S. and DasGupta, A. (1991). Robustness of standard confidence intervals for location
parameters under departure from normality, Ann. Stat., 23, 1433–1442.
Borell, C. (1975). Convex functions in d -space, Period. Math. Hungar., 6, 111–136.
Bucklew, J. (2004). Introduction to Rare Event Simulation, Springer, New York.
Cressie, N. (1980). Relaxing assumptions in the one sample t -test, Austr. J. Statist., 22, 143–153.
Dembo, A. and Shao, Q.M. (2006). Large and moderate deviations for Hotelling’s T 2 statistic,
Electron. Comm. Prob., 11, 149–159.
Dembo, A. and Zeitouni, O. (1998). Large Deviations, Techniques and Applications, Jones and
Bartlett, Boston.
den Hollander, F. (2000). Large Deviations, AMS, Providence, RI.
Devroye, L., Gyorfi, L., and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition,
Springer, New York.
Dubhashi, D. and Panconesi, A. (2009). Concentration of Measure for the Analysis of Randomized
Algorithms, Cambridge University Press, Cambridge, UK.
Dudley, R.M. (1967). The sizes of compact subsets of Hilbert space and continuity of Gaussian
processes, J. Funct. Anal., 1, 290–330.
Efron, B. (1969). Student’s t -test under symmetry conditions, J. Amer. Statist. Assoc., 64,
1278–1302.
Fernique, X. (1974). Des resultats nouveaux sur les processus gaussiens, C. R. Acad. Sci. Paris,
Ser. A, 278, 363–365.
R
Giné, E., Gotze, F., and Mason, D.M. (1997). When is the Student’s t -statistic asymptotically
standard normal?, Ann. Prob., 25, 1514–1531.
Groeneboom, P. and Oosterhoff, J. (1977). Bahadur efficiency and probabilities of large deviations,
Statist. Neerlandica, 31, 1, 1–24.
Groeneboom, P. and Oosterhoff, J. (1980). Bahadur Efficiency and Small Sample Efficiency :
A Numerical Study, Mathematisch Centrum, Amsterdam.
Groeneboom, P. and Oosterhoff, J. (1981). Bahadur efficiency and small sample efficiency, Inter-
nat. Statist. Rev., 49, 2, 127–141.
Hall, P. (1987). Edgeworth expansions for Student’s t -statistic under minimal moment conditions,
Ann. Prob., 15, 920–931.
Hall, P. and Wang, Q. (2004). Exact convergence rate and leading term in central limit theorem for
Student’s t -statistic, Ann. Prob., 32, 1419–1437.
Landau, H.J. and Shepp, L. (1970). On the supremum of a Gaussian process, Sankhya, N Ser. A, 32,
369–378.
Ledoux, M. (2004). Spectral gap, logarithmic Sobolev constant, and geometric bounds, Surveys in
Differential Geometry, IX, 219–240, International. Press, Somerville, MA.
Logan, B.F., Mallows, C.L., Rice, S.O., and Shepp, L. (1973). Limit distributions of self-
normalized sums, Ann. Prob., 1, 788–809.
Lugosi, G. (2004). Concentration Inequalities, Preprint.
McDiarmid, C. (1998). Concentration, Prob. Methods for Algorithmic Discrete Math., 195–248,
Algorithm. Combin., 16, Springer, Berlin.
Shao, Q.M. (1997). Self-normalized large deviations, Ann. Prob., 25, 285–328.
Stroock, D. (1984). An Introduction to the Theory of Large Deviations, Springer, New York.
Sudakov, V.N. and Cirelson, B.S. (1974). Extremal properties of half spaces for spherically invari-
ant measures, Zap. Nauchn. Sem. Leningrad Otdel. Mat. Inst. Steklov, 41, 14–24.
Talagrand, M. (1995). Concentration of measures and isoperimetric inequalities, Inst. Hautes
Etudes Sci. Publ. Math., 81, 73–205.
Talagrand, M. (1996). Majorizing measures: the generic chaining, Ann. Prob., 24, 1049–1103.
Varadhan, S.R.S. (1984). Large Deviations and Applications, SIAM, Philadelphia.
Chapter 18
The Exponential Family and Statistical
Applications

The exponential family is a practically convenient and widely used unified family
of distributions on finite-dimensional Euclidean spaces parametrized by a finite-
dimensional parameter vector. Specialized to the case of the real line, the exponen-
tial family contains as special cases most of the standard discrete and continuous
distributions that we use for practical modeling, such as the normal, Poisson, bi-
nomial, exponential, Gamma, multivariate normal, and so on. The reason for the
special status of the exponential family is that a number of important and useful
calculations in statistics can be done all at one stroke within the framework of the
exponential family. This generality contributes to both convenience and larger-scale
understanding. The exponential family is the usual testing ground for the large spec-
trum of results in parametric statistical theory that require notions of regularity
or Cramér–Rao regularity. In addition, the unified calculations in the exponential
family have an element of mathematical neatness. Distributions in the exponential
family have been used in classical statistics for decades. However, it has recently
obtained additional importance due to its use and appeal to the machine learning
community. A fundamental treatment of the general exponential family is provided
in this chapter. Classic expositions are available in Barndorff-Nielsen (1978), Brown
(1986), and Lehmann and Casella (1998). An excellent recent treatment is available
in Bickel and Doksum (2006).

18.1 One-Parameter Exponential Family

Exponential families can have any finite number of parameters. For instance, as we
should a normal distribution with a known mean is in the one-parameter exponential
family, where as a normal distribution with both parameters unknown is in the two-
parameter exponential family. A bivariate normal distribution with all parameters
unknown is in the five-parameter exponential family. As another example, if we take
a normal distribution in which the mean and the variance are functionally related
(e.g., the N.; 2 / distribution), then the distribution is neither in the one-parameter
nor in the two-parameter exponential family, but in a family called a curved expo-
nential family. We start with the one-parameter regular exponential family.

A. DasGupta, Probability for Statistics and Machine Learning: Fundamentals 583


and Advanced Topics, Springer Texts in Statistics, DOI 10.1007/978-1-4419-9634-3 18,
c Springer Science+Business Media, LLC 2011
584 18 The Exponential Family and Statistical Applications

18.1.1 Definition and First Examples

We start with an illustrative example that brings out some of the most important
properties of distributions in an exponential family.
Example 18.1 (Normal Distribution with a Known Mean). Suppose X N.0;  2 /.
Then the density of X is
2
1  x
f .x j/ D p e 2 2 Ix2R :
 2

This density is parametrized by a single parameter . Writing

1 1
./ D  ; T .x/ D x 2 ; ./ D log ; h.x/ D p Ix2R ;
2 2 2

we can represent the density in the form

f .x j/ D e ./T .x/ ./


h.x/;

for any  2 RC .
Next, suppose that we have an iid sample X1 ; X2 ; : : : ; Xn N.0;  2 /. Then the
joint density of X1 ; X2 ; : : : ; Xn is
Pn
x2
1  i D1 i
f .x1 ; x2 ; : : : ; xn j/ D n e 2 2 Ix1 ;x2 ;:::;xn 2R :
 .2/n=2

Now writing

1 X n
./ D  2 ; T .x1 ; x2 ; : : : ; xn / D xi2 ; ./ D n log ;
2
i D1

and
1
h.x1 ; x2 ; : : : ; xn / D Ix ;x ;:::;xn 2R ;
.2/n=2 1 2
once again we can represent the joint density in the same general form

f .x1 ; x2 ; : : : ; xn j/ D e ./T .x1 ;x2 ;:::;xn / ./


h.x1 ; x2 ; : : : ; xn /:

We notice that in this representation of the joint density f .x1 ; x2 ; : : : ; xn j/,


the statistic T .XP1 ; X2 ; : : : ; Xn / is still a one-dimensional statistic, namely, T .X1 ;
X2 ; : : : ; Xn / D niD1 Xi2 . Using the fact that the sum of squares of n independent
standard normal variables is a chi-square variable with n degrees of freedom, we
have that the density of T .X1 ; X2 ; : : : ; Xn / is
18.1 One-Parameter Exponential Family 585
t
 n
e 2 2 t 2 1
fT .t j/ D It >0 :
 n 2n=2 . n2 /

This time, writing

1 1
./ D  ; S.t/ D t; ./ D n log ; h.t/ D It >0 ;
2 2 2n=2 . n2 /
Pn
once again we are able to write even the density of T .X1 ; X2 ; : : : ; Xn / D i D1 Xi2
in that same general form

fT .t j/ D e ./S.t / ./


h.t/:

Clearly, something very interesting is going on. We started with a basic density in a
specific form, namely, f .x j/ D e ./T .x/ ./ h.x/, and then we found P that the
joint density and the density of the relevant one-dimensional statistic niD1 Xi2 in
that joint density, are once again densities of exactly that same general form. It turns
out that all of these phenomena are true of the entire family of densities that can be
written in that general form, which is the one-parameter exponential family. Let us
formally define it and we then extend the definition to distributions with more than
one parameter.

Definition 18.1. Let X D .X1 ; : : : ; Xd / be a d -dimensional random vector with a


distribution P ;  2 ‚  R.
Suppose X1 ; : : : ; Xd are jointly continuous. The family of distributions
fP ;  2 ‚g is said to belong to the one-parameter exponential family if the density
of X D .X1 ; : : : ; Xd / may be represented in the form

f .x j/ D e . /T .x/ . /


h.x/;

for some real-valued functions T .x/; ./ and h.x/  0.


If X1 ; : : : ; Xd are jointly discrete, then fP ;  2 ‚g is said to belong to the one-
parameter exponential family if the joint pmf p.x j/ D P .X1 D x1 ; : : : ; Xd D
xd / may be written in the form

p.x j/ D e . /T .x/ . /


h.x/;

for some real-valued functions T .x/; ./ and h.x/  0. Note that the functions
; T , and h are not unique. For example, in the product T , we can multiply T by
some constant c and divide  by it. Similarly, we can play with constants in the
function h.

Definition 18.2. Suppose X D .X1 ; : : : ; Xd / has a distribution P ;  2 ‚, belong-


ing to the one parameter exponential family. Then the statistic T .X / is called the
natural sufficient statistic for the family fP g.
586 18 The Exponential Family and Statistical Applications

The notion of a sufficient statistic is a fundamental one in statistical theory and its
applications. Sufficiency was introduced into the statistical literature by Sir Ronald
A. Fisher (Fisher (1922)). Sufficiency attempts to formalize the notion of no loss
of information. A sufficient statistic is supposed to contain by itself all of the infor-
mation about the unknown parameters of the underlying distribution that the entire
sample could have provided. In that sense, there is nothing to lose by restricting
attention to just a sufficient statistic in one’s inference process. However, the form
of a sufficient statistic is very much dependent on the choice of a particular distri-
bution P for modeling the observable X . Still, reduction to sufficiency in widely
used models usually makes just simple common sense. We come back to the issue
of sufficiency once again later in this chapter.
We now show examples of a few more common distributions that belong to the
one-parameter exponential family.

Example 18.2 (Binomial Distribution). Let X Bin.n; p/; with n  1 considered


as known, and 0 < p < 1 a parameter. We represent the pmf of X in the one-
parameter exponential family form.
! ! x
n x n p
f .xjp/ D p .1  p/ nx
Ifx2f0;1;:::;ngg D .1  p/n Ifx2f0;1;:::;ngg
x x 1p
!
n x log p=.1p/Cn log.1p/
D e Ifx2f0;1;:::;ngg :
x

p
Writing .p/ D log 1p ; T .x/ D x; .p/ D n log.1  p/, and h.x/ D
n 
I
x fx2f0;1;:::;ngg
, we have represented the pmf f .x jp/ in the one-parameter ex-
ponential family form, as long as p 2 .0; 1/. For p D 0 or 1, the distribution be-
comes a one-point distribution. Consequently, the family of distributions ff .x jp/;
0 < p < 1g forms a one-parameter exponential family, but if either of the boundary
values p D 0; 1 is included, the family is not in the exponential family.

Example 18.3 (Normal Distribution with a Known Variance). Suppose X N


.;  2 /, where  is considered known, and  2 R a parameter. Then,

1 x2 2
f .x j/ D p e  2 Cx 2 Ix2R ;
2

which can be written in the one-parameter exponential family form by witing


2 x2
./ D ; T .x/ D x; ./ D 2 , and h.x/ D e  2 Ix2R . So, the family of
distributions ff .x j/;  2 Rg forms a one-parameter exponential family.

Example 18.4 (Errors in Variables). Suppose U; V; W are independent normal vari-


ables, with U and V being N.; 1/ and W being N.0; 1/. Let X1 D U C W and
X2 D V C W . In other words, a common error of measurement W contaminates
both U and V .
18.1 One-Parameter Exponential Family 587

Let X D .X1 ; X2 /. Then X has a bivariate normal distribution with means ; ,


variances 2; 2, and a correlation parameter  D 12 . Thus, the density of X is

.x1 / 2
.x / 2
1 2 C 22 2.x1 /.x2 /
f .x j/ D p e
3 2
Ix1 ;x2 2R
2 3
2 Cx 2 4x x
1 2 2 2 x1 2 1 2
D p e Œ 3 .x1 Cx2 / 3   e  3 Ix1 ;x2 2R :
2 3

This is in the form of a one-parameter exponential family with the natural sufficient
statistic T .X / D T .X1 ; X2 / D X1 C X2 .
Example 18.5 (Gamma Distribution). Suppose X has the Gamma density

x
e   x ’1
I
’ .’/ x>0
:

As such, it has two parameters ; ’. If we assume that ’ is known, then we may


write the density in the one-parameter exponential family form:

x x ’1
f .x j / D e   ’ log  Ix>0 ;
.’/

and recognize it as a density in the exponential family with . / D  1 ; T .x/ D


’1
x; . / D ’ log ; h.x/ D x.’/ Ix>0 .
If we assume that is known, once again, by writing the density as

x
’ log x’.log /log .’/ e

f .x j’/ D e Ix>0 ;
x

we recognize it as a density in the exponential family with .’/ D ’; T .x/ D


x
log x; .’/ D ’.log / C log .’/; h.x/ D e
x

Ix>0 .
Example 18.6 (An Unusual Gamma Distribution). Suppose we have a Gamma den-
sity in which the mean is known, say, E.X / D 1. This means that ’ D 1 ) D
1=’. Parametrizing the density with ’, we have

’’ 1
f .x j’/ D e ’xC’ log x Ix>0
.’/ x
1
D e ’Œlog xxŒlog .’/’ log ’ Ix>0 ;
x

which is once again in the one-parameter exponential family form with .’/ D ’;
T .x/ D log x  x; .’/ D log .’/  ’ log ’; h.x/ D x1 Ix>0 .
588 18 The Exponential Family and Statistical Applications

Example 18.7 (A Normal Distribution Truncated to a Set). Suppose a certain


random variable W has a normal distribution with mean  and variance one. We
saw in Example 18.3 that this is in the one-parameter exponential family. Suppose
now that the variable W can be physically observed only when its value is inside
some set A. For instance, if W > 2, then our measuring instruments cannot tell
what the value of W is. In such a case, the variable X that is truly observed has a
normal distribution truncated to the set A. For simplicity, take A to be A D Œa; b,
an interval. Then, the density of X is

.x/2
e 2
f .x j/ D p Iaxb :
2Œˆ.b  /  ˆ.a  /

This can be written as


1 2 x2
f .x j/ D p e x 2 logŒˆ.b/ˆ.a/ e  2 Iaxb ;
2

and we recognize this to be in the exponential family form with ./ D ; T .x/ D
2 x2
x; ./ D 2 C logŒˆ.b  /  ˆ.a  /, and h.x/ D e  2 Iaxb . Thus, the
distribution of W truncated to A D Œa; b is still in the one-parameter exponential
family. This phenomenon is in fact more general.

Example 18.8 (Some Distributions Not in the Exponential Family). It is clear from
the definition of a one-parameter exponential family that if a certain family of dis-
tributions fP ;  2 ‚g belongs to the one-parameter exponential family, then each
PR has exactly the same support. Precisely, for any fixed ; P .A/ > 0 if and only
if A h.x/dx > 0, and in the discrete case, P .A/ > 0 if and only if A \ X ¤ ;,
where X is the countable set X D fx W h.x/ > 0g. As a consequence of this
common support fact, the so-called irregular distributions whose support depends
on the parameter cannot be members of the exponential family. Examples would
be the family of U Œ0; ; U Œ;  distributions, and so on. Likewise, the shifted
exponential density f .x j/ D e  x Ix> cannot be in the exponential family.
Some other common distributions are also not in the exponential family, but for
other reasons. An important example is the family of Cauchy distributions given by
the location parameter form f .x j/ D Œ1C.x/
1
2  Ix2R . Suppose that it is. Then,

we can find functions ./; T .x/ such that for all x; ,

1
e ./T .x/ D ) ./T .x/ D  log.1 C .x  /2 /
1 C .x  /2

) .0/T .x/ D  log.1 C x 2 / ) T .x/ D c log.1 C x 2 /

for some constant c.


18.2 The Canonical Form and Basic Properties 589

Plugging this back, we get, for all x; ,

1 log.1 C .x  /2 /
c./ log.1 C x 2 / D  log.1 C .x  /2 / ) ./ D :
c log.1 C x 2 /
2
/
This means that log.1C.x/
log.1Cx 2 /
must be a constant function of x, which is a contra-
diction. The choice of  D 0 as the special value of  is not important.

18.2 The Canonical Form and Basic Properties

Suppose fP ;  2 ‚g is a family belonging to the one-parameter exponential family,


with density (or pmf) of the form f .x j/ D e . /T .x/ . h.x/. If ./ is a one-to-
one function of , then we can drop  altogether, and parametrize the distribution
in terms of  itself. If we do that, we get a reparametrized density g in the form

e T .x/ ./ h.x/. By a slight abuse of notation, we again use the notation f for g
and for  .

Definition 18.3. Let X D .X1 ; : : : ; Xd / have a distribution P ;  2 T  R. The


family of distributions fP ;  2 T g is said to belong to the canonical one-parameter
exponential family if the density (pmf) of P may be written in the form

f .x j/ D e T .x/ ./


h.x/;

where
Z
2T D We ./
D e T .x/ h.x/dx < 1 ;
Rd

in the continuous case, and


( )
X
./
T D We D e T .x/ h.x/ < 1 ;
x2X

in the discrete case, with X being the countable set on which h.x/ > 0.
For a distribution in the canonical one-parameter exponential family, the param-
eter  is called the natural parameter, and T is called the natural parameter space.
Note that T describes the largest set of values of  for which the density (pmf) can
be defined. In a particular application, we may have extraneous knowledge that 
belongs to some proper subset of T . Thus, fP g with  2 T is called the full canon-
ical one-parameter exponential family. We generally refer to the full family, unless
otherwise stated.
590 18 The Exponential Family and Statistical Applications

The canonical exponential family is called regular if T is an open set in R, and


it is called nonsingular if Var .T .X // > 0 for all  2 T 0 , the interior of the natural
parameter space T .
It is analytically convenient to work with an exponential family distribution in its
canonical form. Once a result has been derived for the canonical form, if desired we
can rewrite the answer in terms of the original parameter . Doing this retransfor-
mation at the end is algebraically and notationally simpler than carrying the original
function ./ and often its higher derivatives with us throughout a calculation. Most
of our formulas and theorems below are given for the canonical form.

 (Binomial Distribution in Canonical Form). Let X Bin.n; p/ with


Example18.9
the pmf xn p x .1  p/nx Ix2f0;1;:::;ng . In Example 18.2, we represented this pmf in
the exponential family form
!
p n
f .x jp/ D e x log .1p/ n log.1p/ Ix2f0;1;:::;ng :
x

p p
If we write log 1p D , then 1p D e  , and hence, p D 1Ce
e
, and 1  p D 1Ce 1
.
Therefore, the canonical exponential family form of the binomial distribution is
!
xn log.1Ce / n
f .x j/ D e Ix2f0;1;:::;ng ;
x

and the natural parameter space is T D R.

18.2.1 Convexity Properties

Written in its canonical form, a density (pmf) in an exponential family has some
convexity properties. These convexity properties are useful in manipulating with
moments and other functionals of T .X /, the natural sufficient statistic appearing in
the expression for the density of the distribution.

Theorem 18.1. The natural parameter space T is convex, and ./ is a convex
function on T .

Proof. We consider the continuous case only, as the discrete case admits basically
the same proof. Let 1 ; 2 be two members of T , and let 0 < ’ < 1. We need to
show that ’1 C .1  ’/2 belongs to T ; that is,
Z
e .’1 C.1’/2 /T .x/ h.x/dx < 1:
Rd
18.2 The Canonical Form and Basic Properties 591

But
Z Z
e .’1 C.1’/2 /T .x/ h.x/dx D e ’1 T .x/  e .1’/2 T .x/ h.x/dx
Rd Rd
Z  ’  1’
D e 1 T .x/ e 2 T .x/ h.x/dx
Rd
Z ’Z 1’
1 T .x/ 2 T .x/
e h.x/dx e h.x/dx
Rd Rd

(by Holder’s inequality)


< 1;
R R
because, by hypothesis, 1 ; 2 2 T , and hence, Rd e 1 T .x/ h.x/dx, and Rd
e 2 T .x/ h.x/dx are both finite.
Note that in this argument, we have actually proved the inequality
.’1 C.1’/2 /
e e’ .1 /C.1’/ .2 /
:

But this is the same as saying

.’1 C .1  ’/2 / ’ .1 / C .1  ’/ .2 /I

that is, ./ is a convex function on T . t


u

18.2.2 Moments and Moment Generating Function

The next result is a very special fact about the canonical exponential family, and is
the source of a large number of closed-form formulas valid for the entire canonical
exponential family. The fact itself is actually a fact in mathematical analysis. Due
to the special form of exponential family densities, the fact in analysis translates to
results for the exponential family, an instance of interplay between mathematics and
statistics and probability.

Theorem 18.2. (a) The function e ./ is infinitelyR differentiable at every  2 T 0 .


Furthermore, in the continuous case, e ./ D Rd e T .x/ h.x/dx can be differ-
entiated any
P number of times inside the integral sign, and in the discrete case,
e ./ D x2X e T .x/ h.x/ can be differentiated any number of times inside the
sum.
(b) In the continuous case, for any k  1,
Z
dk
e ./
D ŒT .x/k e T .x/ h.x/dx;
dk Rd
592 18 The Exponential Family and Statistical Applications

and in the discrete case,

dk X
e ./
D ŒT .x/k e T .x/ h.x/:
dk x2X

Proof. Take k D 1. Then, by the definition of derivative of a function, d


d
e ./
 . Cı/ e . / 
exists if and only if limı!0 e ı
exists. But
Z
e .Cı/
e ./
e .Cı/T .x/  e T .x/
D h.x/dx;
ı Rd ı

and by an application of the dominated convergence theorem (see Chapter 7),


Z
e .Cı/T .x/  e T .x/
lim h.x/dx
ı!0 Rd ı

exists, and the limit can be carried inside the integral, to give
Z Z
e .Cı/T .x/  e T .x/ e .Cı/T .x/  e T .x/
lim h.x/dx D lim h.x/dx
ı!0 Rd ı Rd ı!0 ı
Z
d T .x/
D e h.x/dx
Rd d
Z
D T .x/e T .x/ h.x/dx:
Rd

Now use induction on k by using the dominated convergence theorem again. t


u
./
This compact formula for an arbitrary derivative of e leads to the following
important moment formulas.

Theorem 18.3. At any  2 T 0 ,


(a) E ŒT .X / D 0 ./I Var ŒT .X / D 00 ./I
(b) The coefficients of skewness and kurtosis of T .X / equal

.3/ .4/
./ ./
ˇ. / D 00
I and ./ D 00 ./2
I
Œ ./3=2 Œ

(c) At any t such that  C t 2 T , the mgf of T .X / exists and equals

M .t/ D e .Ct / ./


:
18.2 The Canonical Form and Basic Properties 593

Proof. Again, we take just the continuous case. Consider the result of the previous
dk
R
theorem that for any k  1; d ke
./
D Rd ŒT .x/k e T .x/ h.x/dx. Using this for
k D 1, we get
Z Z
0 0
./e ./
D T .x/e T .x/ h.x/dx ) T .x/e T .x/ ./
h.x/dx D ./;
Rd Rd

0
which gives the result E ŒT .X / D ./.
Similarly,
Z
d2 ./ 00 0
e D ŒT .x/2 e T .x/ h.x/dx ) Œ ./ C f ./g2 e ./
d2 Rd
Z
D ŒT .x/2 e T .x/ h.x/dx
Rd
Z
00 0
) ./ C f ./g D2
ŒT .x/2 e T .x/ ./
h.x/dx;
Rd

which gives E ŒT .X /2 D 00 ./ C f 0 ./g2 . Combine this with the already ob-
tained result that E ŒT .X / D 0 ./, and we get Var ŒT .X / D E ŒT .X /2 
.E ŒT .X //2 D 00 ./.
.X/  E T .X/3
The coefficient of skewness is defined as ˇ D E ŒT.VarT .X//3=2
. To obtain
EŒT .X /  ET .X /3 D EŒT .X /3  3EŒT .X /2 EŒT .X / C 2ŒET .X /3 , use the
d3
R
identity d 3e
./
D Rd ŒT .x/3 e T .x/ h.x/dx. Then use the fact that the third
derivative of e ./ is e ./ Œ .3/ ./ C 3 0 ./ 00 ./ C f 0 ./g3 . As we did in our
proofs for the mean and the variance above, transfer e ./ into the integral on the
right-hand side and then simplify. This gives EŒT .X /  ET .X /3 D .3/ ./, and
the skewness formula follows. The formula for kurtosis is proved by the same argu-
dk
R
ment, using k D 4 in the derivative identity d ke
./
D Rd ŒT .x/k e T .x/ h.x/dx.
Finally, for the mgf formula,
Z
M .t/ D E Œe t T .X/  D e t T .X/ e T .x/ ./
h.x/dx
Rd
Z
D e ./
e .t C/T .x/ h.x/dx
Rd
Z
 ./ .t C/
De e e .t C/T .x/ .t C/
h.x/dx D e  ./
e .t C/
1
Rd
.t C/ ./
De :

An important consequence of the mean and the variance formulas is the following
monotonicity result. t
u
Corollary 18.1. For a nonsingular canonical exponential family, E ŒT .X / is
strictly increasing in  on T 0 .
594 18 The Exponential Family and Statistical Applications

Proof. From part (a) of Theorem 18.3, the variance of T .X / is the derivative of the
expectation of T .X /, and by nonsingularity, the variance is strictly positive. This
implies that the expectation is strictly increasing.
As a consequence of this strict monotonicity of the mean of T .X / in the natural
parameter, nonsingular canonical exponential families may be reparametrized by
using the mean of T itself as the parameter. This is useful for some purposes.
Example 18.10 (Binomial Distribution). From Example 18.9, in the canonical
representation of the binomial distribution, ./ D n log.1 C e  /. By direct
differentiation,

0 ne  00 ne 
./ D I ./ D I
1 C e .1 C e  /2
.3/ ne  .e   1/ .4/ ne  .e 2  4e  C 1/
./ D I ./ D :
.1 C e  /3 .1 C e  /4

Now recall from Example 18.9 that the success probability p and the natural param-
eter  are related as p D 1Ce
e
. Using this, and our general formulas from Theorem
18.3, we can rewrite the mean, variance, skewness, and kurtosis of X as

1  2p
1
p.1p/ 6
E.X / D npI Var.X / D np.1  p/I ˇp D p I p D :
np.1  p/ n

For completeness, it is useful to have the mean and the variance formula in an
original parametrization, and they are stated below. The proof follows from an
application of Theorem 18.3 and the chain rule.
Theorem 18.4. Let fP ;  2 ‚g be a family of distributions in the one-parameter
exponential family with density (pmf)

f .x j/ D e . /T .x/ . /


h.x/:

Then, at any  at which 0 ./ ¤ 0,


0 00 0
./ ./ ./00 ./
E ŒT .X / D I Var .T .X // D  :
0 ./ Œ0 ./2 Œ0 ./3

18.2.3 Closure Properties

The exponential family satisfies a number of important closure properties. For in-
stance, if a d -dimensional random vector X D .X1 ; : : : ; Xd / has a distribution in
the exponential family, then the conditional distribution of any subvector given the
rest is also in the exponential family. There are a number of such closure properties,
of which we discuss only four.
18.2 The Canonical Form and Basic Properties 595

First, if X D .X1 ; : : : ; Xd / has a distribution in the exponential family, then the


natural sufficient statistic T .X / also has a distribution in the exponential family.
Verification of this in the greatest generality cannot be done without using measure
theory. However, we can easily demonstrate this in some particular cases. Consider
the continuous case with d D 1 and suppose T .X / is a differentiable one-to-one
function of X . Then, by the Jacobian formula (see Chapter 1), T .X / has the density

h.T 1 .t//
fT .t j/ D e  t  ./
:
jT 0 .T 1 .t//j

This is once again in the one-parameter exponential family form, with the natural
sufficient statistic as T itself, and the function unchanged. The h function has
 h.T 1 .t //
changed to a new function h .t/ D jT 0 .T 1 .t //j .
Similarly, in the discrete case, the pmf of T .X / is given by
X
P .T .X / D t/ D e T .x/ ./
h.x/ D e  t  ./ 
h .t/;
xW T .x/Dt

P
where h .t/ D xW T .x/Dt h.x/.
Next, suppose X D .X1 ; : : : ; Xd / has a density (pmf) f .x j/ in the expo-
nential family and Y1 ; Y2 ; : : : ; Yn are n iid observations from this density f .x j/.
Note that each individual Yi is a d -dimensional vector. The joint density of Y D
.Y1 ; Y2 ; : : : ; Yn / is

Y
n Y
n
f .y j/ D f .yi j/ D e T .yi / ./
h.yi /
i D1 i D1
Pn Y
n
D e i D1 T .yi /n ./
h.yi /:
i D1

We recognize this to be in the one-parameter


Pn exponential family form again, with
the natural sufficient
Q statistic as i D1 T .Y i /, the new
Q function as n , and the
new h function as niD1 h.yi /. The joint density niD1 f .yi j/ is known as the
likelihood function in statistics (see Chapter 3). So, likelihood functions obtained
from an iid sample from a distribution in the one-parameter exponential family are
also members of the one-parameter exponential family.
The closure properties outlined in the above are formally stated in the next
theorem.
Theorem 18.5. Suppose X D .X1 ; : : : ; Xd / has a distribution belonging to the
one-parameter exponential family with the natural sufficient statistic T .X /.
(a) T D T .X / also has a distribution belonging to the one-parameter exponential
family.
(b) Let Y D AX C u be a nonsingular linear transformation of X . Then Y also
has a distribution belonging to the one-parameter exponential family.
596 18 The Exponential Family and Statistical Applications

(c) Let I0 be any proper subset of I D f1; 2; : : : ; d g. Then the joint conditional dis-
tribution of Xi ; i 2 I0 given Xj ; j 2 I  I0 also belongs to the one-parameter
exponential family.
(d) For given n  1, suppose Y1 ; : : : ; Yn are iid with the same distribution as X .
Then the joint distribution of .Y1 ; : : : ; Yn / also belongs to the one-parameter
exponential family.

18.3 Multiparameter Exponential Family

Similar to the case of distributions with only one-parameter, several common


distributions with multiple parameters also belong to a general multiparameter ex-
ponential family. An example is the normal distribution on R with both parameters
unknown. Another example is a multivariate normal distribution. Analytic tech-
niques and properties of multiparameter exponential families are very similar to
those of the one-parameter exponential family. For that reason, most of our presen-
tation in this section dwells on examples.

Definition 18.4. Let X D .X1 ; : : : ; Xd / have a distribution P ;  2 ‚  Rk . The


family of distributions fP ;  2 ‚g is said to belong to the k-parameter exponential
family if its density (pmf) may be represented in the form
Pk
f .x j/ D e i D1 i . /Ti .x/ . /
h.x/:

Again, obviously, the choice of the relevant functions i ; Ti ; h is not unique. As


in the one-parameter case, the vector of statistics .T1 ; : : : ; Tk / is called the natural
sufficient statistic, and if we reparametrize by using i D i ./; i D 1; 2; : : : ; k,
the family is called the k-parameter canonical exponential family.

There is an implicit assumption in this definition that the number of freely varying
 is the same as the number of freely varying , and that these are both equal to the
specific k in the context. The formal way to say this is to assume the following.
Assumption. The dimension of ‚ as well as the dimension of the image of ‚ under
the map .1 ; 2 ; : : : ; k / ! .1 .1 ; 2 ; : : : ; k /; 2 .1 ; 2 ; : : : ; k /; : : : ; k .1 ;
2 ; : : : ; k // is equal to k.
There are some important examples where this assumption does not hold. They
are not counted as members of a k-parameter exponential family. The name curved
exponential family is commonly used for them, and this is discussed in the last
section.
The terms canonical form, natural parameter, and natural parameter space
mean the same things as in the one-parameter case. Thus, if we parametrize
the distributions by using 1 ; 2 ; : : : ; k as the k parameters, then the vector
 D .1 ; 2 ; : : : ; k / is called the natural parameter vector, the parametrization
Pk
f .x j/ D e i D1 i Ti .x/ ./ h.x/ is called the canonical form, and the set of all
18.3 Multiparameter Exponential Family 597

vectors  for which f .x j/ is a valid density (pmf) is called the natural parameter
space. The main theorems for the case k D 1 hold for a general k.
Theorem 18.6. The results of Theorem 18.1 and 18.5 hold for the k-parameter ex-
ponential family.
The proofs are almost verbatim the same. The moment formulas differ somewhat
due to the presence of more than one-parameter in the current context.
Theorem 18.7. Suppose X D .X1 ; : : : ; Xd / has a distribution P ;  2 T , belong-
ing to the canonical k-parameter exponential family, with a density (pmf)
Pk
f .x j/ D e i D1 i Ti .x/ ./
h.x/;

where Z Pk
T D  2 Rk W e i D1 i Ti .x/
h.x/dx < 1
Rd

(and the integral being replaced by a sum in the discrete case).


(a) At any  2 T 0 ; Z Pk
e ./
D e i D1 i Ti .x/
h.x/dx
Rd
is infinitely partially differentiable with respect to each i , and the partial
derivatives of any order can be obtained by differentiating inside the integral
sign.
2
(b) E ŒTi .X / D @@ i ./I Cov .Ti .X /; Tj .X // D @@i @j ./; 1 i; j k:
(c) If ; t are such that ;  C t 2 T , then the joint mgf of .T1 .X /; : : : ; Tk .X //
exists and equals
M .t/ D e .Ct / ./ :
An important new terminology is that of a full rank.
Definition 18.5. A family of distributions fP ;  2 T g belonging to the canonical
 2 family is called full rank if at every  2 T , the k  k
0
k-parameter exponential
@
covariance matrix @ @ ./ is nonsingular.
i j

Definition 18.6 (Fisher Information Matrix). Suppose a family of distributions in


 2k-parameter exponential family is nonsingular. Then, for  2 T , the
0
the canonical
@
matrix @i @j ./ is called the Fisher information matrix (at ).
The Fisher information matrix is of paramount importance in parametric sta-
tistical theory and lies at the heart of finite and large sample optimality theory in
statistical inference problems for general regular parametric families.
We now show some examples of distributions in k-parameter exponential fami-
lies where k > 1.
Example 18.11 (Two-Parameter Normal Distribution). Suppose X N.;  2 /,
and we consider both ;  to be parameters. If we denote .; / D .1 ; 2 / D ,
then parametrized by , the density of X is
598 18 The Exponential Family and Statistical Applications
2
.x 1 /2 x2 1x  1
1  1  C
2 2 2 2 2 2 2
f .x j/D p e 2 Ix2R D p e 2 2 2 Ix2R :
22 22

This is in the two-parameter exponential family with

1 1
1 ./ D  ; 2 ./ D ; T1 .x/ D x 2 ; T2 .x/ D x;
222 22
12 1
./ D C log 2 ; h.x/ D p Ix2R :
222 2

The parameter space in the  parametrization is

‚ D .1; 1/ ˝ .0; 1/:

If we want the canonical form, we let

1 1 22 1
1 D  ; 2 D ; and ./ D   log.1 /:
222 22 41 2

The natural parameter space for .1 ; 2 / is .1; 0/ ˝ .1; 1/.

Example 18.12 (Two-Parameter Gamma). It was seen in Example 18.5 that if we


fix one of the two-parameters of a Gamma distribution, then it becomes a member
of the one-parameter exponential family. We show in this example that the general
Gamma distribution is a member of the two-parameter exponential family. To show
this, just observe that with  D .’; / D .1 ; 2 /,

 x
C1 log x1 log 2 log .1 / 1
f .x j/ D e 2 Ix>0 :
x
This is in the two-parameter exponential family with 1 ./ D  12 ; 2 ./ D
1 ; T1 .x/ D x; T2 .x/ D log x; ./ D 1 log 2 C log .1 /, and h.x/ D x1 Ix>0 .
The parameter space in the -parametrization is .0; 1/ ˝ .0; 1/. For the canon-
ical form, use 1 D  12 ; 2 D 1 , and so, the natural parameter space is
.1; 0/ ˝ .0; 1/. The natural sufficient statistic is .X; log X /.

Example 18.13 (The General Multivariate Normal Distribution). Suppose X


Nd .; †/, where  is arbitrary and † is positive-definite (and of course, symmet-
ric). Writing  D .; †/, we can think of  as a subset in an Euclidean space of
dimension

d2  d d.d C 1/ d.d C 3/
k Dd Cd C DdC D :
2 2 2
18.3 Multiparameter Exponential Family 599

The density of X is

1 1 0 1
f .x j/ D e  2 .x/ † .x/ Ix2Rd :
.2/d=2 j†j1=2
1 1 0 1 0 1 1 0 1
D e  2 x † xC † x 2  †  Ix2Rd
.2/d=2 j†j1=2
1 1 PP P P 1 0 1
e 2 i;j  xi xj C i . k  k /xi  2  †
ij ki
D 
Ix2Rd
.2/ j†j
d=2 1=2

1 P ii 2 P P P P
1 i  xi  i <j  xi xj C i . k  k /xi
ij ki
D e 2
.2/ j†j
d=2 1=2

1 0 1 
2 † Ix2Rd :

We have thus represented the density of X in the k-parameter exponential family


form with the k-dimensional natural sufficient statistic

T .X / D .X1 ; : : : ; Xd ; X12 ; : : : ; Xd2 ; X1 X2 ; : : : ; Xd 1 Xd /;

and the natural parameters defined by


X X 1 1
 k1 k ; : : : ;  kd k ;   11 ; : : : ;   d d ;  12 ; : : : ;  d 1;d :
2 2
k k

Example 18.14 (Multinomial Distribution). Consider the k C 1 cell multinomial


P
distribution with cell probabilities p1 ; p2 ; : : : ; pk ; pkC1 D 1  kiD1 pi . Writing
 D .p1 ; p2 ; : : : ; pk /, the joint pmf of X D .X1 ; X2 ; : : : ; Xk /, the cell frequencies
of the first k cells, is

!nPkiD1 xi
nŠ x
Y
k X
k
f .x j/ D Qk Pk pi i 1 pi
. i D1 xi Š/.n  i D1 xi /Š i D1 i D1
Ix ;:::;x 0;Pk x n
1 k i D1 i

nŠ Pk Pk Pk
D Qk i D1 .log pi /xi log.1 i D1 pi /. xi /
P e i D1

. i D1 xi Š/.n  kiD1 xi /Š
P
C n log.1 ki D1 pi / Pk
Ix
1 ;:::;xk 0; i D1 xi n
!
Pk  P
pi
nŠ i D1 log P xi Cn log 1 k
i D1 pi
1 k
D Qk P e p
i D1 i

. i D1 xi Š/.n  kiD1 xi /Š
Ix ;:::;x 0;Pk x n :
1 k i D1 i
600 18 The Exponential Family and Statistical Applications

This is in the k-parameter exponential family form with the natural sufficient statis-
tic and natural parameters
pi
T .X / D .X1 ; X2 ; : : : ; Xk /; i D log Pk ; 1 i k:
1 i D1 pi

Example 18.15 (Two-parameter Inverse Gaussian Distribution). It was shown in


Theorem 11.5 that for the simple symmetric random walk on R, the time of the rth
return to zero r satisfies the weak convergence result
  
r 1
P 2
x !2 1ˆ p ; x > 0;
r x

 1 3
as r ! 1. The density of this limiting CDF is f .x/ D e p 2x x 2
Ix>0 . This is a
2
special inverse Gaussian distribution. The general inverse Gaussian distribution has
the density
 
2 1=2 1 x 2 C2p1 2
f .x j1 ; 2 / D e x Ix>0 I
x 3
the parameter space for  D .1 ; 2 / is Œ0; 1/ ˝ .0; 1/. Note that the special
inverse Gaussian density ascribed to the above corresponds to 1 D 0; 2 D 12 .
The general inverse Gaussian density f .x j1 ; 2 / is the density of the first time
p proces (starting at zero) hits the straight line with the equation y D
p a Wiener
that
22  21 t; t > 0.
It is clear from the formula for f .x j1 ; 2 / that it is a member of the two-
parameter exponential family with the natural sufficient statistic T .X / D .X; X1 /
and the natural parameter space T D .1; 0 ˝ .1; 0/. Note that the natural
parameter space is not open.

18.4  Sufficiency and Completeness

Exponential families under mild conditions on the parameter space ‚ have the prop-
erty that if a function g.T / of the natural sufficient statistic T D T .X / has zero
expected value under each  2 ‚, then g.T / itself must be essentially identically
equal to zero. A family of distributions that has this property is called a complete
family. The completeness property, particularly in conjunction with the property of
sufficiency, has had an historically important role in statistical inference. Lehmann
(1959), Lehmann and Casella (1998), and Brown (1986) give many applications.
However, our motivation for studying the completeness of a full rank exponential
family is primarily for presenting a well-known theorem in statistics, which actu-
ally is also a very effective and efficient tool for probabilists. This theorem, known
as Basu’s theorem (Basu (1955)), is an efficient tool for probabilists in minimizing
clumsy distributional calculations. Completeness is required in order to state Basu’s
theorem.
18.4 Sufficiency and Completeness 601

Definition 18.7. A family of distributions fP ;  2 ‚g on some sample space X is


called complete if EP Œg.X / D 0 for all  2 ‚ implies that P .g.X / D 0/ D 1
for all  2 ‚.
It is useful to first see an example of a family that is not complete.
Example 18.16. Suppose X Bin.2; p/, and the parameter p is 14 or 34 . In the no-
tation of the definition of completeness, ‚ is the two-point set f 14 ; 34 g. Consider the
function g defined by

g.0/ D g.2/ D 3; g.1/ D 5:

Then,

Ep Œg.X / D g.0/.1  p/2 C 2g.1/p.1  p/ C g.2/p 2


1 3
D 16p 2  16p C 3 D 0; if p D or :
4 4
Therefore, we have exhibited a function g that violates the condition for complete-
ness of this family of distributions.
Thus, completeness of a family of distributions is not universally true. The prob-
lem with the two-point parameter set in the above example is that it is too small. If
the parameter space is richer, the family of binomial distributions for any fixed n
is in fact complete. In fact, any distribution in the general k-parameter exponential
family as a whole is a complete family, provided the set of parameter values is not
too thin. Here is a general theorem.
Theorem 18.8. Suppose a family of distributions F D fP ;  2 ‚g belongs to a k-
parameter exponential family, and that the set ‚ to which the parameter  is known
to belong has a nonempty interior. Then the family F is complete.
The proof of this requires the use of properties of functions that are analytic on a
domain in C k , where C is the complex plane. We do not prove the theorem here; see
Brown (1986, p. 43) for a proof. The nonempty interior assumption is protecting us
from the set ‚ being too small.
Example 18.17. Suppose X Bin.n; p/, where n is fixed, and the set of possible
values for p contains an interval (however small). Then, in the terminology of the
theorem above, ‚ has a nonempty interior. Therefore, such a family of binomial
distributions is indeed complete. The only function g.X / that satisfies Ep Œg.X / D
0, for all p in a set ‚ that contains in it an interval, is the zero function g.x/ D 0
for all x D 0; 1; : : : ; n. Contrast this with Example 18.16.
We require one more definition before we can state Basu’s theorem.
Definition 18.8. Suppose X has a distribution P belonging to a family F D
fP ;  2 ‚g. A statistic S.X / is called F -ancillary (or, simply ancillary), if for
any set A; P .S.X / 2 A/ does not depend on  2 ‚, that is, if S.X / has the same
distribution under each P 2 F .
602 18 The Exponential Family and Statistical Applications

Example 18.18. Suppose X1 ; X2 are iid N.; 1/, and  belongs to some subset
‚ of the real line. Let S.X1 ; X2 / D X1  X2 . then, under any P ; S.X1 ; X2 /
N.0; 2/, a fixed distribution that does not depend on . Thus, S.X1 ; X2 / D X1 X2
is ancillary, whatever the set of values of  is.
Example 18.19. Suppose X1 ; X2 are iid U Œ0; , and  belongs to some subset ‚
of .0; 1/. Let S.X1 ; X2 / D X1
X2 . We can write S.X1 ; X2 / as

L U1 U1
S.X1 ; X2 / D D ;
U2 U2

where U1 ; U2 are iid U Œ0; 1. Thus, under any P ; S.X1 ; X2 / is distributed as the
ratio of two independent U Œ0; 1 variables. This is a fixed distribution that does not
depend on . Thus, S.X1 ; X2 / D X 1
X2 is ancillary, whatever the set of values of  is.
Example 18.20. Suppose X1 ; X2 ; : : : ; Xn are iid N.;
Pn 1/, and  belongs to some
subset ‚ of the real line. Let S.X1 ; : : : ; Xn / D i D1 .X i  X / 2
. We can write
S.X1 ; : : : ; Xn / as

L
X
n X
n
S.X1 ; : : : ; Xn / D . C Zi  Œ C Z/ D 2
.Zi  Z/2 ;
i D1 i D1

where Z1 ; : : : ; Zn are iid N.0; 1/. Thus,P


under any P ; S.X1 ; : : : ; Xn / has a fixed
distribution, namely the distribution ofP niD1 .Zi  Z/2 (actually, this is a 2n1
distribution). Thus, S.X1 ; : : : ; Xn / D niD1 .Xi  X /2 is ancillary, whatever the
set of values of  is.
Theorem 18.9 (Basu’s Theorem for the Exponential Family). In any k-
parameter exponential family F , with a parameter space ‚ that has a nonempty
interior, the natural sufficient statistic of the family T .X / and any F -ancillary
statistic S.X / are independently distributed under each  2 ‚.
We show applications of this result following the next section.

18.4.1  Neyman–Fisher Factorization and Basu’s Theorem

There is a more general version of Basu’s theorem that applies to arbitrary para-
metric families of distributions. The intuition is the same as it was in the case of an
exponential family, namely, a sufficient statistic, which contains all the information,
and an ancillary statistic, which contains no information, must be independent. For
this, we need to define what a sufficient statistic means for a general parametric
family. Here is Fisher’s original definition (Fisher (1922)).
Definition 18.9. Let n  1 be given, and suppose X D .X1 ; : : : ; Xn / has a joint
distribution P;n belonging to some family

Fn D fP;n W  2 ‚g:
18.4 Sufficiency and Completeness 603

A statistic T .X / D T .X1 ; : : : ; Xn / taking values in some Euclidean space is called


sufficient for the family Fn if the joint conditional distribution of X1 ; : : : ; Xn given
T .X1 ; : : : ; Xn / is the same under each  2 ‚.
Thus, we can interpret the sufficient statistic T .X1 ; : : : ; Xn / in the following
way: once we know the value of T , the set of individual data values X1 ; : : : ; Xn has
nothing more to convey about . We can think of sufficiency as data reduction at no
cost; we can save only T and discard the individual data values without losing any
information. However, what is sufficient depends, often crucially, on the functional
form of the distributions P;n . Thus, sufficiency is useful for data reduction subject
to loyalty to the chosen functional form of P;n .
Fortunately, there is an easily applicable universal recipe for automatically iden-
tifying a sufficient statistic for a given family Fn . This is the factorization theorem.

Theorem 18.10 (Neyman–Fisher Factorization Theorem). Let f .x1 ; : : : ; xn j/


be the joint density function (joint pmf) corresponding to the distribution P;n . Then,
a statistic T D T .X1 ; : : : ; Xn / is sufficient for the family Fn if and only if for any
 2 ‚; f .x1 ; : : : ; xn j/ can be factorized in the form

f .x1 ; : : : ; xn j/ D g.; T .X1 ; : : : ; Xn //h.x1 ; : : : ; xn /:

See Bickel and Doksum (2006) for a proof.

The intuition of the factorization theorem is that the only way that the parameter
 is tied to the data values X1 ; : : : ; Xn in the likelihood function f .x1 ; : : : ; xn j/ is
via the statistic T .X1 ; : : : ; Xn /, because there is no  in the function h.x1 ; : : : ; xn /.
Therefore, we should only care to know what T is, but not the individual values
X1 ; : : : ; Xn .
Here is one example on using the factorization theorem.

Example 18.21 (Sufficient Statistic for a Uniform Distribution). Suppose X1 ; : : : ;


Xn are iid and distributed as U Œ0;  for some  > 0. Then, the likelihood function is

Y
n  n Y
n
1 1
f .x1 ; : : : ; xn j/ D I xi D I xi
 
i D1 i D1
 n
1
D I x.n/ ;


where x.n/ D max.x1 ; : : : ; xn /. If we let


 n
1
T .X1 ; : : : ; Xn / D X.n/ ; g.; t/ D I t ; h.x1 ; : : : ; xn / 1;


then, by the factorization theorem, the sample maximum X.n/ is sufficient for the
U Œ0;  family. The result does make some intuitive sense.
604 18 The Exponential Family and Statistical Applications

Here is now the general version of Basu’s theorem.


Theorem 18.11 (General Basu Theorem). Let Fn D fP;n W  2 ‚g be a family
of distributions. Suppose T .X1 ; : : : ; Xn / is sufficient for Fn , and S.X1 ; : : : ; Xn /
is ancillary under Fn . Then T and S are independently distributed under each
P;n 2Fn .
See Basu (1955) for a proof.

18.4.2  Applications of Basu’s Theorem to Probability

We had previously commented that the sufficient statistic by itself captures all of the
information about  that the full knowledge of X could have provided. On the other
hand, an ancillary statistic cannot provide any information about , because its dis-
tribution does not even involve . Basu’s theorem says that a statistic which provides
all the information, and another that provides no information, must be independent,
provided the additional nonempty interior condition holds, in order to ensure com-
pleteness of the family F . Thus, the concepts of information, sufficiency, ancillarity,
completeness, and independence come together in Basu’s theorem. However, our
main interest is simply to use Basu’s theorem as a convenient tool to arrive quickly
at some results that are purely results in the domain of probability. Here are a few
such examples.
Example 18.22 (Independence of Mean and Variance for a Normal Sample). Sup-
pose X1 ; X2 ; : : : ; Xn are iid N.;  2 / for some ; . It was stated in Chapter 4 that
the sample mean X and the sample variance s 2 are independently distributed for
any n, and whatever  and  are. We now prove it. For this, first we establish the
claim that if the result holds for  D 0;  D 1, then it holds for all ; . Indeed, fix
any ; , and write Xi D  C Zi ; 1 i n, where Z1 ; : : : ; Zn are iid N.0; 1/.
Now,
! !
Xn
L
Xn
X; .Xi  X / D  C Z; 
2 2
.Zi  Z/ : 2

i D1 i D1
Pn
Therefore, X P and i D1 .Xi  X /2 are independently distributed under .; / if and
only if Z and niD1 .Zi Z/2 are independently distributed. This is a step in getting
rid of the parameters ;  from consideration.
But, now, we import a parameter! Embed the N.0; 1/ distribution into a larger
family of fN.; 1/;  2 Rg distributions. Consider now a fictitious sample
Y1 ; Y2 ; : : : ; Yn from P D N.; 1/. The joint density of Y D .Y1 ; Y2 ; : : : ; Yn /
is a one-parameter
P exponential family P
density with the natural sufficient statistic
T .Y / D niD1 Yi . By Example 18.20, niD1 .Yi  Y /2 is ancillary. The parameter
space for  obviously has a nonempty interior, thus P all the conditions
P of Basu’s
theorem are satisfied, and therefore, under each ; niD1 Yi and niD1 .Yi  Y /2 are
independently distributed. In particular, they are independently distributed under
 D 0, that is, when the samples are iid N.0; 1/, which is what we needed to prove.
18.4 Sufficiency and Completeness 605

Example 18.23 (An Exponential Distribution Result). Suppose X1 ; X2 ; : : : ; Xn are


iid exponential random variables with mean . Then, by transforming .X1 ; X2 ; : : : ;
Xn / to
 
X1 Xn1
;:::; ; X1 C    C Xn ;
X1 C    C Xn X1 C    C Xn
one can show by carrying out the necessary Jacobian calculation (see Chapter 4),
that  
X1 Xn1
;:::;
X1 C    C Xn X1 C    C Xn

is independent of X1 C    C Xn . We can show this without doing any calculations


by using Basu’s theorem.
For this, once again, by writing Xi D Zi ; 1 i n, where the Zi are iid
standard exponentials, first observe that
 
X1 Xn1
;:::;
X1 C    C Xn X1 C    C Xn

is a (vector) ancillary statistic. Next observe that the joint density of X D .X1 ;
X2 ; : : : ; Xn / is a one-parameter exponential family, with the natural sufficient statis-
tic T .X / D X1 C    C Xn . Because the parameter space .0; 1/ obviously contains
a nonempty interior, by Basu’s theorem, under each ,
 
X1 Xn1
;:::; and X1 C    C Xn
X1 C    C Xn X1 C    C Xn

is independently distributed.
Example 18.24 (A Covariance Calculation). Suppose X1 ; : : : ; Xn are iid N.0; 1/,
and let X and Mn denote the mean and the median of the sample set X1 ; : : : ; Xn .
By using our old trick of importing a mean parameter , we first observe that the
difference statistic X  Mn is ancillary. On the other hand, the joint density of
X D .X1 ; : : : ; Xn / is of course a one-parameter exponential family with the natural
sufficient statistic T .X / D X1 C    C Xn . By Basu’s theorem, X1 C    C Xn and
X  Mn are independent under each , which implies

Cov.X1 C    C Xn ; X  Mn / D 0 ) Cov.X ; X  Mn / D 0
1
) Cov.X ; Mn / D Cov.X ; X / D Var.X/ D :
n
We have achieved this result without doing any calculations at all. A direct attack
on this problem requires handling the joint distribution of .X; Mn /.
Example 18.25 (An Expectation Calculation). Suppose X1 ; : : : ; Xn are iid U Œ0; 1,
and let X.1/ ; X.n/ denote the smallest and the largest order statistic of X1 ; : : : ; Xn .
Import a parameter  > 0, and consider the family of U Œ0;  distributions. We have
606 18 The Exponential Family and Statistical Applications

shown that the largest order statistic X.n/ is sufficient; it is also complete. On the
X.1/ L
other hand, the quotient X.n/
is ancillary. To see this, again, write .X1 ; : : : ; Xn / D
X.1/ L U.1/
.U1 ; : : : ; Un /, where U1 ; : : : ; Un are iid U Œ0; 1. As a consequence, X.n/ D U.n/ .
X.1/
So, X.n/
is ancillary. By the general version of Basu’s theorem which works for any
X.1/
family of distributions (not just an exponential family), it follows that X.n/ and X.n/
are independently distributed under each . Hence,

X.1/ X.1/
EŒX.1/  D E X.n/ D E EŒX.n/ 
X.n/ X.n/

X.1/ EŒX.1/  nC1 1
)E D D n
D :
X.n/ EŒX.n/  nC1
n

Once again, we can get this result by using Basu’s theorem without doing any inte-
grations or calculations at all.

Example 18.26 (A Weak Convergence Result Using Basu’s Theorem). Suppose


X1 ; X2 ; : : : are iid random vectors distributed as a uniform in the d -dimensional
unit ball. For n  1, let dn D min1i n jjXi jj, and Dn D max1i n jjXi jj. Thus,
dn measures the distance to the closest data point from the center of the ball, and
Dn measures the distance to the farthest data point. We find the limiting distribution
of n D dn =Dn . Although this can be done by using other means, we do so by an
application of Basu’s theorem.
Toward this, note that for 0 u 1,

P .dn > u/ D .1  ud /n I P .Dn > u/ D 1  und :

As a consequence, for any k  1,


Z 1
nd
EŒDn k D kuk1 .1  und /d u D ;
0 nd C k

and,
Z 1 nŠ. dk C 1/
EŒdn  D
k
kuk1 .1  ud /n d u D :
0 .n C k
d
C 1/

Now, embed the uniform distribution in the unit ball into the family of uniform
distributions in balls of radius  and centered at the origin. Then, Dn is complete
and sufficient (akin to Example 18.24), and n is ancillary. Therefore, once again,
by the general version of Basu’s theorem, Dn and n are independently distributed
under each  > 0, and so, in particular under  D 1. Thus, for any k  1,
18.5 Curved Exponential Family 607

EŒdn k D EŒDn n k D EŒDn k EŒn k


EŒdn k nŠ. dk C 1/ nd C k
) EŒn k D D
EŒDn k .n C dk C 1/ nd
. dk C 1/e n nnC1=2
k
e nk=d .n C dk /nC d C1=2

(by using Stirling’s approximation)



 k
d
C1
k
:
n d

Thus, for each k  1;


h ik  
k
E n1=d n !  C 1 D EŒV k=d D EŒV 1=d k ;
d

where V is a standard exponential random variable. This implies, because V 1=d is


uniquely determined by its moment sequence, that
L
n1=d n ) V 1=d ;
as n ! 1.

18.5 Curved Exponential Family

There are some important examples in which the density (pmf) has the basic expo-
Pk
nential family form f .x j/ D e i D1 i . /Ti .X/ . / h.x/, but the assumption that
the dimensions of ‚, and that of the range space of .1 ./; : : : ; k .// are the same
is violated, more precisely, the dimension of ‚ is some positive integer q strictly
less than k. Let us start with an example.
Example 18.27. Suppose X N.; 2 /;  ¤ 0. Writing  D , the density of
X is
1  1 .x /2
f .x j/ D p e 2 2 Ix2R
2jj
1  x2 C x  12 log j j
D p e 2 2 Ix2R :
2

Writing 1 ./ D  212 ; 2 ./ D 1 ; T1 .x/ D x 2 ; T2 .x/ D x; ./ D 12 Clog jj, and
Pk
h.x/ D p12 Ix2R , this is in the form f .x j/ D e i D1 i . /Ti .x/ . / h.x/,
with k D 2, although  2 R, which is only one-dimensional. The two functions
608 18 The Exponential Family and Statistical Applications

2
1 ./ D  212 and 2 ./ D 1 are related to each other by the identity 1 D  22 ,
so that a plot of .1 ; 2 / in the plane would be a curve, not a straight line. Distribu-
tions of this kind go by the name of curved exponential family. The dimension of
the natural sufficient statistic is more than the dimension of ‚ for such distributions.
Definition 18.10. Let X D .X1 ; : : : ; Xd / have a distribution P ;  2 ‚  Rq .
Suppose P has a density (pmf) of the form
Pk
f .x j/ D e i D1 i . /Ti .x/ . /
h.x/;

where k > q. Then, the family fP ;  2 ‚g is called a curved exponential family.
Example 18.28 (A Specific Bivariate Normal). Suppose X D .X1 ; X2 / has a bi-
variate normal distribution with zero means, standard deviations equal to one, and a
correlation parameter ; 1 <  < 1. The density of X is
1
1  Œx12 Cx22 2 x1 x2 
e 2.1 /
2
f .x j/ D p Ix1 ;x2 2R
2 1  2
x 2 Cx 2 
 1 2 C x1 x2
1 .
2 1 2 / 1 2
D p e Ix1 ;x2 2R :
2 1  2

Therefore, here we have a curved exponential family with q D 1; k D 2; 1 ./ D


 2.11 2 / ; 2 ./ D 1 2 ; T1 .x/ D x12 C x22 ; T2 .x/ D x1 x2 ; ./ D 12 log.1  2 /,
and h.x/ D 1
I
2 x1 ;x2 2R
.
Example 18.29 (Poissons with Random Covariates). Suppose given Zi D zi ; i D
1; 2; : : : ; n; Xi are independent Poi. zi / variables, and Z1 ; Z2 ; : : : ; Zn have some
joint pmf p.z1 ; z2 ; : : : ; zn /. It is implicitly assumed that each Zi > 0 with probabil-
ity one. Then, the joint pmf of .X1 ; X2 ; : : : ; Xn ; Z1 ; Z2 ; : : : ; Zn / is

Y
n
e zi . zi /xi
f .x1 ; : : : ; xn ; z1 ; : : : ; zn j / D p.z1 ; z2 ; : : : ; zn /Ix1 ;:::;xn 2N0
xi Š
i D1
Iz1 ;z2 ;:::;zn 2N1
Pn Pn Y
n x
zi i
D e  i D1 zi C. i D1 xi / log 
p.z1 ; z2 ; : : : ; zn /
xi Š
i D1
Ix1 ;:::;xn 2N0 Iz1 ;z2 ;:::;zn 2N1 ;

where N0 is the set of nonnegative integers, and N1 is the set of positive integers.
This is in the curved exponential family with
X
n
q D 1; k D 2; 1 . / D  ; 2 . / D log ; T1 .x; z/ D zi ;
i D1
X
n
T2 .x; z/ D xi ;
i D1
Exercises 609

and
Y
n x
zi i
h.x; z/ D p.z1 ; z2 ; : : : ; zn /Ix1 ;:::;xn 2N0 Iz1 ;z2 ;:::;zn 2N1 :
xi Š
i D1

If we consider the covariates as fixed, the joint distribution of .X1 ; X2 ; : : : ; Xn /


becomes a regular one-parameter exponential family.

Exercises

Exercise 18.1. Show that the geometric distribution belongs to the one-parameter
exponential family if 0 < p < 1, and write it in the canonical form and by using the
mean parametrization.

Exercise 18.2 (Poisson Distribution). Show that the Poisson distribution belongs
to the one-parameter exponential family if > 0. Write it in the canonical form and
by using the mean parametrization.

Exercise 18.3 (Negative Binomial Distribution). Show that the negative binomial
distribution with parameters r and p belongs to the one-parameter exponential fam-
ily if r is considered fixed and 0 < p < 1. Write it in the canonical form and by
using the mean parametrization.

Exercise 18.4 * (Generalized Negative Binomial Distribution). Show that the


generalized negative binomial distribution with the pmf f .x jp/ D .’Cx/ .’/xŠ
p ’ .1 
p/x ; x D 0; 1; 2; : : : belongs to the one-parameter exponential family if ’ > 0 is
considered fixed and 0 < p < 1.
Show that the two-parameter generalized negative binomial distribution with the
pmf f .x j’; p/ D .’Cx/
.’/xŠ
p ’ .1  p/x ; x D 0; 1; 2; : : : does not belong to the two-
parameter exponential family.

Exercise 18.5 * (Normal with Equal Mean and Variance). Show that the
N.; / distribution belongs to the one-parameter exponential family if  > 0.
Write it in the canonical form and by using the mean parametrization.

Exercise 18.6 * (Hardy–Weinberg Law). Suppose genotypes at a single locus


with two alleles are present in a population according to the relative frequencies
p 2 ; 2pq, and q 2 , where q D 1  p, and p is the relative frequency of the dominant
allele. Show that the joint distribution of the frequencies of the three genotypes in
a random sample of n individuals from this population belongs to a one-parameter
exponential family if 0 < p < 1. Write it in the canonical form and by using the
mean parametrization.

Exercise 18.7 (Beta Distribution). Show that the two-parameter Beta distribution
belongs to the two-parameter exponential family if the parameters ’; ˇ > 0. Write
it in the canonical form and by using the mean parametrization.
610 18 The Exponential Family and Statistical Applications

Show that symmetric Beta distributions belong to the one-parameter exponential


family if the single parameter ’ > 0.

Exercise 18.8 * (Poisson Skewness and Kurtosis). Find the skewness and
kurtosis of a Poisson distribution by using Theorem 18.3.

Exercise 18.9 * (Gamma Skewness and Kurtosis). Find the skewness and
kurtosis of a Gamma distribution, considering ’ as fixed, by using Theorem 18.3.

Exercise 18.10 * (Distributions with Zero Skewness). Show that the only dis-
tributions in a canonical one-parameter exponential family such that the natural
sufficient statistic has a zero skewness are the normal distributions with a fixed vari-
ance.

Exercise 18.11 * (Identifiability of the Distribution). Show that distributions in


the nonsingular canonical one-parameter exponential family are identifiable; that is,
P1 D P2 only if 1 D 2 .

Exercise 18.12 * (Infinite Differentiability of Mean Functionals). Suppose P ;


 2 ‚ is a one-parameter exponential family and .x/ is a general function. Show
that at any  2 ‚0 at which E Œj.X /j < 1;  ./ D E Œ.X / is infinitely
differentiable, and can be differentiated any number of times inside the integral
(sum).

Exercise 18.13 * (Normalizing Constant Determines the Distribution). Consi-


der a canonical one-parameter exponential family density (pmf) f .xj/ D e x ./
h.x/. Assume that the natural parameter space T has a nonempty interior. Show that
./ determines h.x/.

Exercise 18.14. Calculate the mgf of a .k C 1/ cell multinomial distribution by


using Theorem 18.7.

Exercise 18.15 * (Multinomial Covariances). Calculate the covariances in a


multinomial distribution by using Theorem 18.7.

Exercise 18.16 * (Dirichlet Distribution). Show that the Dirichlet distribution


defined in Chapter 4, with parameter vector ’ D .’1 ; : : : ; ’nC1 /; ’i > 0 for all
i , is an .n C 1/-parameter exponential family.

Exercise 18.17 * (Normal Linear Model). Suppose given an n  p nonrandom


matrix X , a parameter vector ˇ 2 Rp , and a variance parameter  2 > 0, Y D
.Y1 ; Y2 ; : : : ; Yn / Nn .Xˇ;  2 In /, where In is the n  n identity matrix. Show that
the distribution of Y belongs to a full rank multiparameter exponential family.

Exercise 18.18 (Fisher Information Matrix). For each of the following


distributions, calculate the Fisher information matrix:
(a) Two-parameter Beta distribution.
(b) Two-parameter Gamma distribution.
Exercises 611

(c) Two-parameter inverse Gaussian distribution.


(d) Two-parameter normal distribution.
Exercise 18.19 * (Normal with an Integer Mean). Suppose X N.; 1/,
where  2 f1; 2; 3; : : :g. Is this a regular one-parameter exponential family?
Exercise 18.20 * (Normal with an Irrational Mean). Suppose X N.; 1/,
where  is known to be an irrational number. Is this a regular one-parameter ex-
ponential family?
Exercise 18.21 * (Normal with an Integer Mean). Suppose X N.; 1/,
where  2 f1; 2; 3; : : :g. Exhibit a function g.X / 6 0 such that E Œg.X / D 0 for
all .
Exercise 18.22 (Application of Basu’s Theorem). Suppose X1 ; : : : ; Xn is an iid
sample from a standard normal distribution, and suppose X.1/ ; X.n/ are the smallest
and the largest order statistics of X1 ; : : : ; Xn , and s 2 is the sample variance. Prove,
by applying Basu’s theorem to a suitable two-parameter exponential family, that

X.n/  X.1/ EŒX.n/ 


E D2 :
s E.s/

Exercise 18.23 (Mahalanobis’s D 2 and Basu’s Theorem). Suppose X1 ; : : : ; Xn


is an iid sample from a d -dimensional normal distribution Nd .0; †/, where † is
positive definite. Suppose S is the sample covariance matrix (see Chapter 5) and X
0
the sample mean vector. The statistic Dn2 D nX S 1 X is called the Mahalanobis
D 2 -statistic. Find E.Dn2 / by using Basu’s theorem.
Hint: Look at Example 18.13, and Theorem 5.10.
Exercise 18.24 (Application of Basu’s Theorem). Suppose Xi ; 1 i n are iid
N.1 ; 12 /, Yi ; 1 i n are iid N.2 ; 22 /, where 1 ; 2 2 R, and 12 ; 22 >
0. Let X ; s1 denote the mean and the variance of X1 ; : : : ; Xn , and Y ; s22 denote
2

the mean and the variance of Y1 ; : : : ; Yn . Also let r denote the sample correlation
coefficient based on the pairs .Xi ; Yi /; 1 i n. Prove that X ; Y ; s12 ; s22 ; r are
mutually independent under all 1 ; 2 ; 1 ; 2 .
Exercise 18.25 (Mixtures of Normal). Show that the mixture distribution
:5N.; 1/ C :5N.; 2/ does not belong to the one-parameter exponential fam-
ily. Generalize this result to more general mixtures of normal distributions.
Exercise 18.26 (Double Exponential Distribution). (a) Show that the double ex-
ponential distribution with a known  value and an unknown mean does not
belong to the one-parameter exponential family, but the double exponential dis-
tribution with a known mean and an unknown  belongs to the one-parameter
exponential family.
(b) Show that the two-parameter double exponential distribution does not belong to
the two-parameter exponential family.
612 18 The Exponential Family and Statistical Applications

Exercise 18.27 * (A Curved Exponential Family). Suppose X Bin.n; p/; Y


Bin.m; p 2 /, and that X; Y are independent. Show that the distribution of .X; Y / is
a curved exponential family.
Exercise 18.28 (Equicorrelation Multivariate Normal). Suppose .X1 ; X2 ; : : : ;
Xn / are jointly multivariate normal with general means i , variances all one, and a
common pairwise correlation . Show that the distribution of .X1 ; X2 ; : : : ; Xn / is a
curved exponential family.
Exercise 18.29 (Poissons with Covariates). Suppose X1 ; X2 ; : : : ; Xn are inde-
pendent Poissons with E.Xi / D e ˇ zi ; > 0; 1 < ˇ < 1. The covariates
z1 ; z2 ; : : : ; zn are considered fixed. Show that the distribution of .X1 ; X2 ; : : : ; Xn /
is a curved exponential family.
Exercise 18.30 (Incomplete Sufficient Statistic).
P Suppose
P X1 ; : : : ; Xn are iid
N.; 2 /;  ¤ 0. Let T .X1 ; : : : ; Xn / D . niD1 Xi ; niD1 Xi2 /. Find a function
g.T / such that E Œg.T / D 0 for all , but P .g.T / D 0/ < 1 for any .
Exercise 18.31 * (Quadratic Exponential Family). Suppose the natural suffi-
cient statistic T .X / in some canonical one-parameter exponential family is X itself.
By using the formula in Theorem 18.3 for the mean and the variance of the natural
sufficient statistic in a canonical one-parameter exponential family, characterize all
the functions ./ for which the variance of T .X / D X is a quadratic function
of the mean of T .X /, that is, Var .X / aŒE .X /2 C bE .X / C c for some
constants a; b; c.
Exercise 18.32 (Quadratic Exponential Family). Exhibit explicit examples of
canonical one-parameter exponential families which are quadratic exponential
families.
Hint: There are six of them, and some of them are common distributions, but not all.
See Morris (1982), Brown (1986).

References

Barndorff-Nielsen, O. (1978). Information and Exponential Families in Statistical Theory, Wiley,


New York.
N 15, 377–380.
Basu, D. (1955). On statistics independent of a complete sufficient statistic, Sankhya,
Bickel, P.J. and Doksum, K. (2006). Mathematical Statistics, Basic Ideas and Selected Topics, Vol
I, Prentice Hall, UpperSaddle River, NJ.
Brown, L.D. (1986). Fundamentals of Statistical Exponential Families, IMS, Lecture Notes and
Monographs Series, Hayward, CA.
Fisher, R.A. (1922). On the mathematical foundations of theoretical statistics, Philos. Trans. Royal
Soc. London, Ser. A, 222, 309–368.
Lehmann, E.L. (1959). Testing Statistical Hypotheses, Wiley, New York.
Lehmann, E. L. and Casella, G. (1998). Theory of Point Estimation, Springer, New York.
Morris, C. (1982). Natural Exponential families with quadratic variance functions, Ann. Statistist.
10, 65–80.
Chapter 19
Simulation and Markov Chain Monte Carlo

Simulation is a computer-based exploratory exercise that aids in understanding how


the behavior of a random or even a deterministic process changes in response to
changes in input or the environment. It is essentially the only option left when exact
mathematical calculations are impossible, or require an amount of effort that the user
is not willing to invest. Even when the mathematical calculations are quite doable, a
preliminary simulation can be very helpful in guiding the researcher to theorems that
were not a priori obvious or conjectured, and also to identify the more productive
corners of a particular problem. Although simulation in itself is a machine-based
exercise, credible simulation must be based on appropriate theory. A simulation
algorithm must be theoretically justified before we use it. This chapter gives a fairly
broad introduction to the classic theory and techniques of probabilistic simulation,
and also to some of the modern advents in simulation, particularly Markov chain
Monte Carlo (MCMC) methods based on ergodic Markov chain theory.
The classic theory of simulation includes such time-tested methods as the original
Monte Carlo, and textbook techniques of simulation from standard distributions in
common use. They involve a varied degree of sophistication. Markov chain Monte
Carlo is the name for a collection of simulation algorithms for simulating from
the distribution of very general types of random variables taking values in quite
general spaces. MCMC methods have truly revolutionized simulation because of an
inherent simplicity in applying them, the generality of their scopes, and the diversity
of applied problems in which some suitable form of MCMC has helped in making
useful practical advances. MCMC methods are the most useful when conventional
Monte Carlo is difficult or impossible to use. This happens to be the case when
one has a complicated distribution in a high-dimensional space, or the state space
of the underlying random variable is exotic or huge, or the distribution to simulate
from, called the target distribution, is known only up to a normalizing constant, and
this normalizing constant cannot be computed. The normalizing constant issue is
common when the target distribution is the posterior distribution of an unknown
parameter when the prior is not a simple one. The problem of huge state spaces
arises in numerous problems in image processing and statistical physics. Indeed,
MCMC methods originated in statistical physics, and only gradually made their
way into probability and statistics.

A. DasGupta, Probability for Statistics and Machine Learning: Fundamentals 613


and Advanced Topics, Springer Texts in Statistics, DOI 10.1007/978-1-4419-9634-3 19,
c Springer Science+Business Media, LLC 2011
614 19 Simulation and Markov Chain Monte Carlo

The principle underlying all MCMC methods is extremely simple. Given a target
distribution  on some state space S from which one wants to simulate, one simply
constructs a Markov chain Xn ; n D 0; 1; 2; : : : ; such that Xn has a unique stationary
distribution, and that stationary distribution is . So, if we simply generate succes-
sive states X0 ; X1 ; : : : ; XB ; XBC1 ; : : : ; Xn for some suitable large values of B; n,
then we can act as if XBC1 ; : : : ; Xn is a dependent sample of size n  B from . We
can then use these n  B values to approximate probabilities, to approximate expec-
tations, or for whatever reason we want to use them. The practical part is to devise
a Markov chain that has  as its stationary distribution, and it must be reasonably
easy to run this chain as a matter of convenient implementation.
There are a number of popular MCMC methods in use. These include the
Hastings algorithm, the Metropolis algorithm, and the Gibbs sampler. Implement-
ing these algorithms is usually a simple matter, and computer codes for many of
them are now publicly available. The difficult part in justifying MCMC methods is
answering the question of how long to let the chain run so as to allow it to get close
to stationarity. This is critical, because the stationary distribution of the chain is the
target distribution. In principle, the speed of convergence of the chain to stationar-
ity is exponential under mild conditions on the chain. General Markov chain theory
implies such exponential convergence. However, concrete statements on the exact
exponential speed, or asserting theoretical bounds on the closeness of the chain to
stationarity can usually be done only on a case-by-case basis, and typically each
new case needs a new creative technique.
We start with a treatment of conventional Monte Carlo and textbook simulation
techniques, and then provide the most popular MCMC algorithms with a selection
of applications. We then provide some of the available general theory on the ques-
tion of speed of convergence of a particular chain to stationarity. This is linked to
choosing the number B, the burn-in period, and the total run length n. These con-
vergence theorems form the most appealing theoretical aspect of MCMC methods.
They are also practically useful, because without a theorem ensuring that the chain
has come very close to its equilibrium state, the user can never be confident that the
simulation output is reliable.
For conventional Monte Carlo and techniques of simulation, some excellent
references are Fishman (1995), Ripley (1987), Robert and Casella (2004), and
Ross (2006). Markov Chain Monte Carlo started off with the two path-breaking
articles: Metropolis et al. (1953), and Hastings (1970). Geman and Geman (1984),
Tanner and Wong (1987), Smith and Roberts (1993), and Gelfand and Smith (1987)
are among the pioneering articles in the statistical literature. Excellent book-length
treatments of MCMC include Robert and Casella (2004), Gamerman (1997), and
Liu (2008). Geyer (1992), Gilks et al. (1995), and Gelman et al. (2003) are ex-
cellent readings on MCMC with an applied focus. Diaconis (2009) is an up-to-
date review. Literature on convergence of an MCMC algorithm has been growing
steadily. Brémaud (1999) is one of the best sources to read about general the-
ory of convergence of MCMC algorithms. Diaconis et al. (2008) is a tour de
force on convergence of the Gibbs sampler, and Diaconis and Saloff-Coste (1998)
on the Metropolis algorithm. Other important references on the difficult issue of
convergence are Diaconis and Stroock (1991), Tierney (1994), Athreya, Doss and
19.1 The Ordinary Monte Carlo 615

Sethuraman (1996), Propp and Wilson (1998), and Rosenthal (2002); Dimakos
(2001) is a useful survey. Various useful modifications of the basic MCMC have
been suggested to address specific important applications. Notable among them
are Green (1995), Besag and Clifford (1991), Fill (1998), Diaconis and Sturmfels
(1998), Higdon (1998), and Kendall and Thönnes (1999), among many others. Var-
ious other specific references are given in the sections.

19.1 The Ordinary Monte Carlo

The ordinary Monte Carlo is a simulation technique for approximating the expecta-
tion of a function .X / for a general random variable X , when the exact expectation
cannot be found analytically, or by other numerical means, such as quadrature. The
idea of Monte Carlo simulation originated around 1940 in connection with secret
nuclear weapon projects in which physicists wanted to understand how the physical
properties of neutrons would be affected by various possible scenarios following a
collison with a nucleus. The Monte Carlo term was picked by Stanislaw Ulam and
von Neumann during that time.

19.1.1 Basic Theory and Examples

The basis for the ordinary Monte Carlo is Kolmogorov’s SLLN (see Chapter 7),
which says that if X1P ; X2 ; : : : are iid copies of X , the basic underlying random vari-
able, then O D n1 niD1 .Xi / converges almost surely to  D EŒ.X /, as
long as we know that the expectation exists. This is because, if X1 ; X2 ; : : : are iid
copies of X , then Zi D .Xi /; i DP 1; 2; : : : are iid copies of Z D .X / and
therefore, by the canonical SLLN, n1 niD1 Zi converges almost surely to EŒZ.
Therefore, provided that we can actually do the simulation in practice, we can sim-
ply simulate a large number of iidPcopies of X and approximate the true value of
EŒ.X / by the sample mean n1 niD1 .Xi /. Of course, there will be a Monte
Carlo error in this approximation, and if we run the simulation again, we get a differ-
ent approximated value for EŒ.X /. Thus, some reliability measure for the Monte
Carlo estimate is necessary. A quick one at hand is the variance of the Monte Carlo
estimate
VarŒ.X / EŒ 2 .X /  .EŒ.X //2
Var.O / D D :
n n
However, this involves quantities that we do not know, in particular EŒ.X /, the
very quantity we set out to approximate! However, we can fall back on the sample
variance
1 X
n
sz2 D .Zi  Z/2
n1
i D1
616 19 Simulation and Markov Chain Monte Carlo

to estimate this uncomputable exact variance, that is, a reliability measure could be

2
Var.O / D
1
n.n  1/
X n
.Zi  Z/2 ;
i D1

where Zi D .Xi /; i D 1; 2; : : : ; n. We can also compute confidence intervals


 of  . By the central limit theorem, for a large Monte Carlo
for the true value
2
size n; O N ; n , where  2 D Var..X //. As we commented above, in a
practical problem, we would not know the true value of  2 , and therefore the usual
confidence interval for  , namely, Z ˙ z’=2 pn would not be usable. The standard
proxy in such a case is the t confidence interval Z ˙ t ’2 ;n1 pszn , where t ’2 ;n1 stands
 
for the 100 1  ’2 percentile of a t-distribution with n  1 degrees of freedom (see
Chapter 4).
In the special case that we want to find a Monte Carlo estimate of a probabil-
ity P .X 2 A/, the function of interest becomes an indicator function .X / D
IX2A , and Zi D IXi 2A ; i D 1; 2; : : : ; n. Then, the Monte Carlo estimate of
p D P .X 2 A/ is
1X
n
#fi W Xi 2 Ag
pO D Zi D :
n n
i D1

We can also construct confidence intervals on p as in Chapter 1. The score confi-


dence interval based on the Monte Carlo samples is
s
z2’ p
pO C 2 z ’2 n z2’
2n
˙ O  p/
p.1 O C 2
:
z2’ n C z2’ 4n
1C n
2 2

In principle, this seems to give an acceptable numerical solution to our problem;


we can give an approximate value for EŒ.X /, an accuracy measure, and also a
confidence interval. It is important to understand that although the Monte Carlo
principle is applicable, in principle, very widely and to any number of dimensions,
the Monte Carlo sample size n must be quite large to provide reliable estimates for
the true value of the expectation. This is evidenced in some of the examples below.

Example 19.1 (A Cauchy Distribution Expectation). Suppose we want to evaluate


 D E.log jX j/, where X C.0; 1/, the standard Cauchy distribution. Actually,
the value of  can be analytically calculated, and  D 0 (this is a chapter exercise).
We use the Monte Carlo simulation method to approximate , and then we investi-
gate its accuracy. For this, we calculate the Monte Carlo estimate for  itself, and
then a 95% t confidence interval for . We use four different values of the Monte
Carlo sample size n; n D 100; 250; 500; 1000, to inspect the increase in accuracy
obtainable with an increase in the Monte Carlo sample size.
19.1 The Ordinary Monte Carlo 617

Monte Carlo
n Estimate of  D 0 95% Confidence Interval
100 .0714 :0714 ˙ :2921
250 :0232 :0232 ˙ :2185
500 .0712 :0712 ˙ :1435
1000 .0116 :0116 ˙ :1005

The Monte Carlo estimate itself oscillates. But the confidence interval gets tighter
as the Monte Carlo sample size n increases. Only for n D 1000, the results of the
Monte Carlo simulation approach even barely acceptable accuracy. It is common
practice to use n in several thousands when applying ordinary Monte Carlo for esti-
mating one single  . If we have to estimate  for several different choices of ,
and if the functions  have awkward behavior at regions of low density of X , then
the Monte Carlo sample size has to be increased. Formal recommendations can be
given by using a pilot or guessed value of  2 , the true variance of .X /.
Example 19.2 (Estimating the Volume of a Set). Let S be a set in an Euclidean space
Rd , and suppose we want to find the volume of S . Unless S has a very specialized
shape, an exact volume formula will be difficult or impossible to write (and es-
pecially so, when d is large). However, Monte Carlo can assist us in this difficult
problem. We have to make some assumptions. We assume that S is a bounded set,
and that there is an explicit d -dimensional rectangle R such that S  R. With-
d
out loss of generality, we may R take R to be Œ0; 1 . Denote the volume of S by
D Vol.S /. Note that D S dx.
There is nothing probabilistic in the problem so far. It appears to be a purely
a mathematical
R problem. But, we can think of probabilistically by writing it as
D S f .x/dx, where f .x/ D Ix2R is the density of the d -dimensional uniform
distribution on R D Œ0; 1d . Therefore, D P .X 2 S /, where X is distributed
uniformly in R. We now realize the potential of Monte Carlo in estimating .
Indeed, let X1 ; X2 ; : : : ; Xn be independent uniformly distributed points in R.
Then, from our general discussion above, a Monte Carlo estimate of is O D
#fi WXi 2Sg
n . We can also construct confidence intervals for by following the score
interval’s formula given above. Thus, a potentially very difficult mathematical prob-
lem is reduced to simply simulating n uniform random vectors from the rectangle R,
which is the same as simulating nd iid U Œ0; 1 random variables, an extremely sim-
ple task. Of course, it is necessary to remember that Monte Carlo is not going to give
us the exact value of the volume of S . With luck, it will give a good approximation,
which is useful.
Example 19.3 (Monte Carlo Estimate of the Volume of a Cone). This is a specific
example of the application of the Monte Carlo method to estimate the volume of a
set. Take the case of a right circular cone S of base radius r D 1 and height h D 1.
The true volume of S is D 13  r 2 h D 3 D 1:047. The apex aperture of S is
given by  D 2 arctan. hr / D 2 4 D 2 . Therefore, S is described in the Cartesian
coordinates as
S D f.x; y; z/ W 0 z 1; x 2 C y 2 z2 g:
618 19 Simulation and Markov Chain Monte Carlo

S is contained in the rectangle R D Œ1; 1 ˝ Œ1; 1 ˝ Œ0; 1. To approximate the


volume of S , we simulate n random vectors X1 ; X2 ; : : : ; Xn from R and form the
equation
#fi W Xi 2 S g
D ;
Vol.R/ n

which gives the estimate of

O D Vol.R/ #fi W Xi 2 S g D 4 #fi W Xi 2 S g :


n n
The table below reports Monte Carlo estimates and 95% score confidence intervals
for for a range of values of n.

Monte Carlo
n Estimate of D 1:047 95% Confidence Interval
100 1.08 1:08 ˙ :343
250 1.168 1:168 ˙ :224
500 1.016 1:016 ˙ :152
1000 1.100 1:100 ˙ :1105
10,000 1.064 1:064 ˙ :035

Once again, as in Example 19.1, we see that the Monte Carlo estimate itself
oscillates, and the Monte Carlo error is not monotone decreasing in n. However,
the width of the confidence interval consistently decreases as the Monte Carlo sam-
ple size n increases. To get a really good estimate and a tight confidence interval,
we appear to need a Monte Carlo sample size of about 10,000.

Example 19.4 (A Bayesian Example). Monte Carlo methods can be used not only
to approximate expectations, but also to approximate percentiles of a distribution.
Precisely, if X1 ; X2 ; : : : are iid sample observations from a continuous distribution
F with a strictly positive density f , then for any ’; 0 < ’ < 1, the sample quantile
Fn1 .’/ converges almost surely to the corresponding quantile F 1 .’/ of F . So,
as in the case of Monte Carlo estimates for expectations, we can generate a Monte
Carlo sample from F and estimate F 1 .’/ by Fn1 .’/.
Suppose X Bin.m; p/, and the unknown parameter p is assigned a beta prior
density, with parameters ’; ˇ. Then the posterior density of p is another Beta,
namely Be.x C ’; m  x C ˇ/ (see Chapter 3). To give a specific example, sup-
pose m D 100; x D 45; ’ D ˇ D 1 (i.e., p has a U Œ0; 1 prior). The percentiles of
a Beta density do not have closed-form formulas. So, one has to resort to numerical
methods to evaluate them.
We first estimate the posterior median by using a Monte Carlo sample for vari-
ous values of the Monte Carlo sample size n. For comparison, a true value for the
posterior median in this case is reported by Mathematica as 0:4507.
19.1 The Ordinary Monte Carlo 619

Monte Carlo Estimate


n of Posterior Median D 0:4507
50 .4584
100 .4463
250 .4486
500 .4553
1000 .4498

The estimates oscillate slightly. But even for a Monte Carlo sample size as small as
n D 50, the estimate is impressively accurate. But change the problem to Monte
Carlo estimation of an extreme quantile, say the 99:9th percentile of the posterior
distribution of p. The true value is reported by Mathematica to be 0:6028. The
Monte Carlo estimates for various n are reported below. We can see that the quality
of the estimation has deteriorated.
Monte Carlo Estimate of 99:9th
n Posterior Percentile D 0:6028
50 .5662
100 .5618
250 .5714
500 .5735
1000 .5889
5000 .5956
10,000 .5987

The need for a much larger Monte Carlo sample size is clearly seen from the table.
Although we could estimate the posterior median accurately with a Monte Carlo
sample size of 100, for the extreme quantile, we need five to ten thousand Monte
Carlo samples.

Example 19.5 (Computing a Nasty Integral). Monte Carlo methods can be ex-
tremely useful in computing the value of a complicated integral that cannot be
evaluated in closed form. Monte Carlo is especially useful for this purpose in high
dimensions, because methods of numerical integration are hard to implement and
quite unreliable when the number of dimensions is even moderately high (as few
as four). The basic idea
R is very simple. Suppose we wish to know the value of the
definite integral I D Rd f .x/dx for some (possibly complicated) function f . We
are of course assuming we know that the integral exists. This must be verified math-
ematically before we start on the Monte Carlo journey. Monte Carlo cannot verify
the existence of the integral.
The idea now is to use a suitable density function g.x/ on Rd and write I as
Z Z
f .x/ f .X /
I D f .x/dx D g.x/dx D Eg :
Rd Rd g.x/ g.X /
620 19 Simulation and Markov Chain Monte Carlo

Now the Monte Carlo method is usable. We P simulate iid random vectors X1 ; : : : ; Xn
from g, and approximate I by IO D n1 niD1 fg.X .Xi /
. The choice of g is obviously
i/
not unique. It is usually chosen such that it is easy to simulate from g, and such
that fg.x/
.x/
can be reliably computed. In other words, the function fg.x/ .x/
preferably
should not have singularities, or cusps, or too many local maximas or minimas.
Some preliminary investigation on the choice of g is needed, and in fact, this choice
issue is related to a topic known as importance sampling, which we discuss later.
As a concrete example, suppose we want to find the value of
Z 1 Z 1 Z 1 Z 1 2 y 2 z2 w2
e x
I D p dxdydzdw:
1 1 1 1 1 C x 2 C y 2 C z2 C w2

Suppose now we let

1 x 2 y 2 z2 w2
g.x; y; z; w/ D e ;
2
and
2 y 2 z2 w2
e x
f .x; y; z; w/ D p :
1 C x 2 C y 2 C z2 C w2
Then,
Z 1 Z 1 Z 1 Z 1
I D f .x; y; z; w/dxdydzdw
1 1 1 1

f .X / 1
D Eg D  2 Eg p :
g.X / 1 C X C Y 2 C Z2 C W 2
2

Therefore, we can find a Monte Carlo estimate of I by simulating .Xi ; Yi ; Zi ; Wi /;


i D 1; : : : ; n, where Xi ; Yi ; Zi ; Wi are all iid N.0; 12 /, and by using the estimate

1 X
n
1
IO D  2 q :
n 1 C Xi2 C Yi2 C Zi2 C Wi2
i D1

The table below reports Monte Carlo estimates of I for various values of n. The
exact value of I is useful for comparison and can be obtained after transformation
to polar coordinates (see Chapter 4) to be
h p  p i
I D 2 1 C e  ˆ 2  1 D 6:1297:
19.1 The Ordinary Monte Carlo 621

Monte Carlo
n Estimate of I D 6:1297
50 6.3503
100 6.2117
250 6.2264
500 6.1854
1000 6.1946
5000 6.0806
10,000 6.1044

By the time the Monte Carlo sample size is 10,000, we get fairly accurate estimates
for the value of I .

Example 19.6 (Monte Carlo Evaluation of ). In numerous probability examples,


the number  arises in the formula for the probability of some suitable event A in
some random experiment, say p D P .A/ D h./. Then, we can form a Monte
Carlo estimate for p, say p, O and estimate  as h1 .p/,
O assuming that the function
h is one to one. This is not a very effective method, as it turns out. But the idea has
an inherent appeal and we describe such a method in this example.
Suppose we fix a positive integer N , and let X; Y be independent discrete uni-
forms on f1; 2; : : : ; N g. Then, it is known that

6
lim P .X; Y are coprime/ D :
N !1 2

Here, coprime means that X; Y do not have any common factors >1. So, in
principle, we may choose a large N , choose n pairs .Xi ; Yi / independently at ran-
dom from the discrete set f1; 2; : : : ; N g, and find a Monte Carlo estimate pO for
p D P .X; Y are coprime/, and invert it to form an estimate for the value of  as
s
6
O D :
pO

The table below reports the results of such a Monte Carlo experiment.

Monte Carlo
N n Estimate of  D 3:14159
500 100 3.0252
1000 100 3.0817
1000 250 3.1308
5000 250 3.2225
5000 500 3.2233
10000 1000 3.1629
10,000 5000 3.1395
622 19 Simulation and Markov Chain Monte Carlo

This is an interesting example of the application of Monte Carlo where two indices
N; n have to take large values simultaneously. Only when N D 10;000 and n D
5;000, do we come close to matching the second digit after the decimal.

19.1.2 Monte Carlo P-Values

Suppose we have a statistical hypothesis-testing problem in which we wish to test a


particular null hypothesis H0 against a particular alternative hypothesis H1 . We
have at our disposal a data vector X .m/ D .X1 ; : : : ; Xm /. Often, the testing is
done by a judicious choice of a statistic Tm D Tm .X1 ; : : : ; Xm / such that large
values of Tm cast doubt on the truth of the null hypothesis H0 . Suppose that for
the dataset .x1 ; : : : ; xm / that is actually observed, the statistic Tm takes the value
Tm .x1 ; : : : ; xm / D tm . In that case, it is common statistical practice to calculate the
P -value defined as

pm D pm .X1 ; : : : ; Xm / D PH0 .Tm > tm /;

and reject the null hypothesis if pm is small (say, smaller than 0:01). The program
assumes that pm D PH0 .Tm > tm / can be computed (and that it does not depend
on any unknown parameters not specified by H0 ). However, except in some spe-
cial cases, we do not know the exact null distribution of our test statistic Tm , and
so, computing the P -value will require some suitable approximation method. The
Monte Carlo P -value method simply simulates a lot of datasets .X1 ; : : : ; Xm / under
the null, computes Tm .X1 ; : : : ; Xm / for each generated dataset, and then computes
the percentile rank of our value tm among these synthetic sets of values of Tm . For
example, if we generated n D 99 additional datasets, and we find our value tm to
be the 98th order statistic among these 100 values of Tm , we declare the Monte
Carlo P -value to be 0:02. If Tm .X1 ; : : : ; Xm / has a continuous distribution under
H0 , then the rank of tm among the nC1 values of Tm , say Rm;n , is simply a discrete
uniform random variable on the set f1; 2; : : : ; n C 1g, and therefore, we can estimate
the true value of pm by using nC1R nC1
m;n
. This lets us avoid a potentially impossible
distributional calculation for the exact calculation of pm .
The idea of a Monte Carlo P -value is attractive. But it should be noted that if we
repeat the simulation, we get a different P -value by this method for the same orig-
inal dataset. Second, one can work out at best a case-by-case Monte Carlo sample
size necessary for accurate approximation of the true value of pm . Such calcula-
tions are very likely to need the same difficult calculations that we are trying to
avoid. Also, if the test statistic Tm is hard to compute, then evaluation of a Monte
Carlo P -value may require a prohibitive amount of computing. There will often be
theoretically grounded alternative methods for approximating the true value of pm ,
such as central limit theorems or Edgeworth expansions. So, use of Monte Carlo
P -values need not be the only method available to us, or the best method avail-
able to us in a given problem. Monte Carlo P -values were originally suggested in
19.1 The Ordinary Monte Carlo 623

Barnard (1963) and Besag and Clifford (1989). They have become quite popular in
certain applied sciences, notably biology. For deeper theoretical studies of Monte
Carlo P -values, one can see Hall and Titterington (1989).

Example 19.7 (Testing for the Center of Cauchy). This is an example where the
calculation of the exact P -value using any reasonable test statistic is not easy.
Suppose X1 ; : : : ; Xm are iid C.; 1/ and suppose we wish to test H0 W  D 0
against  > 0. As a test statistic, the sample mean X is an extremely poor choice
in this case. The sample median is reasonable, and the unique maximum likeli-
hood estimate of  is asymptotically the best, but it is not easy to compute the
maximum likelihood estimate in this case. Although the sample median and the
maximum likelihood estimate are both asymptotically normal, neither of them has
a tractable CDF for given m > 2. Therefore, an exact calculation of the P -value
pm D PD0 .Tm > tm / is essentially impossible using either of these two statistics.
With m D 25 and various values of the Monte Carlo sample size n, the fol-
lowing table reports the Monte Carlo P -values with Tm as the sample median.
An approximation to the P -value obtained from the asymptotic normality result
p L 2
m.Tm  / ) N.0; 4 / is pm 0:386. The original dataset of size m D 25 was
also generated under the null, that is, from C.0; 1/.

n Monte Carlo P -Value


50 0.32
100 0.38
250 0.388
500 0.374
1000 0.371

The Monte Carlo P -values stabilize when the Monte Carlo sample size n is about
100. They closely match the normal approximation pm 0:386. However, it is im-
possible to say which one is more accurate, the Monte Carlo P -value or the normal
approximation.

19.1.3 Rao–Blackwellization

Let X; Y be any two random variables such that Var.X / and Var.X j Y / exist. Then,
the iterated variance formula says that Var.X / D EŒVar.X j Y / C Var.E.X j Y //
(see Chapter 2). As a consequence, Var.E.X j Y // Var.X /. Therefore, if we de-
fine h.Y / D E.X j Y /, then E.h.Y // D EŒE.X j Y / D E.X /, and Var.h.Y //
Var.X /. This says that as an estimate of  D E.X /, the conditional expectation
h.Y / D E.X j Y / is at least as good an unbiased estimate as X itself. The technique
is similar to the well-known technique in statistics of conditioning with respect to
a sufficient statistic, which is due to David Blackwell and C. R. Rao. So, h.Y / is
called a Rao–Blackwellized Monte Carlo estimate for  D E.X /. For the method to
624 19 Simulation and Markov Chain Monte Carlo

be applicable, the function h.y/ D EŒE.X j Y D y/ must be explicitly computable


for any y. To apply this method, we need to choose a random variable Y , simulate
iid values Y1 ; : : : ; Yn from P
the distribution of Y , calculate h.Y1 /; : : : ; h.Yn /, and
estimate  by using O D n1 niD1 h.Yi /.
Example 19.8. Suppose X; Y are iid standard normal variables and that we wish to
estimate  D E.e tXY /. The ordinary Monte Carlo estimate of it requires a simu-
lation .Xi ; Yi /; i D 1; : : : ; n,
Pall mutually independent standard normal variables,
and estimates  as O D n1 niD1 e tXi Yi . For the Rao–Blackwellized estimate, we
condition on Y , so that we have

t 2y2
h.y/ D E.e tXY j Y D y/ D E.e tyX j Y D y/ D E.e tyX / D e 2 :
Pn Pn
The Rao–Blackwellized Monte Carlo estimate is O D n1 i D1 h.Yi / D n1 i D1
t 2Y 2
i
e 2 . Note that this requires simulation of only Y1 ; : : : ; Yn , and not the pairs
.Xi ; Yi /; i D 1; : : : ; n.

19.2 Textbook Simulation Techniques

As was commented before, the entire Monte Carlo method is based on the assump-
tion that we can in fact simulate the Monte Carlo sample observations X1 ; : : : ; Xn
from whatever distribution is the relevant one for the given problem. Widely
available commercial software exists for simulating from nearly every common dis-
tribution in one dimension, and many common distributions in higher dimensions.
With the modern high-speed computers that most researchers now generally use,
the efficiency issue of these commercial algorithms has become less important than
before. Still, the fact is that often one has to customize one’s own algorithm to a
given problem, either because the problem is unusual and commercial software is
not available, or that commercial software is unacceptably slow. We do not intend
to delve into customized simulation algorithms for special problems in this text. We
give a basic description of a few widely used simulation methods, and some eas-
ily applied methods for 25 special distributions, for the purpose of quick reference.
Textbook and more detailed treatments of simulation from standard distributions are
available in Fishman (1995), Robert and Casella (2004), and Ross (2006), among
others. Schmeiser (1994, 2001) provides lucidly written summary accounts.

19.2.1 Quantile Transformation and Accept–Reject

Quantile Transformation. We are actually already familiar with this method (see
Chapter 1 and Chapter 6). Suppose F is a continuous CDF on the real line with the
quantile function F 1 . Suppose X F . Then U D F .X / U Œ0; 1. Therefore,
19.2 Textbook Simulation Techniques 625

to simulate a value of X F , we can simulate a value of U U Œ0; 1 and use


X D F 1 .U /. As long as the quantile function F 1 has a formula, this method
will work for any one-dimensional random variable X with a continuous CDF.

Example 19.9. Suppose we want to simulate a value for X Exp.1/. The quantile
function of the standard exponential distribution is F 1 .p/ D  log.1  p/; 0 <
p < 1. Therefore, to simulate X , we can generate U U Œ0; 1, and use X D
 log.1  U / ( log U will work too).
As another example, suppose we want to simulate X Be. 12 ; 12 /, the Beta
distribution with parameters 12 each. The density of the Be. 12 ; 12 / distribution is
p 1 ; 0 < x < 1. By direct integration, we get that the CDF is F .x/ D
 x.1x/
p

2
arcsin. x/. Therefore, the quantile function is F 1 .p/ D sin2 . 2 p/, and so,
to simulate X Be. 12 ; 12 /, we generate U U Œ0; 1, and use X D sin2 . 2 U /.
The quantile function F 1 does not have closed-form formulas if F is a normal,
or a Beta, or a Gamma, and so on. In such cases, the use of the quantile transform
method by numerically evaluating F 1 .U / will cause a slight error. The error may
be practically negligible. But, for these distributions, simulated values are usually
obtained in practice by using special techniques, rather than the quantile transform
technique. For example, simulations for a standard normal variable are obtainable
by using the Box–Muller method (see the chapter exercises in Chapter 4).

Accept–Reject. The accept–reject method is useful when it is difficult to directly


simulate from a target density f .x/ on the real line, but we can construct another
density g.x/ such that fg.x/
.x/
is uniformly bounded, and it is much easier to simulate
from g. Then we do simulate X from g, and retain it or toss it according to some
specific rule. The set of X values that are retained may be treated as independent
simulations from the original target density f . Because an X value is either retained
or discarded, depending on whether it passes the admission rule, the method is called
the accept–reject method. The density g is called the envelope density.
The method proceeds as follows.
(a) Find a density function g and a finite constant c such that fg.x/
.x/
c for all x.
(b) Generate X g.
(c) Generate U U Œ0; 1, independently of X .
f .X/
(d) Retain this generated X value if U cg.X/
.
(e) Repeat the steps until the required number of n values of X has been obtained.
The following result justifies this indirect algorithm for generating values from
the actual target density f .

Theorem 19.1. Let X g, and U , independent of X , be distributed as U Œ0; 1.


f .X/
Then the conditional density of X given that U cg.X/
is f .
626 19 Simulation and Markov Chain Monte Carlo

Proof. Denote the CDF of f by F . Then,


 
f .X /
P X x jU
cg.X /

P X x; U f .X/ Rx R f .t /
cg.t /
cg.X/ 1 0 g.t/dudt
D  D
P U f .X/ R1 R f .t /
cg.t /
cg.X/ 1 0 g.t/dudt
Rx f .t / Rx
1 cg.t / g.t/dt f .t/dt F .x/
D R 1 f .t / D R1
1 D D F .x/:
1 f .t/dt
1
1 cg.t / g.t/dt

t
u

Example 19.10 (Generating from Normal via Accept–Reject). Suppose we want to


generate X N.0; 1/. Thus our target density f is just the standard normal den-
sity. Since there is no formula for the quantile function of the standard normal, the
quantile transform method is usually not used to generate from a standard normal
distribution. We can, however, use the accept–reject method to generate standard
normal values. For this, we need an envelope density g such that fg.x/.x/
is uniformly
bounded, and furthermore, it should be easier to sample from this g.
One possibility is to use the standard double exponential density g.x/ D 12 e jxj .
Then,
2
p1 e x =2
f .x/ 2
D 1 jxj
g.x/ 2
e
r r
2 jxjx 2 =2 2e
D e
 
for all real x, by q
elementary differential calculus.
We take c D 2e 
in the accept–reject scheme and, of course, g as the standard
double exponential density. The scheme works out to the following: generate U
2 1
U Œ0; 1, and a double exponential value X , and retain X if U e jXjX =2 2 . Note
that a double exponential value can be generated easily by several means:
(a) Generate a standard exponential value Y and assign it a random sign (C or 
with an equal probability).
(b) Generate two independent standard exponential values Y1 ; Y2 and set
X D Y1  Y2 .
(c) Use the quantile transform method, as there is a closed-form formula for the
quantile function of the standard double exponential.
It is helpful to understand this example by means of a plot. The generated X
value is retained if and only if the pair .X; U / is below the graph of the function
2 1
in Fig. 19.1 namely the function u D e jxjx =2 2 . We can see that one of the two
generated values will be accepted, and the other rejected.
19.2 Textbook Simulation Techniques 627

u
1

0.8

0.6

0.4

0.2

x
-4 -2 2 4

Fig. 19.1 Acceept–reject plot for generating N.0; 1/ values

Example 19.11 (Generating a Beta). Special-purpose customized algorithms are


the most efficient for generating simulations from general Beta distributions. For
values of the parameters ’; ˇ of the Beta distribution in various ranges, separate
treatments are necessary for producing the most efficient schemes. If ’ and ˇ are
both larger than one, then a Beta density is strictly unimodal with an interior mode
’1
at ’Cˇ 2 . As a result, for ’; ˇ > 1, a Beta density is uniformly bounded, and hence
the U Œ0; 1 density can serve as an envelope density for generating from such a Beta
distribution by using the accept-reject scheme. Precisely, generate U; X U Œ0; 1
f .X/
(independently), and retain the X value if U sup f .x/
, where
x

.’ C ˇ/ ’1
f .x/ D x .1  x/ˇ 1 ; 0 < x < 1:
.’/.ˇ/

Because
 
’1 .’ C ˇ/ .’  1/’1 .ˇ  1/ˇ 1
sup f .x/ D f D ;
x ’Cˇ2 .’/.ˇ/ .’ C ˇ  2/’Cˇ 2

the scheme finally works out to the following.


Generate independent U; X U Œ0; 1 and retain the X value if

X ’1 .1  X /ˇ 1 .’ C ˇ  2/’Cˇ 2
U :
.’  1/’1 .ˇ  1/ˇ 1
628 19 Simulation and Markov Chain Monte Carlo

It is to be noted that this accept–reject scheme with g as the U Œ0; 1 density would
not be very efficient if ’; ˇ are large. If ’; ˇ are large, the beta density tapers off very
rapidly from its mode, and the uniform envelope density would be a poor choice for
g. For ’; ˇ not too far from 1, the scheme of this example would be reasonably
efficient.
An important practical issue about an accept–reject scheme is the acceptance
percentage. One must strive to make this as large as possible in order to increase the
efficiency of the method. This is achieved by choosing c to be the smallest possible
number that one can, as the following result shows.
Proposition. For an accept–reject scheme, the probability that an X g is ac-
cepted is 1c , and is maximized when c is chosen to be c D supx fg.x/
.x/
.
We have essentially already proved it, because the acceptance probability is
  Z 1 Z f .t /
f .X / cg.t /
P U D g.t/dudt
cg.X / 1 0
Z 1 Z 1
f .t/ f .t/ 1
D g.t/dt D dt D :
1 cg.t/ 1 c c

f .x/
Because any c that can be chosen must be at least as large as supx g.x/ , obviously
1
c is maximized by choosing c D supx fg.x/
.x/
.
Example 19.12 (Efficiency of Accept–Reject Scheme). In Example 19.8, we used an
accept–reject scheme with g as the standard double exponential density to simulate
from the standard normal distribution. Because
r
f .x/ 2e
sup D ;
x g.x/ 
q
the acceptance rate, by our previous theorem, would be 2e 
D :7602. If we gen-
erate 100 X -values from g, we can expect that about 75 of them would be retained,
and the others discarded.
Suppose now we use g.x/ D .1Cx 1
2 / , the standard Cauchy density. This density

also satisfies the requirement that supx fg.x/


.x/
< 1, and in fact with this choice
q
f .x/
of g; supx g.x/ D 2
e . Therefore, with g as the standard Cauchy density, the
q
acceptance rate would be 2 e
D :6577, which is lower than the acceptance rate
when g is the standard double exponential. In general, when using an accept–reject
scheme, one should choose the envelope density g to be a density that matches the
target density f as closely as possible, while being easier to simulate from. The
standard Cauchy density does not match the standard normal density well, and so
we get a lower acceptance rate. Rubin (1976) gives some ideas on improving the
efficiency of acceptance–rejection schemes.
19.2 Textbook Simulation Techniques 629

19.2.2 Importance Sampling and Its Asymptotic Properties

There are two different ways to think about importance sampling. The more tradi-
tional one is to go back to the primary problem that Monte
R Carlo wants to solve,
namely to approximate the value of an expectation  D 0 .x/dF0 .x/ for some
R 0 and some CDF F0 . However, .0 ; F0 / is not the only pair .; F / for
function
which .x/dF .x/ equals the specific number . Indeed, given any other CDF F1 ,
Z Z Z
dF0
D 0 .x/dF0 .x/ D 0 .x/ .x/dF1 .x/ D .x/0 .x/dF1 .x/;
dF1
dF0 f0 .x/
where .x/ D dF1 .x/. If F0 ; F1 have densities f0 ; f1 , then .x/ D f1 .x/ ; if F0 ; F1
f0 .x/
have respective pmfs f0 ; f1 , then also .x/ D f1 .x/
(if one is continuous and the
other discrete, dF0
dF1 .x/ need not be defined; the general treatment needs the use of
Radon–Nikodym derivatives).
This raises the interesting possibility that we can sample from a general F1 , and
subsequently use the usual Monte Carlo estimate

1X
n
O D .Xi /0 .Xi /;
n
i D1

where X1 ; X2 ; : : : ; Xn is a Monte Carlo sample from F1 . Importance sampling


poses the problem of finding an optimal choice of F1 from which to sample, so
that O has the smallest possible variance. The distribution F1 that ultimately gets
chosen is called the importance sampling distribution.
A more contemporary view of importance sampling is that we do not approach
importance sampling as an optimization problem, but because the circumstances
force us to consider different sampling distributions F , and possibly even different
functions  within the boundaries of the same general problem that we are trying
to solve. For example, in studies of Bayesian robustness, it is imperative that one
consider different prior distributions on the parameter of the problem, which would
force us to look at different posterior distributions for the parameter. If we want
to compute the mean and the variance of all these posterior distributions, then we
automatically have multiple pairs .; F / to consider simultaneously. The traditional
view of importance sampling as a technique to decrease Monte Carlo variance was
perhaps more relevant when simulation was not as cheap as it is today. The modern
view that we have to consider importance sampling out of necessity is arguably the
more prevalent viewpoint now.
Coming back to the question of the choice of the importance sampling distri-
bution, we present the proposed solution in a format that allows us to confront
a recurring problem in Bayesian calculations, which is that the basic underlying
CDF (density) F0 as well as a candidate importance sampling distribution F1 are
known only up to uncomputable normalizing constants (Section 19.3 discusses this
630 19 Simulation and Markov Chain Monte Carlo

dilemma in greater detail). We also assume that F0 ; F1 both have densities, say
f0 ; f1 . The presentation given below carries over with only notational change if
F0 ; F1 are both discrete with the same support. Suppose then fi .x/ D hic.x/
i
;i D
0; 1, where the assumption is that h0 ; h1 are completely known and also computable,
but c0 ; c1 are unknown and are not even computable. Then, as we showed above,
for any function  for which the expectation EF0 Œ.X / exists,
Z
 D EF0 Œ.X / DD .x/.x/f1 .x/dx
Z
c1 .x/h0 .x/
D f1 .x/dx
c0 h1 .x/
c1 .X /h0 .X /
D EF1 :
c0 h1 .X /

This is a useful reduction, but we still have to deal with the fact that the ratio cc10
is not known to us. Fortunately, if we use the special function .x/ 1, the same
representation above gives us

c1 h0 .X / c1 1
1D EF ) D h i;
c0 1 h1 .X / c0 EF1 hh01 .X/
.X/

and because h0 ; h1 are explicitly known to us, we have a way to get rid of the
quotient cc10 and write the final importance sampling identity
h i
.X/h0 .X/
EF1 h1 .X/
EF0 Œ.X / D h i :
h0 .X/
EF1 h1 .X/

We can now use an available Monte Carlo sample X1 ; : : : ; Xn from F1 to find Monte
Carlo estimates for  D EF0 Œ.X /.
The basic plug-in estimate for  is the so-called ratio estimate
Pn .Xi /h0 .Xi /
i D1 h1 .Xi /
O D Pn h0 .Xi / :
i D1 h1 .Xi /

If the Monte Carlo sample size is small, this estimate will probably have quite a bit
of bias, and some bias correction would be desirable. In any case, we have at hand
at least a first-order approximation for EF0 Œ.X / based on a Monte Carlo sample
from a general candidate importance sampling distribution F1 . The issue of which
F1 to actually choose has not been addressed yet. All we have done so far is some
algebraic manipulation and then a proposal for using the ratio estimate of survey
theory. The choice issue is addressed below. But first let us see an example.
19.2 Textbook Simulation Techniques 631

Example 19.13 (Binomial Bayes Problem with an Atypical Prior). Suppose X


Bin.m; p/ for some fixed m, and p has the prior density c sin2 .p/, where c is a
normalizing constant. Throughout the example, c denotes a generic constant, and is
not intended to mean the same constant at every use.
The posterior density of p given X D x is

.p j X D x/ D cp x .1  p/mx sin2 .p/; 0 < p < 1:

The problem is to find the posterior mean


Z 1
Dc pŒp x .1  p/mx sin2 .p/dp:
0

We use importance sampling to approximate the value of . Towards this, choose

.p/ D p; h0 .p/ D p x .1  p/mx sin2 .p/; h1 .p/ D p x .1  p/mx ;

so that if p1 ; p2 ; : : : ; pn are samples from F1 , (i.e., a Beta distribution with param-


eters x C 1 and m  x C 1), then the importance sampling estimate of the posterior
mean  is
Pn .pi /h0 .pi /
i D1 h1 .pi /
O D Pn h0 .pi /
i D1 h1 .pi /
Pn 2
i D1 pi sin .pi /
D P n 2
:
i D1 sin .pi /

Note that we did not need to calculate the normalizing constant in the posterior
density. We take m D 100; x D 45 for specificity. The following table gives values
of this importance sampling estimate and also the value of  computed by using a
numerical integration routine, so that we can assess the accuracy of the importance
sampling estimate.

Importance Sampling
n Estimate of  D :4532
20 .4444
50 .4476
100 .4558
250 .4537
500 .4529

Even with an importance sampling size of n D 20, the estimation error is less
than 2%. This has partly to do with the choice of the importance sampling distri-
bution. In this example, the importance sampling distribution was chosen to match
the shape of the posterior distribution well. This generally enhances the accuracy of
importance sampling.
632 19 Simulation and Markov Chain Monte Carlo

Let us now study two basic theoretical properties of the importance sampling
estimate O in general. The first issue is whether asymptotically it estimates  cor-
O
rectly, and the second issue is what can we say about the amount of the error  in
general. Fortunately, we can handle both questions by using our asymptotic toolbox
from Chapter 7.
Theorem 19.2. Suppose
   
h0 .X / h0 .X /
VarF1 .X / and VarF1
h1 .X / h1 .X /

both exist. Denote


   
h0 .X / h0 .X /
1 D EF1 .X / ; 2 D EF1 ;
h1 .X / h1 .X /
   
h0 .X / h0 .X /
12 D VarF1 .X / ; 22 D VarF1 ;
h1 .X / h1 .X /
 
h0 .X / h0 .X /
12 D CovF1 .X / ; :
h1 .X / h1 .X /

Then, as n ! 1,
a:s:
(a) O !  D 
2
1
.
p L
(b) n.O  / ) N.0;  2 /, where

22 21 12 212


2
1
2 D 4
C 2
 3
:
2 2 2
P
Proof. For part (a), note that by Kolmogorov’s SLLN (see Chapter 7) n1 niD1
P
.Xi /h0 .Xi /
h1 .Xi /
converges almost surely to 1 , and n1 niD1 hh01 .Xi/
.Xi /
converges almost
surely to 2 > 0. Therefore,

1 Pn .Xi /h0 .Xi /


n i D1 h1 .Xi /
O D
1 Pn h0 .Xi /
n i D1 h1 .Xi /

1
converges almost surely to 2 D .
For part (b), denote

.Xi /h0 .Xi / h0 .Xi /


Ui D ; Vi D ;
h1 .Xi / h1 .Xi /

and note that by the multivariate central limit theorem (see Chapter 7),

p L
n.U  1 ; V  2 / ) N2 .0; †/;
19.2 Textbook Simulation Techniques 633

where  
12 12
†D :
12 22
Now define the transformation g.u; v/ D u
v
, so that O D g.U ; V /. By the delta
theorem (see Chapter 7), one has

p L
n.g.U ; V /  g.1 ; 2 // ) N.0; Œrg.1 ; 2 /0 †Œrg.1 ; 2 //:

But g.1 ; 2 / D , and Œrg.1 ; 2 /0 †Œrg.1 ; 2 / D  2 , on algebra, proving


part (b) of the theorem. t
u

19.2.3 Optimal Importance Sampling Distribution

We now address the question of the optimal choice of the importance sampling
distribution. There is no unique way to define what an optimal choice means. We
formulate one definition of optimality and provide an optimal importance sampling
distribution. The optimal choice would not be practically usable, as we shown.
However, the solution still gives useful insight.
P
Theorem 19.3. Consider the importance sampling estimator O D n1 niD1
R
.Xi /.Xi / for  D .x/f0 .x/dx, where .x/ D ff01 .x/ .x/
, and X1 ; : : : ; Xn
are iid observations from F1 . Assume that .x/  0, and  > 0. Then, VarF1 ./ O is
.x/f0 .x/
minimized when f1 .x/ D  .

Proof. Because X1 ; : : : ; Xn are iid, so are .X1 /.X1 /; : : : ; .Xn /.Xn /, and
hence,
1
O D VarF1 . .X1 /.X1 //:
VarF1 ./
n
Clearly, this is minimized when with probability one under F1 ; .X1 /.X1 / is a
constant, say k. The constant k must be equal to the mean of .X1 /.X1 /, that is,
Z Z
.x/f0 .x/
kD .x/.x/f1 .x/dx D f1 .x/dx
f1 .x/
Z
D .x/f0 .x/dx D :

Therefore, the optimal importance sampling density satisfies .x/.x/ D  )


f1 .x/ D .x/f
0 .x/
. u
t

This is not usable in practice, because it involves , which is precisely the un-
known number we want to approximate. However, the theoretically optimal solution
suggests that the importance sampling density should follow key properties of the
634 19 Simulation and Markov Chain Monte Carlo

unnormalized function .x/f0 .x/. For example, f1 should have the same shape and
tail behavior as .x/f0 .x/. Do and Hall (1989) show the advantages of using impor-
tance sampling and choosing the correct shape in distribution estimation problems
O and increases its accuracy.
that arise in the bootstrap. This reduces the variance of ,

19.2.4 Algorithms for Simulating from Common Distributions

Standard software for simulating from common univariate and multivariate distribu-
tions is now widely available. Mathematica permits simulation from essentially all
common distributions. However, for the sake of quick simulations when efficiency
is not of primary concern, a few simple rules for simulating from 25 common distri-
butions are listed below. Their justification comes from various well-known results
in standard distribution theory, many of which have been previously derived in this
text itself.
Standard Exponential. To generate X Exp.1/, generate U U Œ0; 1 and use
X D  log U .
Gamma with Parameters n and . To generate X G.n; /, generate n inde-
pendent values X1 ; : : : ; Xn from a standard exponential, and use X D .X1 C   
C Xn /.
Beta with Integer Parameters m; n. To generate X Be.m; n/, generate U
G.m; 1/; V G.n; 1/ independently, and use X D U U
CV .

Weibull with General Parameters. To generate X from a Weibull distribution


1
with parameters ˇ; , generate Y Exp.1/ and use X D Y ˇ .
Standard Double Exponential. To generate X from a standard double exponen-
tial, generate X1 ; X2 Exp.1/ independently, and use X D X1  X2 .
Standard Normal. To generate X N.0; 1/, use either of the following methods.
p
(i) Generate U; V U Œ0; 1 independently, and use X D 2 log U cos.2V /;
p
Y D 2 log U sin.2V /: Then X; Y are independent N.0; 1/.
q
(ii) Use the accept–reject method with g.x/ D 12 e jxj and c D 2e 
.
Standard Cauchy. To generate X C.0; 1/, generate U U Œ0; 1 and use X D
tanŒ.U  12 /.
t with Integer Degree of Freedom. To generate X t.n/, generate Z1 ; Z2 ;   
ZnC1 N.0; 1/ independently, and use
Z1
XDq :
Z22 CCZnC1
2

lognormal with General Parameters. To generate X from a lognormal distribu-


tion with parameters ;  2 , generate Z N.0; 1/ and use X D e CZ .
19.2 Textbook Simulation Techniques 635

Pareto with General Parameters. To generate X from a Pareto distribution with


1
parameters ’; , generate U U Œ0; 1 and use X D .1  U / ’ .
Gumbel with General Parameters. To generate X from a Gumbel law with pa-
rameters ; , generate U U Œ0; 1 and use X D    log log U1 .
Multivariate Normal with General Parameters. To generate X Nd .; †/,
(i) Find the square root of †, that is, a d  d matrix A such that † D AA0 .
(ii) Generate independent standard normals Z1 ; : : : ; Zd .
(iii) Use X D  C AZ, where Z 0 D .Z1 ; : : : ; Zd /.
Wishart with General Parameters. To generate S P Wp .n; †/, generate
X1 ; X2 ; : : : Xn Np .0; †/ independently, and use S D niD1 Xi Xi0 .
Dirichlet with Integer Parameters. To generate .p1 ; : : : ; pn / from a Dirichlet
distribution with integer parameters m1 ; : : : ; mnC1 , generate Xi G.mi ; 1/; i D
1; 2; : : : ; n C 1 independently, and set

Xi X
n
pi D PnC1 ; i D 1; 2; : : : ; n; and pnC1 D 1  pj :
j D1 Xj j D1

Uniform on the Boundary of the Unit Ball. To generate X from the uniform
distribution on the boundary of the unit ball, generate Z1 ; : : : ; Zd N.0; 1/ inde-
pendently, and use Xi D q 2 Zi 2 ; i D 1; : : : ; d .
Z1 CCZd

Uniform Inside the Unit Ball. To generate X from the uniform distribution in the
d -dimensional unit ball, use either of the following methods.
(i) Generate Z according to a uniform distribution on the boundary of the
d -dimensional unit ball, and independently generate U U Œ0; 1, and use
1
X D U d Z.
(ii) Generate U1 ; : : : ; Ud U Œ1; 1 independently, and use X D .U1 ; : : : ; Ud /0 ,
Pd 2
if i D1 Ui 1. Otherwise, discard it and repeat.
Bernoulli with a General Parameter. To generate X Ber.p/, generate U
U Œ0; 1 and set X D IU >1p .
Binomial with General Parameters. To generate PX Bin.n; p/, generate
X1 ; : : : ; Xn Ber.p/ independently, and use X D niD1 Xi .
Geometric with a General
j Parameter.
k To generate X Geo.p/, generate Y
Exp.1/ and use X D 1 C  log
Y
p , where b c denotes the integer part.

Negative Binomial with General Parameters. To generate X NB.k; p/, gen-


Pk
erate X1 ; : : : ; Xk Geo.p/ independently, and use X D i D1 Xi .
Poisson with a General Parameter. To generate X Poi. /, generate Y
Bin.n; n / with n > 100 , and use X D Y .
636 19 Simulation and Markov Chain Monte Carlo

Multinomial with General Parameters. To generate .X1 ; X2 ; : : : ; Xk / from a


multinomial distribution with general parameters n; p1 ; p2 ; : : : ; pk ,
(i) Generate X1 Bin.n; p1 /.
p2
(ii) Generate X2 Bin.n  X1 ; p2 CCp k
/.
p3
(iii) Generate X3 Bin.n  X1  X2 ; p3 CCp k
/, and so on.
Random Permutation. To generate ..1/; : : : ; .n// according to a uniform dis-
tribution on the set of nŠ permutations of .1; 2; : : : ; n/,
(i) Choose one of the numbers 1; 2; : : : ; n at random and set it equal to .1/.
(ii) Choose one of the remaining n  1 numbers at random and set it equal to .2/,
and so on.
Brownian Motion on Œ0; 1. To generate a path of a standard Brownian motion
W .t/ on Œ0; 1,
(i) Generate Bernoulli random variables with parameter 12 B1 ; B2 ; : : : ; Bn for n 
100;000.
(ii) Set Xi D 2Bi  1; i PD 1; 2; : : : ; n.
(iii) With S0 D 0; Sk D kiD1 Xi ; k  1, set W .t/ D pSk
n
for k nt < k C1; k D
0; 1; : : : ; n.
Homogeneous Poisson Process. To generate a path of a homogeneous Poisson
process X.t/ with rate for t  0,
(i) Generate standard exponential random variables X1 ; X2 ; : : :.
(ii) Set Ti D Xi ; i D 1; 2; : : :.
(iii) Set X.t/ D k if T1 C    C Tk t < T1 C    C TkC1 .
Example 19.14 (Random points on the surface of a circle). We generate 20 points
according to a uniform distribution on the surface of the unit circle in the two-
dimensional plane by using the algorithm that was presented above (for a general
dimension). To generate one point, we first generate iid standard normals Z1 ; Z2 ,
and then the uniform random point X on the surface of the circle is taken to be
0 10
B Z1 Z2 C
X D @q ;q A :
Z1 C Z2
2 2
Z1 C Z2
2 2

This is then repeated 20 times.


Two of the four quadrants have 7 of the 20 points each, and the other two have
3 points each. The expected number per quadrant is of course 5. The observed dis-
crepancy from the expected number is not at all unusual. One can calculate the P
value by using the method of Section 6.7. The second picture shows 200 random
points inside the unit circle. Although the points are chosen according to the uni-
form distribution inside the circle, visually there are signs of some clustering and
some void regions in the picture. Once again, this is not unusual.
19.3 Markov Chain Monte Carlo 637

19.3 Markov Chain Monte Carlo

The standard simulation techniques are difficult to apply, or even do not apply, when
the target distribution is an unconventional one, or even worse, it is known only up to
a normalizing constant: that is, f .x/ D h.x/
c
for some explicit function h, but only
an implicit normalizing constant c, because c cannot be computed exactly, or even
to a high degree of accuracy. This problem often arises in simulating from posterior
densities of a parameter (perhaps a vector)

f .x j /./
. j x/ D ;
m.x/

where f .x j / is the likelihood function, ./ is the prior density, and m.x/ is
R density of the observable X induced by .f; / (see Chapter 3). Thus,
the marginal
m.x/ D ‚ f .x j /./d, and it serves as the normalizing constant to the func-
tion h./ D f .x j /./. But, if the parameter  is high-dimensional, and the
prior density ./ is not a very conveniently chosen one, then m.x/ usually cannot
be calculated in closed-form, or even to a high degree of numerical approximation.
All the simulation methods discussed in the previous section are useless in such a

0.5

-1 -0.5 0.5 1

-0.5

-1

Fig. 19.2 Twenty random points on the unit circle


638 19 Simulation and Markov Chain Monte Carlo

0.5

-1 -0.5 0.5 1

-0.5

-1

Fig. 19.3 Two hundred random points inside the unit circle

situation. It is a remarkable marriage of mathematical theory and dire practical need


that Markov chain Monte Carlo methods allow one to simulate from the target den-
sity (distribution)  in such a situation. The simulated values can then be used to
approximate probabilities P .X 2 A/, or to approximate expectations, E Œ.X /,
etc. Strikingly, the actual algorithms are remarkably simple and are very general.
MCMC methods have certainly made statistical computing a whole lot easier, and
the main bulk of the development in statistics took place in a short span of about
20 years. However, important questions on the speed of convergence of MCMC
algorithms await additional concrete results, and these are difficult to obtain. MCMC
schemes need not work in every problem; if the target distribution is multimodal, or
has gaps in its support, then MCMC methods usually give misleading results and
estimates. We present the basic MCMC principle, some popular special algorithms,
and fundamental convergence theory for discrete state spaces alone in this section.
The case of more general state spaces (e.g., S D R), are treated in a separate sec-
tion. We chose to present the discrete state space first, because the ideas are easier
to assimilate in that case.
19.3 Markov Chain Monte Carlo 639

19.3.1 Reversible Markov Chains

As was mentioned in the introduction, MCMC methods follow this route.


(a) Identify the target distribution (or density)  and the state space S (i.e., the
support of ) from which one wants to simulate.
(b) Construct a stationary Markov chain Xn ; n  0 on S with suitable additional
properties such that Xn has a unique stationary distribution and the stationary
distribution coincides with the target distribution .
(c) Run the chain Xn for a sufficiently long time, and use the values of the chain
XBC1 ;    ; Xn as correlated samples
Pnfrom  for some suitably large B.
1
(d) Approximate E Œ.X / by .nB/ kDBC1 .Xk /.
(e) Form confidence intervals for E Œ.X / based on XBC1 ; : : : ; Xn . This is much
more complicated than the previous steps.
The second step is the main algorithmic state, and choice of B is the main theoretical
step. The construction of the chain is not unique. Given a target distribution  on
the state space S , there are many stationary Markov chains with  as the stationary
distribution. We have to answer two questions.
Question 1. How do we isolate an initial class of chains that definitely have  as
the unique stationary distribution?
Question 2. How do we choose a specific chain from this class for implementation
in a given problem?
Reversibility of a Markov chain plays an important role in answering the first
question. The choice issue has been answered in classic research on MCMC,
and general easy-to-apply recipes for choosing particular chains to run have been
worked out. A few popular recipes are the Metropolis–Hastings algorithm, the
Barker algorithm, the independent Metropolis algorithm, and the Gibbs sampler.
We start with a description of reversible Markov chains and an explanation of why
reversibility is a convenient way to address our Question 1.

Definition 19.1. A stationary Markov chain Xn ; n  0 on a countable state space


S with the transition matrix P D ..pij //; i; j 2 S , is called reversible if there
exists a nonnegative function .x/ on S such that the detailed balance equation
pij .i / D pj i .j / holds for all i; j 2 S; i ¤ j .
The justification for the name reversible stems from the following simple
calculation.
Suppose Xn ; n  0 is reversible, and that the nonnegative function .x/ is a
probability function. Assume also that the initial distribution of the chain is  (i.e.,
P .X0 D k/ D .k/; k 2 S ). Then the marginal distribution of the state of the
chain at all subsequent times remains equal to this initial distribution, that is, for all
n  1; P .Xn D k/ D .k/; k 2 S .
640 19 Simulation and Markov Chain Monte Carlo

Now fix n. Then, from the defining equation for reversibility and Bayes’ theorem,

P .XnC1 D i j Xn D j /P .Xn D j /
P .Xn D j j XnC1 D i / D
P .XnC1 D i /
pji .j / pij .i /
D D D pij
.i / .i /
D P .XnC1 D j j Xn D i /:

In other words, in the statement, “The probability that the next state is j given
that the current state is i is pij ” we have the liberty to take either n as the current
time and n C 1 as the next, or, n C 1 as the current time and n as the next. That is, if
we run our clock backwards, then it would seem to us that the chain is evolving the
same way as it did when the clock ran forward; hence the name a reversible chain.
But we get even more. Assume that the chain is regular (see Chapter 10) and that
the state space S is finite. By summing the defining equation for reversibility over
j , we get, for every i 2 S ,
X X X
pj i .j / D pij .i / D .i / pij D .i /:
j 2S j 2S j 2S

Therefore, by Theorem 10.5,  must be the unique stationary distribution of the


Markov chain Xn ; n  0. In fact we do not require the finiteness of the state space
or the assumption of regularity of the chain. The fact that a reversible chain has  as
its stationary distribution holds under weaker conditions. Here is a standard version
of that result.

Theorem 19.4. Suppose Xn ; n  0 is an irreducible and aperiodic stationary


Markov chain on a discrete state space S such that for some probability distribution
 on S with .k/ > 0 for all k, the reversibility equation pij .i / D pj i .j / holds
for all i; j 2 S . Then,  is a unique stationary distribution for the chain Xn ; n  0.
As a consequence, in order to devise an MCMC algorithm to draw simulations
from a target distribution  on a countable set S , we only have to invent a stationary
Markov chain (or equivalently, a transition matrix P ), which is irreducible and
aperiodic and is such that the reversibility equation pij .i / D pj i .j / holds.
There are infinitely many such transition matrices P for a given target . But a few
special recipes for choosing P have earned a special status over the years. These
algorithms go by the collective name of Metropolis algorithms, and are discussed
in the next section.
On the justifiability
P Pn the true average E Œ.X / by the sample
of approximating
average n1 nkD1 .Xk / (or .nB/ 1
kDBC1 .Xk / after throwing out an initial
segment), we have the Markov chain SLLN, generally known as the strong ergodic
theorem. Compare it with the weak ergodic theorem, Theorem 10.7.

Theorem 19.5. Suppose Xn ; n  0 is an irreducible stationary Markov chain on a


discrete state space S . Suppose Xn is known to possess a stationary distribution .
19.3 Markov Chain Monte Carlo 641
P
Let  W S ! R be such that E Œ.X / exists, that is, i 2S j.i /j.i / <
1 Pn a:s:
 ; n kD1 .Xk / ! E Œ.X /; that is,
1.Then, for any initial distribution
P a:s:
P n1 nkD1 .Xk / ! E Œ.X / D 1.

Proof. The proof uses the fact that for a positive recurrent Markov chain on a dis-
crete state space, the proportion of times that the chain is at a particular state i
converges almost surely to the stationary probability of that state i . That is, given
P a:s:
n  1 and i 2 S , suppose Vi .n/ D nkD1 IXk Di . Then, Vin.n/ ! .i /. We can now
see intuitively why the strong ergodic theorem is true. We have:

1X 1X X
n
.Xk / D .Xk /
n n
kD1 i 2S kW Xk Di
1X X 1X
D .i / 1D .i /Vi .n/
n n
i 2S kW Xk Di i 2S
X Vi .n/ X
D .i / .i /.i / D E Œ.X /:
n
i 2S i 2S

Formally, fix an > 0, and find a finite subset of states E  S such that
.S  E/ < . Let T denote the number ˇ ˇ in the set E, and suppose
of elements
ˇ ˇ
n is large enough that for each i 2 E; ˇ Vin.n/  .i /ˇ < . Such an n can be chosen
because E is a finite set, and because we have the almost sure convergence property
that Vin.n/ converges to .i / for any fixed i , as was mentioned habove. Also, note i for
P P Vi .n/
the sake of the proof below that i 2S Vi .n/ D n ) i 2S n
 .i / D 0.
We prove the theorem for functions  that are bounded, and in that case we may
assume that jj 1.
Hence,
ˇ n ˇ
ˇ1 X X ˇ
ˇ ˇ
ˇ .Xk /  .i /.i /ˇ
ˇn ˇ
kD1 i 2S
ˇ ˇ ˇ ˇ
ˇX Vi .n/ X ˇ ˇX ˇ
ˇ ˇ ˇ Vi .n/ ˇ
Dˇ .i /  .i /.i /ˇ D ˇ .i /  .i / ˇ
ˇ n ˇ ˇ n ˇ
i 2S i 2S i 2S
X ˇ ˇ
ˇ Vi .n/ ˇ
j.i /j ˇˇ  .i /ˇˇ
n
i 2S
X ˇ ˇ X ˇ ˇ
ˇ Vi .n/ ˇ ˇ Vi .n/ ˇ
D j.i /j ˇˇ  .i /ˇˇ C j.i /j ˇˇ  .i /ˇˇ
n n
i 2E i 2SE
ˇ ˇ
X ˇ Vi .n/ ˇ X Vi .n/
ˇ ˇ
ˇ n  .i /ˇ C n
C .i /
i 2S i 2SE
642 19 Simulation and Markov Chain Monte Carlo
X ˇˇ Vi .n/ ˇ
ˇ X Vi .n/ X
ˇ  .i / ˇC  .i / C 2 .i /
ˇ n ˇ n
i 2S i 2SE i 2SE
X ˇˇ Vi .n/ ˇ X
ˇ V .n/ X
D ˇ ˇ i
ˇ n  .i /ˇ C .i / 
n
C2 .i /
i 2S i 2E i 2SE
T C T C 2 D 2.T C 1/ :

Because is arbitrary, the theorem follows. t


u

19.3.2 Metropolis Algorithms

The general principle of any Metropolis algorithm is the following. Suppose the
chain is at some state i 2 S at the current instant. Then, as a first step, one of the
states, say j , from the states in S is picked according to some probability distribu-
tion for possibly moving to that state j . The state j is commonly called a candidate
state. The distribution used to pick this candidate state is called a proposal distri-
bution. Then, as a second step, if j happens to be different from i , then we either
move to the candidate step j with some designated probability, or we decline to
move, and therefore remain at the current state i . Thus, the entries of the overall
transition matrix P have the multiplicative structure

pij D ij ij ; i; j 2 S; j ¤ i;
X
pii D 1  pij ;
j ¤i

where

ij D P .State j is picked as the candidate state/I


ij D P .The chain actually moves to the candidate state j/:

The matrix ..ij // is chosen to be irreducible, in order that the ultimate transition
matrix P D ..pij // is irreducible. Note that the Metropolis algorithm has a formal
similarity to the accept–reject scheme. A candidate state j is picked according to
the proposal probabilities ij and once picked, is accepted according to the accep-
tance probabilities ij . Special choices for ij ; ij lead to the following well-known
algorithms.
Independent Sampling. Choose

ij D .j / 8 i; and ij 1:


19.3 Markov Chain Monte Carlo 643

Metropolis–Hastings Algorithm. Choose

.j /
ij D c D constant; and ij D min 1; :
.i /

Barker’s Algorithm. Choose

.j /
ij D c D constant; and ij D :
.i / C .j /

Independent Metropolis Algorithm. For all i , choose


( .j / )
pj
ij D pj ; and ij D min 1; .i /
:
pi

It is implicitly assumed that the state space S is finite for the Metropolis–Hastings
and the Barker algorithm to apply. Note that the Metropolis–Hastings, independent
Metropolis, and the Barker algorithm only require the specification of the target dis-
tribution  up to normalizing constants. That is, if we only knew that .k/ D h.k/ c
,
where h.k/ is explicit, but the normalizing constant c is not, we can still execute
all of these algorithms. It is also worth noting that in the Metropolis–Hastings al-
gorithm, if a state j gets picked as the candidate state, and j happens to be more
likely than the current state i , then the chain surely moves to j . This is not the case
for Barker’s algorithm.
A fourth MCMC algorithm that is especially popular in statistics, and particularly
in Bayesian statistics, is the Gibbs sampler, which is treated separately. Note that for
each proposed algorithm, one needs to verify two things; that the chain is irreducible
and aperiodic and that the time-reversibility equation indeed holds.

Example 19.15 (Verifying the Assumptions). First consider the case of purely in-
dependent sampling. Then, pij D .j / and pj i D .i /. So irreducibility holds,
because P itself has all entries strictly positive. Also, .i /pij D .i /.j / D
pj i .j /, and so time-reversibility also holds.
Next, consider the Metropolis–Hastings algorithm. Fix i; j . Consider the case
.i /
that .j /  .i /. Then, pij D c, and pji D c .j /
. Therefore, every en-
try in P is again strictly positive, and so the chain is irreducible. Furthermore,
pj i .j / D c.i / D pij .i /, and therefore, time-reversibility holds. For the other
case, namely .i /  .j /, the proof of reversibility is the same. The verification of
the assumptions for the Barker algorithm and the independence Metropolis sampler
is a chapter exercise.

Example 19.16 (Simulation from a Beta–Binomial Distribution). If X given p is


distributed as Bin.n; p/ for some fixed n, and p has a Beta distribution with parame-
ters ’ and ˇ, then the marginal distribution of X on the state space S D f0; 1; : : : ; ng
is called a Beta–Binomial distribution. It has the pmf
644 19 Simulation and Markov Chain Monte Carlo
!Z
.’ C ˇ/ n 1
m.x/ D p xC’1 .1  p/nxCˇ 1 dp
.’/.ˇ/ x 0
!
.’ C ˇ/ n .x C ’/.n  x C ˇ/
D ; x D 0; 1; : : : ; n:
.’/.ˇ/ x .n C ’ C ˇ/

To simulate from the Beta–Binomial distribution by using the Metropolis–Hastings


algorithm, if we are currently at some state i; 0 i n, then one of the states j is
chosen at random as a candidate state for the chain to move, and the move is actually
made with a probability of minf1; m.jm.i /
/
g. By using the formula for m.x/ from the
above,
m.j / i Š.n  i /Š.j C ’/.n  j C ˇ/
D :
m.i / j Š.n  j /Š.i C ’/.n  i C ˇ/
Thus, the overall transition matrix of the Metropolis–Hastings chain is

1 i Š.n  i /Š.j C ’/.n  j C ˇ/


pij D min 1; ; j ¤ i;
nC1 j Š.n  j /Š.i C ’/.n  i C ˇ/
X
pi i D 1  pij :
j ¤i

Note that the chain is in fact regular. Interestingly, if ’ D ˇ D 1, so that p


1
U Œ0; 1, then this works out to pij nC1 . That is, if the chain moves, it moves to
one of the noncurrent states with an equal probability.

Example 19.17 (Simulation from a Truncated Geometric Distribution). Suppose


Y Geo.p/, and we want to simulate from the conditional distribution of Y given
that Y n, where n > 1 is some specified integer. The pmf of this conditional
distribution is

P .Y D x/
.x/ D I1xn
P .Y n/

p.1  p/x1 p.1  p/x1


D P1 I 1xn D I1xn :
1  xDnC1 p.1  p/x1 1  .1  p/n

Therefore, for any i; j n,

.j /
D .1  p/j i 1 iff j  i:
.i /
19.4 The Gibbs Sampler 645

Thus, the overall transition matrix of the Metropolis–Hastings chain would be

1
pij D if j < i I
n
1
pij D .1  p/j i if j > i I
n
X
pi i D 1  pij :
j ¤i

The general Metropolis algorithm uses more general forms for the probabilities
ij ; ij , and we provide this general form next.

Definition 19.2. The general Metropolis algorithm corresponds to the transition


probabilities pij D ij ij , where ..ij // is a general
n irreducible
o transition prob-
.j /
ability matrix, and ij are defined as ij D min 1; .i /ij ji
.
If the proposal distribution is symmetric (i.e., if j i D ij for all pairs i; j ), then
the general Metropolis algorithm reduces to the previously described basic Metropo-
lis algorithm. Thus, more generality is attained only by choosing the matrix ..ij //
to be not symmetric. The general Metropolis algorithm has the desired convergence
property and it is a consequence of the general convergence theorem for irreducible
reversible chains (Theorem 19.4).

Theorem 19.6. The general Metropolis algorithm has  as its stationary


distribution.

19.4 The Gibbs Sampler

The Metropolis algorithms of the previous section can be difficult to apply when the
dimension of the state space is high, and the likelihood ratios .y/ .x/ ; x; y 2 S depend
on all the coordinates of x; y. The generation of the chain becomes too much of a
multidimensional problem and becomes at least unwieldy, and possibly undoable.
A collection of special Metropolis algorithms very cleverly reduces the mul-
tidimensional nature of the Markov chain generation problem to a sequence
of one-dimensional problems. To explain it more precisely, suppose a state x
in the state space S is a vector in some m-dimensional space with coordinates
.x1 ; x2 ; : : : ; xm /. Suppose we are currently in state x. To make a transition to a new
state y 2 S , we change coordinates one at a time, such as .x1 ; x2 ; : : : ; xm / !
.y1 ; x2 ; : : : ; xm / ! .y1 ; y2 ; : : : ; xm / !    .y1 ; y2 ; : : : ; ym /, and each co-
ordinate change is made by using the conditional distribution of that coordinate
given the rest of the coordinates. For example, the transition .x1 ; x2 ; : : : ; xm / !
.y1 ; x2 ; : : : ; xm / is made by simulating from the distribution .x1 j x2 ; : : : ; xm /.
These conditional distributions of one coordinate given all the rest are called full
conditionals. Therefore, as long as we can calculate and also simulate from all
646 19 Simulation and Markov Chain Monte Carlo

the full conditionals, a complicated multidimensional problem is broken into m


one-dimensional problems, which is a productive simplification in many applica-
tions. It can be shown that the sequence of observations thus generated forms a
stationary Markov chain and under suitable conditions, the chain has the target
distribution  as its stationary distribution. It should be noted that in some appli-
cations, the full conditionals do not depend on all the m  1 coordinates that one
conditions on, but only on a smaller number of them. This happy coincidence makes
the algebraic aspect of simulating runs of the Markov chain even simpler.
This special Metropolis algorithm, that uses only the full conditionals to form the
proposal probabilities and uses acceptance probabilities equal to one, is called the
Gibbs sampler. It arose in mathematical physics in the works of Glauber (1963) in a
more general setup than the finite-dimensional Euclidean setup described here. The
Gibbs sampler is a hugely popular tool in several areas of statistics, notably simu-
lation of posterior distributions in Bayesian statistics, image processing, generation
of random contingency tables, random walks on gigantic state spaces, such as those
that arise in models of random graphs and Bayesian networks, molecular biology,
and many others.
The Gibbs sampler, by definition, makes its transitions by changing one coordi-
nate of the m-dimensional vector at a time. At any particular step of this transition,
the coordinate to be changed may be chosen at random, or according to some de-
terministic order. If the selection is random, the corresponding algorithm is called
a random scan Gibbs sampler, and if the selection is deterministic, the algorithm is
called a systematic scan Gibbs sampler. There are some important optimality issues
regarding which type of scan is preferable, and we discuss them briefly later in this
section.
Suppose, for specificity, that we use a random scan Gibbs sampler. Denote the
full conditionals by using the notation .xi j xi /. Denote the current state by
x D .x1 ; : : : ; xm /. Pick the coordinate to be changed at random from the m available
coordinates. If the coordinate picked is i , then make a transition from x to a new vec-
tor y D .x1 ; : : : ; xi 1 ; yi ; xi C1 ; : : : ; xm / by using the full conditional .yi j xi /.
Then, the transition probability matrix of the chain is P D m 1
.P1 C  CPm /, where
Pi has entries given by

pi;xy D .yi j xi /Iyj Dxj 8 j ¤i :

Each Pi is reversible, and hence so is P . To show that a given Pi is reversible, we


verify directly

.x/pi;xy D .x1 ; : : : ; xi 1 ; xi ; xi C1 ; : : : ; xm /
.x1 ; : : : ; xi 1 ; yi ; xi C1 ; : : : ; xm /
P Iyj Dxj 8 j ¤i
yi .x1 ; : : : ; xi 1 ; yi ; xi C1 ; : : : ; xm /
.y1 ; : : : ; yi 1 ; xi ; yi C1 ; : : : ; ym /.y1 ; : : : ; yi 1 ; yi ; yi C1 ; : : : ; ym /
D P
yi .y1 ; : : : ; yi 1 ; yi ; yi C1 ; : : : ; ym /
Iyj Dxj 8 j ¤i
19.4 The Gibbs Sampler 647

D .y1 ; : : : ; yi 1 ; yi ; yi C1 ; : : : ; ym /
.y1 ; : : : ; yi 1 ; xi ; yi C1 ; : : : ; ym /
P Iyj Dxj 8 j ¤i
yi .y1 ; : : : ; yi 1 ; yi ; yi C1 ; : : : ; ym /
D .y/pi;yx :

Likewise, if the Pi are irreducible, so will be P . A simple sufficient condition for


irreducibility is that .x/ > 0 on the state space. This is a simple sufficient condi-
tion, and weaker conditions for irreducibility are available.
For the systematic scan Gibbs sampler, we change the coordinates in a pre-
determined sequence. For example, if we change the coordinates in the natural
sequence 1; 2; : : : m, then the transition probability matrix P of the chain is P D
P1 P2    Pm . So, provided that the Pi are irreducible, P is still irreducible. How-
ever, reversibility will typically be lost. Reversibility is not absolutely essential,
because the chain may have  as the unique stationary distribution even if reversibil-
ity fails. But it is useful to have the reversibility property, because then Theorem 19.4
guarantees the desired convergence property, without requiring further verification.
To ensure reversibility, we may use a hybrid of the systematic scan and the random
scan Gibbs sampler, by first choosing the ordering of the change of coordinates at
random from the mŠ possible orderings of 1; 2; : : : ; m. We call this the order ran-
domized systematic scan Gibbs sampler. The transition probability matrix P now is

1 X
P D Pi1 Pi2    Pim ;

S.m/

P
where S.m/ denotes the sum over all mŠ permutations of 1; 2; : : : ; m. Once we
have both reversibility and irreducibility, we get the desired convergence result
(Theorem 19.4). We state this formally.

Theorem 19.7. Suppose .x/ > 0 for all x 2 S . Then the random scan and the or-
der randomized systematic scan Gibbs sampler both have  as the unique stationary
distribution of the chain.

Example 19.18 (The Beta–Binomial Pair). To continue with Example 19.14,


suppose that X given p is distributed as Bin.m; p/ and p has a Beta prior dis-
tribution with parameters ’ and ˇ. Then, the posterior distribution of p given
X D x is Be.x C ’; m  x C ˇ/. The systematic scan Gibbs sampler runs the
bivariate Markov chain .Xk ; pk /; k D 0; 1; 2; : : : in the following steps:
(a) Choose an initial value p0 Be.’; ˇ/.
(b) Choose an initial value X0 D x0 from Bin.m; p0 /; 0 x0 m.
(c) Generate p1 from Be.x0 C ’; m  x0 C ˇ/.
(d) Generate X1 D x1 from Bin.m; p1 /.
(e) Repeat.
648 19 Simulation and Markov Chain Monte Carlo

This produces a bivariate chain that has the bivariate distribution


!
.’ C ˇ/ m xC’1
g.x; p/ D p .1p/mxCˇ 1 ; x D 0; 1; : : : ; m; 0 < p < 1
.’/.ˇ/ x

as its stationary distribution. We can reverse the order of the systematic scan, that is,
start with a fixed x0 , get p0 from Be.x0 C ’; m  x0 C ˇ/, then x1 from Bin.m; p0 /,
and so on, and the convergence result will still hold.

Example 19.19 (A Gaussian Bayes Example). Posterior densities were introduced


in this text in Chapter 3 as conditional distributions of the parameter when there
is a bona fide joint distribution on the data value and the parameter, induced by the
likelihood function and a prior probability distribution on the parameter. Sometimes,
a nonnegative function ./ of  is used R as a formal prior on , although ./ is
not a valid probability density; that is, ‚ ./d is actually infinite. Such formal
priors are called improper priors. It can happen, however, that if we forget the fact
that ./ is not a probability density, but plug it into the formula for the conditional
density of  given X D x, that is, in the formula

. j x/ D cf .x j /./;

then . j x/ is indeed a probability density with a suitable finite normalizing con-
stant c. In that case, one sometimes proceeds as if . j x/ is a bona fide posterior
density and does inference with it. This is such an example of the application of the
Gibbs sampler to a posterior density arising from an improper prior.
Suppose given  D .;  2 /; X1 ; X2 ; : : : ; Xn are iid N.;  2 /, and that for  D
.;  2 / we use the nonnegative function ./ D 12 dd 2 as an improper prior. It
is important to note that  2 is being treated as the second coordinate of , not . One
may informally interpret such a prior as treating  and  2 to be independent, and
then putting, respectively, the improper prior densities 1 ./ 1 and 2 . 2 / D
1 2
2
. > 0/.
Our final goal is to simulate observations from the posterior distribution of
 D .;  2 / given the data X D .X1 ; : : : ; Xn /. To apply the Gibbs sampler, we
need the full conditionals. Because  is two-dimensional, there are only two full
conditionals to find, namely, the posterior distribution of  given  2 , and the pos-
terior distribution of  2 given . Note that when we consider posteriors, the data
values are considered as fixed. So X does not enter into this Gibbs picture as an-
other variable.
It turns out that even though the prior is an improper prior, the formal posterior
is a bona fide probability density on  D .;  2 / for any n  1. Furthermore, these
two full conditionals are very easy to find. In the following lines, c denotes a generic
normalizing constant, and is not intended to be the same number at each occurrence.
Consider the conditional of  given  2 . This is formed as
19.4 The Gibbs Sampler 649

1  1 2 PniD1 .Xi /2


. j  2 ; X / D c e 2
n
1  1 Pn .X X/2  n2 .X/2
D c n e 2 2 i D1 i e 2

n
 .X/2
D ce 2 2 ;
2
and therefore, . j  2 ; X / is the density of a N.X; n / distribution.
Next, the conditional of  2 given  is

1  1 2 PniD1 .Xi /2 1


. 2 j ; X / D c e 2
n 2
P
1  1 n
.X /2
Dc n
C1
e .2 2 / i D1 i :
2
. / 2

Therefore, by transforming to w D 1
2
, from the Jacobian formula, the conditional
of w given  is
n
Pn 2
.w j ; X / D cw 2 1 e  2 i D1 .Xi / :
w

P
Making one final transformation to v D w niD1 .Xi  /2 ; the conditional of v
given  is
n
.v j ; X / D cv 2 1 e  2 :
v

We recognize this to be the density of a chi-square distribution with n2 degrees of


freedom. To summarize the steps of the systematic scan Gibbs sampler for this
problem,
Step 1. Start with an initial value 02 .
2
Step 2. Generate 0 from N.X; n0 /.
Step 3. Generate 12 by first simulating v1 from 2n , the chi-square distribution with
2P
n
.X  /2
degrees of freedom, and then set 12 D i D1 v1i 0 .
n
2  
2
Step 4. Generate 1 from N X ; n1 , and so on and repeat.

Example 19.20 (Gibbs Sampling from Dirichlet Distributions). The success of the
Gibbs sampler hinges on one’s ability to write and easily simulate from the one-
dimensional full conditionals. One important case where this works out very nicely
is that of the n-dimensional Dirichlet distribution (see Chapter 4). The Dirich-
let distribution arises naturally as models for data on a simplex. They also arise
prominently in Bayesian statistics in numerous situations, for example, as priors
for the parameters of a multinomial distribution, and also in dealing with infinite-
dimensional Bayesian problems.
650 19 Simulation and Markov Chain Monte Carlo

A canonical representation of a Dirichlet random vector with parameter ’ D


.’1 ; ’2 ; : : : ; ’nC1 / is given by

L X1 X2
.p1 ; p2 ; : : : ; pn / D ; ;:::;
X1 C X2 C    C XnC1 X1 C X2 C    C XnC1
!
Xn
;
X1 C X2 C    C XnC1

where Xi are independent Gamma variables with parameters ’i and 1. Therefore, it


is possible to simulate a Dirichlet random vector by simulating n C 1 independent
Gamma variables. However, the Gibbs sampler is an alternative and relatively sim-
ple method for this case, because by Theorem 4.9, the full conditionals are simply
Beta distributions. Precisely,
pi
Pn j fpj ; j ¤ i g Be.’i ; ’nC1 /:
1 j ¤i D1 pj

So, to simulate from a Dirichlet distribution by using the systematic scan Gibbs
sampler, we can
Step 1. Fix initial values p20 ; : : : ; pn0 for p2 ; : : : ; pn . P
Step 2. Simulate a Beta variable Y1 from Be.’1 ; ’nC1 / and set p10 D .1  njD2
pj 0 /Y1 . P
Step 3. Simulate a Beta variable Y2 from Be.’2 ; ’nC1 / and set p21 D .1  j ¤2
pj 0 /Y2 , and so on, by updating the coordinates one at a time. The result-
ing chain will not be reversible, but reversibility can be retrieved by using a
somewhat more involved updating scheme; see the general principles pre-
sented at the beginning of this section.

Example 19.21 (Gibbs Sampling in a Change Point Problem). Suppose for some
given n  2, and 1 k n, we have X1 ; : : : ; Xk Poi./, and XkC1 ; : : : ; Xn
Poi. /, where all n observations are mutually independent, and k; ; are unknown
parameters. This is an example of a change point problem, where an underlying se-
quence of Poisson distributed random variables has a level change at some unknown
time within the interval of the study. This unknown change point is the parameter k.
The problem is unusual in one parameter being integer-valued, and the others con-
tinuous. Change point problems are rather difficult from a theoretical perspective.
Suppose the problem is attacked by assuming a prior distribution on .k; ; /,
with

k Uniff1; 2; : : : ; ngI
 G.’; 1 /; G.ˇ; 2 /;

where we assume that k; ; are independent.


19.5 Convergence of MCMC and Bounds on Errors 651

To implement the systematic scan Gibbs sampler in the order of updating k !


 ! , we only need to know what the full conditionals are. By direct calculations,

 PkiD1 xi
. /k
e

k j ; .k j ; / D  PkiD1 xi I1kn I
Pn
kD1 e . /k

X
k  1 !
1
 j .k; / G ’C xi ; Ck I
1
i D1
0 1
X
n  1
1
j .k; / G @ˇ C xi ; Cnk A:
2
i DkC1

Each of these full conditionals is a standard distribution. The full conditional corre-
sponding to k is just a multinomial distribution, and the other two full conditionals
are Gamma distributions. It is easy to simulate from each of them, and so the Gibbs
chain can be generated easily. Special interest lies in estimating the change point
k. Once one has the three-component Gibbs chain, the k-component of it can be
used to approximate the posterior mean of k; there are also methods to estimate the
posterior distribution itself from the output of the Gibbs chain.

19.5 Convergence of MCMC and Bounds on Errors

Theorem 19.4 does ensure that a stationary Markov chain with suitable additional
properties has the target distribution  as its unique stationary distribution. In partic-
ular, whatever the initial distribution  is, for any state i 2 S; P .Xn D i / ! .i /
as n ! 1. However, in practice we can only run the chain for a finite amount
of time, and hence only for finitely many steps. So, a question of obvious practi-
cal importance is how large the value of n should be in a given application for the
true distribution of Xn to closely approximate the target distribution , according to
some specified measure of distance between distributions. Several probability met-
rics introduced in Chapter 15 now become useful. A related question is what can we
say about the distance between the true distribution of Xn and the target distribution
 for a given n.
Without any doubt, a study of these questions forms the most challenging and
the most sophisticated part of the MCMC theme. A broad general rule is that the
second largest eigenvalue in modulus (SLEM) of our transition probability matrix
P is going to determine the rapidity of the convergence of the chain to stationarity.
The smaller the SLEM is, the faster will be the convergence. However, it is usually
extremely difficult to go further, and find the SLEM explicitly. We can instead try
to give concrete bounds on the SLEM, or directly the distance between the true
652 19 Simulation and Markov Chain Monte Carlo

distribution of Xn and the target distribution, and each new problem usually requires
ingenious new methods. The area is still flourishing, and a fascinating interplay of
powerful tools from pure mathematics and probability is producing increasingly
sophisticated results.
There are some general theorems on the SLEM and hence speed of conver-
gence, and these are useful to know. They are mostly for reversible chains on
finite-state spaces, although there are some exceptions. Some of these are presented
with illustrative examples in this section. These results are taken primarily from
Dobrushin (1956), Diaconis and Stroock (1991), Diaconis and Saloff-Coste (1996),
and Brémaud (1999). References on convergence of the Gibbs sampler are given
later.
First, we need to choose the specific distances that we use to determine the ra-
pidity of convergence of the chain. Use of the total variation distance is the most
common; the separation and the chi-square distances are also used. We redefine
these distances here for easy reference; they were introduced and treated in detail in
Chapter 15.  
Suppose P n D pij .n/ t t is the n-step transition probability matrix for our
stationary chain on the finite state space S D f1; 2; : : : ; tg. Suppose that enough
conditions on the chain have been assumed so that we know that a unique stationary
distribution  exists. If the initial distribution of the chain is , then the distribution
of Xn for fixed n is
X X
P .Xn D i / D P .X0 D k/pki .n/ D k pki .n/;
k2S k2S

which is the i th element in the vector P n . If the initial distribution  is a one-point


distribution at some x 2 S (i.e., if P .X0 D x/ D 1), then we follow the notation
P n .x; :/ to denote the probabilities

P n .x; A/ D P .Xn 2 A j X0 D x/:

The total variation distance between the true distribution of Xn and the stationary
distribution  under a general initial distribution  is
ˇ ˇ
ˇ ˇ ˇX X X ˇ
ˇ ˇ
.P n ; / D sup ˇPP n .A/  P .A/ˇ D sup ˇ k pki .n/  .i /ˇ :
A A ˇ ˇ
i 2A k2S i 2A

If we know that for a particular n0 ; .P n0 ; / is small, and if we also know that
.P n ; / is monotone decreasing in n, then we can be assured that MCMC output
after step n0 will produce sufficiently accurate estimates of probabilities of arbitrary
sets under the target distribution . This is the motivation for wanting a small value
for .P n ; /.
19.5 Convergence of MCMC and Bounds on Errors 653

Two other distances that are also used in this context are the separation and the
chi-square distances. The separation distance is defined as

P .Xn D i /
D.P n ; / D sup 1  ;
i 2S .i /

and the chi-square distance is defined as

X .P .Xn D i /  .i //2


2 .P n ; / D :
.i /
i 2S

Here are the principal notions of convergence that are used.

Definition 19.3. Suppose Xn ; n  0 is a stationary Markov chain on a discrete


state space S with a unique stationary distribution . Assume that .x/ > 0 for all
x 2 S . The chain is said to be ergodic in variation norm if for any x 2 S ,

.P n .x; :/; / D sup jP.Xn 2 A j X0 D x/  .A/j ! 0 as n ! 1:


A

The chain is said to be geometrically ergodic if for any x 2 S , there exist a finite
constant M.x/ and 0 < r < 1 such that

.P n .x; :/; / M.x/rn for all n  1:

The chain is said to be uniformly ergodic if there exist a fixed finite constant M and
0 < r < 1 such that for any x 2 S ,

.P n .x; :/; / Mrn for all n  1:

If the chain has a finite state space and is geometrically ergodic, then it is obvi-
ously uniformly ergodic. Otherwise, and especially so for MCMC chains, uniform
ergodicity usually does not hold, although geometric ergodicity does.

19.5.1 Spectral Bounds

In the case of an irreducible and aperiodic chain with a finite state space, conver-
gence to stationarity occurs geometrically fast. That is,
0 1

B C
B C
P n D B : C C O.n’2 1 c n /;
@ :: A

654 19 Simulation and Markov Chain Monte Carlo

where c is the second largest value among the moduli of the eigenvalues of P
(called the SLEM), and ’2 is the algebraic multiplicity of that particular eigenvalue.
This follows from general spectral decomposition of stochastic matrices, and is also
known as the Perron–Frobenius theorem; see Chapter 6 in Brémaud (1999).
The important point is that irreducibility and aperiodicity guarantee that all
eigenvalues except one are strictly less than one in modulus (although they can be
complex), and so, in particular c < 1, which causes each row of P n to converge to
the stationary vector  at a geometric (i.e., exponentially fast) rate, and therefore,
the chain is geometrically ergodic. If, in addition, the chain is reversible, then the
eigenvalues are real, say

1D 1 > 2  3    t > 1;

and so in that case, c D maxf 2 ; j t jg.

Example 19.22 (Two-State Chain). This is the simplest possible example of a


stationary Markov chain. Consider the two-state chain with the transition proba-
bility matrix  
’ 1’
P D ;
1ˇ ˇ
where 0 < ’; ˇ < 1. The eigenvalues of P can be analytically computed, and they
are
1 D 1; 2 D ’ C ˇ  1:

Thus, the SLEM equals c D j’ C ˇ  1j.

Example 19.23. Consider the three-state stationary Markov chain with the transi-
tion probability matrix
0 1
0 1 0
P D@ 0 :5 :5 A :
:5 0 :5
Note that the chain is irreducible and aperiodic (see Example 10.13). But it is not
reversible. Thus, there can be complex eigenvalues. Indeed, the eigenvalues are

i i
1 D 1I 2 D I 3 D :
2 2

The moduli of the eigenvalues are therefore 1; 12 , and 12 , and so, the SLEM equals
c D 12 .
The practical difficulty is that it is usually difficult or impossible to derive expres-
sions for 2 and t in closed-form, and especially so when the number of possible
states t is large. Hence, much of the research effort in this field has been directed to-
wards finding effective and explicit upper bounds on c. The following theorem gives
some of the most significant bounds currently available on the distance between
19.5 Convergence of MCMC and Bounds on Errors 655

P n and  for some of the distances that we defined above. The bounds in the
theorem below all depend on our ability to evaluate or bound the SLEM c, and they
are all for the reversible finite-state case.
Theorem 19.8. Let Xn ; n  0 be an irreducible, stationary, reversible Markov
chain on the finite-state space S D f1; 2; : : : ; tg. Let  be the stationary distri-
bution of the chain, and c the SLEM of its transition probability matrix P . Then,
(a) For all n  1 and any i 2 S ,
s
1  .i / c n
sup jP n .i; A/  .A/j :
A .i / 2

(b) For all n  1 and any i 2 S ,


s
pi i .2/ n1
sup jP .i; A/  .A/j
n
c ;
A .i /

where pi i .2/ is the i th diagonal element in P 2 .


(c) For all n  1 and any initial distribution ,

2 .P n ; / c 2n 2 .; /:

(d) For all n  1 and any initial distribution ,

cn p 2
sup jP .Xn 2 A/  .A/j  .; /:
A 2

Proof. See pp. 208–210 in Brémaud (1999).


Example 19.24 (Two-State Chain Revisited). Consider again the chain of Example
19.19. We have the eigenvalues of P as 1 D 1; 2 D ’ C ˇ  1. We can calculate
P n exactly by spectral decomposition, and we get
   
1 1ˇ 1’ .’ C ˇ  1/n 1’ ’1
P D
n
C :
2’ˇ 1ˇ 1’ 2’ˇ ˇ1 1ˇ
 

The first matrix in this formula represents the matrix (the stationary dis-

tribution  was derived in Example 10.18). Therefore, we can also calculate
supA jP n .i; A/  .A/j exactly. Indeed, if we take the initial state to be i D 1, then
we have

1X
2
sup jP .Xn 2 A j X0 D i /  .A/j D jp1j .n/  .j /j
A 2
j D1
656 19 Simulation and Markov Chain Monte Carlo

(this is a standard formula for the total variation distance; see Section 15.1)

1 j’ C ˇ  1jn 1’
D .j1  ’j C j’  1j/ D j’ C ˇ  1jn :
2 2’ˇ 2’ˇ

On the other hand, the first bound in part (a) of Theorem 19.8 equals
s s
1  .1/ c n .2/ c n
D
.1/ 2 .1/ 2
s
1  ’ j’ C ˇ  1jn
D :
1ˇ 2

Therefore, the relative error of the upper bound is


q
1’
1ˇ .1’/C.1ˇ /
2 2’ˇ 2
1’
1 D p 1D p  1:
2’ˇ 2 .1  ’/.1  ˇ/ .1  ’/.1  ˇ/

That is, we get the very interesting conclusion that the relative error of the bound in
part (a) of Theorem 19.8 is the quotient of the arithmetic and the geometric mean of
1  ’ and 1  ˇ minus one. If ’ D ˇ, then the error is zero, and the bound becomes
exact. If ’ and ˇ are very different, the bound becomes inefficient.

Example 19.25 (SLEM for Metropolis–Hastings Algorithm). As a rule, the eigen-


values of P are hard to evaluate in symbolic form, except when the size of the state
space is very small. An important example where the eigenvalues can, in fact, be
found in closed-form is the Metropolis–Hastings algorithm. The availability of gen-
eral formulas for the eigenvalues makes the SLEM bounds applicable when using
the Metropolis–Hastings algorithm.
In this case, the transition probabilities are given by

1 .j /
pij D min 1; ; j ¤ iI
t .i /
X
pi i D 1  pij :
j ¤i

We label the states in decreasing order of the values of their stationary probabili-
ties. That is, we label the states such that .1/  .2/  : : :  .k/. The first
eigenvalue, as always, is 1 D 1, and the remaining eigenvalues are
2 3
1 4 X .k  1/  .j / 5
t

k D ; k  2:
t .k  1/
j Dk1
19.5 Convergence of MCMC and Bounds on Errors 657

For example, take t D 3. Then, denoting .1/ D a  .2/ D b  .3/ D c D


1  a  b, the transition probability matrix works out to
0 1
4 1
a
b
a
c
a
1@ A:
P D 1 2 c
b
c
b
3
1 1 1

By direct verification, the eigenvalues of P are


1 2 1a
1; 1  ;  :
3a 3 3b
To see that these correspond to the eigenvalue formulas,
2 3
1 4 X .1/  .j / 5
3

3 .1/
j D1
2 3
1 X
3
14 1 1
D 3 .j /5 D 1  D1 ;
3 .1/ 3.1/ 3a
j D1

P3
because j D1 .j / D 1. Likewise,
2 3
1 4 X .2/  .j / 5
3

3 .2/
j D2
2 3
1 X
3
14 2 1  .1/ 2 1a
D 2 .j /5 D  D  :
3 .2/ 3 3.2/ 3 3b
j D2

The general case is proved by direct verification in Liu (1995).

19.5.2  Dobrushin’s Inequality and Diaconis–Fill–Stroock


Bound

The bounds in the preceding section are only for the reversible case. There are
some examples in which convergence occurs, although the chain is not reversible.
There are two principal ways to deal with the nonreversible situation. A classic
inequality in Dobrushin (1956) provides an upper bound on the SLEM of the tran-
sition probability matrix P by using a very simple computable index, known as
Dobrushin’s coefficient, even if the chain is not reversible. A second method intro-
duced in Diaconis and Stroock (1991) and Fill (1991) uses a technique of reversal
of a nonreversible transition probability matrix, and derives the needed bounds in
terms of the SLEM of a reversible matrix M , constructed from the original transition
probability matrix P .
658 19 Simulation and Markov Chain Monte Carlo

First, we need some notation and definitions. Given a t  t transition probability


matrix (not necessarily reversible), let

pEi D .pi1 ; : : : ; pi t /;
 
pj i .j /
pQij D I PQ D pQij I M D P PQ ;
.i /

where  is the stationary distribution corresponding to P . Note that M is reversible.

Definition 19.4. Let P be the transition probability matrix of a stationary Markov


chain on a finite-state space S D f1; 2; : : : ; tg. The Dobrushin coefficient of P is
defined as
1 X
.P / D max .pEi ; pEj / D max jpi k  pjk j:
i;j 2S 2 i;j 2S
k2S

In the above, recall that we are using the notation  to denote total variation distance.
Note that 0 .P / 1, and usually, 0 < .P / < 1.
Here is our main theorem on handling rates of convergence to stationarity when
the chain is not reversible. These are not the only bounds available, but the specific
bounds in the theorem below have some simplicity about them.

Theorem 19.9. Let P be the transition probability matrix of a stationary Markov


chain on a finite-state space, with unique stationary distribution . Let c be the
SLEM of P and the SLEM of M . Then,
(a) c .P /.
(b) For any two initial distributions ; , and for all n  1,

.P n ; P n / .; /Œ.P /n :

(c) For all n  1 and any i 2 S ,


2 n
sup jP n .i; A/  .A/j :
A 4.i /

See Brémaud (1999, pp. 237–238) for a proof of this theorem.

Example 19.26 (Two-State Chain). Consider our two-state case with the transition
probability matrix P
 
’ 1’
P D :
1ˇ ˇ

Then, the Dobrushin coefficient equals

1
.P / D j’  1 C ˇj C j1  ’  ˇj D 2j’ C ˇ  1j:
2
19.5 Convergence of MCMC and Bounds on Errors 659

We have previously seen that the eigenvalues of P are 1 and ’ C ˇ  1 (see Example
19.21), so that the SLEM equals c D j’ C ˇ  1j. Hence, in this case, c D .P /.
Example 19.27 (Metropolis–Hastings for Truncated Geometric). It was shown in
Example 19.15 that the Metropolis–Hastings algorithm for simulating from the dis-
tribution of Y given Y m, where Y Geo.p/, has the transition matrix with
entries
1 1 X
pij D if j < i I pij D .1  p/j i if j > i I pii D 1  pij :
m m
j ¤i

The eigenvalues of P are remarkably structured, and are


Pi
i j D1 qj
1 D 1; i C1 D ; i D 1; : : : ; m  1;
m

where q D 1  p. The eigenvalues are all in the unit interval Œ0; 1, and therefore
the second largest eigenvalue is also the SLEM. The second largest eigenvalue is the
last one, namely,
Pm1
m1 j D1 qj 1  qm 1  qm
m D D1 D1 :
m m.1  q/ mp

There is no such general closed-form formula for the Dobrushin coefficient in this
case, but for any given m and p, it is easily computable.
Example 19.28 (Dobrushin Bound May Not Be Useful). Consider again the transi-
tion matrix of Example 19.20
0 1
0 1 0
P D @ 0 :5 :5 A :
:5 0 :5

In this case, .pE1 ; pE2 / D :5; .pE2 ; pE3 / D :5, and .pE1 ; pE3 / D 1. Therefore,
.P / D 1. Therefore, the results of Theorem 19.9 involving the Dobrushin coeffi-
cient are not useful in this example. A closer examination reveals that the Dobrushin
coefficient is rendered equal to one because pE1 and pE3 are orthogonal. In any such
case, the bounds involving .P / would not be informative.

19.5.3  Drift and Minorization Methods

The bounds of this section on the distance between the exact distribution of the
chain at some fixed time n and the stationary distribution  have the appealing
feature that they apply to even nonreversible chains. As we have seen, certain Gibbs
660 19 Simulation and Markov Chain Monte Carlo

chains are not reversible. However, in principle, the drift and minorization methods
of this section can apply to them, and geometric convergence to stationarity may be
established by using the methods below.
We consider the case of discrete state spaces below. For more general state spaces
(e.g., if S D RC ), the methods of this section need an added assurance of a form
of recurrence, known as Harris recurrence. For the special case of a Gibbs chain
on a discrete state space, the Harris recurrence condition will automatically hold
under the conditions that we assume. However, it should be emphasized that Harris
recurrence must be verified before using drift methods if the underlying chain has a
more general state space. We first need the appropriate definitions.

Definition 19.5. Let Xn ; n  0 be an irreducible and aperiodic stationary Markov


chain on a discrete state space S . The chain is said to satisfy the drift condition with
the energy function V W S ! RC if there exist > 0, a real number b, and a finite
subset of states R  S such that
X
EŒV .XkC1 / j Xk D x D V .y/pxy .1  /V .x/ C b
y2S

for all x 62 R.

Definition 19.6. Let Xn ; n  0, be an irreducible and aperiodic stationary Markov


chain on a discrete state space S . Suppose also that the chain satisfies the drift
condition with the energy function V , and the associated constants ; b. Then, the
chain is said to satisfy the minorization condition with respect to q, a probability
distribution on S , if there exist ı > 0 and ’ > 2b such that

pxy  ıq.y/ 8 y 2 S; x 2 C;

where C D fx 2 S W V .x/ ’g.

Remark. The set C in the definition of the minorization condition is generally called
a small set. The drift condition says that if the chain is sitting at some x 62 R,
then it will tend to drift towards some state with a smaller energy than x. However,
the energy function is bounded from below, and so it cannot get lower at a steady
rate ad infinitum. So, eventually, the chain will seek shelter in R, and stay there.
Convergence will occur geometrically fast; that is, the total variation distance will
satisfy a bound of the form supA jP n .x; A/.A/j K n for some 2 .0; 1/ and
some K. If we have the minorization condition, we can furthermore place concrete
bounds on , typically of the form 1  ı.
It is an unfortunate aspect of the approach via drift and minorization conditions,
that it is usually very difficult to manufacture the energy function V and the prob-
ability distribution q even in simple problems. In addition, if an energy function V
satisfying the drift condition exists, it is not unique, and different choices produce
different constants K; . However, the approach has met some success, and it is a
19.5 Convergence of MCMC and Bounds on Errors 661

main theoretical approach in the convergence area. In particular, the drift approach
has been successfully used for some Gibbs chains for simulating from the posterior
distribution of a vector of means in so-called hierarchical Bayes linear models.
At first glance, the drift and the minorization conditions appear to be rather ob-
scure, and it is not clear how they lead to bounds on the total variation distance
between the exact distribution of the chain and the stationary distribution. For the
case that the minorization condition pxy  ıq.y/ holds for all x and y in the state
space S , we show the proof of the following theorem.

Theorem 19.10. Let Xn ; n  0, be an irreducible and aperiodic stationary Markov


chain on a discrete state space S . Suppose also that the chain satisfies the minoriza-
tion condition pxy  ıq.y/ for all x; y 2 S . Then, for all n  1, and any i 2 S ,

sup jP n .i; A/  .A/j .1  ı/n :


A

Proof. The proof uses a famous technique in probability theory, known as coupling.
We manufacture (only for the purpose of the proof) two new chains, Zn ; n  0,
and Yn ; n  0, such that the two chains will eventually couple. That is, from some
random time T D t onwards, we will have Zn D Yn . It turns out that the chains
are so constructed that this coupling time T Geo.ı/, and it also turns out that
jP .Xn 2 A j X0 D i /  .A/j P .T > n/ for any A; i , and n, where Xn is our
original chain (with  as the stationary distribution). Because P .T > n/ D .1ı/n ,
the theorem follows.
Here is how the chains Zn ; Yn ; n  0 are constructed. Define the probability
density rx .y/ on the state space S by the defining equation

pxy D ıq.y/ C .1  ı/rx .y/I

we can define such a density because the minorization condition has been assumed
to hold for all x; y 2 S . Set Z0 D i 2 S; Y0 , and choose a Bernoulli variable
B0 Ber.ı/. If B0 D 0, choose Y1 ry0 .y/ and choose Z1 rz0 .y/, indepen-
dently of each other. If B0 D 1, choose a single z q.y/ and set Z1 D Y1 D z.
This is one sweep. Repeat this sweep, mutually independently, until the Z-chain
and the Y -chain couple and call the coupling time T . Once the chains couple, at all
subsequent steps, the two chains remain identical. Because the sweeps are indepen-
dent and at each sweep the same Bernoulli experiment with success probability ı is
performed, we have T Geo.ı/.
To complete the proof, we now show that jP .Xn 2 A j X0 D i /  .A/j
P .T > n/ for any A; i , and n. For this, note the important fact that because Y0
was chosen from the stationary distribution , and because the transition probabil-
ities satisfy the mixture representation pxy D ıq.y/ C .1  ı/rx .y/, the marginal
distribution of Yn coincides with  for every n. Furthermore, because the Z-chain
started at the state i , its marginal distribution coincides with the distribution of Xn
given that X0 D i . Hence,
662 19 Simulation and Markov Chain Monte Carlo

jP .Xn 2 A j X0 D i /  .A/j
D jP .Xn 2 A j X0 D i /  P .Yn 2 A/j
D jP .Zn 2 A/  P .Yn 2 A/j
D jP .Zn 2 A; Zn DYn / C P .Zn 2 A; Zn ¤ Yn /
P .Yn 2 A; Yn D Zn /  P .Yn 2 A; Yn ¤ Zn /j
D jP .Zn 2 A; Zn DYn / C P .Zn 2 A; Zn ¤ Yn /
P .Zn 2 A; Zn DYn /  P .Yn 2 A; Yn ¤ Zn /j
D jP .Zn 2 A; Zn ¤ Yn /  P .Yn 2 A; Zn ¤ Yn /j
 
max P .Zn 2 A; Zn ¤ Yn /; P .Yn 2 A; Zn ¤ Yn /

P .Zn ¤ Yn / D P .T > n/: t


u

19.6 MCMC on General Spaces

Markov chain Monte Carlo methods are not limited to target distributions  that
have a discrete set S as their support. They are routinely used for simulating from
continuous distributions with general supports. The basic definitions, results, and
phenomena in these general cases closely match the corresponding results in the
discrete case. For example, we still have Metropolis schemes and the Gibbs sampler
for these general cases, and convergence theory and convergence conditions are
very similar to those for the discrete case. We describe the continuous versions of
the MCMC algorithms and the basic associated theory with examples in this section.
First, we set up the necessary notation and provide the necessary definitions.

19.6.1 General Theory and Metropolis Schemes

Definition 19.7. A discrete-time stochastic process Xn ; n  0, is called a stationary


or a homogeneous Markov chain on a general state space S  Rd if there exists a
Markov transition kernel P .x; :/, which is a function of two arguments, x 2 S and
A  S , such that for fixed x; P .x; :/ is a probability distribution (measure) on S ,
and for any n  0 and any x 2 S ,

P .XnC1 2 A j Xn D x/ D P .x; A/:

The Markov transition kernelP P .x; A/ is the direct generalization of the probability
P .XnC1 2 A j Xn D i / D j 2A pij in the discrete state space case. In the discrete
case, the diagonal elements pi i can be strictly positive. Likewise, in the general
state space case, P .x; fxg/ can be strictly positive. Therefore, in general, a Markov
transition kernel P .x; A/ need not be a continuous distribution; that is, P .x; A/
19.6 MCMC on General Spaces 663

need not have a density. If it does, the density corresponding to the measure P .x; :/
is called a transition density, and we denote it by p.x; y/. So, a transition density,
if it exists, satisfies the two usual properties:
Z
p.x; y/  0I p.x; y/dy D 1:
S

Caution. In the next definition, a probability measure and its density function have
been denoted by using the same notation . This is an abuse of notation, but is very
common in the literature in the present context.

Definition 19.8. Suppose Xn ; n  0, is a stationary Markov chain with Markov


transition kernel P .x; :/. A probability measure  with density  on S is called a
stationary or an invariant measure for the chain if for all A  S ,
Z
.A/ D P .x; A/.x/dx:

P
This is a direct extension of the corresponding equation .i / D j 2S pj i .j / in
the discrete case. The distribution of the chain at time n, having started from state x
at time zero, is still denoted as

P n .x; A/ D P .Xn 2 A j X0 D x/:

As in the discrete case, we must remember that


(a) A stationary distribution  need not exist.
(b) If it exists, it need not be unique.
(c) Even if a certain distribution  is a stationary distribution for a chain Xn ,
we need not have the convergence property supA jP n .x; A/  .A/j ! 0 as
n ! 1.
And, as in the discrete case, the three conditions that help us eliminate these obsta-
cles are reversibility, irreducibility, and aperiodicity. We need to redefine them in
the new notation and the new general state space setup.

Definition 19.9. A stationary Markov chain on a general state space S with a


transition density p.x; y/ is called reversible with respect to  if the detailed bal-
ance equation .x/p.x; y/ D .y/p.y; x/ holds for all x; y; 2 S; y ¤ x.

Definition 19.10. A stationary Markov chain Xn ; n  0 on a general state space S


is called irreducible with respect to a given  if for any x 2 S and any A  S such
that .A/ > 0; P .9 n such that Xn 2 A j X0 D x/ > 0.

Definition 19.11. A stationary Markov chain Xn ; n  0 on a general state space


S is called periodic of period d > 1 if there exists a partition A1 ; : : : ; Ad of the
664 19 Simulation and Markov Chain Monte Carlo

state space S such that the one-step transitions of the chain must be of the form
A1 ! A2 ! A3 !    ! Ad ! A1 . The chain is called aperiodic if there is no
d > 1 such that the chain is periodic of period d .
Metropolis chains on general state spaces will usually (but not always) have these
three helpful properties of reversibility with respect to the target distribution  that
is in our mind, and irreducibility with respect to that same , and aperiodicity. This
enables us to conclude that the chain indeed has the targeted  as its unique station-
ary distribution, and that convergence takes place as well. Those issues are picked
up later in this section.
The Metropolis chain in this general case is defined in essentially the same man-
ner as in the discrete case. If the chain is currently at state x, a candidate state y is
generated according to a proposal distribution and once a candidate state has been
picked, it is accepted or declined according to an acceptance distribution. We need
notations for these two things. We assume below that the proposal distribution is
continuous; that is, it has a density, and this proposal densityR is denoted as .x; y/.
So, the properties of a proposal density are .x; y/  0I S .x; y/dy D 1. The
acceptance probabilities are denoted by .x; y/. So, for each fixed x and each fixed
y; 0 .x; y/ 1. The overall transition kernel of our Metropolis chain is then

p.x; y/ D .x; y/ .x; y/; x; y 2 S; y ¤ xI


Z
P .x; fxg/ D .x; y/Œ1  .x; y/dy:
S

Throughout we make the assumption that the stationary density .x/ and the pro-
posal densities .x; y/ are strictly positive for all x; y. This is not essential, but it
saves us from a lot of tedious fixing if we do not allow them to take the value zero.
Now we proceed to define the special MCMC schemes exactly as we did in the
discrete case.

Independent Sampling.

.x; y/ D .y/ 8 yI .x; y/ 1:

General Metropolis Scheme. Choose a general .x; y/ and

.y/.y; x/
.x; y/ D min ;1 :
.x/.x; y/

Random Walk Metropolis Scheme. In the general Metropolis scheme, choose


.x; y/ D .y; x/ D q.jy  xj/ for some fixed density q, so that

.y/
.x; y/ D min ;1 :
.x/
19.6 MCMC on General Spaces 665

Independent Metropolis Scheme. Choose .x; y/ D p.y/, that is, a fixed strictly
positive density p.y/ independent of x, and
( .y/
)
p.y/
.x; y/ D min 1; .x/
:
p.x/

Example 19.29 (Simulating from a t-Distribution). Suppose the target distribution


is a t-distribution with ’ degrees of freedom, where we take ’ > 1. Thus,
c
.x/ D  ; 1 < x < 1:
 ’C1
2
x2
1C ’

The normalizing constant c is unimportant for the purpose of implementing an


MCMC scheme. Note that in this example, the state space is S D R. We illus-
trate the use of a random walk Metropolis scheme with a proposal distribution that
is Cauchy. Precisely,
c
.x; y/ D ; 1 < y < 1I
1 C .y  x/2
once again, c is a normalizing constant (not the same one as in the t density), and is
not important for our purpose. Therefore,

.y/.y; x/
.x/.x; y/
 ’C1

1 C x’
2 2
.1 C .y  x/2 /   ’C1
’ C x2 2
D D
.1 C y 2 ’C1
/ 2 .1 C .x  y/2 / ’ C y2

 1 iff x 2  y 2 iff jxj  jyj:

Hence the acceptance probabilities of the Metropolis scheme are given by


  ’C1
’ C x2 2
.x; y/ D 1 if jxj  jyj; and .x; y/ D if jxj < jyj:
’ C y2

So, finally, the Metropolis scheme for simulating from the t-distribution proceeds
as follows.
Step 1. If the current value of the chain is x, generate a candidate state y C.x; 1/.
Step 2. If jxj  jyj, accept the candidate value y; if jxj < jyj, perform a Bernoulli
 ’C1
’Cx 2 2
experiment with success probability ’Cy 2 , and accept the candidate
value y if the Bernoulli experiment results in a success.
666 19 Simulation and Markov Chain Monte Carlo

Example 19.30 (Simulating from a Nonconventional Multivariate Distribution).


Suppose we want to simulate from a trivariate distribution with the density
c
.x1 ; x2 ; x3 / D ; x1 ; x2 ; x3 > 0:
.1 C x1 C x2 C x3 /4

Thus, in this example, the state space is S D R3C . We use a suitable independent
Metropolis scheme to simulate from . Towards this, choose the proposal density
c
p.y1 ; y2 ; y3 / D ; y1 ; y2 ; y3 > 0
.1 C y12 C y22 C y32 /2

Note that the proposal density is a trivariate spherically symmetric Cauchy density,
restricted to the first quadrant. We can simulate quite easily
q from this proposal den-
sity. For instance, we can first simulate a value for r D y12 C y22 C y32 according
to the density
4 r2
g.r/ D ; 0 < r < 1;
 .1 C r 2 /2
and then generate .y1 ; y2 ; y3 / as .y1 ; y2 ; y3 / D r.u1 ; u2 ; u3 / where .u1 ; u2 ; u3 / is
a point on the boundary of the unit sphere (see Section 19.2.4 on how to do these
simulations). When the point .u1 ; u2 ; u3 / does not fall in the first quadrant, we have
to fix the signs.
We only have to work out the acceptance probabilities of our Metropolis scheme
now. We have
.y/
p.y/ .1 C x1 C x2 C x3 /4 .1 C y12 C y22 C y32 /2
D
.x/ .1 C y1 C y2 C y3 /4 .1 C x12 C x22 C x32 /2
p.x/
D ’.x; y/ say:

Then, the acceptance probability is .x; y/ D minf1; ’.x; y/g. For example, if the
current state is .x1 ; x2 ; x3 / D .1; 1; 1/, and the candidate state is .y1 ; y2 ; y3 / D
.2; 2; 2/, then ’.x; y/ D 1:126, and therefore .x; y/ D minf1; 1:126g D 1; that is,
the candidate state is definitely accepted.

19.6.2 Convergence

The convergence issues, conditions needed, proofs of the convergence theorems,


and verification of the conditions needed for convergence of Metropolis schemes on
general state spaces are all substantially more complicated than in the discrete state
space case. The questions of importance are the following.
19.6 MCMC on General Spaces 667

Question 1. Does a particular Metropolis scheme in a specific problem have a


stationary distribution at all, and if so, is it our target distribution , and if so, is
 a unique stationary distribution?
Question 2. For a particular Metropolis scheme in a specific problem, does
supA jP n .x; A/  .A/j ! 0?
Question 3. If it does, is the chain geometrically ergodic?
Question 4. Is the chain uniformly ergodic?
Of these, the answer to the first question is usually straightforward. By construc-
tion, our Metropolis chain will usually be reversible with respect to the given  that
we have in mind, and exactly as in the discrete state space case, this ensures that
 is a stationary distribution. Uniqueness follows if we also have irreducibility and
aperiodicity of our Metropolis chain. So, what we strive for is the construction of a
Metropolis chain that is reversible, and irreducible, and aperioodic. The nice thing
about Metropolis chains is that we can usually (but not always) achieve all three
without imposing stringent conditions on  or the chain.
An added complexity in the case of general state spaces, which does not arise in
the discrete case, is that one may have supA jP n .x; A/  .A/j ! 0 for almost all
initial states x, but not for all x. Because we do not know for exactly which initial
states x; supA jP n .x; A/  .A/j ! 0, we cannot be sure that convergence occurs
for the particular initial state with which we started.
The best theorems on convergence and the speed of convergence for the case of
general state spaces are generally awkward, either in the statement of the condi-
tions, or in their verification. We present results and techniques that are simpler to
state, understand, and verify, rather than focus on the weakest possible conditions.
The principal references for the results that we present here are Tierney (1994),
Mengersen and Tweedie (1996), and Roberts and Rosenthal (2004). Some of the
best theorems on convergence of general stationary Markov chains on general state
spaces under essentially the weakest conditions can be seen in Athreya, Doss and
Sethuraman (1996).
We start with the most basic question of whether Metropolis chains on general
state spaces at all converge to the desired target distribution , and if so, under what
conditions. To answer this question, we need to define Harris recurrence; this is
defined below.
Definition 19.12. A stationary Markov chain Xn ; n  0, on a general state space S
is called Harris recurrent with respect to  if given any initial state x 2 S , and any
subset B  S such that .B/ > 0,

P .9 n such that Xn 2 B j X0 D x/ D 1:

This is the same as saying that regardless of where the chain started, any set B with
.B/ > 0 will be visited infinitely often by the chain with probability one. On
comparing this with the definition of irreducibility, we find that Harris recurrence is
stronger than irreducibility.
668 19 Simulation and Markov Chain Monte Carlo

Harris recurrence is a difficult object to verify in general. However, for


Metropolis chains, if we make the proposal density nice enough, and if the tar-
get density is also nice enough, then Harris recurrence obtains. So, one does not
have to verify Harris recurrence on a case-by-case basis, under some general con-
ditions on the Metropolis chain. Once we have Harris recurrence, and aperiodicity,
we get the desired convergence property, although rates of convergence still need a
separate study. Here is a very concrete theorem from Tierney (1994).

Theorem 19.11. Suppose the state space of the chain is S  Rd .


(a) The random walk Metropolis chain is aperiodic and Harris recurrent if either
of the following holds.
(A) q is positive on all of Rd .
(B) q is positive in a neighborhood of zero, and  D fx W .x/ > 0g is open
and connected.
(b) The independent Metropolis chain is aperiodic and Harris recurrent if .x/ >
0 ) p.x/ > 0.
(c) Provided that the sufficient conditions stated above hold, the random walk
Metropolis chain and the independent Metropolis chain satisfy the convergence
property

For any x 2 S; sup jP n .x; A/  .A/j ! 0 as n ! 1:


A

(d) In addition, under the sufficient conditions stated above, the random walk
Metropolis chain and the independent Metropolis chain R are strongly er-
godic; that is, for any  W S ! R such that S j.x/j.x/dx <
P a:s: R
1; n1 nkD1 .Xk / ! s .x/.x/dx, for any initial distribution .

Example 19.31 (Simulating from a t-Distribution, Continued). Consider again the


setup of Example 19.26, with .x/ as the density of a t-distribution with ’ degrees
of freedom and .x; y/ as the density of C.x; 1/. Thus, in our notation, q.x/ D
c
I
1Cx 2 1<x<1
. Because q satisfies condition (A) in part (a) of Theorem 19.11, the
desired convergence property supA jP n .x; A/  .A/j ! 0 is assured.
Suppose we change the proposal density to .x; y/ D 12 Ix1yxC1 . Then
q.x/ D 12 I1x1 . So, q does not satisfy condition (A) in part (a), but it still sat-
isfies condition (B). Furthermore,  being a t-density,  D R, which is open and
connected. Hence, we are again assured of supA jP n .x; A/  .A/j ! 0.
Suppose to force quick jumps, we choose .x; y/ D 12 Ix2yx1 C
2 IxC1yxC2 . Then q.x/ D 2 I2x1 C 2 I1x2 . Now, q is not posi-
1 1 1

tive in any neighborhood of zero, and so it satisfies neither condition (A) nor
condition (B) of part (a) in Theorem 19.11, and the theorem fails to guarantee
supA jP n .x; A/  .A/j ! 0.
The above theorem suggests that if all we care for is simple convergence, then
Metropolis chains will usually deliver it. However, for practical use, what we re-
ally need are concrete bounds on supA jP n .x; A/  .A/j for given n. For the
19.6 MCMC on General Spaces 669

discrete state space case, we obtained such bounds by spectral methods, drift and
minorization, and by using Dobrushin’s coefficient. In the general state space case,
we provide a result based on drift and minorization, and another based on the tails
of the target and the proposal density. We need another definition before the next
theorem.
Definition 19.13. A density .x/ on the real line is called logconcave in the tails
with the associated exponential constant ’ > 0 if for some x0 ,

log .x/  log .y/


’
jx  yj

for y > x  x0 and y < x x0 .


Here is a rather explicit theorem on convergence and bounds on approximation
error due to Mengersen and Tweedie (1996).
Theorem 19.12. (a) The independent Metropolis sampler is geometrically and
uniformly ergodic if the proposal density p.y/ satisfies the condition that for
some finite constant c; .y/
p.y/ c for all y. Furthermore, in this case, for any
x 2 S , and any n  1,
 
1 n
.P n .x; :/; / D sup jP n .x; A/  .A/j 1 :
A c

(b) Suppose the state space S  R. Assume the following conditions on  and the
proposal densities .x; y/:
(i) .x/ is logconcave in the tails with the associated exponential constant ’.
(ii) .x; y/ D q.jy  xj/ for some fixed density function q, which satisfies the
fat tail condition q.x/ Ke ’jxj with ’ as in (i) above.
Then, with V .x/ defined as V .x/ D e ajxj with a < ’, the random walk
Metropolis chain with the proposal density .x; y/ is geometrically, but not
uniformly, ergodic in the following sense:
Z
sup jE.f .Xn / j X0 D x/  f .y/.y/dyj M V .x/r n
f W jf .y/jV .y/ S

for some finite constant M , and 0 < r < 1. This holds for any initial state
x, and any n  1.
Example 19.32 (A Geometrically Ergodic Metropolis Chain). This is an example
where we can verify the conditions in part (b) of the above theorem easily. Suppose
the target density is .x/ D ce jxj ; 1 < x < 1. We use a Metropolis sampling
scheme with the proposal density .x; y/ D q.jy  xj/, where q is the uniform
density q.y/ D 2b 1
Ijyjb ; 0 < b < 1. Then, .x/ is certainly logconcave in the
tails. Indeed, taking x0 to be any positive number, we have for y > x > x0 , or for
y < x < x0 ,
670 19 Simulation and Markov Chain Monte Carlo

log .x/  log .y/


D 1;
jx  yj
and hence we can choose ’ D 1 in the tail logconcavity property. Also, clearly, q
satisfies the fat tail condition q.x/ Ke jxj for all x (note that q D 0 outside
Œb; b). Therefore, by part (b) of the above theorem, the Metropolis sampler of this
example is geometrically ergodic. It is, however, not uniformly ergodic.

Example 19.33 (Proposal Densities Should Be Heavier Tailed Than Target). Sup-
pose the target density is a standard exponential, and suppose we consider an
independent Metropolis sampler with a proposal density that is exponential with
mean . Thus, for y > 0,

.y/ e y 1
D 1 y=
D e y :
p.y/ 
e

This is uniformly bounded by a finite constant c if  1, and therefore, from part


(a) of Theorem 19.12, we can conclude that the independence Metropolis chain is
geometrically ergodic.
For < 1; .y/
p.y/
is unbounded. How does the independent Metropolis chain be-
have in that case? Indeed, the chain behaves poorly when < 1. For < 1, the
acceptance probabilities are
( .y/ ) ( )
p.y/ 1
.x; y/ D min 1; D min 1; e  .yx/ :
.x/
p.x/

Suppose now that the chain is started at X0 D x0 D 1. Then, the candidate value
y D y0 , it being from an exponential distribution with a small mean, will likely be
smaller than x0 , and therefore, from our expression above, the probability .x0 ; y0 /
that the candidate value will be accepted is small. This cycle will persist, and the
chain will mix very slowly. When a candidate value gets accepted, it will tend to
bePstill lower than the initial value x0 . Even after many steps, averages such as
1 n
n kD1 .Xk / will produce poor estimates of E ..X //. This example illustrates
that it is important that the proposal density not be thinner tailed than the target
density when using an independent Metropolis sampler.

19.6.3 Convergence of the Gibbs Sampler

It is important to understand that the Gibbs sampler is not guaranteed to always


converge; it can fail. Indeed, it can fail in extremely simple-looking problems. When
it fails, it is usually due to a lack of communicability problem. Here is an well-
known example.
19.6 MCMC on General Spaces 671

Example 19.34 (Failure of the Gibbs Sampler to Converge). Suppose U; V are iid
N.0; 1/ (normality is not important for this example), and suppose that we wish to
simulate from the joint distribution of .X; Y / where Y D U; X D jV jsgn.U / D
jV jsgn.Y /. Therefore, the full conditionals, that of X j Y and Y j X , are such that
P .Y > 0 j X > 0/ D P .X > 0 j Y > 0/ D 1. Suppose the starting values in
a Gibbs chain are x0 D 1; y0 D 1. Then, with probability one, the Gibbs update
produces x1 ; y1 such that first x1 > 0 (because x1 is simulated from f .x j Y D y0 /
and so x1 has the same sign as y0 ), and then, y1 > 0. Then, with probability one,
the Gibbs update produces x2 ; y2 such that x2 ; y2 > 0, and so on. Thus, if we let
A be the event that X > 0; Y > 0, the estimate of P .A/ from a Gibbs sampler,
regardless of the length of the chain, is 1. But the true value of P .A/ is P .A/ D
P .X > 0; Y > 0/ D 12 .
The reason that the Gibbs sampler fails to converge to the correct joint distribu-
tion of .X; Y / in the above example is that the support of the joint distribution is
confined to two disjoint subsets, the first quadrant, and the third quadrant. When-
ever the target distribution has such a topologically disconnected support, the full
conditionals of course preserve it, and so the Markov chain corresponding to Gibbs
sampling fails to be irreducible. Without the irreducibility property, the chain fails
to mix, and gets lost.

Convergence issues for the Gibbs sampler are usually treated separately for
the cases of discrete and continuous state spaces. The Gibbs sampler is a special
Metropolis scheme, in which the acceptance probabilities .x; y/ are equal to one.
So, any general theorem on the convergence of a Metropolis scheme, and any gen-
eral bound on the error of the approximation works, in principle, for the Gibbs
sampler. For example, in the finite-state space case, if a particular Gibbs chain is
reversible (recall that Gibbs chains are not necessarily reversible and the exact up-
dating scheme matters; see Section 19.4), then all the results in Theorem 19.8 apply.
However, their practical use depends on whether the constant c, that is, the SLEM,
can be evaluated or at least bounded for that specific Gibbs chain. Unfortunately,
this has to be done on a case-by-case basis. Diaconis et al. (2008) have succeeded in
some exponential family problems, where  is the posterior distribution of a scalar
parameter, and the prior has a conjugate character (see Chapter 3 for an explanation
of conjugacy).
The case of a continuous-state space is more complicated. There are some general
theorems with conditions on  that at least guarantee convergence. However, if we
seek geometric ergodicity over and above just convergence, then typically one has to
fall back on drift and minorization methods, and for Gibbs chains, this approach is
essentially problem specific, and not easy. Construction of the energy function V .x/
corresponding to the drift condition (see Section 19.5.3, and also Theorem 19.12
part (b)) is often too difficult for Gibbs chains. Rosenthal (1995, 1996), Chan (1993),
and Jones and Hobert (2001) are some specific success stories. Uniform ergodicity
usually does not hold for Gibbs chains unless the state space is bounded.
No very general satisfactory error bounds are available in the Gibbs context. The
following concrete theorem is based on Athreya et al. (1996) and Tierney (1994).
672 19 Simulation and Markov Chain Monte Carlo

Theorem 19.13. (a) Suppose the target distribution has the joint density .x1 ; x2 ;
: : : ; xm / and that it satisfies the condition

.A/ .x1 ; x2 ; : : : ; xi j xi C1 ; : : : ; xm / > 0 8 i < m and 8 .x1 ; x2 ; : : : ; xm /:

Then the systematic scan Gibbs sampler that updates the coordinates in the
order 1 ! 2 !    ! m has the convergence property

sup jP n .x; A/  .A/j ! 0 as n ! 1;


A

for almost all starting values x (with respect to the target distribution ).
(b) Suppose that the target distribution has the joint density .x1 ; x2 ; : : : ; xm / and
that  D fx W .x/ > 0g is an open connected set in an Euclidean space.
Then the random scan Gibbs sampler is reversible, Harris recurrent, has  as
its unique stationary distribution, and has the convergence property

sup jP n .x; A/  .A/j ! 0 as n ! 1


A

for all initial states x.


Example 19.35 (The Drift Method in a Bayes Example). Here is a simple enough ex-
iid
ample where the drift method can be made to work with ease. Suppose Xi j .; /
N.; /; i D 1; 2; : : : ; m, and .; / are given the joint improper formal prior den-
sity p1 ; 1 <  < 1; > 0. Then the formal joint posterior density is


Pm
 mC1 1
 2 i D1 .xi  /
2
.; j x1 ; : : : ; xm / D c 2 e :

Even though the formal prior is not a probability density, the formal posterior is
integrable for m  3. We assume that m  5 in order that the drift method fully
works out. The full conditionals are derived easily from the joint posterior, and
they are:
 
1 m1 2
j .; x1 ; : : : ; xm / G ; Pm I
2 .xi  /2
  i D1
 j . ; x1 ; : : : ; xm / N x; :
m

Consider now the systematic scan Gibbs sampler that updates in the order . ; / !
. 0 ; / ! . 0 ;  0 /. The transition density k ..; /; . 0 ; 0 // of this Gibbs chain is
 
k .; /; . 0 ; 0
/ D . 0
j ; x1 ; : : : ; xm /. 0 j 0
; x1 ; : : : ; xm /;

and these two full conditionals have been described above.


19.7 Practical Convergence Diagnostics 673

Now notice, from this, that given ; . 0 ; 0 / is conditionally independent of , and


given 0 ;  0 is conditionally independent of . Therefore, by iterated expectation, for
any function V .; /, as long as the expectations below exist,

EŒV . 0 ; 0
/ j .; / D EŒV . 0 ; 0 / j 
 
D E0 j  EŒV . 0 ; 0
/ j .; 0
/ :

Now choose the energy function to be V .; / D .  x/2 . Then, from the above,
0
EŒV . 0 ; 0
/ j .; / D E0 j  . /
Pm m
.xi  /2
D i D1 :
m.m  3/
Pm Pm Pm
Now decompose i D1 .xi  /2 as i D1 .xi  /2 D m.  x/2 C i D1 .xi  x/2 .
Therefore,
P Pm
mV .; / C mi D1 .xi  x/
2
V .; / .xi  x/2
EŒV . ;0 0
/ j .; / D D C i D1
m.m  3/ m3 m.m  3/
.1  /V .; / C b;
Pm
.x x/2
where 1  m31
D m4
m3 , and b 
i D1 i
m.m3/ . This establishes the drift
condition in this example (see Definition 19.5).

19.7 Practical Convergence Diagnostics

The spectral methods as well as the drift methods for studying convergence
rates of MCMC algorithms may sometimes give very conservative answers to
the all-important practical question: how long should the chain be run to make
.P n ; / ? The spectral bounds can be difficult to apply, if the eigenvalues of
P are intractable, and the drift methods are usually difficult to apply in any case. In
addition the spectral and the drift methods require each new problem to be treated
separately. Therefore, effective and simple omnibus convergence diagnostics are
useful. They can be graphical, or numerical. Fueled by this practical need, a bewil-
dering array of MCMC convergence diagnostics have been proposed over the last
25 years. A major review of methods up to that time is Cowles and Carlin (1996).
A more recent literature summary is Mengersen et al. (2004).
These convergence diagnostics have often been received with skepticism by ac-
tual users, because none of them has been found to be generally superior to the
others, and more importantly, because in a given application, they often give contra-
dictory answers. Although one diagnostic may lead us to conclude that approximate
674 19 Simulation and Markov Chain Monte Carlo

convergence has already been achieved, another may suggest that we are still very
far from convergence. However, in conjunction with the use of spectral or drift meth-
ods, and collectively, they have some practical utility. We provide a brief overview
of a few of the diagnostics that have become popular in the area.
Gelman-Rubin Multiple Chain Method. This is based on Gelman and Rubin
(1992). In this method, m parallel chains, say Xkj ; k D 1; 2; : : : ; 2n; j D
1; 2; : : : ; m are generated by using an MCMC algorithm. The method applies to any
MCMC algorithm for which convergence to a stationary distribution is theoretically
known.
Suppose the target distribution  is d -dimensional for some d  1, and let
g W Rd ! R be a function of interest. Let

1 X 1 XX
2n m 2n
gkj D g.Xkj /; gj D gkj ; gN D gkj ;
n mn
kDnC1 j D1 kDn

1 X
2n
sj2 D .gkj  g j /2 :
n
kDnC1

Thus, gkj is the value of g based on the kth sweep of the j th run, g j is their mean,
sj2 is their variance, and gN is the overall mean. Now define the between and within
variances by
1 X 1 X 2
m m
BD N 2; W D
.g j  g/ sj :
m m
j D1 j D1

The method also computes an estimate On of the target distribution  by pooling the
last n values from the m parallel chains. Convergence is monitored by calculating

n1 mCn B
GR D GRg;n D C cn ;
n mn W

where cn is a certain concrete parameter of the estimating distribution On . GR


has the interpretation of the additional possible gain in efficiency for estimating
E .g.X // if the run size is increased indefinitely. Thus, convergence is concluded
when GR 1. If there are several different functions g of interest, then each corre-
sponding GR has to be monitored and sampling continues until each is close to 1.
Yu–Mykland Single Chain Graphic. This is based om Yu and Mykland (1994). In
this method, a single chain Xj ; j D 0; 1; : : : ; n is generated using any convergent
Pn again, let g W R ! R be a function of interest, and let
d
MCMC chain. Once
gN D .1=.n  B// j DBC1 g.Xj / be the mean of the value of g after discarding an
initial BPvalue, B has to be chosen by the user. Define the cumulative sum values
t
St D j DBC1 Œg.Xj /  g; N t D B; B C 1; : : : ; n. The method plots the pairs
.t; St /; t D B; B C 1; : : : ; n. The plot is called a CUSUM plot. This plot starts and
ends at St D 0.
19.7 Practical Convergence Diagnostics 675

A necessary condition for accurate estimation of means and other functionals


of the target distribution is that the MCMC chain has had enough time to traverse
through the entire state space. This is called good mixing. Chains that mix slowly are
likely to produce well-behaved CUSUM plots, such as large segments of monotone
curves. Chains that mix rapidly are likely to produce oscillatory plots. The CUSUM
plot is visually examined for its smoothness, and if it appears nonsmooth, then rapid
mixing is concluded. Rapid mixing is indicative of rapid convergence. So the plot
assesses convergence indirectly. The CUSUM plot could be nonsmooth for some g,
and smooth for another g, however. Apart from the need to select the burn-in period
B, the method has a simplistic appeal.
Garren–Smith Multiple Chain Gibbs Method. This applies only to reversible
Gibbs chains Xn ; n  0, because of the theoretical need that the eigenvalues of the
chain’s transition matrix be such that the second largest eigenvalue is unique. Here is
the precise meaning of that. Suppose P is reversible, and let its eigenvalues (which
are real) be arranged in the order of decreasing absolute values, 1 > j 2 j  j 3 j 
: : : : Let A be any subset of the state space S . Provided that j 2 j > j 3 j (strictly
greater), one gets
k k
P .Xk 2 A j X0 D i / D .A/ C c 2 C o.j 2j /:

Suppose that m parallel chains Xkj ; k D 1; 2; : : : ; n; j D 1; 2; : : : ; m, all with the


same starting value, say i , are generated, and select a value for the burn-in period
B. Denote .A/ D p, and form estimates of p by pooling simulated values from
the multiple chains:

1 X
m
pOk D IXkj 2A ; k D B C 1; : : : ; n:
m
j D1

Denote  D .p; c; 2 /, and estimate  as

X
n
O D OB D argmin .pOk  p  c k 2
2/ :
kDBC1

As a preliminary diagnostic, one plots each coordinate of OB against B, as B ranges


over 1; 2; : : : (or some smaller subset of it). Initially, the plot would look unstable.
The burn-in period is chosen as that B where the plot clearly begins to stabilize,
if such a value of B can be identified. This may be difficult and ambiguous. As
regards convergence, if the plot stabilizes at some B << n, then convergence may
be suspected. Garren and Smith also give more complex second stage diagnostics in
their article (Garren and Smith (1993)).
676 19 Simulation and Markov Chain Monte Carlo

Exercises
R 10 p
Exercise 19.1. Find a Monte Carlo estimate for the value of  D 0 e x dx, and
construct a 95% confidence interval for . Use a selection of values for the Monte
Carlo sample size n. Compare the Monte Carlo estimate to the true value of .
R1 2
Exercise 19.2. * Find a Monte Carlo estimate for the value of  D 1 e x
sin5 .x/dx, and construct a 95% confidence interval for . Use a selection of values
for the Monte Carlo sample size n. Compare the Monte Carlo estimate to the true
value of .

Exercise 19.3 * (Monte Carlo Estimate of a Three-Dimensional Integral). Find


a Monte Carlo estimate for the value of
Z 1 Z 1 Z 1 2 2 2
e .1=2/.x Cy Cz /
D dxdydz;
1 1 1 1 C jxj C jyj C jzj

and construct a 95% confidence interval for . Use a selection of values for the
Monte Carlo sample size n.

Exercise 19.4. With reference to Example 19.1, give a rigorous proof that
E.log jX j/ D 0 if X C.0; 1/.

Exercise 19.5 * (Monte Carlo Estimate of Area of a Triangle). Find a Monte


Carlo estimate for the area  of a triangle with sides a D 3; b D 2; c D 2, and
construct a 95% confidence interval for . Use a selection of values for the Monte
Carlo sample size n. Compare the Monte Carlo estimate to the true value of .

Exercise 19.6 * (Monte Carlo Estimate of Volume of a Cylinder). Find a Monte


Carlo estimate for the volume  of a right circular cylinder with height h D 2 and
radius r D 1. Construct a 95% confidence interval for . Use a selection of values
for the Monte Carlo sample size n. Compare the Monte Carlo estimate to the true
value of .

Exercise 19.7 * (Monte Carlo Estimate of Volume of a Pyramid). Find a Monte


Carlo estimate for the volume  of a square-based pyramid with base sides equal to
b D 1 and height h D 5. Construct a 95% confidence interval for . Use a selection
of values for the Monte Carlo sample size n. Compare the Monte Carlo estimate to
the true value of .

Exercise 19.8 * (Monte Carlo Estimate of Surface Area). Devise a Monte Carlo
scheme for finding the surface area of the d -dimensional unit ball. Use it to find
estimates for the surface area when d D 2; 3; 5; 10; 50. Use a selection of values for
the Monte Carlo sample size n. Compare the Monte Carlo estimate to the true value
d=2
of the surface area of the unit ball (which is dd /.
. 2 C1/
Exercises 677

Exercise 19.9 (Using the Monte Carlo on a Divergent Integral). The integral
R1
 D 0 cosx x diverges. Investigate what happens if you try to find a Monte Carlo
estimate for the integral anyway. Use a selection of values for the Monte Carlo
sample size n.

Exercise 19.10 * (Using the Monte Carlo on an Infinite Expectation). For any
m  2, if X1 ; : : : ; Xm are iid standard Cauchy, then E.X.m/  X.m1/ / D 1.
Investigate what happens if you try to find a Monte Carlo estimate for this expecta-
tion anyway; use m D 2; 5; 20, and a selection of values for the Monte Carlo sample
size n.

Exercise 19.11. Suppose Xi Exp.i /; i D 1; 2; 3; 4, and suppose that they are


independent. Find a Monte Carlo estimate for P .X1 < X2 < X3 < X4 /. Use a
selection of values for the Monte Carlo sample size n.
iid
Exercise 19.12. Suppose Xi N.0; 1/; i D 1; 2; : : : ; m. Find a Monte Carlo es-
timate for E.X.m/ /, the expectation of the maximum of X1 ; : : : ; Xm . Use m D
2; 5; 10; 25; 50, and a selection of values for the Monte Carlo sample size n. Com-
pare the Monte Carlo estimate with the true value of E.X.m/ / (see Chapter 6).

Exercise 19.13 * (Monte Carlo in a Bayes Problem). Suppose X Bin.100; p/


and p has the prior density cp p .1  p/1p ; 0 < p < 1, where c is a normalizing
constant. Find a Monte Carlo estimate for the posterior mean of p if X D 45. Use a
selection of values for the Monte Carlo sample size n.

Exercise 19.14 * (Monte Carlo in a Bayes Problem). Suppose X Poi. / and


has the discrete prior .n/ D nc2 ; n D 1; 2; : : :, where c is a normalizing constant.
Find a Monte Carlo estimate for the posterior mean of if X D 1. Use a selection
of values for the Monte Carlo sample size n.

Exercise 19.15. * Suppose X N3 .; I /, and that the mean vector  has a prior
density
c
./ D :
1 C .j1 j C j2 j C j3 j/3:5
Find a Monte Carlo estimate for the posterior mean of i ; i D 1; 2; 3 if X D
.1; 0; 1/. Use a selection of values for the Monte Carlo sample size n.

Exercise 19.16 (Monte Carlo Estimate of e). Find a Monte Carlo estimate for e
by using the identity P .X > 1/ D 1e if X Exp.1/. Use a selection of values for
the Monte Carlo sample size n, and plot the Monte Carlo estimates against n.

Exercise 19.17 (Monte Carlo P -value). With reference to Example 19.7, calcu-
late the Monte Carlo P -value for the median-based test for the center of a Cauchy
distribution when the original dataset is generated under the alternative C.; 1/ with
 D 2:5. Compare with the P -value approximated by using the normal approxima-
tion, as in Example 19.7.
678 19 Simulation and Markov Chain Monte Carlo

Exercise 19.18 * (Monte Carlo P -value). Calculate the Monte Carlo P -value for
the Kolmogorov–Smirnov statistic for testing H0 W F D N.0; 1/, when the original
dataset of size m D 50 is generated under the null; under F D U Œ2; 2. Compare
with the P -value approximated by using the Brownian bridge approximation, as in
Chapter 16, Section 16.2.1.

Exercise 19.19 * (Outlier Detection and Monte Carlo P -value). Suppose


X1 ; : : : ; Xm are iid observations from a continuous CDF F0 on the real line. In
outlier detection studies, one often declares the largest observation i.e., the mth
order statistic, an outlier if
X.m/  X.m1/
X.m/  M
is large, where M is the median of X1 ; : : : ; Xm , X.m/ is the mth order statistic, and
X.m1/ the .m  1/th order statistic.
Calculate the Monte Carlo P -value for this test when the original dataset is gen-
erated from F0 D C.0; 1/; from F0 D N.0; 1/. Use m D 50, and a selection of
Monte Carlo sample sizes n.

Exercise 19.20 (Testing for Randomness). Suppose a computer has produced n


numbers U1 ; : : : ; Un , that are supposed to be iid U Œ0; 1. Devise a formal test of it
by using the Kolmogorov–Smirnov statistic (see Chapter 16). Answer the following
question. Does this method test whether there are possible dependencies between
the obtained values U1 ; : : : ; Un ?

Exercise 19.21 * (Testing for Randomness). Suppose a computer has produced n


numbers U1 ; : : : ; Un , that are supposed to be iid U Œ0; 1. Let W0 D U.1/ ; Wi D
U.i C1/  U.i /; 1 i n  1, where U.1/ ; U.n/ are the order statistics of U1 ; : : : ; Un .
Devise formal tests of whether U1 ; : : : ; Un are indeed independent uniform by using
the following facts from Chapter 6 (Theorem 6.6):
(i) For any r; U.r/ Be.r; n  r C 1/.
(ii) Jointly, .W0 ; W1 ; : : : ; Wn1 / is uniformly distributed in the n-dimensional
simplex.

Exercise 19.22 (Accept–Reject for a General Beta). Devise an accept–reject al-


gorithm to simulate from a Be.’; ˇ/ distribution, by using an envelope density
ˇ
g.x/ D cx ’1 .1  x ’ / ’ 1 . Find the acceptance probability of your accept–reject
algorithm.

Exercise 19.23 (Accept–Reject for a Truncated Exponential). Devise an accept-


reject algorithm to simulate from the density f .x/ D e x ; x  , by using an
Exp. / envelope density. Identify the value of for which the accept–reject scheme
is the most efficient.

Exercise 19.24. Devise an accept–reject scheme to simulate from the density


f .x/ D 163 x; 0 x 4 ; f .x/ D 3 ; 4
1 4 1
x 4 ; f .x/ D 3 .1  x/; 4
3 16 3
x 1.
Find the efficiency value of your accept–reject scheme.
Exercises 679

Exercise 19.25 * (Accept–Reject for Multidimensional Truncated Normal).


Pd 2
Devise an accept–reject scheme to simulate from the density f .x/ D ce j D1 xj ;
Pd 2
j D1 xj 1, where c is a normalizing constant, and find the efficiency value of
your accept–reject scheme.
Exercise 19.26 (Accept–Reject for Truncated Gamma). Devise an accept–reject
scheme with exponential envelope densities to simulate from the density f .x/ D
cxn e x ; 0 x 1, where n  1, and c D cn is a normalizing constant. Find the
best envelope density from the exponential class.
Exercise 19.27 * (Asymptotic Zero Efficiency of Accept–Reject). Consider the
scheme of generating a uniformly distributed random vector X in the d -dimensional
unit ball by generating iid variables Ui U Œ0; 1; i D 1; 2; : : : ; d , and by setting
P
X D .U1 ; : : : ; Ud / if diD1 Ui2 1. Show that the acceptance rate of this scheme
converges to zero as d ! 1. Explicitly evaluate the acceptance rate for 2 d 6.
Exercise 19.28 * (Generating Points in the Unit Circle). Generate n points uni-
formly in the unit circle and visually identify the radius of the largest empty circle.
Use n D 50; 100; 500; 1000.
Exercise 19.29 * (Generating Cauchy from Two Uniforms).
(a) Suppose U; V are iid U Œ0; 1. Show that the conditional distribution of U
V given
that U 2 C V 2 1 is standard Cauchy.
(b) Hence, devise an algorithm for generating a standard Cauchy random variable.
Exercise 19.30 * (General Ratio of Uniforms Method). Suppose f .x/ is a den-
sity function on the real line, and suppose .U; V / p
is uniformly distributed in the set
in the plane defined as C D f.u; v/ W 0 u f . uv /g. Show that the marginal
V
density of U is f .
Use this result to generate a standard Cauchy random variable.
Exercise 19.31 (Quantile Transform Method). Show how to simulate from the
density
x ’1
f .x/ D c p ; 0 < x < 1;
1  x 2’
by using the quantile transform method; here, c is a normalizing constant.
Exercise 19.32 (Quantile Transform Method). Show how to simulate a standard
double exponential random variable by using the quantile transform method.
Exercise 19.33 * (Quantile Transform Method to Simulate from t). Suppose X
has a t-distribution with two degrees of freedom (you can write the CDF in a simple
closed-form). Show how to simulate X by using the quantile transform method.
Exercise 19.34 (Quantile Transform Method to Simulate from Pareto). Sup-
pose X P a.’; /. Show how to simulate X by using the quantile transform
method.
680 19 Simulation and Markov Chain Monte Carlo

Exercise 19.35 (Quantile Transform Method to Simulate from the Gumbel


Distribution). Show how to simulate a standard Gumbel random variable by us-
ing the quantile transform method.

Exercise 19.36 (Importance Sampling in a Bayes Problem). Suppose X


1
Poi. / and that has the prior density c 1C 2; > 0, where c is a normalizing
constant. By following the methods of Example 19.12, approximate the posterior
mean of when X D 5. Use a selection of values of the importance sample size n.

Exercise 19.37 * (Asymptotic Bias of the Importance Sampling Estimate). De-


rive an expression of the form E./O D  C ’.h0n;h1 / C o.n1 / for the importance
O in the notation of Section 19.2.2.
sampling estimate ,

Exercise 19.38 (Importance Sampling Estimate of Normal Tail Probability).


Suppose Z N.0; 1/. Find an importance sampling estimate of P .Z > z/ by
using a truncated exponential importance sampling density g.x/ D e zx ; x > z.
Use it for z D 1; 2; 3; 4. Compare with a standard normal table.

Exercise 19.39 * (Simulating from the Tail of a Normal Density). Suppose Z


xz
N.0; 1/. Use the family of truncated exponential densities 1 e   ; x > z to devise
an importance sampling scheme for simulating from the distribution of Z given
Z > z.
Use your scheme to generate ten observations from the distribution of Z given
Z > 3:5.

Exercise 19.40 (Optimal Importance Sampling Density). Suppose X Exp.1/


and that we want to find Monte Carlo estimates for  D EŒ.X /.
(a) Find the optimal importance sampling density from the family of all possible
x
truncated exponential densities g.x j ; / D 1 e   ; x  , if .x/ D Ix>a .
(b) Find the optimal importance sampling density from the same family if
.x/ D x k .

Exercise 19.41 * (Importance Sampling and Delta Theorem). Suppose X


N.0; 1/ and you want to approximate  D E.jX j/ by using importance sampling
with a standard double exponential importance sampling density. Find an expres-
sion for the asymptotic variance of the importance sampling estimate of , by using
Theorem 19.2.

Exercise 19.42 * (Error of the Importance Sampling Estimate). Suppose X


Be.5; 5/ and that you want to estimate  D E.X 6 / by using importance sampling
with a U Œ0; 1 importance sampling density. Find an approximation to the probabil-
ity P .jO  j > :02/ by using the asymptotic normality result of Theorem 19.2.
Evaluate this approximation when the importance sample size is n D 100; 250.
Exercises 681

Exercise 19.43 * (Hit-or-Miss Method). Let 0 .x/ M < 1; 0 x c.


Consider iid bivariate simulations .Xi ; Yi /; i D 1; 2; : : : ; n from the bivariate uni-
R c c ˝ Œ0; M . Consider the following two estimates for the
form distribution on Œ0;
definite integral  D 0 .x/dx:

cX
n
O 1 D .Xi /I
n
i D1

Mc X
n
O 2 D IYi  .Xi / :
n
i D1

(a) Show that each estimate is unbiased; that is, EŒO i  D ; i D 1; 2.


(b) One of the two estimates always has a smaller variance than the other one. Iden-
tify it with a proof.
Exercise 19.44 (Rao–Blackwellization). A chicken lays a Poisson number N of
eggs per week. The eggs hatch, mutually independently, with probability p. Suppose
X is the number of eggs actually hatched in a week. We wish to estimate E.X /.
(a) Describe an ordinary Monte Carlo estimate for E.X /.
(b) Describe a Rao–Blackwellized estimate, by inventing an appropriate variable
on which you condition.
(c) How much is the percentage reduction in variance achieved by your
Rao–Blackwellized estimate?
Exercise 19.45 (Rao–Blackwellization). Suppose X; Y are iid iid continuous ran-
dom variables with CDF F . We wish to estimate P .X < 3Y  3Y 2 C Y 3 /.
(a) Describe an ordinary Monte Carlo estimate.
(b) Describe a Rao–Blackwellized estimate, by inventing an appropriate variable
on which you condition.
Exercise 19.46 * (Quantile Transform for Discrete Distributions). Suppose X
takes values x1 < x2 <    with probabilities pi D P .X D xi /; i D 1; 2; : : :. Let
U U Œ0; 1 and set Y D xi if p1 C    C pi 1 < U p1 C    C pi . Show that
Y has the distribution of X .
Exercise 19.47 (Quantile Transform for Poisson). Use the general quantile trans-
form method of the above exercise to simulate n independent values from Poi. /;
use a selection of values of and n.
Exercise 19.48 (Quantile Transform for Geometric). By using the fact that
P .X > i / has a closed-form formula for any geometric random variable, simu-
late from a Geo.p/ distribution by using the general quantile transform method of
Exercise 19.46.
Exercise 19.49 * (Generating Poisson from Exponential). Suppose X1 ; X2 ; : : :
are iid exponential random variables with mean 1 . Define Y D i if X1 C    C Xi
1 < X1 C    C Xi C1 . Show that Y Poi. /.
682 19 Simulation and Markov Chain Monte Carlo

Exercise 19.50 (Generating Exponential-Order Statistics). By using Reyni’s


representation (see Chapter 6), devise an algorithm for simulating the set of order
statistics of n iid standard exponentials.

Exercise 19.51 * (Generating Random Permutations). Suppose U1 ; : : : ; Un are


iid U Œ0; 1, and let U.1/ < U.2/ <    < U.n/ be their order statistics. Set .i / D j
if Ui D U.j / . Show that ..1/; .2/; : : : ; .n// is uniformly distributed on the set
of nŠ permutations of .1; 2; : : : ; n/.

Exercise 19.52 (Generating Points from a Homogeneous Poisson Process). Use


Theorem 13.3 to generate points from a homogeneous Poisson process with constant
rate 1 on the interval Œ0; 10, and then plot the path of the process.

Exercise 19.53 * (Generating Points from a Nonhomogeneous Poisson Process).


Use the transformation method of Section 13.6 to generate points from a Poisson
process with the intensity function .x/ D min.x; 5/ on Œ0; 10, and then plot the
path of the process.

Exercise 19.54 (Metropolis–Hastings Chain). For the special case that n D 4 and
’ D ˇ D 2, find explicitly the transition probabilities of the Metropolis–Hastings
chain for the Beta–Binomial case (Example 19.15), and verify that the chain is irre-
ducible and aperiodic.

Exercise 19.55 (Barker’s Algorithm). Suppose Y Geo.p/. Find explicitly the


transition probabilities for Barker’s algorithm for simulating from the conditional
distribution of Y given that Y n, where n > 1 is a fixed integer.

Exercise 19.56. Write the following scheme formally as a Metropolis chain: gen-
erate a candidate state j according to the target distribution , regardless of the
current state i , and accept the candidate state only if the target distribution assigns
at least probability to it.

Exercise 19.57 * (Autoregressive Metropolis Chain). Suppose the state space


S D Rd for some d  1. Formally write the proposal probabilities and the ac-
ceptance probabilities of the following Metropolis scheme. If we are currently at
state x, the candidate state j is generated as y D  C A.x  / C z, where  is
a fixed element of the state space, A is a fixed matrix, and z is a random element
generated from a density q.z/.

Exercise 19.58 (Practical MCMC). Use the independent Metropolis sampler


when the target distribution is N.0; 1/ and the proposal density is a standard dou-
ble exponential. Run the chain to length n D 2500. Try to determine what is a
good burn-in value B by using some graphical method (e.g., you may use the
Yu–Mykland CUSUM plot).
Now repeat it when the proposal density is N.0; 4/. For which proposal density,
did you settle for a smaller burn-in time? Is that what you would have expected?
Why?
Exercises 683

Exercise 19.59 (Practical MCMC). Generate a Metropolis–Hastings chain of


length n D 50 when the target distribution is Bin.20; :5/.
Exercise 19.60 (Practical MCMC). Generate a Metropolis–Hastings chain of
length n D 50 when the target distribution is a Poisson with mean 1, truncated to
f0; 1; : : : ; 5g.
Exercise 19.61 (Practical MCMC). Generate an independent Metropolis chain of
length n D 50 when the target distribution is a Poisson with mean 1, and the pro-
posal distribution is a failure geometric with the pmf pj D . 12 /j C1 ; j D 0; 1; 2; : : :.
Exercise 19.62 (Visual Comparison of MCMC Chains). For a target distribution
of Poisson with mean 1, truncated at 10, generate a Metropolis–Hastings chain of
length n D 50, plot a histogram, and compare it visually to the histogram of the
independent Metropolis chain of the previous exercise.
Exercise 19.63 (Practical MCMC in a Bayes Problem). Suppose X Bin.m; p/
and p Beta.alpha; ˇ/. Use the systematic scan Gibbs sampler to simulate from
the joint distribution of .X; p/, and take the marginal p-output to estimate the mean
and the variance of the posterior. Compare these answers to the known exact values
(see Chapter 3). Use m D 25; ’ D ˇ D 2, and x D 10.
Exercise 19.64 (Practical MCMC in a Bayes Problem). Suppose X
Poi. /; G.’; ˇ/. Use the systematic scan Gibbs sampler to simulate from
the joint distribution of .X; / and take the -output to estimate the mean and the
variance of the posterior. Compare these answers to the known exact values (see
Chapter 3). Use ’ D 4; ˇ D 1; x D 8.
Exercise 19.65 * (Where to Start). Suppose you wish to use some MCMC scheme
for simulating from a posterior distribution of a parameter. What are reasonable
starting values?
Exercise 19.66 (Gibbs Chain in a Normal Problem). Suppose X1 ; : : : ; Xn are iid
N.; /, and .; / has the improper prior density .; / D p1 . Find the full

conditionals . j / and . j /.
Exercise 19.67 * (Gibbs Chain for Bivariate Poisson). Suppose X; Y; Z are in-
dependent Poissons with means ; ; . Let U D X C Z; V D Y C Z. The joint
distribution of .U; V / is called a bivariate Poisson. Find the full conditionals .u j v/
and (by symmetry) .v j u/.
Exercise 19.68 * (Gibbs Chain for a Bimodal Joint Density). Suppose .X; Y /
has the joint density
1 2 C.y4/2 Cx 2 y 2
.x; y/ D ce  2 Œ.x4/ ; 1 < x; y < 1;

where c is a normalizing constant. Devise a Gibbs sampler to simulate from this


target density .
684 19 Simulation and Markov Chain Monte Carlo

Exercise 19.69 * (Monotonicity of Gibbs Error). Show that the systematic scan
Gibbs chain for simulating an m-dimensional random vector has the property that
the total variation distance supA jP n .x; A/  .A/ is monotonically nonincreasing
in n.

Exercise 19.70 (A Nonreversible Chain). Consider a stationary Markov chain


with the transition probability matrix
0 1 1 1
2 2
0
P D@ 0 1
2
1
2
A:
1 1
2 0 2

Show that the chain has a stationary distribution , but that the chain is not
reversible.

Exercise 19.71 * (Nonreversibility of Gibbs Chains). Give a counterexample to


show that the systematic scan Gibbs sampler in the fixed update order 1 ! 2 !
   ! m need not be reversible.

Exercise 19.72 (Reversibility P


of Gibbs Chains). Show that the Gibbs chain with
the transition amtrix P D mŠ
1
S.m/ Pi1 Pi2    Pim introduced in Section 19.4 is
reversible.

Exercise 19.73 (Necessary and Sufficient Condition for Reversibility). Given


discrete
P vectors x; y, and a probability distribution , suppose .x; y/ D
i 2S xi y i .i /. Suppose P is the transition probability matrix of a stationary
Markov chain on the state space S . Prove that .P; / satisfy the equation of de-
tailed
P balance if and only P if .P2x; y/ D .x; y/ for all vectors x; y such that
x
i 2S i
2
.i / < 1 and i 2S yi .i / < 1.

Exercise 19.74 * (Necessary and Sufficient Condition for Reversibility). Sup-


pose P is the transition probability matrix of a stationary Markov chain on the
finite-state space S , and  is a probability distribution on S . Let  be the diag-
onal matrix with diagonal elements .i /; i D 1; 2; : : : ; t. Prove that .P; / satisfy
1 1
the equation of detailed balance if and only if the matrix  2 P 2 is symmetric.

Exercise 19.75. Generate a Gibbs chain of length n D 50 using the random scan
Gibbs chain, when the target distribution is a bivariate normal with means 0, stan-
dard deviations 1, and correlation 0:5.

Exercise 19.76 * (Gibbs Chain for Poissons with Covariates). Suppose Xi j i


Poi.xi i /; i D 1; 2; : : : ; m, and that X1 ; : : : ; Xm are independent. Next, i j ˇ
G.’i ; yi ˇ/, where 1 ; : : : ; m are independent, and finally ˇ1 G. ; ı/. The
numbers xi ; yi ; ’i ; ; ı are taken to be given constants. Find the full conditionals
. 1 ; : : : ; m j ˇ; X1 ; : : : ; Xm / and .ˇ j 1 ; : : : ; m ; X1 ; : : : ; Xm /. How are these
useful in simulating from the posterior distribution of the first-stage parameters
. 1 ; : : : ; m /?
Exercises 685

Exercise 19.77 * (Gibbs Chain for the ZIP Model). Suppose Xi ; Bi ; i D


1; 2; : : : ; m are mutually independent, with Xi j Poi. /; Bi j p Ber.p/;
G.’; ˇ/; p U Œ0; 1, and that ; p are independent. Find the full conditionals
. j p/ and .p j /.

Exercise 19.78. Find the eigenvalues in analytical form for the three-state station-
ary Markov chain with the transition matrix
0 1
’ 1’ 0
P D@ 0 ˇ 1  ˇ A:
1 0

Hence find an expression for the SLEM. Is this an irreducible chain?

Exercise 19.79 * (Nonconvergent Gibbs Sampler). Give an example, different


from the one in the text, for which the systematic scan Gibbs chain does not con-
verge to the target distribution.

Exercise 19.80. Give a proof that for a reversible chain, the eigenvalues of the tran-
sition matrix are all in the interval Œ0; 1.

Exercise 19.81 * (Ehrenfest Chain). Consider the symmetric Ehrenfest chain of


Examples 10.20 and 10.4. For m D 7, calculate the SLEM of the transition prob-
ability matrix P . Plug it into the bounds in parts (a) and (b) of Theorem 19.8, and
find the best answer to the following question.
How large an n is needed to make

sup sup jP n .i; A/  .A/j :01‹


i A

Note that the stationary distribution of the Ehrenfest chain was worked out in
Example 10.20.

Exercise 19.82 * (SLEM Calculation for Metropolis–Hastings Chain). Use Ex-


ample 19.24 to evaluate the SLEM of the Metropolis–Hastings chain when t D 4
and the target distribution is .1/ D 13 ; .2/ D .3/ D 14 ; .4/ D 16 .

Exercise 19.83 (Extreme Values of the Dobrushin Coefficient). Give necessary


and sufficient conditions for .P / to take the values 0; 1.

Exercise 19.84. Calculate the Dobrushin coefficient as well as the SLEM for the
nonreversible transition matrix of Example 19.22, and verify that c .P /.

Exercise 19.85. * Construct an example in which the SLEM and the Dobrushin
coefficient coincide.
686 19 Simulation and Markov Chain Monte Carlo

Exercise 19.86 * (SLEM versus Dobrushin Coefficients). Prove the better in-
  n1
n
equality c minn .P / .

Exercise 19.87 * (A Way to


 Numerically
 Approximate the SLEM). Show that
log .P / D limn!1 n log .P / .
1 n

References

Athreya, K., Doss, H., and Sethuraman, J. (1996). On the convergence of the Markov chain simu-
lation method, Ann. Statist., 24, 89–100.
Barnard, G. (1963). Discussion of paper by M.S. Bartlett, JRSS Ser. B, 25, 294.
Besag, J. and Clifford, P. (1989). Generalized Monte Carlo significance tests, Biometrika, 76,
633–642.
Besag, J. and Clifford, P. (1991). Sequential Monte Carlo p-values, Biometrika, 78, 301–304.
Brémaud, P. (1999). Markov Chains, Springer, New York.
Chan, K. (1993). Asymptotic behavior of the Gibbs samples, J. Amer. Statist. Assoc., 88, 320–326.
Cowles, M. and Carlin, B. (1996). Markov chain Monte Carlo convergence diagnostics: A com-
parative review, J. Amer. Statist. Assoc., 91, 883–904.
Diaconis, P. (2009). The MCMC revolution, Bull. Amer. Math. Soc., 46, 179–205.
Diaconis, P. and Saloff-Coste, L. (1996). Logarithmic Sobolev inequalities for finite Markov
chains, Ann. Appl. Prob., 6, 695–750.
Diaconis, P. and Saloff-Coste, L. (1998). What do we know about the Metropolis algorithm,
J. Comput. System Sci., 57, 20–36.
Diaconis, P. and Stroock, D. (1991). Geometric bounds for eigenvalues of Markov chains, Ann.
Appl. Prob., 1, 36–61.
Diaconis, P. and Sturmfels, B. (1998). Algebraic algorithms for sampling from conditional distri-
butions, Ann. Statist., 26, 363–398.
Diaconis, P., Khare, K., and Saloff-Coste, L. (2008). Gibbs sampling, exponential families, and
orthogonal polynomials, with discussion, Statist. Sci., 23, 2, 151–200.
Dimakos, X.K. (2001). A guide to exact simulation, Internat. Statist. Rev., 69, 27–48.
Do, K.-A. and Hall, P. (1989). On importance resampling for the bootstrap, Biometrika, 78,
161–167.
Dobrushin, R.L. (1956). Central limit theorems for non-stationary Markov chains II, Ther. Prob.
Appl., 1, 329–383.
Fill, J. (1991). Eigenvalue bounds on convergence to stationarity for non-reversible Markov chains,
with an application to the exclusion process, Ann. Appl. Prob., 1, 62–87.
Fill, J. (1998). An interruptible algorithm for perfect sampling via Markov chains, Ann. App. Prob.,
8, 131–162.
Fishman, G. S. (1995). Monte Carlo, Concepts, Algorithms, and Applications, Springer, New York.
Gamerman, D. (1997). Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference,
Chapman and Hall, London.
Garren, S. and Smith, R.L. (1993). Convergence diagnostics for Markov chain samplers,
Manuscript.
Gelfand, A. and Smith, A.F.M. (1987). Sampling based approaches to calculating marginal densi-
ties, J. Amer. Stat. Assoc., 85, 398–409.
Gelman, A. and Rubin, D. (1992). Inference from iterative simulation using multiple sequences,
with discussion, Statist. Sci., 7, 457–511.
Gelman, A., Carlin, B., Stern, H., and Rubin, D. (2003). Bayesian Data Analysis, Chapman and
Hall/CRC, Boca Raton.
References 687

Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian
restoration of images, IEEE Trans. Pattern Anal. Mach. Intele., 721–740.
Geyer, C. (1992). Practical Markov chain Monte Carlo, with discussion, Statist. Sci., 7, 473–511.
Gilks, W., Richardson, S., and Spiegelhalter, D. (Eds.), (1995). Markov Chain Monte Carlo in
Practice, Chapman and Hall, London.
Glauber, R. (1963). Time dependent statistics of the Ising Model, J. Math. Phys., 4, 294–307.
Green, P.J. (1995). Reversible jump Markov Chain Monte Carlo computation and Bayesian model
determination, Biometrika, 82, 711–732.
Hall, P. and Titterington, D.M. (1989). The effect of simulation order on level accuracy and power
of Monte Carlo tests, JRSS Ser. B, 51, 459–467.
Hastings, W. (1970). Monte Carlo sampling methods using Markov chains and their applications,
Biometrika, 57, 92–109.
Higdon, D. (1998). Auxiliary variables methods for Markov chain Monte Carlo applications,
J. Amer. Statist. Assoc., 93, 585–595.
Jones, G. and Hobert, J. (2001). Honest exploration of intractable probability distributions via
Markov Chain Monte Carlo, Statist. Sci., 16, 312–334.
Kendall, W. and Thönnes, E. (1999). Perfect simulation in stochastic geometry, Patt. Recogn., 32,
1569–1586.
Liu, J. (1995). Eigenanalysis for a Metropolis sampling scheme with comparisons to rejection
sampling and importance sampling, Manuscript.
Liu, J. (2008). Monte Carlo Strategies in Scientific Computing, Springer, New York.
Mengersen, K. and Tweedie, R. (1996). Rates of convergence of Hastings and Metropolis algo-
rithms, Ann. Statist., 24, 101–121.
Mengersen, K., Knight, S., and Robert, C. (2004). MCMC: How do we know when to stop?,
Manuscript.
Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., and Teller, E. (1953). Equations of state
calculations by fast computing machines, J. Chem. Phys., 21, 1087–1092.
Propp, J. and Wilson, B. (1998). How to get a perfectly random sample from a generic Markov
chain and generate a random spanning tree to a directed graph, J. Alg., 27, 170–217.
Ripley, B. D. (1987). Stochastic Simulation, Wiley, New York.
Robert, C. and Casella, G. (2004). Monte Carlo Statistical Methods, Springer, New York.
Roberts, G. and Rosenthal, J.S. (2004). General state space Markov chains and MCMC algorithms,
Prob. Surveys, 1, 20–71.
Rosenthal, J. (1995). Minorization conditions and convergence rates for Markov chain Monte
Carlo, J. Amer. Statist. Assoc., 90, 558–566.
Rosenthal, J. (1996). Analysis of the Gibbs sampler for a model related to the James–Stein estima-
tions, Statist. Comput., 6, 269–275.
Rosenthal, J. (2002). Quantitative convergence rates of Markov chains: A simple account, Electr.
Comm. Prob., 7, 123–128.
Ross, S. (2006). Simulation, Academic Press, New York.
Rubin, H. (1976). Some fast methods of generating random variables with pre-assigned distribu-
tions: General acceptance-rejection procedures, Manuscript.
Schmeiser, B. (1994). Modern simulation environments: Statistical issues, Proceedings of the First
IE Research Conference, 139–144.
Schmeiser, B. (2001). Some myths and common errors in simulation experiments, B. Peters et al.
Eds., Proceedings of the Winter Simulation Conference, 39–46.
Smith, A.F.M. and Roberts, G. (1993). Bayesian computation via the Gibbs sampler, with discus-
sion, JRSS Ser. B, 55, 3–23.
Tanner, M. and Wong, W. (1987). The calculation of posterior distributions, with discussions,
J. Amer. Statist. Assoc., 82, 528–550.
Tierney, L. (1994). Markov chains for exploring posterior distributions, with discussion, Ann.
Statist., 22, 1701–1762.
Yu, B. and Mykland, P. (1994). Looking at Markov samplers through CUSUM path plots: A simple
diagnostic idea, Manuscript.
Chapter 20
Useful Tools for Statistics and Machine Learning

As much as we would like to have analytical solutions to important problems, it


is a fact that many of them are simply too difficult to admit closed-form solutions.
Common examples of this phenomenon are finding exact distributions of estimators
and statistics, computing the value of an exact optimum procedure, such as a max-
imum likelihood estimate, and numerous combinatorial algorithms of importance
in computer science and applied probability. Unprecedented advances in comput-
ing powers and availability have inspired creative new methods and algorithms for
solving old problems; often, these new methods are better than what we had in our
toolbox before. This chapter provides a glimpse into a few selected computing tools
and algorithms that have had a significant impact on the practice of probability and
statistics, specifically, the bootstrap, the EM algorithm, and the use of kernels for
smoothing and modern statistical classification. The treatment is supposed to be
introductory, with references to more advanced parts of the literature.

20.1 The Bootstrap

The bootstrap is a resampling mechanism designed to provide approximations to


the sampling distribution of a functional T .X1 ; X2 ; : : :; Xn ; F /, where F is a CDF,
typically on some Euclidean space, and X1 ; X2 ; : : :; Xn are independent sample ob-
servations from F . For example, F could p be some continuous CDF on the real line,
and T .X1 ; X2 ; : : :; Xn ; F / could be n.X  /, where  D EF .X1 /. The prob-
lem of approximating the distribution is important, because even when the statistic
T .X1 ; X2 ; : : :; Xn ; F / is a simple one, such as the sample mean, we usually cannot
find the distribution of T .X1 ; X2 ; : : :; Xn ; F / exactly for given n. Sometimes, there
may be a suitable asymptotic normality result known about the statistic T , which
may be used to form an approximation to the distribution of T . A remarkable fact
about the bootstrap is that even if such an asymptotic normality result is available,
the bootstrap often provides a better approximation to the true distribution of T than
does the normal approximation.
The bootstrap is not limited to the iid situation. It has been studied for various
kinds of dependent data and highly complex situations. In fact, this versatility of the
bootstrap is the principal reason for its huge popularity and the impact that it has

A. DasGupta, Probability for Statistics and Machine Learning: Fundamentals 689


and Advanced Topics, Springer Texts in Statistics, DOI 10.1007/978-1-4419-9634-3 20,
c Springer Science+Business Media, LLC 2011
690 20 Useful Tools for Statistics and Machine Learning

had on practice. We recommend Hall (1992) and Shao and Tu (1995) for detailed
theoretical developments of the bootstrap, and Efron and Tibshirani (1993) for
an application-oriented readable exposition. Modern reviews include Hall (2003),
Bickel (2003), Efron (2003), and Lahiri (2006). Lahiri (2003) is a rigorous treat-
ment of the bootstrap for various kinds of dependent data, including problems that
arise in time series and spatial statistics.
iid
Suppose X1 ; X2 ; : : :; Xn F , and
p
T .X1 ; X2 ; : : :; Xn ; F / is a functional, for
n.X/
example, T .X1 ; X2 ; : : :; Xn ; F / D 
, where  D .F / D EF .X1 /, and
 2 D  2 .F / D VarF .X1 /, assumed to be finite. In statistical problems, we fre-
quently need to know something about the sampling distribution of T , for example,
PF .T .X1 ; X2 ; : : :; Xn ; F / t/. If we had replicated samples from the population,
resulting in a series of values for the statistic T , then we could form estimates of
PF .T t/ by counting how many of the Ti s are t. But statistical sampling is
not done that way. Usually, we do not obtain replicated samples; we obtain just one
set of data values of some size n. The intuition of the canonical bootstrap is that
by the Glivenko–Cantelli theorem (see Chapter 7), the empirical CDF Fn should be
very close to the true underlying CDF F , and so, sampling from Fn , which amounts
to simply resampling n values with replacement from the already available data
.X1 ; X2 ; : : :; Xn /, should produce new sets of values that act like samples from F
itself. So, although we did not have replicated datasets to start with, it is as if by re-
sampling from the available dataset we now have the desired replications. There is
a certain element of faith in this idea, unless we have demonstrable proofs that this
simple idea will in fact work, that is, that these resamples lead us to accurate approx-
imations to the true distribution of T . It turns out that such theorems are available,
and have led to the credibility and popularity of the bootstrap as a distribution ap-
proximation tool. To implement the bootstrap, we only need to be able to generate
enough resamples from the original dataset. So, in a sense, the bootstrap replaces a
hard mathematical calculation in probability theory by an omnibus and almost au-
tomated computing exercise. It is the automatic nature of the bootstrap that makes
it so appealing. However, it is also frequently misused in situations where it should
not be used, because it is theoretically unjustifiable in those problems, and will in
fact give incorrect and inaccurate answers.
Suppose for some number B, we draw B resamples of size n from the original
sample. Denoting the resamples from the original sample as
        
.X11 ; X12 ; : : :; X1n /; .X21 ; X22 ; : : :; X2n /; : : :; .XB1 ; XB2 ; : : :; XBn /;

with corresponding values T1 ; T2 ; : : :; TB for the functional T , one can use simple
j #fj W T  t g
frequency-based estimates such as B to estimate PF .T t/. This is the
basic idea of the bootstrap.
The formal definition of the bootstrap distribution of a functional is as follows.

Definition 20.1. Suppose X1 ; : : : ; Xn are iid observations from a CDF F on some


space X , and T .X1 ; X2 ; : : :; Xn ; F / is a given real-valued functional. The ordinary
bootstrap distribution of T is defined as
20.1 The Bootstrap 691

Hboot .x/ D PFn .T .X1 ; : : :; Xn ; Fn / x/;

where .X1 ; : : :; Xn / refers to an iid sample of size n from the empirical CDF Fn .
It is common to use the notation P to denote probabilities under the boot-
strap distribution. PFn ./ corresponds to probability statements corresponding to all
the nn possible with replacement resamples from the original sample X1 ; : : : ; Xn .
Recalculating T from all nn resamples is basically impossible unless n is very
small, therefore one uses a smaller number of B resamples and recalculates T only
B times. Thus Hboot .x/ is itself estimated by a Monte Carlo, known as bootstrap
Monte Carlo. So the final estimate for PF .T .X1 ; X2 ; : : :; Xn ; F / x/ absorbs er-

rors from two sources: (i) pretending that .Xi1 ; Xi2 ; : : :; Xin / are bona fide samples
from F ; (ii) estimating the true Hboot .x/ by a Monte Carlo. By choosing B ade-
quately large, the issue of the Monte Carlo error is generally ignored. The choice of
which B would let one ignore the Monte Carlo error is a hard mathematical problem;
Hall (1986, 1989) are two key references. It is customary to choose B 500 1000
for variance estimation and a somewhat larger value for estimating quantiles. It is
hard to give any general reliable prescriptions for B.
At first glance, the bootstrap idea of resampling from the original sample appears
to be a bit too simple to actually work. One has to have a definition for what one
means by the bootstrap working in a given situation. For estimating the CDF of
a statistic, one should want Hboot .x/ to be numerically close to the true CDF, say
Hn .x/, of T . This would require consideration of metrics on CDFs, a topic we
covered in Chapter 15. For a general metric , the definition of the bootstrap working
in a given problem is the following.
Definition 20.2. Let .F; G/ be a metric on the space of CDFs on X . For a given
functional T .X1 ; X2 ; : : :; Xn ; F /, let

Hn .x/ D PF .T .X1 ; X2 ; : : :; Xn ; F / x/;


Hboot .x/ D P .T .X1 ; X2 ; : : :; Xn ; Fn / x/:

P
We say that the bootstrap is weakly consistent under  for T if .Hn ; Hboot / ) 0
as n ! 1. We say that the bootstrap is strongly consistent under  for T if
a:s:
.Hn ; Hboot / ! 0.
Note that the need for mentioning convergence to zero in probability or a.s. in
this definition is due to the fact that the bootstrap distribution Hboot is a random
CDF. It is a random CDF because as a function it depends on the original sam-
ple .X1 ; X2 ; : : :; Xn /. Thus, the bootstrap uses a random CDF to approximate a
deterministic but unknown CDF, namely the true CDF Hn of the functional T . In
principle, a sequence of random CDFs could very well converge to another random
CDF, or not converge at all! It is remarkable that under certain minimal conditions,
those disasters do not happen, and Hboot and Hn get close as n ! 1.
Example 20.1 (Applying the Bootstrap). How does one applyp
the bootstrap in prac-
N
tice? Suppose for example, T .X1 ; X2 ; : : :; Xn ; F / D n.X/

. In the canonical
692 20 Useful Tools for Statistics and Machine Learning

bootstrap scheme, we take iid samples from Fn . By a simple calculation,


P the mean
and the variance of the empirical distribution Fn are XN and s 2 D n1 niD1 .Xi  XN /2
(note the n rather than n  1 in the denominator). The bootstrap is a device for
estimating
p  ! p    !
n XN  .F / n XN n  XN
PF x/ by PFn x :
 s
p
n.XN  X/
n
N
We further approximate PFn . s / x/ by resampling only B times from
the original sample
p
set fX 1 ; X 2 ; : : :; X n g. In other words, we finally report as our
N
estimate for PF . n.X/
 x/ the number
p
n.XN n;j
 N
X/
# j W s x
:
B

This number depends on the original sample set fX1 ; X2 ; : : :; Xn g, the particular

resampled sets .Xi1 ; Xi2 ;    ; Xin /; and the bootstrap Monte Carlo sample size B.
If the bootstrap Monte Carlo is repeated, then for the same B, and of course, the
same original sample set fX1 ; X2 ; : : :; Xn g, the bootstrap estimate will be a differ-
ent number.
p
We would like the bootstrap estimate to be close to the true value of
n.XN /
PF . 
x/; consistency of the bootstrap is about our ability to guarantee
that for large n, and an implicit unspoken assumption of a large B.

20.1.1 Consistency of the Bootstrap

iid
p mean of iid random riables. If X1 ; X2 ; : : :; Xn
We start with the case of the sample
F and if VarF .Xi / < 1, then n.p XN  / has a limiting normal distribution, by the
CLT. So a probability such as PF . n.XN  / x/ could be approximated by, for
example, ˆ. xs /, where s is the sample standard deviation. An interesting property
of the bootstrap approximation is that even when the CLT approximation ˆ. xs / is
available, the bootstrap approximation may be more accurate. Such results generally
go by the name of higher-order accuracy of the bootstrap.
But first we present two consistency results corresponding to two specific metrics
that have earned a special status in this literature. The two metrics are
(i) Kolmogorov metric

K.F; G/ D sup jF .x/  G.x/jI


1<x<1
20.1 The Bootstrap 693

(ii) Mallows–Wasserstein metric


1
`2 .F; G/ D inf .EjY  X j2 / 2 ;
2;F;G

where X F, Y G and 2;F ;G is the class of all joint distributions of


.X; Y / with marginals F and G, each with a finite second moment. See Chapter
15 for detailed treatment of these two metrics. We recall from Chapter 15 that the
Kolmogorov metric is universally regarded as a natural one. The metric `2 is a nat-
ural metric for many statistical problems because of its interesting property that
L
`2 .Fn ; F / ! 0 iff Fn ) F and EFn .X i / ! EF .X i / for i D 1; 2. One might
want to use the bootstrap primarily for estimating the CDF, and the mean and the
variance of a statistic, thus consistency in `2 is just the right result for that purpose.
iid
Theorem 20.1. Suppose X1 ; X2 ; : : :; Xn F and suppose EF .X12 / < 1. Let
p a:s
T .X1 ; X2 ; : : :; Xn ; F / D n.XN  /. Then K.Hn ; Hboot / and `2 .Hn ; Hboot / ! 0
as n ! 1.
Strong consistency in K is proved in Singh (1981) and that for `2 ispproved in
Bickel and Freedman (1981). Notice that EF .X12 / < 1 guarantees that n.XN /
admits a CLT. And the theorem above says that the bootstrap is strongly consistent
(wrt K and `2 ) under that very assumption. This is in fact a very good rule of thumb:
if a functional T .X1 ; X2 ; : : :; Xn ; F / admits a CLT, then the bootstrap would be at
least weakly consistent for T . Strong consistency might require more assumptions.

Proof. We sketch a proof of the strong consistency in K. The proof requires use of
the Berry–Esseen inequality, Polya’s theorem (see Chapter 7 and Chapter 8), and a
strong law known as the Zygmund–Marcinkiewicz strong law, which we state below
without a proof.

Proposition (Zygmund–Marcinkiewicz SLLN). Let Y1 ; Y2 ; : : : be iid random


variables with CDF F and suppose, for some 0 < ı < 1, EF jY1 jı < 1. Then
P a:s:
n1=ı niD1 Yi ) 0.

We are now ready to sketch the proof of strong consistency of Hboot under the
Kolmogorov metric K.

Proof of Theorem 20.1. Using the definition of K, we can write


K.Hn ; Hboot / D sup jPF fTn xg  P fTn xgj
x
ˇ ˇ
ˇ Tn x x ˇˇ Tn
D sup ˇˇPF  P
x   s ˇ s
ˇ x x x
ˇTn x
D sup ˇˇPF ˆ Cˆ ˆ
x     s
x ˇ
Tn
x ˇˇ
Cˆ  P
s s s ˇ
694 20 Useful Tools for Statistics and Machine Learning
ˇ  x ˇˇ ˇ   ˇ
ˇ
sup ˇˇPF
Tn x
ˆ ˇ C sup ˇˇˆ x  ˆ x ˇˇ
x    ˇ x  s
ˇ  ˇ
ˇ x Tn
x ˇˇ
C sup ˇˇˆ  P
x s s s ˇ
D An C Bn C Cn ; say:

That An ! 0 is a direct consequence of Polya’s theorem (see Chapter 7). Also,


s 2 converges almost surely to  2 and so, by the continuous mapping theorem,
s converges almost surely to . Then Bn ) 0 almost surely by the fact that ˆ./ is
a uniformly continuous function. Finally, we can apply the Berry–Esseen theorem
(see Chapter 8) to show that Cn goes to zero:

4 EFn jX1  XN j3
Cn p 
5 n ŒVarFn .X1 /3=2
Pn
4 jXi  XN j3
D p  i D1 3
5 n ns
" n #
4 X
2 3
jXi  j C nj  XN j
3 3
5n3=2 s 3 i D1
" #
1 X jXN  j3
n
M
D 3 jXi  j C p
3
;
s n3=2 i D1 n

where M D 32 .
5 p
Because s )  > 0 and XN ) , it is clear that jXN  j3 =. ns 3 / ) 0 almost
surely. As regards the first term, let Yi D jXi  j3 and ı D 2=3. Then the fYi g are
iid and
EjYi jı D EF jXi  j32=3 D VarF .X1 / < 1:
It now follows from the Zygmund–Marcinkiewicz SLLN that

1 X X
n n
1=ı
jXi  j D n
3
Yi ) 0; a:s:; as n ! 1:
n3=2 i D1 i D1

Thus, An C Bn C Cn ! 0 almost surely, and hence K.Hn ; Hboot / ! 0. t


u
p
It is natural to ask if the bootstrap is consistent for n.XN  / even when
EF .X12 / D 1. If we insist on strong consistency, then the answer is negative.
The point is that the sequence of bootstrap distributions is a sequence of random
CDFs and so it can converge to a random CDF, depending on the particular realiza-
tion X1 ; X2 ; : : : , if EF .X12 / D 1. See Athreya (1987), Giné and Zinn (1989), and
Hall (1990) for proofs and additional detail.
20.1 The Bootstrap 695

Example 20.2 (Practical Accuracy of Bootstrap). How does the bootstrap compare
with the CLT approximation in actual applications? The question can only be an-
swered by case-by-case simulation. The results are mixed
p in the following numerical
table. The Xi are iid Exp(1) in this example and T D n.XN  1/, with n D 20. For
the bootstrap approximation, B D 250 was used.

t Hn .t / CLT Approximation Hboot .t /


2 0.0098 0.0228 0.0080
1 0.1563 0.1587 0.1160
0 0.5297 0.5000 0.4840
1 0.8431 0.8413 0.8760
2 0.9667 0.9772 0.9700

The bootstrap approximation is more accurate than the normal approximation in


the tails in this specific example. This greater accuracy is related to the higher-order
accuracy of the bootstrap; see Section 20.1.3.
The ordinary bootstrap, which resamples with replacement from the empirical
CDF Fn , is consistent for many other natural statistics in addition to the sample
mean and even higher-order accurate for some, but under additional conditions.
Examples of such statistics are sample percentiles, and smooth functions of a sam-
ple mean vector. The consistency of the bootstrap for the sample mean under finite
second moments is also true for the multivariate case. The basic theorems are stated
below; see DasGupta (2008) for further details.
Theorem 20.2. Let XE1 ; : : : ; XEn ; : : : be iid d -dimensional vectors with common
p
CDF F , with CovF .XE1 / D †, † finite. Let T .XE1 ; XE2 ; : : : ; XEn ; F / D n.XEN  /.
E
a:s:
Then K.Hboot ; Hn / ! 0 as n ! 1.
We know from the ordinary delta theorem (see Chapter 7) that if a sequence of
statistics Tn admits a CLT and if g./ is a smooth transformation, then g.Tn / also
admits a CLT. If we were to believe in our rule of thumb, then this would suggest
that the bootstrap should be consistent also for g.Tn / if it is already consistent
for Tn . For the case when Tn is a sample mean vector, the following result holds.
This theorem has numerous applications to practically important statistics whose
exact distributions are very difficult to find, and so the bootstrap is a very natural
and effective tool for approximating their distributions. The mathematical assurance
of consistency supplied by the theorem below gives us some confidence that the
bootstrap approximation will be reasonable.

Theorem 20.3. Let XE1 ; XE2 ; : : : ; XEn F , and let †d d D CovF .XE1 / be finite. Let
iid
p E
T .XE1 ; XE2 ; : : : ; XEn ; F / D n.XN  /
E and for some m  1, let g W Rd ! Rm . If
rg./ exists in a neighborhood of , E rg./ E and if rg./ is continuous at ,
E 6D 0, E
then the bootstrap is strongly consistent with respect to the Kolmogorov metric K
p
for nŒg.XEN /  g./. E See Shao and Tu (1995, p. 80) for a sketch of a proof of this
theorem.
696 20 Useful Tools for Statistics and Machine Learning

20.1.2 Further Examples

The bootstrap is used in practice for a variety of purposes. It is used to estimate a


CDF, or a percentile, or the bias or variance of a statistic Tn . For example, if Tn is
an estimate for some parameter , and if EF .Tn  / is the bias of Tn , the boot-
strap estimate EFn .Tn  Tn / can be used to estimate the bias. Likewise, variance
estimates can be formed by estimating VarF .Tn / by VarFn .Tn /. In other words, to
estimate VarF .Tn /, we sample B sets of samples of size n with replacement from
the original sample set, say
   
.X11 ; : : : ; X1n /; : : : ; .XB1 ; : : : ; XBn /:

We compute Ti D T .Xi1 


; : : : ; Xin /; 1 D 1; 2; : : : ; B, and their mean TN , and
P
estimate VarF .Tn / by B1 i D1 .Ti  TN /2 . This is the basic bootstrap variance es-
B

timate. One wants to know how accurate the bootstrap-based estimates are in reality.
This can only be answered on the basis of case-by-case investigation. Some over-
all qualitative phenomena have emerged from these investigations. For instance,
(a) The bootstrap distribution estimate captures information about skewness that
the CLT will miss.
(b) But the bootstrap tends to underestimate the variance of a statistic T .
Here are a few more illustrative examples.
Example 20.3 (Bootstrapping the Sample Variance). Let X1 ; X2 ; : : : be iid one-
dimensional random variables with the CDF F , and suppose EF .X14 / < 1. Let
YEi D . X i2 /. Then with d D 2, YE1 ; YE2 ; : : : ; YEn are iid d -dimensional vectors with
X
i

Cov.YE1 / finite. Note that


0 1
XN
YEN D @ 1 P
n A:
n Xi2
i D1

Consider the transformation g W R2 ! R1 defined as g.u; v/ D v  u2 . Then

1 X 1 X 2  N 2
n n
2
Xi  XN  D Xi  X D g.YEN /:
n n
i D1 i D1

If we let E D E.YE1 /, then g./E D  2 D Var.X1 /. Because g./ satisfies the


conditions of Theorem 20.3, it follows that the bootstrap is strongly consistent with
respect to the Kolmogorov metric K for
!
1X
n
p
n N
.Xi  X /   :
2 2
n
i D1
20.1 The Bootstrap 697

Example 20.4 (Bootstrapping the Sample Skewness Coefficient). Suppose X1 ;


X2 ; : : : are iid observations from a CDF F on the real line with EF .X16 / < 1.
Then, it follows from the delta p theorem (see Chapter 7 and Exercise 7.36) that the
standardized sample skewness nŒb1  ˇ is asymptotically normally distributed
with zero mean and a variance that will depend on F . Here,
Pn  3
EF .X1  /3 1
n i D1 Xi  X
ˇD and b D :
3 s3
p
In fixed-size samples when n is not too large, the true distribution of nŒb1  ˇ
will of course not be normal, and will depart from normal in various ways, depend-
ing on F . We sampled n D 30 samples p from a standard normal distribution, and
then bootstrapped the functional nŒb1  ˇ using a bootstrap Monte Carlo size
of B D 600. A histogram of the bootstrapped values in Fig. 20.1 shows the long
right tail, which the normal approximation would not have p shown. The histogram
also shows a heavier central part in the distribution of nŒbp1  ˇ than a normal
approximation would have shown. The true distribution of nŒb1  ˇ would be
impossible to find, and so, this is fertile ground for use of the bootstrap.

Example 20.5 (Bootstrapping the Correlation Coefficient). Suppose .Xi ; Yi /; i D


1; 2; : : : ; n are iid BV N.0; 0; 1; 1; / and let r be the sample correlation coefficient.
p L
Let Tn D n.r  /. We know that Tn ) N.0; .1  2 /2 /; see Chapter 7. Conver-
gence to normality is very slow. There is also an exact formula for the density of r.
For n  4, the exact density is,

X 1  2
2n3 .1  2 /.n1/=2 nCk1 .2r/k
f .rj/ D .1  r 2 /.n4/=2  I
.n  3/Š 2 kŠ
kD0

80

60

40

20

-2 -1 0 1 2 3 4 5 6 7

Fig. 20.1 Histogram of bootstrapped values


698 20 Useful Tools for Statistics and Machine Learning

see Tong (1990). In the table below, we give simulation averages of the estimated
standard deviation of r by using the bootstrap. We used n D 20, and B D 200.
The bootstrap estimate was calculated for 1000 independent simulations; the table
reports the average of the standard deviation estimates over the 1000 simulations.

n True  True s.d. of r CLT Estimate Bootstrap Estimate


20 0.0 0.230 0.232 0.217
0.5 0.182 0.175 0.160
0.9 0.053 0.046 0.046

Except when  is large the bootstrap underestimates the variance and the CLT
estimate is better.

Example 20.6 (The t-Statistic for Poisson p Data). Suppose X1 ; : : : ; Xn are iid
Poi./ and let Tn be the t-statistic Tn D n.XN  /=s. In this example n D 20
and B D 200 and for the actual data,  was chosen to be 1. Apart from the bias
and the variance of Tn , in this example we also report percentile estimates for Tn .
The bootstrap percentile estimates are found by calculating Tn for the B resamples
and calculating the corresponding percentile value of the B values of Tn . The bias
and the variance are estimated to be 0:18 and 1.614, respectively. The estimated
percentiles are reported in the table. Note that the 5th and the 95th percentiles are
not equal in absolute value in this table; neither are the 10th and the 90th. Such
potential strong skewness would have remained undetected, if we had simply used
a normal approximation.

’ Estimated 100’ Percentile


0.05 –2.45
0.10 –1.73
0.25 –0.76
0.50 –0.17
0.75 0.49
0.90 1.25
0.95 1.58

Example 20.7 (Bootstrap Failure). In spite of the many consistency theorems in the
previous section, there are instances where the ordinary bootstrap with replacement
sampling from Fn actually does not work. Typically, these are instances where the
functional Tn fails to admit a CLT. Here is a simple example where the ordinary
bootstrap fails to consistently estimate the true distribution of a statistic.
Let X1 ; X2 ; : : : be iid U.0; / and let Tn D n.  X.n/ /, Tn D n.X.n/  X.n/

/.
The ordinary bootstrap will fail in this example in the sense that the conditional
distribution of Tn given X.n/ does not converge to the Exp./ distribution a.s. Let
us take, for notational simplicity,  D 1. Then for t  0,
20.1 The Bootstrap 699
 
PFn .Tn t/  PFn Tn D 0


D PFn X.n/ D X.n/


D 1  PFn X.n/ < X.n/
 n
n1
D 1
n
n!1
! 1  e 1

For example, take t D 0:0001; then limn PFn .Tn t/  1  e 1 , whereas


0:0001 
limn PF .Tn t/ D 1  e 0. So PFn .Tn t/ 6! PF .Tn t/. The
phenomenon of this example can be generalized essentially to any CDF F with a
compact support Œ !.F /; !.F /  with some conditions on F , such as existence of a
smooth and positive density. There are variants of the ordinary bootstrap that correct
the bootstrap failure in this example; see DasGupta (2008).

20.1.3  Higher-Order Accuracy of the Bootstrap

One question about the use of the bootstrap is whether the bootstrap has any ad-
vantages at all when a CLT is already available. To be specific, suppose T .X1 ;
p p L
: : : ; Xn ; F / D n.XN  /. If  2 D VarF .X / < 1, then n.XN  / )
a:s:
N.0;  2 / and K.Hboot ; Hn / ! 0. So two competitive p approximations to
PF .T .X1 ; : : : ; Xn ; F / x/ are ˆ. xO / and PFn . n.XN   XN / x/. It turns
out that for certain types of statistics, the bootstrap approximation is (theoretically)
more accurate than the approximation provided by the CLT. The CLT, because any
normal distribution is symmetric, cannot capture information about the skewness
in the finite sample distribution of T . The bootstrap approximation does so. So
the bootstrap succeeds in correcting for skewness, just as an Edgeworth expansion
would do (see Chapter 1). This is called Edgeworth correction by the bootstrap and
the property is called second-order accuracy of the bootstrap. It is important to
remember that second-order accuracy is not automatic; it holds for certain types of
T but not for others. It is also important to understand that practical accuracy and
theoretical higher-order accuracy can be different things. The following heuristic
calculation illustrates when second-order accuracy can be anticipated. The first re-
sult on higher-order accuracy of the bootstrap is due to Singh (1981). In addition to
the references we provided in the beginning, Lehmann (1999) gives a very readable
treatment of higher-order accuracy of the bootstrap. p
iid N
Suppose X1 ; : : : ; Xn F and T .X1 ; : : : ; Xn ; F / D n.X/

; here  2 D
VarF .X1 / < 1. We know that T admits the Edgeworth expansion (see Chapter 1):
p1 .xjF / p2 .xjF /
PF .T x/ D ˆ.x/ C p .x/ C .x/
n n
C smaller order terms,
700 20 Useful Tools for Statistics and Machine Learning

p1 .xjFn / p2 .xjFn /
P .T  x/ D ˆ.x/ C p .x/ C .x/
n n
C smaller order terms,
p1 .xjF /  p1 .xjFn / p2 .xjF /  p2 .xjFn /
Hn .x/  Hboot .x/ D p C
n n
C smaller order terms.

Recall now that the polynomials p1 ; p2 are given as


 
p1 .xjF / D 1  x2 ;
6
3   2  
p2 .xjF / D x 3  x2  x 4  10x 2 C 15 ;
24 72
4
EF .X1 /3
where D 3
and D EF .X14/ . Because Fn  D Op . p1n / and
Fn
 D Op . p1n /, just from the CLT for Fn and Fn (under finiteness of four
moments), one obtains Hn .x/  Hboot .x/ D Op . n1 /. If we contrast this to the CLT
approximation, in general, the error in the CLT is O. p1n /, as is known from the
Berry–Esseen theorem (Chapter 8). The p1n rate of the error of the CLT cannot be
improved in generalp
even if there are four moments. Thus, by looking at the stan-
N
dardized statistic n.X/

, we have succeeded in making the bootstrap one order
more accurate than the CLT. This is called second-order accuracy of the bootstrap.
If one does not standardize, then
p !
p    n. XN  / x x
PF n XN   x D PF !ˆ
  

and the leading term in the bootstrap approximation in this unstandardized case
would be ˆ. xs /. So in the unstandardized case, the bootstrap approximates the true
CDF Hn .x/ also at the rate p1n , that is, if one does not standardize, then Hn .x/ 
Hboot .x/ D Op . p1n /. We have now lost the second-order accuracy. The following
second rule of thumb often applies.
iid
Rule of Thumb. Let X1 ; : : : ; Xn and T .X1 ; : : : ; Xn ; F / a functional. If T .X1 ;
L
: : : ; Xn ; F / ) N.0;  / where  is independent of F , then second-order accuracy
2

is likely. Proving it depends on the availability of an Edgeworth expansion for T .


If  depends on F (i.e.,  D .F /), then the bootstrap should be just first-order
accurate.
Thus, as we now
p
show, canonical bootstrap is second-order accurate for the stan-
N
n.X/
dardized mean 
. From an inferential point of view, it is not particularly
p
N
n.X/
useful to have an accurate approximation to the distribution of  , because
20.1 The Bootstrap 701

 would usually be unknown, and the accurate approximation could not really be
used to construct a confidence interval for . Still, the second-order accuracy result
is theoretically insightful.
We state a specific result below for the case of standardized and nonstandardized
sample means. Let
p   !
p    n XN  
Hn .x/ D PF n XN   x ; Hn;0 .x/ D PF x ;

p    !
p     n XN  XN
Hboot .x/ D P n XN  XN x ; Hboot;0 .x/ D PFn x :
s

iid
Theorem 20.4. Let X1 ; : : : ; Xn F.
(a) If EF jX1 j3 < 1, and F is nonlattice, then K.Hn;0 ; Hboot;0 / D op . p1n /.
p P
(b) If EF jX1 j3 < 1, and F is lattice, then nK.Hn;0 ; Hboot;0 / ! c, 0 < c < 1.
See Shao and Tu (1995 p. 92–94) for a proof. The constant c in the lattice case
equals ph , where h is the span of the lattice fa C kh; k D 0; ˙1; ˙2; : : :g on
 2
which the Xi are supported. Note also that part (a) says that higher-order accu-
racy for the standardized case obtains with three moments; Hall (1988) showed that
finiteness of three absolute moments is in fact necessary and sufficient for higher-
order accuracy of the bootstrap in the standardized case.

20.1.4 Bootstrap for Dependent Data

The ordinary bootstrap that resamples observations with replacement from the orig-
inal dataset does not work when the sample observations are dependent. This was
already pointed out in Singh (1981). It took some time before consistent bootstrap
schemes were offered for dependent data. There are consistent schemes that are
meant for specific dependence structures (e.g., stationary autoregression of a known
order) and there are also general bootstrap schemes that work for large classes of
stationary time series without requiring any particular dependence structure. The
model-based schemes are better for the specific models, but can completely fail if
some assumption about the specific model does not hold. Block bootstrap methods
are regarded as the bread and butter of resampling for dependent sequences. These
are general and mostly all-purpose resampling schemes that provide at least consis-
tency for a wide selection of dependent data models.
The basic idea of the block bootstrap method is that if the underlying series is a
stationary process with short-range dependence, then blocks of observations of suit-
able lengths should be approximately independent. Also, the joint distribution of
the variables in different blocks would be about the same, due to stationarity.
702 20 Useful Tools for Statistics and Machine Learning

So, if we resample blocks of observations, rather than observations one at a


time, then that should bring us back to the nearly iid situation, a situation in
which the bootstrap is known to succeed. Block bootstrap was first suggested in
Carlstein (1986) and Künsch (1989). Various block bootstrap schemes are now
available. We only present three such schemes, for which the block length is non-
random. A small problem with some of the blocking schemes is that the “starred”
time series is not stationary, although the original series is, by hypothesis, stationary.
A version of the block bootstrap that resamples blocks of random length allows the
“starred” series to be provably stationary. This is called the stationary bootstrap,
proposed in Politis and Romano (1994) and Politis et al. (1999). However, later
theoretical studies have established that the auxilliary randomization to determine
the block lengths can make the stationary bootstrap less accurate. For this reason,
we only discuss three blocking methods with nonrandom block lengths.
(a) (Nonoverlapping Block Bootstrap (NBB)). In this scheme, one splits the ob-
served series fy1 ; : : : ; yn g into nonoverlapping blocks
B1 D fy1 ; : : : ; yh g; B2 D fyhC1 ; : : : ; y2h g; : : : ; Bm D fy.m1/hC1 ; : : : ; ymh g;
where it is assumed that n D mh. The common block length is h. One then
resamples B1 ; B2 ; : : : ; Bm

at random, with replacement, from fB1 ; : : : ; Bm g.
Finally, the Bi s are pasted together to obtain the “starred” series y1 ; : : : ; yn .


(b) (Moving Block Bootstrap (MBB)). In this scheme, the blocks are
B1 D fy1 ; : : : ; yh g; B2 D fy2 ; : : : ; yhC1 g; : : : ; BN D fynhC1 ; : : : ; yn g;
where N D n  h C 1. One then resamples B1 ; : : : ; Bm 
from B1 ; : : : ; BN ,
where still n D mh.
(c) (Circular Block Bootstrap (CBB)). In this scheme, one periodically extends
the observed series as y1 ; y2 ; : : : ; yn ; y1 ; y2 ; : : : ; yn ; : : :. Suppose we let zi be
the members of this new series, i D 1; 2; : : :. The blocks are defined as
B1 D fz1 ; : : : ; zh g; B2 D fzhC1 ; : : : ; z2h g; : : : ; Bn D fzn ; : : : ; znCh1 g:
One then resamples B1 ; : : : ; Bm

from B1 ; : : : ; Bn .
We now give some theoretical properties of the three block bootstrap methods
described above. The results below are due to Lahiri (1999). We need a definition
for the result below.

Definition 20.3. Let Yn ; n D 0; ˙1; ˙2; : : : be a stationary time series with covari-
ance function .k/ D Cov.Yt ; Yt Ck /; k D 0; ˙1; ˙2; : : :. The spectral density of
the series is the function
1
1 X
f .!/ D .k/e i k! ;  < ! ;
2
kD1

p
where i D 1.
20.1 The Bootstrap 703

Just as the covariance function characterizes second-order properties of the


stationary time series in the time domain, the spectral density does the same work-
ing with the frequency domain. Both approaches are useful, and they complement
each other.
Suppose fyi W 1 < i < 1g is a d -dimensional stationary process with a finite
mean  and spectral density f . Let h W Rd ! R be a sufficiently smooth function.
Let  D h./ and On D h.yNn /, where yNn is the mean of the realized series. We
propose to use the block bootstrap schemes to estimate the bias and variance of On .
Precisely, let bn D E.On  / be the bias and let n2 D Var.On / be the variance.
We use the block bootstrap based estimates of bn and n2 , denoted by bOn and On2 ,
respectively.
Next, let Tn D On   D h.yNn /  h./, and let Tn D h.yNn /  h.E yNn /. The
estimates bOn and On2 are defined as bOn D E Tn and On2 D Var .Tn /. Then the
following asymptotic expansions hold; see Lahiri (1999).
Theorem 20.5. Let h W Rd ! R be a sufficiently smooth function.
(a) For each of the NBB, MBB, and CBB, there exists c1 D c1 .f / such that
c1
E bOn D bn C C o..nh/1 /; n ! 1:
nh

(b) For the NBB, there exists c2 D c2 .f / such that

2 2 c2 h
Var.bOn / D C o.hn3 /; n ! 1;
n3
and for the MBB and CBB,

4 2 c2 h
Var.bOn / D C o.hn3 /; n ! 1:
3n3

(c) For each of NBB, MBB, and CBB, there exists c3 D c3 .f / such that E.On2 / D
c3
n2 C nh C o..nh/1 /; n ! 1:
(d) For NBB, there exists c4 D c4 .f / such that

2 2 c4 h
Var.On2 / D C o.hn3 /; n ! 1;
n3
and for the MBB and CBB,

4 2 c4 h
Var.On2 / D C o.hn3 /; n ! 1:
3n3
We now use these expansions to derive optimal block sizes. The asymptotic expan-
sions for the bias and the variance are combined to derive mean-squared error
704 20 Useful Tools for Statistics and Machine Learning

optimal block sizes. For example, for estimating bn by bOn , the leading term in the
expansion for the mean-squared error is

4 2 c2 h c12
m.h/ D C :
3n3 n2 h2

To minimize m./, we solve m0 .h/ D 0 to get


 1=3
3c12
hopt D n1=3 :
2 2 c2

Similarly, an optimal block length can be derived for estimating n2 by On2 . We state
the following optimal block length result of Lahiri (1999) below.
Theorem 20.6. For the MBB and the CBB, the optimal block length for estimating
bn by bOn satisfies
 1=3
3c12
hopt D n1=3 .1 C o.1//;
2 2 c2
and the optimal block length for estimating n2 by On2 satisfies
 1=3
3c32
hopt D n1=3 .1 C o.1//:
2 2 c4

The constants ci depend on the spectral density f of the process, which would be
unknown in a statistical context. So, the optimal block lengths cannot be directly
used. Plug-in estimates for the ci may be substituted. Or, the formulas can be used
to try block lengths proportional to n1=3 , with flexible proportionality constants.
There are also other methods in the literature on selection of block lengths; see Hall
et al. (1995) and Politis and White (2004).

20.2 The EM Algorithm

Maximum likelihood is a mainstay of parametric statistical inference. The idea


of maximizing the likelihood function globally over the parameter space ‚ has
an intuitive appeal. In addition, well-known theorems exist that show the asymp-
totic optimaility of the MLE for finite-dimensional and suitably smooth parametric
problems. Lehmann and Casella (1998), Bickel and Doksum (2006), Le Cam and
Yang (1990), and DasGupta (2008) are a few sources where the maximum likeli-
hood estimate is carefully studied. A long history of success, the intrinsic intuition,
and the theoretical support have all led to the mostly deserved reputation of max-
imum likelihood estimates as the estimate to be preferred in problems that do not
have too many parameters.
20.2 The EM Algorithm 705

However, closed-form formulas for maximum likelihood estimates are rare out-
side of the exponential family structure. In such cases, one must understand the
shape and boundedness properties of the likelihood function, and carefully compute
the maximum likelihood estimate numerically for the observed data. Driven by this
need, Fisher gave the well-known scoring method, the first iterative method for nu-
merical calculation of the maximum likelihood estimate. In problems with a small
number of parameters, the scoring method is known to work well, under some con-
ditions. It is awkward to use when the number of parameters is even moderately
large.
The EM algorithm, formally introduced in Dempster et al. (1977) as a general-
purpose iterative numerical method for approximating the maximum likelihood
estimate can be applicable, and even successful, when the scoring method is difficult
to apply. The EM algorithm has become a mainstay of the numerical approximation
of the maximum likelihood estimate, with widespread applications, quite like max-
imum likelihood itself is the mainstay of the estimation paradigm in parametric
inference. The reputation of one seems to fuel the popularity of the other, although
one of them is a principle, and the other a numerical scheme. The standard reference
on the EM algorithm, its various mutations, and practical applications and properties
is McLachlan and Krishnan (2008). Algorithms very similar to the EM algorithm
were previously described in several places, notably Sundberg (1974) for the case of
exponential families. The basic general algorithm is presented in this section with a
description of some of its known properties and known weaknesses.
Underlying each application of the EM algorithm, there is an implicit element of
missing data, say Z, and some observed data, say Y . If the missing data Z did be-
come available, one would have the complete data X D .Y; Z/. Truly, the likelihood
function, say l.; Y /, depends on only the data Y that we have, and not the data that
we might have had. However, the EM algorithm effectively fills in the missing data
Z using the observed data Y , and a current value for the parameter , thereby pro-
ducing a fictitious complete data likelihood l.; X /. One finds the projection of this
fictitious complete data likelihood l.; X / onto the class of functions that depend
only on the actual observed data Y , which is then maximized over  2 ‚ to pro-
duce a candidate maximum likelihood estimate . O This O is used as the next current
value for , and the process is let run until convergence within tolerable fluctuation
appears to have been achieved. The filling in part corresponds to the E-part of the
algorithm, and the maximization corresponds to the M-part of the algorithm. Be-
cause statistical models are often such that the logarithm of the likelihood function
is a more manipulable function than the likelihood itself, the algorithm works with
the log-likelihood, for which we use the notation L D log l below.
It is important to note that the so-called missing data Z may be really physically
missing in some problems, whereas in other problems the missing data are imag-
inary, deviously thought of so that the complete data likelihood l.; X / becomes
particularly pleasant and receptive to easy global maximization. In those problems
where the missing data are an artificial construct, there would be a choice as to how
to embed the problem into a missing data structure, and part of the art of the method
is to pick a wise embedding.
706 20 Useful Tools for Statistics and Machine Learning

20.2.1 The Algorithm and Examples

The EM algorithm runs as follows.


(a) Start with an initial value 0 .
(b) At the kth stage of the iterative algorithm, find the best guess for what the ide-
alized complete data likelihood would have been, by finding the conditional
expectation
LO k .; Y / D Ek ŒL.; X / j Y :
(c) Maximize this predicted complete data log-likelihood LO k .; Y / over .
(d) Set the next stage current value of  to be

kC1 D argmax 2‚ LO k .; Y /:

In general, argmax 2‚ LO k .; Y / may be a set, rather than a single point. In that
case, choose any member of that set to be the next stage current value.
(e) Iterate until convergence to satisfaction appears to have been achieved.
Thus, implementation of the EM algorithm is substantially more straightfor-
ward when the calculation of the conditional expectation Ek ŒL.; X / j Y  and the
maximization of LO k .; Y / can be done in closed-form. These two closed-form cal-
culations may not be possible unless one has an exponential family structure in the
complete data X . There have been newer versions of the EM algorithm that try
to bypass closed-form calculations of these two quantities; for example, the self-
evident idea of using Monte Carlo to calculate Ek ŒL.; X / j Y  when it cannot be
done in closed-form is one of the newer versions of the EM algorithm, and is often
called the Monte Carlo EM algorithm. We now give some illustrative examples.
Example 20.8 (Poisson with Missing Values). Suppose for some fixed n  1, com-
plete data X1 ; : : : ; Xn are iid Poi. /, but the data value is actually reported only if
Xi  2. This sort of missing data can occur if, for example, the Xi are supposed to
be counts of minor accidents per week in n locations, but the values do not get re-
ported if there are too few incidents. If the number of recorded values is m n, then
denoting the recorded values as Y1 ; : : : ; Ym , the number of unreported zero values
as Z0 , and the number of unreported values that equal one as Z1 , the complete data
X can be represented as .Y1 ; : : : ; Ym ; m; Z0 ; Z1 /; the reported values Y1 ; : : : ; Ym
are iid from the conditional distribution of a Poisson variable P with mean given
that the variable is larger than 1. Therefore, writing Sy D m i D1 yi , the likelihood
based on the complete data is

e m Sy  m
l. ; X / D Qm   e  /
1  e   e  e z0 z1 z1
e
i D1 Œyi Š .1  e
e n Sy Cz1
D Qm
i D1 yi Š
X
m
) L. ; X / D log l. ; X / D n C .Sy C z1 / log  log yi Š:
i D1
20.2 The EM Algorithm 707

Therefore,

Ek ŒL. ; X / j .Y1 ; : : : ; Ym ; m/


X
m
D n C Sy log C log Ek ŒZ1 j .Y1 ; : : : ; Ym ; m/  log yi Š
i D1

k
X
m
D n C Sy log C .log / .n  m/  log yi Š;
1C k i D1

k
Because Z1 j .Y1 ; : : : ; Ym ; m/ Bin.n  m; 1C /. This is the E-step of the prob-
k
lem. For the M-step, we have to maximize over > 0 the function

k
n C log Sy C .n  m/ :
1C k

By an easy calculus argument, this is maximized at

k
Sy C .n  m/ 1C
D k
:
n

This takes the position of kC1 , and the process is iterated until apparent conver-
gence.
Example 20.9 (Bivariate Normal with Missing Coordinates). Suppose for some
n  1, complete data are iid bivariate normal vectors .X1j ; X2j /; j D 1; 2; : : : ; n
N2 .; †/. However, for n1 of the n units, only the X1 coordinate is available, and
for another n2 distinct units, only the X2 coordinate is available. For the rest of the
m D n  n1  n2 units, the data on both coordinates are available. We can therefore
write the complete data in the canonical form

X D Y1 ; : : : ; Yn1 ; Yn1 C1 ; : : : ; Yn1 Cn2 ; .Y11 ; Y21 /; : : : ; .Y1m ; Y2m /;

Z1 ; : : : ; Zn1 ; Zn1 C1 ; : : : ; Zn1 Cn2 ;

where
iid
.Yi ; Zi /; 1 i n1 ; .Zi ; Yi /; n1 C 1 i n1 C n2 ; .Y1j ; Y2j /; 1 j m; N2 .; †/:

As usual, the notation Z is supposed to stand for the missing data. The parameter
vector  is  D .; †/ D .1 ; 2 ; 11 ; 12 ; 22 /. The corresponding values in the
.k/ .k/ .k/ .k/ .k/
kth iteration of the algorithm are denoted as k D .1 ; 2 ; 11 ; 12 ; 22 /. We
also use the following notation in the rest of this example, because they naturally
arise in the calculations of the E-step:
 
1 r11 r12
† DRD I
r12 r22
708 20 Useful Tools for Statistics and Machine Learning

.k/ 
 .k/ .k/ .k/ 12 .k/
.k/ D q 12 I ’i D 2 C .k/
yi  1 I
.k/ .k/
11 22 11

12 
.k/
ˇi.k/ D .k/
1 C .k/
yi  .k/2 I
22
 h i2   h i2 
.k/ .k/ .k/ .k/ .k/ .k/
v1 D 11 1   I v2 D 22 1   :

The complete data likelihood function is l.; X /


0 1 0 1
P n1 1 @
yi  1 A 1 Pn1 Cn2 zi  1 A
 12 i D1 .yi 1 ; zi 2 /†  2 i Dn C1 .zi 1 ; yi 2 /†1 @
D
1
e zi  2 1
yi  2
j†jn=2
0 1
Pm 1 @
y1j  1 A
 12 j D1 .y1j 1 ; y2j 2 /†
e y2j  2 :

Therefore, L.; X / D log l.; X /


" n
n 1 X 1
˚ 
D log jRj  .yi  1 /2 r11 C .zi  2 /2 r22 C 2.yi  1 /.zi  2 /r12
2 2
i D1

Cn2
n1X
˚ 
C .zi  1 /2 r11 C .yi  2 /2 r22 C 2.zi  1 /.yi  2 /r12
i Dn1 C1
#
X
m
˚ 
C .y1j  1 / r11 C .y2j  2 / r22 C 2.y1j  1 /.y2j  2 /r12 :
2 2

j D1

To complete the E-step, we now need to evaluate the following conditional


expectations:
 
Ek .Zi  2 /2 j yi ; Ek ŒZi  2 j yi  ; 1 i n1 I
 
Ek .Zi  1 /2 j yi ; Ek ŒZi  1 j yi  ; n1 C 1 i n1 C n2 :

These follow from standard bivariate normal conditional expectation formulas (see
Chapter 3). Indeed, in our notation introduced above,
.k/
For 1 i Ek ŒZi  2 j yi  D ’i  2 I
n1 ;
   2
Ek .Zi  2 /2 j yi D ’.k/
i  2 C v.k/2 I

For n1 C 1 i Ek ŒZi  1 j yi  D ˇi.k/  1 I


n1 C n2 ;
   .k/ 2
.k/
Ek .Zi  1 /2 j yi D ˇi  1 C v1 :

Plugging these in, Ek ŒL.; X / j Y 


20.2 The EM Algorithm 709
"  
1 X
n1
n 2
D log jRj  .yi  1 /2 r11 C ’.k/i  2 C v.k/
2 r22
2 2
i D1

.k/
C2.yi  1 / ’i  2 r12
Cn2
n1X  
2
.k/ .k/
C ˇi  1 C v1 r11
i Dn1 C1

.k/
C.yi  2 /2 r22 C 2.yi  2 / ˇi  1 r12
#
X
m
˚ 
C .y1j  1 / r11 C .y2j  2 / r22 C 2.y1j  1 /.y2j  2 /r12 :
2 2

j D1

So, the E-step can indeed be done in closed-form. The M-step now requires max-
imization of this expression over .1 ; 2 ; r11 ; r12 ; r22 /. Although the expression
above for Ek ŒL.; X / j Y  is long, on inspection we can see that it has the same
structure as the logarithm of the density of a general bivariate normal distribution.
Therefore, even the M-step can be done in closed-form by using standard formulas
for maximum likelihood estimates of the mean vector and the covariance matrix in
the general multivariate normal case. Alternatively, the M-step can be done from
first principles by simply taking the partial derivative of the expression above with
respect to 1 ; 2 ; r11 ; r12 ; r22 and by solving the five equations obtained from set-
ting these partial derivatives equal to zero. We do not show that calculation here.

Example 20.10 (EM in Estimating ABO Allele Proportions). The ABO blood clas-
sification system is perhaps the most clinically important blood typing system for
humans. All humans can be classified into one of the four phenotypes A, B, AB,
and O. ABO blood typing is essential before blood transfusions, because infusion
of an incompatible blood type has fatal consequences. In fact, it was these observed
fatalities during blood transfusions that led to the discovery of the ABO blood types.
The specific blood type is governed by a single gene with three alleles, which
are also usually denoted as A, B, and O. Each individual receives one of these three
alleles from the father, and one from the mother. Alleles A and B dominate over
allele O. Thus, individuals who have one A allele and one O allele will show as phe-
notype A, and so on, although the true genotype is AO. Because A and B dominate
over allele O, an individual can have phenotype O only if she or he receives an O
allele from each parent.
EM is a natural tool for estimating the allele frequencies, that is, the respective
proportions of A, B, and O alleles among all individuals in a sampling population.
We think of the EM algorithm naturally, because although there are only four phe-
notypes A, B, AB, and O, there are six genotypes, AA, AO, BB, BO, AB, and OO.
Because of the dominance property of the A and the B alleles, we cannot phenotyp-
ically distinguish between AA and AO, or between BB and BO. So, we have some
missing data, and EM fits in very naturally.
710 20 Useful Tools for Statistics and Machine Learning

The mathematical formulation needs some notation. The parameter vector is


simply the vector of the three allele proportions in the population, namely  D
.pA ; pB ; pO /. The complete data would correspond to the six genotypical fre-
quencies in a sample of n individuals, namely XAA ; XAO ; : : : ; XOO . To reduce
clutter, we call them X1 ; X2 ; : : : ; X6 . The observed data consist of the pheno-
typical frequencies YA ; YB ; YAB ; YO ; once again, to reduce
P clutter,P we call them
Y1 ; Y2 ; Y3 ; Y4 . Note that pA C pB C pO D 1, and also, 6iD1 Xi D 4iD1 Yi D n.
Also note the important relationships X1 C X2 D Y1 ; X3 C X4 D Y2 ; X5 D Y3 ;
and X6 D Y4 . From the usual formula for a multinomial pmf, the complete data
likelihood function is

l.; X / D c .pA /2X1 .2pA pO /X2 .pB /2X3 .2pB pO /X4 .2pA pB /X5 .pO /2X6 ;

where c is a constant not involving . Therefore,

L.; X / D log l.; X / D constant C .2X1 C X2 C X5 / log pA


C.2X3 C X4 C X5 / log pB C .2X6 C X2 C X4 / log pO
D X1 .log pA  log pO / C X3 .log pB  log pO / C Y1 .log pA C log pO /
CY2 .log pB C log pO / C Y3 .log pA C log pB / C 2Y4 log pO :

Because
2
pA;k
XO1;k W D Ek .X1 j Y / D Y1 I
2
pA;k C 2pA;k pO;k
2
pB;k
XO3;k W D Ek .X3 j Y / D Y2 ;
2
pB;k C 2pB;k pO;k

we get Ek ŒL.; X / j Y 

D XO1;k .log pA  log pO / C XO 3;k .log pB  log pO /


CY1 .log pA C log pO / C Y2 .log pB C log pO /
CY3 .log pA C log pB / C 2Y4 log pO :

This finishes the E-step, and once again, we are fortunate that we can do it in closed-
form.
For the M-step, we have to maximize this with respect to .pA ; pB ; pO / over the
simplex
S D f.pA ; pB ; pO / W pA ; pB ; pO  0; pA C pB C pO D 1g:

Standard calculus methods using Lagrange multipliers lead to the closed-form


maximas
20.2 The EM Algorithm 711

Y1 C Y3 C XO 1;k
pA D I
2n
Y2 C Y3 C XO 3;k
pB D I
2n
Y1 C Y2 C 2Y4  XO 1;k  XO 3;k
pO D :
2n

These serve as the values of kC1 , the next iterate.

20.2.2 Monotone Ascent and Convergence of EM

In its search for the global maximum of the likelihood function, the EM algorithm
has some positive properties and some murky properties. The EM algorithm does
not behave erratically. In fact, in each iteration the EM algorithm produces a value of
the likelihood function that is at least as large as the value at the previous iteration.
This is known as monotonicity of the EM algorithm. This property is mathemati-
cally demonstrable, and is proved below. The ideal goal of the EM algorithm is to
ultimately arrive at or very very close to the global maximum value of the likeli-
hood function, and the MLE. In this, the EM algorithm has mixed success. There
are no all-at-one-time theorems which show that iterates of the EM algorithm are
guaranteed to lead eventually to the correct global maximum. Indeed, there cannot
be such a theorem, because there are widely available counterexamples to it. What
is true is that under frequently satisfied conditions, iterates of the EM algorithm will
converge to a point of stationarity of the likelihood function. The starting value 0
determines to which stationary point the EM iterates converge. If we are willing
to assume quite a bit more structure, such as that of a multiparameter exponential
family, or a strongly unimodal likelihood function, then convergence to a global
maximum can be assured. However, the EM algorithm has the reputation of con-
verging very slowly to the global maximum, if it does at all. Although it has the
monotonicity property, the ascent to the peak value can be slow. The main reference
for this topic is Wu (1983). The work is nicely summarized in several other places,
specifically McLachlan and Krishnan (2008). The main facts are described with a
classic example below.
Theorem 20.7. Let l.; y/ denote the likelihood function on the basis of the ob-
served data Y D y, and let k ; k  0 be the sequence of EM iterates. Then
l.kC1 ; y/  l.k ; y/ for all k  0.
Proof. We recall the notation. The complete data X D .Y; Z/, where Y are the
actually observed data. The joint density of .Y; Z/ under  is f .y; z/, and the
marginal density of Y under  is g .y/. Thus, l.; y/ D g .y/ and L.; y/ D
log l.; y/ D log g .y/. We also need the Kullback–Leibler distance inequality
Ep .log p/  Ep .log q/, where p; q are two densities on some common space (see
Chapter 15).
712 20 Useful Tools for Statistics and Machine Learning

The key to the theorem is to show that LO k .; y/L.; y/ is maximized at  D k .


For, if we can show this, then we will have

LO k .kC1 ; y/  L.kC1 ; y/ LO k .k ; y/  L.k ; y/

LO k .kC1 ; y/  L.k ; y/

(because the function LO k .; y/ is maximized at  D kC1 )

) L.kC1 ; y/  L.k ; y/;

which is the claim of the theorem. To prove that LO k .; y/  L.; y/ is maximized
at  D k , we observe that

LO k .; y/  L.; y/ D Ek .log f .Y; Z/ j Y D y/  log g .y/


D Ek .log f .y; Z/ j Y D y/  log g .y/
D EZ j Y Dy;k .log f .y; Z//  log g .y/
 
f .y; Z/
D EZ j Y Dy;k log
g .y/
 
f .y; Z/
EZ j Y Dy;k log k
gk .y/

f .y;Z/
(because if we identify fg .y;Z/ k
.y/ as q, and g k .y/ as p, then the Kullback–Leibler
distance inequality Ep .log p/  Ep .log q/ is exactly the inequality in the last line)

D LO k .k ; y/  L.k ; y/:

But that precisely means that LO k .; y/  L.; y/ is maximized at  D k , and so


this proves the theorem. u
t
The above theorem implies that if L.; y/ D log g .y/ is bounded in  for
each fixed y, then the sequence of iterates L.k ; y/ has a limit, say L . So, under
the boundedness condition, the chain of EM values of the log-likelihood function
ultimately converges to something. The question is: to what? We would like L to
be the global maximum of L.; y/, if such a global maximum exists. Unfortunately,
it cannot be assured in general. The following example is taken from Wu (1983),
who cites Murray (1977).
Example 20.11 (Failure of the EM Algorithm). Suppose the complete data consist
of n D 12 iid observations from a bivariate normal distribution with means known
to be zero, and the other three parameters 12 ; 22 ;  unknown. The observed data
correspond to our Example 20.9: for some sampling units, data on one coordinate
are missing. The observed data are:
20.2 The EM Algorithm 713

.1; 1/; .1; 1/; .1; 1/; .1; 1/; .2; /; .2; /;

.2; /; .2; /; .; 2/; .; 2/; .; 2/; .; 2/:
The data have been deliberately so constructed as to produce a log-likelihood
function L.; y/ to have several local maximas, and two global maximas. The global
maximas are 12 D 22 D 83 ;  D ˙ 12 . There are other stationary points of the log-
likelihood function L.; y/, one of which has 12 D 22 D 52 ;  D 0. Actually,
this point is moreover a saddle point of L.; y/; that is, the Hessian matrix of the
function L.; y/ is indefinite at this particular stationary point. So, the point is not
a local maximum, or a local minimum. Coming to what the EM algorithm does in
this case, if the initial choice 0 has 0 D 0, then for all k  1; k D 0, and the
sequence of EM iterates converges to the saddle point given above. The problem
with the application of the EM algorithm in this example is exactly the fact that
once the EM reaches  D 0 at any iteration, it fails to move out of there at any sub-
sequent iteration. It is indefinitely trapped at  D 0, and can only maximize L.; y/
over the submanifold f.12 ; 22 ; / W  D 0g. On that submanifold, the saddle point
is the unique maxima; but it is not a global, or even a local maxima.
Wu (1983) gives the following theorem on the convergence of the EM algorithm,
and this theorem is essentially the best possible that can be said.
Theorem 20.8. Define the map

O
L.; / D E .L.; X / j Y /:
Assume that
(a) ‚ is a subset of some Euclidean space Rd .
(b) For any y; L.; y/ is continuous on ‚ and once partially differentiable with
respect to each coordinate of  in the interior of ‚.
(c) L.0 ; y/ > 1.
(d) The sets f W L.; y/  cg are compact for all real c.
O
(e) L.; / is jointly continuous on ‚ ˝ ‚.
Then,
(i) The sequence of iterates fL.k ; y/g converges to L.  ; y/ for some stationary
point   of L.; y/.
(ii) Any sub-sequence of the iterates fk g converges to some stationary point of
L.; y/.
Suppose, in addition, we assume that
O
(f) L.; y/ is concave on ‚ with a unique stationary point .
O
(g) The gradient vector r L.; / is jointly continuous on ‚ ˝ ‚.
Then the sequence of iterates fk g has only one limit point, and the limit coincides
O which is the unique MLE of .
with ,
Wu (1983) uses general theorems on limit points of iterated point-to-set maps for
proving the above theorem.
714 20 Useful Tools for Statistics and Machine Learning

20.2.3  Modifications of EM

The basic EM algorithm may not be exactly applicable, or inefficient due to slow
convergence in some important problems. The basic EM algorithm rests on two
assumptions: that the E-step can be done in closed-form, and that the M-step can
also be done in closed form. Of these, failure of the second assumption may not be
computationally too damaging in low dimensions, because numerical maximization
algorithms on the complete data likelihood may be easily implementable. For ex-
ample, numerical maximizations that use Newton–Raphson methods in the M-step
are known as the EM gradient algorithm; see McLachlan and Krishnan (2008) and
Lange (1999).
Failure to perform the E-step in closed-form almost certainly necessitates Monte
Carlo evaluation of the expectation Ek .L.; X / j Y /. This is called the Monte
Carlo EM algorithm. However, this has to be done for a fine grid of  values, be-
cause the subsequent M-step requires the full function LO k .; y/. This may be time
consuming, depending on d , the dimension of . It is also very important to note that
substitution of Monte Carlo for analytic calculations in the E-step calls for simula-
tion from the conditional distribution of Z (the missing data) given Y (the observed
data). This can be cumbersome, and even very difficult. In such a case, the Gibbs
sampler may be useful, because the Gibbs sampler is specially designed for this (see
Chapter 19). An unfortunate consequence of using Monte Carlo to accomplish the
E-step is that the monotone ascent property germane to the basic EM algorithm is
now usually lost. These ideas are described in McLachlan and Krishnan (2008), Wei
and Tanner (1990), Chan and Ledolter (1995), and Levine and Casella (2001).
The EM is an optimization scheme for wanting to find the maximum value of a
function. As such, the idea can also be applied, verbatim, to approximate the maxi-
mum of a posterior density, that is, to approximate a posterior mode. This is known
as Bayesian EM. The posterior density of the parameter based on the complete data
likelihood is . j X / D c l.; X /./; ./ being the prior density. Consequently,
the E-step has only the trivial modification that we now have the extra term log ./
added to the usual term LO k .; y/. The M-step should not be much more complex,
unless the prior density is multimodal or somehow badly behaved. To an error of
O.n1 /, the approximation to the mode will also provide an approximation to the
mean of the posterior under enough regularity conditions; see DasGupta (2008) and
Bickel and Doksum (2006) for such Bayesian asymptotia.
Accumulated user experience shows that the monotone ascent of the EM algo-
rithm is often very slow. Certain modifications of the basic EM algorithm have been
suggested to enhance the speed of practical convergence. Typically, these blend
time-tested purely numerical analysis tools, such as one-or two-term Taylor ex-
pansions, with the EM algorithm itself. Collectively, these schemes are known as
accelerated EM algorithms. Once again, accelerated EM algorithms do not have
the monotone ascent property, and are also more difficult to code. The methods are
described in detail with many references in McLachlan and Krishnan (2008).
20.3 Kernels and Classification 715

20.3 Kernels and Classification

Smoothing noisy data or a rough function in order to extract the main features out
of the noisy data is a long-standing and time-tested principle in quantitative sci-
ence. For example, consider the CDF of a Poisson random variable Y Poi.5/.
The CDF of Y of course is not smooth; it is not even continuous. Suppose now we
add to Y a small independent Gaussian random variable Z N.0; :012/. The sum
X D Y C Z has a continuous distribution with a density, and the CDF of X is not
only continuous, but even infinitely differentiable. In this example, we used convolu-
tion to smooth a nonsmooth CDF. Convolution is a special case of kernel smoothing,
a particular type of smoothing that has found wide applications in statistics, machine
learning, and approximation theory. This section provides a basic treatment of the
theory and applications of kernels, with examples.
R
Definition
R 20.4. A function K W Rd ! R is called a kernel if Rd jK.x/jdx < 1,
and Rd K.x/dx D 1.
In applications, we often take K  0, in which case a kernel is just a probability
density function on Rd . Moreover, we also often take K to be symmetric, in the
sense that K.x/ D K.x/ for all x 2 Rd .

20.3.1 Smoothing by Kernels

Kernels are often used as a smoothing device via the operation of convolution, as
defined below.
R
Definition 20.5. Let f W Rd ! R be an L1 function: Rd jf .x/jdx < 1. Let K
be any kernel on Rd . The convolution of f and K is defined as
Z
.f  K/.x/ D K.x  y/f .y/dy:
Rd

Convolutions with kernels generally have two important properties that lead to the
wide acceptance of the principle of kernel smoothing:
(a) The convolution f  K will generally have some extra smoothness in com-
parison to f . The exact nature of the extra smoothness will depend on both f
and K.
(b) If we smooth f by a sequence of increasingly spiky kernels Kn , then the se-
quence of convolutions f  Kn will converge in some meaningful sense, and
often in a very strong sense, to f .
These two properties together give us what we want: close approximation of a noisy
function by a nicer smooth function. The following basic theorem gives a smoothing
and an appromixation property of convolutions with kernels. For the rest of this
716 20 Useful Tools for Statistics and Machine Learning

chapter we use the following standard notation for Lp norms and Lp spaces of
functions:
Z 1=p
jjf jjp D jf j dx
p
; 0 < p < 1I jjf jj1 D sup jf .x/jI
Rd x2Rd
 ˚ 
Lp Rd D f W jjf jjp < 1 ; 0<p 1:

We remark that in fact strictly rigorous definitions of Lp norms and Lp spaces


require measure-theoretic considerations because of difficulties presented by null
sets. We do not mention these again in the subsequent development.
Theorem 20.9. Given a kernel K, define Kn .x/ D nd K.nx/; n  1.
(a) For any L1 function f; f  Kn 2 L1 for all n  1,
(b) Z
jf  Kn  f jdx ! 0 as n ! 1:
Rd
@j
(c) If K is r times differentiable with bounded partial derivatives K; i D
@xij
1; 2; : : : ; d; j D 1; 2; : : : ; r, then for any f 2 L1 ; f  Kn is r times differ-
entiable for all n  1.
(d) If K is simply uniformly bounded, then f  Kn is bounded and uniformly con-
tinuous for all n  1.
Proof. For part (a), simply note that
Z Z ˇZ ˇ
ˇ ˇ
j.f  K/.x/jdx D ˇ K.x  y/f .y/dy ˇdx
ˇ d ˇ
Rd Rd R
Z Z
jK.x  y/f .y/jdydx
Rd Rd
Z Z
D jK.x  y/jdx jf .y/jdy
Rd Rd
D jjKjj1jjf jj1 < 1:

For part (b), the key step is to use the fact that Kn is a kernelRfor any n  1, and hence
each Kn integrates to 1 on Rd . Therefore, writing g.z/ D Rd jf .x z/f .x/jdx,
we get
Z Z ˇZ ˇ
ˇ ˇ
jf  Kn  f jdx D ˇ K .x  y/Œf .y/  f .x/dy ˇdx
ˇ d n ˇ
Rd Rd R
Z ˇZ ˇ
ˇ ˇ
D ˇ ˇ
ˇ d Kn .z/Œf .x  z/  f .x/d zˇdx
Rd R
Z
jKn .z/jg.z/d z:
Rd
20.3 Kernels and Classification 717

Break the integral in the last line into two sets, a ballR B.0; ı/ of a suitable radius ı
and the complement Rd  B.0; ı/. The first integral B.0;ı/ jKn .z/jg.z/d z becomes
small for large n, because g is continuous at z D 0 and g.0/ D 0, and because
R n is integrable and indeed, jjKn jj1 is just a fixed constant. The second integral
K
Rd B.0;ı/ can also be made small for large n, because f isRintegrable, which forces
g to be integrable, and because as n gets large, the integrals Rd B.0;ı/ jKn j become
small. Putting these together, we get the convergence result of part (b).
Part (c) is an easy consequence of the dominated convergence theorem. For ex-
ample, by the assumed boundedness of the partial derivatives @x@ i K,
Z Z
jjrK.x  y/jj jf .y/jdy c jf .y/jdy < 1;
Rd Rd

and therefore, by the dominated convergence theorem, at any x 2 Rd ; .f  K/.x/


is partially differentiable with respect to each coordinate of x. The same argument
works for any n  1, and for the higher-order partial derivatives.
Part (d) is a sophisticated result, and requires use of a deep theorem in analysis
called Lusin’s theorem. We do not prove part (d) here. t
u

With added smoothness conditions on f , we can get convergence of f  Kn to f


in some suitable uniform sense. Here is such a result; see pp. 149 and 156 in Cheney
and Light (2000).

Theorem 20.10. (a) Suppose f is bounded and continuous on Rd . Then f  Kn


converges to f uniformly on any given compact set C  Rd .
(b) Suppose f is uniformly continuous on Rd and K has compact support. Then
f  Kn converges to f uniformly on all of Rd . In particular, if f is Lipschitz
of order ’ for some ’ > 0, then f  Kn converges to f uniformly on Rd ,
whenever K has compact support.

20.3.2 Some Common Kernels in Use

The
R only mathematical requirement of a kernel is that it be integrable, and that
Rd K.x/dx should equal 1. In practice, we choose our kernels to have several
additional properties from the following list.
(a) Nonnegativity. K.x/  0 forall x.
(b) Isotropic. K.x/ D h.jjxjj/ for some function h W RC ! R.
(c) Fourier or Positive Definiteness Property. K is isotropic, and h.t/ is the char-
acteristic function of some R variable X ; that is, h.t/ has the
R symmetric random
representation h.t/ D R e i tx dF .x/ D R cos.tx/dF .x/ for some CDF F
that has the symmetry property PF .X x/ D PF .X  x/ for all x.
718 20 Useful Tools for Statistics and Machine Learning

(d) Rapid Decay.R K.x/ converges rapidly to zero as jjxjj ! 1. One type of rapid
decay is that Rd .jx1 j’1 : : : jxd j’d /K.x1 ; : : : ; xd /dx1 : : : dxd should be finite
for suitably large ’1 ; : : : ; ’d > 0.
(e) Compact Support. For some compact set C; K.x/ D 0 if x 62 C .
(f) Smoothness. K.x/ has sufficient smoothness, for examples, that it be continu-
ous, or uniformly continuous, or have some derivatives, or that it belongs to a
suitable Sobolev space.
The choice of the kernel depends on the nature of the problem for which it will be
used. Kernel methods are widely used in statistical density estimation, in classifi-
cation, in simply smoothing an erratic function, and in various other approximate
reconstruction problems. Good kernels for density estimation need not be good for
classification. The following table lists some common kernels in use in statistics,
computer science, and machine learning.

Kernel Name Unnormalized K.x/


Uniform Ijjxjja
Weierstrass .1  jjxjj2 /k Ijjxjj1 ; k  1
Epanechnikov .1  jjxjj2 /Ijjxjj1
e a
2 jjxjj2
Gaussian
1
Cauchy d C1
.1Ca2 jjxjj2 / 2

Exponential e ajjxjj
Laplace e ajjxjj1
 
Spherical 1  32 jjxjj C 12 jjxjj3 Ijjxjj1
 
Pm
Polynomial iD1 i c jjxjjki
Ijjxjj1
.1  jjxjj2 /e a
2 jjxjj2
Hermite
sin.cjjxjj/
Wave
Qcjjxjj
d
iD1 .1  xi /Ijxi j1
2
Product
Qd 1
Product iD1 1Cx 2
 2i
kx
1 sin 2
Fejér k sin x2
.d D 1/
 2
de la Vallée-Poussin sin x
x
.d D 1/

We plot some of these kernels in one and two dimensions in Fig. 20.2. Note how
some are quite flat, and others spiky, and some unimodal and the Fejer kernel wavy.
The choice would depend on exactly what one wants to achieve in a particular prob-
lem. Generally, a flatter kernel would lead to more smoothing.
20.3 Kernels and Classification 719

1 1
0.75 1 0.75 1
0.5 0.5
0.25 0.5 0.25 0.5
0 0
-1 0 -1 0
-0.5 -0.5
0 -0.5 0 -0.5
0.5 0.5
1 -1 1 -1
Gaussian and Laplace kernals with a = 1

1 1
0.75 0.75
0.5 2 0.5 2
0.25 0.25
0 0
0 0
-2 -2
0 0
-2 -2
2 2

Spherical and Hermite kernels with a = 1

1 1
0.75 1 0.75 1
0.5 0.5
0.25 0.5 0.25 0.5
0 0
-1 0 -1 0
-0.5 -0.5
0 -0.5 0 -0.5
0.5 0.5
1-1 1-1
de la valle-poussin and Fejer kernels with k = 5
1 5

0.8 4

0.6 3

0.4 2

0.2 1

-3 -2 -1 1 2 3 -3 -2 -1 1 2 3

Fig. 20.2 Weierstrass kernels with k D 1, 8 when d D 2

20.3.3 Kernel Density Estimation

The basic problem of density estimation is the following. Suppose we have n iid
observations X1 ; : : : ; Xn from a density f .x/ on an Euclidean space Rd . We do
not know the density function f .x/ and we want to estimate it. In some problems,
720 20 Useful Tools for Statistics and Machine Learning

a parametric assumption will be fine; for example, when d D 1, we may assume


that f is a normal density with some mean  and some variance  2 . Then, we
may use the plug-in estiamte N.X; s 2 / as our estimate for f . In some other prob-
lems, a specific parametric form will be too restrictive. Then, we may dispose of
any specific parametric form, and estimate f nonparametrically. Here, nonparamet-
ric is supposed to simply mean lack of a parametric functional form. Nonparametric
density estimation is one of the most researched and still active areas in statistical
theory, and the techniques and theory are highly sophisticated and elegant. A lot of
development in statistics has taken place around the themes, methods, and math-
ematics of density estimation. For example, research in nonparametric regression
and nonparametric function estimation has been heavily influenced by the density
estimation literature.
Several standard types of density estimators are briefly discussed below.
(a) Plug-In Estimates. Assume f 2 F‚ D ff W  2 ‚  Rk ; k < 1g. Let O be
any reasonable estimator of . Then the plug-in estimate of f is fO D fO .
(b) Histograms. Let the support of f be Œa; b and consider the partition a D t0 <
t1 <    < tm D b. Let c0 ; : : : ; cm1 be constants. The histogram density
estimate is
ci ; ti x < ti C1
fO.x/ D cm1 ; xDb :
0; x 62 Œa; b
P
The restriction ci  0 and i ci .ti C1  ti / D 1 makes fO a true density; that is,
R
fO  0 and fO.x/ dx D 1. The canonical histogram estimator is
ni
fO0 .x/ D ; ti x < ti C1 ;
n.ti C1  ti /

where ni D #fi W ti Xi < ti C1 g. Then fO0 is, in fact, the nonparametric


maximum likelihood estimator of f . For a small interval Œti ; ti C1 / containing a
given point x,
 
Ef .ni / D n Pf .ti X < ti C1 / n.ti C1  ti /f .x/:

Then it follows that


 
ni
Ef ŒfO0 .x/ D Ef f .x/:
n.ti C1  ti /
R
(c) Series Estimates. Suppose that f 2 L2 .R/; that is, R f 2 .x/dx < 1. Let
fk W k  0g be a collection of orthonormal basis functions (see Section 20.3.4
for details on orthonormal bases). Then f admits a Fourier expansion
1
X Z
f .x/ D ck k .x/; where ck D k .x/f .x/ dx:
kD0
20.3 Kernels and Classification 721

Because ck D Ef Œk .X /, as an estimate one may use

" n #
X
ln
1 X
fO.x/ D k .Xi / k .x/;
n
kD0 i D1

for a suitably large cutoff ln . This is the orthonormal series estimator.


(d) Kernel Estimates. This is a natural and massive extension of the idea of his-
tograms. Let K.z/  0 be a general nonnegative function, integrating to 1, and
let
 
1 X
n
O x  Xi
fK .x/ D K :
nhn hn
i D1

The function K./ is a kernel and fOK is called the kernel density estimator. The
scaling factor h D hn is called the bandwidth; it must be suitably small so the
density estimate does not suffer from large biases. Suppose the kernel function is
a Gaussian kernel. In such a case, fOK is a mixture of N.Xi ; h2n / densities. Thus,
the kernel density estimate will take normal densities with small widths centered
at the data values and then blend them together. This fundamental and seminal
idea was initiated in Rosenblatt (1956). Kernel density estimates are by far the
most popular nonparametric density estimates. They generally provide the best
rates of convergence, as well as provide a great deal of flexibility through the
choice of the kernel. However, in an asymptotic sense, the choice of the kernel
is of less importance than the choice of bandwidth.
As in most statistical inference problems, there are two issues: systematic error
(in the form of bias) and random error (in the form of variance). To include both of
these aspects, we consider the mean squared error (MSE). If fOn is some estimate of
the unknown density f , we want to consider

MSEŒfOn  EŒfOn .x/  f .x/2 D VarŒfOn .x/ C E 2 ŒfOn .x/  f .x/:

As usual, there is a bias–variance trade-off. The standard density estimates require


specification of a suitable tuning parameter. For instance, for kernel density esti-
mates, the tuning parameter would be the bandwidth h. The tuning parameter is
optimally chosen after weighing in the bias–variance trade-off.
To study the performance of kernel estimates, we consider both the variance
and the bias, starting with the bias. Under various sets of assumptions on f and
K, asymptotic unbiasedness indeed holds. One such result is given below; see
Rosenblatt (1956).
722 20 Useful Tools for Statistics and Machine Learning

Theorem 20.11. Assume that K is nonnegative and integrable. Also assume that f
is uniformly bounded by some M < 1. Suppose h D hn ! 0 as n ! 1. Then,
for any x that is a continuity point of f ,
Z
EŒfOn .x/ ! f .x/ K.z/ d z; as n ! 1:

P 
Proof. Since the Xi s are iid and fOn .x/ D 1
nh K xXi
h , we get,

Z 1 x  z Z 1
1
EŒfOn .x/ D K f .z/ d z D K.z/f .x  hz/ d z:
h 1 h 1

R .x  hz/ MK.z/ and


But f is continuous at x and uniformly bounded; so K.z/f
f .x  hz/ ! f .x/ as h ! 0. Because K./ 2 L1 (i.e., jK.z/jd z < 1), we can
apply the dominated convergence theorem to interchange the limit and the integral.
That is, if we let h D hn and hn ! 0 as n ! 1, then
Z
lim EŒfOn .x/ D lim K.z/f .x  hz/ d z
n!1 n!1
Z
D K.z/f .x/ d z
Z
D f .x/ K.z/ d z

R
In particular, if K.z/ d z D 1, then fOn .x/ is asymptotically unbiased at all conti-
nuity points of f , provided that h D hn ! 0. t
u

Next, we consider the variance of fOn . Consistency of the kernel estimate does
not follow from asymptotic unbiasedness alone; we need something more. To get
a stronger result, we need to assume more than simply hn ! 0. Loosely stated,
we want to drive the variance of fOn to zero, and for this we must have more than
hn ! 0. Obviously, because fOn .x/ is essentially a sample mean,

 
1 1 xX
VarŒfOn .x/ D Var K ;
n h h

which implies
 
1 xX
nhVarŒfOn .x/ D hVar K :
h h

By an application of the dominated convergence theorem, as in the proof above,


EŒfOn .x/2 converges, at continuity points x, to f .x/kKk2 , which is finite if 2
20.3 Kernels and Classification 723

K 2 L2 . We already know that for all continuity points x, EŒfOn .x/ converges
to f .x/kKk1 ; so it follows that

 2
h EŒfOn .x/ ! 0; h ! 0:

Combining these results, we get, for continuity points x,


 
1 xX
nhVarŒfOn .x/ D hVar K ! f .x/kKk22 < 1;
h h

provided K 2 L2 . Consequently, if h ! 0, nh ! 1, f M and K 2 L2 , then,


at continuity points x,
VarŒfOn .x/ ! 0:
We summarize the above derivation in the following theorem. t
u

Theorem 20.12. Suppose f is uniformly bounded by M < 1, that K 2 L2 , and


kKk1 D 1. At any continuity point x of f ,

P
fOn .x/ ) f .x/; provided h ! 0 and nh ! 1:
R
Let kr D rŠ1
zr K.z/d z. By decomposing the MSE into variance and the squared
bias, we get

f .x/
EŒfOn .x/  f .x/2 kKk22 C h2r jkr f .r/ .x/j2 ;
nh

provided K 2 L2 and f has .r C 1/ continuous derivatives at x. Minimizing the


right-hand side of the above expansion with respect to h, we have the asymptotically
optimal local bandwidth as
1
2rC1 h i 2rC1
1

hloc,opt D f .x/kKk22 2nrjkr f .r/ .x/j2

and on plugging this in, the convergence rate of the MSE is

MSE n2r=.2rC1/ :

We often assume that r D 2, in which case the pointwise convergence rate is MSE D
O.n4=5 /, slower than the parametric n1 rate. Note that it is expected that in a
nonparametric setup, we cannot get the parametric n1=2 convergence rate. See
DasGupta (2008) for the proofs.
724 20 Useful Tools for Statistics and Machine Learning

20.3.4 Kernels for Statistical Classification

One of the main applications of kernels is in the problem of classification. Here,


based on training data .xi ; yi /; i D 1; 2; : : : ; n, where xi are observed values
of a d -dimensional relevant variable X (often called covariates), and yi denotes
the group membership of the i th sampled unit, one needs to classify a future in-
dividual with a known x value, but an unknown group membership. That is, the
future individual has to be classified as belonging to one of p possible groups,
based on the past data .xi ; yi /; i D 1; 2; : : : ; n, and the present value X D x.
This is done by designing a classification function or classification rule yO D
n .x1 ; : : : ; xn ; y1 ; : : : ; yn ; x/, with n taking values in the set f1; 2; : : : ; pg. For
example, when there are only two possible groups so that p D 2, an intutive rule
is the majority rule, which takes a suitably small neighborhood U of the present
value X D x, counts how many of the training values x1 ; : : : ; xn that fall inside U
correspond to members of group 1, and how many correspond to members of group
2, and assigns the new x value to whichever group has the majority. For example,
if we choose our neighborhood U to be a ball of radius h centered at x, then the
mathematical definition of the majority rule works out to
X
n X
n
n .x1 ; : : : ; xn ; y1 ; : : : ; yn ; x/ D 1 iff Ijjxi xjjh;yi D1  Ijjxi xjjh;yi D2 :
i D1 i D1

A natural extension is a kernel classification rule:


X
n x  x
i
n;K .x1 ; : : : ; xn ; y1 ; : : : ; yn ; x/ D 1 iff Iyi D1 K
h
i D1
X
n x  x
i
 Iyi D2 K ;
h
i D1

where K W Rd ! R is a kernel. We usually choose K to be nonnegative, and


perhaps to have some other properties, as described in our list of properties in the
previous section. The kernel K essentially quantifies the similarity of a new point x
to a data point xi . The more similar x and xi are, the larger will be the numerical
value of K. xihx /. The scaling constant h is usually called a bandwidth, and should
be chosen to be appropriately small. Thus, kernels arise very naturally in classifica-
tion problems. We want to address the question of how good are kernel classification
rules and under what conditions.
Suppose F denotes the joint distribution of .X; Y /, namely the covariate vector
and the group membership variable. If we knew F , we could try to build classifica-
tion rules based on our knowledge of F . Precisely, suppose g W Rd ! f1; 2g is a
classification function that classifies an individual by using just the X value of that
individual, and suppose ’.F; g/ is its error probability

’.F; g/ D PF .g.X / ¤ Y /:
20.3 Kernels and Classification 725

The oracular error probability is

’.F / D inf ’.F; g/;


g

where the infimum is taken over all possible functions g W Rd ! f1; 2g. The word
oracular is supposed to convey the concept that only a person with oracular access
to knowledge of F can find the rule g0 that makes ’.F; g0 / D ’.F /.
On the other hand, we have our real field classification rules n .x1 ; : : : ; xn ;
y1 ; : : : ; yn ; x/, and they have their conditional error probabilities
 
’n .F; n / D PF n ¤ Y j .x1 ; : : : ; xn ; y1 ; : : : ; yn / :

Note that ’n .F; n / is a sequence of random variables. We would like ’n .F; n /


to be close to ’.F / for large n; i.e., with a lot of training data, we would like to
perform at the oracular level.
The following theorem (see pp 150-159 in Devroye, Gyorfi, R and Lugosi (1996))
shows such an oracle inequality.
Theorem 20.13. Suppose K is a nonnegative, bounded, and uniformly continuous
kernel. Suppose
h D hn ! 0I nhd ! 1:
Then for any joint distribution F of .X; Y /, the kernel classification rule n;K
satisfies
2
PF .’n .F; n /  ’.F / > / 2e cn ;
where c > 0 depends only on the kernel K and the dimension d , but not on F .
An easy consequence (see the Chapter exercises) of this result is that ’n .F; n /
converges almost surely to ’.F / whatever be F . In the classification literature, this
is called strong universal consistency. The result illustrates the advantages of using
kernels that have a certain amount of smoothness, and do not take negative values.
The Gaussian kernel and several other hernels in our illustrative list satisfy the
conditions of the above theorem, and assure oracular performance universally for
all F .

20.3.4.1 Reproducing Kernel Hilbert Spaces

Reproducing kernels allow a potentially very difficult optimization problem on some


suitable infinite dimensional function space into an ordinary finite dimensional op-
timization problem on an Euclidean space. For example, suppose a set of n data
points .xi ; yi /; i D 1; 2; : : : ; n are given to us, and we want to fit a suitable function
f .x/ to these points. We can find exact fits by using rough or oscillatory functions
f , such as polynomials of high Pdegrees, or broken line segnents. If the fit is an exact
fit, then the prediction error niD1 .yi  f .xi //2 will be zero. But the zero error is
achieved by using an erratic curve f .x/. We could trade an inexact, but reasonable,
726 20 Useful Tools for Statistics and Machine Learning

fit for some smoothness in the form of the function f .x/. A mathematical formula-
tion of such a constrained minimization problem is to minimize

X
n
L.yi ; f .xi // C ‚.f /
i D1

over some specified function space F , where L.y; f .x// is a loss function that
measures the goodness of our fit, ‚.f / is a real valued functional that measures
the roughness of f , and is a tuning constant that reflects the importance that we
place on using a smooth function. For example, if we use D 0, then that means
that all we care for is a good fit, and smoothness is of no importance to us. The
loss function L is typically a function such as .y  f .x//2 or jy  f .x/j, but could
be
R more general. The roughness penalty functional ‚.f / is often something like
.f 0 .x//2 dx, although it too can be more general.
On the face of it, this is an infinite dimensional optimization problem, because the
function space F would usually be infinite dimensional, unless we make the choice
of functions too restrictive. Reproducing kernels allow us to transform P such an in-
finite dimensional problem into finding a function of the form niD1 ci K.xi ; x/,
where K.x; x 0 / is a kernel, associated in a unique way with the function space F .
So, as long as we can identify what this reproducing kernel K.x; x 0 / is, all we have
to do to solve our original infinite dimensional optimization problem is to find the
n optimal constants c1 ;    ; cn . Such a kernel K.x; x 0 / uniquely associated with the
function space F can be found, as long as F has a nice amount of structure. The
structure needed is that of a special kind of Hilbert space. Aronszajn (1950) is the
original reference on the theory of reproducing kernels. We first provide a basic
treatment of Hilbert spaces themselves; this is essential for studying reproducing
kernels. Rudin (1986) is an excellent first exposition on Hilbert spaces.

Definition 20.6. A real vector space H is called an inner product space if there is a
function .x; y/ W H ˝ H ! R such that
(a) .x; x/  0 for all x 2 H, with .x; x/ D 0 if and only if x D 0, the null element
of H.
(b) .a1 x1 C a2 x2 ; y/ D a1 .x1 ; y/ C a2 .x2 ; y/ for all x1 ; x2 ; y 2 H, and all real
numbers a1 ; a2 .
p
The function .x; y/ is called the inner product of x and y, and .x; x/ D jjxjjH
the norm of x. The function d.x; y/ D jjx  yjjH is called the distance between x
and y.
Inner product spaces are the most geometrically natural generalizations of Eu-
clidean spaces, because we can talk about length, distance, and angle on inner
product spaces by considering jjxjjH ; .x; y/, and d.x; y/.

Definition 20.7. Let H be an inner product space. The angle between x and y is
defined as  D arccos jjxjj.x;y/
H jjyjjH
. x; y are called orthogonal if .x; y/ D 0, so that
 D 2 . For n given vectors x1 ;    ; xn , the linear span of x1 ;    ; xn is defined to be
20.3 Kernels and Classification 727
Pn
the set of all z 2 H of the form z D i D1 ci xi , where c1 ;    ; cn are arbitrary real
constants. The projection
P of a given y onto the linear span of x1 ;    ; xn P is defined to
be Px1 ; ;xn y D niD1 ci xi , where .c1 ;    ; cn / D argminc1 ; ;cn d.y; niD1 ci xi /.

Example 20.12. Some elementary examples of inner product spaces are


0
Pn n-dimensional Euclidean space R with .x; y/ defined as .x; y/ D x y D
n
1. The
i D1 xi yi .
2. A more general inner product on Rn is .x; y/ D x 0 Ay where A is an n  n
symmetric positive definite matrix.
3. Let H D L2 Œa; b be the class of square integrable functions on a bounded inter-
val Œa; b,
( Z b )
2
L2 Œa; b D f W Œa; b ! R; f .x/dx < 1 :
a

Then H is an inner product space, with the inner product


Z b
.f; g/ D f .x/g.x/dx:
a

In this, we identify any two functions f; g such that f D g almost everywhere


on Œa; b as being equivalent. Thus, any function f such that f D 0 almost
everywhere on Œa; b will be called a zero function.
4. Consider again a bounded interval Œa; b, but consider H D C Œa; b, the class of
all continuous functions on Œa; b. This is an inner product space with the same
Rb
inner product .f; g/ D a f .x/g.x/dx.
Some elementary facts about inner product spaces are summarized for reference in
the next result.

Proposition. Let H be an inner product space. Then,


1. (Joint Continuity of Inner Products). Suppose xn ; yn ; x; y 2 H, and
d.xn ; x/; d.yn ; y/ converge to 0 as n ! 1. Then, .xn ; yn / ! .x; y/ as
n ! 1.
2. (Cauchy-Schwarz Inequality) j.x; y/j jjxjjH jjyjjH for all x; y 2 H, with
equality if and only if y D cx for some constant c.
3. (Triangular Inequality) jjx C yjjH jjxjjH C jjyjjH for all x; y 2 H.
4. d.x; z/ d.x; y/ C d.y; z/ for all x; y; z 2 H.
5. (Pythagorean Identity) If x; y are orthogonal, then jjx C yjj2H D jjxjj2H C
jjyjj2H .
6. (Parallelogram
 Law)
 If x; y are orthogonal, then jjx C yjj2H C jjx  yjj2H D
2 jjxjjH C jjyjjH .
2 2

7. The projection of y 2 H onto the linear span of some fixed x ¤ 0 2 H equals


.x;y/
Px y D jjxjj 2 x.
H
728 20 Useful Tools for Statistics and Machine Learning

A generalization of part (g) is given in Exercise 20.27. Inner product spaces that
have the property of completeness are Hilbert spaces. The following example
demonstrates what completeness is all about, and why it need not hold for arbi-
trary inner product spaces.
Example 20.13 (An Incomplete Inner Product Space). Consider H D C Œ0; 1
R1
equipped with the inner product .f; g/ D 0 f .x/g.x/dx. Consider the sequence
of functions fn 2 H defined as follows:

1 1 1
fn .x/ D 0 for x 2 0; ; fn .x/ D 1 for x 2 C ;1 ;
2 2 n
   
1 1 1 1
and fn .x/ D n x  for x 2 ; C :
2 2 2 n

Now choose m; n ! 1, and suppose m < n. Then, it is clear that the graphs of
fm and fn coincide except on Œ 12 ; 12 C m
1
, and it follows easily that d.fm ; fn / D
R 1
1 2 2
0 .fm  fn / ! 0. This means that the sequence fn 2 H is a Cauchy se-
quence. However, the sequence does not have a continuous limit, i.e., there is no
R1
function f , continuous on Œ0; 1 such that 0 .fn  f /2 ! 0. So, here we have an
inner product space which is not complete, in the sense that we can have sequences
which are Cauchy with respect to the distance induced by the inner product, which
nevertheless do not have a limit within the same inner product space. Spaces on
which such Cauchy sequences cannot be found form Hilbert spaces.
Definition 20.8. Let H be an inner product space such that every Cauchy sequence
xn 2 H converges p to some x 2 H. Then H is called a Hilbert space with the Hilbert
norm jjxjjH D .x; x/.
Among the Lp spaces, L2 Œa; b is a Hilbert space; it has an inner product, and the
inner product is complete. Besides L2 , the other Lp spaces are not Hilbert spaces,
R 1=p
because their norms jjf jjH D jjf jjp D jf jp are not induced by an inner
product. We already saw that C Œa; b is not a Hilbert space, because the inner prod-
uct is not complete. The real line with the usual inner product, however, is a Hilbert
space, because a standard theorem in real analysis says that all Cauchy sequences
must converge to some real number. In fact, essentially the same proof shows that
all finite dimensional Euclidean spaces are Hilbert spaces.
In the finite dimensional Euclidean space Rn , the standard unit vectors ek D
.0;    ; 0; 1; 0;    ; 0/; k D 1; 2;    ; n form an orthonomal basis, in the sense that
the set feP k g is an orthonormal set of n-vectors, and any x 2 R may be represented
n
P
as x D kD1 ck ek , where ck are real constants. Furthermore, jjxjj2 D nkD1 ck2 .
n

Hilbert spaces, in general, are infinite-dimensional, and an orthonormal basis would


not be a finite set in general. In spite of this, a representation akin to the finite-
dimensional Rn case exists, but there are a few subtle elements of differences. The
exact result is the following.
Proposition (Orthonormal Bases and Parseval’s Identity). Let H be a Hilbert
space with the inner product .x; y/. Then,
20.3 Kernels and Classification 729

(a) H has an orthonormal basis B, that is, a set of vectors fe’ g of H such that
jje’ jjH D 1; .e’ ; eˇ / D 0 for all ’; ˇ; ’ ¤ ˇ, and the linear span of the vectors
in B is dense in H with respect to the Hilbert norm on H.
(b) Given any x 2 H, at most countably many among P .x; e’ / are not equal to zero,
and x mayPbe represented in the form x D ’ .x; e’ /e’ .
(c) jjxjj2H D ’ j.x; e’ /j2 .
See Rudin (1986) for a proof. It may be shown that all orthonormal bases of a Hilbert
space H have the same cardinality, which is called the dimension of H.
With these preliminaries, we can now proceed to the topic of reproducing kernels.
We need a key theorem about Hilbert spaces that plays a central role in the entire
concept of a reproducing kernel Hilbert space. This theorem, a classic in analysis,
gives a representation of continuous linear functionals on a Hilbert space. For com-
pleteness, we first define what is meant by a continuous linear functional.
Definition 20.9. Let H be a Hilbert space and ı W H ! R a real-valued linear
functional or operator on H; that is, ı.ax C by/ D aı.x/ C bı.y/ for all x; y 2 H
and all real constants a; b. The norm or operator norm of ı is defined to be jjıjj D
jı.x/j
supx2H jjxjj H
.
Definition 20.10. Let H be a Hilbert space and ı W H ! R a linear functional on
H. The functional ı is called continuous if xn ; x 2 H; d.xn ; x/ ! 0 ) ı.xn / !
ı.x/.
In general, linear functionals need not be continuous. But they are if they have a
finite operator norm. The following extremely important result says that these two
properties of continuity and boundedness are really the same.
Theorem 20.14. Let H be a Hilbert space and ı W H ! R a linear functional.
Then ı is continuous if and only if it is bounded; that is, jjıjj < 1. Equivalently, a
linear operator is continuous if and only if there exists a finite real constant c such
that jı.x/j cjjxjjH for all x 2 H.
Proof. For the “if” part, suppose ı is a bounded operator. Take xn 2 H ! 0 (the
null element). Then, by definition of operator norm,

jı.xn /j jjıjj jjxn jjH ! 0;

because jjıjj < 1. This proves that ı is continuous at 0, and therefore continuous
everywhere by linearity. Conversely, if jjıjj D 1, find a sequence xn 2 H such that
jjxn jjH 1, but jı.xn /j ! 1. This is possible, because jjıjj is easily shown to be
xn
equal to supfjı.x/j W jjjjxjjH 1g. Now define zn D jı.x n /j
; n  1. Then zn ! 0,
but ı.zn / does not go to zero, because jı.zn /j is equal to 1 for all n. This proves the
“only if” part of the theorem.
Here is the classic representation theorem for continuous linear functionals on a
Hilbert space that we promised. The theorem says that any continuous linear func-
tional on a Hilbert space H can be recovered as an inner product with a fixed element
of H, associated in a one-to-one way with the particular continuous functional.
730 20 Useful Tools for Statistics and Machine Learning

Theorem 20.15 (Riesz Representation Theorem). Let H be a Hilbert space and


ı W H ! R a continuous linear functional on H. Then there exists a unique v 2 H
such that ı.u/ D .u; v/ for all u 2 H.
See Rudin (1986) for a proof.
We use a special name for Hilbert spaces whose elements are functions on some
space. This terminology is useful when discussing reproducing kernel Hilbert spaces
below.
Definition 20.11. Let X be a set and H a class of real-valued functions f on X . If
H is a Hilbert space, it is called a Hilbert Function Space.
Here is the result that leads to the entire topic of reproducing kernel Hilbert
spaces.
Theorem 20.16. Let H be a Hilbert function space. Consider the point evaluation
operators defined by

ıx .f / D f .x/; f 2 H; x 2 X :

If the operators ıx .f / W H ! R are all continuous, then for each x 2 X , there is a


unique element Kx of H such that f .x/ D ıx .f / D .f .:/; Kx .://.
The proof is a direct consequence of the Riesz representation theorem, because
the operators ıx are linear operators. We can colloquially characterize this theorem
as saying that the original functions f .x/ can be recovered by taking the inner prod-
uct of f itself with a unique kernel function Kx . For example, if the
R inner product
on our relevant Hilbert space H was an integral, namely, .f; g/ D X f .y/g.y/dy,
then our theorem above says that we can recover each function f in our function
space H in the very special form
Z
f .x/ D f .y/Kx .y/dy:
X

Writing Kx .y/ D K.x; y/, we get the more conventional notation


Z
f .x/ D f .y/K.x; y/dy:
X

It is common to call K.x; y/ the reproducing kernel of the function space H. Note
that in a deviation with how kernels were defined in the context of smoothing, the
reproducing kernel is a function on X ˝ X . It is also important to understand that
not all Hilbert function spaces possess a reproducing kernel. The point evaluation
operators must be continuous for the function space to possess a reproducing ker-
nel. In what follows, more is said of this, and about which kernels can at all be a
reproducing kernel of some Hilbert function space.
Some basic properties of a reproducing kernel are given below.
Proposition. Let K.x; y/ be the reproducing kernel of a Hilbert function space H.
Then,
(a) K.x; y/ D .Kx ; Ky / for all x; y 2 X .
(b) K is symmetric; that is, K.x; y/ D K.y; x/ for all x; y 2 X .
20.3 Kernels and Classification 731

(c) K.x; x/  0 for all x 2 X .


(d) .K.x; y//2 K.x; x/K.y; y/ for all x; y 2 X .
Proof of each part of this result is simple. Part (a) follows from the fact that
Kx 2 H, and hence, Kx .y/ D K.x; y/ D .Kx ; Ky / by definition of a repro-
ducing kernel. Part (b) follows from part (a) because (real) inner products are
symmetric. Part (c) follows on noting that K.x; x/ D .Kx ; Kx / D jjKx jj2H 
0, and part (d) follows from the Cauchy–Schwarz inequality for inner product
spaces. t
u
It now turns out that the question of which real-valued functions K.x; y/ on
X ˝ X can act as a reproducing kernel of some Hilbert function space has an inti-
mate connection to the question of which functions can be the covariance kernel of
a one-dimensional Gaussian process X.t/, where the time parameter t runs through
the set X . In the special case where X is a subset of some Euclidean space Rn , and
the function K.x; y/ in question is of the form K.x; y/ D .jjx  yjj/, it more-
over turns out that characterizing which functions can be reproducing kernels is
intimately connected to characterizing which functions .t/ on R can be the char-
acteristic function (see Chapter 8) of a probability distribution on R. We are now
beginning to see that a question purely in the domain of analysis is connected to
classic questions in probability theory. To describe the main characterization theo-
rem, we need a definition.

Definition 20.12. A symmetric real-valued function K W X ˝ X ! R is called


Pif for each n  1; x1 ; : : : :xn 2 X , and real constants a1 ; : : : ; an ;
positive-definite
P
we have niD1 njD1 ai aj K.xi ; xj /  0. The function K is called strictly positive-
P P
definite if niD1 njD1 ai aj K.xi ; xj / > 0 unless ai D 0 for all i or x1 ;    ; xn are
identical.
So, K is positive-definite if matrices of the form

 ˇn
ˇ
K.xi ; xj / ˇˇ
i;j D1

are nonnegative-definite. Before giving examples of positive-definite functions, we


point out what exactly the connection is between positive-definite functions and
reproducing kernels of Hilbert function spaces.

Theorem 20.17. Let X be a set and K.x; y/ a positive-definite function on X ˝ X .


Then K.x; y/ is the unique reproducing kernel of some Hilbert function space H
on X . Conversely, suppose H is a Hilbert function space on a set X that has some
kernel K.x; y/ as its reproducing kernel. Then K.x; y/ must be a positive-definite
function on X ˝ X .
A proof can be seen in Cheney and Light (2000, pp. 233–234), or in Berlinet and
Thomas-Agnan (2004, p. 22).
732 20 Useful Tools for Statistics and Machine Learning

20.3.5 Mercer’s Theorem and Feature Maps

Theorem 20.15 says that there is a one-to-one correspondence between positive-


definite functions on a set X and Hilbert function spaces on X that possess a
reproducing kernel. The question arises as to how one finds positive-definite func-
tions on some set X , or verifies that a given function is indeed positive-definite. The
definition of a positive-definite function is not the most efficient or practical method
for verifying positive-definiteness of a function. Here is a clean and practical result.
Theorem 20.18. A symmetric function K.x; y/ on X ˝X is the reproducing kernel
of a Hilbert function space if and only if there is a family of maps .x/; x 2 X
with range space F , where F is an inner product space, such that K.x; y/ D
..x/; .y//F , where the notation .u; v/F denotes the inner product of the inner
product space F .
The “if” part of the theorem is trivial, simply using the characterization of re-
producing kernels as positive-definite functions. The “only if” part is nontrivial;
see Aizerman, Braverman and Rozonoer (1964). A version of this result especially
suited for L2 spaces of functions on an Euclidean space is known as Mercer’s the-
orem. It gives a constructive method for finding these maps .x/ that is akin to
finding the Gramian matrix of a finite-dimensional nonnegative-definite matrix. The
maps .x/ corresponding to a given kernel are called the feature maps, and the
space F is called the feature space. The feature space F may, in applications, turn
out to be much higher-dimensional than X , in the case that X is a finite-dimensional
space, for example, some Euclidean space. Still, this theorem gives us the flexibility
to play with various feature maps and thereby choose a suitable kernel function,
which would then correspond to a suitable reproducing kernel Hilbert space.
We now present two illustrative examples.
Example 20.14. Suppose X D Œ0; 1; x .t/ D cos.xt/; and F D L2 Œ1; 1. Then,
we get the kernel K.x; y/ on X ˝ X D Œ0; 1 ˝ Œ0; 1 given by
Z 1
K.x; y/ D .x ; y / D cos.xt/ cos.yt/dt
1

sin.x  y/ sin.x C y/
D C ;
xy xCy

where sin0 0 is interpreted as the limit limz!0 sinz z D 1. Thus, K.x; x/ D 1 C sin.2x/
2x
if x ¤ 0, and K.0; 0/ D 2. By the characterization theorem, Theorem 20.17, this is
a reproducing kernel of a Hilbert space of functions on X D Œ0; 1.
Example 20.15. Suppose X D Rn for some n  1. Consider the kernel K.x; y/ D
.x 0 y/2 , where x 0 y denotes the usual Euclidean inner product x1 y1 C x2 y2 C    C
xn yn . We find the feature maps corresponding to the positive-definite kernel K.
Define the maps
 
.x/ D x12 ; x1 x2 ; : : : ; x1 xn ; : : : ; xn x1 ; xn x2 ; : : : ; xn2 ;
20.3 Kernels and Classification 733

that is, the n2 -dimensional Euclidean vector with coordinates xi xj ; 1 i; j n.


2
We are going to look at .x/ as an element of F D Rn . Then,

2
X
n X
n X
n
..x/; .y//F D ..x//k ..y//k D .xi xj /.yi yj /
kD1 i D1 j D1

X
n X
n X
n  X
n 
D .xi yi /.xj yj / D xi yi xj yj
i D1 j D1 i D1 j D1

D .x 0 y/2 :

Therefore, the maps .x/ are the feature maps corresponding to K.


We end this section with a statement of Mercer’s theorem (Mercer (1909)). Ex-
tensions of Mercer’s theorem to much more abstract spaces are now available; see
Berlinet and Thomas-Agnan (2004).
Theorem 20.19. Suppose X is a closed subset of a finite-dimensional Euclidean
space, and  a -finite measure on X . Let K.x; y/ be a symmetric positive-definite
function (i.e., a reproducing
R R kernel) on X ˝ X , and suppose that K is square inte-
grable in the sense X X K 2 .x; y/d.x/d.y/ < 1. Define the linear operator
AK .f / W L2 .X; / ! L2 .X; /
Z
AK .f /.y/ D K.x; y/f .x/d.x/; y 2 X:
X

Then,
(a) The operator AK has a countable number of nonnegative eigenvalues i ; i  1,
and a corresponding sequence of mutually orthonormal eigenfunctions i ;
i  1, satisfying AK . i / D i i ; i  1.
(b) The kernel K admits the representation
1
X
K.x; y/ D i .x/i .y/;
i D1

p
where i .x/ D i i .x/; i  1.
The theorem covers the two most practically important cases of X being a rectangle
(possibly unbounded) in a finite-dimensional Euclidean space with  as Lebesgue
measure, and X being a finite set in a finite-dimensional Euclidean space with  as
the counting measure. The success of Mercer’s theorem in explicitly identifying the
feature map   1
.x/ D i .x/ ;
i D1
depends on our ability to find the eigenfunctions and the eigenvalues of the linear
operator AK . In some cases, we can find them explicitly, and in some cases, we are
734 20 Useful Tools for Statistics and Machine Learning

out of luck. It is worth noting that the linear operator AK has certain additional
properties (compactness in particular,) which allows the representation as in Mer-
cer’s theorem to hold; see, for example, Theorem 1 on p. 93 in Cheney (2001) for
conditions needed for the spectral decomposition of an operator on a Hilbert space.
Minh, Niyogi and Yao (2006) give some very nice examples of the calculation of
i and i in Mercer’s theorem when the input space is the surface of a sphere in
some finite-dimensional Euclidean space, or a discrete set in a finite-dimensional
Euclidean space.
Theorem 20.15 says that symmetric positive-definite functions and reproducing
kernel Hilbert spaces are in a one-to-one relationship. If we produce a symmetric
positive-definite function, it will correspond to a suitable Hilbert space of functions
with an inner product, and an induced norm. For the sake of applications, it is
useful to know this correspondence for some special kernels. A list of these corre-
spondences is given below for practical use.

X K.x; y/ H jjf jjH


 2
Œa; b e ’jxyj fu W u0 2 L2 Œa; bg 1
u .a/ C u2 .b/ C ’1
2
Rb  0 2 i
a .u / C ’ u
2 2
Rb 0
Œ0; b e x sinhy fu W u.0/ D 0; u0 2 0 .u C u/
2

L2 Œ0; bg
R1
Œ0; 1 .1  x/.1  y/ C xy fu W u00 2 L2 Œ0; 1g u2 .0/ C u2 .1/ C 0 .u00 /2
C.xy/3C  x6 .1
y/.x 2  2y C y 2 /
sin.M.xy//
R M.xy/
functions u 2 L2 .R/ usual L2 norm
Rwith itx
R e u.x/dx D
0 for jt j > M
.1/m1 R1 R1
Œ0; 1 1C .2m/Š
B2m .jxyj/ fu W u.m/ 2 . 0 u/2 C 0 .u.m/ /2
L2 Œ0; 1; u.j / .0/
D u.j / .1/
D 0 8 j < mg
Note: Above, Bj denotes the j th Bernoulli polynomial. For example, B2 .x/ D x 2  x C
1
6
; B4 .x/ D x 4  2x 3 C x 2  30
1
.

20.3.5.1 Support Vector Machines

Let us now return to the two group statistical classification problem. Suppose
the covariate vector X is a d -dimensional multivariate normal under each group,
namely, X j Y D 1 Nd .1 ; †/; X j Y D 2 Nd .2 ; †/. Suppose also that
P .Y D 1/ D p; P .Y D 2/ D 1  p. The marginal distribution of Y and the
conditional distribution of X given Y determine the joint distribution F of .X; Y /.
A classic result is that the misclassification probability PF .g.X / ¤ Y / is min-
imized by a linear classification rule that classifies a given X value into group
1 (i.e., sets g.X / D 1) if c 0 x  b for a suitable vector c and a suitable real
20.3 Kernels and Classification 735

-4 -2 2 4

-2

Fig. 20.3 Linearly separable data cloud

constant b. In the case that 1 ; 2 and † are known to the user, the vector c has
the formula c D .1  2 /0 †1 . Usually, these mean vectors and the covariance
matrix are unknown to the user, in which case c is estimated by using training data
.xi ; yi /; i D 1; 2; : : : ; n.
This is the historically famous Fisher linear classification rule. Geometrically,
the Fisher classification rule takes a suitable hyperplane in Rd , and classifies X
values on one side of the hyperplane to the first group and X values on the other
side of the hyperplane to the second group. For Gaussian data with identical or
nearly identical covariance structures, this idea of linear separation works quite
well. See Fig. 20.3. However, linear separability is too optimistic for many kinds of
data; for example, even in the Gaussian case itself, if the covariance structure of X
is different under the two groups, then linear separation is not going to work well.
On the other hand, linear separation has some advantages.
(a) A linear classification rule is easier to compute.
(b) A linear rule has geometric appeal.
(c) It may be easier to study operating characteristics, such as misclassification
probabilities, of linear rules.
An appealing idea is to map the original input vector X into a feature space, say
some Euclidean space RD , by using a feature map .X /, and use a linear rule in
the feature space. That is, use classification rules that classify an X value by using a
P 0
classification function of the form D i D1 ci .ˆ.xi // .ˆ.X //. As we remarked, the
dimension of the feature space may be much higher, and in fact the feature space
may even be infinite dimensional, in which case computing these inner products
.ˆ.xi //0 .ˆ.X // is going to be time consuming. However, now, our previous dis-
cussion of the theory of reproducing kernel Hilbert spaces is going to help us in
avoiding computation of these very high-dimensional inner products. As we saw
736 20 Useful Tools for Statistics and Machine Learning

in Theorem 20.17, every symmetric positive-definite function K.x; y/ on the prod-


uct space X ˝ X arises as such an inner product, and vice versa. Hence, we can
just directly
P choose a kernel function K.x; y/, and use a classification rule of the
form niD1 ci K.x; xi /. The feature maps are not directly used,Pbut provide the nec-
essary motivation and intuition for using a rule of the form niD1 ci K.x; xi /. To
choose the particular kernel, we can use a known collection of kernels, such as
K.x; y/ D K0 .x  y/, where K0 is as in our table of kernels in Section 20.3.2, or
generate new kernels.
The kernel function should act as a similarity measure. If two points x; y in the
input space are similar, then K.x; y/ should be relatively large. The more dissimilar
x; y are, the smaller should be K.x; y/. For example, if X is an Euclidean space,
and we select a kernel of the form K.x; y/ D K0 .jjx  yjj/, then K0 .t/ should
be a nonincreasing function of t. A few common kernels in the classification and
machine learning literature are the following.
2 2
(a) Gaussian Radial Kernel. K.x; y/ D e c jjxyjj .
(b) Polynomial Kernel. K.x; y/ D .x 0 y C c/d .
(c) Exponential Kernel. K.x; y/ D e ’jjxyjj ; ’ > 0.
(d) Sigmoid Kernel. K.x; y/ D tanh.cx 0 y C d /.
(e) Inverse Quadratic Kernel. K.x; y/ D ’2 jjxyjj
1
2 C1 .

We close with a remark on why support vector classification appears to be more


promising than other existing methods. Strictly linear separation on the input space
X itself is clearly untenable from practical experience. But, figuring out an appro-
priate nonlinear classifier from the data is first of all difficult, secondly ad hoc, and
thirdly will have small generalizability. Support vector machines, indirectly, map
the data into another space, and then select an optimal linear classifier in that space.
This is a well-posed problem. Furthermore, the support vector approach takes away
the overriding emphasis on good fitting for the obtained data to place some empha-
sis on generalizability. More precisely, one couldP pose the classification problem
in terms of minimizing the empirical risk n1 niD1 L.yi ; g.xi // in some class G
of classifiers g. However, this tends to put the boundary of the selected classifier
too close to one of the two groups, and on a future data set, the good risk perfor-
mance would not generalize. Support vector machines, in contrast, also place some
emphasis on placing the boundary at a good geographical margin from the data
cloud in each group. The risk, therefore, has an element of empirical risk minimiza-
tion, and also an element of regularization. The mathematics of the support vector
machines approach also shows that the optimal classifier is ultimately chosen by
a few influential data points, which are the so-called support vectors. Vapnik and
Chervonenkis (1964), Vapnik (1995), and Cristianini and Shawe-Taylor (2000) are
major references on the support vector machine approach to learning from data for
classification.
Exercises 737

Exercises

Exercise 20.1 (Simple Practical Bootstrap). For n D 10; 30; 50 take a random
sample from an N.0; 1/ distribution, and bootstrap the sample mean XN using a boot-
strap Monte Carlo size B D 500. Construct a histogram and superimpose on it the
exact density of XN . Compare.
Exercise 20.2. For n D 5; 20; 50, take a random sample from an Exp.1/ density,
and bootstrap the sample mean XN using a bootstrap Monte Carlo size B D 500.
Construct the corresponding histogram and superimpose it on the exact density.
Compare.
Exercise 20.3 * (Bootstrapping a Complicated Statistic). For n D 15; 30; 60,
take a random sample from an N.0; 1/ distribution, and bootstrap the sample kur-
tosis coefficient using a bootstrap Monte Carlo size B D 500. Next, find the
approximate normal distribution obtained from the delta theorem, and superimpose
the bootstrap histogram on this approximate normal density. Compare.
Exercise 20.4. For n D 20; 40; 75, take a random sample from the standard Cauchy
distribution, and bootstrap the sample median using a bootstrap Monte Carlo size
B D 500. Next, find the approximate normal distribution obtained from Theorem
9.1 and superimpose the bootstrap histogram on this approximate normal density.
Compare.
Exercise 20.5 * (Bootstrapping in an Unusual Situation). For n D 20; 40; 75,
take a random
p
sample from the standard Cauchy distribution, and bootstrap the
nXN
t-statistic s using a bootstrap Monte Carlo size B D 500. Plot the bootstrap
histogram. What special features do you notice in this histogram? In particular, com-
ment on whether the true density appears to be unimodal, or bounded.
Exercise 20.6 * (Bootstrap Variance Estimate). For n D 15; 30; 50, find a boot-
strap estimate of the variance of the sample median for a sample from a Beta.3; 3/
density. Use a bootstrap Monte Carlo size B D 500.
Exercise 20.7. For n D 10; 20; 40, find a bootstrap estimate of the variance of the
sample mean for a sample from a Beta.3; 3/ density. Use a bootstrap Monte Carlo
size B D 500. Compare with the known exact value of the variance of the sample
mean.
Exercise 20.8 (Comparing Bootstrap with an Exact pAnswer). For n D
N
15; 30; 50, find a bootstrap estimate of the probability P . ns X 1/ for samples
from a standard normal distribution. Use a bootstrap Monte Carlo size B D 500.
Compare the bootstrap estimate with the exact value of this probability. (Why is the
exact value easily computable?)
Exercise 20.9. * Prove that under appropriate moment conditions, the bootstrap is
consistent for the sample correlation coefficient r between two jointly distributed
variables X; Y .
738 20 Useful Tools for Statistics and Machine Learning

Exercise 20.10 * (Conceptual). Give an example of


(a) A density such that the bootstrap is not consistent for the mean.
(b) A density such that the bootstrap is consistent, but not second-order accurate
for the mean.

Exercise 20.11 * (Conceptual). In which of the following cases, can you use the
canonical bootstrap justifiably?
(a) Approximating the distribution of the largest order statistic of a sample from a
Beta distribution;
(b) Approximating the distribution of the median of a sample from a Beta
distribution;
(c) Approximating the distribution of the maximum likelihood estimate of P .X1
/ for an exponential density with mean
P
;
1 n
.X M /.Y M /
(d) Approximating the distribution of n i D1 isX sYX i Y , where .Xi ; Yi / are
independent samples from a bivariate normal distribution, MX and sX
are the median and the standard deviation of the Xi values, and MY and
sY are the median and the standard deviation of the Yi values;
(e) Approximating the distribution of the sample mean for a sample from a
t-distribution with two degrees of freedom.

Exercise 20.12. * Suppose XN n is the sample mean of an iid sample from a CDF F
with a finite variance, and XN n is the mean of a bootstrap sample. Consistency of the
bootstrap is a statement about the bootstrap distribution, conditional onpthe observed
data. What can you say about the unconditional limit distribution of n.XNn  /,
where  is the mean of F ?

Exercise 20.13 (EM for Truncated Geometric). Suppose X1 ; : : : ; Xn are iid


Geo.p/, but the value of Xi reported only if it is 4. Explicitly derive the E-step of
the EM algorithm for finding the MLE of p.

Exercise 20.14 (EM in a Background plus Signal Model).


(a) Suppose Y D U C V , where U; V are independent Poisson variables with
mean and c, respectively, where c is known and is unknown. Only Y is
observed. Design an EM algotithm for estimating , and describe the E-step
and the M-step.
(b) Generalize part (a) to the case of independent replications Yi D Ui C Vi ; 1
i n; Ui with common mean and Vi with common mean c.

Exercise 20.15 (EM with Data). The following are observations from a bivari-
ate normal distribution, with  indicating a missing value. Use the derivations in
Example 20.9 to find the first six EM iterates for the MLE of the vector of five
parameters. Comment on how close to convergence you are.
Data: (0, .1), (0, 1), (1, ), (2, ), (.5, .75), (, 3), (, 2), (.2, .2).
Exercises 739

Exercise 20.16 (Conceptual). Consider again the bivariate normal problem with
missing values, except now there are no complete observations:
Data: (0, ), (2, ), (, 3), (, 2), (.5, ), (.6, ), (, 0), (, 1).
Is the EM algorithm useful now? Explain your answer.

Exercise 20.17 (EM for t-Distributions).


(a) Suppose Y t.; m/ with the density
cm
 mC1
;
.x/2 2
1C m

where cm is the normalizing constant. Show that Y can be represented as


 
m
Y jZ D z N ; ; Z 2 .m/:
z

(See Chapter 4).


(b) Show that under a general 0 ; Z j Y D y W
mC.y0 /2
, where W 2 .mC1/.
(c) Use this to derive an expression for E0 .Z j Y D y/ under a general 0 .
(d) Design an EM algorithm for estimating  by writing the complete data as Xi D
.Yi ; Zi /; i D 1; : : : ; n.
(e) Write the complete data likelihood, and derive the E-step explicitly, by using
the result in parts (a) and (b).

Exercise 20.18 (EM in a Genetics Problem). Consider the ABO blood group
problem worked out in Example 20.10. For the data values YA D 182; YB D
60; YAB D 17; YO D 176 (McLachlan and Krishnan (2008)), find the first four
EM iterates, using the starting values .pA ; pB ; pO / D .:264; :093; :643/.

Exercise 20.19 (EM in a Mixture Problem). Suppose for some given n  1; Xi D


.Yi ; Zi /, where Zi takes values 1; 2; : : : ; p with probabilities 1 ; 2 ; : : : ; p . Con-
ditional on Zi D j; Yi fj .y j /. As usual, X1 ; : : : ; Xn are assumed to be
independent. We only get to observe Y1 ; : : : ; Yn , but not the group memberships
Z1 ; : : : ; Zn .
Qp
(a) Write with justification the complete data likelihood in the form
 j D1
nj Q
j i WZi Dj fj .yi j / , where nj is the number of Zi equal to j .
(b) By using Bayes’ theorem, derive an expression for P 0 .Zi D j j yi /, for a
general  0 .
(c) Use this to find LO k .; y/, where y D .y1 ; : : : ; yn /.
(d) Hence, complete the E-step.
(e) Suppose that the component densities fj are N.j ; j2 /. Show how to complete
the M-step, first for 1 ; : : : ; p , and then, for j ; j2 ; j D 1; 2; : : : ; p.
740 20 Useful Tools for Statistics and Machine Learning

Exercise 20.20 (EM in a Censoring Situation). Let Xi ; i D 1; 2; : : : ; n be


iid from a density f .x/. Let T be a fixed censoring time, and let Ui D
min.Xi ; T /; Vi D IXi T ; Yi D .Ui ; Vi /. Suppose that we only get to observe
Yi , but not the Xi .
(a) Write a general expression for the conditional distribution of Xi given Yi under
a general  0 .
(b) Now suppose that f .x/ is the exponential density with mean . Use your rep-
resentation of part (a) to derive an expression for E 0 .Xi j Ui ; Vi D v/, where
v D 0; 1.
(c) Hence complete the E-step of an EM algorithm for estimating .

Exercise 20.21 (Plug-In Density Estimate). Suppose the true model is a N.;  2 /
and a parametric plug-in estimate using MLEs of the parameters is used. Derive an
R
expression for the global error index Ef Œf .x/  fO.x/2 dx. At what rate does this
converge to zero?

Exercise 20.22 (Choosing the Wrong Model). Suppose the true model is a double
exponential location parameter density, but you thought it was a N.; 1/ and used
R
a parametric plug-in estimate with an MLE. Does Ef jf .x/  fO.x/jdx converge
to zero?

Exercise 20.23 (Trying out Density Estimates). Consider the :5N.2; 1/ C


:5N.2; 1/ density, which is bimodal. Simulate a sample of size n D 40 from this
density and compute and plot each the following density estimates, and write a
comparative report:
(a) a histogram that uses m D 11 equiwidth cells;
(b) a kernel estimate that uses a Gaussian kernel and a bandwidth equal to :05;
(c) a kernel estimate that uses a Gaussian kernel and a bandwidth equal to :005;
(d) a kernel estimate that uses a Gaussian kernel and a bandwidth equal to :15.

Exercise 20.24 (Practical Effect of the Kernel). For the simulated data in the pre-
vious exercise, repeat part (b) with the Epanechnikov and Exponential kernels. Plot
the three density estimates, and write a report.

Exercise 20.25. Suppose H is an inner product space. Prove the parallelogram law.

Exercise 20.26. Suppose H is an inner product space, and let x; y 2 H. Show that
x D y if and only if .x; z/ D .y; z/ for all z 2 H.

Exercise 20.27. Suppose H is an inner product space, and let x 2 H. Show that
jjxjjH D supf.x; v/ W jjvjjH D 1g.

Exercise 20.28 (Sufficient Condition for Convergence). Suppose H is an inner


product space, and xn ; y 2 H; n  1. Show that d.xn ; y/ ! 0 if jjxn jjH ! jjyjjH ,
and .xn ; y/ ! jjyjj2H .
Exercises 741

Exercise 20.29 * (Normalization Helps). Suppose H is an inner product space,


y
and x; y 2 H. Suppose that jjxjjH D 1, but jjyjjH > 1. Show that jjx  jjyjjH
jjH
jjx  yjjH . Is this inequality always a strict inequality?

Exercise 20.30 (An Orthonormal


R System). Consider H D L2 Œ;  with the
usual inner product .f; g/ D  f .t/g.t/dt. Define the sequence of functions

1
f0 .t/ p ; fn .t/ D cos.nt/; n D 1; 2; 3; : : : ;
2
fn .t/ D sin.nt/; n D 1; 2; 3; : : : :

Show that ffn g1 nD1 forms an orthonormal set on H; that is, jjfn jjH D 1 and
.fm ; fn / D 0, for all m; n; m ¤ n.

Exercise 20.31 (Legendre Polynomials). Show that the polynomials Pn .t/ D


dn
dt n .t  1/ ; n D 0; 1; 2; : : : form an orthogonal set on L2 Œ1; 1.
2 n

Find, explicitly, Pi .t/; i D 0; 1; 2; 3; 4.

Exercise 20.32 (Hermite Polynomials). Show that the polynomials


 
t2 dn t2
Hn .t/ D .1/n e 2 e  2 ; n D 0; 1; 2; : : :
dt n

form an orthogonal set on L2 .1; 1/ with the inner product .f; g/ D


R1  t2
2

1 f .t/g.t/e dt.
Find, explicitly, Hi .t/; i D 0; 1; 2; 3; 4.

Exercise 20.33 (Laguerre Polynomials). Show that for ’ > 1, the polynomials

e t t ’ d n  t nC’ 
Ln;’ .t/ D e t ; n D 0; 1; 2; : : :
nŠ dt n
R1
form an orthogonal set on L2 .0; 1/ with the inner product .f; g/ D 0 f .t/g.t/
e t t ’ dt.
Find, explicitly, Li;’ .t/; i D 0; 1; 2; 3; 4.

Exercise 20.34 (Jacobi Polynomials). Show that for ’; ˇ > 1, the polyno-
n
dn
 
mials Pn.’;ˇ / .t/ D .1/ 2n nŠ
.1  t/’ .1 C t/ˇ dt n .1  t/
’Cn
.1 C t/ˇ Cn ; n D
0; 1; 2; : : : form an orthogonal set on L2 .1; 1/ with the inner product .f; g/ D
R1
1 f .t/g.t/.1  t/ .1 C t/ dt.
’ ˇ
.’;ˇ /
Find, explicitly, Pi .t/; i D 0; 1; 2; 3; 4.

Exercise 20.35 (Gegenbauer Polynomials). Find, explicitly, the first five Gegen-
bauer polynomials, defined as Pi.’;ˇ / .t/; i D 0; 1; 2; 3; 4, when ’ D ˇ.
742 20 Useful Tools for Statistics and Machine Learning

Exercise 20.36 (Chebyshev Polynomials). Find, explicitly, the first five Cheby-
shev polynomials Tn .x/; n D 0; 1; 2; 3; 4, which are the Jacobi polynomials in the
special case ’ D ˇ D  12 . The polynomials Tn ; n D 0; 1; 2; : : : form an orthogonal
R1 1
set on L2 Œ1; 1 with the inner product .f; g/ D 1 f .t/g.t/.1  t 2 / 2 dt.

Exercise 20.37 (Projection Formula). Suppose fx1 ; x2 ; : : : ; xn g is an orthonormal


system in an inner product space H. ShowPthat the projection of an x 2 H to the
linear span of fx1 ; x2 ; : : : ; xn g is given by njD1 .x; xj /xj .

Exercise 20.38 (Bessel’s Inequality). Use the formula of the previous exercise to
B D fx1 ; x2 ; : : :g in an inner
show that for a general countable orthonormal set P
product space H, and any x 2 H, one has jjxjj2H  1i D1 j.x; xi /j .
2

Exercise 20.39 * (When Is a Normed Space an Inner Product Space). Suppose


H is a normed space, that is, H has associated with it a norm jjxjjH . Show that this
norm is induced by an inner product .x; y/ on H if and only if thenorm satisfies the
parallelogram law jjx C yjj2H C jjx  yjj2H D 2 jjxjj2H C jjyjj2H for all x; y 2 H.

Exercise 20.40 * (Fact About Lp Spaces). Let X be any subset of an Euclidean


space. Show that for any p  1; Lp .X / is not an inner product space if p ¤ 2.

Exercise 20.41. Let I be an interval in the real line. Show that the family of all
real-valued continuous functions on I, with the norm jjf jjH D supx2I jf .x/j is
not an inner product space.

Exercise 20.42. Suppose K is the Epanechnikov kernel in Rd , and f is an isotropic


function f .x/ D g.jjxjj/, where g is Lipschitz of some order ’ > 0.
(a) Show that f is Lipschitz on Rd .
(b) Show that f  Kn converges uniformly to f where Kn means Kn .x/ D
nd K.nx/.
(c) Prove this directly when f .x/ D jjxjj.

Exercise 20.43 (Kernel Plots). Plot the Cauchy and the exponential kernel, as
defined in Section 20.3.2, for some selected values of a. What is the effect of in-
creasing a?

Exercise 20.44 * (Kernel Proof of Weierstrass’s Theorem). Use Weierstrass’s


kernels and Theorem 20.10 to show that every continuous function on a closed
bounded interval in R can be uniformly well approximated by a polynomial of a
suitable degree.

Exercise 20.45. Suppose f; g 2 L1 .Rd /. Prove that jj jf jjgj jj1 D jj f jj1 jj g jj1 .

Exercise 20.46. Show that, pointwise, jf  .gh/j jjgjj1 . jf j  jhj /.

Exercise 20.47. Consider the linear operator ıR W L2 Œa; b ! L2 Œa; b, where a; b
x
are finite real constants, defined as ı.f /.x/ D a f .y/dy. Show that jjıjj b  a.
Exercises 743

Exercise 20.48. Suppose K.x; y/ is the reproducing kernel of a Hilbert function


space H. Suppose for some given x0 2 X ; K.x0 ; x0 / D 0. Show that f .x0 / D 0
for all f 2 H.
Exercise 20.49 * (Reproducing Kernel of an Annhilator). Suppose K.x; y/ is the
reproducing kernel of a Hilbert function space H on the domain set X . Fix z 2 H,
and define Hz D ff 2 H W f .z/ D 0g.
(a) Show that Hz is a reproducing kernel Hilbert space.
K.x;z/K.z;y/
(b) Show that the reproducing kernel of Hz is Kz .x; y/ D K.x; y/  K.z;z/ .
Exercise 20.50 * (Riesz Representation for an RKHS). Consider a function
space H that is known to be a reproducing kernel Hilbert space, and let ı be a contin-
uous linear functional on it. Show that the unique v 2 H that satisfies ı.u/ D .u; v/
for all u 2 H is given by the function v.x/ D ı.K.x; ://.
Exercise 20.51 (Generating New Reproducing Kernels). Let K1 .x; y/; K2 .x; y/
be reproducing kernels on X ˝ X . Show that K D K1 C K2 is also a reproducing
kernel. Can you generalize this result?
p
Exercise 20.52. Show that for any p 2; ’ > 0; e ’jjxyjj ; x; y 2 Rd is a
reproducing kernel.
Exercise 20.53. Show that min.x; y/; x; y 2 Œ0; 1 is a reproducing kernel.
a:s:
Exercise 20.54. Show that under the conditions of Theorem 20.11, ’n .F; n / !
’.F / for any F .
Exercise 20.55 * (Classification Using Kernels). Suppose K.x/ is a kernel in Rd
in the usual sense, that is, as defined in Definition 20.4. Suppose K has a verifiable
isotropic upper bound K.x/ g.jjxjj/ for some g. Give sufficient conditions on g
so that the kernel classification rule based on K is strongly universally consistent in
the sense of Theorem 20.11.
Exercise 20.56 * (Consistency of Kernel Classification). For which of the Gaus-
sian, Cauchy, exponential, Laplace, spherical, and polynomial kernels, defined as in
Section 20.3.2, does strong universal consistency hold? Use Theorem 20.11.
Exercise 20.57 * (Consistency of Product Kernels). Suppose Kj .xj /; j D
1; 2; : : : ; d are kernels on R. Consider the classification rule that classifies a new x
value into group 1 if

X
n Y
d   X
n Y
d  
xj  xi;j xj  xi;j
Iyi D1 Kj  Iyi D2 Kj ;
hj hj
i D1 j D1 i D1 j D1

where hj D hj;n are the coordinatewise bandwidths, j D 1; 2; : : : ; d , and xj ; xi;j


stand for the j th coordinate of x and the i th data value xi .
Show that universal strong consistency of this rule holds if each Kj satisfies the
conditions of Theorem 20.11, and if hj;n ! 0 for all j and nh1;n h2;n : : : hd;n ! 1
as n ! 1.
744 20 Useful Tools for Statistics and Machine Learning

Exercise 20.58 * (Classification with Fisher’s Iris Data). For Fisher’s Iris dataset
(e.g., wikipedia.org), form a kernel classification rule for the pairwise cases, setosa
versus. versicolor; setosa versus. virginica; versicolor versus. virginica, by using
(a) Fisher’s linear classification rule
(b) A kernel classification rule with an exponential kernel, where the parameter of
the kernel is to be chosen by you
(c) A kernel classification rule with the inverse quadratic kernel, where the param-
eter of the kernel is to be chosen by you
Find the empirical error rate for each rule, and write a report.

Exercise 20.59 * (Classification with Simulated Data).


(a) Simulate n D 50 observations from a d -dimensional normal distribution,
Nd .0; I /; do this for d D 3; 5; 10.
(b) Simulate n D 50 observations from a d -dimensional radial (i.e., spherically
symmetric) Cauchy distribution, which has the density
c
d C1
.1 C jjxjj2 / 2

c being the normalizing constant. Do this for d D 3; 5; 10.


(c) Form a kernel classification rule by using the exponential kernel and the inverse
quadratic kernel, where the parameters of the kernel are to be chosen by you.
Find the empirical error rate for each rule, and write a report.

References

Aizerman, M., Braverman, E., and Rozonoer, L. (1964). Theoretical foundations of the potential
function method in pattern recognition learning, Autom. Remote Control, 25, 821–837.
Aronszajn, N. (1950). Theory of reproducing kernels, Trans. Amer. Math. Soc., 68, 307–404.
Athreya, K. (1987). Bootstrap of the mean in the infinite variance case, Ann. Statist., 15, 724–731.
Berlinet, A. and Thomas-Agnan, C. (2004). Reproducing Kernel Hilbert Spaces in Probability and
Statistics, Kluwer, Boston.
Bickel, P.J. (2003). Unorthodox bootstraps, Invited paper, J. Korean Statist. Soc., 32, 213–224.
Bickel, P.J. and Doksum, K. (2006). Mathematical Statistics, Basic Ideas and Selected Topics,
Prentice Hall, upper Saddle River, NJ.
Bickel, P.J. and Freedman, D. (1981). Some asymptotic theory for the bootrap, Ann. Statist., 9,
1196–1217.
Carlstein, E. (1986). The use of subseries values for estimating the variance of a general statistic
from a stationary sequence, Ann. Statist., 14, 1171–1179.
Chan, K. and Ledolter, J. (1995). Monte Carlo estimation for time series models involving counts,
J. Amer. Statist. Assoc., 90, 242–252.
Cheney, W. (2001). Analysis for Applied Mathematics, Springer, New York.
Cheney, W. and Light, W. (2000). A Course in Approximation Theory, Pacific Grove, Brooks/
Cole, CA.
References 745

Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines and other
Kernel Based Learning Methods, Cambridge Univ. Press, Cambridge, UK.
DasGupta, A. (2008). Asymptotic Theory of Statistics and Probability, Springer, New York.
Dempster, A., Laird, N., and Rubin, D. (1977). Maximum likelihood from incomplete data via the
EM algorithm, JRSS, Ser. B, 39, 1–38.
R
Devroye, L., Gyorfi, L., and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition,
Springer, New York.
Efron, B. (2003). Second thoughts on the bootstrap, Statist. Sci., 18, 135–140.
Efron, B. and Tibshirani, R. (1993). An Introduction to the Bootstrap, Chapman and Hall, London.
Giné, E. and Zinn, J. (1989).Necessary conditions for bootstrap of the mean, Ann. Statist., 17,
684–691.
Hall, P. (1986). On the number of bootstrap simulations required to construct a confidence interval,
Ann. Statist., 14, 1453–1462.
Hall, P. (1988). Rate of convergence in bootstrap approximations, Ann. prob, 16,4, 1665–1684.
Hall, P. (1989). On efficient bootstrap simulation, Biometrika, 76, 613–617.
Hall, P. (1990). Asymptotic properties of the bootstrap for heavy-tailed distributions, Ann. Prob.,
18, 1342–1360.
Hall, P. (1992). The Bootstrap and Edgeworth Expansion, Springer, New York.
Hall, P., Horowitz, J. and Jing, B. (1995). On blocking rules for the bootstrap with dependent data,
Biometrika, 82, 561–574.
Hall, P. (2003). A short prehistory of the bootstrap, Statist. Sci., 18, 158–167.
KRunsch, H.R. (1989). The Jackknife and the bootstrap for general stationary observations, Ann.
Statist., 17, 1217–1241.
Lahiri, S.N. (1999). Theoretical comparisons of block bootstrap methods, Ann. Statist., 27,
386–404.
Lahiri, S.N. (2003). Resampling Methods for Dependent Data, Springer-Verlag, New York.
Lahiri, S.N. (2006). Bootstrap methods, a review, in Frontiers in Statistics, J. Fan and H. Koul Eds.,
231–256, Imperial College Press, London.
Lange, K. (1999). Numerical Analysis for Statisticians, Springer, New York.
Le Cam, L. and Yang, G. (1990). Asymptotics in Statistics, Some Basic Concepts, Springer, New
York.
Lehmann, E.L. (1999). Elements of Large Sample Theory, Springer, New York.
Lehmann, E.L. and Casella, G. (1998). Theory of Point Estimation, Springer, New York.
Levine, R. and Casella, G. (2001). Implementation of the Monte Carlo EM algorithm, J. Comput.
Graph. Statist., 10, 422–439.
McLachlan, G. and Krishnan, T. (2008). The EM Algorithm and Extensions, Wiley, New York.
Mercer, J. (1909). Functions of positive and negative type and their connection with the theory of
integral equations, Philos. Trans. Royal Soc. London, A, 415–416.
Minh, H., Niyogi, P., and Yao, Y. (2006). Mercer’s theorem, feature maps, and smoothing, Proc.
Comput. Learning Theory, COLT, 154–168.
Murray, G.D. (1977). Discussion of paper by Dempster, Laird, and Rubin (1977), JRSS Ser. B, 39,
27–28.
Politis, D. and Romano, J. (1994). The stationary bootstrap, JASA, 89, 1303–1313.
Politis, D. and White, A. (2004). Automatic block length selection for the dependent bootstrap,
Econ. Rev., 23, 53–70.
Politis, D., Romano, J. and Wolf, M. (1999). Subsampling, Springer, New York.
Rudin, W. (1986). Real and Complex Analysis, 3rd edition, McGraw-Hill, Columbus, OH.
Rosenblatt, M. (1956). Remarks on some nonparametric estimates of a density function, Ann.
Math, Statist., 27, 832–835. 3rd Edition, McGraw-Hill, Columbus, OH.
Shao, J. and Tu, D. (1995). The Jackknife and Bootstrap, Springer, New York.
Singh, K. (1981). On the asymptotic accuracy of Efron’s bootstrap, Ann. Statist., 9, 1187–1195.
Sundberg, R. (1974). Maximum likelihood theory for incomplete data from exponential family,
Scand. J. Statist., 1, 49–58.
Tong, Y. (1990). The Multivariate Normal Distribution, Springer, New York.
746 20 Useful Tools for Statistics and Machine Learning

Vapnik, V. and Chervonenkis, A. (1964). A note on one class of perceptrons, Autom. Remote Con-
trol, 25.
Vapnik, V. (1995). The Nature of Statistical Learning Theory, Springer, New York.
Wei, G. and Tanner, M. (1990). A Monte Carlo implementation of the EM algorithm, J. Amer.
Statist. Assoc., 85, 699–704.
Wu, C.F.J. (1983). On the convergence properties of the EM algorithm, Ann. Statist., 11, 95–103.
Appendix A
Symbols, Useful Formulas, and Normal Table

A.1 Glossary of Symbols

 sample space
P .B j A/ conditional probability
f.1/; : : : ; .n/g permutation of f1; : : : ; ng
F CDF
FN 1F
F 1 ; Q quantile function
iid, IID independent and identically distributed
p.x; y/; f .x1 ; : : : ; xn / joint pmf; joint density
F .x1 ; : : : ; xn / joint CDF
f .y j x/; E.Y j X D x/ conditional density and expectation
var; Var variance
Var.Y j X D x/ conditional variance
Cov covariance
X;Y correlation
G.s/; .t/ generating function; mgf or characteristic function
.t1 ; : : : ; tn / joint mgf
ˇ; skewness and kurtosis
k E.X  /k
mk sample kth central moment
r rth cumulant
r rth standardized cumulant; correlation of lag r
r;  polar coordinates
J Jacobian
Fn empirical CDF
Fn1 sample quantile function
X .n/ sample observation vector .X1 ; : : : ; Xn /
Mn sample median
X.k/ ; XkWn kth-order statistic

A. DasGupta, Probability for Statistics and Machine Learning: Fundamentals, 747


and Advanced Topics, Springer Texts in Statistics, DOI 10.1007/978-1-4419-9634-3,
c Springer Science+Business Media, LLC 2011
748 Appendix A Symbols, Useful Formulas, and Normal Table

Wn sample range
IQR interquartile range
P P
); ! convergence in probability
op .1/ convergence in probability to zero
Op .1/ bounded in probability
an bn 0 < lim inf abnn lim sup abnn < 1
an bn ; an bn lim abnn D 1
a:s: a:s:
); ! almost sure convergence
w.p. 1 with probability 1
a.e. almost everywhere
i.o. infinitely often
L L
); ! convergence in distribution
r r
); ! convergence in rth mean
u. i. uniformly integrable
LIL law of iterated logarithm
VST variance stabilizing transformation
ıx point mass at x
P .fxg/ probability of the point x
Lebesgue measure
 convolution
 absolutely continuous
dP
Radon–Nikodym derivative
N
d
product measure; Kronecker product
I./ Fisher information function or matrix
T natural parameter space
f . j x/; . j X .n/ / posterior density of 
LRT likelihood ratio test
ƒn likelihood ratio
S sample covariance matrix
T2 Hotelling’s T 2 statistic
MLE maximum likelihood estimate
l.; X / complete data likelihood in EM
L.; X / log l.; X /
l.; Y / likelihood for observed data
L.; Y / log l.; Y /
LO k .; Y / function to be maximized in M-step
HBoot bootstrap distribution of a statistic
P bootstrap measure
pij transition probabilities in a Markov chain
A.1 Glossary of Symbols 749

pij .n/ n-step transition probabilities


Ti first passage time
 stationary distribution of a Markov chain
Sn random walk; partial sums
.x; n/; .x; T / local time
W .t/; B.t/ Brownian motion and Brownian bridge
W d .t/ d -dimensional Brownian motion
.s; t/ covariance kernel
X.t/; N.t/ Poisson process
… Poisson point process
’n .t/; un .y/ uniform empirical and quantile process
Fn .t/; ˇn .t/ general empirical process
Pn empirical measure
Dn Kolmogorov–Smirnov statistic
MCMC Markov chain Monte Carlo
ij proposal probabilities
ij acceptance probabilities
S.n; C/ shattering coefficient
SLEM second largest eigenvalue in modulus
.P / Dobrushin’s coefficient
V .x/ energy or drift function
V C.C/ VC dimension
N. ; F ; jj:jj/ covering number
Nbc . ; F ; jj:jj/ bracketing number
D.x; ; P / packing number
.x/ intensity function of a Poisson process
R real line
Rd d -dimensional Euclidean space
C.X / real-valued continuous functions on X
Ck .R/ k times continuously differentiable functions
C0 .R/ real continuous functions f on R such that
f .x/ ! 0 as jxj ! 1
F family of functions
5; 4 gradient vector and Laplacian
f .m/ mth derivative
P @m1 f @mn f
D k f .x1 ; : : : ; xn / m1 ;m2 ;:::;mn 0;m1 C:::mn Dk m1 : : : m
@x n
@x1 n

DC ; D C Dini derivatives
jj:jj Euclidean norm
jj:jj1 supnorm
tr trace of a matrix
jAj determinant of a matrix
K.x/; K.x; y/ kernels
750 Appendix A Symbols, Useful Formulas, and Normal Table

jjxjj1 ; jjxjj L1 ; L2 norm in Rn


H Hilbert space
jjxjjH Hilbert norm
.x; y/ inner product
g.X /; n classification rules
x ; .x/ Mercer’s feature maps
B.x; r/ sphere with center at x and radius r
U; U 0 ; UN domain, interior, and closure
IA indicator function of A
I.t/ large deviation rate function
fg fractional part
b:c integer part
sgn, sign signum function
xC ; x C maxfx; 0g
max; min maximum, minimum
sup; inf supremum, infimum
K ; I ; J Bessel functions
R
Lp ./; Lp ./ set of functions such that jf jp d < 1
d; L;  Kolmogorov, Levy, and total variation distance
H; K Hellinger and Kullback–Leibler distance
W; `2 Wasserstein distance
D separation distance
df f -divergence
N.;  2 / normal distribution
; ˆ standard normal density and CDF
Np .; †/; M V N.; †/ multivariate normal distribution
BV N bivariate normal distribution
MN.n; p1 ; : : : ; pk / multinomial distribution with these parameters
tn ; t.n/ t-distribution with n degrees of freedom
Ber(p), Bin(n; p) Bernoulli and binomial distribution
Poi( ) Poisson distribution
Geo(p) geometric distribution
NB(r; p) negative binomial distribution
Exp. ) exponential distribution with mean
Gamma.’; / Gamma density with shape parameter ’, and
scale parameter
2n ; 2 .n/ chi-square distribution with n degrees of freedom
C.; ) Cauchy distribution
Dn .’/ Dirichlet distribution
Wp .k; †/ Wishart distribution
DoubleExp.; / double exponential with parameters ;
A.2 Moments and MGFs of Common Distributions

Discrete Distributions

Distribution p.x/ Mean Variance Skewness Kurtosis MGF


1 nC1 n2 1 6.n2 C1/ e .nC1/t e t
Uniform n
;x D 1; : : : ; n 2 12
0  5.n2 1/ n.e t 1/

n  x 12p 16p.1p/
Binomial p .1  p/nx ; x D 0; : : : ; n np np.1  p/ p .pe t C 1  p/n
x np.1p/ np.1p/

e x 1 1
p .e t 1/
A.2 Moments and MGFs of Common Distributions

Poisson xŠ
; x D 0; 1; : : : e

1 1p 2p p2 pe t
Geometric p.1  p/x1 ; x D 1; 2; : : : p 6C
p p2 1p 1p 1.1p/e t

x1 r r r.1p/ 2p 6 p2 pe t


Neg. Bin. p .1  p/xr ; x  r; p C
r1 p p2 r r.1p/
r.1p/
. 1.1p/et /r

D
.Dx /.Nnx / D D D N n
Hypergeom nN nN .1  /
N N 1
Complex Complex Complex
.Nn /

log.1C x1 / P9
Benford log 10
;x D 1; : : : ; 9 3.4 4 6.057 .796 2.45 xD1 e tx p.x/
751
752

Continuous Distributions

Distribution f .x/ Mean Variance Skewness Kurtosis


1 aCb .ba/2
Uniform ba
;a xb 2 12
0  65

e x= 2
Exponential ;x  0 2 6

e x= x ’1 2 p2 6
Gamma ’ .’/ ; x 0 ’ ’ ’ ’

q
e x=2 x m=21 8 12
2m 2m=2 . m
;x 0 m 2m m m
2 /

3 .1C 3 /3 2 3


ˇ x ˇ1 . x /ˇ 2 ˇ
Weibull . / e ;x > 0 .1 C ˇ1 / .1 C ˇ2 /  2 3
Complex
p
x ’1 .1x/ˇ1 ’ ’ˇ 2.ˇ’/ ’CˇC1
Beta ;0 x1 p Complex
B.’;ˇ/ ’Cˇ .’Cˇ/2 .’CˇC1/ ’ˇ.’CˇC2/

p1 2 =.2 2 /
Normal  2
e .x/ ;x 2 R  2 0 0

.log x/2 2 =2 2 2 2 C2
p 2
lognormal p1 e 2 2 ;x > 0 e C .e   1/e 2C e e  1 Complex
 2x
(continued)
Appendix A Symbols, Useful Formulas, and Normal Table
Distribution f .x/ Mean Variance Skewness Kurtosis
1
Cauchy .1C.x/2 = 2 /
;x 2R None None None None

. mC1
2 / 1 m 6
tm p 2 .mC1/=2 ;x 2 R 0(m > 1) (m > 2) 0(m > 3) (m > 4)
m. m2 / .1Cx =m/
m2 m4

ˇ
. ’ /ˇ x ’1 ˇ ˇ 2 .’Cˇ1/
F ˇ ;x > 0 ˇ1
.ˇ > 1/ ’.ˇ2/.ˇ1/2
.ˇ > 2/ Complex Complex
B.’;ˇ/.xC ’ /’Cˇ

e jxj=
Double Exp. ;x 2 R  2 2 0 3
A.2 Moments and MGFs of Common Distributions

2
q
’ ’ ’ ’ 2 2.’C1/ ’2
Pareto x ’C1
;x  >0 ’1
.’ > 1/ .’1/2 .’2/
.’ > 2/ ’3 ’
.’ > 3/ Complex

x p
x 12 6.3/
1 .e   /  2 2 12
Gumbel e e  ;x 2 R C  
 6 3 5
Note:
P1 For the Gumbel distribution, :577216 is the Euler constant, and .3/ is Riemann’s zeta function .3/ D
1
nD1 n3 1:20206.
753
754 Appendix A Symbols, Useful Formulas, and Normal Table

Table of MGFs of Continuous Distributions

Distribution f .x/ MGF


1 e bt e at
Uniform ba ; a x b .ba/t

e x=
Exponential  ;x 0 .1  t/1 .t < 1= /

e x= x ’1
Gamma ’ .’/
;x 0 .1  t/’ .t < 1= /

e x=2 x m=21
2m 2m=2 . m
;x 0 .1  2t/m=2 .t < 12 /
2 /

ˇ x ˇ 1 . x ˇ P1 .t /n
Weibull ./ e / ; x >0 nD0 nŠ .1 C ˇn /

x ’1 .1x/ˇ1
Beta B.’;ˇ / ;0 x 1 1F1 .’; ’ C ˇ; t/
2 2 2  2 =2
Normal p1 e .x/ =.2 / ; x 2R e tCt
 2

.log x/2

lognormal p1 e 2 2 ;x > 0 None
 2x

Cauchy 1
.1C.x/2 = 2 /
;x 2R None

. mC1
2 /
tm p
m. m 2
1
.mC1/=2 ; x 2R None
2 / .1Cx =m/

.ˇ ˇ ’1
’/ x
F ;x > 0 None
B.’;ˇ /.xC ˇ
’/
’Cˇ

e jxj= e t
Double Exp. 2
;x 2 R 1 2 t 2
.jtj < 1=/

’ ’
Pareto x ’C1
;x  >0 None
x
1 .e  /  x
Gumbel 
e 
e  ;x 2 R e t .1  t/.t < 1=/
A.3 Normal Table 755

A.3 Normal Table

Standard Normal Probabilities P.Z  t/ and Standard Normal Percentiles


Quantity tabulated in the next page is ˆ.t/ D P .Z t/ for given t  0, where
Z N.0; 1/. For example, from the table, P .Z 1:52/ D :9357.
For any positive t; P .t Z t/ D 2ˆ.t/1, and P .Z > t/ D P .Z > t/ D
1  ˆ.t/.
Selected standard normal percentiles z’ are given below. Here, the meaning of z’
is P .Z > z’ / D ’.

’ z’
:25 :675
:2 :84
:1 1:28
:05 1:645
:025 1:96
:02 2:055
:01 2:33
:005 2:575
:001 3:08
:0001 3:72
756 Appendix A Symbols, Useful Formulas, and Normal Table

0 1 2 3 4 5 6 7 8 9
0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517
0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879
0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224
0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549
0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852
0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133
0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389
1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621
1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830
1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015
1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177
1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319
1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441
1.6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545
1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633
1.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.9706
1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9761 0.9767
2.0 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817
2.1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857
2.2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.9890
2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916
2.4 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936
2.5 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952
2.6 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 0.9961 0.9962 0.9963 0.9964
2.7 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973 0.9974
2.8 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 0.9979 0.9979 0.9980 0.9981
2.9 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986
3.0 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.9990 0.9990
3.1 0.9990 0.9991 0.9991 0.9991 0.9992 0.9992 0.9992 0.9992 0.9993 0.9993
3.2 0.9993 0.9993 0.9994 0.9994 0.9994 0.9994 0.9994 0.9995 0.9995 0.9995
3.3 0.9995 0.9995 0.9995 0.9996 0.9996 0.9996 0.9996 0.9996 0.9996 0.9997
3.4 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9998
3.5 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998
3.6 0.9998 0.9998 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999
3.7 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999
3.8 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999
3.9 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
4.0 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
Author Index

A Bucklew, J., 560


Adler, R.J., 576 Burkholder, D.L., 481, 482
Aitchison, J., 188
Aizerman, M., 732
Alexander, K., 545 C
Alon, N., 20 Cai, T., 76
Aronszajn, N., 726 Carlin, B., 614, 673
Ash, R., 1, 249 Carlstein, E., 702
Athreya, K., 614, 667, 671, 694 Casella, G., 583, 600, 614, 704, 714
Azuma, K., 483 Chan, K., 671, 714
Cheney, W., 731, 734
Chernoff, H., 51, 52, 68–70, 89
Chervonenkis, A., 540, 541, 736
B Chibisov, D., 534
Balakrishnan, N., 54 Chow, Y.S., 249, 259, 264, 267, 276, 463
Barbour, A., 34 Chung, K.L., 1, 381, 393, 394, 463
Barnard, G., 623 Cirelson, B.S., 572
Barndorff-Nielsen, O., 583 Clifford, P., 615, 623
Basu, D., 188, 208, 600, 604 Coles, S., 238
Basu, S., 570 Cowles, M., 673
Beran, R., 545 Cox, D., 157
Berlinet, A., 731, 733 Cramér, H., 249
Bernstein, S., 51, 52, 89 Cressie, N., 570
Berry, A., 308 Cristianini, N., 736
Besag, J., 615, 623 Csáki, E., 535
Bhattacharya, R.N., 1, 71, 76, 82, 249, 308, Csörgo, M., 421, 423, 425, 427, 527, 536
339, 402, 408, 409, 414, 423 Csörgo, S., 421
Bickel, P.J., 230, 249, 282, 583, 603, 690, 693,
704, 714
Billingsley, P., 1, 421, 531 D
Blackwell, D., 188 DasGupta, A., 1, 3, 71, 76, 215, 238, 243, 249,
Borell, C., 572 257, 311, 317, 323, 327, 328, 333, 337,
Bose, R., 208 421, 429, 505, 527, 570, 695, 699, 704,
Braverman, E., 732 714, 723
Breiman, L., 1, 249, 402 Dasgupta, S., 208
Brémaud, P., 339, 366, 614, 652, 654, 655, 658 David, H.A., 221, 228, 234, 236, 238, 323
Brown, B.M., 498 Davis, B., 481, 482
Brown, L.D., 76, 416, 429, 583, 600, 601, 612 de Haan, L., 323, 329, 331, 332, 334
Brown, R., 401 Deheuvels, P., 527, 553

757
758 Author Index

del Barrio, E., 527, 553 Gelman, A., 614, 674


Dembo, A., 560, 570 Geman, D., 614
Dempster, A., 705 Geman, S., 614
den Hollander, F., 560 Genz, A., 157
Devroye, L., 485–488, 560, 725 Geyer, C., 614
Diaconis, P., 67, 339, 505, 614, 615, 652, 657, Ghosh, M., 208
671 Gibbs, A., 505, 508, 509, 516
Dimakos, X.K., 615 Gilks, W., 614
Do, K.-A., 634 Giné, E., 527, 538, 541, 542, 545, 548, 570,
Dobrushin, R.L., 652, 657 694
Doksum, K., 249, 282, 583, 603, 704, 714 Glauber, R., 646
Donsker, M., 421, 422, 530 Gnedenko, B.V., 328
Doob, J.L., 463 Götze, F., 570
Doss, H., 614, 667, 671 Gray, L., 463, 472, 493, 496
Dubhashi, D., 560 Green, P.J., 615
Dudley, R.M., 1, 505, 508, 527, 530, 531, 541, Groeneboom, P., 563
578 Gundy, R.F., 481, 482
Durrett, R., 402 Györfi, L., 560, 725
Dvoretzky, A., 242
Dym, H., 390
H
Haff, L.R., 208, 327, 429
E Hall, P., 34, 71, 224, 231, 249, 308, 331, 421,
Eaton, M., 208 463, 497, 498, 560, 570, 623, 634, 690,
Efron, B., 570, 571, 690 691, 694, 701, 704
Eicker, F., 535 Hastings, W., 614
Einmahl, J., 535 Heyde, C., 249, 421, 463, 497, 498
Einmahl, U., 425 Higdon, D., 615
Embrechts, P., 238 Hinkley, D., 174
Erdös, P., 421, 422 Hobert, J., 671
Esseen, C., 308 Hoeffding, W., 483, 570
Ethier, S., 243 Horowitz, J., 704
Everitt, B., 54 Hotelling, H., 210
Hüsler, J., 238
F
Falk, M., 238
Feller, W., 1, 62, 71, 76, 77, 249, 254, 258, I
277, 285, 307, 308, 316, 317, 339, 375, Ibragimov, I.A., 508
386, 389 Isaacson, D., 339
Ferguson, T., 188, 249
Fernique, X., 576
Fill, J., 615, 657 J
Finch, S., 382 Jaeschke, D., 535
Fisher, R.A., 25, 212, 586, 602 Jing, B., 704
Fishman, G.S., 614, 624 Johnson, N., 54
Freedman, D., 339, 402, 693 Jones, G., 671
Fristedt, B., 463, 472, 493, 496
Fuchs, W., 381, 394
K
Kac, M., 421, 422
G Kagan, A., 69, 157
Galambos, J., 221, 238, 323, 328, 331 Kamat, A., 156
Gamerman, D., 614 Karatzas, I., 414, 416, 417, 463
Garren, S., 674 Karlin, S., 402, 411, 427, 437, 463
Author Index 759

Kass, R., 519 Metropolis, N., 614


Kemperman, J., 339 Meyn, S., 339
Kendall, M.G., 54, 62, 65 Mikosch, T., 238
Kendall, W., 615 Millar, P., 545
Kesten, H., 254, 257 Minh, H., 734
Khare, K., 614, 671 Morris, C., 612
Kiefer, J., 242 Mörters, P., 402, 420
Kingman, J.F.C., 437, 439, 442, 451, Murray, G.D., 712
456, 457 Mykland, P., 674
Klüppelberg, C., 238
Knight, S., 673
Komlós, J., 421, 425, 426, 537 N
Körner, T., 417 Niyogi, P., 734
Kosorok, M., 527 Norris, J., 339, 366
Kotz, S., 54
Krishnan, T., 705, 711, 714, 739
Künsch, H.R., 702 O
O’Reilly, N., 534
Olkin, I., 208
L Oosterhoff, J., 563
Lahiri, S.N., 690, 702–704 Owen, D., 157
Laird, N., 705
Landau, H.J., 577
Lange, K., 714 P
Lawler, G., 402, 437 Paley, R.E., 20
Le Cam, L., 34, 71, 509, 704 Panconesi, A., 560
Leadbetter, M., 221 Parzen, E., 437
Ledolter, J., 714 Patel, J., 157
Ledoux, M., 560 Peres, Y., 402, 420
Lehmann, E.L., 238, 249, 583, 600, Perlman, M., 208
699, 704 Petrov, V., 62, 249, 301, 308–311
Leise, F., 505 Pitman, J., 1
Levine, R., 714 Plackett, R., 157
Light, W., 731 Politis, D., 702, 704
Lindgren, G., 221 Pollard, D., 527, 538, 548
Linnik, Y., 69, 157 Pólya, G., 382
Liu, J., 614, 657 Port, S., 239, 264, 265, 302, 303, 306,
Logan, B.F., 570 315, 437
Lugosi, G., 560, 725 Propp, J., 615
Pyke, R., 421

M
Madsen, R., 339 R
Mahalanobis, P., 208 Rachev, S.T., 505
Major, P., 421, 425, 426, 537 Rao, C.R., 20, 62, 69, 157, 210, 505, 519, 520,
Mallows, C.L., 570 522
Martynov, G., 238 Rao, R.R., 71, 76, 82, 249, 308
Mason, D.M., 535, 537, 570 Read, C., 157
Massart, P., 242 Reiss, R., 221, 231, 234, 236, 238, 285, 323,
McDiarmid, C., 485, 487, 488, 560 326, 328, 505, 516
McKean, H., 390 Rényi, A., 375, 380, 389
McLachlan, G., 705, 711, 714, 739 Resnick, S., 221, 402
Mee, R., 157 Révész, P., 254, 420, 421, 423, 425,
Mengersen, K., 667, 669, 673 427, 527
760 Author Index

Revuz, D., 402 Sudakov, V.N., 572


Rice, S.O., 570 Sundberg, R., 705
Richardson, S., 614
Ripley, B.D., 614
Robert, C., 614, 624, 673 T
Roberts, G., 614, 667 Talagrand, M., 572, 578
Romano, J., 702 Tanner, M., 614, 714
Rootzén, H., 221 Taylor, H.M., 402, 411, 427, 437, 463
Rosenblatt, M., 721 Teicher, H., 249, 259, 264, 267, 276, 463
Rosenbluth, A., 614 Teller, A., 614
Rosenbluth, M., 614 Teller, E., 614
Rosenthal, J.S., 615, 667, 671 Thomas-Agnan, C., 731, 733
Ross, S., 1, 614, 624 Thönnes, E., 615
Roy, S., 208 Tibshirani, R., 690
Rozonoer, L., 732 Tierney, L., 614, 667, 668, 671
Rubin, D., 614, 674, 705 Titterington, D.M., 623
Rubin, H., 337, 628 Tiwari, R., 188
Rudin, W., 726, 730 Tong, Y., 138, 154, 201, 203, 204, 208, 210,
211, 215, 698
Tu, D., 690, 695, 701
S Tusnady, G., 421, 425, 426, 537
Saloff-Coste, L., 505, 614, 652, 671 Tweedie, R., 339, 667, 669
Sauer, N., 539, 540
Schmeiser, B., 624
Sen, P.K., 328, 329
V
Seneta, E., 339, 361
Serfling, R., 249, 311, 323, 326 Vajda, I., 505
Sethuraman, J., 614, 667, 671 van de Geer, S., 527, 553
Shao, J., 690, 695, 701 van der Vaart, A., 249, 310, 527, 546,
Shao, Q.M., 560, 570, 571 547, 557
Shawe-Taylor, J., 736 van Zwet, W.R., 537
Shepp, L., 570, 577 Vapnik, V., 540, 541, 736
Shorack, G., 238, 527 Varadhan, S.R.S., 456, 560
Shreve, S., 414, 416, 417, 463 Vos, P., 519
Singer, J., 328, 329
Singh, K., 693, 699, 701
Sinha, B., 208 W
Smith, A.F.M., 614 Wang, Q., 560
Smith, R.L., 674 Wasserman, L., 67
Sparre-Andersen, E., 389 Waymire, E., 1, 339, 402, 408, 409,
Spencer, J., 20 414, 423
Spiegelhalter, D., 614 Wei, G., 714
Spitzer, F., 375, 389, 390 Wellner, J., 238, 310, 527, 538, 546,
Steele, J.M., 34 547, 557
Stein, C., 67 Wermuth, N., 157
Stern, H., 614 White, A., 704
Stigler, S., 71 Whitt, W., 421
Stirzaker, D., 1, 339 Widder, D., 53
Strassen, V., 425 Williams, D., 463
Strawderman, W.E., 429 Wilson, B., 615
Stroock, D., 560, 614 Wolf, M., 702
Stuart, A., 54, 62, 65 Wolfowitz, J., 242
Sturmfels, B., 615 Wong, W., 614
Su, F., 505, 508, 509, 516 Wu, C.F.J., 711–713
Author Index 761

Y Z
Yang, G., 704 Zabell, S., 67
Zeitouni, O., 560
Yao, Y., 734
Zinn, J., 538, 694
Yor, M., 402 Zolotarev, V.M., 505
Yu, B., 674 Zygmund, A., 20
Subject Index

A median, 335
ABO allele, 709–711 variance ratio, 325
Acceptance distribution, 664 Asymptotics
Accept-reject method convergence, distribution
beta generation, 627–628 CDF, 262
described, 625 CLT, 266
generation, standard normal values, Cramér–Wold theorem, 263–264
626–627 Helly’s theorem, 264
scheme efficiency, 628 LIL, 267
Almost surely, 252, 253, 256, 259, 261–262, multivariate CLT, 267
267, 287, 288, 314, 410, 461, 470, Pólya’s theorem, 265
491–494, 496, 498, 503, 535, 537, Portmanteau theorem, 265–266
542, 555, 576, 577, 578, 615, 618, densities and Scheffé’s theorem, 282–286
632, 641, 694, 725 laws, large numbers
Ancillary Borel-Cantelli Lemma, 254–256
definition, 601 Glivenko-Cantelli theorem, 258–259
statistic, 602, 604, 605 strong, 256–257
Anderson’s inequality, 214, 218 weak, 256–258
Annulus, Dirichlet problem, 417–418 moments, convergence (see Convergence
Aperiodic state, 351 of moments)
notation and convergence, 250–254
Approximation of moments
preservation, convergence
first-order, 278, 279
continuous mapping, 260–261
scalar function, 282
Delta theorem, 269–272
second-order, 279, 280
multidimension, 260
variance, 281 sample correlation, 261–262
Arc sine law, 386 sample variance, 261
ARE. See Asymptotic relative efficiency Slutsky’s theorem, 268–269
Arrival rate, 552, 553 transformations, 259
intensity function, 455 variance stabilizing transformations,
Poisson process, 439, 441, 442, 446, 272–274
460–462 Asymptotics, extremes and order statistics
Arrival time application, 325–326
definition, 438 several, 326–327
independent Poisson process, 445 single, 323–325
interarrival times, 438 convergence, types theorem
Asymptotic independence, 337 limit distributions, types, 332
Asymptotic relative efficiency (ARE) Mills ratio, 333
IQR-based estimation, 327 distribution theory, 323

763
764 Subject Index

Asymptotics, extremes and order Binomial distribution


statistics (cont.) hypergeometric distribution problems, 30
easy applicable limit theorem negative, 28–29
Gumbel distribution, 331 Bivariate Cauchy, 195
Hermite polynomial, 329 Bivariate normal
Fisher–Tippet family definition, 136
definition, 333 distribution, 212
theorem, 334–335 five-parameter distribution, 138–139
Avogadro’s constant, 401 formula, 138
Azuma’s inequality, 483–486 joint distribution, 140
mean and variance independence, 139–140
missing coordinates, 707–709
B property, 202
Ballot theorem, 390 simulation, 136, 137, 200–201
Barker’s algorithm, 643 Bivariate normal conditional distributions
Basu’s theorem conditional expectation, 154
applications, probability Galton’s observation, 155
convergence result, 606–607 Bivariate normal formulas, 138, 156
covariance calculation, 605 Bivariate Poisson, 120
expectation calculation, 606 Gibbs chain, 683
exponential distribution result, 605 MGF, 121
mean and variance independence, Bivariate uniform, 125–126, 133
604–605 Block bootstrap, 701, 702, 703
exponential family, 602 Bochner’s theorem, 306–307
and Mahalanobis’s D 2 , 611 Bonferroni bounds, 5
and Neyman–Fisher factorization Bootstrap
definition, 602–603 consistency
factorization, 603–604 correlation coefficient, 697–698
general, 604 Kolmogorov metric, 692–693
Bayes estimate, 147–152, 467–468, 493–494 Mallows-Wasserstein metric, 693
Bayes theorem sample variance, 696
conditional densities, 141, 147, 149 Skewness Coefficient, 697
use, 147 t -statistic, 698
Berry–Esseen bound distribution, 690, 691, 694, 696, 738
definition, 76 failure, 698–699
normal approximation, 76–79 higher-order accuracy
Bessel’s inequality, 742 CBB, 702
Best linear predictor, 110–111, 154 MBB, 702
Best predictor, 142, 154 optimal block length, 704
Beta-Binomial distribution second-order accuracy, 699–700
Gibbs sampler, 965–966 smooth function, 703–704
simulation, 643–644 thumb rule, 700–701
Beta density Monte Carlo, 694, 697, 737
defintion, 59–60 ordinary bootstrap distribution, 690–691
mean, 150 resampling, 689–690
mode, 89 variance estimate, 737
percentiles, 618 Borel–Cantelli lemma
Bhattacharya affinity, 525 almost sure convergence, binomial
Bikelis local bound, 322 proportion, 255–256
Binomial confidence interval pairwise independence, 254–255
normal approximation, 74 Borell inequality, 215
score confidence, 76 Borel’s paradox
Wald confidence, 74–76 Jacobian transformation, 159
Subject Index 765

marginal and conditional density, 159–160 covariance function, 429


mean residual life, 160–161 Gaussian process, 430
Boundary crossing probabilities random walks
annulus, 417–418 scaled, simulated plot, 403
domain and harmonic, 417 state visited, planar, 406
harmonic function, 416 stochastic process
irregular boundary points, 416 definition, 403
recurrence and transience, 418–419 Gaussian process, 404
Bounded in probability, 251–252 real-valued, 404
Box–Mueller transformation, 194 standard Wiener process, 404–405
Bracket, 546 strong invariance principle and KMT
Bracketing number, 546 theorem, 425–427
Branching process, 500 transition density and heat equation,
Brownian bridge 428–429
Brownian motion, 403
empirical process, iid random variables, C
404 Campbell’s theorem
Karhunen–Loeve expansion, 554 characteristic functional …, 456
maximum, 410 Poisson random variables, 457
standard, 405 shot effects, 456
Brownian motion stable laws, 458
covariance functions, 406 Canonical exponential family
d -dimensional, 405 description, 590
Dirichlet problem and boundary crossing form and properties
probabilities binomial distribution, 590
annulus, 417–418 closure, 594–596
domain and harmonic, 417 convexity, 590–591
harmonic function, 416 moment and generating function,
irregular boundary points, 416 591–594
recurrence and transience, 418–419 one parameter, 589–590
distributional properties k-parameter, 597
fractal nature, level sets, 415–416 Canonical metric
path properties and behavior, 412–414 definition, 575
reflection principle, 410–411 Capture-recapture, 32–33
explicit construction Cauchy
Karhunen–Loéve expansion, 408 density, 44
Gaussian process, Markov distribution, 44, 48
continuous random variables, 407 Cauchy order statistics, 232
correlation function, 406 Cauchy random walk, 395
invariance principle and statistics Cauchy–Schwarz inequality, 21, 109, 259,
convergence, partial sum process, 516, 731
423–424 CDF. See Cumulative distribution function
Donsker’s, 424–425 Central limit theorem (CLT)
partial sums, 421 approximation, 83
Skorohod embedding theorem, 422, binomial confidence interval, 74–76
423 binomial probabilities., 72–73
uniform metric, 422 continuous distributions, 71
local time de Moivre–Laplace, 72
Lebesgue measure, 419 discrete distribution, 72
one-dimensional, 420 empirical measures, 543–547
zero, 420–421 error, 76–79
negative drift and density, 427–428 iid case, 304
Ornstein–Uhlenbeck process martingale, 498
convergence, 430–431 multivariate, 267
766 Subject Index

Central limit theorem (CLT) (cont.) Chernoff’s variance inequality


random walks, 73 equality holding, 68–69
and WLLN, 305–306 normal distribution, 68
Change of variable, 13 Chibisov–O’Reilly theorem, 534–535
Change point problem, 650–651 Chi square density
Chapman–Kolmogorov equation, 256, 315, degree of freedom, 45
318 Gamma densities, 58
transition probabilities, 345–346 Chung–Fuchs theorem, 381, 394–396
Characteristic function, 171, 263, 383–385, Circular Block Bootstrap (CBB), 702
394–396, 449, 456, 458, 462, 500, Classification rule, 724, 725, 734, 735, 736,
533 743, 744
Bochner’s theorem, 306–307 Communicating classes
CLT error equivalence relation, 350
Berry–Esseen theorem, 309–310 identification, 350–351
CDF sequence, 308–309 irreducibility, 349
theorem, 310–311 period computation, 351
continuity theorems, 303–304 Completeness
Cramér–Wold theorem, 263 applications, probability
Euler’s formula, 293 covariance calculation, 605
inequalities expectation calculation, 606
Bikelis nonuniform, 317 exponential distribution result, 605
Hoeffding’s, 318
mean and variance independence,
Kolmogorov’s maximal, 318
604
partial sums moment and von
weak convergence result, 606–607
Bahr–Esseen, 318
definition, 601
Rosenthal, 318–319
Neyman–Fisher factorization and Basu’s
infinite divisibility and stable laws
theorem
distributions, 317
definition, 602–603
triangular arrays, 316
Complete randomness
inversion and uniqueness
homogeneous Poisson process, 551
distribution determining property,
299–300 property, 446–448
Esseen’s Lemma, 301 Compound Poisson Process, 445, 448, 449,
lattice random variables, 300–301 459, 461
theorems, 298–299 Concentration inequalities
Lindeberg–Feller theorem, 311–315 Azuma’s, 483
Polýa’s criterion, 307–308 Burkholder, Davis and Gundy
proof general square integrable martingale,
CLT, 305 481–482
Cramér–Wold theorem, 306 martingale sequence, 480
WLLN, 305 Cirel’son and Borell, 215
random sums, 320 generalized Hoeffding lemma, 484–485
standard distributions Hoeffding’s lemma, 483–484
binomial, normal and Poisson, 295–296 maximal
exponential, double exponential, and martingale, 477–478
Cauchy, 296 moment bounds, 479–480
n-dimensional unit ball, 296–297 sharper bounds near zero, 479
Taylor expansions, differentiability and submartingale, 478
moments McDiarmid and Devroye
CDF, 302 Kolmogorov–Smirnov statistic,
Riemann–Lebesgue lemma, 302–303 487–488
Chebyshev polynomials, 742 martingale decomposition and
Chebyshev’s inequality two-point distribution, 486–487
bound, 52 optional stopping theorem, 477
large deviation probabilities, 51 upcrossing, 488–490
Subject Index 767

Conditional density Portmanteau theorem, 265–266


Bayes theorem, 141 Slutsky’s theorem, 268–269
best predictor, 142 Convergence in mean, 253
definition, 140–141 Convergence in probability
two stage experiment, 143–144 continuous mapping theorem, 270
Conditional distribution, 202–205 sure convergence, 252, 260
binomial, 119 Convergence of Gibbs sampler
definition, 100–101 discrete and continuous state spaces, 671
interarrival times, 460 drift method, 672–673
and marginal, 189–190 failure, 670–671
and Markov property, 235–238 joint density, 672
Poisson, 104 Convergence of martingales
recurrence times, 399 L1 and L2
Conditional expectation, 141, 150, 154 basic convergence theorem, 493
Jensen’s inequality, 494 Bayes estimates, 493–494
order statistics, 247 Pólya’s urn, 493
Conditional probability theorem
definition, 5 Fatou’s lemma and monotone, 491
prior to posterior belief, 8 Convergence of MCMC
Conditional variance, 143, 487, 497 Dobrushin’s inequality and
definition, 103, 141 Diaconis–Fill–Stroock bound,
Confidence band, continuous CDF, 242–243 657–659
Confidence interval drift and minorization methods, 659–662
binomial, 74–75 geometric and uniform ergodic, 653
central limit theorem, 61–62
separation and chi-square distance, 653
normal approximation theorem, 81
SLEM, 651–652
Poisson distribution, 80
spectral bounds, 653–657
sample size calculation., 66
stationary Markov chain, 651
Confidence interval, quantile, 246, 326
total variation distance, 652
Conjugate priors, 151–152
Convergence of medians, 287
Consistency bootstrap, 693–699
Convergence of moments
Consistency of kernel classification, 743
approximation, 278–282
Continuity of Gaussian process
logarithm tail, maxima and Landau–Shepp distribution, 277–278
theorem, 577 uniform integrability
Wiener process, 576–577 conditions, 276–277
Continuous mapping, 260–262, 268, 270, 275, dominated convergence theorem,
306, 327, 694 275–276
Convergence diagnostics Convergence of types, 328, 332
Garren–Smith multiple chain Gibbs Convex function theorem, 468, 495–496
method, 675 Convolution, 715
Gelman–Rubin multiple chain method, 674 definition, 167–168
spectral and drift methods, 673–674 density, 168
Yu–Mykland single chain method, double exponential, 195
674–675 n-fold, 168–169
Convergence in distribution random variable symmetrization, 170–172
CLT, 266 uniform and exponential, 194
Cramér–Wold theorem, 263 Correlation, 137, 195, 201, 204, 271–272
Delta theorem, 269–272 coefficient, 212–213, 697–698
description, 262 convergence, 261–262
Helly’s theorem, 264 definition, 108
LIL, 267 exponential order statistics, 234
multivariate CLT, 267 inequality, 120
Pólya’s theorem, 265 order statistics, 244, 245
768 Subject Index

Correlation (cont.) moments and tail, 49–50


Poisson process, 461 nonnegative integer-valued random
properties, 108–109 variables, 16
Correlation coefficient distribution PDF and median, 38
bivariate normal, 212–213 Pólya’s theorem, 265
bootstrapping, 697–698 quantile transformation, 44
Countable additivity range, 226
definition, 2–3 standard normal density, 41–42, 62
Coupon collection, 288 Curse of dimensionality, 131–132, 184
Covariance, 544, 574, 735 Curved exponential family
calculation, 107, 605 application, 612
definition, 107 definition, 583, 608
function, 412, 429 density, 607–608
inequality, 120, 215 Poissons, random covariates, 608–609
matrix, 207, 216 specific bivariate normal, 608
multinomial, 610
properties, 108–109
Covariance matrix, 136–137, 200, 203, 205, D
207, 210, 216, 217, 267, 272, 280, Delta theorem, continuous mapping theorem
326, 530, 597, 611, 709 Cramér, 269–270
Cox–Wermuth approximation random vectors, 269
approximations test, 157–158 sample correlation, 271–272
Cramér–Chernoff theorem sample variance and standard deviation,
Cauchy case, 564 271
exponential tilting technique, 563 Density
inequality, 561 midrange, 245
large deviation, 562 range, 226
rate function Density function
Bernoulli case, 563–564 continuous random variable, 38
definition, 560 description, 36
normal, 563 location scale parameter, 39
Cramér–Levy theorem, 293 normal densities, 40
Cramér–von Mises statistic, 532–533 standard normal density, 42
Cramér–Wold theorem, 263, 266, 305, 306 Detailed balance equation, 639–640
Cubic lattice random walks Diaconis–Fill–Stroock bound, 657–658
distribution theory, 378–379 Differential metrics
Pólya’s formula, 382–383 Fisher information
recurrence and transience, 379–381 direct differentiation, 521
recurrent state, 377 multivariate densities, 520
three dimensions, 375 Kullback–Leibler distance curvature,
two simulated, 376, 377 519–520
Cumulants Rao’s Geodesic distances, distributions
definition, 25–26 geodesic curves, 522
recursion relations, 26 variance-stabilizing transformation, 522
Cumulative distribution function (CDF) Diffusion coefficient
application, 87–88, 90, 245 Brownian motion, 427, 428
continuous random variable, 36–37 Dirichlet cross moment, 196
definition, 9 Dirichlet distribution
empirical, 241–243 density, 189
exchangeable normals, 206–207 Jacobian density theorem, 188–189
and independence, 9–12 marginal and conditional, 189–190
joint, 177, 193 and normal, 190
jump function, 10 Poincaré’s Lemma, 191
Markov property, 236 subvector sum, 190
Subject Index 769

Dirichlet problem irregular boundary point, 416


annulus, 417–418 maximal attraction, 332, 336
domain and harmonic, 417 Dominated convergence theorem, 275, 285
harmonic function, 416 Donsker class, 544
irregular boundary points, 416 Donsker’s theorem, 423, 531
recurrence and transience, 418–419 Double exponential density, 39–40
Discrete random variable Doubly stochastic
CDF and independence, 9–12 bistochastic, 340
definition, 8 one-step transition matrix, 347
expectation and moments, 13–19 transition probability matrix, 342
inequalities, 19–22 Drift
moment-generating functions, 22–26 Bayes example, 672–673
Discrete uniform distribution, 2, 25, 277 drift-diffusion equation, 429
Distribution and minorization methods, 659–662
binomial negative, Brownian motion, 427–428
hypergeometric distribution problems, time-dependent, 429
30 Drift-diffusion equation., 429
Cauchy, 44, 48
CDF
definition, 9 E
Chebyshev’s inequality, 20 Edgeworth expansion, 82–83, 302, 622, 699,
conditional, 203–205 700
continuous, 71 Ehrenfest model, 342, 373
correlation coefficient, 212–213 EM algorithm, 689, 704–744
discrete, 72 ABO allele, 709–711
discrete uniform, 2, 25, 277 background plus signal model, 738
lognormal, 65 censoring, 740
noncentral, 213–214 Fisher scoring method, 705
Poisson, 80 genetics problem, 739
quadratic forms, Hotelling’s T2 mixture problem, 739
Fisher Cochran theorem, 211 monotone ascent and convergence, 711
sampling, statistics Poisson, missing values, 706–707
correlation coefficient, 212–213 truncated geometric, 738
Hotelling’s T2 and quadratic forms, Empirical CDF
209–212 confidence band, continuous, 242, 243
Wishart, 207–208 definition, 527
Wishart identities, 208–209 DKW inequality, 241–242
DKW inequality, 241–242, 541, 553 Empirical measure
Dobrushin coefficient CLTs
convergence of Metropolis chains, entropy bounds, 544–546
668–669 notation and formulation, 543–544
defined, 657–658 definition, 222, 528
Metropolis–Hastings, truncated geometric, Empirical process
659 asymptotic properties
two-state chain, 658–659 approximation, 530–531
Dobrushin’s inequality Brownian bridge, 529–530
coefficient, 657–658 invariance principle and statistical
Metropolis–Hastings, truncated geometric, applications, 531–533
659 multivariate central limit theorem, 530
stationary Markov chain, 658 quantile process, 536
two-state chain, 658–659 strong approximations, 537
Domain weighted empirical process, 534–535
definition, 417 CLTs, measures and applications
Dirichlet problem, 416, 417 compact convex subsets, 547
770 Subject Index

Empirical process (cont.) construction, 205–206


entropy bounds, 544–546 definition, 205
invariance principles, 543 Expectation
monotone and Lipschitz functions, 547 continuous random variables, 45–46
notation and formulation, 543–544 definitions, 46–47
maximal inequalities and symmetrization gamma function, 47
Gaussian processes, 547–548 Experiment
notation and definitions countable additivity, 2–3
cadlag functions and Glivenko–Cantelli countably infinite set, 2
theorem, 529 definition, 2
Poisson process inclusion–exclusion formula, 5
distributional equality, 552 Exponential density
exceedance probability, 553 basic properties, 55
randomness property, 551 definition, 38
Stirling’s approximation, 553 densities, 55
Vapnik–Chervonenkis (VC) theory mean and variance, 47
dimensions and classes, 540–541 standard double, 40–41
Glivenko–Cantelli class, 538–539 Exponential family
measurability conditions, 542 canonical form and properties
shattering coefficient and Sauer’s closure, 594–596
lemma, 539–540 convexity, 590–591
uniform convergence, relative definition, 589–590
frequencies, 542–543 moments and moment generating
Empirical risk, 736 function, 591–594
Entropy bounds curved (see Curved exponential family)
bracket and bracketing number, 546 multiparameter, 596–600
covering number, 545–546 one-parameter
Kolmogorov–Smirnov type statistic, 545 binomial distribution, 586
P -Donsker, 546 definition, 585–586
VC-subgraph, 545 gamma distribution, 587
Equally likely irregular distribution, 588–589
concept, 3 normal distribution, mean and variance,
finite sample space, 3 584–586
sample points, 3 unusual gamma distribution, 587
Ergodic theorem, 366, 640, 641 variables errors, 586–587
Error function, 90 sufficiency and completeness
Error of CLT applications, probability, 604–607
Berry–Esseen bound, 77 definition, 601–602
Error probability description, 600
likelihood ratio test, 565, 580 Neyman–Fisher factorization and
type I and II, 565 Basu’s theorem, 602–604
Wald’s SPRT, 476–477 Exponential kernel, 736, 740, 742, 744
Esseen’s lemma, 301, 309 Exponential tilting
Estimation technique, 563
density estimator, 721 Extremes. See also Asymptotics, extremes and
histogram estimate, 683, 720–721, 737, order statistics
740 definition, 221
optimal local bandwidth, 723 discrete-time stochastic sequence, 574–575
plug-in estimate, 212, 630, 704, 720, 740 finite sample theory, 221–247
series estimation, 720–721
Exceedance probability, 553
Exchangeable normal F
variables Factorial moment, 22
CDF, 206–207 Factorization theorem, 602–604
Subject Index 771

Failure of EM, kernel, 712–713 Fisher linear, 735


Fatou’s lemma, 491, 492 Fisher’s Iris data, 744
F-distribution, 174–175, 218 Fisher–Tippet family
f -divergence definition, 333
finite partition property, 507, 517 theorem, 334–335
Feature maps, 732–744 Formula
Filtered Poisson process, 443–444 change of variable, 13
Finite additivity, 3 inclusion-exclusion, 4
Finite dimensional distributions, 575 Tailsum, 16
Finite sample theory Full conditionals, 645–646, 648–651, 671, 672
advanced distribution theory Function theorem
density function, 226 convex, 468, 495–496
density, standard normals, 229 implicit, 569
exponential order statistics, 227–228
joint density, 225–226
moment formulas, uniform case, 227 G
normal order statistics, 228 Gambler’s ruin, 351–353, 370, 392, 464,
basic distribution theory 468–470, 475, 501
empirical CDF, 222 Gamma density
joint density, 222–223 distributions, 59
minimum, median and maximum exponential density, 56
variables, 225 skewness, 82
order statistics, 221 Gamma function
quantile function, 222 definition, 47
uniform order statistics, 223–224 Garren–Smith multiple chain Gibbs method,
distribution, multinomial maximum 675
equiprobable case, 243 Gartner–Ellis theorem
Poissonization technique, 243 conditions, 580
empirical CDF large deviation rate function, 567
confidence band, continuous CDF, multivariate normal mean, 568–569
242–243 non-iid setups, 567–568
DKW inequality, 241–242 Gaussian factorization, 176–177
existence of moments, 230–232 Gaussian process
quantile transformation theorem, 229–230 definition, 404
records Markov property, 406–407
density, values and times, 240–241 stationary, 430
interarrival times, 238 Gaussian radial kernel, 736
spacings Gegenbauer polynomials, 741
description, 233 Gelman–Rubin multiple chain method, 674
exponential and Réyni’s representation, Generalized chi-square distance, 525, 652, 653
233–234 Generalized Hoeffding lemma, 484–485
uniform, 234–235 Generalized negative binomial distribution,
First passage, 354–359, 383–386, 409, 410, 28–29, 609
427, 475 Generalized Wald identity, 475–476
Fisher Cochran theorem, 211 Generalized Wasserstein distance, 525
Fisher information Generating function
and differential metrics, 520–521 central moment, 25–26
matrix, 597–600, 610–611 Chernoff–Bernstein inequality, 51–52
Fisher information matrix definition, 22
application, 597–610 distribution-determining property, 24–25
definition, 597 inversion, 53
and differential, 520–521 Jensen’s inequality, 52–53
multiparameter exponential family, Poisson distribution, 23–24
597–600 standard exponential, 51
772 Subject Index

Geometrically ergodic Hypergeometric distribution


convergence of MCMC, 653 definition, 29
Metropolis chain, 669–670 ingenious use, 32
spectral bounds, 654
Geometric distribution
definition, 28 I
exponential densities, 55 IID. See Independent and identically
lack of memory, 32 distributed
Gibbs sampler Importance sampling
beta-binomial pair, 647–648 Bayesian calculations, 629–630
change point problem, 650–651 binomial Bayes problem, 631
convergence, 670–673 described, 619–620
definition, 646 distribution, 629, 633–634
Dirichlet distributions, 649–650 Radon–Nikodym derivatives, 629
full conditionals, 645–646 theoretical properties, 632–633
Gaussian Bayes Importance sampling distribution, 633–634
formal and improper priors, 648 Inclusion-exclusion formula, 4, 5
Markov chain generation problem, 645 Incomplete inner product space, 728
random scan, 646–647 Independence
systematic scan, 647 definition, 5, 9
Glivenko–Cantelli class, 539, 541 Independence of mean and variance, 139–140,
Glivenko–Cantelli theorem, 258–259, 336, 185–187, 211, 604–605
487, 529, 538, 541, 545, 555, 690 Independent and identically distributed (IID)
Gumbel density observations, 140
definition, 61 random variables, 20
standard, 61 Independent increments, 404–405, 409, 412,
427, 433, 449
Independent metropolis algorithm, 643
H Indicator variable
Hájek–Rényi inequality, 396 Bernoulli variable, 11
Hardy–Weinberg law, 609 binomial random variable, 34
Harmonic distribution, 11
definition, 417 mathematical calculations, 10
function, 416 Inequality
Harris recurrence Anderson’s, 214
convergence property, 672 Borell concentration, 215
drift and minorization methods, 660 Cauchy–Schwarz, 21
Metropolis chains, 667–668 central moment, 25
Heat equation, 429, 435 Chen, 215
Hellinger metric, 506–507, 514 Chernoff–Bernstein inequality, 51–52
Helly’s theorem, 264 Cirel’son, 215
Helmert’s transformation, 186–187 covariance, 215–216
Hermite polynomials, 83, 320, 329, 741 distribution, 19
High dimensional formulas, 191–192 Jensen’s inequality, 52–53
Hilbert norm, 729 monotonicity, 214–215, 593
Hilbert space, 725–732, 734, 735, 743 positive dependence, 215
Histogram estimate, 683, 720–721, 737, 740 probability, 20–21
Hoeffding’s inequality, 318, 483–484 Sidak, 215
Holder continuity, 413 Slepian’s I and II, 214
Holder’s inequality, 21 Infinite divisibility and stable laws
Hotelling’s T2 description, 315
Fisher Cochran theorem, 211 distributions, 317
quadratic form, 210 triangular arrays, 316
Subject Index 773

Initial distribution, 369, 641, 651, 652, 655, formula, 105, 146, 167, 468
658, 668 higher-order, 107
definition, 340 Iterated variance, 119, 623
Markov chain, 361 applications, 105–106
weak ergodic theorem, 366 formula, 105
Inner product space, 726, 727, 728, 740–742
Inspection paradox, 444
J
Integrated Brownian motion, 432
Jacobian formula
Intensity, 557, 682
CDF, 42
function, 454, 455, 460
technique, 45
piecewise linear, 454–456 Jacobi polynomials, 741, 742
Interarrival time, 238, 240, 241 Jensen’s inequality, 52–53, 280, 468, 491, 494,
conditional distribution, 453 516
Poisson process, 438 Joint cumulative distribution function, 97, 112,
transformed process, 453 124, 177, 193, 261, 263, 447
Interquartile range (IQR) Joint density
definition, 327 bivariate uniform, 125–126
limit distribution, 335 continuous random vector, 125
Invariance principle defined, 123–124
application, 434 dimensionality curse, 131–132
convergence, partial sum process, 423–424 nonuniform, uniform marginals, 128–129
Donsker’s, 424–425, 538 Joint moment-generating function (mgf),
partial sums, 421 112–114, 121, 156, 597
Skorohod embedding theorem, 422, 423 Joint probability mass function (pmf), 121,
and statistics 148, 152, 585, 599, 608
Cramér–von Mises statistic, 532–533 definition, 96, 112
Kolmogorov–Smirnov statistic, function expectation, 100
531–532
strong
KMT theorem, 425–427 K
Karhunen–Loéve expansion, 408, 533, 554
partial sum process, 530–531, 537
Kernel classification rule, 724–731
uniform metric, 422
Kernels and classification
weak, 536
annhilator, 743
Invariant measure, 663
definition, 715
Inverse Gamma density, 59
density, 719–723
Inverse quadratic kernel, 736, 744 estimation
Inversion of mgf density estimator, 721
moment-generating function, 53 histograms, 720
Inversion theorem optimal local bandwidth, 723
CDF, 298–299 plug-in, 720
failure, 307–308 series estimation, 720–721
Plancherel’s identity, 298 Fourier, 717
IQR. See Interquartile range kernel plots, 742
Irreducible, 569, 581, 640–643, 645, 647, product kernels, 743
653–655, 660, 661, 663, 667, 671, smoothing
685 definition, 715–716
definition, 349 density estimation, 718
loop chains, 371 statistical classification
regular chain, 363 exponential kernel, 736
Isolated points, 415, 416 Fisher linear, 735
Isotropic, 717, 742, 743 Gaussian radial kernel, 736
Iterated expectation, 105, 119, 146, 162, 172 Hilbert norm, 729
applications, 105–106 Hilbert space, 725–727
774 Subject Index

Kernels and classification (cont.) Gärtner–Ellis theorem and Markov chain,


inner product space, 728 567–570
linearly separable data, 735 Lipschitz functions and Talagrand’s
linear span, 726–727, 729, 742 inequality
Mercer’s theorem, 732–744 correlated normals, 573–574
operator norm, 729 outer parallel body, 572–573
parallelogram law, 727 Taylor expansion, 572
polynomial kernel, 736 multiple testing, 559
positive definiteness, 732 rate function, properties
Pythagorean identity, 727 error probabilities, likelihood ratio test,
reproducing kernel, 725 565–566
Riesz representation theorem, 730 shape and smoothness, 564
rule, 724–725 self-normalization, 560
sigmoid kernel, 736 t -statistic
support vector machines, 734–736 definition, 570
Weierstrass kernels, 719 normal case, 571–572
Kolmogorov metric, 506, 511, 692, 693, 695, rate function, 570
696 Law of iterated logarithm (LIL)
Kolmogorov–Smirnov statistic application, 433
Brownian motion, 419
invariance principle, 531–532
use, 267
McDiarmid and Devroye inequalities,
Legendre polynomials, 741
487–488
Level sets
null hypothesis testing, 545
fractal nature, 415–416
weighted, 534–535
Lévy inequality, 397, 400
Komlós, Major, Tusnady theorem
Likelihood function
rate, 537
complete data, 708, 710
strong invariance principle, 425–427
definition, 148, 565
Kullback–Leibler distance log, 712
definition, 507 shape, 153
inequality, 711, 712 Likelihood ratios
Kurtosis convergence, 490–491
coefficient, 82 error probabilities, 565–566
defined, 18 Gibbs sampler, 645
skewness, 82 martingale formation, 467
sequential tests, statistics, 469
Lindeberg–Feller theorem
L failure condition, 315
Lack of memory IID variables, linear combination,
exponential densities, 55 314–315
exponential distribution, 55–56 Lyapounov’s theorem,
geometric distribution, 32 311–312
Laguerre polynomials, 741 Linear functional, 729, 730, 743
Landau–Shepp theorem, 577 Linear span, 726, 727, 729, 742
Laplacian, 417 Local limit theorem, 82
Large deviation, 51, 463, 483 Local time
continuous time Brownian motion
extreme statistics, 574 Lebesgue measure, 419
finite-dimensional distribution (FDD), one-dimensional, 420
575 zero, 420–421
Gaussian process, 576–577 Lognormal density
metric entropy, supremum, 577–579 finite mgf, 65
Cramér–Chernoff theorem, 560–564 skewness, 65
Cramér’s theorem, general set, 566–567 Loop chains, 369, 371
Subject Index 775

Lower semi-continuous nonconventional multivariate


definition, 564 distribution, 666
Lyapounov inequality, 21–22, 52–53 practical convergence diagnostics,
673–675
random walk Metropolis scheme, 664
M simulation, t-distribution, 665
Mapping theorem, 261, 262, 275, 327, 422, stationary Markov chain, 663–664
424, 694 transition density, 663
continuous, 269–272 Gibbs sampler, 645–651
intensity measure, 452, 453 Metropolis algorithm
lower-dimensional projections, 452 Barker and independent, 643
Marginal density beta–binomial distribution, 643–644
bivariate uniform, 125–126 independent sampling, 642–643
function, 125 Metropolis–Hastings algorithm, 643
Marginal probability mass function (pmf), 117 proposal distribution, 645
definition, 98 transition matrix, 642
E.X/ calculation, 110 truncated geometric distribution,
Margin of error, 66, 91 644–645
Markov chain, 381, 463, 466, 500, 567–570, Monte Carlo
581, 613–686 conventional, 614
Chapman–Kolmogorov equation ordinary, 615–624
transition probabilities, 345–346 principle, 614
communicating classes reversible Markov chains
equivalence relation, 350 discrete state space, 640–641
identification, 350–351 ergodic theorem, 641–642
irreducibility, 349 irreducible and aperiodic stationary
period computation, 351 distribution, 640
gambler’s ruin, 352–353 stationary distribution, 639–640
long run evolution and stationary simulation
distribution described, 613
asymmetric random walk, 365–366 textbook techniques, 614–615, 624–636
Ehrenfest Urn, 364–365 target distribution, 613, 637–638
finite Markov chain, 361 Markov process
one-step transition probability matrix, definition, 404
360 Ornstein–Uhlenbeck process, 431
weak ergodic theorem, 366 two-dimensional Brownian motion, 433
recurrence, transience and first passage Markov’s inequality
times description, 68
definition, 354 Markov transition kernel, 662–663
infinite expectation, 355–358 Martingale
simple random walk, 354–355 applications
Markov chain large deviation, 567–570 adapted to sequence, 464–465
Markov chain Monte Carlo (MCMC) methods Bayes estimates, 467–468
algorithms, 638 convex function theorem, 468
convergence and bounds, 651–662 gambler’s fortune, 463–464
general spaces Jensen’s inequality, 468
Gibbs sampler, convergence, 670–673 likelihood ratios, 467
independent Metropolis scheme, 665 matching problem, 465–466
independent sampling, 664 partial sums, 465
Markov transition kernel, 662–663 Pólya urn scheme, 466
Metropolis chains, 664 sums of squares, 465
Metropolis schemes, convergence, supermartingale, 464
666–670 Wright–Fisher Markov chain, 466–467
776 Subject Index

Martingale (cont.) Minorization methods


central limit theorem, 497–498 Bernoulli experiment, 661–662
concentration inequalities, 477–490 coupling, 661
convergence, 490–494 drift condition, 660
Kolmogorov’s SLLN proof, 496 Harris recurrence, 660
optional stopping theorem, 468–477 hierarchical Bayes linear models, 660–661
reverse, 494–496 nonreversible chains, 659–660
Martingale central limit theorem Mixed distribution, 88, 434
CLT, 498 Mixtures, 39, 214, 300, 307, 611, 661, 721,
conditions 739
concentration, 497 MLE. See Maximum likelihood estimate
Martingale Lindeberg, 498 Mode, 30–31, 89, 91, 232, 252, 282, 552, 627,
Martingale Lindeberg condition, 498 628, 714
Martingale maximal inequality, 477–478 Moment generating function (MGF)
Matching problem, 84, 465–466 central moment, 25–26
Maximal inequality Chernoff–Bernstein inequality, 51–52
Kolmogorov’s, 318, 396 definition, 22
martingale and concentration, 477–480 distribution-determining property, 24–25
and symmetrization, 547–551 inversion, 53
Maximum likelihood estimate (MLE) Jensen’s inequality, 52–53
closed-form formulas, 705 Poisson distribution, 23–24
definition, 152–153 standard exponential, 51
endpoint, uniform, 153–154 Moment generating function (mgf), 116, 121
exponential mean, 153 Moment problem, 277–278
normal mean and variance, 153 Moments
McDiarmid’s inequality continuous random variables, 45–46
Kolmogorov–Smirnov statistic, 487–488 definitions, 46–47
martingale decomposition and two-point standard normal, 48–49
distribution, 486–487 Monotone convergence theorem, 491, 492
Mean absolute deviation Monte Carlo
definition, 17 bootstrap, 691–692
and mode, 30 conventional, 614–615
Mean measure, 450, 557 EM algorithm, 706
Mean residual life, 160–161 Markov chain, 637–645
Median ordinary
CDF to PDF, 38 Bayesian, 618–619
definition, 9 computation, confidence intervals, 616
distribution, 55 P-values, 622–623
exponential, 55 Rao–Blackwellization, 623–624
Mercer’s theorem, 732–744 Monte Carlo P-values
Metric entropy, 545, 577–579 computation and application, 622–623
Metric inequalities statistical hypothesis-testing problem, 622
Cauchy–Schwarz, 516 Moving Block Bootstrap (MBB), 702
variation vs. Hellinger, 517–518 Multidimensional densities
Metropolis–Hastings algorithm bivariate normal
beta–binomial distribution, 644 conditional distributions, 154–155
described, 643 five-parameter, density, 136, 138–139
SLEM, 656–657 formulas, 138, 155–158
truncated geometric, 659 mean and variance, 139–140
Midpoint process, 462 normal marginals, 140
Midrange, 245 singular, 137
Mills ratio, 160, 333 variance–covariance matrix, 136–137
Minkowski’s inequality, 21 Borel’s paradox, 158–161
Subject Index 777

conditional density and expectations, Noncentral F distribution, 210


140–147 Noncentral t -distribution, 213, 218
posterior densities, likelihood functions Non-lattice, 701
and Bayes estimates, 147–152 Nonreversible chain, 659–660
Multinomial distribution, 243, 542, 599–600 Normal approximation
MGF, 116 binomial
pmf, 114 confidence interval, 74–76
and Poisson process, 451–452 Normal density
Multinomial expansion, 116 CDF, 41–42
Multinomial maximum distribution definition, 40
maximum cell frequency, 243 Normalizing
Multiparameter exponential family density function, 39–40
assumption, 596–597 Normal order statistics, 228
definition, 596, 597 Null recurrent
Fisher information matrix definition, 597 definition, 356
general multivariate normal distribution, Markov chain, 373
598–599
multinomial distribution, 599–600
two-parameter O
gamma, 598 Operator norm, 729
inverse gaussian distribution, 600 Optimal local bandwidth, 723
normal distribution, 597–598 Optional stopping theorem
Multivariate Cauchy, 196 applications
Multivariate CLT, 267 error probabilities, Wald’s SPRT,
Multivariate Jacobian formula, 177–178 476–477
Multivariate normal gambler’s ruin, 475
conditional distributions generalized Wald identity, 475–476
bivariate normal case, 202 hitting times, random walk, 475
independent, 203 Wald identities, 474–475
quadrant probability, 204–205 stopping times
definition and properties defined, 469
bivariate normal simulation, 200, 201 sequential tests and Wald’s SPRT, 469
characterization, 201–202 Order statistics
density, 200 basic distribution theory, 221–222
linear transformations density, 200–201 Cauchy, 232
exchangeable normal variables, 205–207 conditional distributions, 235–236
inequalities, 214–216 density, 225
noncentral distributions description, 221
chi-square, 214 existence of moments, 230–232
t -distribution, 213 exponential, 227–228
joint density, 222–223
N moments, uniform, 227
Natural parameter normal, 228–229
definition, 589, 599 quantile transformation, 229–230
space, 589, 590, 596, 598 and range, 226–227
and sufficient statistic, 600 spacings, 233–235
Natural sufficient statistic uniform, 223–224
definition, 585, 596 Orlicz norms, 556–557
one-parameter exponential family, 595–596 Ornstein–Uhlenbeck process
Nearest neighbor, 462, 554, 556 convergence, 430–431
Negative covariance function, 429
binomial distribution, 28–29, 609 Gaussian process, 430
drift, 427–428 Orthonormal bases, Parseval’s identity,
Neighborhood recurrent, 394, 418, 419 728–729
Noncentral chi square distribution, 214, 218 Orthonormal system, 741
778 Subject Index

P Polar coordinates
Packing number, 548 spherical calculations
Parallelogram law, 727 definition, 182–183
Pareto density, 60 dimensionality curse, 184
Partial correlation, 204 joint density, 183–184
Partial sum process spherically symmetric facts, 185
convergence, Brownian motion, 423–424 two dimensions
interpolated, 425 n uniforms product, 181–182
strong invariance principle, 537 polar transformation usefulness, 181
Pattern problems transformation, 180–181
discrete probability, 26 use, 134–135
recursion relation, 27 Polar transformation
variance formula, 27 spherical calculations, 182–185
Period use, 181
burn-in, 675 Pólýa’s criterion
circular block bootstrap (CBB), 702 characteristic functions, 308
computation, 351 inversion theorem failure, 307–308
intensity function, 454, 455 stable distributions, 307
Perron–Frobenius theorem, 361–363, 569, 654 Pólya’s formula, 382–383
Plancherel’s identity, 298–299 Pólya’s theorem
Plug-in estimate, 212, 630, 704, 720, 740 CDF, 265, 509
Poincaré’s lemma, 191 return probability, 382–383
Pólya’s urn, 466, 493
Poisson approximation
Polynomial kernel, 736
applications, 35
Portmanteau theorem
binomial random variable, 34
definition, 265
confidence intervals, 80–82
Positive definiteness, 568, 717, 732
Poisson distribution
Positive recurrent, 356–357, 361, 363, 365,
characteristic function, 295–296
651
confidence interval, 80–81
Posterior density
moment-generating function, 23–24 definition, 147–148
Poissonization, 116–118, 121, 243, 244 exponential mean, 148–149
Poisson point process normal mean, 151–152
higher-dimensional Poisson mean, 150
intensity/mean measure, 450 Posterior mean
Mapping theorem, 452–453 approximation, 651
multinomial distribution, 451–452 binomial, 149–150
Nearest Event site, 451 Prior density
Nearest Neighbor, 462 definition, 148
polar coordinates, 460 uses, 151
Poisson process, 551–553, 557, 636, 682 Probability metrics
Campbell’s theorem and shot noise differential metrics, 519–522
characteristic functional …, 456–457 metric inequalities, 515–518
shot effects, 456 properties
stable laws, 458 coupling identity, 508
1-D nonhomogeneous processes f -divergences, 513–514
intensity and mean function, plots, 455 Hellinger distance, 510–511
mapping theorem, 453–454 joint and marginal distribution
piecewise linear intensity, 454–456 distances, 509–510
higher-dimensional Poisson point standard probability, statistics
processes f -divergences, 507
distance, nearest event site, 451 Hellinger metric, 506–507
mapping theorem, 452–453 Kolmogorov metric, 506
stationary/homogeneous, 450 Kullback–Leibler distance, 507
Subject Index 779

Lévy–Prokhorov metric, 507 Erdös–Kac generalization, 390


separation distance, 506 first passage time
total variation metric, 506 definition, 383
Wasserstein metric, 506 distribution, return times, 383–386
Product kernels., 743 inequalities
Product martingale, 499 Bickel, 397
Projection formula, 742 Hájek–Rényi, 396
Prophet inequality Kolmogorov’s maximal, 396
application, 400 Lévy and Doob–Klass Prophet, 397
Bickel, 397 Sparre–Andersen generalization, 389, 390
Doob–Klass, 397 Wald’s identity
Proportions, convergence of EM, 282, stopping time, 391, 392
711–713
Rao–Blackwellization
Proposal distribution, 645, 664, 665
Monte Carlo estimate, 623–624
Pythagorean identity, 727
variance formula, 623
Rao’s geodesic distance, 522
Q Rate function, 52
Quadrant probability, 204 Ratio estimate, 630
Quadratic exponential family, 612 Real analytic, 564–565
Quadratic forms distribution Records
Fisher Cochran theorem, 211 density, values and times, 240–241
Hotelling’s T2 statistic, 209–210 interarrival times, 238
independence, 211–212 Recurrence, 354–359, 379–381, 383, 385, 393,
spectral decomposition theorem, 210 396, 399, 418–419, 472, 660, 667,
Quantile process 668
normalized, 528–529 Recurrent state, 355, 357–359, 370, 373, 377,
restricted Glivenko–Cantelli property, 536 394
uniform, 529 Recurrent value, 393
weak invariance principle, 536 Reflection principle, 410–411, 432
Quantile transformation Regular chain
accept–reject method, 625 communicating classes, 349
application, 242
irreducible, 363
Cauchy distribution, 44
Regular variation, 332, 333
closed-form formulas, 625
Reproducing kernel, 725–731, 733, 735, 743
definition, 44
Resampling, 543, 689, 690, 691, 692, 701
theorem, 229–230
Quartile ratio, 335 Restricted Glivenko–Cantelli property, 536
Return probability, 382–383
Return times
distribution, 383
R
scaled, asymptotic density, 385
Random permutation, 636
Random scan Gibbs sampler, 646, 647, 672 Reverse martingale
Random walks convergence theorem, 496
arc sine law, 386 defined, 494
Brownian motion, 402, 403 sample means, 494–495
Chung–Fuchs theorem, 394–396 second convex function theorem, 495–496
cubic lattice Reverse submartingale, 494–496
distribution theory, 378–379 Reversible chain, 373, 640, 645, 652, 685
Pólya’s formula, 382–383 Réyni’s representation, 233–234
recurrence and transience, 379–381 Riemann–Lebesgue lemma, 302, 303
recurrent state, 377 Riesz representation theorem, 730, 743
three dimensions, 375 RKHS, 743
two simulated, 376, 377 Rosenthal inequality, 318–319, 322
780 Subject Index

S SLEM. See Second largest eigenvalue in


Sample maximum, 224, 235, 274, 276, 287, modulus
337, 603–604 Slepian’s inequaliity, 214
asymptotic distribution, 330 SLLN. See Strong law of large numbers
Cauchy, mode, 232 Slutsky’s theorem
domain of attraction, 336 application, 269, 270
Sample points Spacings
concept, 3 description, 233
Sample range, 337 exponential and Réyni’s representation,
Sample space 233–234
countable additivity, 2–3 uniform, 234–235
countably infinite set, 2 Spatial Poisson process, 459
definition, 2 Spectral bounds
inclusion–exclusion formula, 5 decomposition, 655–656
Sauer’s lemma, 539–540 Perron–Frobenius theorem, 653–654
Schauder functions, 408 SLEM, Metropolis–Hastings algorithm,
Scheffés theorem, 284–285 656–657
Score confidence interval, 74, 76 Spherically symmetric, 135, 180, 182–186,
Second Aresine law, 410 191
Second largest eigenvalue in modulus (SLEM) SPRT. See Sequential probability ratio test
approximation, 686 Stable laws
calculation, 685 index 1/2, 385–386
convergence, Gibbs sampler, 671 infinite divisibility
vs. Dobrushin coefficients, 686 description, 315
Metropolis–Hastings algorithm, 656–657 distributions, 317
transition probability matrix, 651–652 triangular arrays, 316
Second order accuracy, 699–701 and Poisson process, 458
Separation distance, 506, 653 Stable random walk, 395
Sequential probability ratio test (SPRT), 469, Standard discrete distributions
476–477 binomial, 28
Sequential tests, 391, 469 geometric, 28
Series estimate, 720–721 hypergeometric, 29
Shattering coefficient, 539, 540, 542 negative binomial, 28–29
Shot noise, 456–458 Poisson, 29–34
Sidak inequality, 215 Stationary, 152, 153, 163, 404, 406, 430, 431,
Sigmoid kernel, 736 440, 449, 450, 466, 569, 575, 576,
Simple random walk, 354, 375, 393, 395, 397, 614, 639, 640, 641, 645–648,
399, 474 651–656, 658–664, 667, 672, 674,
mathematical formulation, 345 684, 685, 701–703, 711, 713
periodicity and, 369 Stationary distribution, 614, 639, 640,
simulated, 376, 377 646–648, 651, 653, 658, 659, 661,
Simulation 663, 667, 672, 674, 684, 685
algorithms, common distributions, Statistical classification, kernels
634–636 exponential kernel, 736
beta-binomial distribution, 643–644 Fisher linear, 735
bivariate normal, 137, 201 Gaussian radial kernel, 736
textbook techniques Hilbert norm, 729
accept–reject method, 625–628 inner product space, 728
quantile transformation, 624–625 linear functional, 729, 730
truncated geometric distribution, 644–645 linearly separable data, 735
Skewness linear span, 726–727, 729, 742
defined, 18 Mercer’s theorem, 732–744
normal approximation, 82 orthonormal bases, 728–729
Skorohod embedding, 422, 423 parallelogram law, 727
Subject Index 781

polynomial kernel, 736 convex function theorem, 468


Pythagorean identity, 727 nonnegative, 477–478
Riesz representation theorem, 730 optional stopping theorem, 470–471
sigmoid kernel, 736 reverse, 494–496
support vector machines, 734–736 upcrossing inequality, 489
Statistics and machine learning tools Sufficiency
bootstrap Neyman–Fisher factorization and Basu’s
consistency, 692–699 theorem
dependent data, 701–704 definition, 602
higher-order accuracy, 699–701 general, 604
EM algorithm Sum of uniforms
ABO allele, 709–711 distribution, 77
modifications, 714 normal density, 78, 79
monotone ascent and convergence, Supermartingale, 464, 491, 494
711–713 Support vector machine, 734–736
Poisson, missing values, 706–707 Symmetrization, 170–172, 319, 547–551
kernels Systematic scan Gibbs sampler
compact support, 718 bivariate distribution, 647–648
Fourier, 717 change point problem, 650–651
Mercer’s theorem, 722–734 convergence property, 672
nonnegativity, 717 defined, 646
rapid decay, 718 Dirichlet distribution, 650
smoothing, 715–717
Stein’s lemma T
characterization, 70 Tailsum formula, 16–19, 86, 357
normal distribution, 66 Target distribution
principal applications, 67–68 convergence, MCMC, 651–652
Stochastic matrix, 340, 654 Metropolis chains, 664
Stochastic process Taylor expansion, characteristic function, 303
continuous-time, 422 t confidence interval, 187–188, 290–291
d -dimensional, 418 t -distribution
definition, 403 noncentral, 213, 218
standard Wiener process, 404 simulation, 665, 668
Stopped martingale, 472 student, 175–176, 187
Stopping time Time reversal, 412
definition, 391, 478 Total probability, 5, 6, 117
optional stopping theorem, 470–471 Total variation distance
sequential tests, 469 closed-form formula, 283
Wald’s SPRT, 469 definition, 282
Strong approximations, 530–531, 537 and densities, 282–283
Strong ergodic theorem, 641 normals, 283–284
Strong invariance principle Total variation metric, 506, 509
and KMT theorem, 425–427 Transformations
partial sum process, 530–531, 537 arctanh, 274
Strong law of large numbers (SLLN) Box–Mueller, 194
application, 261 Helmert’s, 186–187
Cramér-Chernoff theorem, 579 linear, 200–201
Kolmogorov’s, 258, 496, 615, 632 multivariate Jacobian formula, 177–178
Markov chain, 640 n-dimensional polar, 191
proof, 496 nonmonotone, 45
Zygmund–Marcinkiewicz, 693–695 polar coordinates, 180–185
Strong Markov property, 409, 411, 423 quantile, 44–45, 229–232, 624–628
Submartingale simple linear, 42–43
convergence theorem, 491–492 variance stabilizing, 272–274, 291
782 Subject Index

Transience, 354–359, 379–381, 394, 399, Variance


418–419 definition, 17, 46
Transition density linear function, 68–69
definition, 428, 662–663 mean and, 19
drift-diffusion equation, 429 Variance stabilizing transformations (VST)
Gibbs chain, 672 binomial case, 273–274
Transition probability confidence intervals, 272
Markov chain, 341 Fisher’s, 274
matrix, 340 unusual, 274
nonreversible chain, 684 Void probabilities, 450
n-step, 346 VST. See Variance stabilizing transformations
one-step, 344
Triangular arrays, 316
Triangular density
definition, 39–40 W
piecewise linear polynomial., 77 Wald confidence interval, 74
Wald’s identity
application, 399
U
first and second, 474–477
Uniform density
basic properties, 54 stopping time, 391, 392
Uniform empirical process, 528, 530, 531, Wasserstein metric, 506, 693
552, 553 Weak law of large numbers (WLLN), 20, 256,
Uniform integrability 257, 269, 270, 305
conditions, 276–277 Weakly stationary, 404
dominated convergence theorem, 275–276 Weibull density, 56
Uniformly ergodic, 653, 669–670 Weierstrass’s theorem, 265–266, 268, 742
Uniform metric, 422, 424, 425 Weighted empirical process
Uniform order statistics, 223–225 Chibisov–O’Reilly theorem, 534–535
joint density, 235 test statistics, 534
joint distribution, 230 Wiener process
moments, 227 continuous, 576, 577
Uniform spacings, 234–235 increments, 581
Unimodal standard, 404, 405
properties, 44 tied down, 405
Unimodality, order statistics, 39, 246 Wishart distribution, 207–208
Universal Donsker class, 544 Wishart identities, 208–209
Upcrossing, 488–490 WLLN. See Weak law of large numbers
Upcrossing inequality Wright–Fisher Markov chain, 466–467
discrete time process, 488
optional stopping theorem, 490
stopping times, 488–489
submartingale and decomposition, 489 Y
Yu–Mykland single chain method, 674–675

V
Vapnik–Chervonenkis (VC)
dimension, 539–541 Z
subgraph, 545, 546 Zygmund–Marcinkiewicz SLLN, 693–694

You might also like