Statistical Inference
Second Edition
George Casella
Roger \. BergerOIKONOMIKO
NANENIZTHMIO
A@HNON
BIBAIO@HKH
e10. F226S—
. . A. IDS
Statistical Inference rok. CAs
Second Edition
George Casella
University of Florida
Roger L. Berger
North Carolina State University
DUXBURY
THOMSON LEARNING.
Australia ¢ Canada * Mexico + Singapore * Spain ¢ United Kingdom * United StatesWOVHOA
DUXBURY
MMHG ANGE!
THOMSON LEARNING
Rae)
Sponsoring Editag@Cprolyn Crockett Cover Design: Jennifer Mackres
Marketing Re na thrive: Tom Ziolkowski Interior Dlustration: Lori Heckelman
Editorial nnifer Jenkins Print Buyer: Vena Dyer
-Prodwetion Editor: Tom Novack ‘Typesetting: Integre Teehnical Publishing Co.
Assistant Editor: Ann Day Cover Printing: R. R. Donnelley & Sons Co.,
Manuscript Editor: Carol Reitz Crawfordsville
Permissions Editor: Sue Ewing Printing and Binding: R. R. Donnelley
& Sons Co., Crawfordsville
All products used herein are used for identification purposes only and may be trademarks
or registered trademarks of their respective owners.
COPYRIGHT © 2002 the Wadsworth Group. Duxbury is an imprint of the Wadsworth
Group, a division of Thomson Learning Inc.
‘Thomson Learning™ is a trademark used herein under license.
For more information about this or any other Duzbury products, contact:
DUXBURY
511 Forest Lodge Road
Pacific Grove, CA 93950 USA
‘woww.duxbury.com
1-800-423-0563 (Thomson Learning Academic Resource Center)
All rights reserved. No part of this work may be reproduced, transcribed or used in any
form or by any means—graphic, electronic, or mechanical, including photocopying,
recording, taping, Web distribution, ot information storage and/or retrieval
systems—without the prior written permission of the publisher.
For permission to use material from, figs cheer us by
www.thomsonrights.com
fax: 1-800-730-2215
phone: 1-800-730-2214
Printed in United States of A:
10987654321
Library of Congress Cataloging-in-Publication Data
Casella, George.
Statistical inference / George Casella, Roger L. Berger —2nd ed.
P. cm.
Includes bibliographical references and indexes.
ISBN 0-534-24312-6 a
1, Mathematical statistics. 2. Probabilities. I. Berger, Roger L. seen,
IL. Title. 4 a
QA276.C37 2001
519.5 —de21 xX if
2001025704 merTo Anne and VickiDuzbury titles of related interest
Daniel, Applied Nonparametric Statistics 2"
Derr, Statistical Consulting: A Guide to Effective Communication
Durrett, Probability: Theory and Examples 2°4
Graybill, Theory and Application of the Linear Model
Johnson, Applied Multivariate Methods for Data Analysts
Kuehl, Design of Experiments: Statistical Principles of Research Design and Analysis 24
Larsen, Marx, & Cooil, Statistics for Applied Problem Solving and Decision Making
Lohr, Sampling: Design and Analysis
Lunneborg, Data Analysis by Resampling: Concepts and Applications
Minh, Applied Probability Models
Minitab Inc., MINITAS™ Student Version 12 for Windows
Myers, Classical and Modern Regression with Applications 2°¢
Newton & Harvill, StatConcepts: A Visual Tour of Statistical Ideas
Ramsey & Schafer, The Statistical Sleuth 2°¢
SAS Institute Inc., JMP-IN: Statistical Discovery Software
Savage, INSIGHT: Business Analysis Software for Microsoft® Excel
Scheaffer, Mendenhall, & Ott, Elementary Survey Sampling 5"
Shapiro, Modeling the Supply Chain
Winston, Simulation Modeling Using @RISK
‘To order copies contact your local Hobkstore or call 1-800-354-9706. For more
information contact Duxbury Préss-a¢ 511 Forest Lodkt Road, Pacific Grove, CA 93950,
or go to: www.duxbury.com F314 ag gtagi® 1]
iy Zh }Preface to the Second Edition
Although Sir Arthur Conan Doyle is responsible for most of the quotes in this book,
perhaps the best description of the life of this book can be attributed to the Grateful
Dead sentiment, “What a long, strange trip it’s been.”
Plans for the second edition started about six years ago, and for a long time we
struggled with questions about what to add and what to delete. Thankfully, as time
passed, the answers became clearer as the flow of the discipline of statistics became
clearer. We see the trend moving away from elegant proofs of special cases to algo-
rithmic solutions of more complex and practical cases. This does not undermine the
importance of mathematics and rigor; indeed, we have found that these have become
more important. But the manner in which they are applied is changing.
For those familiar with the first edition, we can summarize the changes succinctly
as follows. Discussion of asymptotic methods has been greatly expanded into its own
chapter. There is more emphasis on computing and simulation (see Section 5.5 and
the computer algebra Appendix); coverage of the more applicable techniques has
been expanded or added (for example, bootstrapping, the EM algorithm, p-values,
logistic and robust regression); and there are many new Miscellanea and Exercises.
We have de-emphasized the more specialized theoretical topies, such as equivariance
and decision theory, and have restructured some material in Chapters 3-LI for clarity.
‘There are two things that we want to note. First, with respect to computer algebra
programs, although we believe that they are becoming increasingly valuable tools,
we did not want to force them on the instructor who does not share that belief.
‘Thus, the treatment is “unobtrusive” in that it appears only in an appendix, with
some hints throughout the book where it may be useful. Second, we have changed
the numbering system to one that facilitates finding things. Now theorems, lemmas,
examples, and definitions are numbered together; for example, Definition 7.2.4 is
followed by Example 7.2.5 and Theorem 10.1.3 precedes Example 10.1.4.
‘The first four chapters have received only minor changes. We reordered some ma-
terial (in particular, the inequalities and identities have been split), added some new
examples and exercises, and did some general updating. Chapter 5 has also been re-
ordered, with the convergence section being moved further back, and a new section on
generating random variables added. The previous coverage of invariance, which was
in Chapters 7-9 of the first edition, has been greatly reduced and incorporated into
Chapter 6, which otherwise has received only minor editing (mostly the addition of
new exercises). Chapter 7 has been expanded and updated, and includes a new section
on the EM algorithm. Chapter 8 has also received minor editing and updating, andoe
fo ceo nce
anew - ssogonibn p-values. In Chapter 9 we now put more emphasis on pivoting
Hzedthat “guaranteeing an interval” was merely “pivoting the cdf”). Also,
the himateriaf that was in Chapter 10 of the first edition (decision theory) has been re-
duced, and small sections on loss function optimality of point estimation, hypothesis
testing, and interval estimation have been added to the appropriate chapters.
Chapter 10 is entirely new and attempts to lay out the fundamentals of large sample
inference, including the delta method, consistency and asymptotic normality, boot
strapping, robust estimators, score tests, etc. Chapter 11 is classic oneway ANOVA
and linear regression (which was covered in two different chapters in the first edi-
tion). Unfortunately, coverage of randomized block designs has been eliminated for
space reasons. Chapter 12 covers regression with errors-in-variables and contains new
material on robust and logistic regression.
After teaching from the first edition for a number of years, we know (approximately)
what can be covered in a one-year course. From the second edition, it should be
possible to cover the following in one year:
Chapter 1: Sections 1-7 Chapter 6: Sections 1-3
Chapter 2: Sections 1-3 Chapter 7; Sections 1-3
Chapter 3: Sections 1-6 Chapter 8: Sections 1-3
Chapter 4: Sections 1-7 Chapter 9: Sections 1-3
Chapter 5: Sections 1-6 Chapter 10: Sections 1, 3, 4
Classes that begin the course with some probability background can cover more ma-
terial from the later chapters.
Finally, it is almost impossible to thank all of the people who have contributed in
some way to making the second edition a reality (and help us correct the mistakes in
the first edition). To all of our students, friends, and colleagues who took the time to
send us a note or an e-mail, we thank you. A number of people made key suggestions
that led to substantial changes in presentation. Sometimes these suggestions were just
short notes or comments, and some were longer reviews. Some were so long ago that
their authors may have forgotten, but we haven't. So thanks to Arthur Cohen, Sir
David Cox, Steve Samuels, Rob Strawderman and ‘Tom Wehrly. We also owe much to
Jay Beder, who has sent us numerous comments and suggestions over the years and
possibly knows the first edition better than we do, and to Michael Perlman and his
class, who are sending comments and corrections even as we write this.
This book has seen a number of editors. We thank Alex Kugashev, who in the
mid-1990s first suggested doing a second edition, and our editor, Carolyn Crockett,
who constantly encouraged us. Perhaps the one person (other than us) who is most
tesponsible for this book is our first editor, John Kimmel, who encouraged, published,
and marketed the first edition. Thanks, John.
George Casella
Roger L. BergerPreface to the First Edition
When someone discovers that you are writing a textbook, one (or both) of two ques-
tions will be asked, The first is “Why are you writing a book?” and the second is
“How is your book different from what’s out there?” The first question is fairly easy
to answer. You are writing a book because you are not entirely satisfied with the
available texts. The second question is harder to answer. The answer can’t be put
in a few sentences so, in order not to bore your audience (who may be asking the
question only out of politeness), you try to say something quick and witty. It usually
doesn’t work.
‘The purpose of this book is to build theoretical statistics (as different from mathe-
matical statistics) from the first principles of probability theory. Logical development,
proofs, ideas, themes, etc., evolve through statistical arguments. Thus, starting from
the basics of probability, we develop the theory of statistical inference using tech-
niques, definitions, and concepts that are statistical and are natural extensions and
consequences of previous concepts. When this endeavor was started, we were not sure
how well it would work. The final judgment of our success is, of course, left to the
reader.
‘The book js intended for first-year graduate students majoring in statistics or in
a field where a statistics concentration is desirable. The prerequisite is one yeat of
calculus. (Some familiarity with matrix manipulations would be useful, but is not
essential.) The book can be used for a two-semester, or three-quarter, introductory
course in statistics.
‘The first four chapters cover basics of probability theory and introduce many fun-
damentals that are later necessary. Chapters 5 and 6 are the first statistical chapters.
Chapter 5 is transitional (between probability and statistics) and can be the starting
Point for a course in statistical theory for students with some probability background.
Chapter 6 is somewhat unique, detailing three statistical principles (sufficiency, like-
lihood, and invariance) and showing how these principles are important in modeling
data. Not all instructors will cover this chapter in detail, although we strongly recom-
mend spending some time here. In particular, the likelihood and invariance principles
are treated in detail. Along with the sufficiency principle, these principles, and the
thinking behind them, are fundamental to total statistical understanding.
Chapters 7-9 represent the central core of statistical inference, estimation (point
and interval) and hypothesis testing. A major feature of these chapters is the division
into methods of finding appropriate statistical techniques and methods of evaluating
these techniques. Finding and evaluating are of interest to both the theorist and thevill PREFACE TO THE FIRST EDITION
practitioner, but we feel that it is important to separate these endeavors. Different
concerns are important, and different rules are invoked. Of further interest may be
the sections of these chapters titled Other Considerations. Here, we indicate how the
rules of statistical inference may be relaxed (as is done every day) and still produce
meaningful inferences. Many of the techniques covered in these sections are ones that
axe used in consulting and are helpful in analyzing and inferring from actual problems.
‘The final three chapters can be thought of as special topics, although we feel that
some familiarity with the material is important in anyone’s statistical education.
Chapter 10 is a thorough introduction to decision theory and contains the most mod-
ern material we could include. Chapter 11 deals with the analysis of variance (oneway
and randomized block), building the theory of the complete analysis from the more
simple theory of treatment contrasts. Our experience has been that experimenters are
most interested in inferences from contrasts, and using principles developed earlier,
most tests and intervals can be derived from contrasts. Finally, Chapter 12 treats
the theory of regression, dealing first with simple linear regression and then covering
regression with “errors in variables.” This latter topic is quite important, not only to
show its own usefulness and inherent difficulties, but also to illustrate the limitations
of inferences from ordinary regression.
‘As more concrete guidelines for basing a one-year course on this book, we offer the
following suggestions. There can be two distinct types of courses taught from this
book. One kind we might label “more mathematical,” being a course appropriate for
students majoring in statistics and having a solid mathematics background (at least
1} years of calculus, some matrix algebra, and perhaps a real analysis course). For
such students we recommend covering Chapters 1-9 in their entirety (which should
take approximately 22 weeks) and spend the remaining time customizing the course
with selected topics from Chapters 10-12. Once the first nine chapters are covered,
the material in each of the last three chapters is self-contained, and can be covered
in any order.
Another type of course is “more practical.” Such a course may also be a first course
for mathematically sophisticated students, but is aimed at students with one year of
calculus who may not be majoring in statistics. It stresses the more practical uses of
statistical theory, being more concerned with understanding basic statistical concepts
and deriving reasonable statistical procedures for a variety of situations, and less
concetned with formal optimality investigations. Such a course will necessarily omit
a certain amount of material, but the following list of sections can be covered in a
one-year course:
Chapter Sections
1 All
24, 2.2, 2.3
3.1, 3.2
4.1, 4.2, 4.3, 4.5
5.1, 5.2, 5.3.1, 5.4
6.1.1, 6.2.1
7A, 7.2.1, 7.2.2, 7.2.3, 7.3.1, 7.3.3, 7.4
8.1, 8.2.1, 8.2.3, 8.2.4, 8.3.1, 8.3.2, 8.4
we
wor anmPREFACE TO THE FIRST EDITION ix
9 9.1, 9.2.1, 9.2.2, 9.2.4, 9.3.1, 9.4
u 11.1, 11.2
12 12.1, 12.2
If time permits, there can be some discussion (with little emphasis on details) of the
material in Sections 4.4, 5.5, and 6.1.2, 6.1.3, 6.1.4. The material in Sections 11.3 and
12.3 may also be considered.
The exercises have been gathered from many sources and are quite plentiful. We
feel that, perhaps, the only way to master this material is through practice, and thus
we have included much opportunity to do so. The exercises are as varied as we could
make them, and many of them illustrate points that are either new or complementary
to the material in the text. Some exercises are even taken from research papers. (It
makes you feel old when you can include exercises based on papers that were new
research during your own student days!) Although the exercises are not subdivided
like the chapters, their ordering roughly follows that of the chapter. (Subdivisions
often give too many hints.) Furthermore, the exercises become (again, roughly) more
challenging as their numbers become higher.
As this is an introductory book with a relatively broad scope, the topics are not
covered in great depth. However, we felt some obligation to guide the reader one
step further in the topics that may be of interest. Thus, we have included many
references, pointing to the path to deeper understanding of any particular topic. (The
Encyclopedia of Statistical Sciences, edited by Kotz, Johnson, and Read, provides a
fine introduction to many topics.)
‘To write this book, we have drawn on both our past teachings and current work. We
have also drawn on many people, to whom we are extremely grateful. We thank our
colleagues at Cornell, North Carolina State, and Purdue—in particular, Jim Berger,
Larry Brown, Sir David Cox, Ziding Feng, Janet Johnson, Leon Gleser, Costas Goutis,
Dave Lansky, George McCabe, Chuck McCulloch, Myra Samuels, Steve Schwager,
and Shayle Searle, who have given their time and expertise in reading parts of this
manuscript, offered assistance, and taken part in many conversations leading to con-
structive suggestions. We also thank Shanti Gupta for his hospitality, and the li
brary at Purdue, which was essential. We are grateful for the detailed reading and
helpful suggestions of Shayle Searle and of our reviewers, both anonymous and non-
anonymous (Jim Albert, Dan Coster, and ‘Tom Wehrly). We also thank David Moore
and George McCabe for allowing us to use their tables, and Steve Hirdt for supplying
us with data. Since this book was written by two people who, for most of the time,
were at least 600 miles apart, we lastly thank Bitnet for making this entire thing
Possible.
George Caseila
Roger L. Berger“We have got to the deductions and the inferences,” said Lestrade, winking at me.
“I find it hard enough to tackle facts, Holmes, without flying away
after theories and fancies.”
Inspector Lestrade to Sherlock Holmes
The Boscombe Valley MysteryContents
1 Probability Theory
ll
ey
13
14
15
16
7
18
Set Theory
Basics of Probability Theory
1.2.1 Axiomatic Foundations
1.2.2. The Caleulus of Probabilities
1.23 Counting
1.24 Enumerating Outcomes
Conditional Probability and Independence
Random Variables
Distribution Functions
Density and Mass Functions
Exercises
Miscellanea
2 Transformations and Expectations
21
22
2.3
24
2.5
2.6
Distributions of Functions of a Random Variable
Expected Values
Moments and Moment Generating Functions
Differentiating Under an Integral Sign
Exercises
Miscellanea
3 Common Families of Distributions
31
3.2
3.3,
3.4
3.5
Introduction
Discrete Distributions
Continuous Distributions
Exponential Families
Location and Scale Families
woaaeH
13
16
20
27
29
34
37
44
aT
a
55
59
68
76
82
85
85
85
98
iu
1163.6
3.7
3.8
CONTENTS
Inequalities and Identities
3.6.1 Probability Inequalities
3.6.2 Identities
Exercises
Miscellanea
Multiple Random Variables
ad
42
43
44
45
46
47
4.8
49
Joint and Marginal Distributions
Conditional Distributions and Independence
Bivariate Transformations
Hierarchical Models and Mixture Distributions
Covariance and Correlation
Multivariate Distributions
Inequalities
4.7.1 Numerical Inequalities
4.7.2 Functional Inequalities
Exercises
Miscellanea
Properties of a Random Sample
51
5.2
53
54
5.5
5.6
5.7
5.8
Basic Concepts of Random Samples
Sums of Random Variables from a Random Sample
Sampling from the Normal Distribution
5.3.1. Properties of the Sample Mean and Variance
5.3.2. The Derived Distributions: Student’s t and Snedecor’s F
Order Statistics
Convergence Concepts
5.5.1 Convergence in Probability
5.5.2 Almost Sure Convergence
5.5.3 Convergence in Distribution
5.5.4 The Delta Method
Generating a Random Sample
5.6.1 Direct Methods
5.6.2 Indirect Methods
5.6.3. The Accept/Reject Algorithm
Exercises
Miscellanea
Principles of Data Reduction
6.1
6.2
Introductio
‘The Sufficiency Principle
6.2.1 Sufficient Statistics
6.2.2 Minimal Sufficient Statistics
6.2.3 Ancillary Statistics
6.2.4 Sufficient, Ancillary, and Complete Statistics
121
122
123
127
135
139
139
147
156
162
169
177
186
186
189
192
203
207
207
211
218
218
222
226
232
232
234
235
240
245
247
251
253
255
267
271
271
272
272
279
282
284CONTENTS
6.3 The Likelihood Principle
6.3.1 The Likelihood Function
6.3.2 The Formal Likelihood Principle
6.4 The Equivariance Principle
6.5 Exercises
6.6 Miscellanea
Point Estimation
7.1 Introduction
7.2. Methods of Finding Estimators
7.2.1 Method of Moments
7.2.2. Maximum Likelihood Estimators
7.2.3. Bayes Estimators
7.24 The EM Algorithm
7.3. Methods of Evaluating Estimators
7.3.1 Mean Squared Error
7.3.2 Best Unbiased Estimators
7.3.3 Sufficiency and Unbiasedness
7.3.4 Loss Function Optimality
74 Exercises
7.5 Miscellanea
Hypothesis Testing
8.1 Introduction
8.2 Methods of Finding Tests
8.2.1 Likelihood Ratio Tests
8.2.2 Bayesian Tests
8.2.3 Union-Intersection and Intersection-Union Tests
8&3 Methods of Evaluating Tests
8.3.1 Error Probabilities and the Power Function
8.3.2 Most Powerful Tests
8.3.3 Sizes of Union-Intersection and Intersection-Union Tests
8.3.4 p-Values
8.3.5 Loss Function Optimality
8.4 Exercises
8.5 Miscellanea
Interval Estimation
9.1 Introduction
9.2. Methods of Finding Interval Estimators
9.2.1 Inverting a Test Statistic
Pivotal Quantities
Pivoting the CDF
Bayesian Intervals
290
290
292
296
300
307
311
311
312
312
315
324
326
330
330
334
342
348
355
367
373
373
374
374
379
380
382
382
387
394
397
400
402
413
417
417
420
420
427
430
43593
9.4
9.5
CONTENTS
‘Methods of Evaluating Interval Estimators -
9.3.1 Size and Coverage Probability
9.3.2. Test-Related Optimality
9.3.3. Bayesian Optimality
9.3.4 Loss Function Optimality
Exercises
Miscellanea
10 Asymptotic Evaluations
10.1
10.2
10.3
10.4
10.5
10.6
Point Estimation
10.1.1 Consistency
10.1.2 Efficiency
10.1.3 Calculations and Comparisons
10.1.4 Bootstrap Standard Errors
Robustness
10.2.1 The Mean and the Median
10.2.2 M-Estimators
Hypothesis Testing
10.3.1 Asymptotic Distribution of LRTs
10.3.2 Other Large-Sample Tests
Interval Estimation
10.4.1 Approximate Maximum Likelihood Intervals
10.4.2 Other Large-Sample Intervals
Exercises
Miscellanea
11 Analysis of Variance and Regression
Wl
11.2
11.3
114
11.5
Introduction
Oneway Analysis of Variance
11.2.1 Model and Distribution Assumptions
11.2.2 The Classic ANOVA Hypothesis
11.2.3 Inferences Regarding Linear Combinations of Means
11.2.4 The ANOVA F Test
11.2.5 Simultaneous Estimation of Contrasts
11.2.6 Partitioning Sums of Squares
Simple Linear Regression
11.3.1 Least Squares: A Mathematical Solution
11.3.2 Best Linear Unbiased Estimators: A Statistical Solution
11.3.3 Models and Distribution Assumptions
11.3.4 Estimation and Testing with Normal Errors
11.3.5 Estimation and Prediction at a Specified x = x9
11.3.6 Simultaneous Estimation and Confidence Bands
Exercises
Miscellanes
440
440
444
447
449
451
463
467
467
467
470
473
478
481
482
484
488
488
492
496
496
499
504
515
521
521
522
524
525
527
530
534
536
539
542
544
548
550
557
559
563
572CONTENTS
12 Regression Models
12.1 Introduction
12.2 Regression with Errors in Variables
12.2.1 Functional and Structural Relationships
12.2.2 A Least Squares Solution
12.2.3 Maximum Likelihood Estimation
12.2.4 Confidence Sets
12.3 Logistic Regression
12.3.1 The Model
12.3.2 Estimation
12.4 Robust Regression
12.5 Exercises
12.6 Miscellanea
Appendix: Computer Algebra
‘Table of Common Distributions
References
Author Index
Subject Index
xvii
577
577
577
579
581
583
588
591
591
593
597
602
608
613
621
629
645
649List of Tables
124
Ald
731
8.3.1
9.21
9.2.2
10.1.1
10.2.1
10.3.1
10.4.1
10.4.2
11.2.1
11.3.1
11.3.2
12.3.1
12.4.1
Number of arrangements
Values of the joint pmf f(z, y)
Three estimators for a binomial p
Counts of leukemia cases
‘Two types of errors in hypothesis testing
Location-scale pivots
Sterne’s acceptance region and confidence set
‘Three 90% normal confidence intervals
Bootstrap and Delta Method variances
Median/mean asymptotic relative efficiencies
Huber estimators
Huber estimator asymptotic relative efficiencies, k = 1.5
Poisson LRT statistic
Power of robust tests
Confidence coefficient for a pivotal interval
Confidence coefficients for intervals based on Huber’s M-estimator
ANOVA table for oneway el
Data pictured in Figure 11.3.1
ANOVA table for simple linear regression
‘ification.
Challenger date.
Potoroo data
Regression M-estimator asymptotic relative efficiencies
16
141
354
360
383
427
431
44.
480
484
485
487
490
497
500
504
538
542
556
594
598
601List of Figures
Dart board for Example 1.2.7
Histogram of averages
Caf of Example 1.5.2
Geometric edf, p = .3
Area under logistic curve
‘Transformation of Example 2.1.2
Increasing and nondecreasing cdfs
Exponential densities
‘Two pdfs with the same moments
Poisson approximation to the binomial
Standard normal density
Normal approximation to the binomial
Beta densities
Symmetric beta densities
Standard normal density and Cauchy density
Lognormal and gamma pdfs
Location densities
Exponential location densities
‘Members of the same scale family
Location-scale families
Regions for Example 4.1.12
Regions for Examples 4.5.4 and 4.5.8
Regions for Example 4.5.9
Convex function
Jensen's Inequality
Region on which fa,v(r,v) > 0 for Example 5.4.7
Histogram of exponential pdf
Histogram of Poisson sample variances
Beta distribution in accept/reject sampling
19
30
32
36
49
54
60
65
68
105
106
107
108
109
110
7
118
119
120
147
170
175
189
190
232
248
251
252LIST OF FIGURES
Binomial MSE comparison
Risk functions for variance estimators
LRT statistic
Power functions for Example 8.3.2
Power functions for Example 8.3.3
Power functions for three tests in Example 8.3.19
Risk function for test in Example 8.3.31
Confidence interval-acceptance region relationship
Acceptance region and confidence interval for Example 9.2.3
Credible and confidence intervals from Example 9.2.16
Credible probabilities of the intervals from Example 9.2.16
Coverage probabilities of the intervals from Example 9.2.16.
‘Three interval estimators from Example 9.2.16
Asymptotic relative efficiency for gamma mean estimators
Poisson LRT histogram
LRT intervals for a binomial proportion
Coverage probabilities for nominal .9 binomial confidence procedures
Vertical distances that are measured by RSS
Geometric description of the BLUE
Scheffé bands, t interval, and Bonferroni intervals
Distance minimized by orthogonal least squares
‘Three regression lines
Creasy-Williams F statistic
Challenger data logistic curve
Least squares, LAD, and M-estimate fits
333
351
377
384
384
394
401
42
423
437
438
439
449
478
490
502
503
542
547
562
581
583
590
595,
599List of Examples
Event operations
Sigma algebra-I
Sigma algebra-II
Defining probabilities-I
Defining probabilities-II
Bonferroni's Inequality
Lottery-I
Tournament
Lottery-II
Poker
Sampling with replacement
Calculating an average
Four aces
Continuation of Example 1.3.1
‘Three prisoners
Coding
Chevalier de Meré
Tossing two dice
Letters
Three coin tosses—I
Random variables
‘Three coin tosses-II
Distribution of a random variable
Tossing three coins
Tossing for a head
Continuous cdf
Caf with jumps
Identically distributed random variables
Geometric probabilities
Logistic probabilities
Binomial transformation
Uniform transformation
Uniform-exponential relationship-I
Inverted gamma pdf
Noo
13
13
4
16
17
18
20
20
2
23
24
25
26
27
28
28
29
30
31
32
33
33
34
36
48
49
51
51247
219
2.2.2
2.2.3
2.24
226
2.2.7
2.3.3
2.3.5
2.3.8
2.3.13
24.5
2.4.6
2.4.7
24.9
3.2.1
3.2.3
3.2.4
3.2.5
3.2.6
3.2.7
3.3.1
3.3.2
3.4.1
3.4.3
3.44
3.4.6
3.4.8
3.49
3.5.3
3.6.2
3.6.3
3.6.6
3.6.9
4.1.2
414
415
4a7
418
419
41.11
LIST OF EXAMPLES
‘Square transformation
Normal-chi squared relationship
Exponential mean
Binomial mean
Cauchy mean
Minimizing distance
Uniform-exponential relationship-I
Exponential variance
Binomial variance
Gamma mgf
Binomial mgf
Nonunique moments
Poisson approximation
Interchanging integration and differentiation-I
Interchanging integration and differentiation-II
Interchanging summation and differentiation
Continuation of Example 2.4.7
Acceptance sampling
Dice probabilities
Waiting time
Poisson approximation
Inverse binomial sampling
Failure times
Gamma-Poisson relationship
Normal approximation
Binomial exponentia) family
Binomial mean and variance
Normal exponential family
Continuation of Example 3.4.4
‘A curved exponential family
Normal approximations
Exponential location family
Mlustrating Chebychev
A normal probability inequality
Higher-order normal moments
Higher-order Poisson moments
Sample space for dice
Continuation of Example 4.1.2
Joint pmf for dice
Marginal pmé for dice
Dice probabilities
Same marginals, different joint pmf
Calculating joint probabilities
52
53
55
56
56
58
BB
59
61
63
64
64
66
1
72
73
4
88
91
93
94
96
98
100
105
111
112
113
44
115
115
118
122
123
125
127
140
141
142
143
144
144
4541.12
4.2.2
4.2.4
4.2.6
4.2.8
4.2.9
4.2.11
4.2.13
43.1
4.3.3
43.4
43.6
441
44.2
44.5
44.6
44.8
454
45.8
45.9
46.1
4.6.3
46.8
46.13
47A
47.8
5.1.2
5.1.3
5.2.8
5.2.10
5.2.12
5.3.5
5.3.7
5.4.5
5.4.7
5.5.3
5.5.5
5.5.7
5.5.8
5.5.11
5.5.16
5.5.18
5.5.19
5.5.22
LIST OF EXAMPLES
Calculating joint probabilities-TT
Calculating conditional probabilities
Calculating conditional pdfs
Checking independence-I
Checking independence-IT
Joint probability model
Expectations of independent variables
‘Megf of a sum of normal variables
Distribution of the sum of Poisson variables
Distribution of the product of beta varizbles
Sum and difference of normal variables
Distribution of the ratio of normal variables
Binomial-Poisson hierarchy
Continuation of Example 4.4.1
Generalization of Example 4.4.1
Beta-binomial hierarchy
Continuation of Example 4.4.6
Correlation-I
Correlation-II
Correlation-IIT
Multivariate pdfs
Multivariate pmf
Mef of a sum of gamma variables
Multivariate change of variables
Covariance inequality
‘An inequality for means
Sample pdf-exponential
Finite population model
Distribution of the mean
Sum of Cauchy random variables
Sum of Bemoulli random variables
Variance ratio distribution
Continuation of Example 5.3.5
Uniform order statistic pdf
Distribution of the midrange and range
Consistency of S?
Consistency of
Almost sure convergence
Convergence in probability, not almost surely
Maximum of uniforms
Normal approximation to the negative binomial
Normal approximation with estimated variance
Estimating the odds
Continuation of Example 5.5.19
wea
146
148
150
152
153
154
155
156
157
158
159
162
163
163
165
167
168
170
173
174
178
181
183
185
188
191
208
210
215
216
217
224
225
230
231
233
233
234
234
235
239
240
240
242rodv,
5.5.23
5.5.25
5.5.27
5.6.1
5.6.2
5.6.3
5.6.4
5.6.5
5.6.6
5.6.7
5.6.9
LIST OF EXAMPLES
Approximate mean and variance
Continuation of Example 5.5.23
Moments of a ratio estimator
Exponential lifetime
Continuation of Example 5.6.1
Probability Integral Transform
Box-Muller algorithm
Binomial random variable generation
Distribution of the Poisson variance
Beta random variable generation—I
Beta random variable generation-II
Binomial sufficient statistic
Normal sufficient statistic
Sufficient order statistics
Continuation of Example 6.2.4
Uniform sufficient statistic
Normal sufficient statistic, both parameters unknown
‘Two normal sufficient statistics
Normal minimal sufficient statistic
Uniform minimal sufficient statistic
Uniform ancillary statistic
Location family ancillary statistic
Scale family ancillary statistic
Ancillary precision
Binomial complete sufficient statistic
Uniform complete sufficient statistic
Using Basu's Theorem-I
Using Basu’s Theorem-II
Negative binomial likelihood
Normal fiducial distribution
Evidence function
Binomial/negative binomial experiment,
Continuation of Example 6.3.5
Binomial equivariance
Continuation of Example 6.4.1
Conclusion of Example 6.4.1
Normal location invariance
Normal method of moments
Binomial method of moments
Satterthwaite approximation
Normal likelihood
Continuation of Example 7.2.5
Bernoulli MLE
242
243
244
246
246
247
249
249
250
251
254
274
274
275
277
217
279
280
281
282
282
283
284
285
285
286
288
289
290
291
292
293
295
297
298
299
299
313
313
314
316
317
317LIST OF EXAMPLES
Restricted range MLE
Binomial MLE, unknown number of trials
Normal MLEs, y: and ¢ unknown
Continuation of Example 7.2.11
Continuation of Example 7.2.2
Binomial Bayes estimation
Normal Bayes estimators
Multiple Poisson rates
Continuation of Example 7.2.17
Conclusion of Example 7.2.17
Normal MSE.
Continuation of Example 7.3.3
MSE of binomial Bayes estimator
‘MSE of equivariant estimators
Poisson unbiased estimation
Conclusion of Example 7.3.8
Unbiased estimator for the scale uniform
Normal variance bound
Continuation of Example 7.3.14
Conditioning on an insufficient statistic
Unbiased estimators of zero
Continuation of Example 7.3.13
Binomial best unbiased estimation
Binomial risk functions
Risk of normal variance
Variance estimation using Stein’s loss
Two Bayes rules
Normal Bayes estimates
Binomial Bayes estimates
Normal LRT
Exponential LRT
LRT and sufficiency
Normal LRT with unknown variance
Normal Bayesian test
Normal union- intersection test
Acceptance sampling
Binomial power function
Normal power function
Continuation of Example 8.3.3
Size of LRT
Size of union-intersecticn test
Conclusion of Example 8.3.3
UMP bizomial test
UMP norma! test
318
318
321
322
323
324
326
326
327
328
331
331
332
333
335
338
339
340
341
343
345,
346
347
350
350
351
353
353
354
375
376
378
378
379
380
382
383
384
385
386
387
387
390
3908.3.18
8.3.19
8.3.20
8.3.22
8.3.25,
8.3.28
8.3.29
8.3.30
8.3.31
9.12
9.1.3
9.1.6
9.2.1
9.2.3
9.2.4
9.2.5
9.2.7
9.2.8
9.2.9
9.2.10
9.2.11
9.2.13
9.2.15
9.2.16
9.2.17
9.2.18
9.3.1
9.3.3
9.3.4
9.3.6
9.3.8
9.3.11
9.3.12
9.3.13
10.1.2
10.1.4
10.1.8
10.1.10
10.1.13
10.1.14
10.1.15
10.1.17
10.1.18
LIST OF EXAMPLES
Continuation of Example 8.3.15
Nonexistence of UMP test
Unbiased test
An equivalence
Intersection-union test
‘Two-sided normal p-value
One-sided normal p-value
Fisher's Exact Test
Risk of UMP test
Interval estimator
Continuation of Example 9.1.2
Scale uniform interval estimator
Inverting a normal test
Inverting an LRT
Normal one-sided confidence bound
Binomial one-sided confidence bound
Location-scale pivots
Gamma pivot
Continuation of Example 9.2.8
Normal pivotal interval
Shortest length binomial set
Location exponential interval
Poisson interval estimator
Poisson credible set
Poisson credible and coverage probabilities
Coverage of a normal credible set
Optimizing length
Optimizing expected length
Shortest pivotal interval
UMA confidence bound
Continuation of Example 9.3.6
Poisson HPD region
Normal HPD region
Normal interval estimator
Consistency of X
Continuation of Example 10.1.2
Limiting variances
Large-sample mixture variances
Asymptotic normality and consistency
Approximate binomial variance
Continuation of Example 10.1.14
ARES of Poisson estimators
Estimating a gamma mean
392
392
393
395
396
398
398
399
401
418
ry
419
420
423
425
425
427
428
429
429
431
433
434
436
437
438
441
443,
443
445,
446
448
449
450
468
469
470
an
472
474
45
476
477LIST OF EXAMPLES
10.1.19 Bootstrapping a variance
10.1.20 Bootstrapping a binomial variance
10.1.21 Conclusion of Example 10.1.20
10.1.22 Parametric bootstrap
10.2.1
10.2.3
10.2.4
10.2.5
10.2.6
10.2.7
10.3.2
10.3.4
10.3.5
Robustness of the sample mean
Asymptotic normality of the median
AREs of the median to the mean
Huber estimator
Limit distribution of the Huber estimator
ARE of the Huber estimator
Poisson LRT
Maltinomial LRT
Large-sample binomial tests
Binomial score test
‘Tests based on the Huber estimator
Continuation of Example 10.1.14
Binomial score interval
Binomial LRT interval
Approximate interval
Approximate Poisson interval
More on the binomial score interval
Comparison of binomial intervals
Intervals based on the Huber estimator
Negative binomial interval
Influence functions of the mean and median
Oneway ANOVA
The ANOVA hypothesis
ANOVA contrasts
Pairwise differences
Continuation of Example 11.2.1
Predicting grape crops
Continuation of Example 11.3.1
Estimating atmospheric pressure
Challenger data
Challenger data continued
Robustness of least squares estimates
Catastrophic observations
Asymptotic normality of the LAD estimator
Regression M-estimator
Simulation of regression AREs
Unordered sampling
Univariate transformation
Bivariate transformations
478
479
479
480
482
483
484
485
486
487
489
491
493
495
496
497
498
499
499
500
501
502
503
504
518
522
525
529
534
538
540
555
579
594
596
597
598
599
601
601
613
614
614ee
A04
A05
A06
AO7
A08
LIST OF EXAMPLES
Normal probability
Density of a sum
Fourth moment of sum of uniforms
ARE for a gamma mean
Limit of chi squared mgfs
616
616
617
618
619Chapter 1
Probability Theory
“You can, for example, never foretell what any one man will do, but you can
say with precision what an average number will be up to. Individuals vary, but
percentages remain constant. So says the statistician.”
Sherlock Holmes
The Sign of Four
‘The subject of probability theory is the foundation upon which all of statistics is
built, providing a means for modeling populations, experiments, or almost anything
else that could be considered a random phenomenon. Through these models, statisti-
cians are able to draw inferences about populations, inferences based on examination
of only a part of the whole.
The theory of probability has a long and rich history, dating back at least to the
seventeenth century when, at the request of their friend, the Chevalier de Meré, Pascal
and Fermat developed a mathematical formulation of gambling odds.
The aim of this chapter is not to give a thorough introduction to probability theory;
such an attempt would be foolhardy in so short a space. Rather, we attempt to outline
some of the basic ideas of probability theory that are fundamental to the study of
statistics.
Just as statistics builds upon the foundation of probability theory, probability the-
ory in turn builds upon set theory, which is where we begin.
1.1 Set Theory
One of the main objectives of a statistician is to draw conclusions about a population
of objects by conducting an experiment. The first step in this endeavor is to identify
the possible outcomes or, in statistical terminology, the sample space.
Definition 1.1.1 The set, $, of all possible outcomes of a particular experiment is
called the sample space for the experiment.
If the experiment consists of tossing a coin, the sample space contains two outcomes,
heads and tails; thus,
$= {H,T}.
Tf, on the other hand, the experiment consists of observing the reported SAT scores
of randomly selected students at a certain university, the sample space would be7 7
\ PROBABILITY THEORY Section 1.1.
between 200 and 800 that are multiples of ten—that
2.780, 790, 800}. Finally, consider an experiment where the
time to a certain stimulus. Here, the sample space would
numbers, that is, S = (0,00).
We can classify sample spaces into two types according to the number of elements
they contain. Sample spaces can be either countable or uncountable; if the elements of
a sample space can be put into 1-1 correspondence with a subset of the integers, the
sample space is countable. Of course, if the sample space contains only a finite number
of elements, it is countable. Thus, the coin-toss and SAT score sample spaces are both
countable (in fact, finite), whereas the reaction time sample space is uncountable, since
the positive real numbers cannot be put into 1-1 correspondence with the integers.
If, however, we measured reaction time to the nearest second, then the sample space
would be {in seconds) § = {0,1,2,3,...}, which is then countable.
This distinction between countable and uncountable sample spaces is important
only in that it dictates the way in which probabilities can be assigned. For the most
part, this causes no problems, although the mathematical treatment of the situations
is different. On a philosophical level, it might be argued that there can only be count-
able sample spaces, since measurements cannot be made with infinite accuracy. (A
sample space consisting of, say, all ten-digit numbers is a countable sample space.)
While in practice this is true, probabilistic and statistical methods associated with
uncountable sample spaces are, in general, less cumbersome than those for countable
sample spaces, and provide a close approximation to the true (countable) situation,
Once the sample space has been defined, we are in a position to consider collections
of possible outcomes of an experiment.
Definition 1.1.2 An event is any collection of possible outcomes of an experiment,
that is, any subset of S (including S itself).
Let A be an event, a subset of S. We say the event A occurs if the outcome of the
experiment is in the set A. When speaking of probabilities, we generally speak of the
probability of an event, rather than aset. But we may use the terms interchangeably.
We first need to define formally the following two relationships, which allow us to
order and equate sets:
ACBSreAsceB (containment)
A=BeAC Band BCA. (equality)
Given any two events (or sets) A and B, we have the following elementary set
operations:
Union: The union of A and B, written AU B, is the set of elements that belong to
either A or B or both:
AUB ={2:2€ Aor re B}.
Intersection: The intersection of A and B, written ANB, is the set of elements that,
belong to both A and B:
ANB={e:2€ A and z€ B}.Section 1.1 SET THEORY 3
Complementation: The complement of A, written A°, is the set of all elements
that are not in A:
At = {2:2 ¢ A}.
Example 1.1.3 (Event operations) Consider the experiment of selecting a card
at random from a standard deck and noting its suit: clubs (C), diamonds (D), hearts
(A), or spades (8). The sample space is
S={C,D,H,S},
and some possible events are
A={C,D} and B= {D,H,S}.
From these events we can form
AUB={C,D,H,S}, ANB={D}, and A° = {H,S}.
Furthermore, notice that AU B = S (the event S) and (AU B)* = 0, where 0 denotes
the empty set (the set consisting of no elements). \
‘The elementary set operations can be combined, somewhat akin to the way addition
and multiplication can be combined. As long as we are careful, we can treat sets as if
they were numbers. We can now state the following useful properties of set operations.
Theorem 1.1.4 For any three events, A, B, and C,, defined on a sample space S,
a. Commutativity
b. Associativity (AUB)UC,
(ANB)nC;
. Distributive Laws AN(BUC)= (AN B)U(ANC),
AU(BNC) =(AUB)A(AUC);
d. DeMorgan’s Laws (AU B)° = AeN BY,
(AN BY = ACU Be
Proof: The proof of much of this theorem is left as Exercise 1.3. Also, Exercises 1.9
and 1.10 generalize the theorem. ‘To illustrate the technique, however, we will prove
the Distributive Law:
AN (BUC) = (ANB)U(ANC)
(You might be familiar with the use of Venn diagrams to “prove” theorems in set
theory. We caution that although Venn diagrams are sometimes helpful in visualizing
@ situation, they do not constitute a formal proof.) To prove that two sets are equal,
it must be demonstrated that each set contains the other. Formally, then
AN(BUC) = {e€S:2€ Aand 2 €(BUC)};
(AN B)U(ANC) = {2 S:2 € (ANB) or rE (ANC)}4 PROBABILITY THEORY Section 1.1
We first show that AN (BUC) c (ANB) U(ANC). Let e € (AN(BUC)). By the
definition of intersection, it must be that z € (BUC), that is, either x € B or 2 € C.
Since z also must be in A, we have that either x € (AN B) or z € (AMC); therefore,
(AN B)U(ANO)),
and the containment is established.
Now assume x € (AN B)U(ANC)). This implies that z € (ANB) or 2 € (ANC).
If z € (ANB), then z is in both A and B. Since z € Bz € (BUC) and thus
2 (AN(BUC)). If, on the other hand, ¢ € (ANC), the argument is similar, and we
again conclude that 2 € (AN(BUC)). Thus, we have established (AM B)U(ANC) C
AN (BUC), showing containment in the other direction and, hence, proving the
Distributive Law. a
The operations of union and intersection can be extended to infinite collections of
sets as well. If Ai, Az, A3,... is a collection of sets, all defined on a sample space S,
then
U4: = {x € S$: 2 € A; for some i},
=
(A= (ee S:2€ Aj for all i}.
isl
For example, let $ = (0, 1] and define Ai = [(1/4), 1]. Then
Ua- = Chava, Y= {2 (0,1):2€ {(1/4),1] for some é}
=
=e 1} = (1
q Ac=(\G/).1] = {2 € 0,1]: € [(1/8),1] for all t}
fel fl
= & €(l):2e pt} = {1}. (the point 1)
It is also possible to define unions and intersections over uncountable collections of
sets. If I’ is an index set (a set of elements to be used as indices), then
U 4c = {2 € 5:2 Ay for some a},
oer
(1) Ae = {2 € Si € Ag for all a}.
ocr
If, for example, we take I = {all positive real numbers} and Ag = (0,a), then
Uger’Aa = (0,00) is an uncountable union. While uncountable unions and intersec-
tions do not play a major role in statistics, they sometimes provide a useful mechanism
for obtaining an answer (see Section 8.2.3).
Finally, we discuss the idea of a partition of the sample space.Section 1.2 BASICS OF PROBABILITY THEORY 8
Definition 1.1.5 Two events A and B are disjoint (or mutually exclusive) if ANB =
0. The events Ay, Ao,... axe pairwise disjoint (or mutually exclusive) if Ayn A; = 0
for all ij.
Disjoint sets are sets with no points in common. If we draw a Venn diagram for
two disjoint sets, the sets do not overlap. The collection
i
Ap= [iit], £=0,1,2,.
consists of pairwise disjoint sets. Note further that U?yA; = 0,00).
Definition 1.1.6 If A;, Ao,... are pairwise disjoint and U%,A; = S, then the
collection A;, Ag,... forms a partition of S.
‘The sets A; = [i,i +1) form a partition of (0, 00). In general, partitions are very
useful, allowing us to divide the sample space into small, nonoverlapping pieces
1.2 Basics of Probability Theory
‘When an experiment is performed, the realization of the experiment is an outcome in
the sample space. If the experiment is performed a number of times, different outcomes
may occur each time or some outcomes may repeat. This “frequency of occurrence” of
an outcome can be thought of as a probability. More probable outcomes occur more
frequently. If the outcomes of an experiment can be described probabilistically, we
are on our way to analyzing the experiment statistically.
In this section we describe some of the basics of probability theory. We do not define
probabilities in terms of frequencies but instead take the mathematically simpler
axiomatic approach. As will be seen, the axiomatic approach is not concerned with
the interpretations of probabilities, but is concerned only that the probabilities are
defined by a function satisfying the axioms, Interpretations of the probabilities are
quite another matter. The “frequency of occurrence” of an event is one example of a
particular interpretation of probability. Another possible interpretation is a subjective
one, where rather than thinking of probability as frequency, we can think of it as
belief in the chance of an event occurring.
1.2.1 Aziomatic Foundations
For each event A in the sample space S we want to associate with A a number
between zero and one that will be called the probability of A, denoted by P(A). It
would seem natural to define the domain of P (the set where the arguments of the
function P(-) are defined) as all subsets of S; that is, for each A C S we define P(A)
as the probability that A occurs. Unfortunately, matters are not that simple. There
are some technical difficulties to overcome. We will not dwell on these technicalities;
although they are of importance, they are usually of more interest to probabilists
than to statisticians. However, a firm understanding of statistics requires at least a
Passing familiarity with the following.6 PROBABILITY THEORY Section 1.2
Definition 1.2.1 A collection of subsets of S is called a sigma algebra (or Borel
field), denoted by B, if it satisfies the following three properties:
a. 0 € B (the empty set is an element of B).
b. If Ae B, then A® € B (B is closed under complementation)
c. If Ai, Az,--. € B, then U%, A; € B (B is closed under countable unions).
The empty set 0 is a subset of any set. Thus, 0 C S. Property (a) states that this
subset is always in a sigma algebra. Since S = 0°, properties (a) and (b) imply that
$ is always in B also. In addition, from DeMorgan’s Laws it follows that B is closed
under countable intersections. If Ay, Az,... € B, then A$, AS,... € B by property (b),
and therefore U2, AS € B. However, using DeMorgan’s Law (as in Exercise 1.9), we
have
(1.2.1) (G «) ~Aw
‘Thus, again by property (b), NZ, Ai € B.
‘Associated with sample space S we can have many different sigma algebras. For
example, the collection of the two sets {0, S} is a sigma algebra, usually called the
trivial sigma algebra. The only sigma algebra we will be concerned with is the smallest
one that contains all of the open sets in a given sample space S.
Example 1.2.2 (Sigma algebra-I) If S is finite or countable, then these techni-
calities really do not arise, for we define for a given sample space S,
B = {all subsets of S, including S itself}
If S has n elements, there are 2" sets in B (see Exercise 1.14). For example, if S =
{1,2,3}, then B is the following collection of 2° = 8 sets:
{1} {1,2} {1,2,3}
{2} 1,3} 0
{3} {2,3} i}
In general, if S is uncountable, it is not an easy task to describe B. However, B is
chosen to contain any set of interest.
Example 1.2.3 (Sigma algebra-II) Let 5 = (—vo, 00), the real line. Then B is
chosen to contain all sets of the form
[a,b], (a,b), (a,b), and {a,5)
for all real numbers a and 6. Also, from the properties of B, it follows that B con-
tains all sets that can be formed by taking (possibly countably infinite) unions and
intersections of sets of the above varieties. ISection 1.2 BASICS OF PROBABILITY THEORY 7
‘We are now in a position to define a probability function.
Definition 1.2.4 Given a sample space S and an associated sigma algebra B, a
probability function is a function P with domain B that satisfies
1. P(A) 20 for all AEB.
2. P(S)=1.
3. If Ay, Aa,... € B are pairwise disjoint, then P(UR, Ai) = 1%, P(A).
‘The three properties given in Definition 1.2.4 are usuaily referred to as the Axioms
of Probability (or the Kolmogorov Axioms, after A. Kolmogorov, one of the fathers of
probability theory). Any function P that satisfies the Axioms of Probability is called
a probability function. The axiomatic definition makes no attempt to tell what partic-
ular function P to choose; it merely requires P to satisfy the axioms. For any sample
space many different probability functions can be defined. Which one(s) reflects what
is likely to be observed in a particular experiment is still to be discussed.
Example 1.2.5 (Defining probabilities-I) Consider the simple experiment of
tossing a fair coin, so S = {H,T}. By a “fair” coin we mean a balanced coin that is
equally as likely to land heads up as tails up, and hence the reasonable probability
fanction is the one that assigns equal probabilities to heads and tails, that is,
(1.2.2) P({H}) = P({T}).
Note that (1.2.2) does not follow from the Axioms of Probability but rather is out-
side of the axioms. We have used a symmetry interpretation of probability (or just.
intuition) to impose the requirement that heads and tails be equally probable. Since
S = {H}U{T}, we have, from Axiom 1, P({H}U {T}) = 1. Also, {H} and {T} are
disjoint, so P({H} U {T}) = P({H}) + P({T}) and
(1.2.3) P({H}) + PU(T}) =
Simultaneously solving (1.2.2) and (1.2.3) shows that P({H}) = P({T}) = 4
Since (1.2.2) is based on our knowledge of the particular experiment, not the
any nonnegative values for P({H}) and P({T}) that satisfy (1.2.3) define a legitimate
probability function. For example, we might choose P({H}) = 2 and P({T}) = 8.
We need general methods of defining probability functions that we know will always
satisfy Kolmogorov's Axioms. We do not want to have to check the Axioms for each
new probability function, like we did in Example 1.2.5. The following gives a common
method of defining a legitimate probability function.
Theorem 1.2.6 Let $= {s1,....8n} be a finite set. Let B be any sigma algebra of
subsets of S. Let p1,...,Pn be nonnegative numbers that sum to 1. For any A € B,
define P(A) by
P(A)8 PROBABILITY THEORY Section 1.2
(The sum over an empty set is defined to be 0.) Then P is a probability function on
B. This remains true if S = {81,52,...} is @ countable set.
Proof: We will give the proof for finite S. For any A € B, P(A) = Dinca) Pi 2%
because every pi > 0. Thus, Axiom 1 is true. Now,
P(S)= SY w= oma
a
‘Thus, Axiom 2 is true. Let A1,...,Ax denote pairwise disjoint events. (B contains
only a finite number of sets, so we need consider only finite disjoint unions.) Then,
»(Ua) = > oe yp 3A.
{5jEU AD) fl aye ad)
The first and third equalities are true by the definition of P(A). The disjointedness of
the Ais ensures that the second equelity is true, because the same pjs appear exactly
once on each side of the equality. Thus, Axiom 3 is true and Kolmogorov’s Axioms
are satisfied. a
The physical reality of the experiment might dictate the probability assignment, as
the next example illustrates.
Example 1.2.7 (Defining probabilities~II) The game of darts is played by
throwing a dart at a board and receiving a score corresponding to the number assigned
to the region in which the dart lands. For a novice player, it seems reasonable to
assume that the probability of the dart hitting a particular region is proportional to
the area of the region. Thus, a bigger region has a higher probability of being hit.
Referring to Figure 1.2.1, we see that the dart board has radius r and the distance
between rings is r/5. If we make the assumption that the board is always hit (see
Exercise 1.7 for a variation on this), then we have
~ | _Area of region i
P (scoring ¢ points) = 7 7e5 of dart board *
For example
a 2
P (scoring 1 point) = 7 (4r/5)?
mre
It is easy to derive the general formula, and we find that
(6-1)? - (6-4)?
5 ,
independent of x and r. The sum of the areas of the disjoint regions equals the area of
the dart board. Thus, the probabilities that have been assigned to the five outcomes
sum to 1, and, by Theorem 1.2.6, this is a probability function (see Exercise 1.8).
P (scoring i points) iSection 1.2 BASICS OF PROBABILITY THEORY 9
Figure 1.2.1. Dart board for Example 1.2.7
Before we leave the axiomatic development of probability, there is one further point
to consider, Axiom 3 of Definition 1.2.4, which is commonly known as the Axiom of
Countable Additivity, is not universally accepted among statisticians. Indeed, it can
be argued that axioms should be simpie, self-evident statements. Comparing Axiom 3
to the other axioms, which are simple and self-evident, may lead us to doubt whether
it is reasonable to assume the truth of Axiom 3.
The Axiom of Countable Additivity is rejected by a school of statisticians led
by deFinetti (1972), who chooses to replace this axiom with the Axiom of Finite
Additivity.
Aziom of Finite Additivity: If A € B and B € B are disjoint, then
P(AUB) = P(A) + P(B).
While this axiom may not be entirely self-evident, it is certainly simpler than the
Axiom of Countable Additivity (and is implied by it — see Exercise 1.12).
Assuming only finite additivity, while perhaps more plausible, can lead to unex-
ected complications in statistical theory — complications that, at this level, do not
necessarily enhance understanding of the subject. We therefore proceed under the
assumption that the Axiom of Countable Additivity holds.
1.2.2 The Calculus of Probabilities
From the Axioms of Probability we can build up many properties of the probability
+ function, properties that are quite helpful in the calculation of more complicated
Probabilities. Some of these manipulations will be discussed in detail in this section;
others will be left as exercises.
‘We start with some (fairly self-evident) properties of the probability fuzction when
applied to a single event.10 PROBABILITY THEORY Section 1.2
Theorem 1.2.8 Ij P is a probability function and A is any eet in B, then
a, P(Q) =0, where 0 is the empty set;
b. P(A) <1;
ce. P(A®) =1- P(A).
Proof: It is easiest to prove (c) first. The sets A and A® (orm a partition of the
sample space, that is, S = AU A®, Therefore,
(1.24) P(AUA‘) = P(S) =1
by the second axiom. Also, A and A® are disjoint, so by the third axiom,
(1.2.8) P(AUA®) = P(A) + P(A*).
Combining (1.2.4) and (1.2.5) gives (c).
Since P(A®) > 0, (b) is immediately implied by (c). To prove (a), we use a similar
argument on $= $u 0. (Recall that both 9 and 9 are always in B.) Since S and 0
are disjoint, we have
1= P(S) = P(S UB) = P(S) + P(0),
and thus P(0) = 0. Qo
Theorem 1.2.8 contains properties that are so basic that they also have the fla-
vor of axioms, although we have formally proved them using only the original three
Kolmogorov Axioms. The next theorem, which is similar in spirit to Theorem 1.2.8,
contains statements that are not so self-evident.
Theorem 1.2.9 If P is a probability function and A and B are any sets in B, then
a. P(BNA®) = P(B)— P(ANB);
b. P(AUB) = P(A) + P(B) — P(ANB);
c. IfACB, then P(A) < P(B).
Proof: To establish (a) note that for any sets A and B we have
B={BNA}U{BN A},
and therefore
(1.2.6) P(B) = P({BN A}U{BNA‘}) = P(BN A) + P(BO AY),
where the last equality in (1.2.6) follows from the fact that BA and BN A® are
disjoint. Rearranging (1.2.6) gives (a).
To establish (b), we use the identity
(1.2.7) AUB =AU{BN A}Section 1.2 BASICS OF PROBABILITY THEORY n
A Venn diagram will show why (1.2.7) holds, although a formal proof is not difficult
(see Exercise 1.2). Using (1.2.7) and the fact that A and B 1 A® are disjoint (since A
and A® are), we have
(1.2.8) P(AUB) = P(A) + P(BN A) = P(A) + P(B) ~ P(ANB)
from (a).
If AC B, then AN B = A. Therefore, using (a) we have
0
P(A) + P(B)—1.
This inequality is a special case of what is known as Bonferroni's Inequality (Miller
1981 is a good reference). Bonferroni's Inequality allows us to bound the probability of
a simultaneous event (the intersection) in terms of the probabilities of the individual
events.
Example 1.2.10 (Bonferroni’s Inequality) _ Bonferroni’s Inequality is partic-
ularly useful when it is difficult (or even impossible) to calculate the intersection
probability, but some idea of the size of this probability is desired. Suppose A and
B are two events and each has probability .95. Then the probability that both will
occur is bounded below by
P(AN B) > P(A) + P(B) —1=.95+.95-1=.90.
Note that unless the probabilities of the individual events are sufficiently large, the
Bonferroni bound is a useless (but correct!) negative number. I
We close this section with a theorem that gives some useful results for dealing with
a collection of sets.
Theorem 1.2.11 If P is a probability function, then
a. P(A) = 2%, P(ANC%) for any partition Cy,C2,.--5
b. P(U%,Ai) < 2, PCAs) for any sets Ai, Aa, ..- « (Boole’s Inequality)
Proof: Since C1,C2,... form a partition, we have that 0:9
S =U, C). Hence,
=O for alli #j, and
A=ANS=An (G) =Utanen,
m1
i=112 PROBABILITY THEORY Section 1.2
where the last equality follows from the Distributive Law (Theorem 1.1.4). We there-
fore have
P(A) =P (Geren) :
Now, since the O, are disjoint, the sets A.C; are also disjoint, end from the properties
of a probability function we have
P (Gane = yr PANG),
= =i
establishing (a)
To establish (b) we first construct a disjoint collection At, Ag, ... with the property
that U2, AT = US, Ay. We define At by
a
ia
Al=A, AT=A\|U 4s}, t=
31
rByecey
where the notation A\B denotes the part of A that does not intersect with B. In more
familiar symbols, A\B = ANB¥. It should be easy to see that US, A} = U%, Ay, and
we therefore have
, (0 4) “P (G-) -dPun,
where the last equality follows since the Aj are disjoint. To see this, wé write
AINAL= fa (i +) \ n {as (i
= {a fn (i a) ‘ a {a n (ui a) } (definition of *\")
ar bel
= {a at) «| ny ant) «| (DeMorgan’s Laws)
jel jel
4) | (definition of Az)
Now if i > &, the first intersection above will be contained in the set Ag, which will
have an empty intersection with Ag. If k > i, the argument is similar. Further, by
construction Ay C A;, 80 P(Aj) < P(A;) and we have
Pas SPA,
a a
establishing (b). dSection 1.2 BASICS OF PROBABILITY THEORY 18
There is a similarity between Boole’s Inequality and Bonferroni's Inequality. In
fact, they are essentially the same thing. We could have used Boole's Inequality to
derive (1.2.9). If we apply Boole’s Inequality to A®, we have
»(U-) < SPU,
i=1
and using the facts that UAf = (N4,)° and P(A) = 1 — P(Aj), we obtain
-e (Aa) SPAj—(n—1),
iol
which is a more general version of the Bonferroni Inequality of (1.2.9).
1.2.9 Counting
The elementary process of counting can become quite sophisticated when placed in
the hands of a statistician. Most often, methods of counting are used in order to
construct probsbility assignments on finite sample spaces, although they can be used
to answer other questions also.
Example 1.2.12 (Lottery-I) For a number of years the New York state lottery
operated according to the following scheme. From the numbers 1, 2, ...,44, a person
may pick any six for her ticket. The winning number is then decided by randomly
selecting six numbers from the forty-four. To be able to calculate the probability of
winning we first must count how many different groups of six numbers can be chosen
from the forty-four. i
Example 1.2.13 (Tournament) In a single-elimination tournament, such as the
U.S. Open tennis tournament, players advance only if they win (in contrast to double-
elimination or round-robin tournaments). If we have 16 entrants, we might be inter-
ested in the number of paths a particular player can take to victory, where a path is
taken to mean a sequence of opponents. ll
Counting problems, in general, sound complicated, and often we must do our count
ing subject to many restrictions. The way to solve such problems is to break them
down into a series of simple tasks that are easy to count, and employ known rules
of combining tasks. The following theorem is a first step in such a process and is
sometimes known as the Fundamental Theorem of Counting.
Theorem 1.2.14 If a job consists of k separate tasks, the ith of which can te done
inn; ways, i= 1,...,k, then the entire job can be done in my X no x +++ x Me ways.M4 ‘PROBABILITY THEORY Section 1.2
Proof: It suffices to prove the theorem for k = 2 (see Exercise 1.15). The proof is
just a matter of careful counting. The first task can be done in n; ways, and for each
of these ways we have m2 choices for the second task. Thus, we can do the job in
(Lx m2) + (1X Mag) t---+ (1X ng) = xm
lg eee)
‘ny terms
‘ways, establishing the theorem for k = 2. Q
Example 1.2.15 (Lottery-II) Although the Fundamental Theorem of Counting
is a reasonable place to start, in applications there are usually more aspects of a
problem to consider. For example, in the New York state lottery the first number
can be chosen in 44 ways, and the second number in 43 ways, making a total of
44 x 43 = 1,892 ways of choosing the first two numbers. However, if a person is
allowed to choose the same numiber twice, then the first two numbers can be chosen
in 44 x 44 = 1,936 ways. I
The distinction being made in Example 1.2.15 is between counting with replacement
and counting without replacement. ‘There is a second crucial element in any counting
problem, whether or not the ordering of the tasks is important. To illustrate with the
lottery example, suppose the winning numbers are selected in the order 12, 37, 35, 9,
13, 22. Does a person who selected 9, 12, 13, 22, 35, 37 qualify as a winner? In other
words, does the order in which the task is performed actually matter? Taking all of
these considerations into account, we can construct a 2 x 2 table of possibilities:
Possible methods of counting
Without With
replacement _ replacement
Ordered
Unordered
Before we begin to count, the following definition gives us some extremely helpful
notation.
Definition 1.2.16 For a positive integer n, n! (read n factorial) is the product of
all of the positive integers less than or equal to n. That is,
2) xx Bx2QK1.
ni=nx(n-1)x(n
Furthermore, we define 0! =
Let us now consider counting all of the possible lottery tickets under each of these
four cases.
1. Ordered, without replacement From the Fundamental Theorem of Counting, the
first number can be selected in 44 ways, the second in 43 ways, etc. So there are
at
44 x 43 x 42 x 41 x 40 x 39 = a = 5,082,517,440
possible tickets.Section 1,2 BASICS OF PROBABILITY THEORY 18
2. Ordered, with replacement Since each number can now be selected in 44 ways
(because the chosen number is replaced), there are
4A x 44 x Ad x 44 x 44 x 44 = 44° = 7,256,313,856
possible tickets.
3. Unordered, without replacement We know the number of possible tickets when the
ordering must be accounted for, s0 what we must do is divide out the redundant
orderings. Again from the Fundamental Theorem, six numbers can be arranged in
6x5 x4x3x 2x1 ways, so the total number of unordered tickets is
44 x 43 x 42 x 41 x 40 x 39 44!
OxSx4x3xDx1 —~ Giggl ~ 7059,052.
This form of counting plays a central role in much of statistics—so much, in fact,
that it has earned its own notation.
Definition 1.2.17 For nonnegative integers n and r, where n > r, we define the
symbol ("), read n choose r, as
(")= nl
) > twa
In our lottery example, the number of possible tickets (unordered, without replace-
ment) is (4*). These numbers are also referred to as binomial coefficients, for reasons
that will become clear in Chapter 3.
4, Unordered, with replacement This is the most difficult case to count. You might
first guess that the answer is 44°/(6 x 5 x 4x 3x 2 1), but this is not correct (it
is too small).
To count in this case, it is easiest to think of placing 6 markers on the 44 numbers.
In fact, we can think of the 44 numbers defining bins in which we can place the six
markers, M, as shown, for example, in this figure.
{_M [LMM | M | [ict |e M | J
1 ee
(
The number of possible tickets is then equal to the number of ways that we can
put the 6 markers into the 44 bins. But this can be further reduced by noting that
ail we need to keep track of is the arrangement of the markers and the walls of the
bins. Note further that the two outermost walls play no part. ‘Thus, we have to
count all of the arrangements of 43 walls (44 bins yield 45 walls, but we disregard
the two end walls) and 6 markers. We therefore have 43 + 6 = 49 objects, which
can be arranged in 49! ways. However, to eliminate the redundant orderings we
must divide by both 6! and 43!, so the total number of arrangements is
49!
Frag = 13:983.816.
Although all of the preceding derivations were done in terms of an example, it
should be easy to see that they hold in general. For completeness, we can summarize
these situations in Table 1.2.1.16 ’ “PROBABILITY THEORY Section 1.2
‘Table 1.2.1. Number of possible arrangements of size r from n objects
Without With
replacement replacement
nt ;
Ordered n
Unordered —() ("* r |
r r
1.2.4 Enumerating Outcomes
The counting techniques of the previous section are useful when the sample space
$ is a finite set and all the outcomes in $ are equally likely. Then probabilities of
events can be calculated by simply counting the number of outcomes in the event. To
see this, suppose that $ = {#1,...,81} is @ finite sample space. Saying that all the
outcomes are equally likely means that P({si}) = 1/N for every outcome s;. Then,
using Axiom 3 from Definition 1.2.4, we have, for any event A,
1 _ # of elements in A
P(A) = > Ps) = 4 y=
cA SQN © Fok elements in $
For large sample spaces, the counting techniques might be used to calculate both
the numerator and denominator of this expression.
Example 1.2.18 (Poker) Consider choosing a five-card poker hand from a stan-
dard deck of 52 playing cards. Obviously, we are sampling without replacement from
‘the deck. But to specify the possible outcomes (possible hands), we must decide
whether we think of the hand as being dealt sequentially (ordered) or all at once
(unordered). If we wish to calculate probabilities for events that depend on the or-
der, such as the probability of an ace in the first two cards, then we must use the
ordered outcomes. But if our events do not depend on the order, we can use the
unordered outcomes. For this example we use the unordered outcomes, so the sample
space consists of all the five-card hands that can be chosen from the 52-card deck.
There are (*?) = 2,598,960 possible hands. If the deck is well shuffled and the cards
are randomly dealt, it is reasonable to assign probability 1/2,598,960 to each possible
hand.
‘We now calculate some probabilities by counting outcomes in events. What is the
probability of having four aces? How many different hands are there with four aces? If
we specify that four of the cards are aces, then there are 48 different ways of specifying
the fifth card. Thus,
48
2,598,960"
less than 1 chance in 50,000. Only slightly more complicated counting, using Theorem
1.2.14, allows us to calculate the probability of having four of a kind. There are 13
P(four aces) =Section 1.2 BASICS OF PROBABILITY THEORY w
ways to specify which denomination there will be four of. After we specify these four
cards, there are 48 ways of specifying the fifth. Thus, the total number of hands with
four of a kind is (13)(48) and
(13)(48) 624
P(four of a kind) = 5525 sey = Z8g5-000"
To calculate the probability of exactly one pair (not two pairs, no three of a kind,
etc.) we combine some of the counting techniques. The number of hands with exactly
one pair is
(1.2.11) 13 (3) (3) 49 = 1,098,240.
Expression (1.2.11) comes from Theorem 1.2.14 because
13 = # of ways to specify the denomination for the pair,
4 ?
(a = # of ways to specify the two cards from that denomination,
12 ras ; 5
( es ) = # of ways of specifying the other three denominations,
4° = # of ways of specifying the other three cards from those denominations.
‘Thus,
1,098,240
Plexectly one palit) = 3 555 G6" il
When sampling without replacement, as in Example 1.2.18, if we want to calculate
the probability of an event that does not depend on the order, we can use either
the ordered or unordered sample space. Each outcome in the unordered sample space
corresponds to r! outcomes in the ordered sample space. So, when counting outcomes
in the ordered sample space, we use a factor of r! in both the numerator and denom-
inator that will cancel to give the same probability as if we counted in the unordered
sample space.
‘The situation is different if we sample with replacement. Each outcome in the
unordered sample space corresponds to some outcomes in the ordered sample space,
but the number of outcomes differs.
Example 1.2.19 (Sampling with replacement) Consider sampling r = 2 items
from n = 3 items, with replacement. The outcomes in the ordered and unordered
sample spaces are these,
Unordered {1,1} {2,2} {3,3} {1,2} {1,3} {2,3}
Ordered (1,1) (2,2) (3,3) (1,2),(2,1) (1,3), (8,1) (2,3), (3,2)
Probability 1/9 1/9 1/9 2/9 2/9 2/98 PROBABILITY THBORY Section 1.2
The probabilities come from considering the nine outcomes in the ordered sample
space to be equally likely. This corresponds to the common interpretation of “sampling
with replacement”; namely, one of the three items is chosen, each with probability 1/3;
the item is noted and replaced; the items are mixed and again one of the three items
is chosen, each with probability 1/3. It is seen that the six outcomes in the unordered
sample space are not equally likely under this kind of sampling. The formula for the
number of outcomes in the unordered sample space is useful for enumerating the
outcomes, but ordered outcomes must be counted to correctly calculate probabilities.
Some authors argue that it is appropriate to assign equal probabilities to the un-
ordered outcomes when “randomly distributing r indistinguishable balls into n dis-
tinguishable urns.” That is, an urn is chosen at random and a ball placed in it, and
this is repeated r times. The order in which the balls are placed is not recorded so,
in the end, an outcome such as {1,3} means one ball is in urn 1 and one ball is in
urn 3.
But here is the problem with this interpretation. Suppose two people observe this
process, and Observer 1 records the order in which the balls are placed but Observer 2
does not. Observer 1 will assign probability 2/9 to the event {1,3}. Observer 2,
who is observing exactly the same process, should also assign probability 2/9 to this
event. But if the six unordered outcomes are written on identical pieces of paper and
one is randomly chosen to determine the placement of the balls, then the unordered.
outcomes each have probability 1/6. So Observer 2 will assign probability 1/6 to the
event {1,3}.
‘The confusion arises because the phrase “with replacement” will typically be inter-
preted with the sequential kind of sampling we described above, leading to assigning
a probability 2/9 to the event {1,3}. This is the correct way to proceed, as proba-
bilities should be determined by the sampling mechanism, not whether the balls are
distinguishable or indistinguishable.
Example 1.2.20 (Calculating an average) As an illustration of the distinguish
able/indistinguishable approach, suppose that we are going to calculate all possible
averages of four numbers selected from
2,4,9,12
where we draw the numbers with replacement. For example, possible draws are
{2,4,4,9} with average 4.75 and {4,4,9,9} with average 6.5. If we are only inter-
ested in the avetage of the sampled numbers, the ordering is unimportant, and thus
the total number of distinct samples is obtained by counting according to unordered,
with-replacement sampling.
‘The total number of distinct samples is (**7~'). But now, to calculate the proba-
bility distribution of the sampled averages, we must count the different ways that a
particular average can occur.
The value 4.75 can occur only if the sample contains one 2, two 4s, and one 9.
The number of possible samples that have this configuration is given in the following
table:Section 1.2 BASICS OF PROBABILITY THEORY 19
Probability
12
06
00 2a tO acre neo ee 12
Average
Figure 1.2.2. Histogram of averages of samples with replacement from the four numbers
{2,4,4,9}
Unordered Ordered
(2,4, 4,9), (2,4,9,4), (2,9, 4,4), (4,2, 4,9),
{2,4,4,9} (4,2,9,4), (4,4,2,9), (4,4,9,2), (4,9, 2,4),
(4,9, 4,2), (9,2,4,4), (9,4,2,4), (9,4, 4,2)
The total number of ordered samples is n" = 4¢ = 256, so the probability of drawing
the unordered sample {2, 4,4, 9} is 12/256. Compare this to the probability that we
would have obtained if we regarded the unordered samples as equally likely ~ we would
have assigned probability 1/("*"-') = 1/({) = 1/35 to {2,4,4,9} and to every other
unordered sample.
To count the number of ordered samples that, would result in {2,4,4,9}, we argue
as follows. We need to enumerate the possible orders of the four numbers {2,4, 4,9},
0 we are essentially using counting method 1 of Section 1.2.3. We can order the
sample in 4 x 3x 2x 1 = 24 ways. But there is a bit of double counting here, since we
cannot count distinct arrangements of the two 4s. For example, the 24 ways would
count {9,4,2,4} twice (which would be OK if the 4s were different). To correct this,
we divide by 2! (there are 2! ways to arrange the two 4s) and obtain 24/2 = 12 ordered
samples. In general, if there are & places and we have m different qumbers repeated
ki, ka,..+)km times, then the number of ordered samples is Ele oat
la! Br
of counting is related to the multinomial distribution, which we will see in Section
4.6, Figure 1.2.2 is a histogram of the probability distribution of the sample averages,
reflecting the multinomial counting of the samples.
‘There is also one further refinement that is reflected in Figure 1.2.2. It is possible
that two different unordered samples will result in the same mean. For example, the
unordered samples {4, 4, 12,12} and {2,9, 9,12} both result in an average value of 8.
The first sample has probability 3/128 and the second has probability 3/64, giving the
value 8 a probability of 9/128 = .07. See Example A.0.1 in Appendix A for details on
constructing such a histogram. The calculation that we have done in this example is
an elementary version of a very important statistical technique known as the bootstrap
(Efron and Tibshirani 1993). We will return to the bootstrap in Section 10.1.4. ||
. This type90 ‘PROBABILITY THEORY Section 1.3
1.3 Conditional Probability and Independence
All of the probabilities that we have dealt with thus far have been unconditional
probabilities. A sample space was defined and all probabilities were calculated with
respect to that sample space. In many instances, however, we are in a position to
update the sample space based on new information. In such cases, we want to be able
to update probability calculations or to calculate conditional probobilities.
Example 1.3.1 (Four aces) Fout cards are dealt from the top of a well-shuffled
deck. What is the probability that they are the four aces? We can calculate this
probability by the methods of the previous section. The number of distinct groups of
four cards is
52
( 4 ) = 270,725.
‘Only one of these groups consists of the four aces and every group is equally likely,
so the probability of being dealt ali four aces is 1/270,728.
We can also calculate this probability by an “updating” argument, as follows. The
probability that the first card is an ace is 4/52. Given that the first card is an ace,
the probability that the second card is an ace is 3/51 (there are 3 aces and 51 cards
left). Continuing this argument, we get the desired probability as
4,3,2,1 1
52° 51 50 49 (270,725 !
In our second method of solving the problem, we updated the sample space after
each draw of a card; we calculated conditional probabilities.
Definition 1.3.2 If A and B are events in 9, and P(B) > 0, then the conditional
probability of A given B, written P(A|B), is
= P(ANB)
(1.3.1) P(A|B) = Py
Note that what happens in the conditional prabability calculation is that B becomes
the sample space: P(B|B) = 1. The intuition is that our original sample space, S,
has been updated to B. All further occurrences are then calibrated with respect to
their relation to B. In particular, note what happens to conditional probabilities of
disjoint sets. Suppose A and B are disjoint, so P(A B) = 0. It then follows that,
P(A)B) = P(BIA) = 0.
Example 1.3.3 (Continuation of Example 1.3.1) Although the probsbility of
getting all four aces is quite small, let us see how the conditional probabilities change
given that some aces have already been drawn. Four cards will again be dealt from a
well-shuffled deck, and we now calculate
P(4 aces in 4 cards |i aces in i cards), 12,3.Section 1.3 CONDITIONAL PROBABILITY AND INDEPENDENCE, 2
The event {4 aces in 4 cards} is a subset of the event {¢ aces in i cards}. Thus, from
the definition of conditional probability, (1.3.1), we know that
P(4 aces in 4 cards|i aces in i cards)
P({4 aces in 4 cards} M {i aces in i cards})
7 P(i aces in i cards)
P(4 aces in 4 cards)
‘P(i aces in i cards) *
‘The numerator has already been calculated, and the denominator can be calculated
with a similar argument. The number of distinct groups of i cards is (5?), and
()
P(i aces in i cards) = 45.
CO)
‘Therefore, the conditional probability is given by
(2) _ (4-ast
BO” Ga - ey
For i 1, 2, and 3, the conditional probabilities are .00005, .00082, and .02041,
respectively. lI
P(4 aces in 4 cards| i aces in i cards) =
For any B for which P(B) > 0, it is straightforward to verify that the probability
function P(-|B) satisfies Kolmogorov’s Axioms (see Exercise 1.35). You may suspect
that requiring P(B) > 0 is redundant. Who would want to condition on an event of
probability 0? Interestingly, sometimes this is a particularly useful way of thinking of
things. However, we will defer these considerations until Chapter 4.
Conditional probabilities can be particularly slippery entities and sometimes require
careful thought. Consider the following often-told tale.
Example 1.3.4 (Three prisoners) Three prisoners, A, B, and C, are on death
row. The governor decides to pardon one of the three and chooses at random the
prisoner to pardon. He informs the warden of his choice but requests that the name
be kept secret for a few days.
‘The next day, A tries to get the warden to tell him who had been pardoned. The
warden refuses. A then asks which of B or C will be executed. The warden thinks for
a while, then tells A that B is to be executed.
Warden’s reasoning: Each prisoner has a 4 chance of being pardoned. Clearly,
either B or C must be executed, so I have given A no information about whether
A will be pardoned.
A’s reasoning: Given that B will be executed, then either A or C will be pardoned.
‘My chance of being pardoned has risen to }.
It should be clear that the warden’s reasoning is correct, but let us see why. Let
A,B, and C denote the events that A, B, or C is pardoned, respectively. We know3a PROBABILITY: THEORY ection 1.3
that P(A) = P(B) = P(C) = }. Let W denote the event that the warden says B will
die. Using (1.3.1), A can update his probability of being pardoned to
P(ANW)
Peay) =
‘What is happening can be summarized in this table:
Prisoner pardoned _ Warden tells A
a B dies | each with equal
A C dies probability
B C dies
c - B dies
Using this table, we can calculate
P(W) = P(warden says B dies)
= P(warden says B dies and A pardoned}
+ P(warden says B dies and C pardoned)
+ P(wasden says B dies and B pardoned)
loi 1
=gtgto =F
Thus, using the warden’s reasoning, we have
P(ANW)
P(W)
__ P(warden says B dies and A pardoned) _ 1/6 _ 1
PAW) =
(1.3.2)
‘P(warden says B dies) “if x
However, A falsely interprets the event W as equal to the event B® and calculates
P(AQBS) _1/3_1
POLE = Seas) = 378 ~ 3
We see that conditional probabilities can be quite slippery and require careful
interpretation. For some other variations of this problem, see Exercise 1.37. ll
Re-expressing (1.3.1) gives a useful form for calculating intersection probabilities,
(4.3.3) P(ANB) = P(A|B)P(B),
which is essentially the formula that was used in Example 1.3.1. We can take advan-
tage of the symmetry af (1.3.3) and also write
(1.3.4) P(AN B) = P(BIA)P(A).Section 1.3 CONDITIONAL PROBABILITY AND INDEPENDENCE 23
When faced with seemingly difficult calculations, we can break up our calculations
according to (1.3.3) or (1.3.4), whichever is easier. Furthermore, we can equate the
right-hand sides of these equations to obtain (after rearrangement)
= P(A)
(1.3.5) P(AIB) = P(B\A) ey:
which gives us a formula for “turning around” conditional probabilities. Equation
(1.3.5) is often called Bayes’ Rule for its discoverer, Sir Thomas Bayes (although see
Stigler 1983).
Bayes’ Rule has a more general form than (1.3.5), one that applies to partitions of
a sample space. We therefore take the following as the definition of Bayes’ Rule.
Theorem 1.3.5 (Bayes’ Rule) Let Ai, A2,... ¢ @ partition of the sample space,
ond let B be any set. Then, for eack i= 1,2,
__PUBIA)P(AD
PUAIB) = Sa PCB) PUA)
Example 1.3.6 (Coding) When coded messages are sent, there are sometimes
errors in transmission, In particular, Morse code uses “dots” and “dashes,” which are
known to occur in the proportion of 3:4, This means that for any given symbol,
3
P(dot sent)
and P(desh sent) = ;
cs
Suppose there is interference on the transmission line, and with probability } a dot
is mistakenly received os a dash, and vice versa. If we receive a dot, can we be sure
that a dot was sent? Using Bayes’ Rule, we can write
P(dot sent)
(dot sent | dot recived) = Pldot received | dot sent) 5355 ested)
Now, from the information given, we know that P(dot sent) = $ and P(dot received]
dot sent) = 7. Furthermore, we can also write
P(dot received) = P(dot received N dot sent) + P(dot received M dash sent)
= P(dot received | dot sent) P(dot sent)
+ P(dot received | dash sent)P(dash sent)
73,1, 4 _ 2
a)
787 * 6
Combining these results, we bave that the probability of correctly receiving a dot is
P(dot sent | dot received) = oe z. iu“ PROBABILITY THEORY Section 1.3
In some cases it may happen that the occurrence of a particular event, B, has no
effect on the probability of another event, A. Symbolically, we are saying that
(1.3.6) P(A|B) = P(A).
If this holds, then by Bayes’ Rule (1.3.5) and using (1.3.6) we have
PUB) _
P(A)
P(B)
(13.7) P(BIA) = P(A|B) Pay
= P(A) Eray = PCB),
so the occurrence of A has no effect on B. Moreover, since P(B|A)P(A) = P(ANB),
it then follows that
P(AN B) = P(A)P(B),
which we take as the definition of statistical independence,
Definition 1.3.7 Two events, A and B, are statistically independent if
(1.3.8) P(ANB) = P(A)P(B).
Note that independence could have been equivalently defined by either (1.3.6) or
(1.3.7) (as long as either P(A) > 0 or P(B) > 0). The advantage of (1.3.8) is that
it treats the events symmetrically and will be easier to generalize to more than two
events.
Many gembling games provide models of independent events. The spins of a roulette
wheel and the tosses of a pair of dice are both series of independent events.
Example 1.3.8 (Chevalier de Meré) The gambler introduced at the start of the
chapter, the Chevalier de Meré, was particularly interested in the event that he could
throw at least 1 six in 4 rolls of a die. We have
P(et least 1 six in 4 rolls) = 1 — P(no six in 4 rolls)
4
— [J P(e six on roll §),
isl
where the last equality follows by independence of the rolls. On any roll, the proba-
bility of not rolling a six is §, so
P(at least 1 six in 4 rolls)
Independence of A and B implies independence of the complements also, In fact,
we have the following theorem.Section 1.3 CONDITIONAL PROBABILITY AND INDEPENDENCE, 28
Theorem 1.3.9 If A and B are independent events, then the following pairs are
also independent:
a. A and BS,
b. A® and B,
c. A® and BS.
Proof: We will prove only (a), leaving the rest as Exercise 1.40. To prove (a) we
must show that P(AA B®) = P(A)P(B°). From Theorem 1.2.9 we have
P(AN B®)
P(A) — P(ANB)
= P(A)-P(A)P(B) — (Aand B are independent)
= P(A)(1- P(B))
= P(A)P(B®). D
Independence of more than two events can be defined in a manner similar to (1.3.8),
but we must be careful. For example, we might think that we could say A, B, and C
are independent if P(AN BNC) = P(A)P(B)P(C). However, this is not the correct
condition,
Example 1.3.10 (Tossing two dice) Let an experiment consist of tossing two
dice. For this experiment the sample space is
S = {(1,1), (1,2), +25 (156), (2,1)s ++ +5 (256)s-++4 (6, 1s, (6,6)};
that is, S consists of the 36 ordered pairs formed from the numbers 1 to 6. Define the
following events:
A= {doubles appear} = {(2,1), (2,2), (3,3), (4,4), (5,5), (6,6)},
B = {the sum is between 7 and 10},
C = {the sum is 2 or 7 or 8}.
The probabilities can be calculated by counting among the 36 possible outcomes. We
have
PA)= 7, P(B) =
Furthermore,
P(AN BNC) = P(the sum is 8, composed of double 4s)
au
~ 3620 “ ‘PROBABILITY: THEORY
However,
P(BNC) = P(eum equals 7 or 8)
Similarly, it can be shown that P(AN B) # P(A)P(B); therefore, the requirement:
P(AN BNC) = P(A)P(B)P(C) is not a strong enough condition to guarantee
pairwise independence, (
‘A second attempt at a genetal definition of independence, in light of the previ-
ons example, might be to define A, B, and C to be independent if all the pairs ere
independent. Alas, this condition also fails.
Example 1.3.11 (Letters) Let the sample space 5 consist of the 3! pernmtations
of the letters a, b, and c along with the three triples of each letter. Thus,
aaa bbb ccc
S= abc bea cba}.
acb bac cab
Furthermore, let each element of S have probability 3. Define
A; = {ith place in the triple is occupied by a}.
It is then easy to count that
P(A)=5, §=1,2,3,
5
z
and
1
P(A, 0 Az) = P(A, N Ag) = P(A2 0 Aa) = z
so the Ajs are pairwise independent. But
PUALM Aa Aa) = 5 # PCAL)PCA2) PCAs),
so the Ajs do not satisfy the probability requirement. i]
‘The preceding two examples show that simultaneous (or mutual) independence of
a collection of events requires an extremely strong definition. The following definition
works.
Definition 1.3.12 A collection of events Aj,-.-,An are mutually independent if
for any subcollection Ay,,..., Ai, We have
P (A 4) =] P(4,,)-
jal jaSection 1.4 RANDOM VARIABLES co
Example 1.3.13 (Three coin tosses-I) Consider the experiment of tossing a
coin three times. A sample point for this experiment must indicate the result of each
toss. For example, HHT could indicate that. two heads and then a tail were observed.
‘The sample space for this experiment has eight points, namely,
{HHH, HET, HTH, THH, TTH, THT, HTT, TTT}.
Let Hj, i
(1.3.9) H, = {HHH, HHT, HTH, HTT}
1,2,8, denote the event that the éth toss is a head. For example,
If we assign probability } to each sample point, then using enumerations such as
(1.3.9), we see that P(Zy) = P(H2) = P(H) = 3. This says that the coin is fair and
~ has an equal probability of landing heads or tails on each toss.
Under this probability model, the events #,, Hz, and Hy are also mutually inde-
pendent. To verify this we note that
1011
(Fh 9 Han He) = P((HHH)) = 5 = 5+ 5-5 = P(Hh) PUB) PCa).
To verify the condition in Definition 1.3.12, we also must check each pair. For example,
1
PU 0-H) = P( (HH, BHT}) = 2 = 5.4 = PUn)PU).
The equality is also true for the other two pairs. Thus, Hi, Hz, and Hy are mutually
independent. That is, the occurrence of a head on any toss has no effect on any of
the other tosses.
It can be verified that the assignment of probability 3 to each sample point is the
only probability model that has P(H:) = P(H2) = P(Hs) = } and Hj, He, and H3
mutually independent. i
1.4 Random Variables
In many experiments it is easier to deal with a summary variable than with the
original probability structure. For example, in an opinion poll, we might decide to
ask 50 people whether they agree or disagree with a certain issue. If we record a “1”
for agree and “0” for disagree, the sample space for this experiment has 2° elements,
each an ordered string of 1s and 0s of length 50. We should be able to reduce this to
a reasonable size! It may be that the only quantity of interest is the number of people
who agree (equivalently, disagree) out of 50 and, if we define a variable X = number
of 1s recorded out of 50, we have captured the essence of the problem. Note that the
sample space for X is the set of integers {0,1,2,...,50} and is much easier to deal
with than the original sample space.
In defining the quantity X, we have defined a mapping (a function) from the original
sample space to a new sample space, usually a set of real numbers. In general, we
have the following definition.
Definition 1.4.1 A random varieble is a function from a sample space S into the
Teal numbers.28 PROBABILITY THEORY Section 1.4.
Example 1.4.2 (Random variables) In some experiments random variables are
implicitly used; some examples are these.
Examples of random variables
Experiment Random variable
Toss two dice X = sum of the numbers
Toss a coin 25 times X = number of heads in 25 tosses
Apply different amounts of
fertilizer to corn plants _X = yield/acre 7
In defining a random variable, we have also defined a new sample space (the range
of the random variable). We must now check formally that our probability function,
which is defined on the original sample space, can be used for the random variable.
Suppose we have a sample space
S={5,...,8a}
with a probability function P and we define a random variable X with range ¥ =
{21,...,2m}. We can define a probability function Px on 2 in the following way. We
will observe X = 2; if and only if the outcome of the random experiment is an $; € S
such that X(ss) = 2. Thus,
(1.4.1) Px(X = 1%) = P({s; € $: X(5) = 2i})
Note that the left-hand side of (1.4.1), the function Px, is an induced probability
function on 2’, defined in terms of the original function P. Equation (1.4.1) formally
defines a probability function, Px, for the random variable X. Of course, we have
to verify that Px satisfies the Kolmogorov Axioms, but that is not a very difficult
job (see Exercise 1.45). Because of the equivalence in (1.4.1), we will simply write
P(X =a;) rather than P(X = 2;).
A note on notation: Random variables will always be denoted with uppercase letters
and the realized values of the variable (or its range) will be denoted by the corre-
sponding lowercase letters. Thus, the random variable X can take the value 2.
Example 1.4.3 (Three coin tosses-II) Consider again the experiment of tossing
a fair coin three times from Example 1.3.13. Define the random variable X to be the
number of heads obtained in the three tosses. A complete enumeration of the value
of X for each point in the sample space is
3 HHH WHT HTH THH TTH THT HTT TIT
Xa) ts 2 2 2 1 1 1 0
‘The range for the random variable X is ¥ = {0,1,2,3}. Assuming that all eight
points in S have probability 3, by simply counting in the above display we see thatSection 1.5 DISTRIBUTION FUNCTIONS 29
the induced probability function on is given by
z oO 12 3
Px(X=2) § 3 EG
For example, Px(X = 1) = P({ATT, THT, TTH}) ll
Example 1.4.4 (Distribution of a random variable) It may be possible to
determine Px even if a complete listing, as in Example 1.4.3, is not possible. Let S
be the 2° strings of 50 0s and 1s, X = number of 1s, and 2 = {0,1,2,...,50}, as
mentioned at the beginning of this section. Suppose that each of the 2°° strings is
equally likely. The probability that X = 27 can be obtained by counting ail of the
strings with 27 1s in the original sample space. Since each string is equally likely, it
follows that
# strings with 27 1s (52)
X = 27) = H# sttings with 27 1s _ (a7) |
Px(X = 20) # strings 380
In general, for any i € %,
(2
= i
Px(X =i) = Shy
‘The previous illustrations had both a finite S and finite 2, and the definition of
Px was straightforward. Such is also the case if ¥ is countable. If ¥ is uncountable,
we define the induced probability function, Py, in a manner similar to (1.4.1). For
any set AC,
(1.4.2) Px(X € A) = P({s € S$: X(s) € Aj).
This does define a legitimate probability function for which the Kolmogorov Axioms
can be verified. (To be precise, we use (1.4.2) to define probabilities only for a cer-
tain sigma algebra of subsets of 2. But we will not concern ourselves with these
technicalities.)
1.5 Distribution Functions
With every random variable X, we associate a function called the cumulative distri-
bution function of X.
Definition 1.5.1 The cumulative distribution function or cdf of a random variable
X, denoted by F(z), is defined by
Fy(z) = Px(X <2), forall z.30 PROBABILITY THEORY Section 1.5
Figure 1.5.1. Caf of Example 1.5.2
Example 1.5.2 (Tossing three coins) Consider the experiment of tossing three
fair coins, and let X = number of heads observed. The cdf of X is
@ if-c 3 since a is certain to be less than or equal to such a value. —_||
As is apparent from Figure 1.5.1, Fx can be discontinuous, with jumps at certain
values of z. By the way in which Fx is defined, however, at the jump points Fx takes
the value at the top of the jump. (Note the different inequalities in (1.5.1).) This is
known as right-continuity—the function is continuous when a point is approached
from the right. The property of right-continuity is a consequence of the definition of
the cdf. In contrast, if we had defined Fx (z) = Px(X < 2) (note strict inequality),
Fx would then be left-continuous. The size of the jump at any point z is equal to
P(X =2).
Every cdf satisfies certain properties, some of which are obvious when we think of
the definition of F(z) in terms of probabilities.Section 1.5 DISTRIBUTION FUNCTIONS a1
Theorem 1.5.3 The function F(z) is a cdf if and only if the following three con-
ditions hold:
a. lim, .ooF (x) = 0 and lims soo F(z)
b. F(z) is a nondecreasing function of z.
c. F(x) is right-continuous; that is, for every number zo, limz,z, F(x) = F (zo).
Outline of proof: To prove necessity, the three properties can be verified by writing
F in terms of the probability function (see Exercise 1.48). To prove sufficiency, that,
if a function F satisfies the three conditions of the theorem then it is a cdf for some
random variable, is much harder. It must be established that there exists a sample
space S, a probability function P on S, and a random variable X defined on $ such
that F is the cdf of X. a
Example 1.5.4 (Tossing for ahead) Suppose we do an experiment that consists
of tossing & coin until a head appears. Let p = probability of a head on any given toss,
and define a random variable X = number of tosses required to get a head. Then, for
any 2=1,2,...,
(1.5.2) P(X =z) = (1-p)*""p,
since we must get 2 — 1 tails followed by a head for the event to occur and all trials
are independent. From (1.5.2) we calculate, for any positive integer 2,
(1.5.3) P(X <2)= pu =i) =00-p)p.
=I i=l
The partial sum of the geometric series is
1-0?
(1.5.4)
a fact that can be established by induction (see Exercise 1.50). Applying (1.5.4) to
our probability, we find that the cdf of the random variable X is
Fx(z) = P(X <2)
(1=p)*
“T= (=p)?
=1-(1-p), 2=1,2,....
The cdf F(z) is flat between the nonnegative integers, as in Example 1.5.2.
It is easy to show that if 0 < p <1, then Fx (2) satisfies the conditions of Theorem
1.5.3. First,
alin, Fe()32 PROBABILITY THEORY Section 1.5
Oo 123 4 5 67 8 9 012 4 Is
Figure 1.5.2. Geometric edf, p = .3
since Fy (z) = 0 for all z < 0, and
‘lim Fx(2) = Jim 1-(1-p)" =1,
where 2 goes through only integer values when this limit is taken. To verify property
(b), we simply note that the sum in (1.5.3) contains more positive terms as z increases.
Finally, to verify (c), note that, for any 2, Fx (a+ ¢) = Fx(z) if > 0 is sufficiently
small. Hence,
lim Fx(e+8) = Fx(2)s
so Fx (2) is right-continuous. F(z) is the cdf of a distribution called the geometric
distribution (after the series) and is pictured in Figure 1.5.2. i
Example 1.5.5 (Continuous cdf) An example of a continuous cdf is the function
1
(1.5.5) Fxl@l= apes
which satisfies the conditions of Theorem 1.5.3. For example,
lim Fx(e)=0 since lim e7
and
lim Fx(e)=1 since lim e~
a nt
Differentiating F(z) givesSection 1.5 DISTRIBUTION FUNCTIONS 33
showing that Fx (z) is increasing. Fx is not only right-continuous, but also continuous.
This is a special case of the logistic distribution. lI
Example 1.5.6 (Cdf with jumps) If Fx is not a continuous function of 2, it is
possible for it to be a mixture of continuous pieces and jumps. For example, if we
modify Fx(z) of (1.5.5) to be, for some ¢,1 > € > 0,
» awl ify <0
(1.5. ¥(y) = 9
if
etiper ify20,
then Fy(y) is the cdf of a random variable Y (see Exercise 1.47). The function Fy
has a jump of height ¢ at y = 0 and otherwise is continuous. This model might
be appropriate if we were observing the reading from a gauge, a reading that could
(theoretically) be anywhere between —oo and oo. This particular gauge, however,
sometimes sticks at 0. We could then model our observations with Fy, where € is the
probability that the gauge sticks. I)
Whether a cdf is continuous or has jumps corresponds to the associated random
variable being continuous or not. In fact, the association is such that it is convenient
to define continuous random variables in this way.
Definition 1.5.7 A random variable X is continuous if Fx(z) is a continuous
function of x. A random variable X is discrete if F(z) is a step function of z.
We close this section with a theorem formally stating that Fx completely deter-
mines the probability distribution of a random variable X. This is true if P(X € A) is
defined only for events A in B?, the smallest sigma algebra containing all the intervals
of real numbers of the form (a,), {a, 6), (a,b), and [a,b]. If probabilities are defined
for a larger class of events, it is possible for two random variables to have the same
distribution function but not the same probability for every event (see Chung 1974,
page 27). In this book, as in most statistical applications, we are concerned only with
events that are intervals, countable unions or intersections of intervals, etc. So we do
not consider such pathological cases. We first need the notion of two random variables
being identically distributed.
Definition 1.5.8 The random variables X and Y are identically distributed if, for
every set A€ BY, P(X € A) = P(YE A).
Note that two random variables that are identically distributed are not necessarily
equal. That is, Definition 1.5.8 does not say that X = Y.
Example 1.5.9 (Identically distributed random variables) Consider the ex-
Periment of tossing a fair coin three times as in Example 1.4.3. Define the random
variables X and Y by
X = number of heads observed and Y = number of tails observed.” PROBABILITY THEORY Section 1.6
‘The distribution of X is given in Example 1.4.3, and it is easily verified that the
distribution of Y is exactly the same. That is, for each k = 0,1, 2,3, we have P(X =
k) = P(Y = k). So X and ¥ are identically distributed. However, for no sample
points do we have X(s) = ¥(s). 1
Theorem 1.5.10 The following two statements are equivalent:
a. The random variables X and Y are identically distributed.
b. Fx (2) = Fy(2) for every x.
Proof: To show equivalence we must show that each statement implies the other.
We first show that (a) = (b).
Because X and Y are identically distributed, for any set A € BI, P(X € A) =
P(Y < A). In particular, for every x, the set (—00,] is in B', and
Fx (x) = P(X € (—00,2]) = P(Y € (00, 2]) = Fy(z).
‘The converse implication, that (b) = (a), is much more difficult to prove. The
above argument showed that if the X and ¥ probabilities agreed on all sets, then
they agreed on intervals. We now must prove the opposite; that is, if the X and ¥
probabilities agree on all intervals, then they agree on all sets. To show this requires
heavy use of sigma algebras; we will not go into these details here. Suffice it to say that
it is necessary to prove only that the two probability functions agree on all intervals
{Chung 1974, Section 2.2). oO
1.6 Density and Mass Functions
Associated with a random variable X and its edf Fx is another function, called either
the probability density function (pdf) or probability mass function (pmf). The terms
pdf and pmf refer, respectively, to the continuous and discrete cases. Both pdfs and
pmfs are concerned with “point probabilities” of random variables.
Definition 1.6.1 The probability mass function (pmf) of a discrete random variable
X is given by
fx() = P(X =2) for all z.
Example 1.6.2 (Geometric probabilities) For the geometric distribution of
Example 1.5.4, we have the pmf
x(a) = P(X =2)=[ (1-9) 'p foro =1,2,
0 otherwise.
Recall that P(X = 2) or, equivalently, fx (z) is the size of the jump in the odf at z. We
can use the pmf to calculate probabilities. Since we can now measure the probability
of a single point, we need only sum over all of the points in the appropriate event.
Hence, for positive integers a and 6, with a fx(k) = Fx(d)-
i
‘A widely accepted convention, which we will adopt, is to use an uppercase letter
for the cdf and the corresponding lowercase letter for the pmf or pdf.
We must be a little more careful in our definition of a pdf in the continuous case.
If we naively try to calculate P(X = 2) for a continuous random variable, we get the
following. Since {X = z} C {r —¢ < X < z} for any € > 0, we have from Theorem
1.2.9(c) that
P(X =a) < Plw—€< X <2) = Fx(z)— Fx(e-6)
for any ¢ > 0. Therefore,
O0< P(X=2)< lim [Fx (2) - Fx(z- 6) =0
by the continuity of Fx. However, if we understand the purpose of the pdf, its defi-
nition will become clear.
From Example 1.6.2, we see that a pmf gives us “point probabilities.” In the discrete
case, we can sum over values of the pmf to get the cdf (as in (1.6.1). The analogous
procedure in the continuous case is to substitute integrals for sums, and we get
P(X <2) =Fx(z)= f fx(t) dt.
Using the Fundamental Theorem of Calculus, if fx(z) is continuous, we have the
further relationship
(1.6.2) Arete) = fx(2).
Note that the analogy with the discrete case is almost exact. We “add up” the “point
probabilities” fx(x) to obtain interval probabilities.
Definition 1.6.3 The probability density function or pdf, fx(z), of a continuous
random variable X is the function that satisfies
(1.6.3) Fe) = fo Sx(t)dt for all z.
A note on notation: The expression “X has a distribution given by F(z)” is abbrevi-
ated symbolically by “X ~ Fx(z),” where we read the symbol “~” as “is distributed
as.” We can similarly write X ~ fx(z) or, if X and Y have the same distribution,
X~Y.
In the continuous case we can be somewhat cavalier about the specification of
interval probabilities. Since P(X = z) = 0 if X is a continuous random variable,
P(a< X 0 for all x.
b. Defx(e)=1 (pmf) or f° fx (x) de = 1 (pdf).Section 1.7 EXERCISES a7
Proof If fx(z) is a pdf (or pmf), then the two properties are immediate from the
definitions. In particular, for a pdf, using (1.6.3) and Theorem 1.5.3, we have that
1= Jim Fx(z) = ie Sfr(t)dt.
‘The converse implication is equally easy to prove. Once we have fx (x), we can define
Fx(x) and appeal to Theorem 1.5.3. o
From a purely mathematical viewpoint, any nonnegative function with a finite
positive integral (or sum) can be turned into a pdf or pmf. For example, if h(c) is
any nonnegative function that is positive on a set A, 0 elsewhere, and
f[ h(a) de = K <00
fzcay
for some constant K > 0, then the function fx(z) = h(z)/K is a pdf of a random
variable X taking values in A.
Actually, the relationship (1.6.3) does not always hold because Fx(z) may be
continuous but not differentiable. In fact, there exist continuous random variables
for which the integral relationship does not exist for any fx (x). These cases are
rather pathological and we wilt ignore them. Thus, in this text, we will assume that
(1.6.3) holds for any continuous random variable. In more advanced texts (for exam-
ple, Billingsley 1995, Section 31) a random variable is called absolutely continuous if
(1.6.3) holds.
1.7 Exercises
1.1 For each of the following experiments, describe the sample space.
(a) Toss a coin four times.
(b) Count the number of insect-damaged leaves on a plant,
(c) Measure the lifetime (in hours) of a particular brand of light bulb.
(a) Record the weights of 10-day-old rats.
(c) Observe the proportion of defectives in a shipment of electronic components.
1.2 Verify the following identities.
(a) A\B= A\(ANB) = AN BE
(b) B= (BN A)U(BN A)
(0) B\A=BNA
(4) AUB=AUu(BN A’)somery:
38 PROBABILITY THEORY Section 1.7
1.3 Finish the proof of Theorem 1.1.4. For any events A, B, and C defined on a sample
space S, show that
(a) AUB= BUA and ANB=BNA. (commutativity)
(b) AU(BUC) = (AUB)UC and AN(BNC)=(ANB)NC. __ (associativity)
() (AUBY = A° OB and (AN BY = AUB. (DeMorgan’s Laws)
1.4 For events A and B, find formulas for the probabilities of the following events in terms
‘of the quantities P(A), P(B), and P(AN B).
(a) either A or B or both
(b) either A or B but not both
{c) al least one of A or B
(d) at most one of A or B
1.8 Approximately one-third of all human twins are identical (one-egg) and two-thirds are
fraternal (two-egg) twins. Identical twins are necessarily the same sex, with male and
female being equally likely. Among fraternal twins, approximately one-fourth are both
female, one-fourth are both male, and half are one male and one fernale. Finally, among
all U.S. births, epproximately 1 in 90 is a twin birth. Define the following events:
[a U.S. birth results in twin females}
B = {a US. birth results in identical twins}
C= {aUS. birth results in twins}
(a) State, in words, the event AN BAC.
(b) Find P(AN BNC). -
1.6 Two pennies, one with P(head) = u end one with P(head) = w, are to be tossed
together independently. Define
Po = P(0 heads occur),
Pi = P( head occurs),
P(2 heads occur).
Pa
Can u and w be chosen such that po = pi = pa? Prove yout answer.
1.7 Refer to the dart game of Example 1.2.7. Suppose we do not assume that the proba-
bility of hitting the dart board is 1, but rather is proportional to the area of the dart
board. Assume that the dart board is mounted on a wall that is hit with probability
1, and the wall has area A.
(a) Using the fact that the probability of hitting a region is proportional to aréa,
construct a probability function for P(scoring i points), i = 0, . (No points
are scored if the dart board is not hit.)
(b) Show that the conditional probability distribution P(scoring i points|board is hit)
is exactly the probability distribution of Example 1.2.7.
1.8 Again refer to the game of darts explained in Example 1.2.7.
(a) Derive the general formula for the probability of scoring i points,
(b) Show that P(scoring i points) is a decreasing function of i, that is, as the points
increase, the probability of scaring them decreases.
(c) Show that P(scoring i points) is a probability function according to the Kol-
mogorov Axioms.Section 1.7 EXERCISES 30
1.8 Prove the general version of DeMorgan’s Laws. Let {Aq: a € T'} be a (possibly un-
countable) collection of sets. Prove that
(@) (UeAa}® =aAS. (b) (NaAa)® = Ue:
1.10 Formulate and prove a version of DeMorgan’s Laws that applies to a finite collection
of sets A1,...4An-
1.11 Let S be a sample space.
(a) Show that the collection B = {0,5} is a sigma algebra,
(b) Let B = {all subsets of 5, including S itself}. Show that B is a sigma algebra.
(c) Show that the intersection of two sigma algebras is a sigma algebra.
1.12 It was noted in Section 1.2.1 thet statisticians who follow the deFinetti school do not
the Axiom of Countable Additivity, instead adhering to the Axiom of Finite
Additivity.
(a) Show that the Axiom of Countable Additivity implies Finite Additivity.
(b) Although, by itself, the Axiom of Finite Additivity does not imply Countable
Additivity, suppose we supplement it with the following. Let A: > Az D ++: >
An > ++» be an infinite sequence of nested sets whose limit is the empty set, which
we denote by An | 0. Consider the following:
If An 0, then P(A) — 0.
Axiom of Continuit;
Prove that the Axiom of Continuity and the Axiom of Finite Additivity imply
Countable Additivity.
1.13 If P(A) = 4 and P(B*) = }, can A and B be disjoint? Explain.
1.14 Suppose that a sample space S has n elements. Prove that the number of subsets that
can be formed from the elements of $ is 2°.
1.15 Finish the proof of Theorem 1.2.14. Use the result established for k = 2 as the basis
of an induction argument.
1.16 How many different sets of initials can be formed if every person has one surname and
(a) exactly two given names? (b) either one or two given nam
(b) either one or two or three given names?
(Answers: (a) 26 (b) 26° + 26? (c) 26% + 26° + 267)
1.17 In the game of dominoes, each piece is marked with two numbers. ‘The pieces are
symmetrical so that the number pair is not ordered (so, for example, (2,6) = (6,2)).
How many different pieces can be formed using the numbers 1,2,...,n?
(Answer: n(n + 1)/2)
1.18 If n balls are placed at random into ni cells, find the probability that exactly one cell
remains empty.
(Answer: (3)n!/n")
1.19 If a multivariate function has continuous partial derivatives, the order in which the
derivatives are calculated does not matter. Thus, for example, the function’ f(z, y) of
two variables has equal third partials
e a
Baty!) = Fypgal@y)-
(a) How many fourth partial derivatives does a function of three variables have?
(b) Prove that function of n variables has (**!-") rth partial derivatives.
1.20 My telephone rings 12 times each week, the calls being randomly distributed among
the 7 days. What is the probability that I get at least, one call each day?
(Answer: .2285)“0
1.21
1.22
1.28
1.24
1.25
3.26
1.27
1.28
1.29
PROBABILITY THEORY Section 1.7
A closet contains n pairs of shoes. If 2r shoes are chosen at random (2r }. (Hint: Try to write P(A wins)
in terms of the events Hi, F2,..., where E, = {head first appears on ith toss}.)
(Answers: (a) 2/3 (6) =z)
‘The Smiths have two children. At least one of them is a boy. What is the probability
that both children are boys? (See Gardner 1961 for a complete discussion of this
problem.)
A fair die is cast until a 6 appears. What is the probability that it must be cast more
than five times?
Verify the following identities for n > 2.
(®) Dreo(-1)* (2) =0 (b) Dhak (2) = nar
() Daye (2) =0
‘A way of approximating large factorials is through the use of Stirling’s Formula:
nla VENOM em,
complete derivation of which is difficult. Instead, prove the easier fact,
" nl
dim, Saiingce = 8 constant.
(Hint: Feller 1968 proceeds by using the monotonicity of the logarithm to establish
that
" et
i loge de 1), the employer can hire the mth candidate
only if the mth candidate is better than the previous m — 1.
Suppose a candidate is hired on the ith trial. What is the probability that the best
candidate was hired?
Suppose that 5% of men and .25% of women are color-blind. A person is chosen at
random and that person is color-blind. What is the probability that the person is
male? (Assume males and females to be in equal numbers.)
Two litters of a particular rodent species have been born, one with two brown-haired
and one gray-haired (litter 1), and the other with three brown-haired and two gray-
haired (litter 2). We select a litter at random and then select an offspring at random
from the selected litter.
(a) What is the probability that the animal chosen is brown-haired?
(b) Given that a brown-haired offspring was selected, what is the probability that the
sampling was from litter 1?
Prove that if P(-) is a legitimate probability function and B is a set with P(B) > 0,
then P(-|B) also satisfies Kolmogorov’s Axioms.
If the probability of hitting a target is 1, and ten shots are fired independently, what
is the probability of the target being hit at least twice? What is the conditional prob-
ability that the target is hit at least twice, given that it is hit at least once?a
137
1.38
1.39
1.40
141
PROBABILITY THEORY Section 1.7
Here we look at some variations of Example 1.3.4.
(a) In the warden’s calculation of Example 1.3.4 it was assumed that if A were to be
pardoned, then with equal probability the warden would tell A that either B or C
would die. However, this need not be the case. The warden can assign probabilities
y and 1 ~7 to these events, as shown here:
Prisoner pardoned Warden tells A
A B dies with probability
A C dies with probability 1 ~~
B C dies
c B dies
Calculate P(A|W) as a function of y. For what values of is P(A|W) less than,
equal to, or greater than 3?
(b) Suppose again that = }, as in the example. After the warden tella A that B
will die, A thinks for a while and realizes that his original calculation was false.
However, A then gets a bright idea. A asks the warden if he can swap fates with C.
‘The warden, thinking that no information has been passed, agrees to this. Prove
that A's reasoning is now correct and that his probability of survival has jumped
to 2
A similar, but somewhat more complicated, problem, the “Morte Hall problem” is
discussed by Selvin (1975). The problem in this guise gained a fair amount of noto-
riety when it appeared in a Sunday magazine (vos Savant 1990) along with a correct
answer but with questionable explanation. The ensuing debate was even reported on
the front page of the Sunday New York Times (Tierney 1991). A complete and some-
what amusing treatment is given by Morgan et al. (1991) [see also the response by vos
Savant 1991]. Chun (1999) pretty much exhausts the problem with a very thorough
analysis,
Prove each of the following statements. (Assume that any conditioning event has pos-
itive probability.)
(a) If P(B) =1, then P(A|B) = P(A) for any A.
(b) 1 AC B, then P(BIA) = 1 and P(A|B) = P(A)/P(B).
(c) If A and B are mutually exclusive, then
P(A)
Pay + By
P(A|AUB)
(d) P(AN BNC) = P(AIBNC)P(BIC)P(C).
A pair of events A and B cannot be simultaneously mutually exclusive and independent,
Prove that if P(A) > 0 and P(B) > 0, then:
(a) Mf A and B are mutually exclusive, they cannot be independent.
(b) If A and B are independent, they cannot be matuelly exclusive,
Finish the proof of Theorem 1.3.9 by proving parts (b) and (c).
AAs in Bxample 1.3.6, consider telegraph signals “dot” and “dash” sent in the proportion
3:4, where erratic transmissions cause a re to become a dash with probability } 4 and
a dash to become a dot with probability 4
(a) If a dash is received, what is the probability that a dash has been sent?Section 1.7 EXERCISES 43
1.42
1.43
144
1.45
1.46
1.47
1.48
(b) Assuming independence between signals, if the message dot-dot was received,
‘what is the probability distribution of the four possible messages that could have
been sent?
‘The inclusion-exclusion identity of Miscellanea 1.8.1 gets it name from the fact that
it is proved by the method of inclusion and exclusion (Feller 1968, Section IV.1). Here
we go into the details. The probability P(U?_1Ai) is the sum of the probabilities of
all the sample points that are contained in at least one of the Ais. The method of
inclusion and exclusion is a recipe for counting these points.
(a) Let Ex denote the set of all sample points that are contained in exactly & of the
events Ai, A2,...,An- Show that P(UR1 Ai) = D7, P(A).
(b) If £1 is not empty, show that P(E:) = )3" , P(Ai) -
(c) Without loss of generality, assume that Ey is contained in Ay, Az,..., Ax. Show
that P(E) appears k times in the sum Pi, (5) times in the sum Py, (5) times in
the sum Ps, etc.
kK) | fk k
OOO
(See Exercise 1.27.)
(d) Show that
(e) Show that parts (a) — (¢) imply 0", P(E:) = Py - Pa =
the inclusion-exclusion identity.
+ Py, establishing
For the inclusion-ezclusion identity of Miscellanea 1.8.1:
(a) Derive both Boole’s and Bonferroni’s Inequality from the inclusion-exclusion iden-
tity.
(b) Show that the P, satisfy P, > P, if i > j and that the sequence of bounds in
Miscellanea 1.8.1 improves as the number of terms increases.
(c) Typically as the number of terms in the bound increases, the bound becomes more
useful. However, Schwager (1984) cautions that there are some cases where there
is not much improvement, in particular if the A,s are highly correlated. Examine
what happens to the sequence of bounds in the extreme case when Aj = A for
every i. (See Worsley 1982 and the correspondence of Worsley 1985 and Schwager
1985.)
Standardized tests provide an intoresting application of probability theory. Suppose
first that a test consists of 20 multiple-choice questions, each with 4 possible answers.
If the student guesses on each question, then the taking of the exam can be modeled
‘as a sequence of 20 independent events. Find the probability that the student gets at
least 10 questions correct, given that he is guessing.
Show that the induced probability function defined in (1.4.1) defines a legitimate
probability function in that it satisfies the Kolmogorov Axioms
Seven balls are distributed randomly into seven cells. Let X; = the number of cells
containing exactly i balls. What is the probability distribution of Xs? (Thet is, find
P(X3 = 2) for every possible z.)
Prove that the following functions are cdfs.
(a) 7+ dtan71(2), 2 € (—00, 00) (b) (+e
(c) e-* *, 2 € (-00, 00) (a) 1-e
(e) the function defined in (1.5.6)
Prove the necessity part of Theorem 1.5.3.
y*, 2 € (—20, 00)
*, x € (0,00)“ PROBABILITY THEORY ‘Section 1.8
1.49 A cdf Fx is stochastically greater than a cdf Fy if Fx(t) < Fy (t) for all t and Fx(t) <
Fy (t) for some t. Prove that if X ~ Fx end Y ~ Fy, then
P(X >t)>P(Y>1) foreveryt
and
P(X >1)> PY >4) for some t,
that is, X tends to be bigger than Y.
1.50 Verify formula (1.5.4), the formula for the partial sum of the geometric series.
1.51 An appliance store receives a shipment of 30 microwave ovens, 5 of which are (unknown
to the manager) defective. The store manager selects 4 ovens at random, without
replacement, and tests to see if they are defective. Let X = number of defectives
found. Calculate the pmf and cdf of X and plot the cdf.
1.52 Let X be acontinuous random variable with pdf f(z) and edf F(z). For a fixed number
Zo, define the function
olz) = {fom — F(@o)] ee =
Prove that 9(2) is a pdf. (Assume that F(z9) < 1.)
1.53 A certain river floods every yeat. Suppose that the low-water mark is set at 1 and the
high-water mark ¥ has distribution function
Fy) =P Sw)=1- 5, 1s y <0.
(a) Verify thet Fy(y) is a cdf.
(b) Find fy(y), the pdf of Y.
(c) If the low-water mark is reset at 0 and we use a unit of measurement that is 4; of
thet given previously, the high-water mark becomes Z = 10(Y — 1). Find Fz(z)
1.54 For each of the following, determine the value of c that makes f(z) a pdf.
(a) f(a) =csinz,0<20.
1.8 Miscellanea
1.8.1 Bonferroni and Beyond
‘The Bonferroni bound of (1.2.10), or Boole’s Inequality (Theorem 1.2.11), provides
simple bounds on the probability of an intersection or union. These bounds can be
made more and more precise with the following expansion.
For sets Aj, Az,...An, we create a new set of nested intersections as follows. Let
B= S7PtA)
i=Section 1.8 MISCELLANEA 45
= OD, CHa)
1gigjn
Ps= SD P(AINASNAR)
ISi P; if i < j, and we have the sequence of
upper and lower bounds
P>PURrA) 2 A- Pe
pelaeeeae tee tee egeaee ieee eee ee eate aed
See Exercises 1.42 and 1.43 for details.
‘These bounds become increasingly tighter as the number of terms increases, and
they provide a refinement of the original Bonferroni bounds. Applications of these
bounds include approximating probabilities of runs (Karlin and Ost 1988) and
multiple comparisons procedures (Naiman and Wynn 1992).Chapter 2
Transformations and Expectations
“We want something more than mere theory and preaching now, though.”
Sherlock Holmes
A Study in Scarlet
Often, if we are able to model a phenomenon in terms of a random variable X
with cdf F(z), we will also be concerned with the behavior of functions of X. In
this chapter we study techniques that allow us to gain information about functions
of X that may be of interest, information that can range from very complete (the
distributions of these functions) to more vague (the average behavior).
2.1 Distributions of Functions of a Random Variable
If X is a random variable with cdf Fy(z), then any function of X, say g(X), is
also a random variable. Often 9(X) is of interest itself and we write Y = g(X) to
denote the new random variable g(X). Since Y is a function of X, we can describe
the probabilistic behavior of Y in terms of that of X. That is, for any set A,
PUY € A) = P(9(X) € A),
showing that the distribution of Y depends on the functions Fx and g. Depending
on the choice of g, it is sometimes possible to obtain a tractable expression for this
probability.
Formally, if we write y = g(z), the function g(z) defines a mapping from the original
sample space of X, %, to a new sample space, )’, the sample space of the random
variable Y. That is,
g(a): XY.
We associate with g an inverse mapping, denoted by g~!, which is a mapping from
subsets of ) to subsets of 2, and is defined by
(2.1.1) g (A) = {xe X: g(x) € A}.
Note that the mapping g! takes sets into sets; that is, g~!(A) is the set of points
in ¥ that g(x) takes into the set A. It is possible for A to be a point set, say A = {y}.
Then
go ({u}) = {2 € X: oz) =u}ry TRANSFORMATIONS AND EXPECTATIONS Section 2.1
In this case we often write g~1(y) instead of g~!({y}). The quantity g~(y) can still
‘be a set, however, if there is more than one for which g(z) = y. If there is only one
for which 9(z) = y, then g~1(y) is the point set {x}, and we will write g~"(y) = 2. If
the random variable Y is now defined by Y = g{X), we can write for any set ACY,
P(Y € A) = P(g(X) € A)
(2.1.2) =Pl{z € X: g(x) € A})
=P(Xeg"(A)).
‘This defines the probability distribution of Y. It is straightforward to show that this
probability distribution satisfies the Kolmogorov Axioms.
If X is a discrete random variable, then 4’ is countable. The sample space for
¥ =9(X)is Y= {y: y=9(2), 2 € X}, which is also a countable set. Thus, ¥ is also
a discrete random variable. From (2.1.2), the pmf for Y is
frw)=PY=y= YO PxX=2)= YO fx(@), forved,
zeg My) E97 *(y)
and fy(y) = 0 for y ¢ Y. In this case, finding the pmf of Y involves simply identifying
g ‘(y), for each y € Y, and summing the appropriate probabilities.
Example 2.1.1 (Binomial transformation) A discrete random variable X has
a binomial distribution if its pmf is of the form
(2.1.3) fx(z) = P(X =2) = (Bera =p?, 2=0))..40,
where n is a positive integer and 0 < p < 1. Values such as n and p that can
be set to different values, producing different probability distributions, are called
parameters. Consider the random variable Y = g(X), where g(z) = n — 2; that is,
Y =n—X. Here ¥ = {0,1,...,n} and Y = {j (2), & € X} = {0,1,...,n}.
For any ye Y, n—-2=9(z) if and only if =n —y. Thus, g~*(y) is the single
point s=n—y, and
fry) = YO sx@)
2€g™(y)
= fx(n-y)
= ( n ) n-¥(1 — py ow)
nae P)
Nes) Definition 1.2.17
= ({)e-we (inpts (= (2)
Thus, we see that Y also has a binomial distribution, but with parameters n and
1-p
If X and ¥ are continuous random variables, then in some cases it is possible to
find simple formulas for the cdf and pdf of Y in terms of the df and pdf of X and
the function g. In the remainder of this section, we consider some of these cases.Section 2.1 DISTRIBUTIONS OF FUNCTIONS OF A RANDOM VARIABLE 49
sina)
1
: * me ma
Figure 2.1.1. Graph of the transformation y = sin*(z) of Example 2.1.2
The cdf of Y = 9(X) is
Fy(y) = PY <9)
(2.1.4) = P(g(X) < y)
= P({ze X: 9(z) < y})
-f fx(a)ae.
{z€X: g(z) 24).
From the symmetry of the function sin”(x) and the fact that X has a uniform distri-
bution, we have
PIX Sm)=P(X 24) and Play 0} and = Y= {ys y=9(z) for some x € 2}.
The pdf of the random variable X is positive only on the set ¥ and is 0 elsewhere.
Such a set is called the support set of a distribution or, more informally, the support
of a distribution, This terminology can also apply to a pmf or, in general, to any
nonnegative function.
It is easiest to deal with functions g(x) that are monotone, that is, those that satisfy
either
u> v= glu) > g(v) (increasing) or u9(u)>g(v) (decreasing).
If the transformation 2 — 9(z) is monotone, then it is one-to-one and onto from
2 — Y. That is, each x goes to only one y and each y comes from at most one =
(one-to-one). Also, for defined as in (2.1.7), for each y € ) there is an x € X such
that 9(z) = y (onto). Thus, the transformation g uniquely pairs zs and ys. If g is
monotone, then g~! is single-valued; that is, g~!(y) =z if and only if y = g(z). If g
is increasing, this implies that
{re X: gz) 97*(y)}
(2.1.9) ={reX: 2>g(y)}.
(A graph will illustrate why the inequolity reverses in the decreasing case.) If 9(z) is
an increasing function, then using (2.1.4), we can write
oy)
aw f Jula)de =f" fela)de= Fx (WW).
{zeX: rSg-¥(y)} 00
If g(z) is decreasing, we have
Fi) = [| Ix(a)de=1 Bx (0).
a 1(y)
The continuity of X is used to obtain the second equality. We summarize these results
in the following theorem.Section 2.1 DISTRIBUTIONS OF FUNCTIONS OF A RANDOM VARIABLE 81
Theorem 2.1.3 Let X have cdf Fx(z), let ¥ = 9(X), and let ¥ and Y be defined
as in (2.1.7).
a. If g is an increasing function on X, Fy (y) = Fx (g7*(y)) fory ey.
b. Ifg is a decreasing function on # and X is a continuous random variable, Fy (y) =
1~ Fx (9""(w)) forvey.
Example 2.1.4 (Uniform-exponential relationship-I) Suppose X ~ fx(z) =
1 if 0 < z < 1 and 0 otherwise, the uniform(0,1) distribution, It is straightforward
to check that x(x) = 2, 0 < x < I. We now make the transformation ¥ = 9(X) =
—log X. Since
foe) = EC toga) =atco, fr o 0, y = —logz implies x = e~¥, so g-1(y) = e7¥.
Therefore, for y > 0,
Fy(y) =1- Fx (97) =1- Fx(e¥) =1-e%. (Fx(z) = 2)
Of course, Fy(y) = 0 for y < 0. Note that it was necessary only to verify that
g(x) = —logz is monotone on (0,1), the support of X.
If the pdf of Y is continuous, it can be obtained by differentiating the cdf. The
resulting expression is given in the following theorem.
Theorem 2.1.5 Let X have pdf fx(x) and let ¥ = g(X), where g is a monotone
function. Let ¥ and Y be defined by (2.1.7). Suppose that fx(x) is continuous on ¥
‘and that g~"(y) has a continuous derivative on Y. Then the pdf of Y is given by
il
(2.1.10) fy(u) ~ {me w| eo ‘w)) vey
0 otherwise.
Proof: From Theorem 2.1.3 we have, by the chain rule,
d
fx(g(y)) 9 "(y) if g is increasing
d ay
fry) = ai = f
flo") gj9W) 9 is decreasing,
which can be expressed concisely as (2.1.10). Qo
Example 2.1.6 (Inverted gamma pdf) Let fx(z) be the gamma pdf
Se)= Goyer tel, <2 <0,
where @ is a positive constant and n is a positive integer. Suppose we want to find the
pdf of 9(X) = 1/X. Note that here the support sets 2” and ) are both the interval‘TRANSFORMATIONS AND EXPECTATIONS Section 2.1
(0,00). If we let y = 9(2), then g-“(y) = Ly and fg (y) = -1/y?. Applying the
above theorem, for y € (0,00), we get
Peto) = fe ') [Fore]
m1
=) 1)” anh
(n= 118" \y y
nat
— (2) erV/aw),
(n— 18" \y
a special case of a pdf known as the inverted gamma pdf. i]
In many applications, the function g may be neither increasing nor decreasing;
hence the above results will not apply. However, it is often the case that g will be
monotone over certain intervals and that allows us to get an expression for Y = g(X).
(If g is not monotone over certain intervals, then we are in deep trouble.)
Example 2.1.7 (Square transformation) Suppose X is a continuous random
variable. For y > 0, the cdf of Y = X? is
Fy) = PY y},
a definition that agrees with (2.1.12) when Fy is nonconstant and provides an Fy!
that is single-valued even when Fx is not strictly increasing. Using this definition, in
Figure 2.1.2b, we have F¢'(y) = 21. At the endpoints of the range of y, Pz"(y) can
also be defined. Fz*(1) = 00 if Fx(z) < 1 for all x and, for any Fx, Fy'(0) = —00.
Proof of Theorem 2.1.10: For ¥ = Fx(X) we have, for 0 land P(Y < y) =0 fory <0,
showing that Y has a uniform distribution.
The reasoning behind the equality
P (Fx'(Fx(X)) $ Fy'(y)) = POX S Fx"(y))
is somewhat subtle and deserves additional attention. If Fx is strictly increasing, then
it is true that Fy'(Fx(z)) = 2. (Refer to Figure 2.1.2a.) However, if Fy is flat, it
my be that Fy"(Fy(z)) # 2. Suppose Fy is as in Figure 2.1.2b and let z € [21,22].
Then Fx'(Fx(z)) = 2; for any 2 in this interval. Even in this case, though, the
probability equality holds, since P(X < 2) = P(X < 21) for any x € [z,22). The
fiat cdf denotes a region of 0 probability (P(a: < X < 2) = Fx(z) — Fx(21) = 0).
a
One application of Theorem 2.1.10 is in the generation of random samples from a
particular distribution. If it is required to generate an observation X from a population
with cdf Fx, we need only generate a uniform random number V, between 0 and 1,
and solve for < in the equation Fx(z) = u. (For many distributions there are other
methods of generating observations that take less computer time, but this method is
still useful because of its general applicability.)
2.2 Expected Values
The expected value, or expectation, of a random variable is merely its average value,
where we speak of “average” value as one that is weighted according to the probability
distribution. The expected value of a distribution can be thought of as a measure of
center, as we think of averages as being middle values. By weighting the values of
the random variable according to the probability distribution, we hope to obtain @
number that summarizes a typical or expected value of an observation of the random
variable.
Definition 2.2.1 The expected value or mean of a random variable 9(X), denoted
by E9(X), is
P92) fx (2) de if X is continuous
E9(X) =
Lye 92) x(a) = Deen 9(2)P(X = 2) if X is discrete,
provided that the integral or sum exists. If E|g(X)| = 00, we say that Eg(X) does
not exist. (Ross 1988 refers to this as the “law of the unconscious statistician.” We
do not find this amusing.)
Example 2.2.2 (Exponential mean) Suppose X has an exponential () distri-
bution, that is, it has pdf given by
1
HH, OS t< em, ADO.
fx(a) =LJ ‘TRANSFORMATIONS AND EXPECTATIONS Section 2.2
Then EX is given by
EX = 2) de
x7
= —2e72/4
+ f e-*/Adz (integration by pacts)
b
= f ede = 2 I
lo
Example 2.2.3 (Binomial mean) If X has a binomial distribution, its pmf is
given by
P(X =2) = (Rea —py?, 2 =0,1j..57,
where n is a positive integer, 0 < p < 1, and for every fixed pair n and p the pmf
sums to 1. The expected value of a
EX= > x (2a —p)*
=o
zl
fn -1
ee (1 — pnt
EX dn(22 a) P)
S_fn-1
= Ya ; ) pa pron (substitute y =z — 1)
y=0 ¥
Gln-1
= mw ( )pa-e
y=0 y
=
since the last summation must be 1, being the sum over all possible values of a
binomial(n — 1, p) pmf. I)
Example 2.2.4 (Cauchy mean) A classic example of a random variable whose
expected value does not exist is a Cauchy random variable, that is, one with pdf
ria
mitz?’
It is straightforward to check that f°, fx(z) dx = 1, but E|X|=oo. Write
EIX| = { |z|_ 1 fos 2 x
eo tlt+e 7
fx(a) =
-08 0.
& If g1(z) > g(x) for all x, then Eg (X) > Ega(X).
d. Ifa O such that, for all t in A < ¢ < h, Ee'™ exists. If the expectation does
not exist in a neighborhood of 0, we say that the moment generating function does
not exist.
More explicitly, we can write the mgf of X as
Mx(t) = [ 7 efx(z)dx if X is continuous,
or
Mx(t)=Sre*P(X =z) if X is discrete.
It is very easy to see how the mgf generates moments. We summarize the result in
she following theorem,
Theorem 3.3.7 If X has mof Mx(t), then
EX™ = MY(0),
where we define
M0) = Facto
2
a
‘That is, the nth moment is equal to the nth derivative of Mx(t) evaluated at t =
Proof: Assuming that we can differentiate under the integral sign (see the next
section), we have
2 ” et fe (2) dz
- f(g) moe
= [tee \fetayae
=EXe™,
d
giek®)Bection 2.3 MOMENTS AND MOMENT GENERATING FUNCTIONS 63.
‘Thus,
d
=A = EXe*|,_5=
ax) eles
Proceeding in an analogous manner, we can establish that
ae | = pyrex) = Bx.
gp Mx) 0 7 EXxve*|,_, = EX” o
Example 2.3.8 (Gamma mgf) In Example 2.1.6 we encountered a special case
of the gamma pdf,
f(z) = rales 2@le/0, Y0, 8>0,
where I'(@) denotes the gamma function, some of whose properties are given in Section
3.3. The mgf is given by
aa f* etege-tg-2/8
Mx(t) = ax f efa®1e#/9 de
oo a1 (hte
(2.3.5) = rar f ate dz
1 7
— e-1.-2/( 25)
rar a
‘We now recognize the integrand in (2.3.5) as the kernel of another gamma pdf.
(The kernel of a function is the main part of the function, the part that remains
when constants are disregarded.) Using the fact that, for any positive constants a
and 6,
f()= Ta 1e-2/b
is a pdf, we have that
f[ roe 16-2/> dp 1
0
and, hence,
(2.3.6) [ atte-#/b de = P(a)o°.
lo
Applying (2.3.6) to (2.3.5), we have
xl rah) (Tq : a) = (25) <5cf ‘TRANSFORMATIONS AND EXPECTATIONS Section 2.3
If t > 1/8, then the quantity (1/8) —t, in the integrand of (2.3.5), is nonpositive and
the integral in (2.3.6) is infinite. Thus, the mgf of the gamma distribution exists only
if t < 1/8. (In Section 3.3 we will explore the gamma function in more detail.)
‘The mean of the gamma distribution is given by
op
too OBI Ag
Other moments can be calculated in s similar manner. i
d
EX = EMx(t) = af.
Example 2.3.9 (Binomial mgf) For a second illustration of calculating a moment
generating function, we consider a discrete distribution, the binomial distribution. The
binomial(n, p) pm is given in (2.1.8). So
ae (aq yt 2 4)2(] _ »)P-=
Melt) = Doe (D) easy = (2) eta
The binomial formula (see Theorem 3.2.2) gives
(2.3.7) De (®) wert = (ut).
=
Hence, letting u = pet and v = 1 — p, we have
Mx(t) = [pe +(1— py)". ll
As previously mentioned, the major usefulness of the moment generating function is
not in its ability to generate moments. Rather, its usefulness stems from the fact that,
in many cases, the moment generating function can characterize a distribution. There
aré, however, some technical difficulties associated with using moments to characterize
a distribution, which we will now investigate.
If the mgf exists, it characterizes an infinite set of moments. The natural question is
whether characterizing the infinite set of moments uniquely determines a distribution
function. The answer to this question, unfortunately, is no. Characterizing the set of
moments is not enough to determine a distribution uniquely because there may be
two distinct random variables having the sarne moments.
Example 2.3.10 (Nomunique moments) Consider the two pdfs given by
1
Vint
falz) = fi(a)[L + sin(2rlogz)], OS 2 <0.
(The pdf f, is a speciel case of a lognormal pdf.)
It can be shown that if Xi ~ fi(c}, then
e-(log2)?/2,
Ae) O52 <0,
EXf=e"?, r=0,1,.Bection 2.3 MOMENTS AND MOMENT GENERATING FUNCTIONS 65
ral
10
7 fo
2 fw
% 2 4 :
Figure 2.3.2. Two pdfs with the same moments: fi(z) =
‘fi(2)[1 + sin(2x logz)]
Pee and fal) =
s0 X; has all of its moments. Now suppose that X» ~ fa(z). We have
EX} = i 2" fi(z)[1 + sin(2rlog2)] de
Ib
=EXT+ [ 2" f:(z) sin(2r log) de.
lo
However, the transformation y = logx — r shows that this last integral is that of
an odd function over (00, 00) and hence is equal to 0 for r = 0,1,.... Thus, even
though X; and Xz have distinct pdfs, they have the same moments for all r. The two
pdfs are pictured in Figure 2.3.2.
See Exercise 2.35 for details and also Exercises 2.34, 2.36, and 2.37 for more about
mgts and distributions. i
The problem of uniqueness of moments does not occur if the random variables
have bounded support. If that is the case, then the infinite sequence of moments
does uniquely determine the distribution (see, for example, Billingsley 1995, Section
30). Furthermore, if the mgf exists in a neighborhood of 0, then the distribution is
uniquely determined, no matter what its support. Thus, existence of all moments is
not equivalent to existence of the moment generating function. The following theorem
shows how a distribution can be characterized.
Theorem 2.3.11 Let Fx(x) and Fy(y) be two cdfs all of whose moments exist.
a. If X and ¥ have bounded support, then Fx(u) = Fy(u) for all u if and only if
EX" = EY" for all integers r = 0,1,2,... «
b. If the moment generating functions exist and Mx(t) = My(t) for all t in some
neighborhood of 0, then Fx(u) = Fy(u) for all u.es “TRANSFORMATIONS ANDY EXPECTATIONS Section 2.8
In the next theorem, which deals with a sequence of mgfs that converges, we do
not treat the bounded support case separately. Note that the uniqueness assump-
tion is automatically satisfied if the limiting mgf exists in a neighborhood of 0 (see
Miscellanea 2.6.1).
Theorem 2.3.12 (Convergence of mgfs) Suppose {X;, i=1,2,...} is a se-
quence of random variables, each with mgf Mx,(t). Furthermore, suppose that
lim Mx,(t)=Mx(t), for all t in a neighborhood of 0,
and Mx(t) is an mgf. Then there is a unique cdf Fx whose moments are determined
by Mx(t) and, for ali x where Fx(zx) és continuous, we have
lim Fy,(z) = Fx(z).
That is, convergence, for |t| My(t) as n — oo. The validity of the
approximation in (2.3.9) will then follow from Theorem 2.3.12.
We first must digress a bit and mention an important limit result, one that has wide
applicability in statistics. The proof of this lemma may be found in many standard
calculus texts.
Lemma 2.3.14 Let ai,a2,... be a sequence of numbers converging to a, that is,
limmysoo @n = a. Then
lim (14 *)°
Returning to the example, we have
: oo
Mx(O)= bet +02 = [r+ Met ine] = [14 cena],
because \ = np. Now set a, = a = (et —1)A, and apply Lemma 2.3.14 to get
lim Mx(t) =e") = My(2),
the moment generating function of the Poisson.
The Poisson approximation can be quite good even for moderate p and n, In Figure
2.3.3 we show a binomial mass function along with its Poisson epproximation, with
= np. The approximation appears to be satisfactory. |
We close this section with a useful result concerning mgfs.
Theorem 2.3.15 For any constants a and b, the mof of the random variable aX +6
is given by
Max+o(t) = Mx (at).cy ‘TRANSFORMATIONS AND EXPECTATIONS: Section 2.4
Figure 2.3.3. Poisson (dotted line) approzimation to the binomial (solid line), n= 15, p= .3
Proof: By definition,
Max4s(t) = E (eo)
=E (corel) (properties of exponentials)
= &p (eox ) (e** is constant)
= e&Mx(at), (definition of mgf)
proving the theorem. ||
2.4 Differentiating Under an Integral Sign
In the previous section we encountered an instance in which we desired to interchange
the order of integration and differentiation. This situation is encountered frequently in
theoretical statistics. The purpose of this section is to characterize conditions under
which this operation is legitimate. We will also discuss interchanging the order of
differentiation and summation.
Many of these conditions can be established using standard theorems from calculus
and detailed proofs can be found in most calculus textbooks. Thus, detailed proofs
will not be presented here.
‘We first want to establish the method of calculating
a po
(2.4.1) H Jee f(x, 0) dz,
where —co < a(@),6() < co for all 8. The rule for differentiating (2.4.1) is called
Leibnitz’s Rule and is an application of the Fundamental Theorem of Calculus and
the chain cule.Bection 2.4 DIFFERENTIATING UNDER AN INTEGRAL SIGN 6
Theorem 2.4.1 (Leibnitz’s Rule) If f(z, 6), a(9), and b(0) are differentiable with
respect to 8, then
b(0) b(8),
Bi [4,124 = 1000, 0) gGHe)~ $1000), 0) Gal) + f | Fy Slee)
Notice that if a(@) and 6(6) are constant, we have a special case of Leibnitz’s Rule:
° 6
éf f(e,0)de = [ Zse0)ae.
‘Thus, in general, if we have the integral of a differentiable function over a finite range,
differentiation of the integral poses no problem. If the range of integration is infinite,
however, problems can arise.
Note that the interchange of derivative and integral in the above equation equates
a partial derivative with an ordinary derivative. Formally, this must be the case since
the left-hand side is a function of only 0, while the integrand on the right-hand side
is a function of both @ and x.
‘The question of whether interchanging the order of differentiation and integration
is justified is really a question of whether limits and integration can be interchanged,
since a derivative is a rp kind of limit. Recall that if f(z, 8) is differentiable, then
2 te.) = jm fEP9-fe9,
Peper
Hf feeoae = jm [ [fee9 = 165) a
Therefore, if we can justify the interchanging of the order of limits and integration,
differentiation under the integral sign will be justified. Treatment of this problem
in full generality will, unfortunately, necessitate the use of measure theory, a topic
that will not be covered in this book. However, the statements and conclusions of
some important results can be given. The following theorems are all corollaries of
Lebesgue’s Dominated Convergence Theorem (see, for example, Rudin 1976).
80 we have
while
Theorem 2.4.2 Suppose the function h(z,y) is continuous at yp for each x, and
there exists a function g(z) satisfying
i. |h(z,y)| < g(a) for all x and y,
ii, f° g(a) de < 00.
Then
lim ii A(z, y) de ii lim h(2,y) dz.
om a70 TRANSFORMATIONS AND EXPECTATIONS Section 2.4
The key condition in this theorem is the existence of a dominating function g(2),
with a finite integral, which ensures that the integrals cannot be too badly behaved.
We can now apply this theorem to the case we are considering by identifying A(s, y)
with the difference (f(x, -+ 6) — f(x,6))/6.
Theorem 2.4.3 Suppose {(x,0) is differentiable at 0 =O, that is,
sem £200-+ 6) ~ f(e.lo) _ 8
Be GO)
exists for every x, and there exists a function 9{x,@q) and a constant 59 > 0 such that
i fetore) L(290)| < 9(2,05), for all x and |6| < 6,
ii, [% 90,00) de < oo.
Then
af Ta
(2.4.2) wef fene, -{- [geeo a} de.
Condition (i) is similar to what is known as a Lipschitz condition, a condition
that imposes smoothness on a function. Here, condition (i) is effectively bounding
the variability in the first derivative, other smoothness constraints might bound this
varisbility by a constant (instead of a function g), or place » bound on the variability
of the second derivative of f.
‘The conclusion of Theorem 2.4.3 is a little cumbersome, but it is important to realize
that although we seem to be treating @ as a variable, the statement of the theorem
ig for one value of 8. That is, for each value p for which f(,4) is differentiable at
6p and satisfies conditions (i) and (ii), the order of integration and differentiation can
be interchanged. Often the distinction between @ and Op is not stressed and (2.4.2) is
written
(2.4.3) af. S(z,0)de = f. Bremer.
‘Typically, f(z,6) is differentiable at all @, not at just one value 6p. In this case,
condition (i) of Theorem 2.4.3 can be replaced by another condition that often proves
easier to verify. By an application of the mean value theorem, it follows that, for fixed
zx and @, and |6| < 60,
f(z, 00 + 6) — f(z, #0) _
6
= ao 9)
o=00+6°(z)
for some number 5°(2), where |6*(z)| < 60. Therefore, condition (i) will be satisfied
if we find 8 9(z,0) that satisfies condition (ii) and
4s) | 2 pe)
|< 9(z,8) for all 6” such that |0’ — 6] < é.
PeerSection 2.4 DIFFERENTIATING UNDER AN INTEGRAL SIGN n
Note that in (2.4.4) 69 is implicitly e function of 6, as is the case in Theorem 2.4.3.
This is permitted since the theorem is applied to each value of @ individually. From
(2.4.4) we get the following corollary.
Corollary 2.4.4 Suppose f(z,0) is differentiable in @ and there exists a function
(2,9) such that (2.4.4) is satisfied and f°, g(x,8)dz <0. Then (2.4.3) holds.
Notice that both condition (j) of Theorem 2.4.3 and (2.4.4) impose a uniformity
requirement on the functions to be bounded; some type of uniformity is generally
needed before derivatives and integrals can be interchanged.
Example 2.4.5 (Interchanging integration and differentiation-I) Let X
have the exponential(.) pdf given by f(z) = (1/A)e~7/4, 0 < 2 < 00, and suppose
we want to calculate
(2.4.5) fext= as n(i)e -2/d ge
for integer n > 0. If we could move the differentiation inside the integral, we would
have
qd
(2.4.6)
To justify the interchange of integration and differentiation, we bound the derivative
of 2*(1/A)e-2/4, Now
2 zen) ip
Oy —) , > FT
For some constant 6 satisfying 0 < dy 0, 80 the interchange of integration and differentiation is justified. _||
The property illustrated for the exponential distribution holds for a large class of
densities, which will be dealt with in Section 3.4.2 "TRANSFORMATIONS AND EXPECTATIONS Section 2.4
Notice that (2.4.6) gives us a recursion relation for the moments of the exponential
distribution,
(2.4.7) Ex"? =aExr 4 eS EX,
making the calculation of the (n+ 1)st moment relatively easy. This type of relation-
ship exists for other distributions. In particular, if_X has a normal distribution with
mean and veriance 1, so it has pdf f(x) = (1/V2n)e~@-#)"/?, then
da
mth mo m
EX*** =yEX qe .
‘We illustrate one more interchange of differentiation and integration, one involving
the moment generating function.
Example 2.4.6 (Interchanging integration and differentiation-II) Again
let X have a normal distribution with mean 4: and variance 1, and consider the mgf
of X,
Mx(t) =Ee* = 4 f ee ew de,
In Section 2.3 it was stated that we can calculate moments by differentiation of Mx(t)
and differentiation under the integral sign was justified:
d = 4 px pF ox ex
(2.4.8) q Mx() = qBe* =P ae =E(Ke'*),
We can apply the results of this section to justify the operations in (2.4.8). Notice
that when applying either Theorem 2.4.3 or Corollary 2.4.4 here, we identify t with
the variable @ in Theorem 2.4.3. The parameter is treated as a constant.
From Corollary 2.4.4, we must find a function g(z,t), with finite integral, that
satisfies
(2.4.9) Zeteewin
<9(z,t) for all t’ such that |t’ —t| < 60.
et
Doing the obvious, we have
|zetere-wrr = freee WA] x nletteemrn
It is easiest to define our function g(z,t) separately for x > 0 and 2 <0. We take
(2,2) = |alelt-S)z—e-(@-w)/2 ifr <0
cad JaleltoireC-w'? ite >0,
It is clear that this function satisfies (2.4.9); it remains to check that its integral is
finite.Bection 2.4 DIFFERENTIATING UNDER AN INTEGRAL SIGN 3
For 2 > 0 we have
(ast) = en Pelsere to) +492,
We now complete the square in the exponent; that is, we write
a — Qa(u+t+ 60) + w?
= 2? —2a(u+t+t bo) + (u+t+ bo)? — (ut+t + bo)? +p?
= (2—(utttoo))? +H? — (ut t+ 60),
and so, for x > 0,
(a, t) = veut t+ F0)1?/29— [a= (ut 60)41/2,
Since the last exponential factor in this expression does not depend on x, [5° 9(z, t) dz
is essentially calculating the mean of a normal distribution with mean p+t+6, except
that the integration is only over (0,00). However, it follows that the integral is finite
because the normal distribution has a finite mean (to be shown in Chapter 3). A
similar development for x < 0 shows that
(x,t) = [le fF H+ 60)}7/29— a? (wt t—60)°1/2
and so f°, 9(z,t) de < 00. Therefore, we have found an integrable function satisfying
(2.4.9) and the operation in (2.4.8) is justified. i]
We now turn to the question of when it is possible to interchange differentiation
and summation, an operation that plays an important role in discrete distributions.
Of course, we are concerned only with infinite sums, since a derivative can always be
taken inside a finite sum.
Example 2.4.7 (Interchanging summation and differentiation) Let X bea
discrete random variable with the geometric distribution
P(X =2)=0(1-0), 2=0,1,..., 0<0<1.
We have that 07°. 4(1— @)* = 1 and, provided that the operations are justified,
as 2 yond e
wo 80-9 2 0-9
=> [0-0 - 62(1 - 8)7]
0
te 1
ay (1 — 8)" — = 8 — 0).
220
Since 2.4 6(1 — 6)* =1 for all 0 < 6 < 1, its derivative is 0. So we have
0.
(2.4.10) tyro -a*— 1 So 201 —0)*
=
20“ ‘TRANSFORMATIONS AND EXPECTATIONS Section 2.4
Now the first sum in (2.4.10) is equal to 1 and the second sum is EX; hence (2.4.10)
becomes
or
We have, in essence, summed the series 372°.) 20(1 — @)* by differentiating. i
Justification for taking the derivative inside the summation is more straightforward
than the integration case. The following theorem provides the details.
Theorem 2.4.8 Suppose that the series >= ,h(8,x) converges for all 8 in an
interval (a,b) of real numbers and
i. &h(G,z) is continuous in @ for each x,
ii, D2 HAG, z) converges uniformly on every closed bounded subinterval of (a, »).
Then
=O
(2.4.11) SS n0,2)= 3° 3 006,2).
=o 5
‘The condition of uniform convergence is the key one to verify in order to establish
that the differentiation can be taken inside the suramation. Recall that a series con-
verges uniformly if its sequence of partial sums converges uniformly, a fact that we
use in the following example.
Example 2.4.9 (Continuation of Example 2.4.7) To apply Theorem 2.4.8 we
identify
(0, 2) = O(1 ~ 6)*
and
a (0, 2) = (1 - 6)* — 6x(1 — 6)7-1.
a9 M2) = :
and verify that 7225 & A(0, x) converges uniformly. Define S,() by
Sn(8) = D> [0 - 9) — bz(1 - 9)*4).
0
‘The convergence will be uniform on [c,d] C (0,1) if, given « > 0, we can find an N
such that,
n> N = |Sn(8) ~ Soo(@)|<€ for all 8 € [c,d].Section 2.4 DIFFERENTIATING UNDER AN INTEGRAL SIGN 75
Recall the partial sum of the geometric series (1.5.3). If y # 1, then we can write
n _ynth
y
Applying this, we have
. -a—a
a
x y 7
‘ =~ 8 =
dea) u =O - 51-9)
0
dg :
oa 0-9)
__p)d@ f1-G-art
- 0g ae).
Here we (justifiably) pull the derivative through the finite sum. Calculating this
derivative gives
Sento" a= ney) (oe Net ay
and, hence,
5,(0) = atta _ G=(1-0)"*) ~ (n+ 90-4)"
=(n+1)(1—0)"
It is clear that, for 0 < 0 <1, Sao = lima-soo Sn(0) = 0. Since $,(8) is continuous,
the convergence is uniform on any closed bounded interval. Therefore, the series of
derivatives converges uniformly and the interchange of differentiation and summation
is justified. "
‘We close this section with a theorem that is similar to Theorem 2.4.8, but treats
the case of interchanging the order of summation and integration.
Theorem 2.4.10 Suppose the series "%., h(0, x) converges uniformly on [a,b] and
that, for each «, h(8,2) is a continuous function of 9. Then
be ob
Yn.2)49= 9° [ (6,2) dd.
e=088
2 22076 ‘TRANSFORMATIONS AND EXPECTATIONS Section 2.5
2.5 Exercises
2.1 In each of the following find the pdf of ¥. Show that the pdf integrates to 1.
(a) ¥ =X and fx(z) = 4225(1-2),0<2<1
(b) ¥=4X 43 and fx(z) = 7e-", 0 <2 < 00
(c) ¥ = X? and fx(z) = 3027(1—2)?,0<2<1
(See Example A.0.2 in Appendix A.)
2.2 In each of the following find the pdf of Y.
(a) ¥ =X? and fx(z)=1,0<2<1
(b) ¥ = ~log X and X has pdf
!
$m+D! 9
1s)
nim!
f(z) )™, O<2<1, myn positive integers
(©) ¥ =e* and X has pdf
ae 2,
fx(@) = qqre "2, O< 2 <00, 0” a positive constant
2.3 Suppose X has the geometric pmf fx(z) = 3 (3)", z = 0,1,2,.... Determine the
probability distribution of Y = X/(X +1). Note that here both X and ¥ are discrete
random variables. To specify the probability distribution of Y, specify its pm.
2.4 Let A be a fixed positive constant, and define the function f(x) by f(z) = $e if
2 > Oand f(z) = }re if <0.
(a) Verify that f(z) is a pdf.
(b) IX is a random variable with paf given by f(z), find P(X < t) for all ¢. Evaluate
all integrals.
(©) Find P(|X| 0
2.7 Let X have pdf fy(z)=2(e+1), -1<2<2.
(a) Find the pdf of ¥ = X?. Note that Theorem 2.1.8 is not directly applicable in
this problem.
(b) Show that Theorem 2.1.8 remains valid if the sets Ao, A1,...,Ax contain X, and
apply the extension to solve part (a) using Ay =, Ar = (~2,0), and Aa = (0,2).
2.8 ‘in each of the following show that the given function is a cdf and find Fy(y).
0 ifz<0
() rety={0 oe 250Bection 2.5 EXERCISES 7
2.9
2.10
2.41
2.12
7/2 tzy)>PU>y)=1-y, forally, Oy)>P(U>y)=1-y, forsomey, 0