0% found this document useful (0 votes)
328 views686 pages

Chap 1-4, Statistical Inference, by Casella and Berger PDF

Uploaded by

sakhtar0092
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
328 views686 pages

Chap 1-4, Statistical Inference, by Casella and Berger PDF

Uploaded by

sakhtar0092
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 686
Statistical Inference Second Edition George Casella Roger \. Berger OIKONOMIKO NANENIZTHMIO A@HNON BIBAIO@HKH e10. F226S— . . A. IDS Statistical Inference rok. CAs Second Edition George Casella University of Florida Roger L. Berger North Carolina State University DUXBURY THOMSON LEARNING. Australia ¢ Canada * Mexico + Singapore * Spain ¢ United Kingdom * United States WOVHOA DUXBURY MMHG ANGE! THOMSON LEARNING Rae) Sponsoring Editag@Cprolyn Crockett Cover Design: Jennifer Mackres Marketing Re na thrive: Tom Ziolkowski Interior Dlustration: Lori Heckelman Editorial nnifer Jenkins Print Buyer: Vena Dyer -Prodwetion Editor: Tom Novack ‘Typesetting: Integre Teehnical Publishing Co. Assistant Editor: Ann Day Cover Printing: R. R. Donnelley & Sons Co., Manuscript Editor: Carol Reitz Crawfordsville Permissions Editor: Sue Ewing Printing and Binding: R. R. Donnelley & Sons Co., Crawfordsville All products used herein are used for identification purposes only and may be trademarks or registered trademarks of their respective owners. COPYRIGHT © 2002 the Wadsworth Group. Duxbury is an imprint of the Wadsworth Group, a division of Thomson Learning Inc. ‘Thomson Learning™ is a trademark used herein under license. For more information about this or any other Duzbury products, contact: DUXBURY 511 Forest Lodge Road Pacific Grove, CA 93950 USA ‘woww.duxbury.com 1-800-423-0563 (Thomson Learning Academic Resource Center) All rights reserved. No part of this work may be reproduced, transcribed or used in any form or by any means—graphic, electronic, or mechanical, including photocopying, recording, taping, Web distribution, ot information storage and/or retrieval systems—without the prior written permission of the publisher. For permission to use material from, figs cheer us by www.thomsonrights.com fax: 1-800-730-2215 phone: 1-800-730-2214 Printed in United States of A: 10987654321 Library of Congress Cataloging-in-Publication Data Casella, George. Statistical inference / George Casella, Roger L. Berger —2nd ed. P. cm. Includes bibliographical references and indexes. ISBN 0-534-24312-6 a 1, Mathematical statistics. 2. Probabilities. I. Berger, Roger L. seen, IL. Title. 4 a QA276.C37 2001 519.5 —de21 xX if 2001025704 mer To Anne and Vicki Duzbury titles of related interest Daniel, Applied Nonparametric Statistics 2" Derr, Statistical Consulting: A Guide to Effective Communication Durrett, Probability: Theory and Examples 2°4 Graybill, Theory and Application of the Linear Model Johnson, Applied Multivariate Methods for Data Analysts Kuehl, Design of Experiments: Statistical Principles of Research Design and Analysis 24 Larsen, Marx, & Cooil, Statistics for Applied Problem Solving and Decision Making Lohr, Sampling: Design and Analysis Lunneborg, Data Analysis by Resampling: Concepts and Applications Minh, Applied Probability Models Minitab Inc., MINITAS™ Student Version 12 for Windows Myers, Classical and Modern Regression with Applications 2°¢ Newton & Harvill, StatConcepts: A Visual Tour of Statistical Ideas Ramsey & Schafer, The Statistical Sleuth 2°¢ SAS Institute Inc., JMP-IN: Statistical Discovery Software Savage, INSIGHT: Business Analysis Software for Microsoft® Excel Scheaffer, Mendenhall, & Ott, Elementary Survey Sampling 5" Shapiro, Modeling the Supply Chain Winston, Simulation Modeling Using @RISK ‘To order copies contact your local Hobkstore or call 1-800-354-9706. For more information contact Duxbury Préss-a¢ 511 Forest Lodkt Road, Pacific Grove, CA 93950, or go to: www.duxbury.com F314 ag gtagi® 1] iy Zh } Preface to the Second Edition Although Sir Arthur Conan Doyle is responsible for most of the quotes in this book, perhaps the best description of the life of this book can be attributed to the Grateful Dead sentiment, “What a long, strange trip it’s been.” Plans for the second edition started about six years ago, and for a long time we struggled with questions about what to add and what to delete. Thankfully, as time passed, the answers became clearer as the flow of the discipline of statistics became clearer. We see the trend moving away from elegant proofs of special cases to algo- rithmic solutions of more complex and practical cases. This does not undermine the importance of mathematics and rigor; indeed, we have found that these have become more important. But the manner in which they are applied is changing. For those familiar with the first edition, we can summarize the changes succinctly as follows. Discussion of asymptotic methods has been greatly expanded into its own chapter. There is more emphasis on computing and simulation (see Section 5.5 and the computer algebra Appendix); coverage of the more applicable techniques has been expanded or added (for example, bootstrapping, the EM algorithm, p-values, logistic and robust regression); and there are many new Miscellanea and Exercises. We have de-emphasized the more specialized theoretical topies, such as equivariance and decision theory, and have restructured some material in Chapters 3-LI for clarity. ‘There are two things that we want to note. First, with respect to computer algebra programs, although we believe that they are becoming increasingly valuable tools, we did not want to force them on the instructor who does not share that belief. ‘Thus, the treatment is “unobtrusive” in that it appears only in an appendix, with some hints throughout the book where it may be useful. Second, we have changed the numbering system to one that facilitates finding things. Now theorems, lemmas, examples, and definitions are numbered together; for example, Definition 7.2.4 is followed by Example 7.2.5 and Theorem 10.1.3 precedes Example 10.1.4. ‘The first four chapters have received only minor changes. We reordered some ma- terial (in particular, the inequalities and identities have been split), added some new examples and exercises, and did some general updating. Chapter 5 has also been re- ordered, with the convergence section being moved further back, and a new section on generating random variables added. The previous coverage of invariance, which was in Chapters 7-9 of the first edition, has been greatly reduced and incorporated into Chapter 6, which otherwise has received only minor editing (mostly the addition of new exercises). Chapter 7 has been expanded and updated, and includes a new section on the EM algorithm. Chapter 8 has also received minor editing and updating, and oe fo ceo nce anew - ssogonibn p-values. In Chapter 9 we now put more emphasis on pivoting Hzedthat “guaranteeing an interval” was merely “pivoting the cdf”). Also, the himateriaf that was in Chapter 10 of the first edition (decision theory) has been re- duced, and small sections on loss function optimality of point estimation, hypothesis testing, and interval estimation have been added to the appropriate chapters. Chapter 10 is entirely new and attempts to lay out the fundamentals of large sample inference, including the delta method, consistency and asymptotic normality, boot strapping, robust estimators, score tests, etc. Chapter 11 is classic oneway ANOVA and linear regression (which was covered in two different chapters in the first edi- tion). Unfortunately, coverage of randomized block designs has been eliminated for space reasons. Chapter 12 covers regression with errors-in-variables and contains new material on robust and logistic regression. After teaching from the first edition for a number of years, we know (approximately) what can be covered in a one-year course. From the second edition, it should be possible to cover the following in one year: Chapter 1: Sections 1-7 Chapter 6: Sections 1-3 Chapter 2: Sections 1-3 Chapter 7; Sections 1-3 Chapter 3: Sections 1-6 Chapter 8: Sections 1-3 Chapter 4: Sections 1-7 Chapter 9: Sections 1-3 Chapter 5: Sections 1-6 Chapter 10: Sections 1, 3, 4 Classes that begin the course with some probability background can cover more ma- terial from the later chapters. Finally, it is almost impossible to thank all of the people who have contributed in some way to making the second edition a reality (and help us correct the mistakes in the first edition). To all of our students, friends, and colleagues who took the time to send us a note or an e-mail, we thank you. A number of people made key suggestions that led to substantial changes in presentation. Sometimes these suggestions were just short notes or comments, and some were longer reviews. Some were so long ago that their authors may have forgotten, but we haven't. So thanks to Arthur Cohen, Sir David Cox, Steve Samuels, Rob Strawderman and ‘Tom Wehrly. We also owe much to Jay Beder, who has sent us numerous comments and suggestions over the years and possibly knows the first edition better than we do, and to Michael Perlman and his class, who are sending comments and corrections even as we write this. This book has seen a number of editors. We thank Alex Kugashev, who in the mid-1990s first suggested doing a second edition, and our editor, Carolyn Crockett, who constantly encouraged us. Perhaps the one person (other than us) who is most tesponsible for this book is our first editor, John Kimmel, who encouraged, published, and marketed the first edition. Thanks, John. George Casella Roger L. Berger Preface to the First Edition When someone discovers that you are writing a textbook, one (or both) of two ques- tions will be asked, The first is “Why are you writing a book?” and the second is “How is your book different from what’s out there?” The first question is fairly easy to answer. You are writing a book because you are not entirely satisfied with the available texts. The second question is harder to answer. The answer can’t be put in a few sentences so, in order not to bore your audience (who may be asking the question only out of politeness), you try to say something quick and witty. It usually doesn’t work. ‘The purpose of this book is to build theoretical statistics (as different from mathe- matical statistics) from the first principles of probability theory. Logical development, proofs, ideas, themes, etc., evolve through statistical arguments. Thus, starting from the basics of probability, we develop the theory of statistical inference using tech- niques, definitions, and concepts that are statistical and are natural extensions and consequences of previous concepts. When this endeavor was started, we were not sure how well it would work. The final judgment of our success is, of course, left to the reader. ‘The book js intended for first-year graduate students majoring in statistics or in a field where a statistics concentration is desirable. The prerequisite is one yeat of calculus. (Some familiarity with matrix manipulations would be useful, but is not essential.) The book can be used for a two-semester, or three-quarter, introductory course in statistics. ‘The first four chapters cover basics of probability theory and introduce many fun- damentals that are later necessary. Chapters 5 and 6 are the first statistical chapters. Chapter 5 is transitional (between probability and statistics) and can be the starting Point for a course in statistical theory for students with some probability background. Chapter 6 is somewhat unique, detailing three statistical principles (sufficiency, like- lihood, and invariance) and showing how these principles are important in modeling data. Not all instructors will cover this chapter in detail, although we strongly recom- mend spending some time here. In particular, the likelihood and invariance principles are treated in detail. Along with the sufficiency principle, these principles, and the thinking behind them, are fundamental to total statistical understanding. Chapters 7-9 represent the central core of statistical inference, estimation (point and interval) and hypothesis testing. A major feature of these chapters is the division into methods of finding appropriate statistical techniques and methods of evaluating these techniques. Finding and evaluating are of interest to both the theorist and the vill PREFACE TO THE FIRST EDITION practitioner, but we feel that it is important to separate these endeavors. Different concerns are important, and different rules are invoked. Of further interest may be the sections of these chapters titled Other Considerations. Here, we indicate how the rules of statistical inference may be relaxed (as is done every day) and still produce meaningful inferences. Many of the techniques covered in these sections are ones that axe used in consulting and are helpful in analyzing and inferring from actual problems. ‘The final three chapters can be thought of as special topics, although we feel that some familiarity with the material is important in anyone’s statistical education. Chapter 10 is a thorough introduction to decision theory and contains the most mod- ern material we could include. Chapter 11 deals with the analysis of variance (oneway and randomized block), building the theory of the complete analysis from the more simple theory of treatment contrasts. Our experience has been that experimenters are most interested in inferences from contrasts, and using principles developed earlier, most tests and intervals can be derived from contrasts. Finally, Chapter 12 treats the theory of regression, dealing first with simple linear regression and then covering regression with “errors in variables.” This latter topic is quite important, not only to show its own usefulness and inherent difficulties, but also to illustrate the limitations of inferences from ordinary regression. ‘As more concrete guidelines for basing a one-year course on this book, we offer the following suggestions. There can be two distinct types of courses taught from this book. One kind we might label “more mathematical,” being a course appropriate for students majoring in statistics and having a solid mathematics background (at least 1} years of calculus, some matrix algebra, and perhaps a real analysis course). For such students we recommend covering Chapters 1-9 in their entirety (which should take approximately 22 weeks) and spend the remaining time customizing the course with selected topics from Chapters 10-12. Once the first nine chapters are covered, the material in each of the last three chapters is self-contained, and can be covered in any order. Another type of course is “more practical.” Such a course may also be a first course for mathematically sophisticated students, but is aimed at students with one year of calculus who may not be majoring in statistics. It stresses the more practical uses of statistical theory, being more concerned with understanding basic statistical concepts and deriving reasonable statistical procedures for a variety of situations, and less concetned with formal optimality investigations. Such a course will necessarily omit a certain amount of material, but the following list of sections can be covered in a one-year course: Chapter Sections 1 All 24, 2.2, 2.3 3.1, 3.2 4.1, 4.2, 4.3, 4.5 5.1, 5.2, 5.3.1, 5.4 6.1.1, 6.2.1 7A, 7.2.1, 7.2.2, 7.2.3, 7.3.1, 7.3.3, 7.4 8.1, 8.2.1, 8.2.3, 8.2.4, 8.3.1, 8.3.2, 8.4 we wor anm PREFACE TO THE FIRST EDITION ix 9 9.1, 9.2.1, 9.2.2, 9.2.4, 9.3.1, 9.4 u 11.1, 11.2 12 12.1, 12.2 If time permits, there can be some discussion (with little emphasis on details) of the material in Sections 4.4, 5.5, and 6.1.2, 6.1.3, 6.1.4. The material in Sections 11.3 and 12.3 may also be considered. The exercises have been gathered from many sources and are quite plentiful. We feel that, perhaps, the only way to master this material is through practice, and thus we have included much opportunity to do so. The exercises are as varied as we could make them, and many of them illustrate points that are either new or complementary to the material in the text. Some exercises are even taken from research papers. (It makes you feel old when you can include exercises based on papers that were new research during your own student days!) Although the exercises are not subdivided like the chapters, their ordering roughly follows that of the chapter. (Subdivisions often give too many hints.) Furthermore, the exercises become (again, roughly) more challenging as their numbers become higher. As this is an introductory book with a relatively broad scope, the topics are not covered in great depth. However, we felt some obligation to guide the reader one step further in the topics that may be of interest. Thus, we have included many references, pointing to the path to deeper understanding of any particular topic. (The Encyclopedia of Statistical Sciences, edited by Kotz, Johnson, and Read, provides a fine introduction to many topics.) ‘To write this book, we have drawn on both our past teachings and current work. We have also drawn on many people, to whom we are extremely grateful. We thank our colleagues at Cornell, North Carolina State, and Purdue—in particular, Jim Berger, Larry Brown, Sir David Cox, Ziding Feng, Janet Johnson, Leon Gleser, Costas Goutis, Dave Lansky, George McCabe, Chuck McCulloch, Myra Samuels, Steve Schwager, and Shayle Searle, who have given their time and expertise in reading parts of this manuscript, offered assistance, and taken part in many conversations leading to con- structive suggestions. We also thank Shanti Gupta for his hospitality, and the li brary at Purdue, which was essential. We are grateful for the detailed reading and helpful suggestions of Shayle Searle and of our reviewers, both anonymous and non- anonymous (Jim Albert, Dan Coster, and ‘Tom Wehrly). We also thank David Moore and George McCabe for allowing us to use their tables, and Steve Hirdt for supplying us with data. Since this book was written by two people who, for most of the time, were at least 600 miles apart, we lastly thank Bitnet for making this entire thing Possible. George Caseila Roger L. Berger “We have got to the deductions and the inferences,” said Lestrade, winking at me. “I find it hard enough to tackle facts, Holmes, without flying away after theories and fancies.” Inspector Lestrade to Sherlock Holmes The Boscombe Valley Mystery Contents 1 Probability Theory ll ey 13 14 15 16 7 18 Set Theory Basics of Probability Theory 1.2.1 Axiomatic Foundations 1.2.2. The Caleulus of Probabilities 1.23 Counting 1.24 Enumerating Outcomes Conditional Probability and Independence Random Variables Distribution Functions Density and Mass Functions Exercises Miscellanea 2 Transformations and Expectations 21 22 2.3 24 2.5 2.6 Distributions of Functions of a Random Variable Expected Values Moments and Moment Generating Functions Differentiating Under an Integral Sign Exercises Miscellanea 3 Common Families of Distributions 31 3.2 3.3, 3.4 3.5 Introduction Discrete Distributions Continuous Distributions Exponential Families Location and Scale Families woaaeH 13 16 20 27 29 34 37 44 aT a 55 59 68 76 82 85 85 85 98 iu 116 3.6 3.7 3.8 CONTENTS Inequalities and Identities 3.6.1 Probability Inequalities 3.6.2 Identities Exercises Miscellanea Multiple Random Variables ad 42 43 44 45 46 47 4.8 49 Joint and Marginal Distributions Conditional Distributions and Independence Bivariate Transformations Hierarchical Models and Mixture Distributions Covariance and Correlation Multivariate Distributions Inequalities 4.7.1 Numerical Inequalities 4.7.2 Functional Inequalities Exercises Miscellanea Properties of a Random Sample 51 5.2 53 54 5.5 5.6 5.7 5.8 Basic Concepts of Random Samples Sums of Random Variables from a Random Sample Sampling from the Normal Distribution 5.3.1. Properties of the Sample Mean and Variance 5.3.2. The Derived Distributions: Student’s t and Snedecor’s F Order Statistics Convergence Concepts 5.5.1 Convergence in Probability 5.5.2 Almost Sure Convergence 5.5.3 Convergence in Distribution 5.5.4 The Delta Method Generating a Random Sample 5.6.1 Direct Methods 5.6.2 Indirect Methods 5.6.3. The Accept/Reject Algorithm Exercises Miscellanea Principles of Data Reduction 6.1 6.2 Introductio ‘The Sufficiency Principle 6.2.1 Sufficient Statistics 6.2.2 Minimal Sufficient Statistics 6.2.3 Ancillary Statistics 6.2.4 Sufficient, Ancillary, and Complete Statistics 121 122 123 127 135 139 139 147 156 162 169 177 186 186 189 192 203 207 207 211 218 218 222 226 232 232 234 235 240 245 247 251 253 255 267 271 271 272 272 279 282 284 CONTENTS 6.3 The Likelihood Principle 6.3.1 The Likelihood Function 6.3.2 The Formal Likelihood Principle 6.4 The Equivariance Principle 6.5 Exercises 6.6 Miscellanea Point Estimation 7.1 Introduction 7.2. Methods of Finding Estimators 7.2.1 Method of Moments 7.2.2. Maximum Likelihood Estimators 7.2.3. Bayes Estimators 7.24 The EM Algorithm 7.3. Methods of Evaluating Estimators 7.3.1 Mean Squared Error 7.3.2 Best Unbiased Estimators 7.3.3 Sufficiency and Unbiasedness 7.3.4 Loss Function Optimality 74 Exercises 7.5 Miscellanea Hypothesis Testing 8.1 Introduction 8.2 Methods of Finding Tests 8.2.1 Likelihood Ratio Tests 8.2.2 Bayesian Tests 8.2.3 Union-Intersection and Intersection-Union Tests 8&3 Methods of Evaluating Tests 8.3.1 Error Probabilities and the Power Function 8.3.2 Most Powerful Tests 8.3.3 Sizes of Union-Intersection and Intersection-Union Tests 8.3.4 p-Values 8.3.5 Loss Function Optimality 8.4 Exercises 8.5 Miscellanea Interval Estimation 9.1 Introduction 9.2. Methods of Finding Interval Estimators 9.2.1 Inverting a Test Statistic Pivotal Quantities Pivoting the CDF Bayesian Intervals 290 290 292 296 300 307 311 311 312 312 315 324 326 330 330 334 342 348 355 367 373 373 374 374 379 380 382 382 387 394 397 400 402 413 417 417 420 420 427 430 435 93 9.4 9.5 CONTENTS ‘Methods of Evaluating Interval Estimators - 9.3.1 Size and Coverage Probability 9.3.2. Test-Related Optimality 9.3.3. Bayesian Optimality 9.3.4 Loss Function Optimality Exercises Miscellanea 10 Asymptotic Evaluations 10.1 10.2 10.3 10.4 10.5 10.6 Point Estimation 10.1.1 Consistency 10.1.2 Efficiency 10.1.3 Calculations and Comparisons 10.1.4 Bootstrap Standard Errors Robustness 10.2.1 The Mean and the Median 10.2.2 M-Estimators Hypothesis Testing 10.3.1 Asymptotic Distribution of LRTs 10.3.2 Other Large-Sample Tests Interval Estimation 10.4.1 Approximate Maximum Likelihood Intervals 10.4.2 Other Large-Sample Intervals Exercises Miscellanea 11 Analysis of Variance and Regression Wl 11.2 11.3 114 11.5 Introduction Oneway Analysis of Variance 11.2.1 Model and Distribution Assumptions 11.2.2 The Classic ANOVA Hypothesis 11.2.3 Inferences Regarding Linear Combinations of Means 11.2.4 The ANOVA F Test 11.2.5 Simultaneous Estimation of Contrasts 11.2.6 Partitioning Sums of Squares Simple Linear Regression 11.3.1 Least Squares: A Mathematical Solution 11.3.2 Best Linear Unbiased Estimators: A Statistical Solution 11.3.3 Models and Distribution Assumptions 11.3.4 Estimation and Testing with Normal Errors 11.3.5 Estimation and Prediction at a Specified x = x9 11.3.6 Simultaneous Estimation and Confidence Bands Exercises Miscellanes 440 440 444 447 449 451 463 467 467 467 470 473 478 481 482 484 488 488 492 496 496 499 504 515 521 521 522 524 525 527 530 534 536 539 542 544 548 550 557 559 563 572 CONTENTS 12 Regression Models 12.1 Introduction 12.2 Regression with Errors in Variables 12.2.1 Functional and Structural Relationships 12.2.2 A Least Squares Solution 12.2.3 Maximum Likelihood Estimation 12.2.4 Confidence Sets 12.3 Logistic Regression 12.3.1 The Model 12.3.2 Estimation 12.4 Robust Regression 12.5 Exercises 12.6 Miscellanea Appendix: Computer Algebra ‘Table of Common Distributions References Author Index Subject Index xvii 577 577 577 579 581 583 588 591 591 593 597 602 608 613 621 629 645 649 List of Tables 124 Ald 731 8.3.1 9.21 9.2.2 10.1.1 10.2.1 10.3.1 10.4.1 10.4.2 11.2.1 11.3.1 11.3.2 12.3.1 12.4.1 Number of arrangements Values of the joint pmf f(z, y) Three estimators for a binomial p Counts of leukemia cases ‘Two types of errors in hypothesis testing Location-scale pivots Sterne’s acceptance region and confidence set ‘Three 90% normal confidence intervals Bootstrap and Delta Method variances Median/mean asymptotic relative efficiencies Huber estimators Huber estimator asymptotic relative efficiencies, k = 1.5 Poisson LRT statistic Power of robust tests Confidence coefficient for a pivotal interval Confidence coefficients for intervals based on Huber’s M-estimator ANOVA table for oneway el Data pictured in Figure 11.3.1 ANOVA table for simple linear regression ‘ification. Challenger date. Potoroo data Regression M-estimator asymptotic relative efficiencies 16 141 354 360 383 427 431 44. 480 484 485 487 490 497 500 504 538 542 556 594 598 601 List of Figures Dart board for Example 1.2.7 Histogram of averages Caf of Example 1.5.2 Geometric edf, p = .3 Area under logistic curve ‘Transformation of Example 2.1.2 Increasing and nondecreasing cdfs Exponential densities ‘Two pdfs with the same moments Poisson approximation to the binomial Standard normal density Normal approximation to the binomial Beta densities Symmetric beta densities Standard normal density and Cauchy density Lognormal and gamma pdfs Location densities Exponential location densities ‘Members of the same scale family Location-scale families Regions for Example 4.1.12 Regions for Examples 4.5.4 and 4.5.8 Regions for Example 4.5.9 Convex function Jensen's Inequality Region on which fa,v(r,v) > 0 for Example 5.4.7 Histogram of exponential pdf Histogram of Poisson sample variances Beta distribution in accept/reject sampling 19 30 32 36 49 54 60 65 68 105 106 107 108 109 110 7 118 119 120 147 170 175 189 190 232 248 251 252 LIST OF FIGURES Binomial MSE comparison Risk functions for variance estimators LRT statistic Power functions for Example 8.3.2 Power functions for Example 8.3.3 Power functions for three tests in Example 8.3.19 Risk function for test in Example 8.3.31 Confidence interval-acceptance region relationship Acceptance region and confidence interval for Example 9.2.3 Credible and confidence intervals from Example 9.2.16 Credible probabilities of the intervals from Example 9.2.16 Coverage probabilities of the intervals from Example 9.2.16. ‘Three interval estimators from Example 9.2.16 Asymptotic relative efficiency for gamma mean estimators Poisson LRT histogram LRT intervals for a binomial proportion Coverage probabilities for nominal .9 binomial confidence procedures Vertical distances that are measured by RSS Geometric description of the BLUE Scheffé bands, t interval, and Bonferroni intervals Distance minimized by orthogonal least squares ‘Three regression lines Creasy-Williams F statistic Challenger data logistic curve Least squares, LAD, and M-estimate fits 333 351 377 384 384 394 401 42 423 437 438 439 449 478 490 502 503 542 547 562 581 583 590 595, 599 List of Examples Event operations Sigma algebra-I Sigma algebra-II Defining probabilities-I Defining probabilities-II Bonferroni's Inequality Lottery-I Tournament Lottery-II Poker Sampling with replacement Calculating an average Four aces Continuation of Example 1.3.1 ‘Three prisoners Coding Chevalier de Meré Tossing two dice Letters Three coin tosses—I Random variables ‘Three coin tosses-II Distribution of a random variable Tossing three coins Tossing for a head Continuous cdf Caf with jumps Identically distributed random variables Geometric probabilities Logistic probabilities Binomial transformation Uniform transformation Uniform-exponential relationship-I Inverted gamma pdf Noo 13 13 4 16 17 18 20 20 2 23 24 25 26 27 28 28 29 30 31 32 33 33 34 36 48 49 51 51 247 219 2.2.2 2.2.3 2.24 226 2.2.7 2.3.3 2.3.5 2.3.8 2.3.13 24.5 2.4.6 2.4.7 24.9 3.2.1 3.2.3 3.2.4 3.2.5 3.2.6 3.2.7 3.3.1 3.3.2 3.4.1 3.4.3 3.44 3.4.6 3.4.8 3.49 3.5.3 3.6.2 3.6.3 3.6.6 3.6.9 4.1.2 414 415 4a7 418 419 41.11 LIST OF EXAMPLES ‘Square transformation Normal-chi squared relationship Exponential mean Binomial mean Cauchy mean Minimizing distance Uniform-exponential relationship-I Exponential variance Binomial variance Gamma mgf Binomial mgf Nonunique moments Poisson approximation Interchanging integration and differentiation-I Interchanging integration and differentiation-II Interchanging summation and differentiation Continuation of Example 2.4.7 Acceptance sampling Dice probabilities Waiting time Poisson approximation Inverse binomial sampling Failure times Gamma-Poisson relationship Normal approximation Binomial exponentia) family Binomial mean and variance Normal exponential family Continuation of Example 3.4.4 ‘A curved exponential family Normal approximations Exponential location family Mlustrating Chebychev A normal probability inequality Higher-order normal moments Higher-order Poisson moments Sample space for dice Continuation of Example 4.1.2 Joint pmf for dice Marginal pmé for dice Dice probabilities Same marginals, different joint pmf Calculating joint probabilities 52 53 55 56 56 58 BB 59 61 63 64 64 66 1 72 73 4 88 91 93 94 96 98 100 105 111 112 113 44 115 115 118 122 123 125 127 140 141 142 143 144 144 45 41.12 4.2.2 4.2.4 4.2.6 4.2.8 4.2.9 4.2.11 4.2.13 43.1 4.3.3 43.4 43.6 441 44.2 44.5 44.6 44.8 454 45.8 45.9 46.1 4.6.3 46.8 46.13 47A 47.8 5.1.2 5.1.3 5.2.8 5.2.10 5.2.12 5.3.5 5.3.7 5.4.5 5.4.7 5.5.3 5.5.5 5.5.7 5.5.8 5.5.11 5.5.16 5.5.18 5.5.19 5.5.22 LIST OF EXAMPLES Calculating joint probabilities-TT Calculating conditional probabilities Calculating conditional pdfs Checking independence-I Checking independence-IT Joint probability model Expectations of independent variables ‘Megf of a sum of normal variables Distribution of the sum of Poisson variables Distribution of the product of beta varizbles Sum and difference of normal variables Distribution of the ratio of normal variables Binomial-Poisson hierarchy Continuation of Example 4.4.1 Generalization of Example 4.4.1 Beta-binomial hierarchy Continuation of Example 4.4.6 Correlation-I Correlation-II Correlation-IIT Multivariate pdfs Multivariate pmf Mef of a sum of gamma variables Multivariate change of variables Covariance inequality ‘An inequality for means Sample pdf-exponential Finite population model Distribution of the mean Sum of Cauchy random variables Sum of Bemoulli random variables Variance ratio distribution Continuation of Example 5.3.5 Uniform order statistic pdf Distribution of the midrange and range Consistency of S? Consistency of Almost sure convergence Convergence in probability, not almost surely Maximum of uniforms Normal approximation to the negative binomial Normal approximation with estimated variance Estimating the odds Continuation of Example 5.5.19 wea 146 148 150 152 153 154 155 156 157 158 159 162 163 163 165 167 168 170 173 174 178 181 183 185 188 191 208 210 215 216 217 224 225 230 231 233 233 234 234 235 239 240 240 242 rodv, 5.5.23 5.5.25 5.5.27 5.6.1 5.6.2 5.6.3 5.6.4 5.6.5 5.6.6 5.6.7 5.6.9 LIST OF EXAMPLES Approximate mean and variance Continuation of Example 5.5.23 Moments of a ratio estimator Exponential lifetime Continuation of Example 5.6.1 Probability Integral Transform Box-Muller algorithm Binomial random variable generation Distribution of the Poisson variance Beta random variable generation—I Beta random variable generation-II Binomial sufficient statistic Normal sufficient statistic Sufficient order statistics Continuation of Example 6.2.4 Uniform sufficient statistic Normal sufficient statistic, both parameters unknown ‘Two normal sufficient statistics Normal minimal sufficient statistic Uniform minimal sufficient statistic Uniform ancillary statistic Location family ancillary statistic Scale family ancillary statistic Ancillary precision Binomial complete sufficient statistic Uniform complete sufficient statistic Using Basu's Theorem-I Using Basu’s Theorem-II Negative binomial likelihood Normal fiducial distribution Evidence function Binomial/negative binomial experiment, Continuation of Example 6.3.5 Binomial equivariance Continuation of Example 6.4.1 Conclusion of Example 6.4.1 Normal location invariance Normal method of moments Binomial method of moments Satterthwaite approximation Normal likelihood Continuation of Example 7.2.5 Bernoulli MLE 242 243 244 246 246 247 249 249 250 251 254 274 274 275 277 217 279 280 281 282 282 283 284 285 285 286 288 289 290 291 292 293 295 297 298 299 299 313 313 314 316 317 317 LIST OF EXAMPLES Restricted range MLE Binomial MLE, unknown number of trials Normal MLEs, y: and ¢ unknown Continuation of Example 7.2.11 Continuation of Example 7.2.2 Binomial Bayes estimation Normal Bayes estimators Multiple Poisson rates Continuation of Example 7.2.17 Conclusion of Example 7.2.17 Normal MSE. Continuation of Example 7.3.3 MSE of binomial Bayes estimator ‘MSE of equivariant estimators Poisson unbiased estimation Conclusion of Example 7.3.8 Unbiased estimator for the scale uniform Normal variance bound Continuation of Example 7.3.14 Conditioning on an insufficient statistic Unbiased estimators of zero Continuation of Example 7.3.13 Binomial best unbiased estimation Binomial risk functions Risk of normal variance Variance estimation using Stein’s loss Two Bayes rules Normal Bayes estimates Binomial Bayes estimates Normal LRT Exponential LRT LRT and sufficiency Normal LRT with unknown variance Normal Bayesian test Normal union- intersection test Acceptance sampling Binomial power function Normal power function Continuation of Example 8.3.3 Size of LRT Size of union-intersecticn test Conclusion of Example 8.3.3 UMP bizomial test UMP norma! test 318 318 321 322 323 324 326 326 327 328 331 331 332 333 335 338 339 340 341 343 345, 346 347 350 350 351 353 353 354 375 376 378 378 379 380 382 383 384 385 386 387 387 390 390 8.3.18 8.3.19 8.3.20 8.3.22 8.3.25, 8.3.28 8.3.29 8.3.30 8.3.31 9.12 9.1.3 9.1.6 9.2.1 9.2.3 9.2.4 9.2.5 9.2.7 9.2.8 9.2.9 9.2.10 9.2.11 9.2.13 9.2.15 9.2.16 9.2.17 9.2.18 9.3.1 9.3.3 9.3.4 9.3.6 9.3.8 9.3.11 9.3.12 9.3.13 10.1.2 10.1.4 10.1.8 10.1.10 10.1.13 10.1.14 10.1.15 10.1.17 10.1.18 LIST OF EXAMPLES Continuation of Example 8.3.15 Nonexistence of UMP test Unbiased test An equivalence Intersection-union test ‘Two-sided normal p-value One-sided normal p-value Fisher's Exact Test Risk of UMP test Interval estimator Continuation of Example 9.1.2 Scale uniform interval estimator Inverting a normal test Inverting an LRT Normal one-sided confidence bound Binomial one-sided confidence bound Location-scale pivots Gamma pivot Continuation of Example 9.2.8 Normal pivotal interval Shortest length binomial set Location exponential interval Poisson interval estimator Poisson credible set Poisson credible and coverage probabilities Coverage of a normal credible set Optimizing length Optimizing expected length Shortest pivotal interval UMA confidence bound Continuation of Example 9.3.6 Poisson HPD region Normal HPD region Normal interval estimator Consistency of X Continuation of Example 10.1.2 Limiting variances Large-sample mixture variances Asymptotic normality and consistency Approximate binomial variance Continuation of Example 10.1.14 ARES of Poisson estimators Estimating a gamma mean 392 392 393 395 396 398 398 399 401 418 ry 419 420 423 425 425 427 428 429 429 431 433 434 436 437 438 441 443, 443 445, 446 448 449 450 468 469 470 an 472 474 45 476 477 LIST OF EXAMPLES 10.1.19 Bootstrapping a variance 10.1.20 Bootstrapping a binomial variance 10.1.21 Conclusion of Example 10.1.20 10.1.22 Parametric bootstrap 10.2.1 10.2.3 10.2.4 10.2.5 10.2.6 10.2.7 10.3.2 10.3.4 10.3.5 Robustness of the sample mean Asymptotic normality of the median AREs of the median to the mean Huber estimator Limit distribution of the Huber estimator ARE of the Huber estimator Poisson LRT Maltinomial LRT Large-sample binomial tests Binomial score test ‘Tests based on the Huber estimator Continuation of Example 10.1.14 Binomial score interval Binomial LRT interval Approximate interval Approximate Poisson interval More on the binomial score interval Comparison of binomial intervals Intervals based on the Huber estimator Negative binomial interval Influence functions of the mean and median Oneway ANOVA The ANOVA hypothesis ANOVA contrasts Pairwise differences Continuation of Example 11.2.1 Predicting grape crops Continuation of Example 11.3.1 Estimating atmospheric pressure Challenger data Challenger data continued Robustness of least squares estimates Catastrophic observations Asymptotic normality of the LAD estimator Regression M-estimator Simulation of regression AREs Unordered sampling Univariate transformation Bivariate transformations 478 479 479 480 482 483 484 485 486 487 489 491 493 495 496 497 498 499 499 500 501 502 503 504 518 522 525 529 534 538 540 555 579 594 596 597 598 599 601 601 613 614 614 ee A04 A05 A06 AO7 A08 LIST OF EXAMPLES Normal probability Density of a sum Fourth moment of sum of uniforms ARE for a gamma mean Limit of chi squared mgfs 616 616 617 618 619 Chapter 1 Probability Theory “You can, for example, never foretell what any one man will do, but you can say with precision what an average number will be up to. Individuals vary, but percentages remain constant. So says the statistician.” Sherlock Holmes The Sign of Four ‘The subject of probability theory is the foundation upon which all of statistics is built, providing a means for modeling populations, experiments, or almost anything else that could be considered a random phenomenon. Through these models, statisti- cians are able to draw inferences about populations, inferences based on examination of only a part of the whole. The theory of probability has a long and rich history, dating back at least to the seventeenth century when, at the request of their friend, the Chevalier de Meré, Pascal and Fermat developed a mathematical formulation of gambling odds. The aim of this chapter is not to give a thorough introduction to probability theory; such an attempt would be foolhardy in so short a space. Rather, we attempt to outline some of the basic ideas of probability theory that are fundamental to the study of statistics. Just as statistics builds upon the foundation of probability theory, probability the- ory in turn builds upon set theory, which is where we begin. 1.1 Set Theory One of the main objectives of a statistician is to draw conclusions about a population of objects by conducting an experiment. The first step in this endeavor is to identify the possible outcomes or, in statistical terminology, the sample space. Definition 1.1.1 The set, $, of all possible outcomes of a particular experiment is called the sample space for the experiment. If the experiment consists of tossing a coin, the sample space contains two outcomes, heads and tails; thus, $= {H,T}. Tf, on the other hand, the experiment consists of observing the reported SAT scores of randomly selected students at a certain university, the sample space would be 7 7 \ PROBABILITY THEORY Section 1.1. between 200 and 800 that are multiples of ten—that 2.780, 790, 800}. Finally, consider an experiment where the time to a certain stimulus. Here, the sample space would numbers, that is, S = (0,00). We can classify sample spaces into two types according to the number of elements they contain. Sample spaces can be either countable or uncountable; if the elements of a sample space can be put into 1-1 correspondence with a subset of the integers, the sample space is countable. Of course, if the sample space contains only a finite number of elements, it is countable. Thus, the coin-toss and SAT score sample spaces are both countable (in fact, finite), whereas the reaction time sample space is uncountable, since the positive real numbers cannot be put into 1-1 correspondence with the integers. If, however, we measured reaction time to the nearest second, then the sample space would be {in seconds) § = {0,1,2,3,...}, which is then countable. This distinction between countable and uncountable sample spaces is important only in that it dictates the way in which probabilities can be assigned. For the most part, this causes no problems, although the mathematical treatment of the situations is different. On a philosophical level, it might be argued that there can only be count- able sample spaces, since measurements cannot be made with infinite accuracy. (A sample space consisting of, say, all ten-digit numbers is a countable sample space.) While in practice this is true, probabilistic and statistical methods associated with uncountable sample spaces are, in general, less cumbersome than those for countable sample spaces, and provide a close approximation to the true (countable) situation, Once the sample space has been defined, we are in a position to consider collections of possible outcomes of an experiment. Definition 1.1.2 An event is any collection of possible outcomes of an experiment, that is, any subset of S (including S itself). Let A be an event, a subset of S. We say the event A occurs if the outcome of the experiment is in the set A. When speaking of probabilities, we generally speak of the probability of an event, rather than aset. But we may use the terms interchangeably. We first need to define formally the following two relationships, which allow us to order and equate sets: ACBSreAsceB (containment) A=BeAC Band BCA. (equality) Given any two events (or sets) A and B, we have the following elementary set operations: Union: The union of A and B, written AU B, is the set of elements that belong to either A or B or both: AUB ={2:2€ Aor re B}. Intersection: The intersection of A and B, written ANB, is the set of elements that, belong to both A and B: ANB={e:2€ A and z€ B}. Section 1.1 SET THEORY 3 Complementation: The complement of A, written A°, is the set of all elements that are not in A: At = {2:2 ¢ A}. Example 1.1.3 (Event operations) Consider the experiment of selecting a card at random from a standard deck and noting its suit: clubs (C), diamonds (D), hearts (A), or spades (8). The sample space is S={C,D,H,S}, and some possible events are A={C,D} and B= {D,H,S}. From these events we can form AUB={C,D,H,S}, ANB={D}, and A° = {H,S}. Furthermore, notice that AU B = S (the event S) and (AU B)* = 0, where 0 denotes the empty set (the set consisting of no elements). \ ‘The elementary set operations can be combined, somewhat akin to the way addition and multiplication can be combined. As long as we are careful, we can treat sets as if they were numbers. We can now state the following useful properties of set operations. Theorem 1.1.4 For any three events, A, B, and C,, defined on a sample space S, a. Commutativity b. Associativity (AUB)UC, (ANB)nC; . Distributive Laws AN(BUC)= (AN B)U(ANC), AU(BNC) =(AUB)A(AUC); d. DeMorgan’s Laws (AU B)° = AeN BY, (AN BY = ACU Be Proof: The proof of much of this theorem is left as Exercise 1.3. Also, Exercises 1.9 and 1.10 generalize the theorem. ‘To illustrate the technique, however, we will prove the Distributive Law: AN (BUC) = (ANB)U(ANC) (You might be familiar with the use of Venn diagrams to “prove” theorems in set theory. We caution that although Venn diagrams are sometimes helpful in visualizing @ situation, they do not constitute a formal proof.) To prove that two sets are equal, it must be demonstrated that each set contains the other. Formally, then AN(BUC) = {e€S:2€ Aand 2 €(BUC)}; (AN B)U(ANC) = {2 S:2 € (ANB) or rE (ANC)} 4 PROBABILITY THEORY Section 1.1 We first show that AN (BUC) c (ANB) U(ANC). Let e € (AN(BUC)). By the definition of intersection, it must be that z € (BUC), that is, either x € B or 2 € C. Since z also must be in A, we have that either x € (AN B) or z € (AMC); therefore, (AN B)U(ANO)), and the containment is established. Now assume x € (AN B)U(ANC)). This implies that z € (ANB) or 2 € (ANC). If z € (ANB), then z is in both A and B. Since z € Bz € (BUC) and thus 2 (AN(BUC)). If, on the other hand, ¢ € (ANC), the argument is similar, and we again conclude that 2 € (AN(BUC)). Thus, we have established (AM B)U(ANC) C AN (BUC), showing containment in the other direction and, hence, proving the Distributive Law. a The operations of union and intersection can be extended to infinite collections of sets as well. If Ai, Az, A3,... is a collection of sets, all defined on a sample space S, then U4: = {x € S$: 2 € A; for some i}, = (A= (ee S:2€ Aj for all i}. isl For example, let $ = (0, 1] and define Ai = [(1/4), 1]. Then Ua- = Chava, Y= {2 (0,1):2€ {(1/4),1] for some é} = =e 1} = (1 q Ac=(\G/).1] = {2 € 0,1]: € [(1/8),1] for all t} fel fl = & €(l):2e pt} = {1}. (the point 1) It is also possible to define unions and intersections over uncountable collections of sets. If I’ is an index set (a set of elements to be used as indices), then U 4c = {2 € 5:2 Ay for some a}, oer (1) Ae = {2 € Si € Ag for all a}. ocr If, for example, we take I = {all positive real numbers} and Ag = (0,a), then Uger’Aa = (0,00) is an uncountable union. While uncountable unions and intersec- tions do not play a major role in statistics, they sometimes provide a useful mechanism for obtaining an answer (see Section 8.2.3). Finally, we discuss the idea of a partition of the sample space. Section 1.2 BASICS OF PROBABILITY THEORY 8 Definition 1.1.5 Two events A and B are disjoint (or mutually exclusive) if ANB = 0. The events Ay, Ao,... axe pairwise disjoint (or mutually exclusive) if Ayn A; = 0 for all ij. Disjoint sets are sets with no points in common. If we draw a Venn diagram for two disjoint sets, the sets do not overlap. The collection i Ap= [iit], £=0,1,2,. consists of pairwise disjoint sets. Note further that U?yA; = 0,00). Definition 1.1.6 If A;, Ao,... are pairwise disjoint and U%,A; = S, then the collection A;, Ag,... forms a partition of S. ‘The sets A; = [i,i +1) form a partition of (0, 00). In general, partitions are very useful, allowing us to divide the sample space into small, nonoverlapping pieces 1.2 Basics of Probability Theory ‘When an experiment is performed, the realization of the experiment is an outcome in the sample space. If the experiment is performed a number of times, different outcomes may occur each time or some outcomes may repeat. This “frequency of occurrence” of an outcome can be thought of as a probability. More probable outcomes occur more frequently. If the outcomes of an experiment can be described probabilistically, we are on our way to analyzing the experiment statistically. In this section we describe some of the basics of probability theory. We do not define probabilities in terms of frequencies but instead take the mathematically simpler axiomatic approach. As will be seen, the axiomatic approach is not concerned with the interpretations of probabilities, but is concerned only that the probabilities are defined by a function satisfying the axioms, Interpretations of the probabilities are quite another matter. The “frequency of occurrence” of an event is one example of a particular interpretation of probability. Another possible interpretation is a subjective one, where rather than thinking of probability as frequency, we can think of it as belief in the chance of an event occurring. 1.2.1 Aziomatic Foundations For each event A in the sample space S we want to associate with A a number between zero and one that will be called the probability of A, denoted by P(A). It would seem natural to define the domain of P (the set where the arguments of the function P(-) are defined) as all subsets of S; that is, for each A C S we define P(A) as the probability that A occurs. Unfortunately, matters are not that simple. There are some technical difficulties to overcome. We will not dwell on these technicalities; although they are of importance, they are usually of more interest to probabilists than to statisticians. However, a firm understanding of statistics requires at least a Passing familiarity with the following. 6 PROBABILITY THEORY Section 1.2 Definition 1.2.1 A collection of subsets of S is called a sigma algebra (or Borel field), denoted by B, if it satisfies the following three properties: a. 0 € B (the empty set is an element of B). b. If Ae B, then A® € B (B is closed under complementation) c. If Ai, Az,--. € B, then U%, A; € B (B is closed under countable unions). The empty set 0 is a subset of any set. Thus, 0 C S. Property (a) states that this subset is always in a sigma algebra. Since S = 0°, properties (a) and (b) imply that $ is always in B also. In addition, from DeMorgan’s Laws it follows that B is closed under countable intersections. If Ay, Az,... € B, then A$, AS,... € B by property (b), and therefore U2, AS € B. However, using DeMorgan’s Law (as in Exercise 1.9), we have (1.2.1) (G «) ~Aw ‘Thus, again by property (b), NZ, Ai € B. ‘Associated with sample space S we can have many different sigma algebras. For example, the collection of the two sets {0, S} is a sigma algebra, usually called the trivial sigma algebra. The only sigma algebra we will be concerned with is the smallest one that contains all of the open sets in a given sample space S. Example 1.2.2 (Sigma algebra-I) If S is finite or countable, then these techni- calities really do not arise, for we define for a given sample space S, B = {all subsets of S, including S itself} If S has n elements, there are 2" sets in B (see Exercise 1.14). For example, if S = {1,2,3}, then B is the following collection of 2° = 8 sets: {1} {1,2} {1,2,3} {2} 1,3} 0 {3} {2,3} i} In general, if S is uncountable, it is not an easy task to describe B. However, B is chosen to contain any set of interest. Example 1.2.3 (Sigma algebra-II) Let 5 = (—vo, 00), the real line. Then B is chosen to contain all sets of the form [a,b], (a,b), (a,b), and {a,5) for all real numbers a and 6. Also, from the properties of B, it follows that B con- tains all sets that can be formed by taking (possibly countably infinite) unions and intersections of sets of the above varieties. I Section 1.2 BASICS OF PROBABILITY THEORY 7 ‘We are now in a position to define a probability function. Definition 1.2.4 Given a sample space S and an associated sigma algebra B, a probability function is a function P with domain B that satisfies 1. P(A) 20 for all AEB. 2. P(S)=1. 3. If Ay, Aa,... € B are pairwise disjoint, then P(UR, Ai) = 1%, P(A). ‘The three properties given in Definition 1.2.4 are usuaily referred to as the Axioms of Probability (or the Kolmogorov Axioms, after A. Kolmogorov, one of the fathers of probability theory). Any function P that satisfies the Axioms of Probability is called a probability function. The axiomatic definition makes no attempt to tell what partic- ular function P to choose; it merely requires P to satisfy the axioms. For any sample space many different probability functions can be defined. Which one(s) reflects what is likely to be observed in a particular experiment is still to be discussed. Example 1.2.5 (Defining probabilities-I) Consider the simple experiment of tossing a fair coin, so S = {H,T}. By a “fair” coin we mean a balanced coin that is equally as likely to land heads up as tails up, and hence the reasonable probability fanction is the one that assigns equal probabilities to heads and tails, that is, (1.2.2) P({H}) = P({T}). Note that (1.2.2) does not follow from the Axioms of Probability but rather is out- side of the axioms. We have used a symmetry interpretation of probability (or just. intuition) to impose the requirement that heads and tails be equally probable. Since S = {H}U{T}, we have, from Axiom 1, P({H}U {T}) = 1. Also, {H} and {T} are disjoint, so P({H} U {T}) = P({H}) + P({T}) and (1.2.3) P({H}) + PU(T}) = Simultaneously solving (1.2.2) and (1.2.3) shows that P({H}) = P({T}) = 4 Since (1.2.2) is based on our knowledge of the particular experiment, not the any nonnegative values for P({H}) and P({T}) that satisfy (1.2.3) define a legitimate probability function. For example, we might choose P({H}) = 2 and P({T}) = 8. We need general methods of defining probability functions that we know will always satisfy Kolmogorov's Axioms. We do not want to have to check the Axioms for each new probability function, like we did in Example 1.2.5. The following gives a common method of defining a legitimate probability function. Theorem 1.2.6 Let $= {s1,....8n} be a finite set. Let B be any sigma algebra of subsets of S. Let p1,...,Pn be nonnegative numbers that sum to 1. For any A € B, define P(A) by P(A) 8 PROBABILITY THEORY Section 1.2 (The sum over an empty set is defined to be 0.) Then P is a probability function on B. This remains true if S = {81,52,...} is @ countable set. Proof: We will give the proof for finite S. For any A € B, P(A) = Dinca) Pi 2% because every pi > 0. Thus, Axiom 1 is true. Now, P(S)= SY w= oma a ‘Thus, Axiom 2 is true. Let A1,...,Ax denote pairwise disjoint events. (B contains only a finite number of sets, so we need consider only finite disjoint unions.) Then, »(Ua) = > oe yp 3A. {5jEU AD) fl aye ad) The first and third equalities are true by the definition of P(A). The disjointedness of the Ais ensures that the second equelity is true, because the same pjs appear exactly once on each side of the equality. Thus, Axiom 3 is true and Kolmogorov’s Axioms are satisfied. a The physical reality of the experiment might dictate the probability assignment, as the next example illustrates. Example 1.2.7 (Defining probabilities~II) The game of darts is played by throwing a dart at a board and receiving a score corresponding to the number assigned to the region in which the dart lands. For a novice player, it seems reasonable to assume that the probability of the dart hitting a particular region is proportional to the area of the region. Thus, a bigger region has a higher probability of being hit. Referring to Figure 1.2.1, we see that the dart board has radius r and the distance between rings is r/5. If we make the assumption that the board is always hit (see Exercise 1.7 for a variation on this), then we have ~ | _Area of region i P (scoring ¢ points) = 7 7e5 of dart board * For example a 2 P (scoring 1 point) = 7 (4r/5)? mre It is easy to derive the general formula, and we find that (6-1)? - (6-4)? 5 , independent of x and r. The sum of the areas of the disjoint regions equals the area of the dart board. Thus, the probabilities that have been assigned to the five outcomes sum to 1, and, by Theorem 1.2.6, this is a probability function (see Exercise 1.8). P (scoring i points) i Section 1.2 BASICS OF PROBABILITY THEORY 9 Figure 1.2.1. Dart board for Example 1.2.7 Before we leave the axiomatic development of probability, there is one further point to consider, Axiom 3 of Definition 1.2.4, which is commonly known as the Axiom of Countable Additivity, is not universally accepted among statisticians. Indeed, it can be argued that axioms should be simpie, self-evident statements. Comparing Axiom 3 to the other axioms, which are simple and self-evident, may lead us to doubt whether it is reasonable to assume the truth of Axiom 3. The Axiom of Countable Additivity is rejected by a school of statisticians led by deFinetti (1972), who chooses to replace this axiom with the Axiom of Finite Additivity. Aziom of Finite Additivity: If A € B and B € B are disjoint, then P(AUB) = P(A) + P(B). While this axiom may not be entirely self-evident, it is certainly simpler than the Axiom of Countable Additivity (and is implied by it — see Exercise 1.12). Assuming only finite additivity, while perhaps more plausible, can lead to unex- ected complications in statistical theory — complications that, at this level, do not necessarily enhance understanding of the subject. We therefore proceed under the assumption that the Axiom of Countable Additivity holds. 1.2.2 The Calculus of Probabilities From the Axioms of Probability we can build up many properties of the probability + function, properties that are quite helpful in the calculation of more complicated Probabilities. Some of these manipulations will be discussed in detail in this section; others will be left as exercises. ‘We start with some (fairly self-evident) properties of the probability fuzction when applied to a single event. 10 PROBABILITY THEORY Section 1.2 Theorem 1.2.8 Ij P is a probability function and A is any eet in B, then a, P(Q) =0, where 0 is the empty set; b. P(A) <1; ce. P(A®) =1- P(A). Proof: It is easiest to prove (c) first. The sets A and A® (orm a partition of the sample space, that is, S = AU A®, Therefore, (1.24) P(AUA‘) = P(S) =1 by the second axiom. Also, A and A® are disjoint, so by the third axiom, (1.2.8) P(AUA®) = P(A) + P(A*). Combining (1.2.4) and (1.2.5) gives (c). Since P(A®) > 0, (b) is immediately implied by (c). To prove (a), we use a similar argument on $= $u 0. (Recall that both 9 and 9 are always in B.) Since S and 0 are disjoint, we have 1= P(S) = P(S UB) = P(S) + P(0), and thus P(0) = 0. Qo Theorem 1.2.8 contains properties that are so basic that they also have the fla- vor of axioms, although we have formally proved them using only the original three Kolmogorov Axioms. The next theorem, which is similar in spirit to Theorem 1.2.8, contains statements that are not so self-evident. Theorem 1.2.9 If P is a probability function and A and B are any sets in B, then a. P(BNA®) = P(B)— P(ANB); b. P(AUB) = P(A) + P(B) — P(ANB); c. IfACB, then P(A) < P(B). Proof: To establish (a) note that for any sets A and B we have B={BNA}U{BN A}, and therefore (1.2.6) P(B) = P({BN A}U{BNA‘}) = P(BN A) + P(BO AY), where the last equality in (1.2.6) follows from the fact that BA and BN A® are disjoint. Rearranging (1.2.6) gives (a). To establish (b), we use the identity (1.2.7) AUB =AU{BN A} Section 1.2 BASICS OF PROBABILITY THEORY n A Venn diagram will show why (1.2.7) holds, although a formal proof is not difficult (see Exercise 1.2). Using (1.2.7) and the fact that A and B 1 A® are disjoint (since A and A® are), we have (1.2.8) P(AUB) = P(A) + P(BN A) = P(A) + P(B) ~ P(ANB) from (a). If AC B, then AN B = A. Therefore, using (a) we have 0 P(A) + P(B)—1. This inequality is a special case of what is known as Bonferroni's Inequality (Miller 1981 is a good reference). Bonferroni's Inequality allows us to bound the probability of a simultaneous event (the intersection) in terms of the probabilities of the individual events. Example 1.2.10 (Bonferroni’s Inequality) _ Bonferroni’s Inequality is partic- ularly useful when it is difficult (or even impossible) to calculate the intersection probability, but some idea of the size of this probability is desired. Suppose A and B are two events and each has probability .95. Then the probability that both will occur is bounded below by P(AN B) > P(A) + P(B) —1=.95+.95-1=.90. Note that unless the probabilities of the individual events are sufficiently large, the Bonferroni bound is a useless (but correct!) negative number. I We close this section with a theorem that gives some useful results for dealing with a collection of sets. Theorem 1.2.11 If P is a probability function, then a. P(A) = 2%, P(ANC%) for any partition Cy,C2,.--5 b. P(U%,Ai) < 2, PCAs) for any sets Ai, Aa, ..- « (Boole’s Inequality) Proof: Since C1,C2,... form a partition, we have that 0:9 S =U, C). Hence, =O for alli #j, and A=ANS=An (G) =Utanen, m1 i=1 12 PROBABILITY THEORY Section 1.2 where the last equality follows from the Distributive Law (Theorem 1.1.4). We there- fore have P(A) =P (Geren) : Now, since the O, are disjoint, the sets A.C; are also disjoint, end from the properties of a probability function we have P (Gane = yr PANG), = =i establishing (a) To establish (b) we first construct a disjoint collection At, Ag, ... with the property that U2, AT = US, Ay. We define At by a ia Al=A, AT=A\|U 4s}, t= 31 rByecey where the notation A\B denotes the part of A that does not intersect with B. In more familiar symbols, A\B = ANB¥. It should be easy to see that US, A} = U%, Ay, and we therefore have , (0 4) “P (G-) -dPun, where the last equality follows since the Aj are disjoint. To see this, wé write AINAL= fa (i +) \ n {as (i = {a fn (i a) ‘ a {a n (ui a) } (definition of *\") ar bel = {a at) «| ny ant) «| (DeMorgan’s Laws) jel jel 4) | (definition of Az) Now if i > &, the first intersection above will be contained in the set Ag, which will have an empty intersection with Ag. If k > i, the argument is similar. Further, by construction Ay C A;, 80 P(Aj) < P(A;) and we have Pas SPA, a a establishing (b). d Section 1.2 BASICS OF PROBABILITY THEORY 18 There is a similarity between Boole’s Inequality and Bonferroni's Inequality. In fact, they are essentially the same thing. We could have used Boole's Inequality to derive (1.2.9). If we apply Boole’s Inequality to A®, we have »(U-) < SPU, i=1 and using the facts that UAf = (N4,)° and P(A) = 1 — P(Aj), we obtain -e (Aa) SPAj—(n—1), iol which is a more general version of the Bonferroni Inequality of (1.2.9). 1.2.9 Counting The elementary process of counting can become quite sophisticated when placed in the hands of a statistician. Most often, methods of counting are used in order to construct probsbility assignments on finite sample spaces, although they can be used to answer other questions also. Example 1.2.12 (Lottery-I) For a number of years the New York state lottery operated according to the following scheme. From the numbers 1, 2, ...,44, a person may pick any six for her ticket. The winning number is then decided by randomly selecting six numbers from the forty-four. To be able to calculate the probability of winning we first must count how many different groups of six numbers can be chosen from the forty-four. i Example 1.2.13 (Tournament) In a single-elimination tournament, such as the U.S. Open tennis tournament, players advance only if they win (in contrast to double- elimination or round-robin tournaments). If we have 16 entrants, we might be inter- ested in the number of paths a particular player can take to victory, where a path is taken to mean a sequence of opponents. ll Counting problems, in general, sound complicated, and often we must do our count ing subject to many restrictions. The way to solve such problems is to break them down into a series of simple tasks that are easy to count, and employ known rules of combining tasks. The following theorem is a first step in such a process and is sometimes known as the Fundamental Theorem of Counting. Theorem 1.2.14 If a job consists of k separate tasks, the ith of which can te done inn; ways, i= 1,...,k, then the entire job can be done in my X no x +++ x Me ways. M4 ‘PROBABILITY THEORY Section 1.2 Proof: It suffices to prove the theorem for k = 2 (see Exercise 1.15). The proof is just a matter of careful counting. The first task can be done in n; ways, and for each of these ways we have m2 choices for the second task. Thus, we can do the job in (Lx m2) + (1X Mag) t---+ (1X ng) = xm lg eee) ‘ny terms ‘ways, establishing the theorem for k = 2. Q Example 1.2.15 (Lottery-II) Although the Fundamental Theorem of Counting is a reasonable place to start, in applications there are usually more aspects of a problem to consider. For example, in the New York state lottery the first number can be chosen in 44 ways, and the second number in 43 ways, making a total of 44 x 43 = 1,892 ways of choosing the first two numbers. However, if a person is allowed to choose the same numiber twice, then the first two numbers can be chosen in 44 x 44 = 1,936 ways. I The distinction being made in Example 1.2.15 is between counting with replacement and counting without replacement. ‘There is a second crucial element in any counting problem, whether or not the ordering of the tasks is important. To illustrate with the lottery example, suppose the winning numbers are selected in the order 12, 37, 35, 9, 13, 22. Does a person who selected 9, 12, 13, 22, 35, 37 qualify as a winner? In other words, does the order in which the task is performed actually matter? Taking all of these considerations into account, we can construct a 2 x 2 table of possibilities: Possible methods of counting Without With replacement _ replacement Ordered Unordered Before we begin to count, the following definition gives us some extremely helpful notation. Definition 1.2.16 For a positive integer n, n! (read n factorial) is the product of all of the positive integers less than or equal to n. That is, 2) xx Bx2QK1. ni=nx(n-1)x(n Furthermore, we define 0! = Let us now consider counting all of the possible lottery tickets under each of these four cases. 1. Ordered, without replacement From the Fundamental Theorem of Counting, the first number can be selected in 44 ways, the second in 43 ways, etc. So there are at 44 x 43 x 42 x 41 x 40 x 39 = a = 5,082,517,440 possible tickets. Section 1,2 BASICS OF PROBABILITY THEORY 18 2. Ordered, with replacement Since each number can now be selected in 44 ways (because the chosen number is replaced), there are 4A x 44 x Ad x 44 x 44 x 44 = 44° = 7,256,313,856 possible tickets. 3. Unordered, without replacement We know the number of possible tickets when the ordering must be accounted for, s0 what we must do is divide out the redundant orderings. Again from the Fundamental Theorem, six numbers can be arranged in 6x5 x4x3x 2x1 ways, so the total number of unordered tickets is 44 x 43 x 42 x 41 x 40 x 39 44! OxSx4x3xDx1 —~ Giggl ~ 7059,052. This form of counting plays a central role in much of statistics—so much, in fact, that it has earned its own notation. Definition 1.2.17 For nonnegative integers n and r, where n > r, we define the symbol ("), read n choose r, as (")= nl ) > twa In our lottery example, the number of possible tickets (unordered, without replace- ment) is (4*). These numbers are also referred to as binomial coefficients, for reasons that will become clear in Chapter 3. 4, Unordered, with replacement This is the most difficult case to count. You might first guess that the answer is 44°/(6 x 5 x 4x 3x 2 1), but this is not correct (it is too small). To count in this case, it is easiest to think of placing 6 markers on the 44 numbers. In fact, we can think of the 44 numbers defining bins in which we can place the six markers, M, as shown, for example, in this figure. {_M [LMM | M | [ict |e M | J 1 ee ( The number of possible tickets is then equal to the number of ways that we can put the 6 markers into the 44 bins. But this can be further reduced by noting that ail we need to keep track of is the arrangement of the markers and the walls of the bins. Note further that the two outermost walls play no part. ‘Thus, we have to count all of the arrangements of 43 walls (44 bins yield 45 walls, but we disregard the two end walls) and 6 markers. We therefore have 43 + 6 = 49 objects, which can be arranged in 49! ways. However, to eliminate the redundant orderings we must divide by both 6! and 43!, so the total number of arrangements is 49! Frag = 13:983.816. Although all of the preceding derivations were done in terms of an example, it should be easy to see that they hold in general. For completeness, we can summarize these situations in Table 1.2.1. 16 ’ “PROBABILITY THEORY Section 1.2 ‘Table 1.2.1. Number of possible arrangements of size r from n objects Without With replacement replacement nt ; Ordered n Unordered —() ("* r | r r 1.2.4 Enumerating Outcomes The counting techniques of the previous section are useful when the sample space $ is a finite set and all the outcomes in $ are equally likely. Then probabilities of events can be calculated by simply counting the number of outcomes in the event. To see this, suppose that $ = {#1,...,81} is @ finite sample space. Saying that all the outcomes are equally likely means that P({si}) = 1/N for every outcome s;. Then, using Axiom 3 from Definition 1.2.4, we have, for any event A, 1 _ # of elements in A P(A) = > Ps) = 4 y= cA SQN © Fok elements in $ For large sample spaces, the counting techniques might be used to calculate both the numerator and denominator of this expression. Example 1.2.18 (Poker) Consider choosing a five-card poker hand from a stan- dard deck of 52 playing cards. Obviously, we are sampling without replacement from ‘the deck. But to specify the possible outcomes (possible hands), we must decide whether we think of the hand as being dealt sequentially (ordered) or all at once (unordered). If we wish to calculate probabilities for events that depend on the or- der, such as the probability of an ace in the first two cards, then we must use the ordered outcomes. But if our events do not depend on the order, we can use the unordered outcomes. For this example we use the unordered outcomes, so the sample space consists of all the five-card hands that can be chosen from the 52-card deck. There are (*?) = 2,598,960 possible hands. If the deck is well shuffled and the cards are randomly dealt, it is reasonable to assign probability 1/2,598,960 to each possible hand. ‘We now calculate some probabilities by counting outcomes in events. What is the probability of having four aces? How many different hands are there with four aces? If we specify that four of the cards are aces, then there are 48 different ways of specifying the fifth card. Thus, 48 2,598,960" less than 1 chance in 50,000. Only slightly more complicated counting, using Theorem 1.2.14, allows us to calculate the probability of having four of a kind. There are 13 P(four aces) = Section 1.2 BASICS OF PROBABILITY THEORY w ways to specify which denomination there will be four of. After we specify these four cards, there are 48 ways of specifying the fifth. Thus, the total number of hands with four of a kind is (13)(48) and (13)(48) 624 P(four of a kind) = 5525 sey = Z8g5-000" To calculate the probability of exactly one pair (not two pairs, no three of a kind, etc.) we combine some of the counting techniques. The number of hands with exactly one pair is (1.2.11) 13 (3) (3) 49 = 1,098,240. Expression (1.2.11) comes from Theorem 1.2.14 because 13 = # of ways to specify the denomination for the pair, 4 ? (a = # of ways to specify the two cards from that denomination, 12 ras ; 5 ( es ) = # of ways of specifying the other three denominations, 4° = # of ways of specifying the other three cards from those denominations. ‘Thus, 1,098,240 Plexectly one palit) = 3 555 G6" il When sampling without replacement, as in Example 1.2.18, if we want to calculate the probability of an event that does not depend on the order, we can use either the ordered or unordered sample space. Each outcome in the unordered sample space corresponds to r! outcomes in the ordered sample space. So, when counting outcomes in the ordered sample space, we use a factor of r! in both the numerator and denom- inator that will cancel to give the same probability as if we counted in the unordered sample space. ‘The situation is different if we sample with replacement. Each outcome in the unordered sample space corresponds to some outcomes in the ordered sample space, but the number of outcomes differs. Example 1.2.19 (Sampling with replacement) Consider sampling r = 2 items from n = 3 items, with replacement. The outcomes in the ordered and unordered sample spaces are these, Unordered {1,1} {2,2} {3,3} {1,2} {1,3} {2,3} Ordered (1,1) (2,2) (3,3) (1,2),(2,1) (1,3), (8,1) (2,3), (3,2) Probability 1/9 1/9 1/9 2/9 2/9 2/9 8 PROBABILITY THBORY Section 1.2 The probabilities come from considering the nine outcomes in the ordered sample space to be equally likely. This corresponds to the common interpretation of “sampling with replacement”; namely, one of the three items is chosen, each with probability 1/3; the item is noted and replaced; the items are mixed and again one of the three items is chosen, each with probability 1/3. It is seen that the six outcomes in the unordered sample space are not equally likely under this kind of sampling. The formula for the number of outcomes in the unordered sample space is useful for enumerating the outcomes, but ordered outcomes must be counted to correctly calculate probabilities. Some authors argue that it is appropriate to assign equal probabilities to the un- ordered outcomes when “randomly distributing r indistinguishable balls into n dis- tinguishable urns.” That is, an urn is chosen at random and a ball placed in it, and this is repeated r times. The order in which the balls are placed is not recorded so, in the end, an outcome such as {1,3} means one ball is in urn 1 and one ball is in urn 3. But here is the problem with this interpretation. Suppose two people observe this process, and Observer 1 records the order in which the balls are placed but Observer 2 does not. Observer 1 will assign probability 2/9 to the event {1,3}. Observer 2, who is observing exactly the same process, should also assign probability 2/9 to this event. But if the six unordered outcomes are written on identical pieces of paper and one is randomly chosen to determine the placement of the balls, then the unordered. outcomes each have probability 1/6. So Observer 2 will assign probability 1/6 to the event {1,3}. ‘The confusion arises because the phrase “with replacement” will typically be inter- preted with the sequential kind of sampling we described above, leading to assigning a probability 2/9 to the event {1,3}. This is the correct way to proceed, as proba- bilities should be determined by the sampling mechanism, not whether the balls are distinguishable or indistinguishable. Example 1.2.20 (Calculating an average) As an illustration of the distinguish able/indistinguishable approach, suppose that we are going to calculate all possible averages of four numbers selected from 2,4,9,12 where we draw the numbers with replacement. For example, possible draws are {2,4,4,9} with average 4.75 and {4,4,9,9} with average 6.5. If we are only inter- ested in the avetage of the sampled numbers, the ordering is unimportant, and thus the total number of distinct samples is obtained by counting according to unordered, with-replacement sampling. ‘The total number of distinct samples is (**7~'). But now, to calculate the proba- bility distribution of the sampled averages, we must count the different ways that a particular average can occur. The value 4.75 can occur only if the sample contains one 2, two 4s, and one 9. The number of possible samples that have this configuration is given in the following table: Section 1.2 BASICS OF PROBABILITY THEORY 19 Probability 12 06 00 2a tO acre neo ee 12 Average Figure 1.2.2. Histogram of averages of samples with replacement from the four numbers {2,4,4,9} Unordered Ordered (2,4, 4,9), (2,4,9,4), (2,9, 4,4), (4,2, 4,9), {2,4,4,9} (4,2,9,4), (4,4,2,9), (4,4,9,2), (4,9, 2,4), (4,9, 4,2), (9,2,4,4), (9,4,2,4), (9,4, 4,2) The total number of ordered samples is n" = 4¢ = 256, so the probability of drawing the unordered sample {2, 4,4, 9} is 12/256. Compare this to the probability that we would have obtained if we regarded the unordered samples as equally likely ~ we would have assigned probability 1/("*"-') = 1/({) = 1/35 to {2,4,4,9} and to every other unordered sample. To count the number of ordered samples that, would result in {2,4,4,9}, we argue as follows. We need to enumerate the possible orders of the four numbers {2,4, 4,9}, 0 we are essentially using counting method 1 of Section 1.2.3. We can order the sample in 4 x 3x 2x 1 = 24 ways. But there is a bit of double counting here, since we cannot count distinct arrangements of the two 4s. For example, the 24 ways would count {9,4,2,4} twice (which would be OK if the 4s were different). To correct this, we divide by 2! (there are 2! ways to arrange the two 4s) and obtain 24/2 = 12 ordered samples. In general, if there are & places and we have m different qumbers repeated ki, ka,..+)km times, then the number of ordered samples is Ele oat la! Br of counting is related to the multinomial distribution, which we will see in Section 4.6, Figure 1.2.2 is a histogram of the probability distribution of the sample averages, reflecting the multinomial counting of the samples. ‘There is also one further refinement that is reflected in Figure 1.2.2. It is possible that two different unordered samples will result in the same mean. For example, the unordered samples {4, 4, 12,12} and {2,9, 9,12} both result in an average value of 8. The first sample has probability 3/128 and the second has probability 3/64, giving the value 8 a probability of 9/128 = .07. See Example A.0.1 in Appendix A for details on constructing such a histogram. The calculation that we have done in this example is an elementary version of a very important statistical technique known as the bootstrap (Efron and Tibshirani 1993). We will return to the bootstrap in Section 10.1.4. || . This type 90 ‘PROBABILITY THEORY Section 1.3 1.3 Conditional Probability and Independence All of the probabilities that we have dealt with thus far have been unconditional probabilities. A sample space was defined and all probabilities were calculated with respect to that sample space. In many instances, however, we are in a position to update the sample space based on new information. In such cases, we want to be able to update probability calculations or to calculate conditional probobilities. Example 1.3.1 (Four aces) Fout cards are dealt from the top of a well-shuffled deck. What is the probability that they are the four aces? We can calculate this probability by the methods of the previous section. The number of distinct groups of four cards is 52 ( 4 ) = 270,725. ‘Only one of these groups consists of the four aces and every group is equally likely, so the probability of being dealt ali four aces is 1/270,728. We can also calculate this probability by an “updating” argument, as follows. The probability that the first card is an ace is 4/52. Given that the first card is an ace, the probability that the second card is an ace is 3/51 (there are 3 aces and 51 cards left). Continuing this argument, we get the desired probability as 4,3,2,1 1 52° 51 50 49 (270,725 ! In our second method of solving the problem, we updated the sample space after each draw of a card; we calculated conditional probabilities. Definition 1.3.2 If A and B are events in 9, and P(B) > 0, then the conditional probability of A given B, written P(A|B), is = P(ANB) (1.3.1) P(A|B) = Py Note that what happens in the conditional prabability calculation is that B becomes the sample space: P(B|B) = 1. The intuition is that our original sample space, S, has been updated to B. All further occurrences are then calibrated with respect to their relation to B. In particular, note what happens to conditional probabilities of disjoint sets. Suppose A and B are disjoint, so P(A B) = 0. It then follows that, P(A)B) = P(BIA) = 0. Example 1.3.3 (Continuation of Example 1.3.1) Although the probsbility of getting all four aces is quite small, let us see how the conditional probabilities change given that some aces have already been drawn. Four cards will again be dealt from a well-shuffled deck, and we now calculate P(4 aces in 4 cards |i aces in i cards), 12,3. Section 1.3 CONDITIONAL PROBABILITY AND INDEPENDENCE, 2 The event {4 aces in 4 cards} is a subset of the event {¢ aces in i cards}. Thus, from the definition of conditional probability, (1.3.1), we know that P(4 aces in 4 cards|i aces in i cards) P({4 aces in 4 cards} M {i aces in i cards}) 7 P(i aces in i cards) P(4 aces in 4 cards) ‘P(i aces in i cards) * ‘The numerator has already been calculated, and the denominator can be calculated with a similar argument. The number of distinct groups of i cards is (5?), and () P(i aces in i cards) = 45. CO) ‘Therefore, the conditional probability is given by (2) _ (4-ast BO” Ga - ey For i 1, 2, and 3, the conditional probabilities are .00005, .00082, and .02041, respectively. lI P(4 aces in 4 cards| i aces in i cards) = For any B for which P(B) > 0, it is straightforward to verify that the probability function P(-|B) satisfies Kolmogorov’s Axioms (see Exercise 1.35). You may suspect that requiring P(B) > 0 is redundant. Who would want to condition on an event of probability 0? Interestingly, sometimes this is a particularly useful way of thinking of things. However, we will defer these considerations until Chapter 4. Conditional probabilities can be particularly slippery entities and sometimes require careful thought. Consider the following often-told tale. Example 1.3.4 (Three prisoners) Three prisoners, A, B, and C, are on death row. The governor decides to pardon one of the three and chooses at random the prisoner to pardon. He informs the warden of his choice but requests that the name be kept secret for a few days. ‘The next day, A tries to get the warden to tell him who had been pardoned. The warden refuses. A then asks which of B or C will be executed. The warden thinks for a while, then tells A that B is to be executed. Warden’s reasoning: Each prisoner has a 4 chance of being pardoned. Clearly, either B or C must be executed, so I have given A no information about whether A will be pardoned. A’s reasoning: Given that B will be executed, then either A or C will be pardoned. ‘My chance of being pardoned has risen to }. It should be clear that the warden’s reasoning is correct, but let us see why. Let A,B, and C denote the events that A, B, or C is pardoned, respectively. We know 3a PROBABILITY: THEORY ection 1.3 that P(A) = P(B) = P(C) = }. Let W denote the event that the warden says B will die. Using (1.3.1), A can update his probability of being pardoned to P(ANW) Peay) = ‘What is happening can be summarized in this table: Prisoner pardoned _ Warden tells A a B dies | each with equal A C dies probability B C dies c - B dies Using this table, we can calculate P(W) = P(warden says B dies) = P(warden says B dies and A pardoned} + P(warden says B dies and C pardoned) + P(wasden says B dies and B pardoned) loi 1 =gtgto =F Thus, using the warden’s reasoning, we have P(ANW) P(W) __ P(warden says B dies and A pardoned) _ 1/6 _ 1 PAW) = (1.3.2) ‘P(warden says B dies) “if x However, A falsely interprets the event W as equal to the event B® and calculates P(AQBS) _1/3_1 POLE = Seas) = 378 ~ 3 We see that conditional probabilities can be quite slippery and require careful interpretation. For some other variations of this problem, see Exercise 1.37. ll Re-expressing (1.3.1) gives a useful form for calculating intersection probabilities, (4.3.3) P(ANB) = P(A|B)P(B), which is essentially the formula that was used in Example 1.3.1. We can take advan- tage of the symmetry af (1.3.3) and also write (1.3.4) P(AN B) = P(BIA)P(A). Section 1.3 CONDITIONAL PROBABILITY AND INDEPENDENCE 23 When faced with seemingly difficult calculations, we can break up our calculations according to (1.3.3) or (1.3.4), whichever is easier. Furthermore, we can equate the right-hand sides of these equations to obtain (after rearrangement) = P(A) (1.3.5) P(AIB) = P(B\A) ey: which gives us a formula for “turning around” conditional probabilities. Equation (1.3.5) is often called Bayes’ Rule for its discoverer, Sir Thomas Bayes (although see Stigler 1983). Bayes’ Rule has a more general form than (1.3.5), one that applies to partitions of a sample space. We therefore take the following as the definition of Bayes’ Rule. Theorem 1.3.5 (Bayes’ Rule) Let Ai, A2,... ¢ @ partition of the sample space, ond let B be any set. Then, for eack i= 1,2, __PUBIA)P(AD PUAIB) = Sa PCB) PUA) Example 1.3.6 (Coding) When coded messages are sent, there are sometimes errors in transmission, In particular, Morse code uses “dots” and “dashes,” which are known to occur in the proportion of 3:4, This means that for any given symbol, 3 P(dot sent) and P(desh sent) = ; cs Suppose there is interference on the transmission line, and with probability } a dot is mistakenly received os a dash, and vice versa. If we receive a dot, can we be sure that a dot was sent? Using Bayes’ Rule, we can write P(dot sent) (dot sent | dot recived) = Pldot received | dot sent) 5355 ested) Now, from the information given, we know that P(dot sent) = $ and P(dot received] dot sent) = 7. Furthermore, we can also write P(dot received) = P(dot received N dot sent) + P(dot received M dash sent) = P(dot received | dot sent) P(dot sent) + P(dot received | dash sent)P(dash sent) 73,1, 4 _ 2 a) 787 * 6 Combining these results, we bave that the probability of correctly receiving a dot is P(dot sent | dot received) = oe z. i u“ PROBABILITY THEORY Section 1.3 In some cases it may happen that the occurrence of a particular event, B, has no effect on the probability of another event, A. Symbolically, we are saying that (1.3.6) P(A|B) = P(A). If this holds, then by Bayes’ Rule (1.3.5) and using (1.3.6) we have PUB) _ P(A) P(B) (13.7) P(BIA) = P(A|B) Pay = P(A) Eray = PCB), so the occurrence of A has no effect on B. Moreover, since P(B|A)P(A) = P(ANB), it then follows that P(AN B) = P(A)P(B), which we take as the definition of statistical independence, Definition 1.3.7 Two events, A and B, are statistically independent if (1.3.8) P(ANB) = P(A)P(B). Note that independence could have been equivalently defined by either (1.3.6) or (1.3.7) (as long as either P(A) > 0 or P(B) > 0). The advantage of (1.3.8) is that it treats the events symmetrically and will be easier to generalize to more than two events. Many gembling games provide models of independent events. The spins of a roulette wheel and the tosses of a pair of dice are both series of independent events. Example 1.3.8 (Chevalier de Meré) The gambler introduced at the start of the chapter, the Chevalier de Meré, was particularly interested in the event that he could throw at least 1 six in 4 rolls of a die. We have P(et least 1 six in 4 rolls) = 1 — P(no six in 4 rolls) 4 — [J P(e six on roll §), isl where the last equality follows by independence of the rolls. On any roll, the proba- bility of not rolling a six is §, so P(at least 1 six in 4 rolls) Independence of A and B implies independence of the complements also, In fact, we have the following theorem. Section 1.3 CONDITIONAL PROBABILITY AND INDEPENDENCE, 28 Theorem 1.3.9 If A and B are independent events, then the following pairs are also independent: a. A and BS, b. A® and B, c. A® and BS. Proof: We will prove only (a), leaving the rest as Exercise 1.40. To prove (a) we must show that P(AA B®) = P(A)P(B°). From Theorem 1.2.9 we have P(AN B®) P(A) — P(ANB) = P(A)-P(A)P(B) — (Aand B are independent) = P(A)(1- P(B)) = P(A)P(B®). D Independence of more than two events can be defined in a manner similar to (1.3.8), but we must be careful. For example, we might think that we could say A, B, and C are independent if P(AN BNC) = P(A)P(B)P(C). However, this is not the correct condition, Example 1.3.10 (Tossing two dice) Let an experiment consist of tossing two dice. For this experiment the sample space is S = {(1,1), (1,2), +25 (156), (2,1)s ++ +5 (256)s-++4 (6, 1s, (6,6)}; that is, S consists of the 36 ordered pairs formed from the numbers 1 to 6. Define the following events: A= {doubles appear} = {(2,1), (2,2), (3,3), (4,4), (5,5), (6,6)}, B = {the sum is between 7 and 10}, C = {the sum is 2 or 7 or 8}. The probabilities can be calculated by counting among the 36 possible outcomes. We have PA)= 7, P(B) = Furthermore, P(AN BNC) = P(the sum is 8, composed of double 4s) au ~ 36 20 “ ‘PROBABILITY: THEORY However, P(BNC) = P(eum equals 7 or 8) Similarly, it can be shown that P(AN B) # P(A)P(B); therefore, the requirement: P(AN BNC) = P(A)P(B)P(C) is not a strong enough condition to guarantee pairwise independence, ( ‘A second attempt at a genetal definition of independence, in light of the previ- ons example, might be to define A, B, and C to be independent if all the pairs ere independent. Alas, this condition also fails. Example 1.3.11 (Letters) Let the sample space 5 consist of the 3! pernmtations of the letters a, b, and c along with the three triples of each letter. Thus, aaa bbb ccc S= abc bea cba}. acb bac cab Furthermore, let each element of S have probability 3. Define A; = {ith place in the triple is occupied by a}. It is then easy to count that P(A)=5, §=1,2,3, 5 z and 1 P(A, 0 Az) = P(A, N Ag) = P(A2 0 Aa) = z so the Ajs are pairwise independent. But PUALM Aa Aa) = 5 # PCAL)PCA2) PCAs), so the Ajs do not satisfy the probability requirement. i] ‘The preceding two examples show that simultaneous (or mutual) independence of a collection of events requires an extremely strong definition. The following definition works. Definition 1.3.12 A collection of events Aj,-.-,An are mutually independent if for any subcollection Ay,,..., Ai, We have P (A 4) =] P(4,,)- jal ja Section 1.4 RANDOM VARIABLES co Example 1.3.13 (Three coin tosses-I) Consider the experiment of tossing a coin three times. A sample point for this experiment must indicate the result of each toss. For example, HHT could indicate that. two heads and then a tail were observed. ‘The sample space for this experiment has eight points, namely, {HHH, HET, HTH, THH, TTH, THT, HTT, TTT}. Let Hj, i (1.3.9) H, = {HHH, HHT, HTH, HTT} 1,2,8, denote the event that the éth toss is a head. For example, If we assign probability } to each sample point, then using enumerations such as (1.3.9), we see that P(Zy) = P(H2) = P(H) = 3. This says that the coin is fair and ~ has an equal probability of landing heads or tails on each toss. Under this probability model, the events #,, Hz, and Hy are also mutually inde- pendent. To verify this we note that 1011 (Fh 9 Han He) = P((HHH)) = 5 = 5+ 5-5 = P(Hh) PUB) PCa). To verify the condition in Definition 1.3.12, we also must check each pair. For example, 1 PU 0-H) = P( (HH, BHT}) = 2 = 5.4 = PUn)PU). The equality is also true for the other two pairs. Thus, Hi, Hz, and Hy are mutually independent. That is, the occurrence of a head on any toss has no effect on any of the other tosses. It can be verified that the assignment of probability 3 to each sample point is the only probability model that has P(H:) = P(H2) = P(Hs) = } and Hj, He, and H3 mutually independent. i 1.4 Random Variables In many experiments it is easier to deal with a summary variable than with the original probability structure. For example, in an opinion poll, we might decide to ask 50 people whether they agree or disagree with a certain issue. If we record a “1” for agree and “0” for disagree, the sample space for this experiment has 2° elements, each an ordered string of 1s and 0s of length 50. We should be able to reduce this to a reasonable size! It may be that the only quantity of interest is the number of people who agree (equivalently, disagree) out of 50 and, if we define a variable X = number of 1s recorded out of 50, we have captured the essence of the problem. Note that the sample space for X is the set of integers {0,1,2,...,50} and is much easier to deal with than the original sample space. In defining the quantity X, we have defined a mapping (a function) from the original sample space to a new sample space, usually a set of real numbers. In general, we have the following definition. Definition 1.4.1 A random varieble is a function from a sample space S into the Teal numbers. 28 PROBABILITY THEORY Section 1.4. Example 1.4.2 (Random variables) In some experiments random variables are implicitly used; some examples are these. Examples of random variables Experiment Random variable Toss two dice X = sum of the numbers Toss a coin 25 times X = number of heads in 25 tosses Apply different amounts of fertilizer to corn plants _X = yield/acre 7 In defining a random variable, we have also defined a new sample space (the range of the random variable). We must now check formally that our probability function, which is defined on the original sample space, can be used for the random variable. Suppose we have a sample space S={5,...,8a} with a probability function P and we define a random variable X with range ¥ = {21,...,2m}. We can define a probability function Px on 2 in the following way. We will observe X = 2; if and only if the outcome of the random experiment is an $; € S such that X(ss) = 2. Thus, (1.4.1) Px(X = 1%) = P({s; € $: X(5) = 2i}) Note that the left-hand side of (1.4.1), the function Px, is an induced probability function on 2’, defined in terms of the original function P. Equation (1.4.1) formally defines a probability function, Px, for the random variable X. Of course, we have to verify that Px satisfies the Kolmogorov Axioms, but that is not a very difficult job (see Exercise 1.45). Because of the equivalence in (1.4.1), we will simply write P(X =a;) rather than P(X = 2;). A note on notation: Random variables will always be denoted with uppercase letters and the realized values of the variable (or its range) will be denoted by the corre- sponding lowercase letters. Thus, the random variable X can take the value 2. Example 1.4.3 (Three coin tosses-II) Consider again the experiment of tossing a fair coin three times from Example 1.3.13. Define the random variable X to be the number of heads obtained in the three tosses. A complete enumeration of the value of X for each point in the sample space is 3 HHH WHT HTH THH TTH THT HTT TIT Xa) ts 2 2 2 1 1 1 0 ‘The range for the random variable X is ¥ = {0,1,2,3}. Assuming that all eight points in S have probability 3, by simply counting in the above display we see that Section 1.5 DISTRIBUTION FUNCTIONS 29 the induced probability function on is given by z oO 12 3 Px(X=2) § 3 EG For example, Px(X = 1) = P({ATT, THT, TTH}) ll Example 1.4.4 (Distribution of a random variable) It may be possible to determine Px even if a complete listing, as in Example 1.4.3, is not possible. Let S be the 2° strings of 50 0s and 1s, X = number of 1s, and 2 = {0,1,2,...,50}, as mentioned at the beginning of this section. Suppose that each of the 2°° strings is equally likely. The probability that X = 27 can be obtained by counting ail of the strings with 27 1s in the original sample space. Since each string is equally likely, it follows that # strings with 27 1s (52) X = 27) = H# sttings with 27 1s _ (a7) | Px(X = 20) # strings 380 In general, for any i € %, (2 = i Px(X =i) = Shy ‘The previous illustrations had both a finite S and finite 2, and the definition of Px was straightforward. Such is also the case if ¥ is countable. If ¥ is uncountable, we define the induced probability function, Py, in a manner similar to (1.4.1). For any set AC, (1.4.2) Px(X € A) = P({s € S$: X(s) € Aj). This does define a legitimate probability function for which the Kolmogorov Axioms can be verified. (To be precise, we use (1.4.2) to define probabilities only for a cer- tain sigma algebra of subsets of 2. But we will not concern ourselves with these technicalities.) 1.5 Distribution Functions With every random variable X, we associate a function called the cumulative distri- bution function of X. Definition 1.5.1 The cumulative distribution function or cdf of a random variable X, denoted by F(z), is defined by Fy(z) = Px(X <2), forall z. 30 PROBABILITY THEORY Section 1.5 Figure 1.5.1. Caf of Example 1.5.2 Example 1.5.2 (Tossing three coins) Consider the experiment of tossing three fair coins, and let X = number of heads observed. The cdf of X is @ if-c 3 since a is certain to be less than or equal to such a value. —_|| As is apparent from Figure 1.5.1, Fx can be discontinuous, with jumps at certain values of z. By the way in which Fx is defined, however, at the jump points Fx takes the value at the top of the jump. (Note the different inequalities in (1.5.1).) This is known as right-continuity—the function is continuous when a point is approached from the right. The property of right-continuity is a consequence of the definition of the cdf. In contrast, if we had defined Fx (z) = Px(X < 2) (note strict inequality), Fx would then be left-continuous. The size of the jump at any point z is equal to P(X =2). Every cdf satisfies certain properties, some of which are obvious when we think of the definition of F(z) in terms of probabilities. Section 1.5 DISTRIBUTION FUNCTIONS a1 Theorem 1.5.3 The function F(z) is a cdf if and only if the following three con- ditions hold: a. lim, .ooF (x) = 0 and lims soo F(z) b. F(z) is a nondecreasing function of z. c. F(x) is right-continuous; that is, for every number zo, limz,z, F(x) = F (zo). Outline of proof: To prove necessity, the three properties can be verified by writing F in terms of the probability function (see Exercise 1.48). To prove sufficiency, that, if a function F satisfies the three conditions of the theorem then it is a cdf for some random variable, is much harder. It must be established that there exists a sample space S, a probability function P on S, and a random variable X defined on $ such that F is the cdf of X. a Example 1.5.4 (Tossing for ahead) Suppose we do an experiment that consists of tossing & coin until a head appears. Let p = probability of a head on any given toss, and define a random variable X = number of tosses required to get a head. Then, for any 2=1,2,..., (1.5.2) P(X =z) = (1-p)*""p, since we must get 2 — 1 tails followed by a head for the event to occur and all trials are independent. From (1.5.2) we calculate, for any positive integer 2, (1.5.3) P(X <2)= pu =i) =00-p)p. =I i=l The partial sum of the geometric series is 1-0? (1.5.4) a fact that can be established by induction (see Exercise 1.50). Applying (1.5.4) to our probability, we find that the cdf of the random variable X is Fx(z) = P(X <2) (1=p)* “T= (=p)? =1-(1-p), 2=1,2,.... The cdf F(z) is flat between the nonnegative integers, as in Example 1.5.2. It is easy to show that if 0 < p <1, then Fx (2) satisfies the conditions of Theorem 1.5.3. First, alin, Fe() 32 PROBABILITY THEORY Section 1.5 Oo 123 4 5 67 8 9 012 4 Is Figure 1.5.2. Geometric edf, p = .3 since Fy (z) = 0 for all z < 0, and ‘lim Fx(2) = Jim 1-(1-p)" =1, where 2 goes through only integer values when this limit is taken. To verify property (b), we simply note that the sum in (1.5.3) contains more positive terms as z increases. Finally, to verify (c), note that, for any 2, Fx (a+ ¢) = Fx(z) if > 0 is sufficiently small. Hence, lim Fx(e+8) = Fx(2)s so Fx (2) is right-continuous. F(z) is the cdf of a distribution called the geometric distribution (after the series) and is pictured in Figure 1.5.2. i Example 1.5.5 (Continuous cdf) An example of a continuous cdf is the function 1 (1.5.5) Fxl@l= apes which satisfies the conditions of Theorem 1.5.3. For example, lim Fx(e)=0 since lim e7 and lim Fx(e)=1 since lim e~ a nt Differentiating F(z) gives Section 1.5 DISTRIBUTION FUNCTIONS 33 showing that Fx (z) is increasing. Fx is not only right-continuous, but also continuous. This is a special case of the logistic distribution. lI Example 1.5.6 (Cdf with jumps) If Fx is not a continuous function of 2, it is possible for it to be a mixture of continuous pieces and jumps. For example, if we modify Fx(z) of (1.5.5) to be, for some ¢,1 > € > 0, » awl ify <0 (1.5. ¥(y) = 9 if etiper ify20, then Fy(y) is the cdf of a random variable Y (see Exercise 1.47). The function Fy has a jump of height ¢ at y = 0 and otherwise is continuous. This model might be appropriate if we were observing the reading from a gauge, a reading that could (theoretically) be anywhere between —oo and oo. This particular gauge, however, sometimes sticks at 0. We could then model our observations with Fy, where € is the probability that the gauge sticks. I) Whether a cdf is continuous or has jumps corresponds to the associated random variable being continuous or not. In fact, the association is such that it is convenient to define continuous random variables in this way. Definition 1.5.7 A random variable X is continuous if Fx(z) is a continuous function of x. A random variable X is discrete if F(z) is a step function of z. We close this section with a theorem formally stating that Fx completely deter- mines the probability distribution of a random variable X. This is true if P(X € A) is defined only for events A in B?, the smallest sigma algebra containing all the intervals of real numbers of the form (a,), {a, 6), (a,b), and [a,b]. If probabilities are defined for a larger class of events, it is possible for two random variables to have the same distribution function but not the same probability for every event (see Chung 1974, page 27). In this book, as in most statistical applications, we are concerned only with events that are intervals, countable unions or intersections of intervals, etc. So we do not consider such pathological cases. We first need the notion of two random variables being identically distributed. Definition 1.5.8 The random variables X and Y are identically distributed if, for every set A€ BY, P(X € A) = P(YE A). Note that two random variables that are identically distributed are not necessarily equal. That is, Definition 1.5.8 does not say that X = Y. Example 1.5.9 (Identically distributed random variables) Consider the ex- Periment of tossing a fair coin three times as in Example 1.4.3. Define the random variables X and Y by X = number of heads observed and Y = number of tails observed. ” PROBABILITY THEORY Section 1.6 ‘The distribution of X is given in Example 1.4.3, and it is easily verified that the distribution of Y is exactly the same. That is, for each k = 0,1, 2,3, we have P(X = k) = P(Y = k). So X and ¥ are identically distributed. However, for no sample points do we have X(s) = ¥(s). 1 Theorem 1.5.10 The following two statements are equivalent: a. The random variables X and Y are identically distributed. b. Fx (2) = Fy(2) for every x. Proof: To show equivalence we must show that each statement implies the other. We first show that (a) = (b). Because X and Y are identically distributed, for any set A € BI, P(X € A) = P(Y < A). In particular, for every x, the set (—00,] is in B', and Fx (x) = P(X € (—00,2]) = P(Y € (00, 2]) = Fy(z). ‘The converse implication, that (b) = (a), is much more difficult to prove. The above argument showed that if the X and ¥ probabilities agreed on all sets, then they agreed on intervals. We now must prove the opposite; that is, if the X and ¥ probabilities agree on all intervals, then they agree on all sets. To show this requires heavy use of sigma algebras; we will not go into these details here. Suffice it to say that it is necessary to prove only that the two probability functions agree on all intervals {Chung 1974, Section 2.2). oO 1.6 Density and Mass Functions Associated with a random variable X and its edf Fx is another function, called either the probability density function (pdf) or probability mass function (pmf). The terms pdf and pmf refer, respectively, to the continuous and discrete cases. Both pdfs and pmfs are concerned with “point probabilities” of random variables. Definition 1.6.1 The probability mass function (pmf) of a discrete random variable X is given by fx() = P(X =2) for all z. Example 1.6.2 (Geometric probabilities) For the geometric distribution of Example 1.5.4, we have the pmf x(a) = P(X =2)=[ (1-9) 'p foro =1,2, 0 otherwise. Recall that P(X = 2) or, equivalently, fx (z) is the size of the jump in the odf at z. We can use the pmf to calculate probabilities. Since we can now measure the probability of a single point, we need only sum over all of the points in the appropriate event. Hence, for positive integers a and 6, with a fx(k) = Fx(d)- i ‘A widely accepted convention, which we will adopt, is to use an uppercase letter for the cdf and the corresponding lowercase letter for the pmf or pdf. We must be a little more careful in our definition of a pdf in the continuous case. If we naively try to calculate P(X = 2) for a continuous random variable, we get the following. Since {X = z} C {r —¢ < X < z} for any € > 0, we have from Theorem 1.2.9(c) that P(X =a) < Plw—€< X <2) = Fx(z)— Fx(e-6) for any ¢ > 0. Therefore, O0< P(X=2)< lim [Fx (2) - Fx(z- 6) =0 by the continuity of Fx. However, if we understand the purpose of the pdf, its defi- nition will become clear. From Example 1.6.2, we see that a pmf gives us “point probabilities.” In the discrete case, we can sum over values of the pmf to get the cdf (as in (1.6.1). The analogous procedure in the continuous case is to substitute integrals for sums, and we get P(X <2) =Fx(z)= f fx(t) dt. Using the Fundamental Theorem of Calculus, if fx(z) is continuous, we have the further relationship (1.6.2) Arete) = fx(2). Note that the analogy with the discrete case is almost exact. We “add up” the “point probabilities” fx(x) to obtain interval probabilities. Definition 1.6.3 The probability density function or pdf, fx(z), of a continuous random variable X is the function that satisfies (1.6.3) Fe) = fo Sx(t)dt for all z. A note on notation: The expression “X has a distribution given by F(z)” is abbrevi- ated symbolically by “X ~ Fx(z),” where we read the symbol “~” as “is distributed as.” We can similarly write X ~ fx(z) or, if X and Y have the same distribution, X~Y. In the continuous case we can be somewhat cavalier about the specification of interval probabilities. Since P(X = z) = 0 if X is a continuous random variable, P(a< X 0 for all x. b. Defx(e)=1 (pmf) or f° fx (x) de = 1 (pdf). Section 1.7 EXERCISES a7 Proof If fx(z) is a pdf (or pmf), then the two properties are immediate from the definitions. In particular, for a pdf, using (1.6.3) and Theorem 1.5.3, we have that 1= Jim Fx(z) = ie Sfr(t)dt. ‘The converse implication is equally easy to prove. Once we have fx (x), we can define Fx(x) and appeal to Theorem 1.5.3. o From a purely mathematical viewpoint, any nonnegative function with a finite positive integral (or sum) can be turned into a pdf or pmf. For example, if h(c) is any nonnegative function that is positive on a set A, 0 elsewhere, and f[ h(a) de = K <00 fzcay for some constant K > 0, then the function fx(z) = h(z)/K is a pdf of a random variable X taking values in A. Actually, the relationship (1.6.3) does not always hold because Fx(z) may be continuous but not differentiable. In fact, there exist continuous random variables for which the integral relationship does not exist for any fx (x). These cases are rather pathological and we wilt ignore them. Thus, in this text, we will assume that (1.6.3) holds for any continuous random variable. In more advanced texts (for exam- ple, Billingsley 1995, Section 31) a random variable is called absolutely continuous if (1.6.3) holds. 1.7 Exercises 1.1 For each of the following experiments, describe the sample space. (a) Toss a coin four times. (b) Count the number of insect-damaged leaves on a plant, (c) Measure the lifetime (in hours) of a particular brand of light bulb. (a) Record the weights of 10-day-old rats. (c) Observe the proportion of defectives in a shipment of electronic components. 1.2 Verify the following identities. (a) A\B= A\(ANB) = AN BE (b) B= (BN A)U(BN A) (0) B\A=BNA (4) AUB=AUu(BN A’) somery: 38 PROBABILITY THEORY Section 1.7 1.3 Finish the proof of Theorem 1.1.4. For any events A, B, and C defined on a sample space S, show that (a) AUB= BUA and ANB=BNA. (commutativity) (b) AU(BUC) = (AUB)UC and AN(BNC)=(ANB)NC. __ (associativity) () (AUBY = A° OB and (AN BY = AUB. (DeMorgan’s Laws) 1.4 For events A and B, find formulas for the probabilities of the following events in terms ‘of the quantities P(A), P(B), and P(AN B). (a) either A or B or both (b) either A or B but not both {c) al least one of A or B (d) at most one of A or B 1.8 Approximately one-third of all human twins are identical (one-egg) and two-thirds are fraternal (two-egg) twins. Identical twins are necessarily the same sex, with male and female being equally likely. Among fraternal twins, approximately one-fourth are both female, one-fourth are both male, and half are one male and one fernale. Finally, among all U.S. births, epproximately 1 in 90 is a twin birth. Define the following events: [a U.S. birth results in twin females} B = {a US. birth results in identical twins} C= {aUS. birth results in twins} (a) State, in words, the event AN BAC. (b) Find P(AN BNC). - 1.6 Two pennies, one with P(head) = u end one with P(head) = w, are to be tossed together independently. Define Po = P(0 heads occur), Pi = P( head occurs), P(2 heads occur). Pa Can u and w be chosen such that po = pi = pa? Prove yout answer. 1.7 Refer to the dart game of Example 1.2.7. Suppose we do not assume that the proba- bility of hitting the dart board is 1, but rather is proportional to the area of the dart board. Assume that the dart board is mounted on a wall that is hit with probability 1, and the wall has area A. (a) Using the fact that the probability of hitting a region is proportional to aréa, construct a probability function for P(scoring i points), i = 0, . (No points are scored if the dart board is not hit.) (b) Show that the conditional probability distribution P(scoring i points|board is hit) is exactly the probability distribution of Example 1.2.7. 1.8 Again refer to the game of darts explained in Example 1.2.7. (a) Derive the general formula for the probability of scoring i points, (b) Show that P(scoring i points) is a decreasing function of i, that is, as the points increase, the probability of scaring them decreases. (c) Show that P(scoring i points) is a probability function according to the Kol- mogorov Axioms. Section 1.7 EXERCISES 30 1.8 Prove the general version of DeMorgan’s Laws. Let {Aq: a € T'} be a (possibly un- countable) collection of sets. Prove that (@) (UeAa}® =aAS. (b) (NaAa)® = Ue: 1.10 Formulate and prove a version of DeMorgan’s Laws that applies to a finite collection of sets A1,...4An- 1.11 Let S be a sample space. (a) Show that the collection B = {0,5} is a sigma algebra, (b) Let B = {all subsets of 5, including S itself}. Show that B is a sigma algebra. (c) Show that the intersection of two sigma algebras is a sigma algebra. 1.12 It was noted in Section 1.2.1 thet statisticians who follow the deFinetti school do not the Axiom of Countable Additivity, instead adhering to the Axiom of Finite Additivity. (a) Show that the Axiom of Countable Additivity implies Finite Additivity. (b) Although, by itself, the Axiom of Finite Additivity does not imply Countable Additivity, suppose we supplement it with the following. Let A: > Az D ++: > An > ++» be an infinite sequence of nested sets whose limit is the empty set, which we denote by An | 0. Consider the following: If An 0, then P(A) — 0. Axiom of Continuit; Prove that the Axiom of Continuity and the Axiom of Finite Additivity imply Countable Additivity. 1.13 If P(A) = 4 and P(B*) = }, can A and B be disjoint? Explain. 1.14 Suppose that a sample space S has n elements. Prove that the number of subsets that can be formed from the elements of $ is 2°. 1.15 Finish the proof of Theorem 1.2.14. Use the result established for k = 2 as the basis of an induction argument. 1.16 How many different sets of initials can be formed if every person has one surname and (a) exactly two given names? (b) either one or two given nam (b) either one or two or three given names? (Answers: (a) 26 (b) 26° + 26? (c) 26% + 26° + 267) 1.17 In the game of dominoes, each piece is marked with two numbers. ‘The pieces are symmetrical so that the number pair is not ordered (so, for example, (2,6) = (6,2)). How many different pieces can be formed using the numbers 1,2,...,n? (Answer: n(n + 1)/2) 1.18 If n balls are placed at random into ni cells, find the probability that exactly one cell remains empty. (Answer: (3)n!/n") 1.19 If a multivariate function has continuous partial derivatives, the order in which the derivatives are calculated does not matter. Thus, for example, the function’ f(z, y) of two variables has equal third partials e a Baty!) = Fypgal@y)- (a) How many fourth partial derivatives does a function of three variables have? (b) Prove that function of n variables has (**!-") rth partial derivatives. 1.20 My telephone rings 12 times each week, the calls being randomly distributed among the 7 days. What is the probability that I get at least, one call each day? (Answer: .2285) “0 1.21 1.22 1.28 1.24 1.25 3.26 1.27 1.28 1.29 PROBABILITY THEORY Section 1.7 A closet contains n pairs of shoes. If 2r shoes are chosen at random (2r }. (Hint: Try to write P(A wins) in terms of the events Hi, F2,..., where E, = {head first appears on ith toss}.) (Answers: (a) 2/3 (6) =z) ‘The Smiths have two children. At least one of them is a boy. What is the probability that both children are boys? (See Gardner 1961 for a complete discussion of this problem.) A fair die is cast until a 6 appears. What is the probability that it must be cast more than five times? Verify the following identities for n > 2. (®) Dreo(-1)* (2) =0 (b) Dhak (2) = nar () Daye (2) =0 ‘A way of approximating large factorials is through the use of Stirling’s Formula: nla VENOM em, complete derivation of which is difficult. Instead, prove the easier fact, " nl dim, Saiingce = 8 constant. (Hint: Feller 1968 proceeds by using the monotonicity of the logarithm to establish that " et i loge de 1), the employer can hire the mth candidate only if the mth candidate is better than the previous m — 1. Suppose a candidate is hired on the ith trial. What is the probability that the best candidate was hired? Suppose that 5% of men and .25% of women are color-blind. A person is chosen at random and that person is color-blind. What is the probability that the person is male? (Assume males and females to be in equal numbers.) Two litters of a particular rodent species have been born, one with two brown-haired and one gray-haired (litter 1), and the other with three brown-haired and two gray- haired (litter 2). We select a litter at random and then select an offspring at random from the selected litter. (a) What is the probability that the animal chosen is brown-haired? (b) Given that a brown-haired offspring was selected, what is the probability that the sampling was from litter 1? Prove that if P(-) is a legitimate probability function and B is a set with P(B) > 0, then P(-|B) also satisfies Kolmogorov’s Axioms. If the probability of hitting a target is 1, and ten shots are fired independently, what is the probability of the target being hit at least twice? What is the conditional prob- ability that the target is hit at least twice, given that it is hit at least once? a 137 1.38 1.39 1.40 141 PROBABILITY THEORY Section 1.7 Here we look at some variations of Example 1.3.4. (a) In the warden’s calculation of Example 1.3.4 it was assumed that if A were to be pardoned, then with equal probability the warden would tell A that either B or C would die. However, this need not be the case. The warden can assign probabilities y and 1 ~7 to these events, as shown here: Prisoner pardoned Warden tells A A B dies with probability A C dies with probability 1 ~~ B C dies c B dies Calculate P(A|W) as a function of y. For what values of is P(A|W) less than, equal to, or greater than 3? (b) Suppose again that = }, as in the example. After the warden tella A that B will die, A thinks for a while and realizes that his original calculation was false. However, A then gets a bright idea. A asks the warden if he can swap fates with C. ‘The warden, thinking that no information has been passed, agrees to this. Prove that A's reasoning is now correct and that his probability of survival has jumped to 2 A similar, but somewhat more complicated, problem, the “Morte Hall problem” is discussed by Selvin (1975). The problem in this guise gained a fair amount of noto- riety when it appeared in a Sunday magazine (vos Savant 1990) along with a correct answer but with questionable explanation. The ensuing debate was even reported on the front page of the Sunday New York Times (Tierney 1991). A complete and some- what amusing treatment is given by Morgan et al. (1991) [see also the response by vos Savant 1991]. Chun (1999) pretty much exhausts the problem with a very thorough analysis, Prove each of the following statements. (Assume that any conditioning event has pos- itive probability.) (a) If P(B) =1, then P(A|B) = P(A) for any A. (b) 1 AC B, then P(BIA) = 1 and P(A|B) = P(A)/P(B). (c) If A and B are mutually exclusive, then P(A) Pay + By P(A|AUB) (d) P(AN BNC) = P(AIBNC)P(BIC)P(C). A pair of events A and B cannot be simultaneously mutually exclusive and independent, Prove that if P(A) > 0 and P(B) > 0, then: (a) Mf A and B are mutually exclusive, they cannot be independent. (b) If A and B are independent, they cannot be matuelly exclusive, Finish the proof of Theorem 1.3.9 by proving parts (b) and (c). AAs in Bxample 1.3.6, consider telegraph signals “dot” and “dash” sent in the proportion 3:4, where erratic transmissions cause a re to become a dash with probability } 4 and a dash to become a dot with probability 4 (a) If a dash is received, what is the probability that a dash has been sent? Section 1.7 EXERCISES 43 1.42 1.43 144 1.45 1.46 1.47 1.48 (b) Assuming independence between signals, if the message dot-dot was received, ‘what is the probability distribution of the four possible messages that could have been sent? ‘The inclusion-exclusion identity of Miscellanea 1.8.1 gets it name from the fact that it is proved by the method of inclusion and exclusion (Feller 1968, Section IV.1). Here we go into the details. The probability P(U?_1Ai) is the sum of the probabilities of all the sample points that are contained in at least one of the Ais. The method of inclusion and exclusion is a recipe for counting these points. (a) Let Ex denote the set of all sample points that are contained in exactly & of the events Ai, A2,...,An- Show that P(UR1 Ai) = D7, P(A). (b) If £1 is not empty, show that P(E:) = )3" , P(Ai) - (c) Without loss of generality, assume that Ey is contained in Ay, Az,..., Ax. Show that P(E) appears k times in the sum Pi, (5) times in the sum Py, (5) times in the sum Ps, etc. kK) | fk k OOO (See Exercise 1.27.) (d) Show that (e) Show that parts (a) — (¢) imply 0", P(E:) = Py - Pa = the inclusion-exclusion identity. + Py, establishing For the inclusion-ezclusion identity of Miscellanea 1.8.1: (a) Derive both Boole’s and Bonferroni’s Inequality from the inclusion-exclusion iden- tity. (b) Show that the P, satisfy P, > P, if i > j and that the sequence of bounds in Miscellanea 1.8.1 improves as the number of terms increases. (c) Typically as the number of terms in the bound increases, the bound becomes more useful. However, Schwager (1984) cautions that there are some cases where there is not much improvement, in particular if the A,s are highly correlated. Examine what happens to the sequence of bounds in the extreme case when Aj = A for every i. (See Worsley 1982 and the correspondence of Worsley 1985 and Schwager 1985.) Standardized tests provide an intoresting application of probability theory. Suppose first that a test consists of 20 multiple-choice questions, each with 4 possible answers. If the student guesses on each question, then the taking of the exam can be modeled ‘as a sequence of 20 independent events. Find the probability that the student gets at least 10 questions correct, given that he is guessing. Show that the induced probability function defined in (1.4.1) defines a legitimate probability function in that it satisfies the Kolmogorov Axioms Seven balls are distributed randomly into seven cells. Let X; = the number of cells containing exactly i balls. What is the probability distribution of Xs? (Thet is, find P(X3 = 2) for every possible z.) Prove that the following functions are cdfs. (a) 7+ dtan71(2), 2 € (—00, 00) (b) (+e (c) e-* *, 2 € (-00, 00) (a) 1-e (e) the function defined in (1.5.6) Prove the necessity part of Theorem 1.5.3. y*, 2 € (—20, 00) *, x € (0,00) “ PROBABILITY THEORY ‘Section 1.8 1.49 A cdf Fx is stochastically greater than a cdf Fy if Fx(t) < Fy (t) for all t and Fx(t) < Fy (t) for some t. Prove that if X ~ Fx end Y ~ Fy, then P(X >t)>P(Y>1) foreveryt and P(X >1)> PY >4) for some t, that is, X tends to be bigger than Y. 1.50 Verify formula (1.5.4), the formula for the partial sum of the geometric series. 1.51 An appliance store receives a shipment of 30 microwave ovens, 5 of which are (unknown to the manager) defective. The store manager selects 4 ovens at random, without replacement, and tests to see if they are defective. Let X = number of defectives found. Calculate the pmf and cdf of X and plot the cdf. 1.52 Let X be acontinuous random variable with pdf f(z) and edf F(z). For a fixed number Zo, define the function olz) = {fom — F(@o)] ee = Prove that 9(2) is a pdf. (Assume that F(z9) < 1.) 1.53 A certain river floods every yeat. Suppose that the low-water mark is set at 1 and the high-water mark ¥ has distribution function Fy) =P Sw)=1- 5, 1s y <0. (a) Verify thet Fy(y) is a cdf. (b) Find fy(y), the pdf of Y. (c) If the low-water mark is reset at 0 and we use a unit of measurement that is 4; of thet given previously, the high-water mark becomes Z = 10(Y — 1). Find Fz(z) 1.54 For each of the following, determine the value of c that makes f(z) a pdf. (a) f(a) =csinz,0<20. 1.8 Miscellanea 1.8.1 Bonferroni and Beyond ‘The Bonferroni bound of (1.2.10), or Boole’s Inequality (Theorem 1.2.11), provides simple bounds on the probability of an intersection or union. These bounds can be made more and more precise with the following expansion. For sets Aj, Az,...An, we create a new set of nested intersections as follows. Let B= S7PtA) i= Section 1.8 MISCELLANEA 45 = OD, CHa) 1gigjn Ps= SD P(AINASNAR) ISi P; if i < j, and we have the sequence of upper and lower bounds P>PURrA) 2 A- Pe pelaeeeae tee tee egeaee ieee eee ee eate aed See Exercises 1.42 and 1.43 for details. ‘These bounds become increasingly tighter as the number of terms increases, and they provide a refinement of the original Bonferroni bounds. Applications of these bounds include approximating probabilities of runs (Karlin and Ost 1988) and multiple comparisons procedures (Naiman and Wynn 1992). Chapter 2 Transformations and Expectations “We want something more than mere theory and preaching now, though.” Sherlock Holmes A Study in Scarlet Often, if we are able to model a phenomenon in terms of a random variable X with cdf F(z), we will also be concerned with the behavior of functions of X. In this chapter we study techniques that allow us to gain information about functions of X that may be of interest, information that can range from very complete (the distributions of these functions) to more vague (the average behavior). 2.1 Distributions of Functions of a Random Variable If X is a random variable with cdf Fy(z), then any function of X, say g(X), is also a random variable. Often 9(X) is of interest itself and we write Y = g(X) to denote the new random variable g(X). Since Y is a function of X, we can describe the probabilistic behavior of Y in terms of that of X. That is, for any set A, PUY € A) = P(9(X) € A), showing that the distribution of Y depends on the functions Fx and g. Depending on the choice of g, it is sometimes possible to obtain a tractable expression for this probability. Formally, if we write y = g(z), the function g(z) defines a mapping from the original sample space of X, %, to a new sample space, )’, the sample space of the random variable Y. That is, g(a): XY. We associate with g an inverse mapping, denoted by g~!, which is a mapping from subsets of ) to subsets of 2, and is defined by (2.1.1) g (A) = {xe X: g(x) € A}. Note that the mapping g! takes sets into sets; that is, g~!(A) is the set of points in ¥ that g(x) takes into the set A. It is possible for A to be a point set, say A = {y}. Then go ({u}) = {2 € X: oz) =u} ry TRANSFORMATIONS AND EXPECTATIONS Section 2.1 In this case we often write g~1(y) instead of g~!({y}). The quantity g~(y) can still ‘be a set, however, if there is more than one for which g(z) = y. If there is only one for which 9(z) = y, then g~1(y) is the point set {x}, and we will write g~"(y) = 2. If the random variable Y is now defined by Y = g{X), we can write for any set ACY, P(Y € A) = P(g(X) € A) (2.1.2) =Pl{z € X: g(x) € A}) =P(Xeg"(A)). ‘This defines the probability distribution of Y. It is straightforward to show that this probability distribution satisfies the Kolmogorov Axioms. If X is a discrete random variable, then 4’ is countable. The sample space for ¥ =9(X)is Y= {y: y=9(2), 2 € X}, which is also a countable set. Thus, ¥ is also a discrete random variable. From (2.1.2), the pmf for Y is frw)=PY=y= YO PxX=2)= YO fx(@), forved, zeg My) E97 *(y) and fy(y) = 0 for y ¢ Y. In this case, finding the pmf of Y involves simply identifying g ‘(y), for each y € Y, and summing the appropriate probabilities. Example 2.1.1 (Binomial transformation) A discrete random variable X has a binomial distribution if its pmf is of the form (2.1.3) fx(z) = P(X =2) = (Bera =p?, 2=0))..40, where n is a positive integer and 0 < p < 1. Values such as n and p that can be set to different values, producing different probability distributions, are called parameters. Consider the random variable Y = g(X), where g(z) = n — 2; that is, Y =n—X. Here ¥ = {0,1,...,n} and Y = {j (2), & € X} = {0,1,...,n}. For any ye Y, n—-2=9(z) if and only if =n —y. Thus, g~*(y) is the single point s=n—y, and fry) = YO sx@) 2€g™(y) = fx(n-y) = ( n ) n-¥(1 — py ow) nae P) Nes) Definition 1.2.17 = ({)e-we (inpts (= (2) Thus, we see that Y also has a binomial distribution, but with parameters n and 1-p If X and ¥ are continuous random variables, then in some cases it is possible to find simple formulas for the cdf and pdf of Y in terms of the df and pdf of X and the function g. In the remainder of this section, we consider some of these cases. Section 2.1 DISTRIBUTIONS OF FUNCTIONS OF A RANDOM VARIABLE 49 sina) 1 : * me ma Figure 2.1.1. Graph of the transformation y = sin*(z) of Example 2.1.2 The cdf of Y = 9(X) is Fy(y) = PY <9) (2.1.4) = P(g(X) < y) = P({ze X: 9(z) < y}) -f fx(a)ae. {z€X: g(z) 24). From the symmetry of the function sin”(x) and the fact that X has a uniform distri- bution, we have PIX Sm)=P(X 24) and Play 0} and = Y= {ys y=9(z) for some x € 2}. The pdf of the random variable X is positive only on the set ¥ and is 0 elsewhere. Such a set is called the support set of a distribution or, more informally, the support of a distribution, This terminology can also apply to a pmf or, in general, to any nonnegative function. It is easiest to deal with functions g(x) that are monotone, that is, those that satisfy either u> v= glu) > g(v) (increasing) or u9(u)>g(v) (decreasing). If the transformation 2 — 9(z) is monotone, then it is one-to-one and onto from 2 — Y. That is, each x goes to only one y and each y comes from at most one = (one-to-one). Also, for defined as in (2.1.7), for each y € ) there is an x € X such that 9(z) = y (onto). Thus, the transformation g uniquely pairs zs and ys. If g is monotone, then g~! is single-valued; that is, g~!(y) =z if and only if y = g(z). If g is increasing, this implies that {re X: gz) 97*(y)} (2.1.9) ={reX: 2>g(y)}. (A graph will illustrate why the inequolity reverses in the decreasing case.) If 9(z) is an increasing function, then using (2.1.4), we can write oy) aw f Jula)de =f" fela)de= Fx (WW). {zeX: rSg-¥(y)} 00 If g(z) is decreasing, we have Fi) = [| Ix(a)de=1 Bx (0). a 1(y) The continuity of X is used to obtain the second equality. We summarize these results in the following theorem. Section 2.1 DISTRIBUTIONS OF FUNCTIONS OF A RANDOM VARIABLE 81 Theorem 2.1.3 Let X have cdf Fx(z), let ¥ = 9(X), and let ¥ and Y be defined as in (2.1.7). a. If g is an increasing function on X, Fy (y) = Fx (g7*(y)) fory ey. b. Ifg is a decreasing function on # and X is a continuous random variable, Fy (y) = 1~ Fx (9""(w)) forvey. Example 2.1.4 (Uniform-exponential relationship-I) Suppose X ~ fx(z) = 1 if 0 < z < 1 and 0 otherwise, the uniform(0,1) distribution, It is straightforward to check that x(x) = 2, 0 < x < I. We now make the transformation ¥ = 9(X) = —log X. Since foe) = EC toga) =atco, fr o 0, y = —logz implies x = e~¥, so g-1(y) = e7¥. Therefore, for y > 0, Fy(y) =1- Fx (97) =1- Fx(e¥) =1-e%. (Fx(z) = 2) Of course, Fy(y) = 0 for y < 0. Note that it was necessary only to verify that g(x) = —logz is monotone on (0,1), the support of X. If the pdf of Y is continuous, it can be obtained by differentiating the cdf. The resulting expression is given in the following theorem. Theorem 2.1.5 Let X have pdf fx(x) and let ¥ = g(X), where g is a monotone function. Let ¥ and Y be defined by (2.1.7). Suppose that fx(x) is continuous on ¥ ‘and that g~"(y) has a continuous derivative on Y. Then the pdf of Y is given by il (2.1.10) fy(u) ~ {me w| eo ‘w)) vey 0 otherwise. Proof: From Theorem 2.1.3 we have, by the chain rule, d fx(g(y)) 9 "(y) if g is increasing d ay fry) = ai = f flo") gj9W) 9 is decreasing, which can be expressed concisely as (2.1.10). Qo Example 2.1.6 (Inverted gamma pdf) Let fx(z) be the gamma pdf Se)= Goyer tel, <2 <0, where @ is a positive constant and n is a positive integer. Suppose we want to find the pdf of 9(X) = 1/X. Note that here the support sets 2” and ) are both the interval ‘TRANSFORMATIONS AND EXPECTATIONS Section 2.1 (0,00). If we let y = 9(2), then g-“(y) = Ly and fg (y) = -1/y?. Applying the above theorem, for y € (0,00), we get Peto) = fe ') [Fore] m1 =) 1)” anh (n= 118" \y y nat — (2) erV/aw), (n— 18" \y a special case of a pdf known as the inverted gamma pdf. i] In many applications, the function g may be neither increasing nor decreasing; hence the above results will not apply. However, it is often the case that g will be monotone over certain intervals and that allows us to get an expression for Y = g(X). (If g is not monotone over certain intervals, then we are in deep trouble.) Example 2.1.7 (Square transformation) Suppose X is a continuous random variable. For y > 0, the cdf of Y = X? is Fy) = PY y}, a definition that agrees with (2.1.12) when Fy is nonconstant and provides an Fy! that is single-valued even when Fx is not strictly increasing. Using this definition, in Figure 2.1.2b, we have F¢'(y) = 21. At the endpoints of the range of y, Pz"(y) can also be defined. Fz*(1) = 00 if Fx(z) < 1 for all x and, for any Fx, Fy'(0) = —00. Proof of Theorem 2.1.10: For ¥ = Fx(X) we have, for 0 land P(Y < y) =0 fory <0, showing that Y has a uniform distribution. The reasoning behind the equality P (Fx'(Fx(X)) $ Fy'(y)) = POX S Fx"(y)) is somewhat subtle and deserves additional attention. If Fx is strictly increasing, then it is true that Fy'(Fx(z)) = 2. (Refer to Figure 2.1.2a.) However, if Fy is flat, it my be that Fy"(Fy(z)) # 2. Suppose Fy is as in Figure 2.1.2b and let z € [21,22]. Then Fx'(Fx(z)) = 2; for any 2 in this interval. Even in this case, though, the probability equality holds, since P(X < 2) = P(X < 21) for any x € [z,22). The fiat cdf denotes a region of 0 probability (P(a: < X < 2) = Fx(z) — Fx(21) = 0). a One application of Theorem 2.1.10 is in the generation of random samples from a particular distribution. If it is required to generate an observation X from a population with cdf Fx, we need only generate a uniform random number V, between 0 and 1, and solve for < in the equation Fx(z) = u. (For many distributions there are other methods of generating observations that take less computer time, but this method is still useful because of its general applicability.) 2.2 Expected Values The expected value, or expectation, of a random variable is merely its average value, where we speak of “average” value as one that is weighted according to the probability distribution. The expected value of a distribution can be thought of as a measure of center, as we think of averages as being middle values. By weighting the values of the random variable according to the probability distribution, we hope to obtain @ number that summarizes a typical or expected value of an observation of the random variable. Definition 2.2.1 The expected value or mean of a random variable 9(X), denoted by E9(X), is P92) fx (2) de if X is continuous E9(X) = Lye 92) x(a) = Deen 9(2)P(X = 2) if X is discrete, provided that the integral or sum exists. If E|g(X)| = 00, we say that Eg(X) does not exist. (Ross 1988 refers to this as the “law of the unconscious statistician.” We do not find this amusing.) Example 2.2.2 (Exponential mean) Suppose X has an exponential () distri- bution, that is, it has pdf given by 1 HH, OS t< em, ADO. fx(a) = LJ ‘TRANSFORMATIONS AND EXPECTATIONS Section 2.2 Then EX is given by EX = 2) de x7 = —2e72/4 + f e-*/Adz (integration by pacts) b = f ede = 2 I lo Example 2.2.3 (Binomial mean) If X has a binomial distribution, its pmf is given by P(X =2) = (Rea —py?, 2 =0,1j..57, where n is a positive integer, 0 < p < 1, and for every fixed pair n and p the pmf sums to 1. The expected value of a EX= > x (2a —p)* =o zl fn -1 ee (1 — pnt EX dn(22 a) P) S_fn-1 = Ya ; ) pa pron (substitute y =z — 1) y=0 ¥ Gln-1 = mw ( )pa-e y=0 y = since the last summation must be 1, being the sum over all possible values of a binomial(n — 1, p) pmf. I) Example 2.2.4 (Cauchy mean) A classic example of a random variable whose expected value does not exist is a Cauchy random variable, that is, one with pdf ria mitz?’ It is straightforward to check that f°, fx(z) dx = 1, but E|X|=oo. Write EIX| = { |z|_ 1 fos 2 x eo tlt+e 7 fx(a) = -08 0. & If g1(z) > g(x) for all x, then Eg (X) > Ega(X). d. Ifa O such that, for all t in A < ¢ < h, Ee'™ exists. If the expectation does not exist in a neighborhood of 0, we say that the moment generating function does not exist. More explicitly, we can write the mgf of X as Mx(t) = [ 7 efx(z)dx if X is continuous, or Mx(t)=Sre*P(X =z) if X is discrete. It is very easy to see how the mgf generates moments. We summarize the result in she following theorem, Theorem 3.3.7 If X has mof Mx(t), then EX™ = MY(0), where we define M0) = Facto 2 a ‘That is, the nth moment is equal to the nth derivative of Mx(t) evaluated at t = Proof: Assuming that we can differentiate under the integral sign (see the next section), we have 2 ” et fe (2) dz - f(g) moe = [tee \fetayae =EXe™, d giek®) Bection 2.3 MOMENTS AND MOMENT GENERATING FUNCTIONS 63. ‘Thus, d =A = EXe*|,_5= ax) eles Proceeding in an analogous manner, we can establish that ae | = pyrex) = Bx. gp Mx) 0 7 EXxve*|,_, = EX” o Example 2.3.8 (Gamma mgf) In Example 2.1.6 we encountered a special case of the gamma pdf, f(z) = rales 2@le/0, Y0, 8>0, where I'(@) denotes the gamma function, some of whose properties are given in Section 3.3. The mgf is given by aa f* etege-tg-2/8 Mx(t) = ax f efa®1e#/9 de oo a1 (hte (2.3.5) = rar f ate dz 1 7 — e-1.-2/( 25) rar a ‘We now recognize the integrand in (2.3.5) as the kernel of another gamma pdf. (The kernel of a function is the main part of the function, the part that remains when constants are disregarded.) Using the fact that, for any positive constants a and 6, f()= Ta 1e-2/b is a pdf, we have that f[ roe 16-2/> dp 1 0 and, hence, (2.3.6) [ atte-#/b de = P(a)o°. lo Applying (2.3.6) to (2.3.5), we have xl rah) (Tq : a) = (25) <5 cf ‘TRANSFORMATIONS AND EXPECTATIONS Section 2.3 If t > 1/8, then the quantity (1/8) —t, in the integrand of (2.3.5), is nonpositive and the integral in (2.3.6) is infinite. Thus, the mgf of the gamma distribution exists only if t < 1/8. (In Section 3.3 we will explore the gamma function in more detail.) ‘The mean of the gamma distribution is given by op too OBI Ag Other moments can be calculated in s similar manner. i d EX = EMx(t) = af. Example 2.3.9 (Binomial mgf) For a second illustration of calculating a moment generating function, we consider a discrete distribution, the binomial distribution. The binomial(n, p) pm is given in (2.1.8). So ae (aq yt 2 4)2(] _ »)P-= Melt) = Doe (D) easy = (2) eta The binomial formula (see Theorem 3.2.2) gives (2.3.7) De (®) wert = (ut). = Hence, letting u = pet and v = 1 — p, we have Mx(t) = [pe +(1— py)". ll As previously mentioned, the major usefulness of the moment generating function is not in its ability to generate moments. Rather, its usefulness stems from the fact that, in many cases, the moment generating function can characterize a distribution. There aré, however, some technical difficulties associated with using moments to characterize a distribution, which we will now investigate. If the mgf exists, it characterizes an infinite set of moments. The natural question is whether characterizing the infinite set of moments uniquely determines a distribution function. The answer to this question, unfortunately, is no. Characterizing the set of moments is not enough to determine a distribution uniquely because there may be two distinct random variables having the sarne moments. Example 2.3.10 (Nomunique moments) Consider the two pdfs given by 1 Vint falz) = fi(a)[L + sin(2rlogz)], OS 2 <0. (The pdf f, is a speciel case of a lognormal pdf.) It can be shown that if Xi ~ fi(c}, then e-(log2)?/2, Ae) O52 <0, EXf=e"?, r=0,1,. Bection 2.3 MOMENTS AND MOMENT GENERATING FUNCTIONS 65 ral 10 7 fo 2 fw % 2 4 : Figure 2.3.2. Two pdfs with the same moments: fi(z) = ‘fi(2)[1 + sin(2x logz)] Pee and fal) = s0 X; has all of its moments. Now suppose that X» ~ fa(z). We have EX} = i 2" fi(z)[1 + sin(2rlog2)] de Ib =EXT+ [ 2" f:(z) sin(2r log) de. lo However, the transformation y = logx — r shows that this last integral is that of an odd function over (00, 00) and hence is equal to 0 for r = 0,1,.... Thus, even though X; and Xz have distinct pdfs, they have the same moments for all r. The two pdfs are pictured in Figure 2.3.2. See Exercise 2.35 for details and also Exercises 2.34, 2.36, and 2.37 for more about mgts and distributions. i The problem of uniqueness of moments does not occur if the random variables have bounded support. If that is the case, then the infinite sequence of moments does uniquely determine the distribution (see, for example, Billingsley 1995, Section 30). Furthermore, if the mgf exists in a neighborhood of 0, then the distribution is uniquely determined, no matter what its support. Thus, existence of all moments is not equivalent to existence of the moment generating function. The following theorem shows how a distribution can be characterized. Theorem 2.3.11 Let Fx(x) and Fy(y) be two cdfs all of whose moments exist. a. If X and ¥ have bounded support, then Fx(u) = Fy(u) for all u if and only if EX" = EY" for all integers r = 0,1,2,... « b. If the moment generating functions exist and Mx(t) = My(t) for all t in some neighborhood of 0, then Fx(u) = Fy(u) for all u. es “TRANSFORMATIONS ANDY EXPECTATIONS Section 2.8 In the next theorem, which deals with a sequence of mgfs that converges, we do not treat the bounded support case separately. Note that the uniqueness assump- tion is automatically satisfied if the limiting mgf exists in a neighborhood of 0 (see Miscellanea 2.6.1). Theorem 2.3.12 (Convergence of mgfs) Suppose {X;, i=1,2,...} is a se- quence of random variables, each with mgf Mx,(t). Furthermore, suppose that lim Mx,(t)=Mx(t), for all t in a neighborhood of 0, and Mx(t) is an mgf. Then there is a unique cdf Fx whose moments are determined by Mx(t) and, for ali x where Fx(zx) és continuous, we have lim Fy,(z) = Fx(z). That is, convergence, for |t| My(t) as n — oo. The validity of the approximation in (2.3.9) will then follow from Theorem 2.3.12. We first must digress a bit and mention an important limit result, one that has wide applicability in statistics. The proof of this lemma may be found in many standard calculus texts. Lemma 2.3.14 Let ai,a2,... be a sequence of numbers converging to a, that is, limmysoo @n = a. Then lim (14 *)° Returning to the example, we have : oo Mx(O)= bet +02 = [r+ Met ine] = [14 cena], because \ = np. Now set a, = a = (et —1)A, and apply Lemma 2.3.14 to get lim Mx(t) =e") = My(2), the moment generating function of the Poisson. The Poisson approximation can be quite good even for moderate p and n, In Figure 2.3.3 we show a binomial mass function along with its Poisson epproximation, with = np. The approximation appears to be satisfactory. | We close this section with a useful result concerning mgfs. Theorem 2.3.15 For any constants a and b, the mof of the random variable aX +6 is given by Max+o(t) = Mx (at). cy ‘TRANSFORMATIONS AND EXPECTATIONS: Section 2.4 Figure 2.3.3. Poisson (dotted line) approzimation to the binomial (solid line), n= 15, p= .3 Proof: By definition, Max4s(t) = E (eo) =E (corel) (properties of exponentials) = &p (eox ) (e** is constant) = e&Mx(at), (definition of mgf) proving the theorem. || 2.4 Differentiating Under an Integral Sign In the previous section we encountered an instance in which we desired to interchange the order of integration and differentiation. This situation is encountered frequently in theoretical statistics. The purpose of this section is to characterize conditions under which this operation is legitimate. We will also discuss interchanging the order of differentiation and summation. Many of these conditions can be established using standard theorems from calculus and detailed proofs can be found in most calculus textbooks. Thus, detailed proofs will not be presented here. ‘We first want to establish the method of calculating a po (2.4.1) H Jee f(x, 0) dz, where —co < a(@),6() < co for all 8. The rule for differentiating (2.4.1) is called Leibnitz’s Rule and is an application of the Fundamental Theorem of Calculus and the chain cule. Bection 2.4 DIFFERENTIATING UNDER AN INTEGRAL SIGN 6 Theorem 2.4.1 (Leibnitz’s Rule) If f(z, 6), a(9), and b(0) are differentiable with respect to 8, then b(0) b(8), Bi [4,124 = 1000, 0) gGHe)~ $1000), 0) Gal) + f | Fy Slee) Notice that if a(@) and 6(6) are constant, we have a special case of Leibnitz’s Rule: ° 6 éf f(e,0)de = [ Zse0)ae. ‘Thus, in general, if we have the integral of a differentiable function over a finite range, differentiation of the integral poses no problem. If the range of integration is infinite, however, problems can arise. Note that the interchange of derivative and integral in the above equation equates a partial derivative with an ordinary derivative. Formally, this must be the case since the left-hand side is a function of only 0, while the integrand on the right-hand side is a function of both @ and x. ‘The question of whether interchanging the order of differentiation and integration is justified is really a question of whether limits and integration can be interchanged, since a derivative is a rp kind of limit. Recall that if f(z, 8) is differentiable, then 2 te.) = jm fEP9-fe9, Peper Hf feeoae = jm [ [fee9 = 165) a Therefore, if we can justify the interchanging of the order of limits and integration, differentiation under the integral sign will be justified. Treatment of this problem in full generality will, unfortunately, necessitate the use of measure theory, a topic that will not be covered in this book. However, the statements and conclusions of some important results can be given. The following theorems are all corollaries of Lebesgue’s Dominated Convergence Theorem (see, for example, Rudin 1976). 80 we have while Theorem 2.4.2 Suppose the function h(z,y) is continuous at yp for each x, and there exists a function g(z) satisfying i. |h(z,y)| < g(a) for all x and y, ii, f° g(a) de < 00. Then lim ii A(z, y) de ii lim h(2,y) dz. om a 70 TRANSFORMATIONS AND EXPECTATIONS Section 2.4 The key condition in this theorem is the existence of a dominating function g(2), with a finite integral, which ensures that the integrals cannot be too badly behaved. We can now apply this theorem to the case we are considering by identifying A(s, y) with the difference (f(x, -+ 6) — f(x,6))/6. Theorem 2.4.3 Suppose {(x,0) is differentiable at 0 =O, that is, sem £200-+ 6) ~ f(e.lo) _ 8 Be GO) exists for every x, and there exists a function 9{x,@q) and a constant 59 > 0 such that i fetore) L(290)| < 9(2,05), for all x and |6| < 6, ii, [% 90,00) de < oo. Then af Ta (2.4.2) wef fene, -{- [geeo a} de. Condition (i) is similar to what is known as a Lipschitz condition, a condition that imposes smoothness on a function. Here, condition (i) is effectively bounding the variability in the first derivative, other smoothness constraints might bound this varisbility by a constant (instead of a function g), or place » bound on the variability of the second derivative of f. ‘The conclusion of Theorem 2.4.3 is a little cumbersome, but it is important to realize that although we seem to be treating @ as a variable, the statement of the theorem ig for one value of 8. That is, for each value p for which f(,4) is differentiable at 6p and satisfies conditions (i) and (ii), the order of integration and differentiation can be interchanged. Often the distinction between @ and Op is not stressed and (2.4.2) is written (2.4.3) af. S(z,0)de = f. Bremer. ‘Typically, f(z,6) is differentiable at all @, not at just one value 6p. In this case, condition (i) of Theorem 2.4.3 can be replaced by another condition that often proves easier to verify. By an application of the mean value theorem, it follows that, for fixed zx and @, and |6| < 60, f(z, 00 + 6) — f(z, #0) _ 6 = ao 9) o=00+6°(z) for some number 5°(2), where |6*(z)| < 60. Therefore, condition (i) will be satisfied if we find 8 9(z,0) that satisfies condition (ii) and 4s) | 2 pe) |< 9(z,8) for all 6” such that |0’ — 6] < é. Peer Section 2.4 DIFFERENTIATING UNDER AN INTEGRAL SIGN n Note that in (2.4.4) 69 is implicitly e function of 6, as is the case in Theorem 2.4.3. This is permitted since the theorem is applied to each value of @ individually. From (2.4.4) we get the following corollary. Corollary 2.4.4 Suppose f(z,0) is differentiable in @ and there exists a function (2,9) such that (2.4.4) is satisfied and f°, g(x,8)dz <0. Then (2.4.3) holds. Notice that both condition (j) of Theorem 2.4.3 and (2.4.4) impose a uniformity requirement on the functions to be bounded; some type of uniformity is generally needed before derivatives and integrals can be interchanged. Example 2.4.5 (Interchanging integration and differentiation-I) Let X have the exponential(.) pdf given by f(z) = (1/A)e~7/4, 0 < 2 < 00, and suppose we want to calculate (2.4.5) fext= as n(i)e -2/d ge for integer n > 0. If we could move the differentiation inside the integral, we would have qd (2.4.6) To justify the interchange of integration and differentiation, we bound the derivative of 2*(1/A)e-2/4, Now 2 zen) ip Oy —) , > FT For some constant 6 satisfying 0 < dy 0, 80 the interchange of integration and differentiation is justified. _|| The property illustrated for the exponential distribution holds for a large class of densities, which will be dealt with in Section 3.4. 2 "TRANSFORMATIONS AND EXPECTATIONS Section 2.4 Notice that (2.4.6) gives us a recursion relation for the moments of the exponential distribution, (2.4.7) Ex"? =aExr 4 eS EX, making the calculation of the (n+ 1)st moment relatively easy. This type of relation- ship exists for other distributions. In particular, if_X has a normal distribution with mean and veriance 1, so it has pdf f(x) = (1/V2n)e~@-#)"/?, then da mth mo m EX*** =yEX qe . ‘We illustrate one more interchange of differentiation and integration, one involving the moment generating function. Example 2.4.6 (Interchanging integration and differentiation-II) Again let X have a normal distribution with mean 4: and variance 1, and consider the mgf of X, Mx(t) =Ee* = 4 f ee ew de, In Section 2.3 it was stated that we can calculate moments by differentiation of Mx(t) and differentiation under the integral sign was justified: d = 4 px pF ox ex (2.4.8) q Mx() = qBe* =P ae =E(Ke'*), We can apply the results of this section to justify the operations in (2.4.8). Notice that when applying either Theorem 2.4.3 or Corollary 2.4.4 here, we identify t with the variable @ in Theorem 2.4.3. The parameter is treated as a constant. From Corollary 2.4.4, we must find a function g(z,t), with finite integral, that satisfies (2.4.9) Zeteewin <9(z,t) for all t’ such that |t’ —t| < 60. et Doing the obvious, we have |zetere-wrr = freee WA] x nletteemrn It is easiest to define our function g(z,t) separately for x > 0 and 2 <0. We take (2,2) = |alelt-S)z—e-(@-w)/2 ifr <0 cad JaleltoireC-w'? ite >0, It is clear that this function satisfies (2.4.9); it remains to check that its integral is finite. Bection 2.4 DIFFERENTIATING UNDER AN INTEGRAL SIGN 3 For 2 > 0 we have (ast) = en Pelsere to) +492, We now complete the square in the exponent; that is, we write a — Qa(u+t+ 60) + w? = 2? —2a(u+t+t bo) + (u+t+ bo)? — (ut+t + bo)? +p? = (2—(utttoo))? +H? — (ut t+ 60), and so, for x > 0, (a, t) = veut t+ F0)1?/29— [a= (ut 60)41/2, Since the last exponential factor in this expression does not depend on x, [5° 9(z, t) dz is essentially calculating the mean of a normal distribution with mean p+t+6, except that the integration is only over (0,00). However, it follows that the integral is finite because the normal distribution has a finite mean (to be shown in Chapter 3). A similar development for x < 0 shows that (x,t) = [le fF H+ 60)}7/29— a? (wt t—60)°1/2 and so f°, 9(z,t) de < 00. Therefore, we have found an integrable function satisfying (2.4.9) and the operation in (2.4.8) is justified. i] We now turn to the question of when it is possible to interchange differentiation and summation, an operation that plays an important role in discrete distributions. Of course, we are concerned only with infinite sums, since a derivative can always be taken inside a finite sum. Example 2.4.7 (Interchanging summation and differentiation) Let X bea discrete random variable with the geometric distribution P(X =2)=0(1-0), 2=0,1,..., 0<0<1. We have that 07°. 4(1— @)* = 1 and, provided that the operations are justified, as 2 yond e wo 80-9 2 0-9 => [0-0 - 62(1 - 8)7] 0 te 1 ay (1 — 8)" — = 8 — 0). 220 Since 2.4 6(1 — 6)* =1 for all 0 < 6 < 1, its derivative is 0. So we have 0. (2.4.10) tyro -a*— 1 So 201 —0)* = 20 “ ‘TRANSFORMATIONS AND EXPECTATIONS Section 2.4 Now the first sum in (2.4.10) is equal to 1 and the second sum is EX; hence (2.4.10) becomes or We have, in essence, summed the series 372°.) 20(1 — @)* by differentiating. i Justification for taking the derivative inside the summation is more straightforward than the integration case. The following theorem provides the details. Theorem 2.4.8 Suppose that the series >= ,h(8,x) converges for all 8 in an interval (a,b) of real numbers and i. &h(G,z) is continuous in @ for each x, ii, D2 HAG, z) converges uniformly on every closed bounded subinterval of (a, »). Then =O (2.4.11) SS n0,2)= 3° 3 006,2). =o 5 ‘The condition of uniform convergence is the key one to verify in order to establish that the differentiation can be taken inside the suramation. Recall that a series con- verges uniformly if its sequence of partial sums converges uniformly, a fact that we use in the following example. Example 2.4.9 (Continuation of Example 2.4.7) To apply Theorem 2.4.8 we identify (0, 2) = O(1 ~ 6)* and a (0, 2) = (1 - 6)* — 6x(1 — 6)7-1. a9 M2) = : and verify that 7225 & A(0, x) converges uniformly. Define S,() by Sn(8) = D> [0 - 9) — bz(1 - 9)*4). 0 ‘The convergence will be uniform on [c,d] C (0,1) if, given « > 0, we can find an N such that, n> N = |Sn(8) ~ Soo(@)|<€ for all 8 € [c,d]. Section 2.4 DIFFERENTIATING UNDER AN INTEGRAL SIGN 75 Recall the partial sum of the geometric series (1.5.3). If y # 1, then we can write n _ynth y Applying this, we have . -a—a a x y 7 ‘ =~ 8 = dea) u =O - 51-9) 0 dg : oa 0-9) __p)d@ f1-G-art - 0g ae). Here we (justifiably) pull the derivative through the finite sum. Calculating this derivative gives Sento" a= ney) (oe Net ay and, hence, 5,(0) = atta _ G=(1-0)"*) ~ (n+ 90-4)" =(n+1)(1—0)" It is clear that, for 0 < 0 <1, Sao = lima-soo Sn(0) = 0. Since $,(8) is continuous, the convergence is uniform on any closed bounded interval. Therefore, the series of derivatives converges uniformly and the interchange of differentiation and summation is justified. " ‘We close this section with a theorem that is similar to Theorem 2.4.8, but treats the case of interchanging the order of summation and integration. Theorem 2.4.10 Suppose the series "%., h(0, x) converges uniformly on [a,b] and that, for each «, h(8,2) is a continuous function of 9. Then be ob Yn.2)49= 9° [ (6,2) dd. e=088 2 220 76 ‘TRANSFORMATIONS AND EXPECTATIONS Section 2.5 2.5 Exercises 2.1 In each of the following find the pdf of ¥. Show that the pdf integrates to 1. (a) ¥ =X and fx(z) = 4225(1-2),0<2<1 (b) ¥=4X 43 and fx(z) = 7e-", 0 <2 < 00 (c) ¥ = X? and fx(z) = 3027(1—2)?,0<2<1 (See Example A.0.2 in Appendix A.) 2.2 In each of the following find the pdf of Y. (a) ¥ =X? and fx(z)=1,0<2<1 (b) ¥ = ~log X and X has pdf ! $m+D! 9 1s) nim! f(z) )™, O<2<1, myn positive integers (©) ¥ =e* and X has pdf ae 2, fx(@) = qqre "2, O< 2 <00, 0” a positive constant 2.3 Suppose X has the geometric pmf fx(z) = 3 (3)", z = 0,1,2,.... Determine the probability distribution of Y = X/(X +1). Note that here both X and ¥ are discrete random variables. To specify the probability distribution of Y, specify its pm. 2.4 Let A be a fixed positive constant, and define the function f(x) by f(z) = $e if 2 > Oand f(z) = }re if <0. (a) Verify that f(z) is a pdf. (b) IX is a random variable with paf given by f(z), find P(X < t) for all ¢. Evaluate all integrals. (©) Find P(|X| 0 2.7 Let X have pdf fy(z)=2(e+1), -1<2<2. (a) Find the pdf of ¥ = X?. Note that Theorem 2.1.8 is not directly applicable in this problem. (b) Show that Theorem 2.1.8 remains valid if the sets Ao, A1,...,Ax contain X, and apply the extension to solve part (a) using Ay =, Ar = (~2,0), and Aa = (0,2). 2.8 ‘in each of the following show that the given function is a cdf and find Fy(y). 0 ifz<0 () rety={0 oe 250 Bection 2.5 EXERCISES 7 2.9 2.10 2.41 2.12 7/2 tzy)>PU>y)=1-y, forally, Oy)>P(U>y)=1-y, forsomey, 0

You might also like