100% found this document useful (5 votes)
992 views694 pages

An Introduction To Probability Theory and Its Applications II

Probability Textbook - William Feller

Uploaded by

temp12321
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
100% found this document useful (5 votes)
992 views694 pages

An Introduction To Probability Theory and Its Applications II

Probability Textbook - William Feller

Uploaded by

temp12321
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 694
An Introduction to Probability Theory and Its Applications ‘WILLIAM FELLER (1906-1970) Eugene Higgins Profesor of Mathematics Princeton University VOLUME II SECOND EDITION John Wiley & Sons, Inc. New York - London - Sydney + Toronto Copyright © 1966, 1971 by John Wiley & Sons, Inc. All rights reserved, Published simultaneously in Canada. No part of this book may be reproduced by any means, ‘or transmitted, nor translated into a machine language without the written permission of the publisher. Library of Congress Catalogue Card Number: $7-10805 ISBN 0 471 25709 5 Printed in the United States of America 09876543 To O. E. Neugebauer 0 et praesidium et dulce decus meum Preface to the First Edition [AT THE TIME THE FIRST VOLUME OF THIS BOOK WAS WRITTEN (BETWEEN 1941 and 1948) the interest in probability was not yet widespread. Teaching was ‘ona very limited scale and topics such as Markov chains, which are now extensively used in several disciplines, were highly specialized chapters of pure mathematics. The first volume may therefore be likened to an all- Purpose travel guide to a strange country. To describe the nature of probability it had to stress the mathematical content of the theory as well as the surprising variety of potential applications. It was predicted that the ensuing fluctuations in the level of difficulty would limit the usefulness of the book. In reality it is widely used even today, when its novelty has ‘worn off and its attitude and material are available in newer books written for special purposes. The book seems even to acquire new friends. The fact that laymen are not deterred by passages which proved difficult to students of mathematics shows that the level of difficulty cannot be measured objectively; it depends on the type of information one seeks and the details one is prepared to skip. The traveler often has the choice between climbing a peak or using a cable car. In view of this success the second volume is written in the same style. It involves harder mathematics, but most of the text can be read on different, levels. The handling of measure theory may illustrate this point. Chapter IV contains an informal introduction to the basic ideas of measure theory and the conceptual foundations of probability. The same chapter lists the few facts of measure theory used in the subsequent chapters to formulate analytical theorems in their simplest form and to avoid futile discussions of regularity conditions. The main function of measure theory in this connection is to justify formal operations and passages to the limit that would never be questioned by a non-mathematician. Readers interested primarily in practical results will therefore not feel any need for measure theory. To facilitate access to the individual topics the chapters are rendered as self-contained as possible, and sometimes special cases are treated separately ahead of the general theory. Various topics (such as stable distributions and renewal theory) are discussed at several places from different angles. To avoid repetitions, the definitions and illustrative examples are collected in vi PREFACE chapter VI, which may be described as a collection of introductions to the subsequent chapters. The skeleton of the book consists of chapters V, VIII, and XV. The reader will decide for himself how much of the preparatory chapters to read and which excursions to take. Experts will find new results and proofs, but more important is the attempt to consolidate and unify the general methodology. Indeed, certain parts of probability suffer from a lack of coherence because the usual grouping and treatment of problems depend largely on accidents of the historical develop- ment. In the resulting confusion closely related problems are not recognized as such and simple things are obscured by complicated methods. Consider- able simplifications were obtained by a systematic exploitation and develop- ment of the best available techniques. This is true in particular for the proverbially messy field of limit theorems (chapters XVI-XVII). At other places simplifications were achieved by treating problems in their natural context. For example, an elementary consideration of a particular random walk led to a generalization of an asymptotic estimate which had been derived by hard and laborious methods in risk theory (and under more restrictive conditions independently in queuing). I have tried to achieve mathematical rigor without pedantry in style. For example, the statement that 1/(1 + £) is the characteristic function of }e1*! seems to me a desirable and legitimate abbreviation for the logically correct version that the function which at the point £ assumes the value 1/(. + &) is the characteristic function of the function which at the point x assumes the value de! 1 fear that the brief historical remarks and citations do not render justice to the many authors who contributed to probability, but I have tried to give credit wherever possible. The original work is now in many cases superseded by newer research, and as a rule full references are given only to papers to which the reader may want to turn for additional information. For example, no reference is given to my own work on limit theorems, whereas a paper describing observations or theories underlying an example is cited even if it contains no mathematics.’ Under these circumstances the index of authors gives no indication of their importance for probability theory. Another difficulty is to do justice to the pioneer work to which we owe new directions of research, new approaches, and new methods. Some theorems which were considered strikingly original and deep now appear with simple proofs among more refined results. It is difficult to view such a theorem in its historical perspective and to realize that here as elsewhere it is the first step that counts. 2 This system was used also in the first volume but was misunderstood by some subsequent writers; they now attribute the methods used in the book to earlier scientists who could, not have known them, ACKNOWLEDGMENTS Thanks to the support by the U.S. Army Research Office of work in probability at Princeton University I enjoyed the help of J. Goldman, L. Pitt, M. Silverstein, and, in particular, of M. M. Rao. They eliminated many inaccuracies and obscurities. All chapters were rewritten many times and preliminary versions of the early chapters were circulated among friends. In this way I benefited from comments by J. Elliott, R. S. Pinkham, and L. J. Savage. My special thanks are due to J. L. Doob and J. Wolfowitz for advice and criticism. The graph of the Cauchy random walk was supplied by H, Trotter. The printing was supervised by Mrs. H. McDougal, and the appearance of the book owes much to her. WituiaM FELLER October 1965 ix ‘THE MANUSCRIPT HAD BEEN FINISHED AT THE TIME OF THE AUTHOR'S DEATH but no proofs had been received. I am grateful to the publisher for providing a proofreader to compare the print against the manuscript and for compiling the index. J. Goldman, A. Grunbaum, H. McKean, L, Pitt, and A. Pittenger divided the book among themselves to check on the mathematics. Every mathematician knows what an incredible amount of work that entails. 7 express my deep gratitude to these men and extend my heartfelt thanks for their labor of love. May 1970 CLARA N. FELLER xi Introduction ‘THE CHARACTER AND ORGANIZATION OF THE BOOK REMAIN UNCHANGED, BUT. the entire text has undergone a thorough revision. Many parts (Chapter XVII, in particular) have been completely rewritten and a few new sections have been added. At a number of places the exposition was simplified by streamlined (and sometimes new) arguments. Some new material has been incorporated into the text. While writing the first edition I was haunted by the fear of an excessively Jong volume. Unfortunately, this led me to spend futile months in shortening the original text and economizing on displays. This damage has now been repaired, and a great effort has been spent to make the reading easier. Occasional repetitions will also facilitate a direct access to the individual chapters and make it possible to read certain parts of this book in con- junction with Volume 1. Concerning the organization of the material, see the introduction to the first edition (repeated here), starting with the second paragraph. Lam grateful to many readers for pointing out errors or omissions. I especially thank D. A. Hejhal, of Chicago, for an exhaustive and penetrating list of errata and for suggestions covering the entire book. January 1970 WittiaM FELLER Princeton, NJ. xii Abbreviations and Conventions WG Epoch. Intervals BY, RE, RE 1 is an abbreviation for if and only if. This term is used for points on the time axis, while time is reserved for intervals and durations. (In discussions of stochastic processes the word “times” carries too heavy a burden. The systematic use of “epoch,” introduced’ by J. Riordan, seems preferable to varying substitutes such as moment, instant, or point.) — — are denoted by bars: 2,5 is an open, a, 6 a closed interval; half-open intervals are denoted by a,b. and ‘a,b. This notation is used also in higher dimensions, The pertinent conventions for vector notations and order relations are found in V,1 (and also in IV,2). The symbol (a, 5) is reserved for pairs and for points. stand for the line, the plane, and the r-dimensional Cartesian space. refers to volume one, Roman numerals to chapters. Thus 1; X1,G.6) refers to section 3 of chapter XI of volume 1. D indicates the end of a proof ot of a collection of examples. nand® denote, respectively, the normal density and distribution function with zero expectation and unit variance. 0,0, and~, Let w and v depend on a parameter 2 which tends, say, to a. Assuming that v is positive we write u = Ol) a (remains bounded u = 00) yf = 0 ‘f(@) U{de}. For this abbreviation see V,3. Regarding Borel sets and Baire functions, see the introduction to chapter V. Contents CHAPTER I THE EXPONENTIAL AND THE UNIFORM DENSITIES . Introduction © 2 2. 2 . Densities. Convolutions ss. . The Exponential Density .: . Waiting Time Paradoxes. The Poisson Process . The Persistence of Bad Luck. ©... . Waiting Times and Order Statistics . . The Uniform Distribution . Random Splittings . Convolutions and Covering Theorems . 10. Random Directions. 2 2 2. 11. The Use of Lebesgue Measure 12, Empirical Distributions 13, Problems for Solution... een Awana CHAPTER TL SPECIAL DeNstries. RANDOMIZATION . 1, Notations and Conventions. : 2, Gamma Distributions . *3. Related Distributions of Statistics 4, Some Common Densities : 5. Randomization and Mixtures... 6. Discrete Distributions u 15 a a 25 29 33 36 39 45 45 47 8 49 33 55. * ‘Starred sections are not required for the understanding of the sequel and should be omitted at first reading xvii xviii CONTENTS 7. Bessel Functions and Random Walks 8. Distributions on a Circle 9. Problems for Solution CHAPTER TIL Densrrigs IN HIGHER DIMENSIONS. NORMAL DENSITIES AND PROCESSES 1. Densities oo 2. Conditional Distributions. .- 3, Return to the Exponential and the Uniform Distributions "4, A Characterization of the Normal Distribution 5. Matrix Notation. The Covariance Matrix 6. Normal Densities and Distributions. *7. Stationary Normal Processes 8, Markovian Normal Densities. 9. Problems for Solution CHAPTER IV PROBABILITY MEASURES AND SPACES. . Baire Functions ae 2. Interval Functions and Integrals in ‘RY. 3. o-Algebras. Measurability . 4, Probability Spaces. Random Variables . 5, The Extension Theorem 6. Product Spaces. Sequences of Independent Variables. 7. Null Sets. Completion CHAPTER V PRosasiLiry DistRiButions IN RT 1. Distributions and Expect 2. Preliminaries 3, Densities 4. Convolutions 58 61 66 66 n 4 80 83 87 94 103 104 106 2 us 18. 121 125 127 128 136 138 143 ‘CONTENTS Symmetrization Integration by Parts. Existence of Moments Chebyshev's Inequality Further Inequalities. Convex Functions Simple Conditional Distributions. Mixtures... *10. Conditional Distributions “11. Conditional Expectations . 12. Problems for Solution CHAPTER, ‘VIA SURVEY OF SOME IMPORTANT DISTRIBUTIONS AND PROCESSES 1. Stable Distributions in Rt 2. Examples... oe 3. Infinitely Divisible Distributions in ME. 4, Processes with Independent Increments *5. Ruin Problems in Compound Poisson Processes 6. Renewal Processes . 7. Examples and Problems 8. Random Walks. © 2 9, The Queuing Process... ae 10, Persistent and Transient Random Walks 11. General Markov Chains *12. Martingales. fe 13. Problems for Solution... 2 6. CHAPTER VIE Laws OF LARGE NUMBERS. APPLICATIONS IN ANALYSIS . 1, Main Lemma and Notations . . 2. Bernstein Polynomials. Absolutely Monotone Functions 3. Moment Problems . “4. Application to Exchangeable ‘Variables *5, Generalized Taylor Formula and Semi-Groups 6. Inversion Formulas for Laplace Transforms... xix 148 150 15 152 156 160 162 165 169 169 173, 176 179 182 184 187 190 194 200 205 209 218 219 219 222 224 228 230 232 xx ‘CONTENTS +7. Laws of Large Numbers for ‘enticaly Distributed Variables 2 2... ee BM *8. Strong Laws Se 27 *9, Generalization to Martingales =. = | |). 2A 10, Problems for Solution. . 2 2 1. 2... 244 CHAPTER VIE THe Basic Liver THEOREMS co UT 1, Convergence of Measures ce AT 2, Special Properties. ©. 2. 1. 1... 282 3, Distributions as Operators © 5... . 28H 4, The Central Limit Theorem. . . - 258 *5. Infinite Convolutions. . 2. 6... ee. 265 6, Selection Theorems. co ee 267 7, Ergodic Theorems for Markov Chains - se. 210 8. Regular Variation. . : +. mS *9. Asymptotic Properties of Regularly Varying Functions . 279 10. Problems for Solution. . . . . . . . . . . 284 CHAPTER IX INFINireLy Divisiste DisTRIBUTIONS AND SeMI-GRoUPS . . 290 1. Orientation. © 6 280 2. Convolution Semi-Groups .: 293 3, Preparatory Lemmas L: ce. 296 4, Finite Variances. 2. 2... sw. 28 5. The Main Theorems . ee 300 6. Example: Stable Semi-Groups . - 5 305 7. Triangular Arrays with Identical Distributions. . . 308 8. Domains of Attraction . se BID 9. Variable Distributions. The Three-Series Theorem . . 316 10. Problems for Solution. . 2... . 318 CHAPTER CONTENTS X MARKOV PROCESSES AND SEMI-GROUPS . 1 2. 3 4, 5. 6, 1, 8 9, 10. (CHAPTER . The Pseudo-Poisson Type. ‘A Variant: Linear Increments . Jump Processes: . : Diffusion Processes in 2... : ‘The Forward Equation. Boundary Conditions . Diffusion in Higher Dimensions. . 7. Subordinated Processes... oo . Markov Processes and Semi-Groups : ). The “Exponential Formula” of Semi-Group Theory - Generators. The Backward Equation XI RENEWAL THEORY 1. 2. 3. ‘The Renewal Theorem. Fe Proof of the Renewal Theorem . Refinements 2 2 2 1 2 2 eee 4. Persistent Renewal Processes 5. 6. 1 8 10. (CHAPTER The Number NV, of Renewal Epochs ‘Terminating (Transient) Processes Diverse Applications... See Existence of Limits in Stochastic Processes... ). Renewal Theory on the Whole Line. Problems for Solution. . 2... | XIE RanpoM WaLKSIN RES ee 2 3. 3a. Basic Concepts and Notations. . . . . Duality. Types of Random Walks . Distribution of Ladder Heights Wiener- Hog Factor- ization... ee ‘The Wiener-Hopf Integral Equation... xxi 321 322 324 326 332 337 345 349 353 356 358 358 364 366 368 372 374 317 379 380 385 389) 390 394 398 xxii ‘CONTENTS 4. Examples 2 0 2. 5. Applications : 6. A Combinatorial Lemma . 7, Distribution of Ladder Epochs : 8. The Arc Sine Laws. toe 9. Miscellaneous Complements... 2 2 1. 10. Problems for Solution CHAPTER XTIL LAPLACE TRANSFORMS. TAUBERIAN THEOREMS, RESOLVENTS 1, Definitions. The Continuity Theorem 2, Elementary Properties, 3. Examples 2. : ce 4. Completely Monotone Functions. Inversion Formulas. 5. Tauberian Theorems 6. Stable Distributions . *7. Infinitely Divisible Distributions *8. Higher Dimensions 9, Laplace Transforms for Semi- -Groups 10. The Hille-Yosida Theorem . : 11. Problems for Solution . CHAPTER XIV APPLICATIONS OF LAPLACE TRANSFORMS 1. The Renewal Equation: Theory. 2. Renewal-Type Equations: Examples . 3. Limit Theorems Involving Arc Sine Distributions . 4. Busy Periods and Related Branching Processes... 5. Diffusion Processes... .: 6. Birth-and-Death Processes and Random Walks 7. The Kolmogorov Differential Equations 8. Example: The Pure Birth Process... : 9, Calculation of Ergodic Limits and of Firs-Passage Times 10. Problems for Solution. ae 412, 413, 47 423 425 429, 429 434 436 439, 442 448, 449 452 434 458 463 466 470 473 475 479 483 488 491 495 ‘CONTENTS ‘CHAPTER XV_ Characteristic FUNCTIONS ce 1, Definition. Basic Properties 2. Special Distributions. Mixtures. 2a. Some Unexpected Phenomena 3. Uniqueness. Inversion Formulas : 4, Regularity Properties... Lo 5. The Central Limit Theorem for Equal Components . 6. The Lindeberg Conditions . 7. Characteristic Functions in Higher Dimensions. *8. Two Characterizations of the Normal Distribution 9. Problems for Solution CHAPTER XVI* EXPANSIONS RELATED TO THE CENTRAL LIMIT THEOREM. 1. Notations 2. Expansions for Densities 3, Smoothing : . 4. Expansions for Distributions . toe 5. The Berry-Esséen Theorems. 2 2. 2. 6. Expansions in the Case of Varying Components : 7. Large Deviations coe ‘CHAPTER XVIL_ Inrintrety Divistete Distrisutions. 1, Infinitely Divisible Distributions . : 2. Canonical Forms. The Main Limit Theorem . 2a, Derivatives of Characteristic Functions . 3. Examples and Special Properties... 4, Special Properties . 5, Stable Distributions and Their Domains of Attraction *6, Stable Densities Se 7. Triangular Arrays. xxii 498 498 502 505 507 su 515 518 521 525 526 531 532 533 536 538 542, 546 548 554 554 558 565 570 574 581 583 xxiv ‘CONTENTS “8. The Class L. oe *9, Partial Attraction. “Universal Laws” *10. Infinite Convolutions . 11, Higher Dimensions 12, Problems for Solution. ‘CHAPTER XVIIE AppLicaTions oF FOURIER MeTHops To RANDOM WALKS 1. The Basic Identity . : *2, Finite Intervals, Wald’s ‘Approximation 3. The Wiener-Hopf Factorization 4, Implications and Applications 5, Two Deeper Theorems 6. Criteria for Persistency 7. Problems for Solution CHAPTER XIX HARMONIC ANALYSIS The Parseval Relation . Positive Definite Functions Stationary Processes Bee Fourier Series . *5, The Poisson Summation Formula Positive Definite Sequences. L? Theory Stochastic Processes and Integrals Problems for Solution . ye ANSWERS TO PROBLEMS . Some Books ON CoGNATE SUBIECTS . Inpex . 588 590, 592 593 595 598 598 601 612 614 616 619 619, 620 623, 626 629 633 635 641 647 651 655 657 An Introduction to Probability Theory and Its Applications CHAPTERI The Exponential and the Uniform Densities 1. INTRODUCTION In the course of volume 1 we had repeatedly to deal with probabilities defined by sums of many small terms, and we used approximations of the form a. Pla nd} = (1—p,)" and the expected waiting time is E(T) = d/p,. Refinements of this model are obtained by letting 9. grow smaller in such a way that the expectation d/p, = a remains 4 Further examples from volume 1: The arc sine distribution, chapter IIl, section 4; the distributions for the number of returns to the origin and first passage times in II,7;. the limit theorems for random walks in XIV; the uniform distribution in problem 20 of XU,7. 2 Concerning the use of the term epoch, see the list of abbreviations at the front of the book. 2 ‘THE EXPONENTIAL AND THE UNIFORM DENSITIES Lt fixed. To a time interval of duration there correspond n~ 1/6 trials, and hence for small 3 (2) {T > 1} (1 ~ Bay et approximately, as can be seen by taking logarithms. ‘This model considers the waiting time as a geometrically distributed discrete random variable, and (1.2) states that “in the limit” one gets an exponential distribution. From the point of view of intuition it would seem more natural to start from the sample space whose points are real numbers and to introduce the exponential distribution directly. (8) Random choices. To “choose a point at random” in the interval 0,1 is a conceptual experiment with an obvious intuitive meaning. It can be described by discrete approximations, but it is easier to use the whole interval as sample space and to assign to each interval its length as prob- ability. The conceptual experiment of making two independent random choices of points in 0,1 results in a pair of real numbers, and so the natural sample space is a unit square. In this sample space one equates, almost instinctively, “probability” with “area.” This is quite satisfactory for some elementary purposes, but sooner or later the question arises as to what the word “area” really means, » As these examples show, a continuous sample space may be conceptually simpler than a discrete model, but the definition of probabilities init depends on tools such as integration and measure theory. In denumerable sample spaces it was possible to assign probabilities to all imaginable events, whereas in general spaces this naive procedure leads to logical contra- dictions, and our intuition has to adjust itself to the exigencies of formal logic. ‘We shail soon see that the naive approach can lead to trouble even in relatively simple problems, but it is only fair to say that many probabilistically significant problems do not require a clean definition of probabilities. Some- times they are of an analytic character and the probabilistic background serves primarily as a support for our intuition. More to the point is the fact that complex stochastic processes with intricate sample spaces may lead to significant and comprehensible problems which do not depend on the delicate tools used in the analysis of the whole process. A typical reasoning may run as follows: if the process can be described at all, the random variable Z must have such and such properties, and its distribution must. therefore satisfy such and such an integral equation. Although probabilistic arguments can greatly influence the analytical treatment of the equation in question, the latter is in principle independent of the axioms of probability. Intervals are denoted by bars to preserve the symbol (a, b) for the coordinate notation ‘of points in the plane. See the list of abbreviations at the front ofthe book. 12 DENSITIES, CONVOLUTIONS 3 Specialists in various fields are sometimes so familiar with problems of this type that they deny the need for measure theory because they are unac- quainted with problems of other types and with situations where vague reasoning did lead to wrong results.* This situation will become clearer in the course of this chapter, which serves as an informal introduction to the whole theory. It describes some analytic properties of two important distributions which will be used throughout this book, Special topics are covered partly because of significant applications, partly to illustrate the new problems confronting us and the need for appropriate tools. It is not necessary to study them systematically or in the order in which they appear. ‘Throughout this chapter probabilities are defined by elementary integrals, and the limitations of this definition are accepted. The use of a probabilistic Jargon, and of terms such as random variable or expectation, may be justified in two ways. They may be interpreted as technical aids to intuition based on the formal analogy with similar situations in volume 1. Alternatively, every- thing in this chapter may be interpreted in a logically impeccable manner by a passage to the limit from the discrete model described in example 2(a). Although neither necessary nor desirable in principle, the latter procedure has the merit of a good exercise for beginners. 2, DENSITIES. CONVOLUTIONS A probability density on the line (or Rt) is a function f such that en Sl) > 0, [V1 dz=l For the present we consider only piecewise continuous densities (see V,3 for the general notion). To each density f we let correspond its distribution function® F defined by @2) re =[" sordy “The roles of rigor and intuition are subject to misconceptions, As was pointed out in volume 1, natural intuition and natural thinking are @ poor affair, but they gain strength withthe development of mathematical theory. Today's intuition and applications depend fon the most sophisticated theories of yesterday. Furthermore, strict theory represents ‘economy of thought rather than luxury. Indeed, experience shows that in applications ‘most people rely on lengthy calculations rather than simple arguments because these appear risky. [The nearest illustration is in example S(@).} ® We recall that by “distribution function” is meant a right continuous non-decreasing function with limits © and 1 at 0. Volume t was concerned mainly with distributions ‘whose growth is due entirely to jumps. Now we focus our attention on distribution functions defined as integrals. General distribution functions will be studied in chapter V. 4 ‘THE EXPONENTIAL AND THE UNIFORM DENSITIES 12 It is a monotone continuous function increasing from 0 to 1. We say that Sf and F are concentrated on the interval a 0 and consider the discrete random variable X, which for (n—1)8 <2 < nd assumes the constant value nd. Here n = 0, +1, £2,.... In volume 1 we would have used the multiples of 6 as sample As far as possible we shall denote random variables (that is, functions on the sample space) by capital boldface letters, reserving small letters for numbers or location parameters. This holds in particular for the coordinate variable X, namely the function defined by Xe) 12 DENSITIES, CONVOLUTIONS 5 space, and described the probability distribution of X, by saying that (2.5) P{X;=n0} = F(nd) — F((n—1)6). Now X, becomes a random variable in an enlarged sample space, and its distribution function is the function that for nd- 0, the event {X* < x} is the same as (—Vz 0 gz) =0 for «<0. ‘The distribution function of X* is given for all x by F(W2) and has density 1/(W a), The expectation of X is defined by 26) EX) = { aya) dz, provided the integral converges absolutely. ‘The expectations of the approxi- mating discrete variables X, of example (a) coincide with Riemann sums for this integral, and so E(X,)-» E(X). If w is a bounded continuous function the same argument applies to the random variable u(X), and the relation E(u(X,)) > E(u(X)) implies en BUX = [uefa the point here is that this formula makes no explicit use of the distribution of u(X). Thus the knowledge of the distribution of a random variable X suffices to calculate the expectation of functions of it. The second moment of X is defined by es) Bex) = [2700 dr, provided the integral converges. Putting = E(X), the variance of X is again defined by es) Var (X) = E((X—p)*) = E(X*) — yt, 6 THE EXPONENTIAL AND THE UNIFORM DENSITIES 12 Note. If the variable X is positive (that is, if the density f is concen- trated on 0, 06) and if the integral in (2.6) diverges, it is harmless and convenient to say that X has an infinite expectation and write E(X) = 2. By the same token one says that X has an infinite variance when the integral in (2.8) diverges. For variables assuming positive and negative values the expectation remains undefined when the integral (2.6) diverges. A typical ‘example is provided by the density 7™'(1+2*)". > The notion of density carries over to higher dimensions, but the general discussion is postponed to chapter IIT, Until then we shall consider only the analogue to the product probabilities introduced in definition 2 of 1; V4 to describe combinations of independent experiments. In other words, in this chapter we shall be concerned only with product densities of the form S(@) gy), f@ gy) Me), ete., where f, g,... are densities on the line. Giving a density of the form f(z) g(y) in the plane ‘R? means identifying “probabilities” with integrals: 2.10) P(A} ={f see ator de a Speaking of “two independent random variables X and Y with densities f and g” is an abbreviation for saying that probabilities in the (X, Y)-plane ‘are assigned in accordance with (2.10). This implies the multiplication rule for intervals, for example P(X > a, ¥ > 6} = P(X > a)P(Y > b}. The analogy with the discrete case is so obvious that no further explanations are required, Many new random variables may be defined as functions of X and Y, but the most important role is played by the sum S =X + Y. The event A = (S z>0. Note on the notion of random variable. The use of the line or the Cartesian spaces ‘R" as sample spaces sometimes blurs the distinction between random variables and “ordinary” functions of one or more variables. In volume 1 random variable X could assume only denumerably many values and it was then obvious whether we were talking about a function (such as the square or the exponential) defined on the line, or the random variable X* or e* defined in the sample space. Even the outer appearance of these functions was entirely different inasmuch as the “ordinary” exponential assumes all positive values whereas e* had a denumerable range. To see the change in this situation, consider now “two independent random variables X and Y with a common density f.” In other words, the plane ‘R? serves as sample space, and probabilities are defined as integrals of f(z)f(y). Now every function of two variables can be defined in the sample space, and then it becomes a random variable, but it must be borne in mind that a function of ‘two variables can be defined also without reference to our sample space. For example, certain statistical problems compel one to introduce the random variable /(X)f(¥) [see example VI,12(d)]. On the other hand, in introducing ‘our sample space ‘R? we have evidently referred to the “ordinary” function f defined independently of the sample space. This “ordinary” function induces many random variables, namely /(X), f(¥), f(X4¥), ete. Thus the same f may serve either as a random variable or as an ordinary function. 8 THE EXPONENTIAL AND THE UNIFORM DENSITIES 13 ‘As a tule (and in each individual case) it will be clear whether or not we are concerned with a random variable. Nevertheless, in the general theory there arise situations in which functions (such as conditional prob- abilities and expectations) can be considered either as free functions or as random variables, and this is somewhat confusing if the freedom of choice is not properly understood. Note on terminology and notations. To avoid overburdening of sentences itis customary to call E(X), interchangeably, expectation of the variable X, or of the density f, or of the distribution F. Similar liberties wil be taken for other terms. For example, convolution really signifies an operation, but the term is applied also to the result of the operation and the function f'g is referred (0 as “the convolution.” Tn the older literature the terms distribution and frequency function were applied to what we call densities; our distribution functions were described as “cumulative,” and the abbreviation c.df. is still in use, THE EXPONENTIAL DENSITY For arbitrary but fixed « > 0 put (3.1) Siz) = ae, F(a) = for x>0 and F(x) = f(2) = 0 for x <0. Then f isan exponential density, F its distribution function. A trite calculation shows that the expectation equals a, the variance a*. In example 1(a) the exponential distribution was derived as the limit of geometric distributions, and the method of example 2(a) leads to the same result. We recall that in stochastic processes the geometric distribution frequently governs waiting times or lifetimes, and that this is due to its “lack of memory,” described in 1; XIII,9: whatever the present age, the residual lifetime is unaffected by the past and has the same distribution as the lifetime itself. It will now be shown that this property carries over to the exponential limit and to no other distribution. Let T be an arbitrary positive variable to be interpreted as life- or waiting time. It is convenient to replace the distribution function of T by its tail G2) U) = PIT > 1}. Intuitively, U(t) is the “probability at birth of a lifetime exceeding 1.” Given an age s, the event that the residual lifetime exceeds 1 is the same as {T > s+1} and the conditional probability of this event (given age s) equals the ratio U(s+2)/U(s). This is the residual lifetime distribution, and it coincides with the total lifetime distribution iff G3) UGH = US) UO, 51> 0. 13 ‘THE EXPONENTIAL DENSITY 9 Inwas shown in 1; XVIL6 that positive solution of this equation is necessarily of the form U(t) = e-*, and hence the lack of aging described above in ltalies holds true if the lifetime distribution is exponential, We shall refer to this lack of memory as the Markov property of the exponential distribution, Analytically it reduces to the statement that only for the exponential distribution F do the tails U = 1—F satisfy (3.3), but this explains the constant occurrence of the exponential dis- tribution in Markov processes. (A stronger version of the Markov property will be described in section 6.) Our description referred to temporal processes, but the argument is general and the Markov property remains meaningful when time is replaced by some other parameter. Examples. (a) Tensile strength. To obtain a continuous analogue to the proverbial finite chain whose strength is that of its weakest link denote by U(t) the probability that a thread of length (of a given material) can sustain a certain fixed load. A thread of length s+f does not snap iff the ‘two segments individually sustain the given load, Assuming that there is no interaction, the two events must be considered independent and U must satisfy (3.3). Here the length of the thread takes over the role of the time parameter, and the length at which the thread will break is an exponentially distributed random variable. (b) Random ensembles of points in space play a role in many connections so that it is important to have an appropriate definition for this concept. Speaking intuitively, the first property that perfect randomness should have is.a lack of interaction between different regions: the observed configuration within region 4, should not permit conclusions concerning the ensemble in a non-overlapping region 4g. Specifically, the probability p that both Ay and A, are empty should equal the product of the probabilities p, and Ps that Ay and Ay be empty. It is plausible that this product rule cannot hold for alf partitions unless the probability p depends only on the volume of the region A but not on its shape. Assuming this to be so, we denote by U(f) the probability that a region of volume t be empty. These prob- abilities then satisfy (3.3) and hence U(r) = e~*'; the constant « depends on the density of the ensemble or, what amounts to the same, on the unit of length. It will be shown in the next section that the knowledge of U(t) permits us to calculate the probabilities p,(t) that a region of volume ¢ contains exactly points of the ensemble; they are given by the Poisson dis- tribution p,(t) = e-*"(at)"/n!, We speak accordingly of Poisson ensembles of points, this term being less ambiguous than the term random ensemble which may have other connotations, (©) Ensembles of circles and spheres. Random ensembles of particles present a more intricate problem. For simplicity we assume that the particles, 10 THE EXPONENTIAL AND THE UNIFORM DENSITIES 13 are of a spherical or circular shape, the radius p being fixed. The con- figuration is then completely determined by the centers and it is tempting to assume that these centers form a Poisson ensemble. This, however, is impossible in the strict sense since the mutual distances of centers necessarily exceed 2p. One feels nevertheless that for small radii p the effect of the finite size should be negligible in practice and hence the model of a Poisson ensemble of centers should be usable as an approximation. For a mathematical model we postulate accordingly that the centers form a Poisson ensemble and accept the implied possibility that the circles or spheres intersect. This idealization will have no practical consequences if the dii_p are small, because then the theoretical frequency of intersections be negligible. Thus astronomers treat the stellar system as a Poisson ensemble and the approximation to reality seems excellent. The next two examples show how the model works in practice. (@) Nearest neighbors. We consider a Poisson ensemble of spheres (stars) with density a. The probability that a domain of volume ¢ contains no center equals ¢~*'. Saying that the nearest neighbor to the origin has a distance >r amounts to saying that a sphere of radius r contains no star center in its interior. The volume of such a ball equals 4wr?, and hence in a Poisson ensemble of stars the probability that the nearest neighbor has a distance >r is given by e~**"”, The fact that this expression is independent of the radius p of the stars shows the approximative character of the model and its limitations. In the plane, spheres are replaced by circles and the distribution function for the distance of nearest neighbors is given by 1 — e~**"*, (€) Continuation: free paths. For ease of description we begin with the two-dimensional model. The random ensemble of circular disks may be interpreted as the cross section of a thin forest. I stand at the origin, which is not contained in any disk, and look in the direction of the positive z-axis. ‘The longest interval 0,7 not intersecting any disk represents the visi or free path in the x-direction. It is a random variable and we denote it by L. Denote by A the region formed by the points at a distance

1} occurs iff no disk center is con- tained within 4, but it is known in advance that the circle of radius p about the origin is empty. The remaining domain has area 2pt and we conclude that the distribution of the visibility TL. is exponential: PUL >t} = ett, 4 WAITING TIME PARADOXES, THE POISSON PROCESS uw In space the same argument applies and the relevant region is formed by rotating our A about the z-axis, The rectangle 0t} =e", The mean free path is given by E(L) = 1/(rrap!), > The next theorem will be used repeatedly. Theorem. If X,,...,X, are mutually independent random variables with the exponential distribution (3.1), then the sum Xy +++ +X, has a density gq and distribution function Gy given by ca 20 (@—! G4) 8(2) = G3) G,(2) = 2>0. poof, 4 2 4 (1495+ Proof. For n= 1 the assertion reduces to the definition (3.1). The density g,i1 is defined by the convolution 66 fon = [est—2) Bde and assuming the validity of (3.4) this reduces to Thus (3.4) holds by induction for all_m, ‘The validity of (3.5) is seen by different > The densities g, are among the gamma densities to be introduced in 11,2. They represent the continuous analogue of the negative binomial distribution found in 1; V1.8 for the sum of » variables with a common geometric distribution. (See problem 6.) 4, WAITING TIME PARADOXES. THE POISSON PROCESS Denote by X;,Xz,-.. mutually independent random variables with the common exponential distribution (3.1), and put S) = 0, a) SpeX to + Xe n=l. We introduce a family of new random variables N(1) as follows: N(1) is the number of indices K>1 such that Sy t As S, has the distribution G, the probability of this event equals G,(1) — Gy.s(0) oF om 4.2) PING) =n} =e In words, the random variable N(t) has a Poisson distribution with ex- pectation at, This argument looks like a new derivation of the Poisson distribution but in reality it merely rephrases the original derivation of 1; V1.6 in terms of random variables, For an intuitive description consider chance occurrences (such as cosmic ray bursts or telephone calls), which we call “arrivals.” Suppose that there is no aftereffect in the sense that the past history permits no conclusions as to the future. As we have seen, this condition requires that the waiting time X, to the first arrival be exponentially distributed. But at each arrival the process starts from scratch as a probabilistic replica of the whole process: the successive waiting times X, between arrivals must be independent and must have the same distribution, The sum S, represents the epoch of the nth arrival and N(r) the number of arrivals within the interval 0,7. In this form the argument differs from the original derivation of the Poisson distribution only by the use of better technical terms. (In the terminology of stochastic processes the sequence {S,} constitutes a renewal process with exponential interarrival times X,; for the general notion see VL6.) Even this simple situation leads to apparent contradictions which illustrate the need for a sophisticated approach. We begin by a naive formulation. Example. Waiting time paradox. Buses artive in accordance with a Poisson process, the expected time between consecutive buses being a, Tarrive at an epoch 1. What is the expectation E(W,) of my waiting time W, for the next bus? (It is understood that the epoch ¢ of my arrival is independent of the buses, say noontime sharp.) Two contradictory answers stand to reason: (a) The lack of memory of the Poisson process implies that the distribution of my waiting time should not depend on the epoch of my arrival. In this case E(W,) = EWW,) = & (The epoch of my arrival is “chosen at random” in the interval between two consecutive buses, and for reasons of symmetry my expected waiting time should be half the expected time between two consecutive buses, that is E(W,) = a7, Both arguments appear reasonable and both have been used in practice. What to do about the contradiction? The easiest way out is that of the formalist, who refuses to see a problem if it is not formulated in an impeccable manner. But problems are not solved by ignoring them. 14 WAITING TIME PARADOXES. THE POISSON PROCESS 13 We now show that both arguments are substantially, if not formally, correct. The fallacy lies at an unexpected place and we now proceed to explain it” » We are dealing with interarrival times Xy = $,, X;=8,—S,,.... By assumption the X, have a common exponential distribution with expectation «1, Picking out “any” particular X, yields a random variable, and one has, the intuitive feeling that its expectation should be a? provided the choice is done without knowledge of the sample sequence X,,X.,.... But this is not true. In the example we chose that element X, for which Sur <1 SS where ¢ is fixed. This choice is made without regard to the actual process, but it turns out that the X, so chosen has the double expectation 2a. Given this fact, the argument (8) of the example postulates an expected waiting time a? and the contradiction disappears. This solution of the paradox came as a shock to experienced workers, but it becomes intuitively clear once our mode of thinking is properly adjusted. Roughly speaking, a long interval has a better chance to cover the point 1 than a short one. This vague feeling is supported by the following Proposition. Let X,,Xz,... be mutually independent with a common ‘exponential distribution with expectation a, Let t>0 be fixed, but arbitrary. The element X, satisfying the condition Sy.4t The point is that the density (4.3) is not the common density of the Xy. Its explicit form is of minor interest. [The analogue for arbitrary waiting time distributions is contained in XI,(4.16).] Proof, Let k be the (chance-dependent) index such that S,,<1< 8, and put L, equal to S,—S,... We have to prove that L, has density (4.3). Suppose first «<1 The event {L, t a similar argument applies except that y ranges from 0 to ¢ and we must add to the right side in (4,4) the probability e-* — e* that 0<1 ‘The break in the formula (4.3) at x = 1 is due to the special role of the origin as the starting epoch of the process. Obviously 4.6) limo) = atze, which shows that the special role of the origin wears out, and for an “old” process the distribution of L, is nearly independent of 1. One expresses this conveniently by saying that the “steady state” density of L, is given by the right side in (4.6). With the notations of the proof, the waiting time W, considered in the ‘example is the random variable W, = S,— {The argument of the proof shows also that Gan PO Saat tens & [eatntes-™ = ete dy t-en Thus W, has the same exponential distribution as the X, in accordance with the reasoning (a). (See problem 7.) Finally, a word about the Poisson process. The Poisson variables N(t) were introduced as functions on the sample space of the infinite sequence of random variables X;, X,,-... This procedure is satisfactory for many purposes, but a different sample space is more natural. The conceptual experiment “observing the number of incoming calls up to epoch 1” yields for each positive ¢ an integer, and the result is therefore a step function with unit jumps. The appropriate sample space has these step functions as sample points; the sample space is a function space—the space of all conceivable “paths.” In this space N(s) is defined as the value of the ordinate at epoch tand S, asthe coordinate of the nth jump, etc. Events can now be considered that are not easily expressible in terms of the original variables X,.. A typical example of practical interest (see the ruin problem in VI,5) is the event that N() > a+ bt for some f. The individual path (just as the individual infinite sequence of 1 in binomial trials) represents the natural and un- avoidable object of probabilistic inquiry. Once one gets used to the new phraseology, the space of paths becomes most intuitive. Ls THE PERSISTENCE OF BAD LUCK 15 Unfortunately the introduction of probabilities in spaces of sample paths is far from simple. By comparison, the step from discrete sample spaces to the line, plane, etc., and even to infinite sequences of random variables, is neither conceptually nor technically difficult. Problems of a new type arise in connection with function spaces, and the reader is warned that we shall not deal with them in this volume. We shall be satisfied with an honest treatment of sample spaces of sequences (denumerably many coordinate variables). Reference to stochastic processes in general, and to the Poisson process in particular, will be made freely, but only to provide an intuitive background or to enhance interest in our problems. Poisson Ensembles of Points As shown in 1; VI,6, the Poisson law governs not only “points dis- tributed randomly along the time axis,” but also ensembles of points (such as flaws in materials or raisins in a cake) distributed randomly in plane or space, provided 1 is interpreted as area or volume. The basic assumption was that the probability of finding k points in a specified domain depends only on the area or volume of the domain, but not on its shape, and that occurrences in non-overlapping domains are independent. In example 3(b) we used the same assumption to show that the probability that a domain of volume 1 be empty is given by e~*', This corresponds to the exponential distribution for the waiting time for the first event, and we see now that the Poisson distribution for the number of events is a simple consequence of it, ‘The same argument applies to random ensembles of points in space, and we have thus a new proof for the fact that the number of points of the ensemble contained in a given domain is a Poisson variable. Easy formal calculations may lead to interesting results concerning such random ensembles of points, but the remarks about the Poisson process apply equally to Poisson en- sembles; a complete probabilistic description is complex and beyond the scope of the present volume. 5, THE PERSISTENCE OF BAD LUCK As everyone knows, he who joins a waiting line is sure to wait for an abnormally long time, and similar bad luck follows us on all occasions. How much can probability theory contribute towards an explanation? For a partial answer we consider three examples typical of a variety of situations, They illustrate unexpected general features of chance fluctuations. Examples. (a) Record values. Denote by Xo my waiting time (or financial loss) at some chance event. Suppose that friends of mine expose themselves to the same type of experience, and denote the results by Xj, Xs... To exclude bias we assume that Xo,X,,... are mutually independent 16 ‘THE EXPONENTIAL AND THE UNIFORM DENSITIES Ls random variables with a common distribution, The nature of the latter really does not matter but, since the exponential distribution serves as a model for randomness, we assume the X, exponentially distributed in accordance with (3.1). For simplicity of description we treat the sequence {X)} as infinite. To find a measure for my ill luck I ask how long it will take before a friend experiences worse luck (we neglect the event of probability zero that X, = X,). More formally, we introduce the waiting time N as the value of the first subscript n such that X,>X.. The event {N>n—I} occurs iff the maximal term of the n-tuple Xo, Xi,-..,Xq-1 appears at the initial place; for reasons of symmetry the probability of this event is , The event {N= n} is the same as (N>n—1}—{N>n}, and hence for n=l, G1) P(N =n} = 1d 1 ntl nati)’ This result fully confirms that I have indeed very bad luck: The random variable N has infinite expectation! It would be bad enough if it took on the average 1000 trials to beat the record of my ill luck, but the actual waiting time has infinite expectation. It will be noted that the argument does not depend on the condition that the X, are exponentially distributed, It follows that whenever the variables X, are independent and have a common continuous distribution function F the first record value has the distribution (5.1). ‘The fact that this distribution is independent of F is used by statisticians for tests of independ- ence. (See also problems 8-11.) The striking and general nature of the result (5.1) combined with the simplicity of the proof are apt to arouse suspicion. The argument is really impeccable (except for the informal presentation), but those who prefer to rely on brute calculation can easily verify the truth of (5.1) from the direct, definition of the probability in question as the (n+1)-tuple integral of atienstzettzn) over the region defined by the inequalities 0 <2) <2, and 0 6. WAITING TIMES AND ORDER STATISTICS ‘An ordered n-tuple (x,..., 24) of real numbers, may be reordered in increasing order of magnitude to obtain the new n-tuple ays Taps s+ sm) Where Ray S Fy S07 * S toy 18 ‘THE EXPONENTIAL AND THE UNIFORM DENSITIES 16 This operation applied to all points of the space ‘R* induces n well-defined functions, which will be denoted by Xqy,..-, Xia» If probabilities are defined in 5" these functions become random variables. We say that (Xay,--++Xiq)_ is obtained by reordering (X,,...,X,) according to increasing magnitude, The variable Xi) is called kth-order statistic® of the given sample X,,...,X,. In particular, Xj) and X,,) are the sample extremes; when n = 2y +1 is odd, X,,41) is the sample median. We apply this notion to the particular case of independent random variables X,...,X, with the common exponential density ae~*. Examples. (a) Parallel waiting lines. Interpret X,,...,X, as the lengths of n service times commencing at epoch 0 at a post office with m counters. The order statistics represent the successive epochs of terminations or, as ‘one might say, the epochs of the successive discharges (the “output process”). In particular, Xq) is the waiting time for the first discharge. Now if the assumed lack of aftereffect is meaningful, the waiting time Xj, must have the Markov property, that is, Xq) must be exponentially distributed. As a matter of fact, the event {Xy) > 1} is the simultaneous realization of the n events (X, > 1}, each of which has probability e~*'; because of the assumed independence the probabilities multiply and we have indeed (6.1) P{Xq) > em We can now proceed a step further and consider the situation at epoch Xi). The assumed lack of memory seems to imply that the original situation is restored except that now only m—1 counters are in operation; the continuation of the process should be independent of Xi) and a replica of the whole process. In particular, the waiting time for the next discharge, namely Xi) ~ Xa), should have the distribution (6.2) P{X)—Xay >t} analogous to (6.1). This reasoning leads to the following general proposition concerning the order statistics for independent variables with a common exponential distribution. "Strictly speaking the term “sample statistic" is synonymous with “function of the sample variables,” that is, with random variable, It is used to emphasize linguistically the different role played in a given context by the primary variable (the sample) and some derived variables. For example, the “sample mean” (X,+"+X,)/n is called a statistic. ‘Order statistics occur frequently in the statistical literature, We conform to the standard terminology except that the extremes are usually called extreme “values.” 16 WAITING TIMES AND ORDER STATISTICS 19 Proposition* The variables Xqy),Xy — Xay + independent and the density of Xs) — Xy is given by (n—K)ae™P™, Before verifying this proposition formally let us consider its implications. When n= 2 the difference X,) — Xi is the residual waiting time after the expiration of the shorter of two waiting times. The proposition asserts that this residual waiting time has the same exponential distribution as the original waiting time and is independent of Xq). This is an extension of the Markov property enunciated for fixed epochs t to the chance-dependent stopping time Xq). It is called the strong Markov property. (As we are dealing with only finitely many variables we are in a position to derive the strong Markov property from the weak one, but in more complicated stochastic processes the distinction is essential.) ‘The proof of the proposition serves as an example of formal mai with integrals. For typographical simplicity we let m= 3. As in many similar situations we use a symmetry argument. With probability one, no two among the variables X; are equal, Neglecting an event of probability zero the six possible orderings of X;, Xz, Xz according to magnitude there- fore represent six mutually exclusive events of equal probability. To cal- culate the distribution of the order statistics it suffices therefore to consider the contingency X, < Xz < Xs. Thus 63) PAX) > fy Xer—Xay > fo» Xe Xe > bs} = ” = OP{X, > ty, Xe—Xy > tay Xp—Xp > 15}. (Purely analytically, the space ‘R* is partitioned into six parts congruent to the region defined by 2% <_< ty, each contributing the same amount to the integral. The boundaries where two or more coordinates are equal have probability zero and play no role.) To evaluate the right side in (6.3) we have to integrate ae"**#**9) over the region defined by the inequalities Phy AM >h Bom > hy A simple integration with respect to. x leads to emer an sede = (64) = tenn fara, = ementem * This proposition has been discovered repeatedly for purposes of statistical estimation but the usual proofs are computational instead of appealing to the Markov property. See also problem 13, 20 ‘THE EXPONENTIAL AND THE UNIFORM DENSITIES. 16 Thus the joint distribution of the three variables X,), Xi—Xavs Xay—Xewy isa product of three exponential distributions, and this proves the proposition. It follows in particular that E(Xoy—Xis) = (aK). Summing over k=0,1,...,9-1 we obtain (6.5) EX) i +o) Note that this expectation was calculated without knowledge of the distri- bution of X,, and we have here another example of the advantage to be derived from the representation of a random variable as a sum of other variables. (See 1; 1X,3.) (8) Use of the strong Markov property. For picturesque language suppose that at epoch 0 three persons 4, B, and C arrive at a post office and find two counters free, The three service times are independent random variables X, Y, Z with the same exponential distribution. The service times of and B commence immediately, but that of C starts at the epoch Xa) when either A or B is discharged. We show that the Markov property leads to simple answers to various questions. (i) What is the probability that C will not be the last to leave the post office? ‘The answer is }, because epoch Xq) of the first departure establishes symmetry between C and the other person being served. (ii) What is the distribution of the time T spent by C at the post office? Clearly T = Xq) + Z is the sum of two independent variables whose distributions are exponential with parameters 2a and a. The convolution of two exponential distributions is given by (2.14), and itis seen that T has density u(t) = 2a(e*! — e-**) and E(T) = 3/(22) (iii) What is the distribution of the epoch of the Jast departure? Denote the epochs of the successive departures by Xa), Xia), Xie The difference Xia) — Xqy is the sum of the two variables Xia) — Xie) and Xia) — Xo We saw in the preceding example that these variables are independent and have exponential distributions with parameters 2a and a. It follows that Xia) — Xa) has the same density uw as the variable T. Now Xq) is independent of Xj» Xa) and has density 2ae-*'. The convolution formula used in (ji) shows therefore that X,,) has density a Aafe-t!—e-*!—are-tt] and E(Xiq)) = 2/a. The advantage of this method becomes clear on comparison with direct calculations, but the latter apply to arbitrary service time distributions (problem 19). (©) Distribution of order statistics. As a final exercise we derive the distribution of X,). The event {Xq) <1} signifies that at least k among 47 THE UNIFORM. DISTRIBUTION 2 the m variables X, are 0) the probability of the joint event that one among the variables X, lies between ¢ and 1 +h and that k — 1 among the remaining n — 1 variables are 1-+ A. Multiplying the number of choices and the corresponding probabilities leads to (6.7). Beginners are advised to formalize this argument, and also to derive (6.7) from the discrete model. (Continued in problems 13, 17.) > etter tint aiybotg- loiter e ae 7. THE UNIFORM DISTRIBUTION ‘The random variable X is distributed uniformly in the interval a,b if its density is constant = (6—a)* for a)=(-1", O ‘equals the integral of the constant function 1 over the union of the n! congruent regions defined either by the string of inequalities 2% <--> 1 — 4 and by (7.1) the probability for this equals 1°. The variable % has therefore density 2r, as asserted. (Beginners are advised to try a direct computational verification.) (@) Distribution of order statistics. If X,,...,%q are independent and distributed uniformly in 0,1, the number of variables satisfying the in- equality 0 = (1 - ‘) eet n It is customary to describe this relation by saying that in the limit Xq) is exponentially distributed with expectation n-. Similarly rinoal= (tot) if agf ees and on the right one recognizes the tail of the gamma distribution G, of (3.5). In like manner it is easily verified that for every fixed k as n—» oo the distribution of nX,q) tends to the gamma distribution G,, (see problem 33). Now G, is the distribution of the sum of k independent exponentially distributed variables while X,q) is the sum of the first K intervals considered example (8). We can therefore say that the lengths of the successive intervals of our partition behave in the limit as if they were mutually in- dependent exponentially distributed variables. [In view of the obvious relation of (7.2) with the binomial distribution the central limit theorem may be used to obtain approximations to the distribution of Xj, when both n and k are large. See problem 34,] (f) Ratios. Let X be chosen at random in 0,1 and denote by U the length of the shorter of the intervals 0,X and X,1 and by V=1—U the length of the longer. The random variable U is uniformly distributed between 0 and because the event {U <1 < }} occurs iff'either X <1 oF 1 =X <1 and therefore has probability 21. For reasons of symmetry V is uniformly distributed between # and 1, and so E(U) = 4, E(V)=2 What can we say about the ratio V/U? It necessarily exceeds 1 and it lies between | and 7> 1 iff either (rs) 1 t — 1 it follows that (7.6) 18 RANDOM SPLITTINGS 25 and the density of this distribution is given by 2(¢+1)-*. It is seen that VU has infinite expectation. This example shows how little information is contained in the observation that E(V)/E(U) = 3. > 8 RANDOM SPLITTINGS The problem of this section concludes the preceding parade of examples and is separated from them partly because of its importance in physics, and partly because it will serve as a prototype for general Markov chains. Formally we are concerned with products of the form Zq = X:Xs""* Xq where X,,...,X, are mutually independent variables distributed uni- formly in 0,1. Examples for applications. In certain collision processes a physical particle is split into two and its mass _m divided between them, Different laws of partition may fit different processes, but it is frequently assumed that the fraction of parental mass received by each descendant particle is distributed uniformly in 0,7. If one of the two particles is chosen at random ‘and subject to a new collision then (assuming that there is no interaction so that the collisions are independent) the masses of the two second-generation particles are given by products mX,Xz, and so on. (See problem 21.) With trite verbal changes this model applies also to splittings of mineral grains or pebbles, etc, Instead of masses one considers also energy losses under collisions, and the description simplifies somewhat if one is concerned with changes of energy of the same particle in successive collisions. As a last example consider the changes in the intensity of light when passing through matter. Example 10(a) shows that when a light ray passes through a sphere of radius R “in a random direction” the distance traveled through the sphere is distributed uniformly between Q and 2R. In the presence of uniform absorption such a passage would reduce the intensity of the incident ray by a factor that is uniformly distributed in an interval 6a (where a <1 depends on the strength of absorption). The scale factor does not seriously affect our model and it is seen that n independent passages would reduce the intensity of the light by a factor of the form Z,,. > To find the distribution of Z,, we can proceed in two ways. (i) Reduction to exponential distributions. Since sums are generally preferable to products we pass to logarithms putting Y, = —log X;. The Y, are mutually independent, and for 1 > 0 1) PID =PIR Sepa et Now the distribution function G, of the sum S,= ¥, ++-"+Y, of m independent exponentially distributed variables was calculated in (3.5), 26 ‘THE EXPONENTIAL AND THE UNIFORM DENSITIES 1 and the distribution function of Z,, = e~* is given by 1 — G,(logt~) where 0 <1 <1. The density of this distribution function is 11g,(log 7) or 1 ve (62) Salt) = sales ) , Occ. nile, Our problem is solved explicitly. This method reveals the advantages to be derived from an appropriate transformation, but the success depends. on the accidental equivalence of our problem with one previously solved. (ii) A recursive procedure has the advantage that it lends itself also to related problems and generalizations. Let F,(1)=P{Z, <0 and O<1<1. By definition F,(t)= 1. Suppose F,., known and note that Zq = Zy1X,_ is the product of two independent variables, Given X, = 2 the event {Z, <} occurs iff Z,_, 1 we get by differentiation from (8.3) (4) 10=[ha(t)®, cect, and trite calculations show that f, is indeed given by (8.2). 9. CONVOLUTIONS AND COVERING THEOREMS The results of this section have a mild amusement value in themselves and some obvious applications. Furthermore, they turn up rather un- expectedly in connection with seemingly unrelated topics, such as significance tests in harmonic analysis [example IIT,3(f)], Poisson processes [XIV,2(a)], and random flights {example 10(e)]. It is therefore not surprising that all formulas, as well as variants of them, have been derived repeatedly by different methods. The method used in the sequel is distinguished by its simplicity and applicability to related problems, Let a> 0 be fixed, and denote by X;, Xo,-.. mutually independent random variables distributed uniformly over 0,a. Let S, =X; + +++ + Xq. Our first problem consists in finding the distribution U, of S, and its density u, = Us. 19 CONVOLUTIONS AND COVERING THEOREMS 27 By definition w(z) =a for 00. Note that (z—a), is zero for a. With this notation the uniform distribution may be written in the form 4) Ue) = (= @—a), Ja, Theorem 1. Let S, be the sum of n independent variables distributed uniformly over 0a, Let U,(z) = P{S, <2) and denote by u, = U’, the density of this distribution. Then for n=1,2,... and x >0 03) a= os) 0) = 55 3-0 (leva: ane", 1 (ntl 96) al) = a; 3-0 * \e-vas (These formulas remain true also for <0 and for m=0 provided 2. is defined to equal 0 on the negative half-axis, and 1 on the positive.) Note that fora point x between (k—I)a and ka only k terms of the sum are different from zero, In practical calculations it is convenient to disregard the limits of summation and to pretend that varies from —co 28 THE EXPONENTIAL AND THE UNIFORM DENSITIES 19 to co. This is possible, because with the standard convention the binomial coefficients in (9.5) vanish for » <0 and » > n (see 1; II,8). Proof. For n= 1 the assertion (9.5) reduces to (9.4) and is obviously true, We now prove the two assertions simultaneously by induction. Assume (9.5) to be true for some > 1. Substituting into (9.1) we get ney as the difference of two sums. Changing the summation index » in the second sum to » —1 we get zeo[ (+62) ferns which is identical with (9.6). Integrating this relation leads to (9.5) with replaced by + 1, and this completes the proof. > ty) = a (An alternative proof using a passage to the limit from the discrete model is contained in problem 20 of 1; XI,7.) Let a = 26, The variables X,— 6 are then distributed uniformly over the symmetric interval 6,5, and hence the sum of n such variables has the same distribution as S, — nb. It is given by U,(z-+nb). Our theorem may therefore be reformulated in the following equivalent form. Theorem 1a. The density of the sum of n independent variables distributed uniformly over —b, 6 is given by 4 Say" + (n2nBy 0.1 wleend) = Bani SO (Ne +o. We turn to a theorem which admits of two equivalent formulations both of which are useful in many special problems arising in applications. By unexpected good luck the required probability can be expressed simply in terms of the density u,. We prove this analytically by a method of wide applicability. For a proof based on geometric arguments see problem 23. ‘Theorem 2. Ona circle of length t thereare given n > 2 arcs of length a whose centers are chosen independently and at random. The probability alt) that these n ares cover the whole circle is 08) a = an Iu), ~~ which is the same as e9) #9 =S—01(") (1-4) Before proving it, we reformulate the theorem in a form to be used later. Choose one of the m centers as origin and open the circle into an interval of 110 RANDOM DIRECTIONS 29 length t. The remaining m —1 centers are randomly distributed in 0,7 and theorem 2 obviously expresses the same thing as ‘Theorem 3. Let the interval 0,1 be partitioned into n subintervals by choosing independently at random n —1 points X,,...,Xq-1 of division. The probability p(t) that none of these subintervals is of length exceeding a equals (9.9). Note that ¢,(t), considered for fixed ¢ asa function of a, represents the distribution function of the maximal length among the n intervals into which 0,7 is partitioned. For related questions see problems 22-27. Proof, It suffices to prove theorem 3, We prove the recursion formula 0.10) galt) = (n=) f “raat! = aye, t t Its truth follows directly from the definition of , as an (n—1)-tuple integral, but it is preferable to read (9.10) probabilistically as follows. The smallest among X,,...,X,1 must be less than a, and there are m —1 choices for it. Given that X,= 2, the probability that X, is leftmost equals [(¢—z)/¢}"*. The remaining variables are distributed uniformly over 4,1 and the conditional probability that they satisfy the conditions of the theorem is @,x(!—z). Summing over all possibilities we get (9.10). Let us for the moment define u, by (9.8). Then (9.10) reduces to omy nome" [facto ae which is exactly the recursion formula (9.1) which served to define u,. It suffices therefore to prove the theorem for n= 2. But it is obvious that a(t) =1 for 0<1 10. RANDOM DIRECTIONS Choosing a random direction in the plane ‘R* is the same as choosing at random a point on the circle. If one wishes to specify the direction by its angle with the right a-axis, the circle should be referred to its arc length 6 with 0<6<2n, For random directions in the space KR? the unit sphere serves as sample space; each domain has a probability equal to its area divided by 47. Choosing a random direction in‘? is equivalent to 1 Readers who feel uneasy about the use of conditional probabilities in connection with densities should replace the hypothesis X, = 2 by the hypothesis 2 — hk v1 of total height 2—2VT—/%, This determines the two distribution functions up to numerical factors, and these follow easly from the condition that both distributions equal 1 at = 1 > Examples. (a) Passage through spheres. Let © be a sphere of radius r and N a point on it. A line drawn through N in a random direction intersects & in P, Then: The length of the segment NP is a random variable distributed uniformly between 0 and 2r. To see this consider the axis NS of the sphere and the triangle NPS which has a right angle at P and an angle @ at N. The length of NP is then 2rcos@. But cos @ is also the projection of a unit vector in the 110 RANDOM DIRECTIONS 31 line NP into the diameter NS, and therefore cos © is uniformly distributed in O01. In physics this model is used to describe the passage of light through “randomly distributed spheres.” The resulting absorption of light was used as one example for the random-splitting process in the last section, (See problem 28.) (®) Circular objects under the microscope. Through a microscope one observes the projection of a cell on the 2, zy-plane rather than its actual shape, In certain biological experiments the cells are lens-shaped and may be treated as circular disks. Only the horizontal diameter of the disk projects in its natural length, and the whole disk projects into an ellipse ‘whose minor axis is the projection of the steepest radius. Now it is generally assumed that the orientation of the disk is random, meaning that the direction of its normal is chosen at random. In this case the projection of the unit normal on the x,-axis is distributed uniformly in 0, 1. But the angle between this normal and the y-axis equals the angle between the steepest radius and the 2, zrplane and hence the ratio of the minor to the major axis is dis- tributed uniformly in 0/7. Occasionally the evaluation of experiments was based on the erroneous belief that the angle between the steepest radius and the x,, x,-plane should be distributed uniformly. (c) Why are two violins twice as loud as one? (The question is serious because the loudness is proportional to the square of the amplitude of the vibration.) The incoming waves may be represented by random unit vectors, and the superposition effect of two violins corresponds to the addition ‘of two independent random vectors. By the law of the cosines the square of the length of the resulting vector is 2+ 2cos@. Here @ is the angle between the two random vectors, and hence cos @ is uniformly distributed in <1, and has zero expectation. The expectation of the square of the resultant length is therefore indeed 2. Inthe plane cos 6 isnot uniformly distributed, but for reasons of symmetry. its expectation is still zero, Our result therefore holds in any number of dimensions, See also example V,4(e). > By a random vector in R® is meant a vector drawn in a random direction with a length L which is a random variable independent of its direction. The probabilistic properties of a random vector are completely determined by those of its projection on the z-axis, and using the latter it is frequently possible to avoid analysis in three dimensions. For this purpose it is impor- tant to know the relationship between the distribution function V of the true length L and the distribution F of the length L, of the projection on the z-axis. Now L,=XL, where s€ is the length of the projection of a unit vector in the given direction, Accordingly, X is distributed uniformly 32 ‘THE EXPONENTIAL AND THE UNIFORM DENSITIES 110 over 0,1 and is independent of L. Given X =z, the event {L, <1} occurs iff L < 1/2, and sol (10.2) FQ) =[ru de 1>0. For the corresponding densities we get by differentiation (10.3) so =f) =f, eh Oy and differentiation leads to (10.4) wo) = —f'O, 1>0. We have thus found the analytic relationship between the density v of the length of a random vector in R* and the density f of the length of its pro- jection on a fixed direction. The relation (10.3) is used to find f when v is known, and (10.4) in the opposite direction. (The asymmetry between the two formulas is due to the fact that the direction is nor independent of the length of the projection.) Examples. (d) Maxwell distribution for velocities. Consider random vectors in space whose projections on the z-axis have the normal density with zero expectation and unit variance. Since length is taken positive we have (10.5) MO) = Inlt) = VIZ et, 1>0. From (10.4) then (10.6) 01) = V2jn et", 1>0. This is the Maxwell density for velocities in statistical mechanics. The usual derivation combines the preceding argument with a proof that f must be of the form (10.5). (For an alternative derivation see [I1,4.) (e) Lord Rayleigh's random flights in R*. Consider n unit vectors whose directions are chosen independently and at random. We seek the distribution of the length L,, of their resultant (or vector sum). Instead of studying this, resultant directly we consider its projection on the z-axis. This projection is, obviously the sum of independent random variables distributed uniformly over —1, 1, The density of this sum is given by (9.7) with b= 1. Sub- stituting into (10.4) one sees that the density of the length L,, is given by" 107 (2) = — tS =o" (t)ew 29%, 2>0. (10.7) 26) = Gy BC (3) n—2v) 28 This argument repeats the proof of (8.3). 18 The standard reference is to a paper by S. Chandrasekhar (reprinted in Wax (1954)) who calculated ty 4,44 and the Fourier transform of ,. Beemise he used polar coordi- nates, his W(x) must be multiplied by 4n2° to obtain Our ty. Lt ‘THE USE OF LEBESGUE MEASURE 33 This problem occurs in physics and chemistry (the vectors representing, for example, plane waves or molecular links). The reduction to one dimension seems to render this famous problem trivial. ‘The same method applies to random vectors with arbitrary length and thus (10.4) enables us to reduce random-walk problems in R* to simpler problems in . Even when explicit solutions are hard to get, the central limit theorem provides valuable information {see example VIIL4(6)). > Random vectors in 82 are defined in ike manner. The distribution ¥ of the true length, and the distribution F of the projection are related by the obvious analogue to (10.2), namely 2 fae [a (108) re =2 “(aee)ae Fin) However, the inversion formula (10.4) has no simple analogue, and to express Vin terms ‘of F we must depend on the relatively deep theory of Abel's integral equation. * We state without proof that if F has a continuous density f, then ria fox \ dO 09) var == [1 s) Seo (See problems 29-30.) ° Example. (/) Binary orbits. In observing a spectroscopic binary orbit astronomers ‘can measure only the projections of vectors onto a plane perpendicular to the line of sight. ‘An ellipse in space projects into an ellipse in this plane. The major axis of the true ellipse lies in the plane determined by the line of sight and its projection, and it is therefore reasonable fo assume that the angle between the major axis and its projection is uniformly distributed. Measurements determine (in principle) the distribution of the projection. The distribution of the true major axis is then given by the solution (10.9) of Abel's integral equation. > 11. THE USE OF LEBESGUE MEASURE Ifaset A in 0,1 is the union of finitely many non-overlapping intervals Ty tay of lengths Ay, Aa)... the uniform distribution attributes to it probability aLy PU} aA tao. The following examples will show that some simple, but significant, problems 1 The transformation to Abel's integral equation is by means ofthe change of variables vgay=(e), and esinto a: a): Si ‘Then (10.8) takes on the form Fa = FO. 34 ‘THE EXPONENTIAL AND THE UNIFORM DENSITIES Lu lead to unions of infinitely many non-overlapping intervals. The definition (11.1) is still applicable and identifies P(A} with the Lebesgue measure of A, Itis consistent with our program to identify probabilities with the integral of the density f(z) = 1, except that we use the Lebesgue integral rather than the Riemann integral (which need not exist). Of the Lebesgue theory we require only the fact that if 4 is the union of possibly overlapping intervals, Try Jay... the measure P{A} exists and does not exceed the sum Ay ++ dy ++ °° of the lengths. For non-overlapping intervals the equality (11.1) holds. The use of Lebesgue measure conforms to uninhibited intuition and simplifies matters inasmuch as many formal passages to the limit are justified. A set N is called a null se if it is contained in sets of arbitrarily smail measure, that is, to each « there exists a set A> N such that P{4} <«. In this case P(N) = 0. In the following X stands for a random variable distributed uniformly in 1. Examples. (a) What is the probability of X being rational’? The sequence 4.4,4.4.44,-.. contains all the rationals in 0,1 (ordered according to increasing denominators). Choose « < } and denote by J, an interval of length ¢1 centered at the kth point of the sequence. The sum of the Jengths of the J, is ¢ + @ +++ <-«, and their union covers the rationals. Therefore by our definition the set of all rationals has probability zero, and so X is irrational with probability one, It is pertinent to ask why such sets should be considered in probability theory. One answer is that nothing can be gained by excluding them and that the use of Lebesgue theory actually simplifies matters without requiring new techniques. A second answer may be more convincing to beginners and non-mathematicians; the following variants lead to problems of un- doubted probabilistic nature. (®) With what probability does the digit 7 occur in the decimal expansion of X? In the decimal expansion of each x in the open interval between 0.7 and 0.8 the digit 7 appears at the first place. For each n there are 9"~ intervals of length 10-* containing only numbers such that the digit 7 appears at the nth place but not before. (For n= 2 their endpoints are 0.07 and 0.08, next 0.17 and 0.18, etc.) These intervals are non-overlapping, and their total length is 75(1 + 1% + Gis)* + *--) = 1. Thus our event has probability 1 ‘Notice that certain numbers have two expansions, for example 0.7 = = 0.6999... To make our question unequivocal we should therefore specify whether the digit 7 must or may occur in the expansion, but our argument is independent of the difference. ‘The reason is that only rationals can have two expansions, and the set of all rationals has probability zero. Lu ‘THE USE OF LEBESGUE MEASURE 35 (©) Coin tossing and random choice. Let us now see how a “random choice of a point X between 0 and 1” can be described in terms of discrete random variables. Denote by X,(t) the kth decimal of . (To avoid ambiguities let us use terminating expansions when possible.) The random variable X, assumes the values 0,1,...,9, each with probability 74, and the X, are mutually independent, By the definition of a decimal expansion, ‘we have the identity (12) ‘This formula reduces the random choice of a point X to successive choices of its decimals. For further discussion we switch from decimal to dyadic expansions, that is, we replace the basis 10 by 2. Instead of (11.2) we have now (113) X=y2-K, where the X, are mutually independent random variables assuming the values 0 and 1 with probability }. These variables are defined on the interval 0,1 on which probability is equated with Lebesgue measure (length). This formulation brings to mind the coin-tossing game of volume 1, in which the sample space consists of infinite sequences of heads and tails, or zeros and ones. A new interpretation of (11.3) is now possible in this sample space. Init, the X, are coordinate variables, and X is a random variable defined by them; its distribution function is, of course, uniform, Note that the second formulation contains two distinct sample points 0111111 and 1000000 even though the corresponding dyadic expansions represent the same point }. Nevertheless, the notion of zero probability enables us to identify the two sample spaces. Stated in more intuitive terms, neglecting an event of prob- ability zero the random choice of a point X between 0 and 1 can be effected by a sequence of coin tossings; conversely, the result of an infinite coin- tossing game may be represented by a point z of 0, 1. Every random variable of the coin-tossing game may be represented by a function on 0,1, etc. ‘This convenient and intuitive device has been used since the beginning of probability theory, but it depends on neglecting events of zero probability. (@) Cantor-type distributions. A distribution with unexpected properties is found by considering in (11.3) the contribution of the even-numbered terms or, what amounts to the same, by considering the random variable (a4) y=354x, (The factor 3 is introduced to simplify the discussion. ‘The contribution 36 ‘THE EXPONENTIAL AND THE UNIFORM DENSITIES. 112 of the odd-numbered terms has the same distribution as 3Y.) The distri- bution function F(z) = P{Y 12. EMPIRICAL DISTRIBUTIONS The “empirical distribution function” F, of n points ay... 4a, on the line is the step function with jumps I/m at a,...,a,. In other words, ou nF,(z) equals the number of points a, in —o0, z, and F, is distribution function, Given m random variables X,,...,X,, their values at a particu- lar point of the sample space form an rn-tuple of numbers and its empirical distribution function is called the empirical sample distribution, For each 112 EMPIRICAL DISTRIBUTIONS 37 2, the value F,(z) of the empirical sample distribution defines a new random variable, and the empirical distribution of (X,,...,Xq) represents a whole family of random variables depending on the parameter. (In technical language we are concerned with a stochastic process with 2 as time parameter.) No attempt will be made here to develop the theory of empirical distributions, but the notion may be used to illustrate the occurrence ‘of complicated random variables in simple applications. Furthermore, the uniform distribution will appear in a new light. Let X,,...,X, stand for mutually independent random variables with common continuous distribution F. ‘The probability that any two variables assume the same value is zero, and we car: therefore restrict our attention to samples of m distinct values. For fixed = the number of variables X, such that X, <2 has a binomial distribution with probability of “success” P= Fle), and so the random variable F(z) has a binomial distribution with possible values 0, 1/n,..., 1. Forlarge m and 2 fixed, F,(c) is therefore likely to be close to F(x) and the central limit theorem tells us more about the probable deviations. More interesting is the (chance-dependent) graph of F, as a whole and how close it is to F. A measure for this closeness is the maximum discrepancy, that is, (12.1) D, = sup |F,(z) — FC). This is a new random variable of great interest to statisticians because of the following property. The probability distribution of the random variable D,, is independent of F (provided, of course, that F is continuous). For the proof it suffices to verify that the distribution of D,, remains unchanged when F is replaced by a uniform distribution. We begin by showing that the cariables Y, = F(X) are distributed uniformly in 01. For that purpose we restrict ¢ to the interval 0,1, and in this interval we define v asthe inverse function of F. The event {F(X,) < 1} is then identical with the event {X; |ki/n. The same argument in reverse completes the proof. > [An explicit expression for the probability in question is contained in 1, XIV,9.1). In fact Pian < rin} = Wyn is the probability that a particle starting at the origin returns at epoch 2n to the origin without touching tr. The last condition can be realized by putting absorbing barriers at :r, and So. isthe probability of a return to the origin at epoch 2n when tr are absorbing barriers. In 15 XIV,@.1) the interval is Oa rather than —r,F. Our Why is identical with t(D 113 PROBLEMS FOR SOLUTION 39 [twas shown in 1; XIV tha limiting procedure leads from random walks to difusion proceses, and inthis way it is not difficult to see thatthe distribution of nD, tends to {'limit. Actually this limit was discovered by N. V, Smirnov as early as 1939 and the similar limit for V'nD,, by A. Kolmogorov in 1933. Theit calculations are very intricate fand do not explain the connection with diffusion proceses, which is inherent in the Gnedenko-Koroljuk approach, On the other hand, they have given impetus to fruitful work on the convergence of stochastic processes (P.Bilingsly, M, F. Donsker, Yu. V. Prohoroy, A. V. Skorohod, and others) Tt may be mentioned thatthe Smienov theorems apply equally to discrepancies Dyn of the empirical distributions of samples of diffrent sizes m' and The randomwalk approach carries over, but loss much of its elegance and simplicity (B. V. Gnedenko, E_L. Rvateva). A great many variants of Dy, have been investigated by statisticians (Gee problem 36) 13. PROBLEMS FOR SOLUTION nal problems it is understood that the given variables are mutually independent. 1, Let X and Y have densities a¢~# concentrated on 0,2. Find the densities of ox Gi) 342K Gi) X -¥ tiv) IXY! (v) The smaller of X and ¥® (vi) The larger of X and ¥*, 2. Do the same problem if the densities of X and ¥ equal} in =T,T ando elsewhere, 3. Find the densities for X + ¥ and X — Y if X has density ze*#( > 0) and the density of Y equals A! for 0 <2 0. 5. Find the distribution functions of X+Y/X and X+Y/Z if the variables X, Y, and Z have a common exponential distribution, 6. Derive the convolution formula (3.6) for the exponential distribution by a direct passage to the limit from the convolution formula for the “negative binomial” distribution of 1; VI8.1). 7. In the Poisson process of section 4, denote by Z the time between epoch and the last preceding arrival or 0 (the “age” of the current interarrival time). Find the distribution of Z and show that it tends to the exponential distribution as Io. 8. In example 5(a) show that the probability of the first record value occurring. at the mth place and being X,> 0°" > Xw-1 1} = =mim+n). {In example $(a) we had_m = 1") (0) If-N is the first index 1 such that Xin 2 Ximoreay SHOW that roo C/(7) For r>2 wehave E(N) < © and 1 PIN < ma} +1 ~ Tae (6) If N iis the first index such that Xjin falls outside the interval between Xqy and X;q) then s(n —1) ab = mn) E(N) < @ PIN 9 Gemear and BN) < 12, (Convolutions of exponential distributions). For j =0,...,n let X, have density Ae-%* for x >0 where A; x Ay unless j =k. Put Vem (g—Ag) > «Cnn AeMlngr Ag) Ga =A Show that Xp -+-*- +X, has a density given by o Pat) = Fy? Analoyne 38! $0 + ne Hint: Use induction, a symmetry argument, and (2.14). No calculations are necessary. 38G. F, Newell, Operations Research, vol. 7 (1959), pp. $89-598. 1_S, Wilks, J. Australian Math, Soe., vol. 1 (1959) pp. 106-112. 13 PROBLEMS FOR SOLUTION 4 13, (Continuation). f Y, has the density je, the density of the sum Zoom (a) flod= Using the proposition of example 6(6) conclude that fa. is the density of the spread Xia —Xyy of @ sample X;,...,Xq if the X; have the common density “7, 14, Pure birth processes. In the pure birth process of 1; XVII,3 the system passes, through a sequence of states Ey > E, +, staying at “Ey for a sojourn time Xy with density e+, Thus Sy = Xp +--+ +Xq is the epoch of the transition Ey > Eqay. Denote by Py(t) the probability Of E, at epoch f. Show that PAO = PIS, > 1) — P{S,-, > 1} and hence that Py. is given by formula (*) of problem 12. "The differential equations of the process, namely PEO) = APD, Prt) = —AgP alt) + AnxPnaaln n>h should be derived (a) from (1), and (b) from the properties of the sums. Sq. Hint: Using inductively @ symmetry argument it suffices to consider the factor of e-*, 15. In example 6(a) for parallel waiting fines we say that the system is in state k if k counters are free. Show that the birth process model of the last example applies with 4, = (n—k)a, Conclude that ao ~([laennerrins ts, 2>0. From this derive the distribution of Xi 16. Consider two independent queues of m and 1 > m_ persons respectively, assuming the same exponential distribution for the service times. Show that the probability of the longer queue finishing first equals the probability of obtaining heads before m tails in a fair coin-tossing game, Find the samme probability also by considering the ratio X/Y of two variables with gamma distributions Gy and Ga given in G.5). 17. Example of statistical estimation. tis assumed that the lifetimes of electric bulbs have an exponential distribution with an unknown expectation a“. To estimate a sample of nm bulbs is taken and one observes the lifetimes ey f}, which is the tail of the distribution function of the shortest among the intervals) Prove the recurrence relation o nna 2 fs Conclude that pa(t) = Mt ~ (r+ DANE 23. From a recurrence relation analogous to (*) prove without calculations that for arbitrary iy 20,0641 20 oy) PUL, > tye Dg > tasth SOME my mo taal {This elegant result was derived by B. de Finett*” from geometrical considerations. It contains many interesting special cases. When 2, = h forall { we get the preced- ing problem. Example 7(6) corresponds to the special case where exacily one among ‘the x, is different from zero. ‘The covering theorem 3 of section 9 follows from (##) and the formula 1; 1V.(1.5) for the realization of at least one among m +1 events] 24, Denote by qy(c) the probability that all mutual distances of the Xz exceed 4h. (This differs from problem 22 in that no restrictions are imposed on the end intervals Ly and Ly...) Find a relation analogous to (*) and hence derive 9,(¢). 25. Continuation. Without using the solution of the preceding problems show a Priori that pa(t) = (r—2h)"7-"9,(¢—2h).. 26. Formulate the analogue to problem 24 for a circle and show that problem 23 furnishes its solution. Ap, (2) de. 27. Am isosceles triangle is formed by a unit vector in the z-direction and another in.a random direction. ‘Find the distribution of the length of the third side i) in Rand (il) in 8°. ¥ Giornale Istituto Italiano degli Attuari, vol. 27 (1964) pp. 151-173, in Italian, 113 PROBLEMS FOR SOLUTION 43 28. A unit circle (sphere) about 0 has the north pole on the positive z-axis. A ray enters at the north pole and its angle with the x-axis is distributed uniformly over —Jr, Tr. Find the distribution of the length of the chord within the circle (sphere). "Nore. In % the ray has a random direction and we are concerned with the analogue to example 10(@). In? the problem is new. 29. The ratio of the expected lengths of @ random vector and of its projection fon the x-axis equals 2 in x® and =/2 in 8? Hint: Use (10.2) and (10.8). 30. The length of a random vector is distributed uniformly over 0,7. Find the density of the length of its projection on the z-axis (a) in X%, and (6) in 32, Hint: Use (10.4) and (10.9). 31. Find the distribution function of the projection on the 2-axis of a randomly chosen direction in. 4, 32, Find the analogue in R* to the relation (10.2) between the distributions of the lengths of a random vector and that of its projection on the z-axis. Specialize 10 a unit vector to verify the result of problem 31. 33. A limit theorem for order statistics. (a) Let. X, uniformly in OT. Prove that for k fixed and n-> © 11%, be distributed Pfs < 3 Gyo), > 0, where G. is the gamma distribution (3.5) {see example 7(¢]. (6) If the X, have an arbitrary continuous distribution function F, the same limit exists for P(X, < O(z/n)} where ® is the inverse function of F. (Smimov.) 34. A limit theorem for the sample median. The nth-order statistic Xiu) of (Xi,.++Xey-1) is called the sample median. If the X; are independent and uniformly distributed over 0,1 show that PIX iq) — 4 X; 2 -*+ 2 Xw.a . Two distributions F and G, and also their densities f and g, are said to be of the same type if they stand in the relationship a) G(x) = Flar+d), g(x) = af(ax+b), where a> 0. We shall frequently refer to b as a centering parameter, to @ asa scale parameter. These terms are readily understood from the fact that when F serves as distribution function of a random variable X then G is the distribution function of (2) Y= In many contexts only the type of a distribution really matters. ing to common usage the closed interval I should be called the support of f. A new term is introduced because it will be used in the more general sense that a distribution ‘may be concentrated on the set of integers or rationals. 4s 46 SPECIAL DENSITIES, RANDOMIZATION mi The expectation m and variance o* of f (or of F) are defined by (2.3) ma [sees @=[Temmsede= [ese ae—m, provided the integrals converge absolutely. It is clear from (1.2) that in this case g has expectation (m—6)/a and variance oa’. It follows that for each type there exists at most one density with zero expectation and unit variance. ‘We recall from 1,(2.12) that the convolution fi and fy is the probability aeasiy defined by fife of two densities (14) f(z) = “he-nsen dy. When f, and fz are concentrated on 0,0 this formula reduces to as f= (‘Hew ninnan D0, ‘The former represents the density of the sum of two independent random var- iables with densities f, and f;. Note that for g,(z) = f(z+6,) the con- volution g = g, « goisgiven by g(x) = f(x-+b,+b,) asis obvious from (1.2) Finally we recall the standard normal distribution function and its density defined by Lit “ 1.6) n(z) = Ra) = dy. Vie B Our old acquaintance, the normal density with expectation m and variance o*, is given by (=) ome o\e Implicit in the central limit theorem is the basic fact that the family of normal densities is closed under convolutions; in other words, the convolution of two normal densities with expectations m, m, and variances o?, of is the normal density with expectation »m + mz and variance o? = 0? + a3. In view of what has been said it suffices to prove it for m, = m, = 0. It is asserted that a Jue ap [- a ~ fens f op [- « af ~ 4 “ and the truth of this assertion becomes obvious by the change of vs 2 = y(oloy0,) — x(a/00,) where x is fixed. (See problem 1.) m2 GAMMA DISTRIBUTIONS a7 2, GAMMA DISTRIBUTIONS ‘The gamma function T is defined by 1) re = [ererae, 1>0. [See 1; 11,(12.22).] It interpolates the factorials in the sense that P(an+l) =n! for n=0,1,.... Integration by parts shows that T(1) = (1-1) PU—1) for all 1>0. (Problem 2.) ‘The gamma densities concentrated on 0, % are defined by Lrg tense 2.2) frst) = Ft y>0, x>0. Here a > 0 isthe trivial scale parameter, but » > 0 isessential. The special case f,, represents the exponential density, and the densities g, of 1,(3,4) coincide with f,,, (1=1,2,.... A trite calculation shows that the expectation of f,, equals v/a, the variance v/a? The family of gamma densities is closed under convolutions: 3) San * Savy = Saynt H>0, ¥>0. This important property generalizes the theorem of I,3 and will be in constant use; the proof is exceedingly simple. By ([.5) the left side equals. Tu) FO) After the substitution y = 21 this expression differs from f,,4,(2) by @ numerical factor only, and this equals unity since both f,,,., and (2.4) are probability densities. The value of the last integral for z= 1 is the so-called beta integral Blu, »), and as a by-product of the proof we have found that ptt ay al @5) By») = [0-9 ay = To for all 2 >0, » >. [For integral 4 and » this formula is used in 1; VI,(10.8) and (10.9). See also problem 3 of the present chapter.] As to the graph of f,,,, it is clearly monotone if » < 1, and unbounded near the origin when » <1. For » > 1 the graph of f,,, is bell-shaped, attaining at x=» — 1 its maximum (»—1)"4e"-2/7'O) which is close to [2n(o—I)}-+ (Stirling's formula, problem 12 of 1; II, 12). It follows from (2.4) on [o-ury ay 48 SPECIAL DENSITIES, RANDOMIZATION m3 the central limit theorem that 2-294) m0, a 26) *3. RELATED DISTRIBUTIONS OF STATISTICS ‘The gamma densities play a crucial, though sometimes disguised, role in mathematical statistics. To begin with, in the classical (now somewhat outdated) system of densities introduced by K. Pearson (1894) the gamma densities appear as “type III." A more frequent appearance is due to the fact that for a random variable X with normal density n the square X? has density x-in(z#) = f, (2). In view of the convolution property (2.3) it follows that If Xi... 4Xq are mutually independent normal variables with expectation © and variance 0%, then X? + +--+ X% has density fyje%.nje To statisticians 72 =X? 4+ +++ +X is the “sample variance from a normal population” and its distribution is in constant use. For reasons of tradition (going back to K. Pearson) in this connection f,.4, is called chi- square density with n degrees of freedom. Instatistical mechanics X? + X} + X3 appearsas the square of the speed of particles. Hence v(z) = 2xf, ,3(2*) represents the density of the speed itself. This is the Maxwell density found by other methods in I,(10.6). (See also the example in ITT,4.) In queuing theory the gamma distribution is sometimes called Erlangian. Several random variables (or “statistics”) of importance to statisticians are of the form T = X/¥, where X and Y are independent random vari- ables, Y > 0. Denote their distributions by F and G, respectively, and their densities by f and g. As Y is supposed positive, g is concentrated on 0, 00 and so G1) PIT <= PIX <1Y} =r atu) dy. By differentiation it is found that rhe ratio T = X/¥ has density 62) my = [rey eco a Examples, (a) If X and Y have densities fy, and fy ys. then X/¥ has density TQonem) _ thet 3.3 = Tele 6 = Fam Pam) ENE * This section treats special topics and is not used in the sequel. 1>0. m4 SOME COMMON DENSITIES 49 In fact, the integral in (3.2) equals bt re Pim dite G4) > | eben dy 2hm=) Pm) PQn) Jo and the substitution H(1+0y = 5 reduces it to (3.3). In the analysis of variance one considers the special case XaXtpe0 4X2 and Yo ¥pde + Yh where Xi,...)Xiy Yine+++Yq ee mutually independent variables with the common normal density n. ‘The random variable F = (nX/mY/) is called Snedecor's statistic adits density (m{n) w((m/n) 2) is Snedecor’s density, or the Fedensity. The variable Z = log YF is Fisher's Z-statistic, and its density Fisher's Z-density. The two statistics are, of course, merely notational variants of each other, (®) Student's T-density. Let X,Yyy..-y¥q be independent with the common normal density . The variable G5) is known to statisticians as Student's T-statistic. We show that its density is given by ¢, 1_TGm+)) . w() = ——S4—, where C, = 2 UGE) C9) WO = Tain Van Tn) In fact, the numerator in (3.5) has a normal density zero expectation and variance n, while the density of the denominator is given by 22fj ,4(2*) ‘Thus (3,2) takes on the form G2) Aastinntys dy, 1 7 aan 2° P(nj2 [ . " ‘The substitution s = 4(1+1%/n)y? reduces the integral to a gamma integral and yields (3.6). » 4, SOME COMMON DENSITIES In the following it is understood that all densities vanish identically outside the indicated interval (a) The bilateral exponential is defined by jae~*'*! where a is a scale parameter. It has zero expectation and variance 2x-*, This density is the convolution of the exponential density ae~** (x > 0) with the mirrored density ae"* (e <0). In other words, the bilateral exponential is the density of X;—X; when X, and X, are independent and have the common 50 SPECIAL DENSITIES. RANDOMIZATION m4 exponential density ae-** (> 0). In the French literature it is usually referred to as the “second law of Laplace,” the first being the normal distribution. (6) The uniform (or rectangular) density p, and the triangular density 7, concentrated on —a,a are defined by (4. plz) oa 1 = 11 - a), Ie] 0 and » > 0 are free parameters. That (4.2) indeed defines a probability density follows from (2.5). By the same formula it is seen that By has expectation »/(u+y), and variance yx/{(u+»)(utr+!)). If H<1,9 <1, the graph of B,,, is U-shaped, tending to co at the limits. (42) Bal) = (aay, OI, ¥>1 the graph'is bell-shaped. For wwe get the uniform density as a special case. A simple variant of the beta density is defined by 1 1 Tits) et 4.3) Barf) = ee OS DL eo aay (=) Pw) Po) +o" If the variable X has density (4.2) then Y= X-! — 1 has density (4.3). In the Pearson system the densities (4.2) and (4.3) appear as types I and VI. The Snedecor density (3.3) is a special case of (4.3). ‘The densities (4.3) are sometimes called after the economist Parero. It was thought (rather naively from a modern statistical standpoint) that income distributions should have a tail with a density ~ Az~* as 2+ <0, and (4.3) fulfills this requirement. (d) The so-called are sine density (44) i, a x(1—2) is actually the same as the beta density Ay, but deserves special mention because of its repeated occurrence in fluctuation theory. (It was introduced in 1; [11,4 in connection with the unexpected behavior of sojourn times.) The misleading name is unfortunately in general use; actually the distribu- tion function is given by 2n~? arc sin //z. (The beta densities with + ¥ = 1 are sometimes referred to as “generalized arc sine densities.") O<2<1 14 SOME COMMON DENSITIES st (©) The Cauchy density centered at the origin is defined by 1 t 45) n= -a 0 is a scale parameter. The corresponding distribution function is }+ arctan (2/t). The graph of y, resembles that of the normal density but approaches the axis so slowly that an expectation does not exist. The importance of the Cauchy densities is due to the convolution formula (4.6) Pat Pe = Yar It states that the family of Cauchy densities (4.5) is closed under convolutions.. Formula (4.6) can be proved in an elementary (but tedious) fashion by a routine decomposition of the integrand into partial fractions. A simpler proof depends on Fourier analysis. ‘The convolution formula (4.6) has the amazing consequence that for independent variables X.,...,X, with the common density (4.5) the average (X, +-+* +X,)[n has the same density as the X, Example. Consider a laboratory experiment in which a vertical mirror projects a horizontal light ray on a wall. The mirror is free to rotate about ‘a vertical axis through A. We assume that the direction of the reffected ray is chosen “at random,” that is, the angle «~p between it and the perpen- dicular AO to the wall is distributed uniformly between —}x and Jr. The light ray intersects the wall at a point at a distance X=rtange from © (where 1 is the distance AO of the center A from the wall). It is now obvious that the random variable X has density (4.5). If the experiment is repeated n times the average (X,+:+++X,)/n has the same density and so the averages do not cluster around 0 as one should expect by analogy with the law of large numbers. > ‘The Cauchy density has the curious property that if X has density 7, then 2X has density Yay = 7, #7. Thus 2X = X + X-is the sum of two dependent variables, but its density s given by the convoltion formula. More generally if U and. V are two independ ent variables with common density , and X = aU + BV, Y = cU +dV, then X + hhas density 7aryrevayy Which isthe convolution of the densities 74a,o3¢ OF X AN Yrevane A simple reformulation of this experiment leads to physical interpretation of the convolution formula (4.6). Our argument shows that if a unit light source is situated at the origin then y, represents the distribution of the intensity of light along the line y= F of thezy-plane. Then (4.6) expresses Huygens" principle, according to which the intensity of light along y = + + 1 is the same as if the Source were distributed along the following the density y,. (Lowe this remark to J. W. Walsh.) 52. SPECIAL DENSITIES. RANDOMIZATION W4 of Yj nevertheless, X_and Y are not independent. (For a related example see problem 1 in 1,9, {rhe Chucky denitycomeponis 1 se spc eae = ofthe fiy 5) of Student's T- densities. In other words, if X and Y are independent random variables with the normal density 1, then X/I¥{ has the Cauchy density (4.5) with = 1. For some related densities see problems 5-6) The convolution property (2.3) of the gamma densities looks exactly like (4.6) but there is an important difference in that the parameter » of the gamma densities is essential whereas (4.6) contains only a scale parameter. With the Cauchy density the rype is stable, This stability under convolutions is shared by the normal and the Cauchy densities; the difference is that the scale parameters compose according to the rules o? =o? +03 and f= t, +l respectively. There exist other stable densities with similar properties, and with a systematic terminology we should call the normal and ‘Cauchy densities “symmetric, stable of exponent 2 and 1." (See VII.) (/) One-sided stable distribution of index }. If 2 is the normal distribution of (1.6), then (47) Fx) = 21 — RV), z>0, defines a distribution function with density A Vin Ve Obviously no expectation exists. This distribution was found in 1; II1,(7.7) and again in 1; X,1 as limit of the distribution of recurrence times, and this derivation implies the composition rule (49) fy =f where ymat B (A verification by elementary, but rather cumbersome, integrations is possible. The Fourier analytic proof is simpler.) If X,...,X, ate independent random variables with the distribution (4.7), then (4.9) implies that (Xj+:+"+X,)r-? has the same distribution, and so the averages (X.4++-+X,)n-? are likely to be of the order of magnitude of m; instead of converging they increase over all bounds. (See problems 7 and 8.) (48) fe) = ee, 2>0. (g) Distributions of the form e-2"*(r > 0,2 > 0) appear in connection with order statistics (see problem 8). Together with the variant 1—e~® they appear (rather mysteriously) under the name of Weibull disteibutions in statistical reliability theory. (h) The logistic distribution function 1 (4.10) Fu) a>o Tee ‘may serve as a warning. An unbelievably huge literature tried to establish a transcendental “law of logistic growth”; measured in appropriate units, practically all growth processes Ws RANDOMIZATION AND MIXTURES 33 ‘were supposed to be represented by a function of the form (4.10) with ¢ representing time. Lengthy tables, complete with chi-square tests, supported this thesis for human populations, for bacterial colonies, development of railroads, etc. Both height and weight of plants and animals were found to follow the logistic law even though itis theoretically ‘lear that these two variables cannot be subject to the same distribution. Laboratory ‘experiments on bacteria showed that not even systematic disturbances can produce other results, Population theory relied on logistic extrapolations (even though they were demonstrably unreliable). ‘The only trouble with the theory is that not only the logistic distribution but also the normal, the Cauchy, and other distributions can be fitted to the same material with the same or better goodness of fit In this competition the logistic distribution plays no distinguished role whatever; most contradictory theoretical models ‘can be supported by the same observational material ‘Theories of this nature are short-lived because they open no new ways, and new con- firmations ofthe same old thing soon grow boring. But the naive reasoning as such has not been superseded by common sense, and so it may be useful to have an explicit demonstration of how misleading a mere goodness of fit can be. 5, RANDOMIZATION AND MIXTURES Let F be a distribution function depending on a parameter 9, and w a probability density. Then (1) Wea) =| ” Fa, 6) u(0) a9 isa monotone function of x increasing from 0 to 1 and hence a distribution function. If F has a continuous density f, then W has. density w given by (52) w(2) -f "fx, 0) u(0) d0, Instead of integrating with respect to a density u we can sum with respect to a discrete probability distribution: if 0,,0,,... are chosen arbitrarily and if py 0, Epp = 1, then (5:3) wa) = LI, 9) Pe defines a new probability density. The process may be described proba- bilistically as randomization; the parameter 6 is treated as random variable and a new probability distribution is defined in the z, -plane, which serves as sample space. Densities of the form (5.3) are called mixtures, and the term is now used generally for distributions and densities of the form (5.1) and (2), We do not propose at this juncture to develop a general theory. Our aim is rather to illustrate by a few examples the scope of the method and its 3 W. Feller, On the lagistic law of growth and its empirical verifications in biology, Acta Biotheoretica, vol. 5 (1940) pp. 31-66. 54 SPECIAL DENSITIES. RANDOMIZATION WS probabilistic content. The examples serve also as preparation for the notion of conditional probabilities. The next section is devoted to examples of discrete distributions obtained by randomization of a continuous parameter. Finally, section 7 illustrates the construction of continuous processes out of random walks; as a by-product we shall obtain distributions occurring in many applications and otherwise requiring hard calculations. Examples, (a) Ratios. If X_ is a random variable with density f, then forfixed y > 0 the variable X/y hasdensity f(xy)y. Treating the parameter yas random variable with density g we get the new density 64) we) =f Heeny en a This is the same as formula (3.2) on which the discussion in section 3 was based. In probabilistic language randomizing the denominator y in X/y means considering the random variable X/¥, and we have merely rephrased the derivation of the density (3.2) of X/Y. In this particular case the terminology is a matter of taste. (6) Random sums. Let X,,Xq,... be mutually independent random variables with a common density f. The sum S, =X, -+-"" +X, has the density f**, namely the n-fold convolution of f with itself. [See 1,2.] ‘The number n of terms is a parameter which we now randomize by a prob- ability distribution P(N =n} = p,. The density of the resulting sum Sy with the random number N of terms is (65) w= z pf". ‘As an example take for {p,) the geometric distribution p, = gp", and for f an exponential density. Then f** = g, is given by (2.2) and a) a guete S yrt ON a me) = G08 Pi (©) Application 10 queuing. Consider a single server with exponential servicing time distribution (density f(t) = ye") and assume the incoming trafic to be Poisson, that is, the inter-arival times are independent with density Ze, < ju The model is described in 1; XVII,7(6).. Arriving eustomers join a (possibly empty) “waiting line” and are served in order of arrival without interruption, Consider @ customer who on his arrival finds n> 0 other customers in the line. The total time that he spends at the server isthe sum of the service times of these n customers plus his own service time. This is a random 16 DISCRETE DISTRIBUTIONS 35 variable with density 0", We saw in 1; XVIL,(7.10) that in the steady state the probability of finding exactly m customers in the waiting line equals gp” with p = 2Ju. Assuming this steady state we see that the ‘oral time T spent by a customer at the server is a random variable with density Sant sortey = qe S (puttin! = uty Thus E(T) = 1/(u—2). (See also problem 10.) (a) Waiting lines for buses. A bus is supposed to appear every hour on the hour, but is subject to delays. We treat the successive delays X, as independent random variables with a common distribution F and density . For simplicity we assume 0 < X; <1. Denote by T, the waiting time of a person arriving at epoch x <1 after noon. The probability that the bus scheduled for noon has already departed is F(z), and it is easily seen that F(t4+2) — Fea) for O 6. DISCRETE DISTRIBUTIONS This section is devoted to a quick glance at some results of randomizing binomial and Poisson distributions. The number S,, of successes in Bernoulli trials has a distribution depending ‘on the probability p of success. Treating p as a random variable with density w leads to the new distribution 61) PIS, = k} = (t) [ra-ne* u(p) dp Example. (2) When u(p) = 1 an integration by parts shows (6.1) to be independent of k, and (6.1) reduces to the discrete uniform distribution Oe. ym 56 SPECIAL DENSITIES, RANDOMIZATION 16 P(S, =k} = (n+), More illuminating is an argument due to Bayes. Consider m +1 independent variables Xo,...,X, distributed uniformly between O and 1. The integral in (6.1) (with w = 1) equals the probability that exactly k among the variables X,,...,X,, will be In gambling language (6.1) corresponds to the situation when a skew is picked by a chance mechanism and then trials are performed with this coin of unknown structure. To a gambler the trials do not look independent; indeed, if a Jong sequence of heads is observed it becomes likely that for our coin p is close to 1 and so it is safe to bet on further occurrences of heads. Two formal examples may illustrate estimation and prediction problems of this type, Examples. (6) Given that 1 trials resulted in k successes (= hypothesis H), what is the probability of the event that p Turning to the Poisson distribution let us interpret it as regulating the number of “arrivals” during a time interval of duration 1. ‘The expected m6 DISCRETE DISTRIBUTIONS 37 number of arrivals is af. We illustrate two conceptually different ran- domization procedures. Examples. (d) Randomized time. If the dur random variable with density u, the probability p, of exactly & arrivals becomes 0 n= [eran For example, if the time interval is exponentially distributed, the probability of k=0,1,... new arrivals equals which is a geometric distribution. (€) Stratification. Suppose there are several independent sources for random artivals, each source having a Poisson output, but with different parameters. For example, accidents in a plant during a fixed exposure time # may be assumed to represent Poisson variables, but the parameter will vary from plant to plant. Similarly, telephone calls originating at an individual unit may be Poissonian with the expected number of calls varying from unit to unit. In such processes the parameter a appears as random variable with a density u, and the probability of exactly n arrivals during time ¢ is given by or) rol (65) ne f "¢ 7! u(x) da, For the special case of a gamma density u = fy... we get 61 ro= ("*) GEG nl\B+d Wore which is the limiting form of the Polya distribution as given in problem 24 of 4; V8 and 1; XVI1,(10.2) (setting B = at, » = a? — 1). » ‘Note on spurious contagion. A curious and instructive history attaches to the distribution 462) and its dual nature, ‘The Polya urn model and the Polya process which lead to (6.7) ate models for true contagion where every accident effectively inereases the probability of future accidents. ‘This model enjoyed great popularity, and (6.7) was fitted empirically to a variety of phenomena, a good fit being taken as an indication of true contagion By coincidence, the same distribution (6.7) has been derived previously (in 1920) by M. Greenwood and G. U. Yule with the intent that a good fit should disprove presence of contagion. Their derivation is roughly equivalent to our stratification model, which starts 58 SPECIAL DENSITIES. RANDOMIZATION 7 from the assumption underlying the Poisson process, namely, that there is no aftereffect whatever. We Rave thus the curious fact that a good fit ofthe same distribution may be interpreted in two ways diametccally opposite in their nature as well asin their practical implications. ‘This should serve as a warning against too hasty interpretations of statistical data. ‘The explanation lies in the phenomenon of spurious contagion, described in 1; V,2(d) and above in connection with (6.1). Tn the present situation, having observed m accidents uring a time interval of length + one may estimate the probability of m accidents during ‘a future exposure of duration 1 by a formula analogous to (6.3). The result will depend ‘on m, but this dependence is due to the method of sampling rather than to nature itself; the information concerning the past enables us to make better predictions concerning the future behavior of our sample, and this should not be confused with the future ofthe whole population 7, BESSEL FUNCTIONS AND RANDOM WALKS Surprisingly many explicit solutions in diffusion theory, queuing theory, and other applications involve Bessel functions. Its usually far from obvious that the solutions represent probability distributions, and the analytic theory required to derive their Laplace transforms and other relations is rather complex. Fortunately, the distributions in question (and many more) may be obtained by simple randomization procedures. In this way many relations lose their accidental character, and much hard analysis can be avoided. By the Bessel function of order p > —1 we shall understand the function J, defined for ail real 2 by* eo yee @) Wa => mura) . We proceed to describe three procedures leading to three different types of distributions involving Bessel functions. (a) Randomized Gamma Densities For fixed p > —1 consider the gamma density fips. of (2.2). Taking the parameter k as an integral-valued random variable subject to a Poisson distribution we get in accordance with (5.3) the new density ee pat Po ghiren®) SokP(p tet)” ‘Comparing terms in (7.1) and (7.2) one sees that 2) wyf2) = e/a] 1,212), z>0. (12) wx) “ According to standard usage J, is the “modified” Bessel function or Bessel function “with imaginary argument.” ‘The “ordinary” Bessel function, always denoted by J,, is defined by inserting (—1)* on the right in (7.1). Our use of the term Besse! function ‘should be understood as abbreviation rather than innovation. 17 BESSEL FUNCTIONS AND RANDOM WALKS 59 If p>—1 then w, is a probability density concentrated on 0, a. (For p= —1 the right side is not integrable with respect to z.) Note that 1 is not a scale parameter, so that these densities are of different types. Incidentally, from this construction and the convolution formula (2.3) for the gamma densities it is clear that 4) wh (@) Randomized Random Walks In discussing random walks one pretends usually that the successive jumps occur at epochs 1,2,.... It should be clear, however, that this ‘convention merely lends color to the description and that the model is entirely lependent of time. An honest continuous-time stochastic process is obtained from the ordinary random walk by postulating that the rime intervals between successive jumps correspond to independent random variables with the common density e°*. In other words, the epochs of the jumps are regulated by a Poisson process, but the jumps themselves are random variables assuming the values +1 and —1 with probabilities p and q independent of each other and of the Poisson process. To each distribution connected with the random walk there corresponds a distribution for the continuous-time process, which is obtained formally by randomization of the number of jumps. To see the procedure in detail consider the position at a given epoch 1. In the basic random walk the nth step leads to the position r > 0 iff among the first_m jumps H(n-tr) are positive and 4(n—r) negative. This is impossible unless n —r = 29 is even. In this case the probability of the position r just after the nth jump is 20 15 " ) Joon 2 (TF ee «9 (aera) rer = (re In our Poisson process the probability that up to epoch 1 exactly m= = 2v + y jumps occur is e~tr"/n! and so in our time-dependent process the probability of the position r > 0 at epoch f equals 7.6) ety +20) reve = Volare on pa 06) S ca eer = vole nev m9 and we reach two conclusions. () Ifwe define 1, =I, for r= 1, 2,3... then for fixed 1 > 0, p,q, an a(t) = Vpae LN pqt), 6 represents a probability distribution (that is, a, > 0, ¥ a, = 1). i) In our time-dependent random walk a,(¢) equals the probability of the position r at epoch 1. 60 SPECIAL DENSITIES. RANDOMIZATION 7 Two famous formulas for Bessel functions are immediate corollaries of this result. First, withthe change of notations 2Vpg ¢ = and plg = 12, the identity aj) = 1 becomes (8) eee) nF wrt, “This isthe so-called generating function for Bessel functions or Schlmilc's formula (which sometimes serves as definition for /,). ‘Second, itis clear from the nature of our process that the probabilities a0) must satisy the Chapman-Kolmogorov equation 7s) ade) & adler), Which expresses the fact that at epoch £ the particle must be at some position & and that ‘transition from & to r is equivalent toa transition from 0 to rk. We shall return to this relation in XVI,3. [tis easily verified directly from the representation (7.6) and the analogous formula’ for the probabilities in the random walk} The Chapman- Kolmogoroy relation (7.9) is equivalent to 710) Itty = Enola which is known as K. Newmann’s identity. (© First Passages For simplicity let us restrict our attention to symmetric random walks, p= 4 = 4. According to 1; II1,(7.5), the probability that the first passage through the point r > 0 occurs at the jump number 2n—r is (rerjre ner (7.41) ho ‘The random walk being recurrent, such a first passage occurs with probability one, that is, for fixed r the quantities (7.11) add up to unity. In our time- dependent process the epoch of the kth jump has the gamma density f,,, of (2.2). It follows that the epoch of the first passage through r > 0 has density x e- )E haat = w2n—r\ n (7.12) ae n=)! 5.2m Grr ilincs) att Thus: (i) for fixed r= 1,2,. (7.13) =e" 10 defines a probability density concentrated on 0, %. ns DISTRIBUTIONS ON A CIRCLE 61 (ii) The epoch of the first passage through r >0 has density v,. (See problem 15.) ‘This derivation permits another interesting conclusion. A first passage through r+ p at epoch 1 presupposes a previous first passage through r at some epoch s <4. Because of the independence of the jumps in the time intervals 0,5 and 5,7 and the lack of memory of the exponential waiting times we must have (7.14) 2, p= pape [A computational verification of this relation from (7.12) is easy if one uses the corresponding convolution property for the probabilities (7.11).] ‘Actually the proposition (i) and the relation (7.14) are true for ail positive values of the parameters r and p.° 8 DISTRIBUTIONS ON A CIRCLE ‘The half-open interval ‘0,1 may be taken as representing the points of a circle of unit length, but it is preferable to wrap the whole line around the circle. The circle then receives an orientation, and the arc length runs from —s to co but x, 241, 24 2,,.. are interpreted as the same point. Addition is modulo 1 just as addition of angles is modulo 2m. A probability density on the circle is a periodic function > 0 such that Bt) fusae= 1 Examples. (a) Buffon's needle problem (1777). The traditional formu- lation is as follows. A plane is partitioned into strips of unit width parallel tothe y-axis. A needle of unit length is thrown at random. What is the prob- ability that it lies athwart two strips? To state the problem formally consider first the center of the needle. Its position is determined by two coordinates, but y is disregarded and x is reduced modulo 1. In this way “the center of the needle” becomes a random variable X on the circle with a uniform distribution. The direction of the needle may be described by the angle (measured clockwise) between the needle and the y-axis. A turn through 7 restores the position of the needle and hence the angle is determined only up to a multiple of 1. We denote it by Ze. In Buffon’s needle problem it is implied that X and Z are independent and uniformly distributed variables¢ on the circle with unit length. |W, Feller, Infinitely divisible distributions and Bessel functions associated with random walks, J, Soc. Indust. Appl. Math., vol. 14 (1966), pp. 864-875. The sample space of the pair (X,Z) is a torus. 62 SPECIAL DENSITIES. RANDOMIZATION U8 If we choose to represent X_by values between 0 and I and Z by values between —} and } the needle crosses a boundary iff 4 cos Zw > X or }cos Zn > 1 —X. Fora given value 2 between —4 and $ the probability that X < }0oszm is the same as the probability that 1 — X < }cos om, namely } cos 2m. Thus the required probability is 4 (62) [ieee den? > A random variable X on the line may be reduced modulo | to obtain a variable °X. on the circle, Rounding errors in numerical calculations are random variables of this kind. If X has density f the density of °X is given by? 2) 9) = 3 fletm. Every density on the line thus induces a density on the circle. {It will be seen in XIX,5 that the same g admits of an entirely different representation in terms of Fourier series. For the special case of normal densities see example XIX,5(e).] Examples. () Poincaré’s roulette problem. Consider the number of rotations of a roulette wheel as a random variable X with a density f concentrated on the positive half-axis. The observed net result, namely the point °X at which the wheel comes to rest, is the variable X reduced modulo 1. Its density is given by (8.3). One feels instinctively that “under ordinary circumstances” the density of °X should be nearly uniform. In 1912 H. Poincaré put this vague feeling fon the solid basis of a limit theorem, We shall not repeat this analysis because a similar result follows easily from (8.3). The tacit assumption is, of course, that the given density f is spread out effectively over a long interval so that its maximum m is small. Assume for simplicity that f increases up to a point a where it assumes its maximum m= f(a), and that f decreases for x >a. For the density y of the reduced variable °X we have then (8.4) oe) ~ 1 = Zfle+n) -["16 ds. For fixed x denote by 1, the unique point of the form +n such that 7 Readers worried about convergence should consider only densities f concentrated on a finite interval. The uniform convergence is obvious if fis monotone for = and —z sufficiently large. Without any conditions on the series may diverge at some points, but always represents a density because the partial sums in (8.2) represent & monotone Sequence of functions whose integrals tend to 1. (See TV,2.) m8 DISTRIBUTIONS ON A CIRCLE 63 atk —m. Thus |g(z)—1| 0 with some unknown distribution, The first significant digit of Y equals k iff 10°% 0 (8) the stable density (4.8), then POrK yy) $2} > eV aTE, 2>0. n9 PROBLEMS FOR SOLUTION 65 9, Let X and Y be independent with densities f and g concentrated on 1,3. Wt BCX) < othe ratio. XW has a finite expectation if i "rg dy < 2. 10. In example 5(c) find the density of the waiting time to the next discharge (a) if at epoch 0 the server is empty, (b) under steady-state conditions. 1 In example S(d) show that EIT.) = F@Xu+1—2) +f where 1 is the expectation of F. From this verify the assertion concerning E(T) when 2 is uniformly distributed. 12, In example 5(d) find the waiting time distribution when f(s) =1 for o0 equals one, that is, provided p > 9. Show that the only change in (7.11) is that 2" is replaced by p'gr-t, and the conclusion is that for p > and r = 1, 2, rte) dt, Mh (20), r= 0, 1, 42,... VGplave 2 12V p40) defines a probability density concentrated on t > 0. 16. Let X and Y be independent variables and °X and °Y be the same variables reduced modulo 1. Show that °X+°Y is obtained by reducing X+¥ modulo I. Verify the corresponding formula for convolutions by direct calculation, CHAPTER III Densities in Higher Dimensions. Normal Densities and Processes For obvious reasons multivariate distributions occur less frequently than one-dimensional distributions, and the material of this chapter will play almost no role in the following chapters. On the other hand, it covers important material, for example, a famous characterization of the normal distribution and tools used in the theory of stochastic processes. Their true nature is best understood when divorced from the sophisticated problems with which they are sometimes connected. 1. DENSITIES For typographical convenience we refer explicitly to the Cartesian plane RY, but it will be evident that the number of dimensions is immaterial. We refer the plane to a fixed coordinate system with coordinate variables X,,Xp. (A more convenient single-letter notation will be introduced in section 5.) ‘A non-negative integrable function f defined in :R* and such that its integral equals one is called a probability density, or density for short. (AIL the densities occurring in the chapter are piecewise continuous, and so the concept of integration requires no comment.) The density / attributes to the region Q the probability ay P(Q} = ff Sey 29) de, dry a provided, of course, that © is sufficiently regular for the integral to exist. All such probabilities are uniquely determined by the probabilities of rectangles parallel to the axes, that is, by the knowledge of (12) Play < XS by ar < Xe S$ by} -[ 6 i) "yay ty) diy ty Mm DENSITIES 67 for all combinations a, < b,. Letting a, = a, = —oo we get the distribution function F of f, namely (1.3) F(X, 2) = PIX, < m4, Xz < yh. Obviously F(b,, 1,) — F(a, %) is the probability of a semi-finite strip of width 6, — a, and, the rectangle appearing in (1.2) being the difference of two such strips, the probability (1.2) equals the so-called mixed difference Fb, a) — Flay, Bs) — Fb, a3) + Flay, a). It follows that the knowledge of the distribution function F uniquely determines all probabilities (1.1). Despite the formal analogy with the situa- tion on the line, the concept of distribution function F is much less useful in the plane and it is best to concentrate on the assignment of probabilities (1.1) in terms of the density itself, This assignment differs from the joint probability distribution of two discrete random variables (1; IX,1) in two respects. First, integration replaces summation and, second, probabilities are now assigned only to “sufficiently regular regions” whereas in discrete sample spaces all sets had probabilities. As the present chapter treats only simple examples in which the difference is hardly noticeable, the notions and terms of the discrete theory carry over in a self-explanatory manner. Just as in the preceding chapters we employ therefore a probabilistic language without any attempt at a general theory (which will be supplied in chapter V). It is apparent from (1.3) that? aay PK, <4) = Alm, 0). Thus (2) = Fle, 00) defines the distribution function of X,, and its density fy is given by (as) A [Fea When it is desirable to emphasize the connection between X, and the pair (X,, Xs) we again speak of F, as marginal distribution? and of f, as marginal density. The expectation 4, and variance o} of X,—if they exist—are given by (16) m= bof [nse a de dey and a an of = Var(X,) = { . f "ey md 6) dy dy, Here and in the following U(s0) = lim Ulz) as + 0 and the use of the symbol U(~) implies the existence ofthe limit. * Projection on the axes is another accepted term. 68 DENSITIES IN HIGHER DIMENSIONS utd By symmetry these definitions apply also to X,. Finally, the covariance of X, and X, is (18) CovKXD =f [rman maflen td dd ‘The normalized variables X,0," are dimensionless and their covariance, namely p = Cov (Xi, X:)o; 20,1, isthe correlation coefficient of X, and Xp (see 15 IX,8). A random variable U_is a function of the coordinate variables X, and Xz; again we consider for the present only functions such that the prob- abilities P(U <1) can be evaluated by integrals of the form (1.1). ‘Thus each random variable will have a unique distribution function, each pair will have a joint distribution, ete. In many situations it is expedient to change the coordinate variables, that is, to let two variables Y,,Y, play the role previously assigned to X,,Xq. In the simplest case the Y, are defined by a linear transformation (9) X= an¥i + an¥e, Xz = an Yi + anYo, with determinant A = asx, — a2 > 0. Generally transformation of the form (1.9) may be described either as a mapping from one plane to another or as a change of coordinates in the same plane. Introducing the change of variables (1.9) into the integral (1.1) we get (10) PQ} =f] flanse tore anton) A din dye the region Q, containing all points (y;, ys) whose image (m, #2) is in Q. Since the events (X,, X,)€Q2 and (¥;, ¥,)€Q, are identical it is seen that the joint density of (Wy. ¥a) 18 given by aay 8000 9) = flauntay All this applies equally to higher dimensions. ‘A similar argument applies to more general transformations, except that the determinant A is replaced by the Jacobian. We shall use explicitly only the change to polar coordinates (1.12) X,=Reos@, X,=Rsin® with (R,@) restricted to R20, —7<@<-. Here the density of (R, ®) is given by 13) gtr, 0) = f(r cos 0, r sin Br. In three dimensions one uses the geographic longitude p and latitude 6 (with —7 << and —I7 <6 < $n). The coordinate variables in the 2 Aaya tae)“ A. mt DENSITIES 0 polar system are then defined by (1.14) X, = Ros ® cos @, Rsin@cos®, X= RsinO. For their joint density one gets (15) gle, 9,8) = f(r cos cos 8, r sin y cos 0, r sin B)r? cos 6. In the transformation (1.14) the “planes” © = —Jn and @ = }n corre- spond to the half axes in the zy-direction, but this singularity plays no role since these half axes have zero probability. A similar remark applies to the origin for polar coordinates in the plane. Examples. (a) Independent variables. In the last chapters we considered independent variables X, and Xz with densities f, and f,. This amounts to defining a bivariate density by f(z.) = file)flm). and the f, represent the marginal densities. (8) “Random choice.” Let T° be a bounded region; for simplicity we assume T’ convex. Denote the area of T by 7 and put f equal to within TP and equal to 0 outside I. Then f isa density, and the probability of any region Q = T’ equals the ratio of the areas of Q and T. By obvious analogy with the one-dimensional situation we say that the pair (X,, X2) is distributed uniformly over V. The marginal density of X, at the abscissa +, equals the width of Tat x, in the obvious sense of the word. (See problem 1.) (©) Uniform distribution on a sphere. The unit sphere © in three dimen- sions may be represented in terms of the geographic longitude and latitude 6 by the equations (1.16) 44 = cos pcos 6, sin pcos0, 5 = sind. To each pair (y,6) such that —70) we get u(z)=s7 for 0<20, m-b-- +i, <1, and hence (U,,...,U,) is distributed uniformly over this region. This result is stronger than the previously established fact that the U, have a common distribution function {example I,7(6) and problem in 1,13.) (d) Once more the randomness of the exponential distribution. Let X,,...+Xnu1 be independent with the common density ae for 2 > 0. Put S;=X,4+-+°+X, Then (S,,8,,...,8,.) is obtained from +++ Xqex) by a linear transformation of the form (1.9) with deter- nt 1, Denote by © the “octant" of points 2, >0. The density of + +Xy41) is concentrated on Q and is given by atti tients tenn) if x, > 0. The variables S,,...,S,.. map Q onto the region Q* defined by O< 5 S525 °°°S 5q42 < &, and [see (1.11)] within Q* the density of (Sy... Spex) is given by a™He2s1, The marginal density of S,. is known to be the gamma density 2”*%s%e~™'/n!_and hence the conditional density of the n-tuple (S,,...,S,) given that Sau, = equals n's™ for 0<5,<+++ <5, <5 (and zero elsewhere). In other words, given that 16 DENSITIES IN. HIGHER DIMENSIONS 113, Sy.x = 5 the variables (S,,...,8,) are uniformly distributed over their possible range. Comparing this with example (6) we may say that given S,ax = 5, the variables (S,,...,8,) represent n points chosen independently ‘and at random in the interval 0,5 numbered in theit natural order from left to right, (e) Another distribution connected with the exponential. With a view to a surprising application we give a further example of a transformation Let again X,,...,X, be independent variables with a common exponential distribution and S, = X, +--+ -+X,. Consider the variables U,, . defined by G3) U,=Xi8, for k or, what amounts to the same, G4) X,= UU, for k 0, and in it this density is given by ate-1*""+=). Tt follows that the joint density of (U,,...,U,) is given by a”w*~te-*> in the region Q* defined by snal, Us = Sy, Wt ti <0 kel and that it vanishes outside Q*, An integration with respect tou, shows that the joint density for (U,,...,U,) equals (n—1)! in O* and 0 elsewhere. Comparing with example (c) we see that (U,,...,U,-1) has the same distribution as if Uy were the length of the kth interval in a random partition of 0,1 by n—1 points. (Jf) A significance test in periodogram analysis and the covering theorem. In practice, any continuous function of time 1 can be approximated by a trigonometric polynomial. If the function is a sample function of a stochastic process the coefficients become random variables, and the approximating polynomial may be written in the form 3.5) SUX, cos of, sin o,1) = ER, cos (w1~®,) a 4 where RE=X!+Y! and tan@,=Y,/X,. Conversely, reasonable assumptions on the random variables X,, ¥, lead to a stochastic process with sample functions given by (3.5). For a time it was fashionable to introduce models of this form and to detect “hidden periodicities” for sunspots, wheat prices, poetic creativity, etc. Such hidden periodicities used to be discovered as easily as witches in medieval times, but even strong faith must be fortified by a statistical test. The method is roughly as follows. A trigonometric polynomial of the form (3.5) with well-chosen frequencies Gy. +5 Wy is fitted to some observational data, and a particularly large amplitude R, is observed. One wishes to prove that this cannot be due to U4 CHARACTERIZATION OF THE NORMAL DISTRIBUTION 1 chance and hence that «, is a true period. To test this conjecture one asks whether the large observed value of R, is plausibly compatible with the hypothesis that all_ components play the same role. For a test one assumes, accordingly, that the coefficients X,,...,¥, are mutually independent with a common normal distribution with zero expectation and variance o. In this case (see 11,3) the R? are mutually independent and have a common exponential distribution with expectation 20% If an observed value R? deviated “significantly” from this predicted expectation it was customary to jump to the conclusion that the hypothesis of equal weights was untenable, and R, represented a “hidden periodicity. The fallacy of this reasoning was exposed by R. A. Fisher (1929) who pointed out that the maximum among m independent observations does not obey the same probability distribution as each variable taken separately. ‘The error of treating the worst case statistically as if it had been chosen at random is still common in medical statistics, but the reason for discussing the matter here is the surprising and amusing connection of Fisher's test of significance with covering theorems. ‘As only the ratios of the several components are significant we normalize the coefficients by letting G6) y, a Rif + RE Since the R? have a common exponential distribution we can use the preceding example with X, = 3. Then Vy = Uy,..-,Vya = Uy, but V,=1—U,—+:+-U,4. Accordingly, the ntuple (Vy....,V,) és distributed as the length of the n intervals into which 0,1 is partitioned by a random distribution of n—\ points. The probability that all V, be less than a is therefore given by formula 1,(9.9) of the covering theorem. This result illustrates the occurrence of unexpected relations between apparently unconnected problems.” » *4, A CHARACTERIZATION OF THE NORMAL DISTRIBUTION Con: jer a non-degenerate linear transformation of coordinate variables 4.) y= ayXy + aXe, Yp = Xs + aerXey 7 Fisher derived the distribution of the maximal term among the V, in 1929 without kknowiedge of the covering theorem, and explained in 1940 the equivalence with the covering theorem after W. L. Stevens had proved the latter. (See papers No, 16 and 37 in Fisher's Contributions to Mathematical Statistics, Jobn Wiley, New York (1950) For an alternative derivation using Fourier analysis see U. Grenander and M. Rosenblatt 950, ™ This section treats a special topic and is not used in the sequel. ® DENSITIES IN HIGHER DIMENSIONS m4 and suppose (without loss of generality) that the determinant A= 1. If X and X, are independent normal variables with variances o? and of the distribution of the pair (¥,, ¥.) is normal with covariance dnt + Miedo [sce example 1(d)]. In this case there exist non-trivial choices of the coeffi- cients aj, such that Y, and Y, are independent. The following theorem shows that this property of the univariate normal distribution is not shared by any other distribution. We shall here prove it only for distributions with continuous densities, in which case it reduces to a lemma concerning the functional equation (4.3). By the use of characteristic functions the most general case is reduced to the same equation, and so our proof will really yield the theorem in its greatest generality (see XV,8). ‘The elementary treatment of densities reveals better the basis of the theorem. ‘The transformation (4.1) is meaningful only if no coefficient a,, vanishes. Indeed, suppose for example that a,, = 0. Without loss of generality we may choose the scale parameters so that ay = 1. Then Y, = Xs, and a glance at (4.4) shows that in this case Y, must have the same density as X,. In other words, such a transformation amounts to a mere renaming of the variables, and need not be considered. Theorem. Suppose that X, and X, are independent of each other, and that the same is true of the pair Y,, Ys. If no coefficient aj, vanishes then all four variables are normal. ‘The most interesting special case of (4.1) is presented by rotations, namely transformations of the form (42) Y=Xpcosw+Xsinw, Y¥,= —X sinw + X,cos o where @ is not a multiple of 47. Applying the theorem to them we get Corollary. If X, and Xz are independent and there exists one rotation (4.2) such that Y, and Y, are also independent, then X, and Xz have normal distributions with the same variance. In this ease Y, and Y, are independent for every 1. Example, Maxwell distribution of velocities. In his study of the velocity distributions of molecules in R? Maxwell assumed that in every Cartesian coordinate system the three components of the velocity are mutually independent random variables with zero expectation. Applied to rotations leaving one axis fixed our corollary shows immediately that the three com- ponents are normally distributed with the same variance. As we saw in 11,3 this implies the Maxwell distribution for velocities. > un CHARACTERIZATION OF THE NORMAL DISTRIBUTION 9 ‘The theorem has a long history going back to Maxwell's investigations. Purely prob- abilistic studies were initiated by M. Kac (1940) and S. Bernstein (1941), who proved ‘our corollary assuming finite variances, An impressive number of authors contributed improvements and variants, sometimes by rather deep methods. The development ‘culminates in a result proved by V. P. Skitovig.® Now to the proof in the case of continuous densities. We denote the densities of X, and Y, respectively by u; and fj. For abbreviation we put (43) Ya = yt, + Ae, Ya = Ont, + det, Under the conditions of the theorem we must have 44) Aly falve) = aC) ol). We shall show that this relation implies that (4.5) Fy) = te, u(z) = be" where the exponents are polynomials of degree 2 or lower. The only probability densities of this form are the normal densities. For distributions with continuous densities the theorem is therefore contained in the following Lemma. Suppose that four continuous functions f, and u, are connected by the functional equation (4.4), and that no coefficient aj, vanishes. The functions are then of the form (4.5) where the exponents are polynomials of degree <2. (It is, of course, assumed that none of the functions vanishes identically.) Proof. We note first that none of our functions can have a zero. Indeed, otherwise there would exist a domain Q in the x, zyplane in which the two members of (4.4) have no zeros and on whose boundary they vanish. But the two sides require on the one hand that the boundary consists of segments parallel to the axes, on the other hand of segments parallel to the lines y, = const. This contradiction shows that no such boundary exists. ‘We may therefore assume our functions to be strictly positive. Passing to logarithms we can rewrite (4.4) in the form (4.6) rs) + P2(Y2) = 4(%) + alr 9). For fixed fy and hy define the mixed difference operator A by (AI) Ales ta) = Oly, tot) — Oe hy, 24h) — = OG, tythy) + (hy, Zh). * Lavestia Acad. Nauk SSSR, vol. 18 (1954) pp. 185-200. The theorem: Let Xyy.-. 1 Xq ‘be mutually independent, Y, = Eo,X,, and Y, = Eb,X, where no coefficient is 0. If Yq and Y¥, are independent the X, are normally distributed. 80 DENSITIES IN HIGHER DIMENSIONS MLS Because each «, depends on the single variable 2, it follows that Ac, Also (4.8) AgiQn) = GiQh +h) ~ A+) — Ge) + HAH) where we put for abbreviation (4.9) 1 = Oyhy + Ayala, fy = dyhy — Ayah We have thus Ay, + Ag, =0 with , depending on the single variable Ys Keeping yz fixed one sees that Ay,(y,) is a constant depending only on hy and h,, We now choose /, and hy so that 4 =f and f,=0, where 1 is arbitrary, but fixed. The relation Ag, = const. then takes on the form (4.10) PUA+D + PAD — 2a) = AC. Near a point y, at which , assumes a minimum the left side is >0, and hence such a point can exist only if 2(¢) > 0 for all rin some neighborhood of the origin. But in this case g, cannot assume a maximum. Now a continuous function vanishing at three points assumes both a maximum and a minimum. We conclude that if a continuous solution of (4.10) vanishes at three distinct points, then it is identically zero. Every quadratic polynomial q(y,) = ay? + fy; + y satisfies an equation of the form (4.10) (with a different right side), and hence the same is true of the difference gy(y,) — g(y,). But q can be chosen such that this difference vanishes at three prescribed points, and then ,(y,) is identical with g. The same argument applies to g.. and this proves the assertion concerning, fi and fz, Since the variables X, and Y, play the same role, the same ‘argument applies to the densities u, » 5, MATRIX NOTATION. THE COVARIANCE MATRIX ‘The notation employed in section | is messy and becomes more so in higher dimensions. Elegance and economy of thought may be achieved by the use of matrix notation. For ease of reference we summarize the few facts of matrix theory and the notations used in the sequel. The basic rue i: first rows, then columns. Thus an a by @ matrix A has a rowsand columns; its elements are denoted by aya, the first index indicating the row. If B isa fby y matrix withelements 6, the product 4B isthe eeby "matrix with elements aj,byu + dyabau +--+ aipbgy. No product is defined if the number of columns fof A does not agree with the number of rows of B. The associative law (AB)C = ABC) holds, whereas in general AB x BA. The transpose AT isthe (fby matrix with elements af = ays, Obviously (ABT) = BAT. ‘A one by 2 matrix with a single row is called a row vector; a matrix with a single ‘column, a column vector A row vector r= (Fy... +a) is easily printed, but a column * This is veally an abuse of language. In a concrete case *, may represent pounds and 2, cows; then (2,2) is no “vector” in the strict sense LS MATRIX NOTATION, THE COVARIANCE MATRIX 8t vector is better indicated by its transpose eT = (cy,....¢,). Note that er is an a by & ‘matrix (of the “multiplication table” type) whereas re is a one by one matrix, oF scalar. ee ‘The zero vector has all components equal to 0. (revtrece) Matrices with the same number of rows and columns are called square matrices. With fa square matrix A there is associated its determinant, a number which will be denoted by |A\. For our purposes it sufies to know that the determinants ate multiplicative: if A find B fare square matrices and C = AB, then |C| = |4\- |B}. The transpose A? has the same determinant as. By identity mairix is meant a square matrix with ones in the main diagonal and zeros atall other places. If 1 isthe identity matrix with r rowsand columnsand A an r by r matrix, obviously JA = Al = A. By inverse of A_is meant a matrix A"? such that AA? & AMA = 1. [Only square matrices can have inverses. The inverse is unique, for if B isany inverse of A wehave AB = 1 and by the astociativelaw A! = (4-14)B = B.) AA square matrix without inverse is called singular. The multiplicative property of deter- ‘minants implies that 2 matrix with zero determinant is singular. The converse is also true if |A| #0 then is non-singular. In other words, a matrix A is singular iff there exists a non-zero vector © such that x4 = 0. ‘A square matrix A is symmetric if aye = days that is. 4 ‘associated with a symmetric r by matrix A is defined by A. The quadratic form past = Sayer where 2... yy are indeterminates, ‘The matrix is positive definite if xAx > O for alt ‘on-2ro veciors 2. If follows from the last criterion that a postive definite matrix is non Singular. ‘Rotations in 84, For completeness we mention briefly a geometric application of matrix calculus although it will not be used inthe sequel “The mer product of tW0 FOW Vectors 2 = (Fy. .»- sy) andy = Cy.) is dined by ryt aye = Sey, ‘The length L of 2 is given by L® = a27. If x and y are vectors of unit length the angle 5 between them is given by €os 8 = 2y?. An a by a matrix A induces a transformation mapping = into $=-r4; for the transpose one has 7 = AT2?. The matrix A is orthogonal ifthe induced transformation preserves lengths and angles, that isto say, if any two row vectors have the same inner product as their images: Thus A is orthogonal iff for any pair of row vectors =, AAT YE my? ‘This implies that AAT is the identity matrix J as can be seen by choosing for = and y vectors with 2 — 1 vanishing components. We have thus found that 4 is orthogonal if” AAT ws J. Since A and AT have the same determinant it follows that it equals +1 or =I. An orthogonal matrix with determinant {is called a rovation matrix and the induced transformation isa rotation, 82 DENSITIES IN HIGHER DIMENSIONS WLS From now on we denote a point of the rdimensional space R’ by a single letter to be interpreted as a row vector. Thus 2 = (z,...,%,) and fle) = flz,...,,), ete. Inequalities are to be interpreted coordinatewise: tZ is the product of the determinants of A and the transformation Y-+ Z and hence it is positive. > Theorem 3. The matrices Q and M are inverses of each other and 66) y= Qny (MI where |M|= |Ql" is the determinant of M. Proof. With the notations of the preceding theorem put (62) D = E(Z"Z) = C™MC. This is a matrix with diagonal elements E(Z3) = a} and zero elements outside the diagonal. The density of Z is the product of normal densities n(zo;")o;" and hence induced by the matrix D~! with diagonal elements 116 NORMAL DENSITIES AND DISTRIBUTIONS 85 a;*. Now the density of Z is obtained from the density (6.2) of X by the substitution x= 2C' and multiplication by the determinant |C-'|. Accordingly (638) 2DAT = 2Q2™ and (6.9) Qny |D| = 7*-ICF. From (6.8) it is seen that (6.10) Q=cp"c?, and in view of (6.7) this implies @ = M-, From (6.7) it follows also that ID| = {MI-ICP, and hence (6.9) is equivalent to (6.6). > The theorem implies in particular that a factorization of M corresponds to an analogous factorization of Q and hence we have the Corollary. if (X,, Xz) is normally distributed then X, and Xq are indepen- dent iff Cov (X, Xz) = 0, that is, if X, and Xz are uncorrelated, More generally, if (X:....,X,) has a normal density then (X;,..., X,) and (Xasas...,X,) are independent iff Cov (X,, X,) = 0 for jn. Warning. The corollary depends on the joint density of (X,,X,) being normal and does not apply if it is only known that the marginal densities of X, and Xq are normal. In the latter case the density of (X,, X2) need not be normal and, in fact, need not exist. This fact is frequently misunderstood (see problems 2-3), Theorem 4. A matrix M is the covariance matrix of a normal density iff it is positive definite. Since the density is induced by the matrix Q = M~ an equivalent formulation is: A matrix Q induces a normal density (6.2) iff it is positive definite Proof. We saw at the end of section 5 that every covariance matrix of a density is positive definite. The converse is trivial when r= 1 and we proceed by induction. Assume Q positive definite. For 2 =++°= = 1... =0 we get q(x) =4,,2 and hence g,, > 0. Under this hypothesis, we saw that ¢ may be reduced to the form (6.5). Choosing x, such that ¥y, = 0 we see that the positive definiteness of Q implies q(x) > 0 for all choices of %,...,2,. By the induction hypothesis therefore 7 corre- sponds toa normal density in r — 1 dimensions. From (6.5) itis now obvious that g corresponds to a normal density in r dimensions, and this completes the proof. > 86 DENSITIES IN HIGHER DIMENSIONS. 16 We conclude this general theory by an interpretation of (6.5) in terms of conditional densities which leads to a general formulation of the regression theory explained for the two-dimensional case in example 2(a). Put for abbreviation a, = —qir/qrs SO that 11) Ye = Yoel Be— Oy" + Oy) For a probabilistic interpretation of the coefficients a, we recall that ¥, was found to be independent of Xy,...,Xj1. In other words, the a, are numbers such that (6.12) TeX, = aXy = a aX X,1). and this property uniquely characterizes is independent of (X,,- the coefficients ay. To obtain the conditional density of X, for given X, =.4,...,X1 = = 2,_, we must divide the density of (X,, ... , X,) by the marginal density for (Xp. ..,X,1). In view of (6.5) we get an exponential with exponent —lyi/q,,- Tt follows that the conditional density of X, for given X, = =2,,--..Xp1= 24 is normal with expectation ayty to°* + data and variance 1/q,,. Accordingly (6.13) F(X, | Xue Xa) = GX pt Xa We have thus proved the following generalization of the two-dimensional regression theory embodied in (2.6). Theorem 5. /f (X,,...,X,) has a normal density, the conditional density of X, for given X,,...,X,-1 is again normal. Furthermore, the conditional expectation (6.13) is the unique linear function of X.,-..,X,-1 making T independent of (Xq,-.-+%,-1)- The conditional variance equals Var (T) = =n Example. Sample mean and variance. In statistics the random variables (6.14) R= LO +X), ro {30-9 are called the sample mean and sample variance of X = (X,,...,X,)- Itis a curious fact that if X,,...,X, are independent normal variables with E(X,) = 0, E(X}) = o*, the random variables X and 8 are independent. The proof illustrates the applicability of the preceding results. We put ¥,=X,—X for kr the distribution of Y is degenerate inp dimensions. For p

You might also like