Probability and Calculus
Probability and Calculus
A century ago, French treatises on the theory of probability were commonly called “Le Calcul des
Probabilités”—“The Calculus of Probabilities.” The name has fallen out of fashion, perhaps due to the
potential confusion with integral and differential calculus, but it seems particularly apt for our present topic.
It suggests a system of rules that are generally useful for calculation, where the more modern “probability
theory” has a speculative connotation. Theories may be overthrown or superceded; a calculus can be used
within many theories. Rules for calculation can be accepted even when, as with probability, there may be
different views as to the correct interpretation of the quantities being calculated.
The interpretation of probability has been a matter of dispute for some time, although the terms of the
dispute have not remained constant. To say that an event (such as the occurrence of a Head in the toss
of a coin) has probability 1/2 will mean to some that if the coin is tossed an extraordinarily large number
of times, about half the results will be Heads, while to others it will be seen as a subjective assessment,
an expression of belief about the uncertainty of the event that makes no reference to an idealized (and
not realizable) infinite sequence of tosses. In this book we will not insist upon any single interpretation of
probability; indeed, we will find it convenient to adopt different interpretations at different times, depending
upon the scientific context. Probabilities may be interpreted as long run frequencies, in terms of random
samples from large populations, or as degrees of belief. While not philosophically pure, this opportunistic
approach will have the benefit of permitting us to develop a large body of statistical methodology that can
appeal to and be useful to a large number of people in quite varied situations.
We will, then, begin with a discussion of a set of rules for manipulating or calculating with probabilities,
rules which show how we can go from one assignment of probabilities to another, without prejudice to the
source of the first assignment. Usually we will be interested in reasoning from simple situations to complex
situations.
1.1 Probabilities of Events.
The rules will be introduced within the framework of what we will call an experiment. We will be
purposefully vague as to exactly what we mean by an experiment, only describing it as some process with an
observable outcome. The process may be planned or unplanned, a laboratory exercise or a passive historical
recording of facts about society. For our purposes, the important point that specifies the experiment is that
there is a set, or list, of all possible outcomes of the experiment, called the sample space and denoted S.
An event (say E) is then a set of possible outcomes, a subset of S. We shall see that usually the same
actual experiment may be described in terms of different sample spaces, depending upon the purpose of the
description.
The notation we use for describing and manipulating events is borrowed from elementary set theory. If
E and F are two events, both subsets of the same sample space S, then the complement of E (denoted E c ,
or sometimes E 0 ) is the set of all outcomes not in E, the intersection of E and F (E ∩ F ) is the set of all
outcomes in both E and F , and the union of E and F (E ∪ F ) is the set of all outcomes in E or in F or in
both E and F . It is often convenient to represent these definitions, and arguments associated with them, in
terms of shaded regions of Venn diagrams, where the rectangle S is the sample space and the areas E and
F two events.
[Figure 1.1]
If E and F have no common outcomes, they are said to be mutually exclusive. Even with only these
elementary definitions, fairly complicated relationships can be described. For example, consider the event
(A ∩ B) ∪ (A ∩ B c ), where A and B are two events. Then a simple consideration of Venn diagrams shows
that in fact this describes the same set of outcomes as A:
(A ∩ B) ∪ (A ∩ B c ) = A.
[Figure 1.2]
Thus even without any notion of what the symbol P for probability may mean, we would have the identity
P ((A ∩ B) ∪ (A ∩ B c )) = P (A).
1-2
Of course, for such equations to be useful we will need to define probability. As mentioned earlier,
we avoid giving a limited interpretation to probability for the present, though we may, whatever the inter-
pretation, think of it as a measure of uncertainty. But for all interpretations, probability will have certain
properties, namely those of an additive set function. With respect to our general sample space S these are:
This measures probability on a scale from −∞ to ∞, with 0 as a middle value (corresponding to P (E) = 1/2).
But for present purposes, the zero-to-one scale represented by P (E) is convenient.
Together (1.1) and (1.2) imply a number of useful other properties. For example,
(1.4) General additivity: For any E and F in S (not necessarily mutually exclusive),
P (E ∪ F ) = P (E) + P (F ) − P (E ∩ F ).
(1.5) Finite additivity: For any finite collection of mutually exclusive events E1 , E2 , . . . , En ,
n
[ n
X
P( Ei ) = P (Ei ).
i=1 i=1
Properties (1.1) and (1.2) are not sufficiently strong to imply the more general version of (1.5), namely:
(1.6) Countable additivity: For any countably infinite collection of mutually exclusive events E1 , E2 , . . .
in S,
[∞ X∞
P ( Ei ) = P (Ei ).
i=1 i=1
For our purposes this is not a restrictive additional condition, so we shall add it to (1.1) and (1.2) as an
assumption we shall make about the probabilities we deal with. In some advanced applications, for example
where the sample space is a set of infinite sequences or a function space, there are useful probability measures
that satisfy (1.1) and (1.2) but not (1.6), however.
Probabilities may in some instances be specified by hypothesis for simple outcomes, and the probabilities
of more complex events computed from these rules. Indeed, in this chapter we shall only consider such
hypothetical probabilities, and turn to empirical questions in a following chapter. A trivial example will
illustrate.
1-3
Example 1.A: We might describe the experiment of tossing a single six-sided die by the sample space
S = {1, 2, 3, 4, 5, 6}, where the possible outcomes are the numbers on the upper face when the die comes to
rest. By hypothesis, we might suppose the die is “fair” and interpret this mathematically as meaning that
each of these six outcomes has an equal probability; P ({1}) = 1/6, P ({2}) = 1/6, etc. As mentioned earlier,
this statement is susceptible to several interpretations: it might represent your subjective willingness to bet
on #1 at 5 to 1 odds, or the fact that in an infinite sequence of hypothetical tosses, one-sixth will show #1.
But once we accept the hypothesis of equally likely faces under any interpretation, the calculations we make
are valid under that interpretation. For example, if
and
F = “the number thrown is less than 3,”
then E = {1} ∪ {3} ∪ {5}, F = {1} ∪ {2}, and rule (1.5) implies
3
P (E) = P ({1}) + P ({3}) + P ({5}) =
6
and
2
P (F ) = P ({1}) + P ({2}) = .
6
Furthermore, E ∩ F = {1}, so P (E ∩ F ) = 1/6. Then by rule (1.4),
3 2 1 4
P (E ∪ F ) = P (E) + P (F ) − P (E ∩ F ) = + − = .
6 6 6 6
In this simple situation, and even in more complicated ones, we have alternative ways of computing the same
quantity. Here E ∪ F = {1, 2, 3, 5} and we can also verify that P (E ∪ F ) = 4/6 from rule (1.5).
(1.7) Conditional probability: The probability that E occurs given F has occurred is defined to be
P (E ∩ F )
P (E|F ) = if P (F ) > 0.
P (F )
If P (F ) = 0 we leave P (E|F ) undefined for now. Conditional probability may be thought of as relative
probability: P (E|F ) is the probability of E relative to the reduced sample space consisting of only those
outcomes in the event F . In a sense, all probabilities are conditional since even “unconditional” probabilities
are relative to the sample space S, and it is only by custom that we write P (E) instead of the equivalent
P (E|S).
The definition (1.7) is useful when the quantities on the right-hand side are known; we shall make
frequent use of it in a different form, though, when the conditional probability is given and the composite
probability P (E ∩ F ) is sought:
P (E ∩ F ) = P (F )P (E|F ).
1-4
Note we need not specify P (F ) > 0 here, for if P (F ) = 0 then P (E ∩F ) = 0 and both sides are zero regardless
of what value might be specified for P (E|F ). We can see that (1.7) and (1.8) relate three quantities, any two
of which determine the third. The third version of this relationship (namely P (F ) = P (E ∩ F )/P (E|F )) is
seldom useful.
Sometimes knowing that F has occurred has no effect upon the specification of the probability of E:
(1.9) Independent events: We say events E and F in S are independent if P (E) = P (E|F ).
By simple manipulation using the previous rules, this can be expressed in two other equivalent ways.
P (E ∩ F ) = P (E) · P (F ).
Indeed, this latter condition (which is not to be confused with the almost opposite notion of “mutually
exclusive”) is often taken as the definition of independence.
Note that independence (unlike, for example, being mutually exclusive) depends crucially upon the
values specified for the probabilities. In the previous example of the die, E and F are independent for the
given specification of probabilities. Using (1.9),
µ ¶
P (E ∩ F ) 1/6 1
P (E|F ) = = = = P (E).
P (F ) 2/6 2
However, if the die were strongly weighted and P ({1}) = P ({2}) = P ({4}) = 31 , then P (F ) = 32 . P (E) = 13 ,
and P (E ∩ F ) = 31 , so for this specification of probabilities E and F are then not independent.
Example 1.B: A Random Star. Early astronomers noticed many patterns in the heavens; one which
caught the attention of mathematicians in the eighteenth and nineteenth centuries was the occurrence of six
bright stars (the constellation of the Pleiades) within a small section of the celestial sphere 1◦ square. How
likely, they asked, would such a tight grouping be if the stars were distributed at random in the sky? Could the
occurrence of such a tight cluster be taken as evidence that a common cause, such as gravitational attraction,
tied the six stars together? This turns out to be an extraordinarily difficult question to formulate, much less
answer, but a simpler question can be addressed even with the few rules we have introduced, namely: Let
A be a given area on the surface of the celestial sphere that is a square, 1◦ on a side. A single star is placed
randomly on the sphere. What is the probability it lands in A? A solution requires that we specify what
“placed randomly” means mathematically. Here the sample space S is infinite, namely the points on the
celestial sphere. Specifying probabilities on such a set can be challenging. We give two different solutions.
First Solution: The star is placed by specifying a latitude and a longitude. By “placed randomly” we
may mean that the latitude and longitude are picked independently, the latitude in the range −90◦ to 90◦
with a probability of 1/180 attached to each 1◦ interval, and the longitude in the range 0◦ to 360◦ with a
probability of 1/360 attached to each 1◦ interval. This is not a full specification of the probabilities of the
points on the sphere, but it is sufficient for the present purpose. Suppose A is located at the equator (Figure
1.3). Let
E = “pick latitude in A’s range”
F = “pick longitude in A’s range”
A = “pick a point within A”.
1-5
P (A) = P (E) · P (F )
1 1
= ·
180 360
1
= .
64800
Second Solution: We may ask how many 1◦ squares make up the area of the sphere; if all are equally
likely, the probability of A is just the reciprocal of this number. It has been known since Archimedes that if
a sphere has radius r, the area of the surface is 4πr2 . (This can be easily remembered as following from the
“Orange Theorem”: if a spherical orange is sliced into four quarters then for each quarter the area of the
two flat juicy sides equals that of the peel. The flat juicy sides are two semicircles of total area πr2 , so the
peel of a quarter orange has area πr2 and the whole peel is 4πr2 .) Now if the units of area are to be square
degrees, then since the circumference is 2πr we will need to choose r so that 2πr = 360◦ , or r = 360 2π . Then
the area of the surface is µ ¶2
360 3602
4πr2 = 4π = = 41253.
2π π
Each square degree being supposed equally likely, we have
1
P (A) = ,
41253
which is π/2 times larger than the first solution.
Both solutions are correct; they are based upon different hypotheses. The hypotheses of the first solution
may well be characterized as “placed at random” from one point of view, but they will make it more likely
that a square degree near a pole contains the star than that one on the equator does.
The original problem is more difficult than the one we solved because we need to ask, if the approximately
1500 bright stars (of 5th magnitude or brighter) are placed “randomly” and independently on the sphere,
and we search out the square degree containing the largest number of stars, what is the chance it contains
six or more? And even this already complicated question glosses over the fact that our interest in a section
1◦ square (rather than 2◦ square, or 1◦ triangular, etc.) was determined a posteriori, after looking at the
data. We shall discuss some aspects of this problem in later chapters.
1.3 Counting
When the sample space S is finite and the outcomes are specified to be equally likely, the calculation of
probabilities becomes an exercise in counting: P (E) is simply the number of outcomes in E divided by the
number of outcomes in S. Nevertheless, counting can be difficult. Indeed, an entire branch of mathematics,
combinatorics, is devoted to counting. We will require two rules for counting, namely those for determining
the numbers of permutations and of combinations of n distinguishable objects taken r at a time. These are
(1.12) The number of ways of choosing r objects from n distinguishable objects where the order of choice
makes a difference is the number of permutations of n choose r, given by
n!
Pr,n = .
(n − r)!
(1.13) The number of ways of choosing r objects from n distinguishable objects where the order of choice
does not make a difference is the number of combinations of n choose r, given by
³n´ n!
= Cr,n =
r r!(n − r)!
1-6
In both cases, n! denotes n factorial , defined by n! = 1 · 2 · 3 · · · (n − 1) · n for integer n > 0, and we take
0! = 1 for convenience. Thus we have also
1 · 2 · 3··· · n
Pr,n = = (n − r + 1) · · · (n − 1)n
1 · 2 · · · · (n − r)
and ³n´ Pr,n (n − r + 1) · · · (n − 1)n
= = .
r r! 1 · 2 · 3 · · · (r − 1)r
¡ ¢ ¡ ¢
A variety of identities can be established using these definitions; some easily (eg. n0 = 1, n1 = n,
¡n¢ ³ n ´ Pn ¡ ¢
n n
r = n−r ), others with more difficulty (eg. r = 2 , which can, however, be directly established by
r=0
noting the lefthand side gives the number of ways any selection can be made from n objects without regard
to order, which is just the number of subsets, or 2n ).
Example 1.C : If r = 2 people are to be selected from n = 5 to be designated president and vice president
respectively, there are P2,5 = 20 ways the selection can be made. If, however, they are to serve as a committee
¡ ¢
of two equals (so the committees (A, B) and (B, A) are the same committee), then there are only 52 = 10
ways the selection can be made.
Example 1.D: For an example of a more important type, we could ask how many binary numbers of
length n (= 15, say) are there with exactly r (= 8, say) l’s. That is, how ¡many ¢ possible
¡ ¢ sequences of 15
0’s and 1’s are there for which the sum of the sequence is 8? The answer is nr = 15 8 = 6435, as may be
easily seen by considering the sequence as a succession of n = 15 distinguishable numbered spaces, and the
problem as one of selecting r = 8 from those 15 spaces as the locations for the 8 1’s, the order of the 1’s
being unimportant and the remaining unfilled slots to be filled in by 0’s.
¡ ¢
Evaluating n!, Pr,n , or nr can be quite difficult if n is at all large. It is also usually unnecessary, due
to a very close approximation discovered about 1730 by James Stirling and Abraham De Moivre. Stirling’s
formula states that µ ¶
1 1
loge (n!) ∼ loge (2π) + n + loge (n) − n, (1.14)
2 2
and thus √ 1
n! ∼ 2πnn+ 2 e−n , (1.15)
where “∼” means that the ratio of the two sides tends to 1 as n increases. The approximations can be good
for even small n, as Table 1.1 shows.
[Table 1.1]
¡ ¢
Stirling’s formula can be used to derive approximations to Pr,n and nr , namely
³ r ´−(n+ 12 )
Pr,n ∼ 1 − (n − r)r e−r , (1.16)
n
and
³n´ 1 ³ r ´−(n−r+ 12 ) ³ r ´−(r+ 21 )
∼√ 1− . (1.17)
r 2πn n n
¡ ¢
These too are reasonably accurate approximations; for example 10 5 = 252, while the approximation gives
258.37. While not needed for most purposes, there are more accurate refinements available. For example,
the bounds √ 1 √
1 1 1
2π nn+ 2 e−n+ 12n+1 < n! < 2π nn+ 2 e−n+ 12n (1.18)
give, for n = 5, 119.9699 < n! = 120 < 120.0026. Feller (1957, Chapter II.9) gives a proof of Stirling’s
formula and a nice discussion.
1-7
Sometimes the outcomes of an experiment are expressed as numbers, and at other times we will be
most interested in numerical descriptions that capture only some aspects of the outcomes, even in situations
where we find it easiest to specify the probabilities of the outcomes themselves. We will use the term random
variables for such a description: a function that assigns a numerical value to each outcome in S; a real-valued
function defined on S.
Example 1.E: If a coin is tossed three times, the sample space might be described by a list of 8
three-letter words,
S = {T T T, T T H, T HT, HT T, HHT, HT H, T HH, HHH},
where HHT means that the first two tosses result in Heads, and the third in Tails. One possible random
variable is
X = #H’s in the word.
Another is
Y = #T ’s in the word.
In both cases, the possible values are 0, 1, 2, and 3.
We will call random variables whose values can be listed sequentially in this manner discrete random
variables. In such cases, once the probabilities of the values of the random variable have been specified, they
can be described rather simply, by listing them. A list of the possible values of a discrete random variable
together with the probabilities of these values is called the probability distribution of the random variables;
we shall denote the probability that the random variable X is equal to the possible value x by pX (x), or,
when there is no likely confusion, by p(x).
For the coin example, the specification of the probability distribution of the random variable X = #H’s
can be straightforward. If we assume the coin is “fair” (which we may take to mean that P (H) = 1/2 for
a single toss), and the tosses are independent, then applying the multiplication rule for independent events
(1.11) repeatedly gives us, for example,
µ ¶3
1 1
P (HHT ) = P (H) · P (H) · P (T ) = = ,
2 8
and so the 8 points in S are equally likely. Now the event “X = 1” consists of the outcomes {HT T, T HT, T T H}
and by the additivity rule (1.5) it has probability
pX (1) = P (X = 1)
= P ({HT T, T HT, T T H})
= P (HT T ) + P (T HT ) + P (T T H)
1 1 1 3
= + + = .
8 8 8 8
The full probability distribution is easily found:
x 0 1 2 3
1 3 3 1
pX (x) 8 8 8 8
It is often helpful to think of a probability distribution of a random variable as a distribution of a unit mass
along the real line, with pX (x) giving the mass assigned to the point x. Then FX (x) gives the cumulative
mass, starting from the left, up to and including that at the point x.
For the coin example the calculation is simple:
x 0 1 2 3
1 3 3 1
pX (x) 8 8 8 8
1 4 7 8
FX (x) 8 8 8 8 =1
Graphically, we can depict pX (x) as a system of spikes, and then FX (x) is a jump function with jumps at
each possible value of x.
[Figure 1.4]
Note that the probability distribution pX (x) can be recovered from FX (x) by evaluating the sizes and
locations of the jumps. If x − 1 and x are two consecutive possible values of X, then
The two alternative specifications are thus completely equivalent, given one the other can be found, and we
can choose between them on the grounds of convenience.
The experiment of tossing a fair coin three times is a special case of a broad class of experiments of
immense use in statistics, the class of binomial experiments. These are characterized by three conditions:
A, called “success”
Ac , called “failure”.
(c) The probability of success on a single trial, P (A) = θ, is the same for all n trials (and so, of
course, is P (Ac ) = 1 − θ). This probability θ is called the parameter of the experiment.
For the fair coin example, n = 3, A = H, and θ = 1/2.
The sample space S of the possible outcomes of a binomial experiment consists of a list of “words” of
length n, made up entirely of the “letters” A and Ac . These range from all successes to all failures:
n times
AA ··· AA all successes
AA ··· AAc
AA ··· Ac A
· ·
· ·
AAc ··· AAc
· ·
· ·
· ·
· ·
Ac Ac ··· Ac Ac all failures
1-9
There are 2 choices for each letter and thus 2 × 2 × · · · × 2 = 2n different such words. For the coin example,
n = 3 and there are 23 = 8 outcomes.
Since the trials are independent by hypothesis, it is easy to compute the probability of a single outcome
using the multiplication rule for independent events (1.11). For example:
For n = 2:
P (AA) = P (A) · P (A) = θ · θ = θ2
P (AAc ) = P (A)P (Ac ) = θ(1 − θ).
For n = 3:
P (AAc A) = P (A)P (Ac )P (A)
= θ(1 − θ)θ = θ2 (1 − θ).
In general, the probability of an outcome will only depend upon the numbers of A’s and Ac ’s in the outcome.
If the word
AAc . . . AAc
consists of
x A’s and
n−x Ac ’s
then
P (AAc · · · AAc ) = P (A)P (Ac ) · · · P (A)P (Ac )
= θ(1 − θ) · · · θ(1 − θ)
= θx (1 − θ)n−x .
Now, for binomial experiments we will frequently only be interested in a numerical summary of the
outcome, the random variable
X = # successes = #A’s.
The possible values of X are 0, 1, 2, . . . , n, and its probability distribution can be found as follows: The event
“X = x” consists of exactly those outcomes with x A’s and n − x Ac ’s. We have just found that each such
outcome has probability θx (1 − θ)n−x . It remains only to determine the number, say C, of outcomes with
xA’s and n − x Ac ’s; the desired probability will then be pX (x) = P (X = x) = C · θx (1 − θ)n−x . But C
is equal to the number of binary numbers of length n with exactly x 1’s and n − x 0’s (just ¡ think
¢ of each
A as “1” and each Ac as “0”), and we have found (in Example 1.D) that this number is nx . Therefore
¡ ¢
pX (x) = nx θx (1−θ)n−x . This probability distribution is called the Binomial Distribution, and is sometimes
given a special symbol that shows its dependence upon n and θ explicitly:
Figure 1.5 illustrates some examples, for n = 8. The parameters n and θ determine the distribution; for each
integer n ≥ 1 and 0 ≤ θ ≤ 1, we have a different distribution. The Binomial distribution is thus an example
of what is called a parametric family of probability distributions.
The trials that make up a binomial experiment are often called Bernoulli trials, after the Swiss mathe-
matician Jacob Bernoulli (1654–1705) who was instrumental in the early study of this experiment. Bernoulli
trials can be conducted in manners other than that we have discussed; the most important of these is where
rather than conduct a fixed number n of trials, the trials are conducted until a fixed number of successes r
have been observed. Because this is a sort of reversal of the original scheme, it is called the negative binomial
experiment. For example, if r = 1, the trials are conducted until the first success, and the sample space
consists of “words” of increasing length with a single A at the end: A, Ac A, Ac Ac A, Ac Ac Ac A, etc.
1-10
For r = 1 the probability distribution of Z is easy to compute. For example, we will have Z = 3 only for
the outcome Ac Ac Ac A, and since P (Ac Ac Ac A) = (1 − θ)3 θ we have pZ (3) = (1 − θ)3 θ. More generally,
pZ (z) = (1 − θ)z θ for z = 0, 1, 2, . . .. Note that Z is a discrete random variable with a countably infinite
number of possible values.
To find the probability distribution of Z in general, we can reason analogously to the way we found the
binomial distribution. The sample space S will consist of words with r A’s, each word ending with an A
(since the experiment terminates with the rth success). The outcomes corresponding to Z = z will be those
with r A’s and z Ac ’s, and each of these will have probability θr (1 − θ)z . To find the probability distribution
of Z we need only find the number C of outcomes in S with Z = z; for then pZ (z) = Cθr (1 − θ)z . But
C is the number of “words” of length r + z ending in A, with exactly z Ac ’s. This is the same as the
number of “words” of length r + z − 1 with exactly z Ac ’s and no restrictions on the last letter, namely
¡ r+z−1 ¢ ³ r+z−1 ´
C= z = r−1 . We have therefore found the
(1.21) Negative Binomial Distribution: The probability distribution of the number of failures Z before the
rth success in a series of Bernoulli trials with probability of success θ is
µ ¶
r+z−1
nb(z : r, θ) = θr (1 − θ)z for z = 0, 1, 2, . . .
r−1
=0 otherwise.
This distribution is sometimes called the Pascal distribution, after an early programming language
that will continue to compile until the first bug is encountered. The special case where r = 1, namely
pZ (z) = θ(1 − θ)z , for z = 0, 1, 2, . . . , is called the geometric distribution. We shall later see that the
negative binomial distribution has a close relationship to another important discrete distribution, the Poisson
distribution.
[Figure 1.6]
The Binomial and Negative Binomial distributions are more closely related than the fact that both
involve Bernoulli trials. Let
B(x; n, θ) = P (X ≤ x)
N B(z; r, θ) = P (Z ≤ z)
be their respective cumulative distribution functions. Then a little reflection tells us that if we were com-
puting X and Z from the same series of trials, we would have X ≥ r if and only if Z ≤ n − r. Since
P (X ≥ r) = 1 − P (X ≤ r − 1), this means
and so one set of probabilities can be computed from the other. For example, tables of the binomial
distribution can be used to find Negative Binomial probabilities. The Binomial distribution enjoys certain
symmetry properties. In particular
This relation allows the computation of Binomial (and hence Negative Binomial) probabilities using a table
of the Binomial distribution for 0 < θ ≤ 12 .
1-11
A random variable is called continuous if its possible values form an interval, and hence cannot be listed
sequentially.
Example 1.F: Consider the spinner, a disc with a pointer rotating freely around the center point, pointing
at the edge of the disc which is labelled continuously from a to b.
[Figure 1.7]
If the pointer is spun and allowed to come to rest at an (in some sense random) point X, then the sample
space S is the interval {x : a ≤ x < b} and X is a random variable whose possible values are the numbers in
this interval.
Because the values of a continuous random variable cannot be listed, their probabilities cannot be
listed, and another device is used to describe the probability distribution. In a direct extension to the
interpretation of discrete probability distributions as mass distributions, continuous probability distributions
will be described by probability density functions, nonnegative functions which give the probabilities of an
interval through the area under the function over the interval. Mathematically, since areas are given by
integrals, we will define fX (x) (or f (x) if no confusion arises) to be the probability density function of the
continuous random variable X if for any numbers c and d, with c < d,
Z d
P (c < X ≤ d) = fX (x)dx.
c
[Figure 1.8]
It will necessarily be true of probability density functions that
R∞
(ii) fX (x)dx = 1 .
−∞
Indeed, any function satisfying (i) and (ii) may be considered as the probability density of a continuous
random variable. Note that the values of fX (x) do not themselves give probabilities (they may even exceed
1), though we can think heuristically of fX (x)dx (= height fX (x) times base dx) as the probability X falls
in an infinitesimal interval at x:
P (x < X ≤ x + dx) = fX (x)dx
[Figure 1.9]
It is frequently helpful to think of the density function fX (x) as describing the upper boundary of a sheet
of unit mass resting upon the line of possible values, the area under that boundary over an interval being
equal to the mass over that interval.
One consequence of using probability densities to describe distributions is that individual points are
assigned probability zero:
P (X = c) = 0 for any c.
The area or mass exactly over each single point must be considered to be zero, or contradictions would ensue,
as we shall see. As a consequence, for continuous random variables we have, for any c < d,
1
fX (x) = for a ≤ x < b (1.25)
(b − a)
= 0 otherwise.
[Figure 1.10]
Clearly the total area under fX (x) is 1, and the area or probability over any subinterval (c, d) is (d−c)/(b−a),
proportional to the length d − c of the subinterval. The numbers a and b are the parameters of this
distribution. If we ask what probability could be assigned to any single number c, we see it must be smaller
than that assigned to the interval c ≤ x < c + ², for any ² > 0, that is, smaller than P (c ≤ X < c + ²) =
²/(b − a). But no positive number fits that description, and we are forced by the limitations of our number
system to take P (X = c) = 0. This will not cause difficulties or contradictions as long as we follow our rules
and only insist that probabilities be countably additive: having the probability P (a ≤ X < b) = 1, yet each
P (X = c) = 0 does not contradict the additivity rule (1.6) since there are uncountably many c’s between a
and b.
Similarly to the discrete case, we define the cumulative distribution function of a continuous random
variable X by
FX (x) = P (X ≤ x)
Z x
= fX (u)du
−∞
FX (x) is thus the area under fX (x) to the left of x. As before, FX (x) is a nondecreasing
[Figure 1.11]
function, though it is no longer a jump function. It gives an alternative way to describe continuous distri-
butions. The fundamental theorem of calculus holds that
Z x
d
fX (u)du = fX (x),
dx −∞
so
d
FX (x) = fX (x)
dx
and we may find the probability density function from the cumulative distribution function as well as vice
versa.
Example 1.G: The Exponential Distribution. Consider the experiment of burning a lightbulb until failure.
Let X be the time until failure; X may be considered a continuous random variable with possible values
{x : 0 ≤ x < ∞}. In order to specify a class of possible probability distributions for X, we would expect
to have the probability of survival beyond time t, P (X > t), decreasing as t → ∞. One class of decreasing
functions which also have P (X > 0) = 1 are the exponentially decreasing functions P (X > t) = C t , where
0 < C < 1. Equivalently, writing e−θt for C, where θ > 0 is a fixed parameter, we have
and
FX (t) = P (X ≤ t) = 1 − e−θt for t ≥ 0
=0 for t < 0.
The corresponding probability density function is found by differentiation:
This is called the Exponential (θ) distribution; θ is a parameter of the distribution (for each θ > 0 we get
a different distribution). When we come to discuss the Poisson process we shall see how the Exponential
distribution arises in that context from more natural and less arbitrary assumptions as a common failure
time distribution or waiting time distribution.
[Figure 1.12]
In the calculus of probabilities, it is common to specify a probability distribution for a situation where
that task is simple, and to then reason to a more complicated one. We may choose to describe the sample
space so that the outcomes are equally likely and then deduce the distribution of a random variable whose
possible values are not equally likely. Or we may use ideas of conditional probability to break a complicated
framework down into a number of simple steps, perhaps even into independent trials. In the case of the
binomial experiment, both devices were adopted. Another common route, one which will be particularly
useful in statistical applications, is to go from the distribution of one random variable, say X, whose distri-
bution is easily specified or has previously been determined, to that of another random variable which is a
transformation or function of X, say Y = h(X).
Example 1.H Suppose X is the time to failure of a lightbulb (or an electronic component), and that we
believe X to have an Exponential (θ) distribution with density (1.27),
Upon failure, we plan to replace the lightbulb with a second, similar one. The probability that the first
survives beyond time t is P (X > t) = e−θt ; the probability the second survives longer than the first is then
Y = e−θX , a random variable that depends upon the time to failure of the first. What is the distribution of
Y?
Example 1.I: Suppose a fair coin is tossed three times, and you received $2 for every Head. The number
of Heads, X, has a Binomial distribution, what is the distribution of your winnings, Y = 2X? Or what if
you receive Y = X 2 ?
To begin to address the general question of finding the distribution of a transformation Y = h(X) of a
random variable X, consider first the case where h is a strictly monotone function, at least over the range of
possible values of X. This restriction will ensure that each value of Y could have come from only one possible
X, and the ideas will be easier to explain in that case. For example, h(X) = 2X + 3 is strictly monotone,
while h(X) = X 2 is not, although it will be allowed in the present discussion if X takes no negative values,
since it is strictly monotone for nonnegative x.
Example 1.J: The extremely useful transformation h(X) = loge (X) is strictly monotone, though only
defined for X > 0. We can see what happens to a probability distribution under transformation by looking at
this one special case. Figure 1.14 illustrates the effect of this transformation upon the X-scale: it compresses
the upper end of the scale by pulling large values down, while spreading out the scale for small values. The
1-14
gap between X’s of 5 and 6 (namely, 1 X-unit) is narrowed to that between Y ’s of 1.61 and 1.79 (.18 Y -units),
and the gap between X’s of .2 and 1.2 (also 1 X-unit) is expanded to that between Y ’s of -1.61 and .18 (1.79
Y -units). Figure 1.15 illustrates the effect of this transformation upon two probability distributions, one
discrete and one continuous. The effect in the discrete case is particularly easy to describe: as the scale is
warped by the transformation, the locations of the spikes are changed accordingly, but their heights remain
unchanged. In the continuous case, something different occurs. Since the total area that was between 5 and
6 on the X-scale must now fit between 1.61 and 1.79 on the Y -scale, the height of the density over this part
of the Y -scale must be increased. Similarly, the height of the density must be decreased over the part of the
Y -scale where the scale is being expanded, to preserve areas there. The result is a dramatic change in the
appearance of the density. Our object in the remainder of this section is to describe precisely how this can
be done.
If Y = h(X) is a strictly monotone transformation of X, then we can solve for X in terms of Y , that is,
find the inverse transformation X = g(Y ). Given Y = y, the function g “looks back” to see which possible
value x of X produced that value y; it was x = g(y). If Y = h(X) = 2X + 3, then X = g(Y ) = (Y − 3)/2.
If Y = h(X) √ = loge (X), for, X > 0, then X = g(Y ) = eY . If Y = h(X) = X 2 , for X > 0, then
X = g(Y ) = + Y .
In terms of this inverse relationship, the solution for the discrete case is immediate. If pX (x) is the
probability distribution function of X, then the probability distribution function of Y is
pY (y) = P (Y = y)
= P (h(X) = y) (1.28)
= P (X = g(y))
= pX (g(y)).
That is, for each value y of Y , simply “look back” to find the x that produced y, namely x = g(y), and
assign y the same probability that had been previously assigned to that x, namely pX (g(y)).
¡ ¢
Example 1.K: If X has the Binomial distribution of Example 1.E; that is, pX (x) = x3 (0.5)3 for
√ √
x = 0, 1, 2, 3, and Y = X 2 , what is the distribution of Y ? Here g(y) = + y, and so pY (y) = pX ( y) =
³ ´ √ √
√3 3
y (0.5) for y = 0, 1, 2, 3 (or y = 0, 1, 4, 9). For all other y’s, pY (y) = pX ( y) = 0. That is,
1
pY (y) = for y = 0
8
3
= for y = 1
8
3
= for y = 4
8
1
= for y = 9
8
= 0 otherwise.
[Figure 1.16]
In the continuous case, an additional step is required, the rescaling of the density to compensate for the
compression or expansion of the scale and match corresponding areas. For this reason it is not true that
fY (y) = fX (g(y)), where g is the inverse transformation, but instead
¯ ¯
¯ dg(y) ¯
¯
fY (y) = fX (g(y)) · ¯ ¯. (1.29)
dy ¯
¯ ¯
¯ ¯
The rescaling factor ¯ dg(y) 0
dy ¯ = |g (y)| is called the Jacobian of the transformation in advanced calculus, and
it is precisely the compensation factor needed to match areas. When |g 0 (y)| is small, x = g(y) is changing
1-15
slowly as y changes (for example, for y near 0 in Figures 1.14 and 1.15), and we scale down. When g(y)
changes rapidly with y, |g 0 (y)| is large (for example, for y near 6 in Figures 1.14 and 1.15), and we scale up.
It is easy to verify that this is the correct factor: simply compute P (Y ≤ a) in two different ways.
First, Z a
P (Y ≤ a) = fY (y)dy, (1.30)
−∞
by the definition of fY (y). Second, supposing for a moment that h(x) (and hence g(y) also), is monotone
increasing, we have
P (Y ≤ a) = P (h(X) ≤ a)
= P (X ≤ g(a))
Z g(a)
= fX (x)dx.
−∞
Differentiating both (1.30) and (1.31) with respect to a gives fY (y) = fX (g(y))g 0 (y). If h(x) and g(y) are
monotone decreasing, the result is the same, but with −g 0 (y) as the compensation factor; the factor |g 0 (y)|
covers both cases, and gives us (1.29).
Example 1.H (Continued). Let X be the time to failure of the first lightbulb, and Y the probability
that the second bulb burns longer than the first. Y depends on X, and is given by Y = h(X) = e−θX . The
random time X has density
fX (x) = θe−θx x ≥ 0
=0 otherwise.
Now loge (Y ) = −θX, and the inverse transformation is X = g(Y ) = − loge (Y )/θ. Both h(x) and g(y)
are monotone decreasing. The inverse g(y) is only defined for 0 < y, but the only possible values of Y are
0 < y ≤ 1.
We find g 0 (y) = − θ1 · y1 , and
1
|g 0 (y)| = , for y > 0.
θy
Then
fY (y) = fX (g(y))|g 0 (y)|,
and, noting that fX (g(y)) = 0 for y ≤ 0 or y > 1, we have
1
fY (y) = θe−θ(− log(y)/θ) · for 0 < y ≤ 1
θy
=0 otherwise,
(= 1−Y ) have Uniform (0, 1) distributions. Example 1.H concerned Z for the special case of the Exponential
(θ) distribution. Because FX (x) is the integral of the probability density function fX (x), the transformation
h(x) = FX (x) has been called the probability integral transformation. To find the distribution of Y = h(X),
−1
we need to differentiate g(y) = FX (y), defined to be the inverse cumulative distribution function, the
function that for each y, 0 < y < 1, gives the value of x for which FX (x) = y. [Figure 1.17] (For continuous
random variables X with densities, FX (x) is continuous and such an x will exist for all 0 < y < 1. For more
−1 −1
general random variables, FX (y) can be defined as FX (y) = infimum{x : FX (x) ≥ y}.) The derivative of
−1
g(y) = FX (y) can be found by implicit differentiation:
y = FX (x)
so
d dx
1= FX (x) = fX (x) ·
dy dy
by the chain rule, and so
dx 1
=
dy fX (x)
−1
or, with x = g(y) = FX (y),
d −1 1
g 0 (y) = FX (y) = −1 , (1.32)
dy fX (FX (y))
But then
fY (y) = fX (g(y)) · |g 0 (y)|
−1 1
= fX (FX (y)) · −1 for 0 < y < 1
fX (FX (y))
=1 for 0 < y < 1
=0 otherwise,
the Uniform (0, 1) distribution. The fact that Z = 1 − Y also has this distribution can be shown by repeating
this derivation with h(x) = 1 − FX (x), or more simply by transforming the distribution of Y by h(y) = 1 − y,
whose inverse g(z) = 1 − z has |g 0 (z)| = 1.
Thus far we have considered only strictly monotone transformations h(x). A full treatment for nonmono-
tone transformations is possible, but it is cumbersome, and not necessary for our anticipated applications.
The following example, involving a transformation that can be broken down into two transformations mono-
tone over different ranges, captures all of the important ideas for even more general cases.
Example 1.M. The Standard Normal and Chi-square (1 d.f.) distributions. Suppose X is a continuous
random variable with probability density function fX (x) defined for all x, −∞ < x < ∞, by
1 −x2
φ(x) = √ e 2 . (1.33)
2π
This distribution is called the Standard Normal distribution, and it is sufficiently important that the symbol
φ(x) is reserved for its density and Φ(x) for its cumulative distribution function. [Figure 1.18]. The cu-
mulative distribution function cannot be written in closed form in terms of simple functions, but it can be
evaluated numerically and is tabled at the back of the book for x ≥ 0. By the symmetry and continuity of
the distribution, this range is sufficient, since
or
FX (−x) = 1 − FX (x),
1-17
or
Φ(−x) = 1 − Φ(x) for all x.
fY (y) = fX (g1 (y)) · |g10 (y)| + fX (g2 (y)) · |g20 (y))|. (1.34)
Why does this work? In essence, for each y > 0 it recognizes that y could have come from either of two
different x’s, so we “look back” to both, namely x = g1 (y) and x = g2 (y). Heuristically, the probability
appropriate to a small interval of width dy at y will be the sum of those found from the two separate branches
(Figure 1.19).
For our example, the range of y is y > 0, and we find
√ 2
1 e−(− y)
1 −y
fX (g1 (y)) = √ =√ e 2 ,
2π 2 2π
√ 2
1 e−( y) 1 −y
fX (g2 (y)) = √ =√ e 2
2π 2 2π
−1
g10 (y) = √
2 y
and
1
g20 (y) = √ ,
2 y
so
1
|g10 (y)| = |g20 (y)| = √ ,
2 y
and
1 −y 1 1 −y 1
fY (y) = √ e 2 · √ + √ e 2 · √ for y > 0
2π 2 y 2π 2 y
1 −y
=√ e 2 for y > 0
2πy
= 0 for y ≤ 0. (1.35)
We shall encounter this density later, it is called the Chi-square distribution with 1 degree of freedom, a name
that will seem a bit less mysterious later.
Example 1.N. Linear change of scale. A common and mathematically simple example of a transformation
of a random variable is a linear change of scale. The random variable X may be measured in inches; what is
the distribution of Y = 2.54X, the same quantity measured in centimeters? Or if X is measured in degrees
Fahrenheit, Y = (X − 32◦ )/1.8 is measured in degrees Celsius. The general situation has
Y = aX + b, (1.36)
1-18
where a and b are constants. For any a 6= 0, h(x) = ax + b is a monotone transformation, with inverse
g(y) = (y − b)/a, g 0 (y) = 1/a, and
1
|g 0 (y)| = .
|a|
We then have, for any continuous random variable X,
µ ¶
y−b 1
fY (y) = fX · , (1.37)
a |a|
Example 1.M (continued) The Normal (µ, σ 2 ) Distribution. A special case of Example 1.N will be of great
use later, namely where X has a standard normal distribution, and Y is related to X by a linear change of
scale
Y = σX + µ, σ > 0. (1.39)
Then Y has what we will call the Normal (µ, σ 2 ) distribution with density
µ ¶
y−µ 1
fY (y) = φ · (1.40)
σ σ
1 (y−µ)2
=√ e− 2σ2 , for −∞ < y < ∞.
2π · σ
[Figure 1.20]
This might be called “general” Normal distribution, as a contrast to the “standard” Normal distribution.
Actually, it is of course a parametric family of densities, with parameters µ and σ. When we encounter this
family of distributions next we shall justify referring to µ as the mean and σ as the standard deviation of
the distribution.