Probability - Probabilition Distribution Notes by Kristin Kuter
Probability - Probabilition Distribution Notes by Kristin Kuter
PROBABILITY (KUTER)
Kristin Kuter
Saint Mary's College
Saint Mary's College
MATH 345 - Probability (Kuter)
Kristin Kuter
This text is disseminated via the Open Education Resource (OER) LibreTexts Project (https://LibreTexts.org) and like the hundreds
of other texts available within this powerful platform, it is freely available for reading, printing and "consuming." Most, but not all,
pages in the library have licenses that may allow individuals to make changes, save, and print this book. Carefully
consult the applicable license(s) before pursuing such effects.
Instructors can adopt existing LibreTexts texts or Remix them to quickly build course-specific resources to meet the needs of their
students. Unlike traditional textbooks, LibreTexts’ web based origins allow powerful integration of advanced features and new
technologies to support learning.
The LibreTexts mission is to unite students, faculty and scholars in a cooperative effort to develop an easy-to-use online platform
for the construction, customization, and dissemination of OER content to reduce the burdens of unreasonable textbook costs to our
students and society. The LibreTexts project is a multi-institutional collaborative venture to develop the next generation of open-
access texts to improve postsecondary education at all levels of higher learning by developing an Open Access Resource
environment. The project currently consists of 14 independently operating and interconnected libraries that are constantly being
optimized by students, faculty, and outside experts to supplant conventional paper-based books. These free textbook alternatives are
organized within a central environment that is both vertically (from advance to basic level) and horizontally (across different fields)
integrated.
The LibreTexts libraries are Powered by MindTouch® and are supported by the Department of Education Open Textbook Pilot
Project, the UC Davis Office of the Provost, the UC Davis Library, the California State University Affordable Learning Solutions
Program, and Merlot. This material is based upon work supported by the National Science Foundation under Grant No. 1246120,
1525057, and 1413739. Unless otherwise noted, LibreTexts content is licensed by CC BY-NC-SA 3.0.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not
necessarily reflect the views of the National Science Foundation nor the US Department of Education.
Have questions or comments? For information about adoptions or adaptions contact [email protected]. More information on our
activities can be found via Facebook (https://facebook.com/Libretexts), Twitter (https://twitter.com/libretexts), or our blog
(http://Blog.Libretexts.org).
2: Computing Probabilities
2.1: Equally Likely Outcomes and Counting Techniques (Combinatorics)
2.2: Conditional Probability and Bayes' Rule
2.3: Independent Events
Index
Glossary
This page titled MATH 345 - Probability (Kuter) is shared under a not declared license and was authored, remixed, and/or curated by Kristin
Kuter.
1
CHAPTER OVERVIEW
1: What is Probability?
Topic hierarchy
1.1: Sample Spaces and Events
1.2: Probability Measures
This page titled 1: What is Probability? is shared under a not declared license and was authored, remixed, and/or curated by Kristin Kuter.
1
1.1: Sample Spaces and Events
Introduction
We begin with a definition.
Definition 1.1.1
Probability theory provides a mathematical model for chance (or random) phenomena.
While this is not a very informative definition, it does indicate the overall goal of this course, which is to develop a formal,
mathematical structure for the fairly intuitive concept of probability. While most everyone is familiar with the notion of "chance" -
- we informally talk about the chance of it raining tomorrow, or the chance of getting what you want for your birthday -- when it
comes to quantifying the chance of something happening, we need to develop a mathematical model to make things precise and
calculable.
Definition 1.1.2
The sample space for a probability experiment (i.e., an experiment with random outcomes) is the set of all possible outcomes.
The sample space is denoted S .
An outcome is an element of S , generally denoted s ∈ S .
Example 1.1.1
Suppose we toss a coin twice and record the sequence of heads (h ) and tails (t ). A possible outcome of this experiment is then
given by
s = ht
Example 1.1.2
Suppose we record the time (t ), in minutes, that a car spends waiting for a green light at a particular intersection. A possible
outcome of this experiment is then given by
t = 1.5,
indicating that a particular car waited one and a half minutes for the light to turn green. The sample space consists of all non-
negative numbers, since a measurement of time cannot be negative and, in theory, there is no limit on how a long a car could wait
for a green light. We can then write the sample space as follows:
S = {t ∈ R | t ≥ 0} = [0, ∞). (1.1.2)
Definition 1.1.3
An event is a particular subset of the sample space.
Example 1.1.3
Continuing in the context of Example 1.1.1, define A to be the event that at least one heads is recorded. We can write event A as
the following subset of the sample space:
1.1.1 https://stats.libretexts.org/@go/page/3244
A = {hh, ht, th}.
Example 1.1.4
Continuing in the context of Example 1.1.2, define B to be the event that a car waits at most 2 minutes for the light to turn green.
We can write the event B as the following interval, i.e., a subset of the sample space S given in Equation 1.1.2:
What is probability?
https://youtu.be/Y_GAW22n1Ms
Definition 1.1.4
1. The union of two events A and B , denoted A ∪ B , is the set of all outcomes in A or B (or both).
2. The intersection of two events A and B , denoted A ∩ B , is the set of all outcomes in both A and B .
1.1.2 https://stats.libretexts.org/@go/page/3244
3. The complement of an event A , denoted A , is the set of all outcomes in the sample space that are not in A . This may also
c
be written as follows:
c
A = {s ∈ S | s ∉ A}.
Example 1.1.5
Continuing in the context of both Examples 1.1.1 & 1.1.3, define B to be the event that exactly one heads is recorded:
B = {ht, th}.
Now we can apply the set operations just defined to the events A and B :
A ∩ B = {ht, th} = B
c
A = {tt}
c
B = {hh, tt}
Note the relationship between events A and B : every outcome in B is an outcome in A . In this case, we say that B is a subset of
A , and write
B ⊆ A.
Note also that events A and B are not disjoint, since their intersection is not the empty set. However, if we let C be the event that
no heads are recorded, then
C = {tt},
and
A∩C = ∅
B ∩ C = ∅.
Thus, events A and C are disjoint, and events B and C are disjoint.
1.1.3 https://stats.libretexts.org/@go/page/3244
https://youtu.be/Y_GAW22n1Ms
This page titled 1.1: Sample Spaces and Events is shared under a not declared license and was authored, remixed, and/or curated by Kristin Kuter.
1.1.4 https://stats.libretexts.org/@go/page/3244
1.2: Probability Measures
Now we are ready to formally define probability.
Definition 1.2.1
A probability measure on the sample space S is a function, denoted P , from subsets of S to the real numbers R, such that the
following hold:
1. P (S) = 1
2. If A is any event in S , then P (A) ≥ 0 .
3. If events A and A are disjoint, then P (A ∪ A ) = P (A ) + P (A ) .
1 2 1 2 1 2
So essentially, we are defining probability to be an operation on the events of a sample space, which assigns numbers to events in
such a way that the three properties stated in Definition 1.2.1 are satisfied.
Definition 1.2.1 is often referred to as the axiomatic definition of probability, where the three properties give the three axioms of
probability. These three axioms are all we need to assume about the operation of probability in order for many other desirable
properties of probability to hold, which we now state.
2. P (∅) = 0
3. If A ⊆ B , then P (A) ≤ P (B) .
4. P (A) ≤ 1
5. Addition Law: P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
Exercise 1.2.1
Can you prove the five properties of probability measures stated above using only the three axioms of probability
measures stated in Definition 1.2.1?
Answer
(1) For the first property, note that by definition of the complement of an event A we have
c c
A∪A =S and A∩A = ∅.
In other words, given any event A, we can represent the sample space S as a disjoint union of A with its complement.
Thus, by the first and third axioms, we derive the first property:
c c
1 = P (S) = P (A ∪ A ) = P (A) + P (A )
c
⇒ P (A ) = 1 − P (A)
(2) For the second property, note that we can write S = S ∪ ∅ , and that this is a disjoint union, since anything
intersected with the empty set will necessarily be empty. So, using the first and third axioms, we derive the second
property:
1 = P (S) = P (S ∪ ∅) = P (S) + P (∅) = 1 + P (∅)
⇒ P (∅) = 0
1.2.1 https://stats.libretexts.org/@go/page/3247
(3) For the third property, note that we can write B = A ∪ (B ∩ A c
) , and that this is a disjoint union, since A and A c
By the second axiom, we know that P (B ∩ A ) ≥ 0 . Thus, if we remove it from the right-hand side of Equation
c
1.2.1, we are left with something smaller, which proves the third property:
c
P (B) = P (A) + P (B ∩ A ) ≥ P (A) ⇒ P (B) ≥ P (A)
(4) For the fourth property, we will use the third property that we just proved. By definition, any event A is a subset
of the sample space S , i.e., A ⊆ S . Thus, by the third property and the first axiom, we derive the fourth property:
(5) For the fifth property, note that we can write the union of events A and B as the union of the following
two disjoint events:
c
A ∪ B = A ∪ (A ∩ B),
in other words, the union of A and B is given by the union of all the outcomes in A with all the outcomes in B that
are not in B. Furthermore, note that event B can be written as the union the following two disjoint events:
c
B = (A ∩ B) ∪ (A ∩ B),
in other words, B is written as the disjoint union of all the outcomes in B that are also in A with the outcomes in B
that are not in A. We can use this expression for B to find an expression for P (A ∩ B) to substitute in the c
c
P (A ∪ B) = P (A) + P (A ∩ B) ⇒ P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
Note that the axiomatic definition (Definition 1.2.1) does not tell us how to compute probabilities. It simply defines a formal,
mathematical behavior of probability. In other words, the axiomatic definition describes how probability should
theoretically behave when applied to events. To compute probabilities, we use the properties stated above, as the next example
demonstrates.
Example 1.2.1
Continuing in the context of Example 1.1.5, let's define a probability measure on S . Assuming that the coin we toss is fair, then the
outcomes in S are equally likely, meaning that each outcome has the same probability of occurring. Since there are four outcomes,
and we know that probability of the sample space must be 1 (first axiom of probability in Definition 1.2.1), it follows that the
probability of each outcome is = 0.25.
1
The reader can verify that this defines a probability measure satisfying the three axioms.
With this probability measure on the outcomes we can now compute the probability of any event in S by simply counting the
number of outcomes in the event. Thus, we find the probability of events A and B previously defined:
3
P (A) = P ({hh, ht, th}) = = 0.75
4
2
P (B) = P ({ht, th}) = = 0.50.
4
1.2.2 https://stats.libretexts.org/@go/page/3247
We consider the case of equally likely outcomes further in Section 2.1.
There is another, more empirical, approach to defining probability, given by using relative frequencies and a version of the Law of
Large Numbers.
This approach to defining probability is sometimes referred to as the frequentist definition of probability. Under this definition,
probability represents a long-run average. The two approaches to defining probability are equivalent. It can be shown that using
relative frequencies to define a probability measure satisfies the axiomatic definition.
De nition of Probability
https://youtu.be/Y_GAW22n1Ms
1.2.3 https://stats.libretexts.org/@go/page/3247
This page titled 1.2: Probability Measures is shared under a not declared license and was authored, remixed, and/or curated by Kristin Kuter.
1.2.4 https://stats.libretexts.org/@go/page/3247
CHAPTER OVERVIEW
2: Computing Probabilities
2.1: Equally Likely Outcomes and Counting Techniques (Combinatorics)
2.2: Conditional Probability and Bayes' Rule
2.3: Independent Events
This page titled 2: Computing Probabilities is shared under a not declared license and was authored, remixed, and/or curated by Kristin Kuter.
1
2.1: Equally Likely Outcomes and Counting Techniques (Combinatorics)
In this chapter, we consider the problem of assigning specific probabilities to outcomes in a sample space. As we saw in Section
1.2, the axiomatic definition of probability (Definition 1.2.1) does not tell us how to compute probabilities. So in this section we
consider the commonly encountered scenario referred to as equally likely outcomes and develop methods for computing
probabilities in this special case.
S = { s1 , s2 , … , sN }.
Suppose further that we denote the probability assigned to each outcome in S as P (s ) = p , for i = 1, … , N . Then the
i i
probability of any event A in S is given by adding the probabilities corresponding to the outcomes contained in A and we can
write
P (A) = ∑ pi . (2.1.1)
si ∈A
This follows from the third axiom of probability (Definition 1.2.1), since we can write any event as a disjoint union of the
outcomes contained in the event. For example, if event A contains three outcomes, then we can write
A = { s , s , s } = { s } ∪ { s } ∪ { s } . So the probability of A is given by simply summing up the probabilities assigned to
1 2 3 1 2 3
s , s , s . This fact will be useful in the special case of equally likely outcomes, which we consider next.
1 2 3
Definition 2.1.1
The outcomes in a sample space S are equally likely if each outcome has the same probability of occurring.
In general, if outcomes in a sample space S are equally likely, then computing the probability of a single outcome or an event is
very straightforward, as the following exercise demonstrates. You are encouraged to first try to answer the questions for yourself,
and then click "Answer" to see the solution.
Exercise 2.1.1
Suppose that there are N outcomes in the sample space S and that the outcomes are equally likely.
a. What is the probability of a single outcome in S ?
b. What is the probability of an event A in S ?
Answer
First, note that we can represent the outcomes in S as follows:
S = { s1 , s2 , … , sN }.
P (si ) = c, for i = 1, 2, … , N ,
where c is some constant. This follows from the fact that the outcomes of S are equally likely and so have the same
probability of occurring. With this set-up and using the axioms of probability (Definition 1.2.1), we have the following:
2.1.1 https://stats.libretexts.org/@go/page/3251
1 = P (S) = P ({ s1 } ∪ ⋯ ∪ { sN })
= P (s1 ) + ⋯ + P (sN )
= c +⋯ +c
= N ×c
1
⇒ c = .
N
Thus, the probability of a single outcome is given by 1 divided by the number of outcomes in S .
Now, for an event A in S , suppose it has n outcomes, where n is an integer such that 0 ≤n ≤N . We can represent the
outcomes in A as follows:
A = { a1 , … , an }.
1 1
= +⋯ +
N N
1
= n( )
N
n
= .
N
Thus, the probability of an event in S is equal to the number of outcomes in the event divided by the total number of
outcomes in S .
We have already seen an example of a sample space with equally likely outcomes in Example 1.2.1. You are encouraged to revisit
that example and connect it to the results of Exercise 2.1.1.
In general, Exercise 2.1.1 shows that if a sample space S has equally likely outcomes, then the probability of an event A in the
sample space is given by
From this result, we see that in the context of equally likely outcomes calculating probabilities of events reduces to simply counting
the number of outcomes in the event and the sample space. So, we take a break from our discussion of probability, and briefly
introduce some counting techniques.
2.1.2 https://stats.libretexts.org/@go/page/3251
Calculating Probability
https://youtu.be/YdWaiH3oTq8
Counting Techniques
First, let's consider the general context of performing multi-step experiments. The following tells us how to count the number of
outcomes in such scenarios.
Multiplication Principle
If one probability experiment has m outcomes and another has n outcomes, then there are m ×n total outcomes for the two
experiments.
More generally, if there are k many probability experiments with the first experiment having n outcomes, the second with n ,
1 2
Example 2.1.1
To demonstrate the Multiplication Principle, consider again the example of tossing a coin twice (see Example 1.2.1). Each toss is a
probability experiment and on each toss, there are two possible outcomes: h or t . Thus, for two tosses, there are 2 × 2 = 4 total
outcomes.
If we toss the coin a third time, there are 2 × 2 × 2 = 8 total outcomes.
2.1.3 https://stats.libretexts.org/@go/page/3251
Next we define two commonly encountered situations, permutations and combinations, and consider how to count the number of
ways in which they can occur.
Definition 2.1.2
A permutation is an ordered arrangement of objects. For example, "MATH'' is a permutation of four letters from the alphabet.
A combination is an unordered collection of r objects from n total objects. For example, a group of three students chosen from
a class of 10 students.
In order to count the number of possible permutations in a given setting, the Multiplication Principle is applied. For example, if we
want to know the number of possible permutations of the four letters in "MATH'', we compute
4 × 3 × 2 × 1 = 4! = 24,
since there are four letters to select for the first position, three letters for the second, two for the third, leaving only one letter for the
last. In other words, we treat each letter selection as an experiment in a multi-step process.
Counting Permutations
The number of permutations of n distinct objects is given by the following:
n × (n − 1) × ⋯ × 2 × 1 = n!
Counting combinations is a little more complicated, since we are not interested in the order in which objects are selected and so the
Multiplication Principle does not directly apply. Consider the example that a group of three students are chosen from a class of 10.
The group is the same regardless of the order in which the three students are selected. This implies that if we want to count the
number of possible combinations, we need to be careful not to include permutations, i.e., rearrangements, of a certain selection.
This leads to the following result that the number of possible combinations of size r selected from a total of n objects is given by
binomial coefficients.
Counting Combinations
The number of combinations of r objects selected without replacement from n distinct objects is given by
n n!
( ) = .
r r! × (n − r)!
Note that ( ), read as "n choose r", is also referred to as a binomial coefficient, since it appears in the Binomial Theorem.
n
Using the above, we can compute the number of possible ways to select three students from a class of 10:
10 10!
( ) = = 120
3 3! × 7!
Example 2.1.2
Consider the example of tossing a coin three times. Note that an outcome is a sequence of heads and tails. Suppose that we are
interested in the number of outcomes with exactly two heads, not in the actual sequence. To find the number of outcomes with
exactly two heads, we need to determine the number of ways to select positions in the sequence for the heads, then the remaining
position will be a tails. If we toss the coin three times, there are three positions to select from, and we want to select two. Since the
order that we make the selection of placements does not matter, we are counting the number of combinations of 2 positions from a
total of 3 positions, i.e.,
3 3!
( ) = = 3.
2 2! × 1!
2.1.4 https://stats.libretexts.org/@go/page/3251
Of course, this example is small enough that we could have arrived at the answer of 3 using brute force by just listing the
possibilities. However, if we toss the coin a higher number of times, say 50, then the brute force approach becomes infeasible and
we need to make use of binomial coefficients.
2.1.5 https://stats.libretexts.org/@go/page/3251
This page titled 2.1: Equally Likely Outcomes and Counting Techniques (Combinatorics) is shared under a not declared license and was authored,
remixed, and/or curated by Kristin Kuter.
2.1.6 https://stats.libretexts.org/@go/page/3251
2.2: Conditional Probability and Bayes' Rule
In many situations, additional information about the result of a probability experiment is known (or at least assumed to be known)
and given that information the probability of some other event is desired. For this scenario, we compute what is referred to
as conditional probability.
Definition 2.2.1
For events A and B , with P (B) > 0 , the conditional probability of A given B , denoted P (A | B) , is given by
P (A ∩ B)
P (A | B) = .
P (B)
In computing a conditional probability we assume that we know the outcome of the experiment is in event B and then, given that
additional information, we calculate the probability that the outcome is also in event A . This is useful in practice given that partial
information about the outcome of an experiment is often known, as the next example demonstrates.
Example 2.2.1
Continuing in the context of Example 1.2.1, where we considered tossing a fair coin twice, define D to be the event that at least
one tails is recorded:
Let's calculate the conditional probability of A given D, i.e., the probability that at least one heads is recorded (event A ) assuming
that at least one tails is recorded (event D). Recalling that outcomes in this sample space are equally likely, we apply the definition
of conditional probability (Definition 2.2.1) and find
P (A ∩ D) P ({ht, th}) (2/4) 2
P (A | D) = = = = ≈ 0.67.
P (D) P ({ht, th, tt}) (3/4) 3
Note that in Example 1.2.1 we found the un-conditional probability of A to be P (A) = 0.75. So, knowing that at least one tails
was recorded, i.e., assuming event D occurred, the conditional probability of A given D decreased. This is because, if event D
occurs, then the outcome hh in A cannot occur, thereby decreasing the chances that event A occurs.
Exercise 2.2.1
Suppose we randomly draw a card from a standard deck of 52 playing cards.
a. If we know that the card is a King, what is the probability that the card is a club?
b. If we instead know that the card is black, what is the probability that the card is a club?
Answer
In order to compute the necessary probabilities, first note that the sample space is given by the set of cards in a
standard deck of playing cards. So the number of outcomes in the sample space is 52. Next, note that the outcomes are
equally likely, since we are randomly drawing the card from the deck.
For part (a), we are looking for the conditional probability that the randomly selected card is club, given that it is a
King. If we let C denote the event that the card is a club and K the event that it is a King, then we are looking to
compute
P (C ∩ K)
P (C | K) = . (2.2.1)
P (K)
To compute these probabilities, we count the number of outcomes in the following events:
2.2.1 https://stats.libretexts.org/@go/page/3253
# of outcomes in C = # of clubs in standard deck = 13
The probabilities in Equation 2.2.1 are then given by dividing the counts of outcomes in each event by the total
number of outcomes in the sample space (by the boxed Equation 2.1.8 in Section 2.1):
P (C ∩ K) (1/52) 1
P (C | K) = = = = 0.25.
P (K) (4/52) 4
For part (b), we are looking for the conditional probability that the randomly selected card is club, given that it is
instead black. If we let B denote the event that the card is black, then we are looking to compute
P (C ∩ B)
P (C | B) = . (2.2.2)
P (B)
To compute these probabilities, we count the number of outcomes in the following events:
The probabilities in Equation 2.2.2 are then given by dividing the counts of outcomes in each event by the total
number of outcomes in the sample space:
P (C ∩ B) (13/52) 13
P (C | B) = = = = 0.5.
P (B) (26/52) 26
Remark: Exercise 2.2.1 demonstrates the following fact. For sample spaces with equally likely outcomes, conditional probabilities
are calculated using
In other words, if we know that the outcome of the probability experiment is in the event B , then we restrict our focus to the
outcomes in that event that are also in A . We can think of this as event B taking the place of the sample space, since we know the
outcome must lie in that event.
2.2.2 https://stats.libretexts.org/@go/page/3253
Multiplication Law
P (A ∩ B) = P (A | B)P (B) = P (B | A)P (A)
The next two properties are useful when a partition of the sample space exists, where a partition is a way of dividing up the
outcomes in the sample space into non-overlapping sets. A partition is formally defined in the Law of Total Probability below. In
many cases, when a partition exists, it is easy to compute the conditional probability of an event in the sample space given an event
in the partition. The Law of Total Probability then provides a way of using those conditional probabilities of an event, given the
partition to compute the unconditional probability of the event. Following the Law of Total Probability, we state Bayes' Rule, which
is really just an application of the Multiplication Law. Bayes' Rule is used to calculate what are informally referred to as "reverse
conditional probabilities", which are the conditional probabilities of an event in a partition of the sample space, given any other
event.
2. B ∩ B = ∅ , for every i ≠ j
i j
3. P (B ) > 0 , for i = 1, … , n
i
We say that the events B 1, B2 , … , Bn , partition the sample space S . Then for any event A , we can write
Bayes' Rule
Let B 1, B2 , … , Bn , partition the sample space S and let A be an event with P (A) > 0 . Then, for j = 1, … , n , we have
P (A | Bj )P (Bj )
P (Bj | A) = .
P (A)
A common application of the Law of Total Probability and Bayes' Rule is in the context of medical diagnostic testing.
Example 2.2.2
Consider a test that can diagnose kidney cancer. The test correctly detects when a patient has cancer 90% of the time. Also, if a
person does not have cancer, the test correctly indicates so 99.9% of the time. Finally, suppose it is known that 1 in every 10,000
individuals has kidney cancer. We find the probability that a patient has kidney cancer, given that the test indicates she does.
First, note that we are finding a conditional probability. If we let A denote the event that the patient tests positive for cancer, and
we let B denote the event that the patient actually has cancer, then we want
1
P (B1 | A).
If we let B 2 =B
c
1
, then we have a partition of all patients (which is the sample space) given by B and B .
1 2
In the first paragraph of this example, we are given the following probabilities:
c c
test correctly detects no cancer 99.9% of time: P (A | B2 ) = 0.999 ⇒ P (A | B2 ) = 1 − P (A | B2 ) = 0.001
1 in every 10,000 individuals has cancer: P (B1 ) = 0.0001 ⇒ P (B2 ) = 1 − P (B1 ) = 0.9999
2.2.3 https://stats.libretexts.org/@go/page/3253
Since we have a partition of the sample space, we apply the Law of Total Probability to find P (A):
This implies that only about 8% of patients that test positive under this particular test actually have kidney cancer, which is not very
good.
This page titled 2.2: Conditional Probability and Bayes' Rule is shared under a not declared license and was authored, remixed, and/or curated by
Kristin Kuter.
2.2.4 https://stats.libretexts.org/@go/page/3253
2.3: Independent Events
In this section we consider a property of events that relates to conditional probability, namely independence. First, we define what
it means for a pair of events to be independent, and then we consider collections of more than two events.
Definition 2.3.1
Events A and B are independent if knowing that one occurs does not affect the probability that the other occurs, i.e.,
P (A | B) = P (A) and P (B | A) = P (B). (2.3.1)
Using the definition of conditional probability (Definition 2.2.1), we can derive an alternate way to the Equations
2.3.1 for determining when two events are independent, as the following example demonstrates.
Example 2.3.1
Suppose that events A and B are independent. We rewrite Equations 2.3.1 using the definition of conditional probability:
P (A ∩ B)
P (A | B) = P (A) ⇒ = P (A)
P (B)
and
P (A ∩ B)
P (B | A) = P (B) ⇒ = P (B)
P (A)
P (A ∩ B)
= P (A) ⇒ P (A ∩ B) = P (A)P (B)
P (B)
and
P (A ∩ B)
= P (B) ⇒ P (A ∩ B) = P (A)P (B)
P (A)
Both expressions result in P (A ∩ B) = P (A)P (B) . Thus, we have shown that if events A and B are independent, then the
probability of their intersection is equal to the product of their individual probabilities. We state this fact in the next definition.
2.3.1 https://stats.libretexts.org/@go/page/3254
Definition 2.3.2
Events A and B are independent if
P (A ∩ B) = P (A)P (B).
Generally speaking, Definition 2.3.2 tends to be an easier condition than Definition 2.3.1 to verify when checking whether two
events are independent.
Example 2.3.2
Consider the context of Exercise 2.2.1, where we randomly draw a card from a standard deck of 52 and C denotes the event of
drawing a club, K the event of drawing a King, and B the event of drawing a black card.
Are C and K independent events? Recall that P (C ∩ K) = 1/52 , and note that P (C ) = 13/52 and P (K) = 4/52 . Thus, we
have
1 13 4
P (C ∩ K) = = P (C )P (K) = × ,
52 52 52
Definition 2.3.3
Let A 1, A2 , … , Ak , where k ≥ 3 , be a collection of events.
1. The events are pairwise independent if every pair of events in the collection is independent.
2. The events are mutually independent if every sub-collection of events, say A , A , … , A , satisfy the following:
i1 i2 in
Mutually independent is a stronger type of independence, since it implies pairwise independent. But pairwise independence does
NOT imply mutual independence, as the following example will demonstrate.
Example 2.3.3
Consider again the context of Example 1.1.1, i.e., tossing a fair coin twice, and define the following events:
2.3.2 https://stats.libretexts.org/@go/page/3254
A = first toss is heads
We show that this collection of events - A, B, C - is pairwise independent, but NOT mutually independent. First, we note that the
individual probabilities of each event are 0.5:
Next, we look at the probabilities of all pairwise intersections to establish pairwise independence:
However, note that the three events do not have any outcomes in common, i.e., A ∩ B ∩ C =∅ . Thus, we have
P (A ∩ B ∩ C ) = 0 ≠ P (A)P (B)P (C ),
This page titled 2.3: Independent Events is shared under a not declared license and was authored, remixed, and/or curated by Kristin Kuter.
2.3.3 https://stats.libretexts.org/@go/page/3254
CHAPTER OVERVIEW
3: Discrete Random Variables
3.1: Introduction to Random Variables
3.2: Probability Mass Functions (PMFs) and Cumulative Distribution Functions (CDFs) for Discrete Random Variables
3.3: Bernoulli and Binomial Distributions
3.4: Hypergeometric, Geometric, and Negative Binomial Distributions
3.5: Poisson Distribution
3.6: Expected Value of Discrete Random Variables
3.7: Variance of Discrete Random Variables
3.8: Moment-Generating Functions (MGFs) for Discrete Random Variables
This page titled 3: Discrete Random Variables is shared under a not declared license and was authored, remixed, and/or curated by Kristin Kuter.
1
3.1: Introduction to Random Variables
Now that we have formally defined probability and the underlying structure, we add another layer: random variables. Random
variables allow characterization of outcomes, so that we do not need to focus on each outcome specifically. We begin with the
formal definition.
Definition 3.1.1
A random variable is a function from a sample space S to the real numbers R. We denote random variables with capital letters,
e.g.,
X : S → R.
Informally, a random variable assigns numbers to outcomes in the sample space. So, instead of focusing on the outcomes
themselves, we highlight a specific characteristic of the outcomes.
Example 3.1.1
Consider again the context of Example 1.1.1, where we recorded the sequence of heads and tails in two tosses of a fair coin. The
sample space for this random experiment is given by
Suppose we are only interested in tosses that result in heads. We can define a random variable X that tracks the number of heads
obtained in an outcome. So, if outcome hh is obtained, then X will equal 2. Formally, we denote this as follows:
X : S → R
Since there are only four outcomes in S , we can list the value of X for each outcome individually:
function: X
inputs: S ⟶ outputs: R
hh ↦ 2
th ↦ 1
X
ht ↦ 1
X
tt ↦ 0
The advantage to defining the random variable X in this context is that the two outcomes ht and th are both assigned a value of 1,
meaning we are not focused on the actual sequence of heads and tails that resulted in obtaining one heads.
In Example 3.1.1, note that the random variable we defined only equals one of three possible values: 0, 1, 2. This is an example of
what we call a discrete random variable. We will also encounter another type of random variable: continuous. The next definitions
make precise what we mean by these two types.
Definition 3.1.2
A discrete random variable is a random variable that has only a finite or countably infinite (think integers or whole numbers)
number of possible values.
Definition 3.1.3
A continuous random variable is a random variable with infinitely many possible values (think an interval of real numbers,
e.g., [0, 1]).
3.1.1 https://stats.libretexts.org/@go/page/3258
In this chapter, we take a closer look at discrete random variables, then in Chapter 4 we consider continuous random variables.
This page titled 3.1: Introduction to Random Variables is shared under a not declared license and was authored, remixed, and/or curated by Kristin
Kuter.
3.1.2 https://stats.libretexts.org/@go/page/3258
3.2: Probability Mass Functions (PMFs) and Cumulative Distribution Functions
(CDFs) for Discrete Random Variables
Since random variables simply assign values to outcomes in a sample space and we have defined probability measures on sample
spaces, we can also talk about probabilities for random variables. Specifically, we can compute the probability that a discrete
random variable equals a specific value (probability mass function) and the probability that a random variable is less than or equal
to a specific value (cumulative distribution function).
Example 3.2.1
Continuing in the context of Example 3.1.1, we compute the probability that the random variable X equals 1. There are two
outcomes that lead to X taking the value 1, namely ht and th . So, the probability that X = 1 is given by the probability of the
event ht, th , which is 0.5:
# outcomes in {ht, th} 2
P (X = 1) = P ({ht, th}) = = = 0.5
# outcomes in S 4
In Example 3.2.1, the probability that the random variable X equals 1, P (X = 1) , is referred to as the probability mass function of
X evaluated at 1. In other words, the specific value 1 of the random variable X is associated with the probability that X equals that
value, which we found to be 0.5. The process of assigning probabilities to specific values of a discrete random variable is what the
probability mass function is and the following definition formalizes this.
Definition 3.2.1
The probability mass function (pmf) (or frequency function) of a discrete random variable X assigns probabilities to the
possible values of the random variable. More specifically, if x , x , … denote the possible values of a random variable X, then
1 2
Note that, in Equation 3.2.1, p(x ) is shorthand for P (X = x ) , which represents the probability of the event that the random
i i
variable X equals x . i
As we can see in Definition 3.2.1, the probability mass function of a random variable X depends on the probability measure of the
underlying sample space S . Thus, pmf's inherit some properties from the axioms of probability (Definition 1.2.1). In fact, in order
for a function to be a valid pmf it must satisfy the following properties.
xi
Furthermore, if A is a subset of the possible values of X, then the probability that X takes a value in A is given by
P (X ∈ A) = ∑ p(xi ). (3.2.2)
xi ∈A
Note that the first property of pmf's stated above follows from the first axiom of probability, namely that the probability of the
sample space equals 1: P (S) = 1 . The second property of pmf's follows from the second axiom of probability, which states that all
probabilities are non-negative.
3.2.1 https://stats.libretexts.org/@go/page/3259
We now apply the formal definition of a pmf and verify the properties in a specific context.
Example 3.2.2
Returning to Example 3.2.1, now using the notation of Definition 3.2.1, we found that the pmf for X at 1 is given by
Similarly, we find the pmf for X at the other possible values of the random variable:
p(0) = P (X = 0) = P ({tt}) = 0.25
Note that all the values of p are positive (second property of pmf's) and p(0) + p(1) + p(2) = 1 (first property of pmf's). Also, we
can demonstrate the third property of pmf's (Equation 3.2.2) by computing the probability that there is at least one heads, i.e.,
X ≥ 1 , which we could represent by setting A = {1, 2} so that we want the probability that X takes a value in A :
xi ∈A
We can represent probability mass functions numerically with a table, graphically with a histogram, or analytically with a
formula. The following example demonstrates the numerical and graphical representations. In the next three sections, we will see
examples of pmf's defined analytically with a formula.
Example 3.2.3
We represent the pmf we found in Example 3.2.2 in two ways below, numerically with a table on the left and graphically with a
histogram on the right.
In the histogram in Figure 1, note that we represent probabilities as areas of rectangles. More specifically, each rectangle in the
histogram has width 1 and height equal to the probability of the value of the random variable X that the rectangle is centered over.
For example, the leftmost rectangle in the histogram is centered at 0 and has height equal to p(0) = 0.25, which is also the area of
the rectangle since the width is equal to 1. In this way, histograms provides a visualization of the distribution of the probabilities
assigned to the possible values of the random variable X. This helps to explain where the common terminology of "probability
distribution" comes from when talking about random variables.
Definition 3.2.2
The cumulative distribution function (cdf) of a random variable X is a function on the real numbers that is denoted as F and is
given by
F (x) = P (X ≤ x), for any x ∈ R. (3.2.3)
Before looking at an example of a cdf, we note a few things about the definition.
3.2.2 https://stats.libretexts.org/@go/page/3259
First of all, note that we did not specify the random variable X to be discrete. CDFs are also defined for continuous random
variables (see Chapter 4) in exactly the same way.
Second, the cdf of a random variable is defined for all real numbers, unlike the pmf of a discrete random variable, which we only
define for the possible values of the random variable. Implicit in the definition of a pmf is the assumption that it equals 0 for all real
numbers that are not possible values of the discrete random variable, which should make sense since the random variable will never
equal that value. However, cdf's, for both discrete and continuous random variables, are defined for all real numbers. In looking
more closely at Equation 3.2.3, we see that a cdf F considers an upper bound, x ∈ R, on the random variable X, and assigns that
value x to the probability that the random variable X is less than or equal to that upper bound x. This type of probability is referred
to as a cumulative probability, since it could be thought of as the probability accumulated by the random variable up to the
specified upper bound. With this interpretation, we can represent Equation 3.2.3 as follows:
F : R ⟶ R (3.2.4)
upper bounds on RV X cumulative probabilities
In the case that X is a discrete random variable, with possible values denoted x , x , … , x , …, the cdf of X can be calculated
1 2 i
using the third property of pmf's (Equation 3.2.2), since, for a fixed x ∈ R, if we let the set A contain the possible values of X that
are less than or equal to x, i.e., A = {x | x ≤ x} , then the cdf of X evaluated at x is given by
i i
F (x) = P (X ≤ x) = P (X ∈ A) = ∑ p(xi ).
xi ≤x
Example 3.2.4
Continuing with Examples 3.2.2 and 3.2.3, we find the cdf for X . First, we find F (x) for the possible values of the random
variable, x = 0, 1, 2:
F (0) = P (X ≤ 0) = P (X = 0) = 0.25
Now, if x < 0 , then the cdf F (x) = 0 , since the random variable X will never be negative.
If 0 < x < 1 , then the cdf F (x) = 0.25, since the only value of the random variable X that is less than or equal to such a value x
is 0. For example, consider x = 0.5. The probability that X is less than or equal to 0.5 is the same as the probability that X = 0 ,
since 0 is the only possible value of X less than 0.5:
Exercise 3.2.1
For this random variable X, compute the following values of the cdf:
a. F (−3)
b. F (0.1)
c. F (0.9)
d. F (1.4)
e. F (2.3)
f. F (18)
Answer
a. F (−3) = P (X ≤ −3) = 0
3.2.3 https://stats.libretexts.org/@go/page/3259
b. F (0.1) = P (X ≤ 0.1) = P (X = 0) = 0.25
xi ≤1.4
xi ≤2.3
f. F (18) = P (X ≤ 18) = P (X ≤ 2) = 1
To summarize Example 3.2.4, we write the cdf F as a piecewise function and Figure 2 gives its graph:
⎧ 0, for x < 0
⎪
⎪
⎪
0.25 for 0 ≤ x < 1
F (x) = ⎨
⎪ 0.75 for 1 ≤ x < 2
⎪
⎩
⎪
1 for x ≥ 2.
We end this section with a statement of the properties of cdf's. The reader is encouraged to verify these properties hold for the cdf
derived in Example 3.2.4 and to provide an intuitive explanation (or formal explanation using the axioms of probability and the
properties of pmf's) for why these properties hold for cdf's in general.
This page titled 3.2: Probability Mass Functions (PMFs) and Cumulative Distribution Functions (CDFs) for Discrete Random Variables is shared
under a not declared license and was authored, remixed, and/or curated by Kristin Kuter.
3.2.4 https://stats.libretexts.org/@go/page/3259
3.3: Bernoulli and Binomial Distributions
In this section and the next two, we introduce families of common discrete probability distributions, i.e., probability distributions
for discrete random variables. We refer to these as "families" of distributions because in each case we will define a probability mass
function by specifying an explicit formula, and that formula will incorporate a constant (or set of constants) that are referred to
as parameters. By specifying values for the parameter(s) in the pmf, we define a specific probability distribution for a specific
random variable. For each family of distributions introduced, we will list a set of defining characteristics that will help determine
when to use a certain distribution in a given context.
Bernoulli Distribution
Consider the following example.
Example 3.3.1
Let A be an event in a sample space S . Suppose we are only interested in whether or not the outcome of the underlying probability
experiment is in the specified event A . To track this we can define an indicator random variable, denoted I , given by
A
1, if s ∈ A,
IA (s) = {
c
0, if s ∈ A .
In other words, the random variable I will equal 1 if the resulting outcome is in event A , and I equals 0 if the outcome is not in
A A
A . Thus, IA is a discrete random variable. We can state the probability mass function of I in terms of the probability that the
A
resulting outcome is in event A , i.e., the probability that event A occurs, P (A):
c
p(0) = P (IA = 0) = P (A ) = 1 − P (A)
In Example 3.3.1, the random variable I is a Bernoulli random variable because its pmf has the form of the Bernoulli probability
A
Definition 3.3.1
A random variable X has a Bernoulli distribution with parameter p, where 0 ≤ p ≤ 1 , if it has only two possible values,
typically denoted 0 and 1. The probability mass function (pmf) of X is given by
p(0) = P (X = 0) = 1 − p,
p(1) = P (X = 1) = p.
⎧ 0, x <0
⎪
In Definition 3.3.1, note that the defining characteristic of the Bernoulli distribution is that it models random variables that have
only two possible values. As noted in the definition, the two possible values of a Bernoulli random variable are usually 0 and 1. In
the typical application of the Bernoulli distribution, a value of 1 indicates a "success" and a value of 0 indicates a "failure", where
"success" refers that the event or outcome of interest. The parameter p in the Bernoulli distribution is given by the probability of a
"success". In Example 3.3.1, we were interested in tracking whether or not event A occurred, and so that is what a "success" would
be, which occurs with probability given by the probability of A . Thus, the value of the parameter p for the Bernoulli distribution in
Example 3.3.1 is given by p = P (A) .
Exercise 3.3.1
Derive the general formula for the cdf of the Bernoulli distribution given in Equation 3.3.1.
Hint
First find F (0) and F (1) .
3.3.1 https://stats.libretexts.org/@go/page/3261
Answer
Recall that the only two values of a Bernoulli random variable X are 0 and 1. So, first, we find the cdf at those two values:
F (0) = P (X ≤ 0) = P (X = 0) = p(0) = 1 − p
Now for the other values, a Bernoulli random variable will never be negative, so F (x) = 0 , for x < 0 . Also, a Bernoulli
random variable will always be less than or equal to 1, so F (x) = 1 , for x ≥ 1 . Lastly, if x is in between 0 and 1, then the
cdf is given by
Binomial Distribution
To introduce the next family of distributions, we use our continuing example of tossing a coin, adding another toss.
Example 3.3.2
Suppose we toss a coin three times and record the sequence of heads (h ) and tails (t ). Supposing that the coin is fair, each toss
results in heads with probability 0.5, and tails with the same probability of 0.5. Since the three tosses are mutually independent, the
probability assigned to any outcome is 0.5 . More specifically, consider the outcome hth. We could write the probability of this
3
outcome as (0.5) (0.5) to emphasize the fact that two heads and one tails occurred. Note that there are two other outcomes with
2 1
two heads and one tails: hht and thh. Recall from Example 2.1.2 in Section 2.1, that we can count the number of outcomes with
two heads and one tails by counting the number of ways to select positions for the two heads to occur in a sequence of three tosses,
which is given by ( ). In general, note that ( ) counts the number of possible sequences with exactly x heads, for x = 0, 1, 2, 3.
3
2
3
We generalize the above by defining the discrete random variable X to be the number of heads in an outcome. The possible values
of X are x = 0, 1, 2, 3. Using the above facts, the pmf of X is given as follows:
1 3 0 3
p(0) = P (X = 0) = P ({ttt}) = =( )(0.5 ) (0.5 )
8 0
3 3 1 2
p(1) = P (X = 1) = P ({htt, tht, tth}) = =( )(0.5 ) (0.5 )
8 1
3 3 2 1
p(2) = P (X = 2) = P ({hht, hth, thh}) = =( )(0.5 ) (0.5 ) (3.3.2)
8 2
1 3 3 0
p(3) = P (X = 3) = P ({hhh}) = =( )(0.5 ) (0.5 )
8 3
In the above, the fractions in orange are found by calculating the probabilities directly using equally likely outcomes (note that the
sample space S has 8 outcomes, see Example 2.1.1). In each line, the value of x is highlighted in red so that we can see the pattern
forming. For example, when x = 2 , we see in the expression on the right-hand side of Equation 3.3.2 that "2" appears in the
binomial coefficient ( ), which gives the number of outcomes resulting in the random variable equaling 2, and "2" also appears in
3
the exponent on the first 0.5, which gives the probability of two heads occurring.
The pattern exhibited by the random variable X in Example 3.3.2 is referred to as the binomial distribution, which we formalize in
the next definition.
Definition 3.3.2
Suppose that n independent trials of the same probability experiment are performed, where each trial results in either a
"success" (with probability p), or a "failure" (with probability 1 − p ). If the random variable X denotes the total number of
successes in the n trials, then X has a binomial distribution with parameters n and p, which we write X ∼ binomial(n, p).
The probability mass function of X is given by
n x n−x
p(x) = P (X = x) = ( ) p (1 − p ) , for x = 0, 1, … , n. (3.3.3)
x
3.3.2 https://stats.libretexts.org/@go/page/3261
In Example 3.3.2, the independent trials are the three tosses of the coin, so in this case we have parameter n = 3 . Furthermore, we
were interested in counting the number of heads occurring in the three tosses, so a "success" is getting a heads on a toss, which
occurs with probability 0.5 and so parameter p = 0.5 . Thus, the random variable X in this example has a binomial(3, 0.5)
distribution and applying the formula for the binomial pmf given in Equation 3.3.3 when x = 2 we get the same expression on the
right-hand side of Equation 3.3.2:
n x n−x
3 2 3−2
3 2 1
p(x) = ( ) p (1 − p ) ⇒ p(2) = ( ) 0.5 (1 − 0.5 ) = ( ) 0.5 0.5
x 2 2
In general, we can connect binomial random variables to Bernoulli random variables. If X is a binomial random variable, with
parameters n and p, then it can be written as the sum of n independent Bernoulli random variables, X , … , X . (Note: We will
1 n
formally define independence for random variables later, in Chapter 5.) Specifically, if we define the random variable X , for i
i = 1, … , n, to be 1 when the i trial is a "success", and 0 when it is a "failure", then the sum
th
X = X1 + X2 + ⋯ + Xn
gives the total number of success in n trials. This connection between the binomial and Bernoulli distribution will be useful in a
later section.
One of the main applications of the binomial distribution is to model population characteristics as in the following example.
Example 3.3.3
Consider a group of 100 voters. If p denotes the probability that a voter will vote for a specific candidate, and we let random
variable X denote the number of voters in the group that will vote for that candidate, then X follows a binomial distribution with
parameters n = 100 and p.
This page titled 3.3: Bernoulli and Binomial Distributions is shared under a not declared license and was authored, remixed, and/or curated by
Kristin Kuter.
3.3.3 https://stats.libretexts.org/@go/page/3261
3.4: Hypergeometric, Geometric, and Negative Binomial Distributions
In this section, we consider three more families of discrete probability distributions. There are some similarities between the three,
which can make them hard to distinguish at times. So throughout this section we will compare the three to each other and the
binomial distribution, and point out their differences.
Hypergeometric Distribution
Consider the following example.
Example 3.4.1
An urn contains a total of N balls, where some number m of the balls are orange and the remaining N − m are grey. Suppose we
draw n balls from the urn without replacement, meaning once we select a ball we do not place it back in the urn before drawing out
the next one. Then some of the balls in our selection may be orange and some may be grey. We can define the discrete random
variable X to give the number of orange balls in our selection. The probability distribution of X is referred to as the
hypergeometric distribution, which we define next.
Definition 3.4.1
Suppose in a collection of N objects, m are of type 1 and N − m are of another type 2. Furthermore, suppose that n objects are
randomly selected from the collection without replacement. Define the discrete random variable X to give the number of
selected objects that are of type 1. Then X has a hypergeometric distribution with parameters N , m, n. The probability mass
function of X is given by
m N −m
( )( )
x n−x
=
N
( )
n
In some sense, the hypergeometric distribution is similar to the binomial, except that the method of sampling is crucially different.
In each case, we are interested in the number of times a specific outcome occurs in a set number of repeated trials, where we could
consider each selection of an object in the hypergeometric case as a trial. In the binomial case we are interested in the number of
"successes" in the trials, and in the hypergeometric case we are interested in the number of a certain type of object being selected,
which could be considered a "success". However, the trials in a binomial distribution are independent, while the trials in a
hypergeometric distribution are not because the objects are selected without replacement. If, in Example 3.4.1, the balls were
drawn with replacement, then each draw would be an independent Bernoulli trial and the distribution of X would be binomial,
since the same number of balls in the urn would be the same each time another ball is drawn. However, when the balls are drawn
without replacement, each draw is not independent, since the number of balls in the urn decreases after each draw as well as the
number of balls of a given type.
Exercise 3.4.1
Suppose your friend has 10 cookies, 3 of which are chocolate chip. Your friend randomly divides the cookies equally between
herself and you. What is the probability that you get all the chocolate chip cookies?
Answer
Let random variable X = number of chocolate chip cookies you get. Then X is hypergeometric with N = 10 total cookies,
m = 3 chocolate chip cookes, and n = 5 cookies selected by your friend to give to you. We want the probability that you
3.4.1 https://stats.libretexts.org/@go/page/3263
3 7
( )( )
3 2
P (X = 3) = = 0.083
10
( )
5
Note that X has a hypergeometric distribution and not binomial because the cookies are being selected (or divided) without
replacement.
Definition 3.4.2
Suppose that a sequence of independent Bernoulli trials is performed, with p = P ("success") for each trial. Define the random
variable X to give the number of trial at which the first success occurs. Then X has a geometric distribution with parameter p.
The probability mass function of X is given by
st th
p(x) = P (X = x) = P (1 success on x trial)
st th
= P (1 (x − 1) trials are failures & x trial is success)
x−1
= (1 − p ) p, for x = 1, 2, 3, … (3.4.1)
Exercise 3.4.2
Verify that the pmf for a geometric distribution (Equation 3.4.1) satisfies the two properties for pmf's, i.e.,
1. p(x) ≥ 0 , for x = 1, 2, 3, …
∞
Answer
1. Note that 0 ≤ p ≤ 1 , so that we also have 0 ≤ (1 − p) ≤ 1 and 0 ≤ (1 − p) x−1
≤1 , for x = 1, 2, …. Thus, it follows
that p(x) = (1 − p) p ≥ 0 .
x−1
Note that the sum of the geometric pmf is a geometric series with a = p and r = 1 − p < 1 . Thus, we have
∞ ∞
p p
x−1
∑ p(x) = ∑(1 − p ) p = = = 1 ✓
1 − (1 − p) p
x=1 x=1
Example 3.4.2
Each of the following is an example of a random variable with the geometric distribution.
1. Toss a fair coin until the first heads occurs. In this case, a "success" is getting a heads ("failure" is getting tails) and so the
parameter p = P (h) = 0.5 .
2. Buy lottery tickets until getting the first win. In this case, a "success" is getting a lottery ticket that wins money, and a "failure"
is not winning. The parameter p will depend on the odds of wining for a specific lottery.
3.4.2 https://stats.libretexts.org/@go/page/3263
3. Roll a pair of fair dice until getting the first double 1's. In this case, a "success" is getting double 1's, and a "failure" is simply
not getting double 1's (so anything else). To find the parameter p, note that the underlying sample space consists of all possible
rolls of a pair of fair dice, of which there are 6 × 6 = 36 because each die has 6 possible sides. Each of these rolls is equally
likely, so
# of ways to roll double 1's 1
p = P (double 1's) = = .
36 36
The negative binomial distribution generalizes the geometric distribution by considering any number of successes.
Definition 3.4.3
Suppose that a sequence of independent Bernoulli trials is performed, with p = P ("success") for each trial. Fix an integer r to
be greater than or equal to 2 and define the random variable X to give the number of trial at which the r success occurs. Then th
X has a negative binomial distribution with parameters r and p . The probability mass function of X is given by
th th
p(x) = P (X = x) = P (r success is on x trial)
st st th th
= P (1 (r − 1) successes in 1 (x − 1) trials) × P (r success on x trial)
bniomial with n=x−1
x −1
r−1 (x−1)−(r−1)
=( )p (1 − p ) ×p
r−1
x −1
r x−r
=( ) p (1 − p ) , for x = r, r + 1, r + 2, …
r−1
Example 3.4.3
For examples of the negative binomial distribution, we can alter the geometric examples given in Example 3.4.2.
1. Toss a fair coin until get 8 heads. In this case, the parameter p is still given by p = P (h) = 0.5 , but now we also have the
parameter r = 8 , the number of desired "successes", i.e., heads.
2. Buy lottery tickets until win 5 times. In this case, the parameter p is still given by the odds of winning the lottery, but now we
also have the parameter r = 5 , the number of desired wins.
3. Roll a pair of fair dice until get 100 double 1's. In this case, the parameter p is still given by p = P (double 1's) = , but now 1
36
Note that for both the geometric and negative binomial distributions the number of possible values the random variable can take is
infinite. These are still discrete distributions though, since we can "list" the values. In other words, the possible values are
countable. This is in contrast to the Bernoulli, binomial, and hypergeometric distributions, where the number of possible values is
finite.
We again note the distinction between the binomial distribution and the geometric and negative binomial distributions. In the
binomial distribution, the number of trials is fixed, and we count the number of "successes". Whereas, in the geometric and
negative binomial distributions, the number of "successes" is fixed, and we count the number of trials needed to obtain the desired
number of "successes".
This page titled 3.4: Hypergeometric, Geometric, and Negative Binomial Distributions is shared under a not declared license and was authored,
remixed, and/or curated by Kristin Kuter.
3.4.3 https://stats.libretexts.org/@go/page/3263
3.5: Poisson Distribution
In this section, we consider our final family of discrete probability distributions. We begin with the definition.
Definition 3.5.1
A random variable X has a Poisson distribution, with parameter λ > 0 , if its probability mass function is given by
−λ x
e λ
p(x) = P (X = x) = , for x = 0, 1, 2, … . (3.5.1)
x!
We write X ∼ Poisson(λ) .
The main application of the Poisson distribution is to count the number of times some event occurs over a fixed interval of time or
space. More specifically, if the random variable X denotes the number of times the event occurs during an interval of length T , and
r denotes the average rate at which the event occurs per unit interval, then X has a Poisson distribution with parameter λ = rT .
Example 3.5.1
Suppose typos occur at an average rate of r = 0.01 per page in the Friday edition of the New York Times, which is 45 pages long.
Let X denote the number of typos on the front page. Then X has a Poisson distribution with parameter
λ = 0.01 × 1 = 0.01,
since we are considering an interval of length one page (T =1 ). Thus, the probability that there is at least one typo on the front
page is given by
−0.01 0
e (0.01 )
c −0.01
P (X ≥ 1) = P ({X = 0 } ) = 1 − P (X = 0) = 1 − = 1 −e ≈ 0.00995.
0!
Now, if we let random variable Y denote the number of typos in the entire paper, then Y has a Poisson distribution with parameter
λ = 0.01 × 45 = 0.45,
since we are considering an interval of T = 45 pages. The probability that there are less than three typos in the entire paper is
−0.45 0 −0.45 1 −0.45 2
e (0.45 ) e (0.45 ) e (0.45 )
P (Y < 3) = P (Y = 0, 1, or 2) = + + ≈ 0.98912.
0! 1! 2!
The Poisson distribution is similar to all previously considered families of discrete probability distributions in that it counts the
number of times something happens. However, the Poisson distribution is different in that there is not an act that is being
repeatedly performed. In other words, there are no set trials, but rather a set window of time or space to observe.
This page titled 3.5: Poisson Distribution is shared under a not declared license and was authored, remixed, and/or curated by Kristin Kuter.
3.5.1 https://stats.libretexts.org/@go/page/3262
3.6: Expected Value of Discrete Random Variables
In this section, and the next, we look at various numerical characteristics of discrete random variables. These give us a way of
classifying and comparing random variables.
Definition 3.6.1
If X is a discrete random variable with possible values x 1, , and probability mass function
x2 , … , xi , … p(x) , then the
expected value (or mean) of X is denoted E[X] and given by
The expected value of X may also be denoted as μ or simply μ if the context is clear.
X
The expected value of a random variable has many interpretations. First, looking at the formula in Definition 3.6.1 for computing
expected value (Equation 3.6.1), note that it is essentially a weighted average. Specifically, for a discrete random variable, the
expected value is computed by "weighting'', or multiplying, each value of the random variable, x , by the probability that the
i
random variable takes that value, p(x ), and then summing over all possible values. This interpretation of the expected value as a
i
weighted average explains why it is also referred to as the mean of the random variable.
The expected value of a random variable is also interpreted as the long-run value of the random variable. In other words, if we
repeat the underlying random experiment several times and take the average of the values of the random variable corresponding to
the outcomes, we would get the expected value, approximately. (Note: This interpretation of expected value is similar to the
relative frequency approximation for probability discussed in Section 1.2.) Again, we see that the expected value is related to an
average value of the random variable. Given the interpretation of the expected value as an average, either "weighted'' or "long-run'',
the expected value is often referred to as a measure of center of the random variable.
Finally, the expected value of a random variable has a graphical interpretation. The expected value gives the center of mass of the
probability mass function, which the following example demonstrates.
Example 3.6.1
Consider again the context of Example 1.1.1, where we recorded the sequence of heads and tails in two tosses of a fair coin. In
Example 3.1.1 we defined the discrete random variable X to denote the number of heads obtained. In Example 3.2.2 we found the
pmf of X. We now apply Equation 3.6.1 from Definition 3.6.1 and compute the expected value of X:
= 0.5 + 0.5 = 1.
Thus, we expect that the number of heads obtained in two tosses of a fair coin will be 1 in the long-run or on average. Figure 1
demonstrates the graphical representation of the expected value as the center of mass of the probability mass function.
Figure 1: Histogram of X. The red arrow represents the center of mass, or the expected value of X
3.6.1 https://stats.libretexts.org/@go/page/4372
Example 3.6.2
Suppose we toss a fair coin three times and define the random variable X to be our winnings on a single play of a game where
we win $x if the first heads is on the x toss, for x = 1, 2, 3,
th
x p(x) = P (X = x)
1
−1
8
1
1
2
1
2
4
1
3
8
E[X] = ∑ xi ⋅ p(xi )
1 1 1 1 5
= (−1) ⋅ +1 ⋅ +2 ⋅ +3 ⋅ = = 1.25.
8 2 4 8 4
Thus, the expected winnings for a single play of the game is $1.25. In other words, if we played the game multiple times, we expect
the average winnings to be $1.25.
For many of the common probability distributions, the expected value is given by a parameter of the distribution as the next
exercise shows for the Poisson distribution.
Exercise 3.6.1
Suppose the discrete random variable X has a Poisson distribution with parameter λ . Show that E[X] = λ .
Hint
∞ x−1
y
Recall from calculus the series expansion: e y
=∑
(x − 1)!
x=1
Answer
−λ x
e λ
First, recall that the pmf for Poisson distribution is p(x) = , for x = 0, 1, … . Then, we apply Definition 3.6.1,
x!
giving us
3.6.2 https://stats.libretexts.org/@go/page/4372
∞
E[X] = ∑ x ⋅ p(x)
x=0
∞ −λ x
e λ
= ∑x⋅ (note that the x = 0 term is 0, so we drop it from the sum)
x!
x=0
∞ −λ x ∞ −λ x−1
e λ e λ⋅λ
= ∑x⋅ = ∑x⋅
x! x ⋅ (x − 1)!
x=1 x=1
∞ x−1
λ
−λ
=e ⋅λ∑
(x − 1)!
x=1
−λ λ y
=e ⋅λ⋅e (here we used the series expansion of e )
=λ
as needed.
The expected value may not be exactly equal to a parameter of the probability distribution, but rather it may be a function of the
parameters. The following table gives the expected value for each of the common discrete distributions we considered earlier. We
will see later (Section 3.8) how to derive these results using a technique involving moment-generating functions.
Bernoulli(p) p
binomial(n, p ) np
hypergeometric(N , n, m ) nm
geometric(p) 1
negative binomial(r, p ) r
Poisson(λ ) λ
theorems, which we state without proof, demonstrate how to calculate the expected value of functions of random variables.
Theorem 3.6.1
Let X be a random variable and let g be a real-valued function. Define the random variable Y = g(X) .
If X is a discrete random variable with possible values , and probability mass function
x1 , x2 , … , xi , … p(x) , then the
expected value of Y is given by
To put it simply, Theorem 3.6.1 states that to find the expected value of a function of a random variable, just apply the function to
the possible values of the random variable in the definition of expected value. Before stating an important special case of Theorem
3.6.1, a word of caution regarding order of operations. Note that, in general,
3.6.3 https://stats.libretexts.org/@go/page/4372
E[g(X)] ≠ g (E[X]) ! (3.6.2)
The above special case is referred to as the linearity of expected value, which implies the following properties of the expected
value.
E[cg(X)] = cE[g(X)]
This page titled 3.6: Expected Value of Discrete Random Variables is shared under a not declared license and was authored, remixed, and/or
curated by Kristin Kuter.
3.6.4 https://stats.libretexts.org/@go/page/4372
3.7: Variance of Discrete Random Variables
We now look at our second numerical characteristic associated to random variables.
Definition 3.7.1
The variance of a random variable X is given by
2 2
σ = Var(X) = E[(X − μ) ],
In words, the variance of a random variable is the average of the squared deviations of the random variable from its mean (expected
value). Notice that the variance of a random variable will result in a number with units squared, but the standard deviation will have
the same units as the random variable. Thus, the standard deviation is easier to interpret, which is why we make a point to define it.
The variance and standard deviation give us a measure of spread for random variables. The standard deviation is interpreted as a
measure of how "spread out'' the possible values of X are with respect to the mean of X, μ = E[X] .
Example 3.7.1
Consider the two random variables X and X , whose probability mass functions are given by the histograms in Figure 1 below.
1 2
Note that X and X have the same mean. However, in looking at the histograms, we see that the possible values of X are more
1 2 2
"spread out" from the mean, indicating that the variance (and standard deviation) of X is larger.
2
Figure 1: Histograms for random variables X and X , both with same expected value different variance.
1 2
Theorem 3.6.1 actually tells us how to compute variance, since it is given by finding the expected value of a function applied to the
random variable. First, if X is a discrete random variable with possible values x , x , … , x , …, and probability mass function
1 2 i
3.7.1 https://stats.libretexts.org/@go/page/4373
2
Var(X) = ∑(xi − μ) ⋅ p(xi ).
The above formula follows directly from Definition 3.7.1. However, there is an alternate formula for calculating variance, given by
the following theorem, that is often easier to use.
Theorem 3.7.1
Let X be any random variable, with mean μ . Then the variance of X is
2 2
Var(X) = E[ X ] −μ . (3.7.1)
Proof
By the definition of variance (Definition 3.7.1) and the linearity of expectation, we have the following:
2
Var(X) = E[(X − μ) ]
2 2
= E[ X +μ − 2Xμ]
2 2
= E[ X ] + E[ μ ] − E[2Xμ]
2 2
= E[ X ] +μ − 2μE[X] (Note: since μ is constant, we can take it out from the expected value)
2 2 2
= E[ X ] +μ − 2μ
2 2
= E[ X ] −μ
Example 3.7.2
Continuing in the context of Example 3.6.1, we calculate the variance and standard deviation of the random variable X denoting
the number of heads obtained in two tosses of a fair coin. Using the alternate formula for variance, we need to first calculate
E[ X ], for which we use Theorem 3.6.1:
2
2 2 2 2
E[ X ] =0 ⋅ p(0) + 1 ⋅ p(1) + 2 ⋅ p(2) = 0 + 0.5 + 1 = 1.5.
Exercise 3.7.1
Consider the context of Example 3.6.2, where we defined the random variable X to be our winnings on a single play of game
involving flipping a fair coin three times. We found that E[X] = 1.25. Now find the variance and standard deviation of X.
Answer
First, find E[X ]:
2
2 2
E[ X ] = ∑x ⋅ p(xi )
i
2
1 2
1 2
1 2
1 11
= (−1 ) ⋅ +1 ⋅ +2 ⋅ +3 ⋅ = = 2.75
8 2 4 8 4
Given that the variance of a random variable is defined to be the expected value of squared deviations from the mean, variance is
not linear as expected value is. We do have the following useful property of variance though.
3.7.2 https://stats.libretexts.org/@go/page/4373
Theorem 3.7.2
Let X be a random variable, and a, b be constants. Then the following holds:
2
Var(aX + b) = a Var(X).
Exercise 3.7.2
Prove Theorem 3.7.2.
Answer
First, let μ = E[X] and note that by the linearity of expectation we have
E[aX + b] = aE[X] + b = aμ + b.
Now, we use the alternate formula for variance given in Theorem 3.7.1 to prove the result:
2 2
Var(aX + b) = E [(aX + b ) ] − (E[aX + b])
2 2 2 2
= E[ a X + 2abX + b ] − (aμ + b)
2 2 2 2 2 2
= a E[ X ] + 2abE[X] + b −a μ − 2abμ − b
2 2 2 2 2 2 2 2
= a E[ X ] −a μ = a (E[ X ] − μ ) = a Var(X)
Theorem 3.7.2 easily follows from a little algebraic modification. Note that the "+ b '' disappears in the formula. There is an
intuitive reason for this. Namely, the "+ b '' corresponds to a horizontal shift of the probability mass function for the random
variable. Such a transformation to this function is not going to affect the spread, i.e., the variance will not change.
As with expected values, for many of the common probability distributions, the variance is given by a parameter or a function of
the parameters for the distribution.
Bernoulli(p) p(1 − p)
binomial(n, p ) np(1 − p)
N −1
geometric(p) 1−p
p2
p2
Poisson(λ ) λ
This page titled 3.7: Variance of Discrete Random Variables is shared under a not declared license and was authored, remixed, and/or curated by
Kristin Kuter.
3.7.3 https://stats.libretexts.org/@go/page/4373
3.8: Moment-Generating Functions (MGFs) for Discrete Random Variables
The expected value and variance of a random variable are actually special cases of a more general class of numerical characteristics
for random variables given by moments.
Definition 3.8.1
The rth moment of a random variable X is given by
r
E[ X ].
where μ = E[X] .
Note that the expected value of a random variable is given by the first moment, i.e., when r = 1 . Also, the variance of a random
variable is given the second central moment.
As with expected value and variance, the moments of a random variable are used to characterize the distribution of the random
variable and to compare the distribution to that of other random variables. Moments can be calculated directly from the definition,
but, even for moderate values of r, this approach becomes cumbersome. The next definition and theorem provide an easier way to
generate moments.
Definition 3.8.2
The moment-generating function (mgf) of a random variable X is given by
tX
MX (t) = E[ e ], for t ∈ R.
Theorem 3.8.1
If random variable X has mgf M , then
X (t)
r
(r) d
r
M (0) = [ MX (t)] = E[ X ].
X r t=0
dt
Theorem 3.8.1 tells us how to derive the mgf of a random variable, since the mgf is given by taking the expected value of a
function applied to the random variable:
tX txi
MX (t) = E[ e ] = ∑e ⋅ p(xi )
We can now derive the first moment of the Poisson distribution, i.e., derive the fact we mentioned in Section 3.6, but left as an
exercise, that the expected value is given by the parameter λ . We also find the variance.
Example 3.8.1
Let X ∼ Poisson(λ) . Then, the pmf of X is given by
−λ x
e λ
p(x) = , for x = 0, 1, 2, … .
x!
Before we derive the mgf for X, we recall from calculus the Taylor series expansion of the exponential function e : y
3.8.1 https://stats.libretexts.org/@go/page/4374
∞ x
y
y
e =∑ .
x!
x=0
Now we take the first and second derivatives of M X (t) . Remember we are differentiating with respect to t :
d t t
′ λ( e −1) t λ( e −1)
M (t) = [e ] = λe e
X
dt
d t
t
λ( e −1) t λ( e −1)
t
2 2t
t
λ( e −1)
′′
M (t) = [λ e e ] = λe e +λ e e
X
dt
Next we evaluate the derivatives at t = 0 to find the first and second moments of X:
0
′ 0 λ( e −1)
E[X] = M (0) = λ e e =λ
X
0 0
2 ′′ 0 λ( e −1) 2 0 λ( e −1) 2
E[ X ] =M (0) = λe e +λ e e = λ +λ
X
Thus, we have shown that both the mean and variance for the Poisson(λ) distribution is given by the parameter λ .
Note that the mgf of a random variable is a function of t . The main application of mgf's is to find the moments of a random
variable, as the previous example demonstrated. There are more properties of mgf's that allow us to find moments for functions of
random variables.
Theorem 3.8.2
Let X be a random variable with mgf M , and let
X (t) a, b be constants. If random variable Y = aX + b , then the mgf of Y is
given by
bt
MY (t) = e MX (at).
Theorem 3.8.3
If X , … , X are independent random variables with mgf's
1 n MX1 (t), … , MXn (t) , respectively, then the mgf of random
variable Y = X + ⋯ + X is given by
1 n
Recall that a binomially distributed random variable can be written as a sum of independent Bernoulli random variables. We use
this and Theorem 3.8.3 to derive the mean and variance for a binomial distribution. First, we find the mean and variance of a
Bernoulli distribution.
Example 3.8.2
Recall that X has a Bernoulli(p) distribution if it is assigned the value of 1 with probability p and the value of 0 with probability
1 − p . Thus, the pmf of X is given by
1 − p, if x = 0
p(x) = {
p, if x = 1
3.8.2 https://stats.libretexts.org/@go/page/4374
In order to find the mean and variance of X, we first derive the mgf:
tX t(0) t(1) t
MX (t) = E[ e ] =e (1 − p) + e p = 1 − p + e p.
d t t
′′
M (t) = [ e p] = e p
X
dt
Next we evaluate the derivatives at t = 0 to find the first and second moments:
′ ′′ 0
M (0) = M (0) = e p = p.
X X
Thus, the expected value of X is E[X] = p . Finally, we use the alternate formula for calculating variance:
2 2 2
Var(X) = E[ X ] − (E[X]) = p −p = p(1 − p).
Example 3.8.3
Let X ∼ binomial(n, p). If X 1, … , Xn denote n independent Bernoulli(p) random variables, then we can write
X = X1 + ⋯ + Xn .
In Example 3.8.2, we found the mgf for a Bernoulli(p) random variable. Thus, we have
t
MXi (t) = 1 − p + e p, for i = 1, … , n.
d
′′ t n−1 t t n−2 t 2 t n−1 t
M (t) = [n(1 − p + e p ) e p] = n(n − 1)(1 − p + e p ) (e p ) + n(1 − p + e p ) e p
X
dt
′′ 2
⇒ M (0) = n(n − 1)p + np
X
We end with a final property of mgf's that relates to the comparison of the distribution of random variables.
Theorem 3.8.4
The mgf M (t) of random variable X uniquely determines the probability distribution of X. In other words, if random
X
variables X and Y have the same mgf, M (t) = M (t) , then X and Y have the same probability distribution.
X Y
3.8.3 https://stats.libretexts.org/@go/page/4374
Exercise 3.8.1
Suppose the random variable X has the following mgf:
33
t
MX (t) = (0.85 + 0.15 e )
Hint
Use Theorem 3.8.4 and look at Example 3.8.3.
Answer
We found in Example 3.8.3 that the mgf for a binomial distribution is
t n
MX (t) = (1 − p + e p ) ,
which is the mgf given with p = 0.15 and n = 33 . Thus, X ∼ binomial(33, 0.15).
This page titled 3.8: Moment-Generating Functions (MGFs) for Discrete Random Variables is shared under a not declared license and was
authored, remixed, and/or curated by Kristin Kuter.
3.8.4 https://stats.libretexts.org/@go/page/4374
CHAPTER OVERVIEW
4: Continuous Random Variables
Topic hierarchy
4.1: Probability Density Functions (PDFs) and Cumulative Distribution Functions (CDFs) for Continuous Random Variables
4.2: Expected Value and Variance of Continuous Random Variables
4.3: Uniform Distributions
4.4: Normal Distributions
4.5: Exponential and Gamma Distributions
4.6: Weibull Distributions
4.7: Chi-Squared Distributions
4.8: Beta Distributions
This page titled 4: Continuous Random Variables is shared under a not declared license and was authored, remixed, and/or curated by Kristin
Kuter.
1
4.1: Probability Density Functions (PDFs) and Cumulative Distribution Functions
(CDFs) for Continuous Random Variables
Probability Density Functions (PDFs)
Recall that continuous random variables have uncountably many possible values (think of intervals of real numbers). Just as for
discrete random variables, we can talk about probabilities for continuous random variables using density functions.
Definition 4.1.1
The probability density function (pdf), denoted f , of a continuous random variable X satisfies the following:
1. f (x) ≥ 0 , for all x ∈ R
2. f is piecewise continuous
∞
3. ∫ f (x) dx = 1
−∞
a
4. P (a ≤ X ≤ b) = ∫ f (x) dx
The first three conditions in the definition state the properties necessary for a function to be a valid pdf for a continuous random
variable. The fourth condition tells us how to use a pdf to calculate probabilities for continuous random variables, which are given
by integrals the continuous analog to sums.
Example 4.1.1
Let the random variable X denote the time a person waits for an elevator to arrive. Suppose the longest one would need to wait for
the elevator is 2 minutes, so that the possible values of X (in minutes) are given by the interval [0, 2]. A possible pdf for X is given
by
⎧ x, for 0 ≤ x ≤ 1
⎪
The graph of f is given below, and we verify that f satisfies the first three conditions in Definition 4.1.1:
1. From the graph, it is clear that f (x) ≥ 0 , for all x ∈ R.
2. Since there are no holes, jumps, asymptotes, we see that f (x) is (piecewise) continuous.
3. Finally we compute:
∞ 2 1 2
∫ f (x) dx = ∫ x dx = ∫ x dx + ∫ (2 − x) dx = 1
−∞ 0 0 0
4.1.1 https://stats.libretexts.org/@go/page/3265
Figure 1: Graph of pdf for X, f (x)
So, if we wish to calculate the probability that a person waits less than 30 seconds (or 0.5 minutes) for the elevator to arrive, then
we calculate the following probability using the pdf and the fourth property in Definition 4.1.1:
0.5 0.5
0 0
Note that, unlike discrete random variables, continuous random variables have zero point probabilities, i.e., the probability that a
continuous random variable equals a single value is always given by 0. Formally, this follows from properties of integrals:
a
P (X = a) = P (a ≤ X ≤ a) = ∫ f (x) dx = 0.
Informally, if we realize that probability for a continuous random variable is given by areas under pdf's, then, since there is no
area in a line, there is no probability assigned to a random variable taking on a single value. This does not mean that a continuous
random variable will never equal a single value, only that we do not assign any probability to single values for the random variable.
For this reason, we only talk about the probability of a continuous random variable taking a value in an INTERVAL, not at a point.
And whether or not the endpoints of the interval are included does not affect the probability. In fact, the following probabilities are
all equal:
b
−∞
In other words, the cdf for a continuous random variable is found by integrating the pdf. Note that the Fundamental Theorem of
Calculus implies that the pdf of a continuous random variable can be found by differentiating the cdf. This relationship between
the pdf and cdf for a continuous random variable is incredibly useful.
F (x) = ∫ f (t) dt
−∞
By the Fundamental Theorem of Calculus, the pdf can be found by differentiating the cdf:
d
f (x) = [F (x)]
dx
4.1.2 https://stats.libretexts.org/@go/page/3265
Example 4.1.2
Continuing in the context of Example 4.1.1, we find the corresponding cdf. First, let's find the cdf at two possible values of X ,
x = 0.5 and x = 1.5 :
0.5 0.5
2 0.5
t ∣
F (0.5) = ∫ f (t) dt = ∫ t dt = ∣ = 0.125
2 ∣0
−∞ 0
1.5 1 1.5
2 1 2 1.5
t ∣ t ∣
F (1.5) = ∫ f (t) dt = ∫ t dt + ∫ (2 − t) dt = ∣ + (2t − )∣ = 0.5 + (1.875 − 1.5) = 0.875
2 ∣0 2 ∣
1
−∞ 0 1
Now we find F (x) more generally, working over the intervals that f (x) has different formulas:
x
−∞
x
2 x 2
t ∣ x
for 0 ≤ x ≤ 1 : F (x) = ∫ t dt = ∣ =
2 ∣0 2
0
1 x
2 1 2 x 2 2
t ∣ t ∣ x x
for 1 < x ≤ 2 : F (x) = ∫ t dt + ∫ (2 − t) dt = ∣ + (2t − ) ∣ = 0.5 + (2x − ) − (2 − 0.5) = 2x − −1
2 ∣0 2 ∣
1 2 2
0 1
−∞
Putting this altogether, we write F as a piecewise function and Figure 2 gives its graph:
⎧ 0, for x < 0
⎪
⎪
⎪
⎪ x2
⎪ , for 0 ≤ x ≤ 1
2
F (x) = ⎨
2
x
⎪ 2x − − 1, for 1 < x ≤ 2
⎪
⎪ 2
⎪
⎩
⎪
1, for x > 2
Recall that the graph of the cdf for a discrete random variable is always a step function. Looking at Figure 2 above, we note that the
cdf for a continuous random variable is always a continuous function.
4.1.3 https://stats.libretexts.org/@go/page/3265
Percentiles of a Distribution
Definition 4.1.2
The (100p)th percentile (0 ≤ p ≤ 1 ) of a probability distribution with cdf F is the value π such that p
F (πp ) = P (X ≤ πp ) = p.
To find the percentile π of a continuous random variable, which is a possible value of the random variable, we are specifying a
p
πp
∫ f (t)dt = p
−∞
Special Cases: There are a few values of p for which the corresponding percentile has a special name.
Median or 50 percentile: π = μ = Q , separates probability (area under pdf) into two equal halves
th
.5 2
1st Quartile or 25 percentile: π = Q , separates 1 quarter (25%) of probability (area) from the rest
th
.25 1
st
3rd Quartile or 75 percentile: π = Q , separates 3 quarter (75%) of probability (area) from the rest
th
.75 3
rd
Example 4.1.3
Continuing in the context of Example 4.1.2, we find the median and quartiles.
median: find π , such that F (π ) = 0.5 ⇒ π = 1 (from graph in Figure 1)
.5 .5 .5
1st quartile: find Q = π , such that F (π ) = 0.25. For this, we use the formula and the graph of the cdf in Figure 2:
1 .25 .25
2
π
.25 −
−−
= 0.25 ⇒ Q1 = π.25 = √0.5 ≈ 0.707
2
3rd quartile: find Q 3 = π.75 , such that F (π .75 ) = 0.75 . Again, use the graph of the cdf:
2 –
π 4 − √2
.75
2 π.75 − − 1 = 0.75 ⇒ (using Quadratic Formula) Q3 = π.75 = ≈ 1.293
2 2
This page titled 4.1: Probability Density Functions (PDFs) and Cumulative Distribution Functions (CDFs) for Continuous Random Variables is
shared under a not declared license and was authored, remixed, and/or curated by Kristin Kuter.
4.1.4 https://stats.libretexts.org/@go/page/3265
4.2: Expected Value and Variance of Continuous Random Variables
We now consider the expected value and variance for continuous random variables. Note that the interpretation of each is the same
as in the discrete setting, but we now have a different method of calculating them in the continuous setting.
Definition 4.2.1
If X is a continuous random variable with pdf f (x), then the expected value (or mean) of X is given by
∞
−∞
The formula for the expected value of a continuous random variable is the continuous analog of the expected value of a discrete
random variable, where instead of summing over all possible values we integrate (recall Sections 3.6 & 3.7).
For the variance of a continuous random variable, the definition is the same and we can still use the alternative formula given by
Theorem 3.7.1, only we now integrate to calculate the value:
∞
⎛ ⎞
2 2 2 2
Var(X) = E[ X ] −μ =⎜∫ x ⋅ f (x) dx ⎟ − μ
⎝ ⎠
−∞
Example 4.2.1
Consider again the context of Example 4.1.1, where we defined the continuous random variable X to denote the time a person
waits for an elevator to arrive. The pdf of X was given by
⎧ x, for 0 ≤ x ≤ 1
⎪
2 2
1 2
E[X] = ∫ x ⋅ x dx + ∫ x ⋅ (2 − x) dx = ∫ x dx + ∫ (2x − x ) dx = + = 1.
3 3
0 1 0 1
Thus, we expect a person will wait 1 minute for the elevator on average. Figure 1 demonstrates the graphical representation of the
expected value as the center of mass of the pdf.
Figure 1: The red arrow represents the center of mass, or the expected value, of X.
Now we calculate the variance and standard deviation of X, by first finding the expected value of X . 2
4.2.1 https://stats.libretexts.org/@go/page/3268
1 2 1 2
1 11 7
2 2 2 3 2 3
E[ X ] =∫ x ⋅ x dx + ∫ x ⋅ (2 − x) dx = ∫ x dx + ∫ (2 x − x ) dx = + = .
4 12 6
0 1 0 1
Thus, we have
2 2
7 1
Var(X) = E[ X ] −μ = −1 =
6 6
−−−−−− 1
⇒ SD(X) = √Var(X) = ≈ 0.408
–
√6
This page titled 4.2: Expected Value and Variance of Continuous Random Variables is shared under a not declared license and was authored,
remixed, and/or curated by Kristin Kuter.
4.2.2 https://stats.libretexts.org/@go/page/3268
4.3: Uniform Distributions
Definition 4.3.1
A random variable X has a uniform distribution on interval [a, b], write X ∼ uniform[a, b] , if it has pdf given by
1
, for a ≤ x ≤ b
b−a
f (x) = {
0, otherwise
The uniform distribution is also sometimes referred to as the box distribution, since the graph of its pdf looks like a box. See
Figure 1 below.
Exercise 4.3.1
Verify that the uniform pdf is a valid pdf, i.e., show that it satisfies the first three conditions of Definition 4.1.1.
Answer
1. In looking either at the formula in Definition 4.3.1 or the graph in Figure 1, we can see that the uniform pdf is always
non-negative, i.e., f (x) ≥ 0 , for all x ∈ R .
2. Given that the uniform pdf is a piecewise constant function, it is also piecewise continuous.
3. Finally, we need to verify that the area under the uniform pdf is equal to 1. This is quickly seen from the graph in Figure
1, since we calculate the area of rectangle with width (b − a) and height 1/(b − a) . Thus, the area is
1
(b − a) × = 1.
(b − a)
A typical application of the uniform distribution is to model randomly generated numbers. In other words, it provides the
probability distribution for a random variable representing a randomly chosen number between numbers a and b .
The uniform distribution assigns equal probabilities to intervals of equal lengths, since it is a constant function, on the interval it is
non-zero [a, b]. This is the continuous analog to equally likely outcomes in the discrete setting.
We close the section by finding the expected value of the uniform distribution.
Example 4.3.1
If X has a uniform distribution on the interval [a, b], then we apply Definition 4.2.1 and compute the expected value of X:
4.3.1 https://stats.libretexts.org/@go/page/3269
b
2 2
1 b −a 1 (b − a)(b + a) 1 b +a
E[X] = ∫ x ⋅ dx = ⋅ = ⋅ = .
b −a 2 b −a 2 b −a 2
a
Thus, the expected value of the uniform[a, b] distribution is given by the average of the parameters a and b , or the midpoint of the
interval [a, b]. This is readily apparent when looking at a graph of the pdf in Figure 1 and remembering the interpretation of
expected value as the center of mass. Since the pdf is constant over [a, b], the center of mass is simply given by the midpoint.
This page titled 4.3: Uniform Distributions is shared under a not declared license and was authored, remixed, and/or curated by Kristin Kuter.
4.3.2 https://stats.libretexts.org/@go/page/3269
4.4: Normal Distributions
Definition 4.4.1
A random variable X has a normal distribution, with parameters μ and σ, write X ∼ normal(μ, σ), if it has pdf given by
1 2
−(x−μ ) /2 σ
2
If a continuous random variable X has a normal distribution with parameters μ and σ, then E[X] = μ and Var(X) = σ . 2
These facts can be derived using Definition 4.2.1; however, the integral calculations require many tricks. Note that the normal case
is why the notation μ is often used for the expected value, and σ is used for the variance. So, μ gives the center of the normal pdf,
2
and its graph is symmetric about μ , while σ determines how spread out the graph is. Figure 1 below shows the graph of two
different normal pdf's.
Example 4.4.1
Suppose X ∼ normal(0, 2 ) and X ∼ normal(0, 3 ). So, X and X are both normally distributed random variables with the
1
2
2
2
1 2
same mean, but X has a larger standard deviation. Given our interpretation of standard deviation, this implies that the possible
2
values of X are more "spread out'' from the mean. This is easily seen by looking at the graphs of the pdf's corresponding to X
2 1
The normal distribution is arguably the most important probably distribution. It is used to model the distribution of population
characteristics such as weight, height, and IQ. The pdf is terribly tricky to work with, in fact integrals involving the normal pdf
cannot be solved exactly, but rather require numerical methods to approximate. Because of this, there is no closed form for the
corresponding cdf of a normal distribution. Given the importance of the normal distribution though, many software programs have
built in normal probability calculators. There are also many useful properties of the normal distribution that make it easy to work
with. We state these properties without proof below. Note that we also include the connection to expected value and variance given
by the parameters.
4.4.1 https://stats.libretexts.org/@go/page/3270
The first property says that any linear transformation of a normally distributed random variable is also normally distributed. The
second property is a special case of the first, since we can re-write the transformation on X as
X −μ 1 μ
=( )X− .
σ σ σ
This transformation, subtracting the mean and dividing by the standard deviation, is referred to as standardizing X, since the
resulting random variable will always have the standard normal distribution with mean 0 and standard deviation 1. In this way,
standardizing a normal random variable has the effect of removing the units. Before the prevalence of calculators and computer
software capable of calculating normal probabilities, people would apply the standardizing transformation to the normal random
variable and use a table of probabilities for the standard normal distribution.
This page titled 4.4: Normal Distributions is shared under a not declared license and was authored, remixed, and/or curated by Kristin Kuter.
4.4.2 https://stats.libretexts.org/@go/page/3270
4.5: Exponential and Gamma Distributions
In this section, we introduce two families of continuous probability distributions that are commonly used.
Exponential Distributions
Definition 4.5.1
A random variable X has an exponential distribution with parameter λ > 0 , write X ∼ exponential(λ) , if X has pdf given by
−λx
λe , for x ≥ 0,
f (x) = {
0, otherwise.
Example 4.5.1
A typical application of exponential distributions is to model waiting times or lifetimes. For example, each of the following gives
an application of an exponential distribution.
X = lifetime of a radioactive particle
X = how long you have to wait for an accident to occur at a given intersection
X = length of interval between consecutive occurrences of Poisson distributed events
The parameter λ is referred to as the rate parameter, it represents how quickly events occur. For example, in the first case above
where X denotes the lifetime of a radioactive particle, λ would give the rate at which such particles decay.
− ln(1 − p)
2. For any 0 < p < 1 , the (100p) percentile is π
th
p = .
λ
1
3. The mean of X is E[X] = .
λ
4.5.1 https://stats.libretexts.org/@go/page/4380
1
4. The variance of X is Var(X) = .
λ2
5. The mgf of X is
1
MX (t) = , for t < λ.
1 − (t/λ)
6. X satisfies the Memoryless Property, i.e., P (X > t + s | X > s) = P (X > t) , for any t, s ≥ 0 .
Partial Proof
We prove Properties #1 & #3, the others are left as an exercise.
For the first property, we consider two cases based on the value of x . First, if x < 0 , then the pdf is constant and equal to 0,
which gives the following for the cdf:
x x
For the third property, we Definition 4.2.1 to calculate the expected value of a continuous random variable:
∞ ∞ ∞ −λx
−λx −λx ∞ −λx
−e ∞ 1
E[X] = ∫ x ⋅ f (x)dx = ∫ x ⋅ λe dx = −x ⋅ e ∣
∣ +∫ e dx = 0 + ∣
∣ = .
0 0
−∞ 0 0
λ λ
In words, the Memoryless Property of exponential distributions states that, given that you have already waited more than s units of
time (X > s) , the conditional probability that you will have to wait t more (X > t + s ) is equal to the unconditional probability
you just have to wait more than t units of time. For example, suppose you are waiting for the bus and the amount of time you have
to wait is exponentially distributed. If you have already been waiting 5 minutes at the bus stop, the probability that you have to wait
4 more minutes (so more than 9 minutes total) is equal to the probability that you only had to wait more than 4 minutes once
arriving at the bus stop. In calculating the conditional probability, the exponential distribution "forgets" about the condition or the
time already spent waiting and you can just calculate the unconditional probability that you have to wait longer. Note that we saw
earlier that geometric distributions also have the Memoryless Property.
Gamma Distributions
Definition 4.5.2
A random variable X has a gamma distribution with parameters α, λ > 0 , write X ∼ gamma(α, λ), if X has pdf given by
α
⎧ λ
⎪ α−1 −λx
x e , for x ≥ 0,
f (x) = ⎨ Γ(α)
⎩
⎪
0 otherwise,
where Γ(α) is a function (referred to as the gamma function) given by the following integral:
∞
α−1 −t
Γ(α) = ∫ t e dt.
0
4.5.2 https://stats.libretexts.org/@go/page/4380
Figure 2: Graph of pdf's for various gamma distributions. On the left, for the purple pdf α = 0.5 and for the green pdf α = 1.5 . On
the right, for the blue pdf α = 4 and for the orange pdf α = 8 . For all pdf's, λ = 5 .
Note that the gamma function, Γ(α), ensures that the gamma pdf is valid, i.e., that it integrates to 1, which you are asked to show
in the following exercise. The value of Γ(α) depends on the value of the parameter α , but for a given value of α it is just a number,
i.e., it is a constant value in the gamma pdf, given specific parameter values. In this case, Γ(α) is referred to as a scaling constant,
since it "scales" the rest of the pdf, λ x e , which is referred to as the kernel of the distribution, so that the result integrates to
α α−1 −λx
1.
Exercise 4.5.1
∞ α
λ
Show: ∫ α−1
x e
−λx
dx = 1
0
Γ(α)
Answer
In the integral, we can make the substitution: u = λx → du = λdx . Therefore, we have
∞ α ∞ α−1 ∞
λ α−1 −λx
λλ α−1 −λx
1 α−1 −u
1
∫ x e dx = ∫ x e dx = ∫ u e du = Γ(α) = 1.
0 Γ(α) 0 Γ(α) Γ(α) 0 Γ(α)
4.5.3 https://stats.libretexts.org/@go/page/4380
α
2. The mean of X is E[X] = .
λ
α
3. The variance of X is Var(X) = 2
.
λ
The parameter α is referred to as the shape parameter, and λ is the rate parameter. Varying the value of α changes the shape
of the pdf, as is seen in Figure 2 above, whereas varying the value of λ corresponds to changing the units (e.g., from inches to
centimeters) and does not alter the shape of the pdf.
A closed form does not exist for the cdf of a gamma distribution, computer software must be used to calculate gamma
probabilities. Here is a link to a gamma calculator online. (Note that different notation is used on this online calculator, namely,
λ is referred to as β instead.)
Example 4.5.2
A typical application of gamma distributions is to model the time it takes for a given number of events to occur. For example, each
of the following gives an application of a gamma distribution.
X = lifetime of 5 radioactive particles
X = how long you have to wait for 3 accidents to occur at a given intersection
In these examples, the parameter λ represents the rate at which the event occurs, and the parameter α is the number of events
desired. So, in the first example, α = 5 and λ represents the rate at which particles decay.
This page titled 4.5: Exponential and Gamma Distributions is shared under a not declared license and was authored, remixed, and/or curated by
Kristin Kuter.
4.5.4 https://stats.libretexts.org/@go/page/4380
4.6: Weibull Distributions
In this section, we introduce the Weibull distributions, which are very useful in the field of actuarial science.
Weibull Distributions
Definition 4.6.1
A random variable X has a Weibull distribution with parameters α, β > 0 , write X ∼ Weibull(α, β), if X has pdf given by
α
α α−1 −(x/β)
α
x e , for x ≥ 0,
β
f (x) = {
0 otherwise.
Example 4.6.1
A typical application of Weibull distributions is to model lifetimes that are not “memoryless”. For example, each of the following
gives an application of the Weibull distribution.
modeling the lifetime of a car battery
modeling the probability that someone survives past the age of 80 years old
The parameter α is referred to as the shape parameter, and β is the scale parameter. When α = 1 , the Weibull distribution is an
exponential distribution with λ = 1/β , so the exponential distribution is a special case of both the Weibull distributions and the
gamma distributions. We can see the similarities between the Weibull and exponential distributions more readily when comparing
the cdf's of each. The cdf of the Weibull distribution is given below, with proof, along with other important properties, stated
without proof.
4.6.1 https://stats.libretexts.org/@go/page/9114
0 for x < 0,
F (x) = { α
−(x/β)
1 −e , for x ≥ 0.
Partial Proof
We prove Property #1, but leave #2 as an exercise. Properties #3 and #4 are rather tricky to prove, so we state them without
proof.
For the first property, we consider two cases based on the value of x . First, if x < 0 , then the pdf is constant and equal to 0,
which gives the following for the cdf:
x x
β
α
x
α−1
e
−(x/β)
, and the cdf is given by the following integral, which is solved by making
α
t
the substitution u = ( ) :
β
α
x x (x/β) α
α α (x/β) α α
α−1 −(x/β) −u −u ∣ −(x/β) 0 −(x/β)
F (x) = ∫ f (t)dt = ∫ x e dt = ∫ e du = −e = −e − (−e ) = 1 − e .
α ∣
−∞ 0
β 0
0
4.6: Weibull Distributions is shared under a not declared license and was authored, remixed, and/or curated by LibreTexts.
4.6.2 https://stats.libretexts.org/@go/page/9114
4.7: Chi-Squared Distributions
In this section, we introduce the chi-squared distributions, which are very useful in statistics.
Chi-Squared Distributions
Definition 4.7.1
A random variable X has a chi-squared distribution with k degrees of freedom, where k is a positive integer, write
X ∼ χ (k) , if X has pdf given by
2
⎧ 1 k/2−1 −x/2
⎪ x e , for x ≥ 0,
k/2
f (x) = ⎨ Γ(k/2)2
⎩
⎪
0 otherwise,
2
,λ =
1
2
, which can be used to establish the
following properties of the chi-squared distribution.
Note that there is no closed form equation for the cdf of a chi-squared distribution in general. But most graphing calculators have a
built-in function to compute chi-squared probabilities. On the TI-84 or 89, this function is named "χ cdf''. 2
4.7.1 https://stats.libretexts.org/@go/page/9108
The main applications of the chi-squared distributions relate to their importance in the field of statistics, which result from the
following relationships between the chi-squared distributions and the normal distributions.
of freedom.
2. If X , … , X is a collection of independent, chi-squared random variables each with 1 degree of freedom, i.e.,
1 n
X ∼ χ (1) , for each i = 1, … , n, then the sum X + ⋯ + X is also chi-squared with k = n degrees of freedom.
2
i 1 n
4.7: Chi-Squared Distributions is shared under a not declared license and was authored, remixed, and/or curated by LibreTexts.
4.7.2 https://stats.libretexts.org/@go/page/9108
4.8: Beta Distributions
In this section, we introduce beta distributions, which are very useful in a branch of statistics known as Bayesian Statistics.
Beta Distributions
Definition 4.8.1
A random variable X has a beta distribution with parameters α, β > 0 , write X ∼ beta(α, β) , if X has pdf given by
⎧ Γ(α + β)
⎪ α−1 β−1
x (1 − x ) , for 0 ≤ x ≤ 1,
f (x) = ⎨ Γ(α)Γ(β) (4.8.1)
⎩
⎪
0 otherwise,
In the formula for the pdf of the beta distribution given in Equation 4.8.1 , note that the term with the gamma functions, i.e.,
Γ(α + β)
is the scaling constant so that the pdf is valid, i.e., integrates to 1. This is similar to the role the gamma function plays
Γ(α)Γ(β)
for the gamma distribution introduced in Section 4.5. Ignoring the scaling constant for the beta distribution, we can focus on what
is referred to as the kernel of the distribution, which is given by
α−1 β−1
x (1 − x ) , for x ∈ [0, 1]. (4.8.2)
The parameters, α and β, are both shape parameters for the beta distribution, varying their values changes the shape of the pdf.
As is the case for the normal, gamma, and chi-squared distributions, there is no closed form equation for the cdf of the beta
distribution and computer software must be used to calculate beta probabilities. Here is a link to a beta calculator online.
Beta distributions are useful for modeling random variables that only take values on the unit interval [0, 1]. In fact, if both
parameters are equal to one, i.e., α = β = 1 , the corresponding beta distribution is equal to the uniform[0, 1] distribution. In
statistics, beta distributions are used to model proportions of random samples taken from a population that have a certain
characteristic of interest. For example, the proportion of surface area in a randomly selected urban neighborhood that is green
space, i.e., parks or garden area.
We state the following important properties of beta distributions without proof.
4.8: Beta Distributions is shared under a not declared license and was authored, remixed, and/or curated by LibreTexts.
4.8.1 https://stats.libretexts.org/@go/page/9111
CHAPTER OVERVIEW
5: Probability Distributions for Combinations of Random Variables
5.1: Joint Distributions of Discrete Random Variables
5.2: Joint Distributions of Continuous Random Variables
5.3: Conditional Probability Distributions
5.4: Finding Distributions of Functions of Continuous Random Variables
5.5: Sample Mean
This page titled 5: Probability Distributions for Combinations of Random Variables is shared under a not declared license and was authored,
remixed, and/or curated by Kristin Kuter.
1
5.1: Joint Distributions of Discrete Random Variables
In this chapter we consider two or more random variables defined on the same sample space and discuss how to model the
probability distribution of the random variables jointly. We will begin with the discrete case by looking at the joint probability mass
function for two discrete random variables. In the following section, we will consider continuous random variables.
Definition 5.1.1
If discrete random variables X and Y are defined on the same sample space S , then their joint probability mass function (joint
pmf) is given by
where (x, y) is a pair of possible values for the pair of random variables (X, Y ), and p(x, y) satisfies the following conditions:
0 ≤ p(x, y) ≤ 1
∑ ∑ p(x, y) = 1
(x,y)
(x,y)∈A
Note that conditions #1 and #2 in Definition 5.1.1 are required for p(x, y) to be a valid joint pmf, while the third condition tells us
how to use the joint pmf to find probabilities for the pair of random variables (X, Y ).
In the discrete case, we can obtain the joint cumulative distribution function (joint cdf) of X and Y by summing the joint pmf:
xi ≤x yj ≤y
where x denotes possible values of X and y denotes possible values of Y . From the joint pmf, we can also obtain the individual
i j
Definition 5.1.2
Suppose that discrete random variables X and Y have joint pmf p(x, y). Let x , x , … , x , … denote the possible values of X,
1 2 i
and let y , y , … , y , … denote the possible values of Y . The marginal probability mass functions (marginal pmf's) of X
1 2 j
pX (x) = ∑ p(x, yj ) (fix a value of X and sum over possible values of Y )
pY (y) = ∑ p(xi , y) (fix a value of Y and sum over possible values of X)
Example 5.1.1
Consider again the probability experiment of Example 3.3.2, where we toss a fair coin three times and record the sequence of heads
(h) and tails (t) . Again, we let random variable X denote the number of heads obtained. We also let random variable Y denote the
winnings earned in a single play of a game with the following rules, based on the outcomes of the probability experiment (this is
the same as Example 3.6.2):
player wins $1 if first h occurs on the first toss
player wins $2 if first h occurs on the second toss
player wins $3 if first h occurs on the third toss
5.1.1 https://stats.libretexts.org/@go/page/3272
player loses $1 if no h occur
Note that the possible values of X are x = 0, 1, 2, 3 , and the possible values of Y are y = −1, 1, 2, 3. We represent the joint
pmf using a table:
Y 0 1 2 3
-1
1/8 0 0 0
1
0 1/8 2/8 1/8
2
0 1/8 1/8 0
3
0 1/8 0 0
The values in Table 1 give the values of p(x, y). For example, consider p(0, −1):
1
p(0, −1) = P (X = 0 and Y = −1) = P (ttt) = .
8
Since the outcomes are equally likely, the values of p(x, y) are found by counting the number of outcomes in the sample space
S that result in the specified values of the random variables, and then dividing by 8 , the total number of outcomes in S . The sample
space is given below, color coded to help explain the values of p(x, y):
Given the joint pmf, we can now find the marginal pmf's. Note that the marginal pmf for X is found by computing sums of the
columns in Table 1, and the marginal pmf for Y corresponds to the row sums. (Note that we found the pmf for X in Example
3.3.2 as well, it is a binomial random variable. We also found the pmf for Y in Example 3.6.2.)
0 1/8 -1 1/8
1 3/8 1 1/2
2 3/8 2 1/4
3 1/8 3 1/8
Finally, we can find the joint cdf for X and Y by summing over values of the joint frequency function. For example, consider
F (1, 1):
1
F (1, 1) = P (X ≤ 1 and Y ≤ 1) = ∑ ∑ p(x, y) = p(0, −1) + p(0, 1) + p(−1, 1) + p(1, 1) =
4
x≤1 y≤1
5.1.2 https://stats.libretexts.org/@go/page/3272
Table 3: joint cdf of X and Y
F (x, y) X
Y 0 1 2 3
-1
1/8 1/8 1/8 1/8
1
1/8 1/4 1/2 5/8
2
1/8 3/8 3/4 7/8
3
1/8 1/2 7/8 1
Theorem 5.1.1
Suppose that X and Y are jointly distributed discrete random variables with joint pmf p(x, y).
If g(X, Y ) is a function of these two random variables, then its expected value is given by the following:
(x,y)
Example 5.1.2
Consider again the discrete random variables we defined in Example 5.1.1 with joint pmf given in Table 1. We will find the
expected value of three different functions applied to (X, Y ).
1. First, we define g(x, y) = xy, and compute the expected value of XY :
1
E[XY ] = ∑ ∑ xy ⋅ p(x, y) = (0)(−1) ( )
8
(x,y)
1 2 1
+ (1)(1) ( ) + (2)(1) ( ) + (3)(1) ( )
8 8 8
1 1
+ (1)(2) ( ) + (2)(2) ( )
8 8
1
+ (1)(3) ( )
8
17
= = 2.125
8
5.1.3 https://stats.libretexts.org/@go/page/3272
1
E[X] = ∑ ∑ x ⋅ p(x, y) = (0) ( )
8
(x,y)
1 2 1
+ (1) ( ) + (2) ( ) + (3) ( )
8 8 8
1 1
+ (1) ( ) + (2) ( )
8 8
1
+ (1) ( )
8
12
= = 1.5
8
Recall that X ∼ binomial(n = 3, p = 0.5) , and that the expected value of a binomial random variable is given by np. Thus,
we can verify the expected value of X that we calculated above using Theorem 5.1.1 using this fact for binomial distributions:
E[X] = np = 3(0.5) = 1.5 .
1 2 1
+ (1) ( ) + (1) ( ) + (1) ( )
8 8 8
1 1
+ (2) ( ) + (2) ( )
8 8
1
+ (3) ( )
8
10
= = 1.25
8
Again, we can verify this result by reviewing the calculations done in Example 3.6.2.
Definition 5.1.3
Discrete random variables X 1, X2 , … , Xn are independent if the joint pmf factors into a product of the marginal pmf's:
It is equivalent to check that this condition holds for the cumulative distribution functions.
Recall the definition of independent events (Definition 2.3.2): A and B are independent events if P (A ∩ B) = P (A) P (B) . This
is the basis for the definition of independent random variables because we can write the pmf's in Equation 5.1.1 in terms of events
as follows:
5.1.4 https://stats.libretexts.org/@go/page/3272
In the above, we use the idea that if X and Y are independent, then the event that X takes on a given value x is independent of the
event that Y takes the value y .
Example 5.1.3
Consider yet again the discrete random variables defined in Example 5.1.1. According to the definition, X and Y are independent
if
for all pairs (x, y). Recall that the joint pmf for (X, Y ) is given in Table 1 and that the marginal pmf's for X and Y are given in
Table 2. Note that, for (x, y) = (0, −1), we have the following
1 1 1
p(0, −1) = , pX (0) = , pY (−1) = ⇒ p(0, −1) ≠ pX (0) ⋅ pY (−1).
8 8 8
Thus, X and Y are not independent, or in other words, X and Y are dependent. This should make sense given the definition of X
and Y . The winnings earned depend on the number of heads obtained. So the probabilities assigned to the values of Y will be
affected by the values of X.
We also have the following very useful theorem about the expected value of a product of independent random variables, which is
simply given by the product of the expected values for the individual random variables.
Theorem 5.1.2
If X and Y are independent random variables, then E[XY ] = E[X] E[Y ].
Proof
Assume X and Y are independent random variables. If we let p(x, y) denote the joint pmf of (X, Y ) , then, by Definition
5.1.3, p(x, y) = p (x)p (y) , for all pairs (x, y) . Using this fact and Theorem 5.1.1, we have
X Y
(x,y) (x,y)
x y x y x
Theorem 5.1.2 can be used to show that two random variables are not independent: if E[XY ] ≠ E[X] E[Y ], then X and
Y cannot be independent. However, beware using Theorem 5.1.2 to show that random variables are independent. Note that
Theorem 5.1.2 assumes that X and Y are independent and then the property about the expected value follows. The other direction
does not hold. In other words, if E[XY ] = E[X] E[Y ], then X and Y may or may not be independent.
This page titled 5.1: Joint Distributions of Discrete Random Variables is shared under a not declared license and was authored, remixed, and/or
curated by Kristin Kuter.
5.1.5 https://stats.libretexts.org/@go/page/3272
5.2: Joint Distributions of Continuous Random Variables
Having considered the discrete case, we now look at joint distributions for continuous random variables.
Definition 5.2.1
If continuous random variables X and Y are defined on the same sample space S , then their joint probability density
function (joint pdf) is a piecewise continuous function, denoted f (x, y), that satisfies the following.
1. f (x, y) ≥ 0, for all (x, y) ∈ R 2
2. ∬ f (x, y) dx dy = 1
2
R
The first two conditions in Definition 5.2.1 provide the requirements for a function to be a valid joint pdf. The third condition
indicates how to use a joint pdf to calculate probabilities. As an example of applying the third condition in Definition 5.2.1, the
joint cdf for continuous random variables X and Y is obtained by integrating the joint density function over a set A of the form
2
A = {(x, y) ∈ R | X ≤ a and Y ≤ b},
where a and b are constants. Specifically, if A is given as above, then the joint cdf of X and Y , at the point (a, b), is given by
b a
−∞ −∞
Note that probabilities for continuous jointly distributed random variables are now volumes instead of areas as in the case of a
single continuous random variable.
As in the discrete case, we can also obtain the individual, maginal pdf's of X and Y from the joint pdf.
Definition 5.2.2
Suppose that continuous random variables X and Y have joint density function . The marginal pdf's of
f (x, y) X and Y are
respectively given by the following.
∞
fX (x) = ∫ f (x, y) dy (fix a value of X, and integrate over all possible values of Y )
−∞
fY (y) = ∫ f (x, y) dx (fix a value of Y , and integrate over all possible values of X)
−∞
Example 5.2.1
Suppose a radioactive particle is contained in a unit square. We can define random variables X and Y to denote the x- and y -
coordinates of the particle's location in the unit square, with the bottom left corner placed at the origin. Radioactive particles follow
completely random behavior, meaning that the particle's location should be uniformly distributed over the unit square. This implies
that the joint density function of X and Y should be constant over the unit square, which we can write as
5.2.1 https://stats.libretexts.org/@go/page/3273
c, if 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1
f (x, y) = {
0, otherwise,
where c is some unknown constant. We can find the value of c by using the first condition in Definition 5.2.1 and solving the
following:
1 1 1 1
∬ f (x, y) dx dy = 1 ⇒ ∫ ∫ c dx dy = 1 ⇒ c ∫ ∫ 1 dx dy = 1 ⇒ c =1
R
2 0 0 0 0
We can now use the joint pdf of X and Y to compute probabilities that the particle is in some specific region of the unit square. For
example, consider the region
If we want the probability that the particle's location is in the lower right corner of the unit square that intersects with the region A ,
then we integrate the joint density function over that portion of A in the unit square, which gives the following probability:
0.5 1
fX (x) = ∫ 1 dy = 1, for 0 ≤ x ≤ 1
fY (y) = ∫ 1 dx = 1, for 0 ≤ y ≤ 1
Note that both X and Y are individually uniform random variables, each over the interval [0, 1]. This should not be too surprising.
Given that the particle's location was uniformly distributed over the unit square, we should expect that the individual coordinates
would also be uniformly distributed over the unit intervals.
5.2.2 https://stats.libretexts.org/@go/page/3273
Example 5.2.2
At a particular gas station, gasoline is stocked in a bulk tank each week. Let random variable X denote the proportion of the tank's
capacity that is stocked in a given week, and let Y denote the proportion of the tank's capacity that is sold in the same week. Note
that the gas station cannot sell more than what was stocked in a given week, which implies that the value of Y cannot exceed the
value of X. A possible joint pdf of X and Y is given by
3x, if 0 ≤ y ≤ x ≤ 1
f (x, y) = {
0, otherwise.
Note that this function is only nonzero over the triangular region given by {(x, y) | 0 ≤ y ≤ x ≤ 1} , which is graphed in Figure 2
below:
We find the joint cdf of X and Y at the point (x, y) = (1/2, 1/3):
1/3 0.5
1 1 1 1
F ( , ) = P (X ≤ and Y ≤ ) =∫ ∫ 3x dxdy
2 3 2 3 0 y
1/3 1/3
3 0.5 3 3
2∣ 2
=∫ ( x ) dy = ∫ ( − y ) dy
∣
0
2 y
0
8 2
3 1 1/3
3∣
= y− y ≈ 0.1065
∣
8 2 0
Thus, there is a 10.65% chance that less than half the tank is stocked and less than a third of the tank is sold in a given week. Note
that in finding the above integral, we look at where the region given by {(x, y) | x ≤ 1/2, y ≤ 1/3} intersects the region over
which the joint pdf is nonzero, i.e., the region graphed in Figure 2. This tells us what the limits of integration are in the double
integral. Figure 3 below is a graph of the intersection made on desmos.com:
5.2.3 https://stats.libretexts.org/@go/page/3273
Figure 3: Intersection of {(x, y) | x ≤ 1/2, y ≤ 1/3} with the region over which joint pdf f (x, y) is nonzero.
Next, we find the probability that the amount of gas sold is less than half the amount that is stocked in a given week. In other
words, we find P (Y < 0.5X). In order to find this probability, we need to find the region over which we will integrate the joint
pdf. To do this, look for the intersection of the region given by {(x, y) | y < 0.5x} with the region in Figure 2, which is graphed in
Figure 4 below:
Figure 4: Intersection of {(x, y) | y < 0.5x} with the region over which joint pdf f (x, y) is nonzero.
The calculation is as follows:
5.2.4 https://stats.libretexts.org/@go/page/3273
1 0.5x
1
0.5x
∣
=∫ (3xy ) dx
∣0
0
1
3 1 1
2 3∣
=∫ ( x − 0) dx = x
∣
0
2 2 0
1
=
2
Thus, there is a 50% chance that the amount of gas sold in a given week is less than half of the gas stocked.
Theorem 5.2.1
Suppose that X and Y are jointly distributed continuous random variables with joint pdf f (x, y).
If g(X, Y ) is a function of these two random variables, then its expected value is given by the following:
2
R
Definition 5.2.3
Continuous random variables X 1, X2 , … , Xn are independent if the joint pdf factors into a product of the marginal pdf's:
It is equivalent to check that this condition holds for the cumulative distribution functions.
Example 5.2.3
Consider the continuous random variables defined in Example 5.2.1, where the X and Y gave the location of a radioactive particle.
We will show that X and Y are independent and then verify that Theorem 5.1.2 also applies in the continuous setting.
Recall that we found the marginal pdf's to be the following:
fX (x) = 1, for 0 ≤ x ≤ 1
fY (y) = 1, for 0 ≤ y ≤ 1
and outside the unit square, at least one of marginal pdf's will be 0, so
5.2.5 https://stats.libretexts.org/@go/page/3273
f (x, y) = 0 = fX (x)fY (y).
We have thus shown that f (x, y) = f X (x) fY (y), for all (x, y) ∈ R , and so by Definition 5.2.3, X and Y are independent.
2
Now let's look at the expected value of the product of X and Y . To compute this we apply Theorem 5.2.1:
1 1 1 2
x 1 1
∣
E[XY ] = ∬ xy ⋅ f (x, y) dxdy = ∫ ∫ xy ⋅ 1 dxdy = ∫ ( y ) dy =
∣
R
2
0 0 0
2 0 4
Note that both X and Y are uniform on the interval . Therefore, their expected values are both 1/2, the midpoint of
[0, 1] [0, 1].
Putting this all together, we have
1 1 1
E[XY ] = = ⋅ = E[X] E[Y ],
4 2 2
This page titled 5.2: Joint Distributions of Continuous Random Variables is shared under a not declared license and was authored, remixed, and/or
curated by Kristin Kuter.
5.2.6 https://stats.libretexts.org/@go/page/3273
5.3: Conditional Probability Distributions
In this section, we consider the probability distribution of one random variable given information about the value of another
random variable. As we will see in the formal definition, this kind of conditional distribution will involve the joint distribution of
the two random variables under consideration, which we introduced in the previous two sections. We begin with discrete random
variables, and the consider the continuous case.
We use this same concept for events to define conditional probabilities for random variables.
Definition 5.3.1
If X and Y are discrete random variables with joint pmf given by p(x, y), then the conditional probability mass function of X,
given that Y = y , is denoted p (x|y) and given by
X|Y
Note that if p Y (y) = 0 , then for that value of Y the conditional pmf of X does not exist.
Similarly, the conditional probability mass function of Y , given that X = x, is denoted p Y |X
(y|x) and given by
P ({Y = y} ∩ {X = x}) p(x, y)
pY |X (y|x) = = , provided that pX (x) > 0.
P (X = x) pX (x)
Example 5.3.1
For an example of conditional distributions for discrete random variables, we return to the context of Example 5.1.1, where the
underlying probability experiment was to flip a fair coin three times, and the random variable X denoted the number of heads
obtained and the random variable Y denoted the winnings when betting on the placement of the first heads obtained in the three
flips. We found the joint pmf for X and Y in Table 1 of Section 5.1, and the marginal pmf's are given in Table 2. We now find the
conditional distributions of X and Y .
First, to find the conditional distribution of X given a value of Y , we can think of fixing a row in Table 1 and dividing the values of
the joint pmf in that row by the marginal pmf of Y for the corresponding value. For example, to find p (x|1), we divide eachX|Y
entry in the Y = 1 row by p (1) = 1/2 . Doing this for each row in Table 1, results in the conditional distributions of X given a
Y
0
1 0 0 0
1
0 1/4 1/2 1
5.3.1 https://stats.libretexts.org/@go/page/7847
x pX|Y (x| − 1) pX|Y (x|1) pX|Y (x|2) pX|Y (x|3)
2
0 1/2 1/2 0
3
0 1/4 0 0
Note that every column in the above table sums to 1. This is because each row is a different probability mass function for X given
a value of Y . Specifically, if we look at the column for the conditional distribution of X given that Y = 1 , p (x|1), this is the
X|Y
distribution of probability for the number of heads obtained, knowing that the winnings of the game are $1. Recall that we win $1 if
the first heads is on the first toss. In that case, we know the outcome of the probability experiment must be one of
{hhh, hht, hth, htt}. Working with this reduced sample space, we can see how the corresponding probabilities for the values of
X arise.
Similar to the process for the conditional pmf's of X given Y , we can find the conditional pmf's of Y given X. This time, though,
we take a column in Table 1, giving the joint pmf for a fixed value of X, and divide by the marginal pmf of X for the
corresponding value. The following table gives the results.
−1
1 0 0 0
1
0 1/3 2/3 1
2
0 1/3 1/3 0
3
0 1/3 0 0
Note again that every column in the above table sums to 1. A conditional pmf is a pmf, just found in a specific way.
Link to Video: Walkthrough of Example 5.3.1
Informally, we can think of a conditional probability distribution as a probability distribution for a sub-population. In other words,
a conditional probability distribution describes the probability that a randomly selected person from a sub-population has a given
characteristic of interest. In this context, the joint probability distribution is the probability that a randomly selected person from
the entire population has both characteristics of interest. The following example demonstrates these interpretations in a specific
context.
Example 5.3.2
Suppose we are interested in the relationship between an individual's hair and eye color. Based on a random sample of Saint Mary's
students, we have the following joint pmf, with marginal pmf's given in the margins:
Eye Color (Y ) blonde (1) red (2) brown (3) black (4) pY (y)
blue (1)
0.12 0.05 0.12 0.01 0.30
green (2)
0.12 0.07 0.09 0 0.28
brown (3)
0.16 0.07 0.16 0.03 0.42
pX (x)
0.40 0.19 0.37 0.04 1.00
5.3.2 https://stats.libretexts.org/@go/page/7847
The probabilities in the last row and column (orange cells) give the marginal pmf's for X and Y , while the probabilities in the
interior (white and grey cells) give the joint pmf for pairs (X, Y ). For instance, p(3, 2) = 0.09 indicates the joint probability that a
randomly selected SMC student has brown hair (X = 3 ) and green eyes (Y = 2 ) is 9%, p (3) = 0.37 indicates the marginal
X
probability that a randomly selected SMC student has brown hair is 37%, and p (2) = 0.28 indicates the marginal probability that
Y
p(2, 1) 0.05 1
pX|Y (2|1) = = = ≈ 0.167
pY (1) 0.3 6
Note that we can write p (2|1) = P (X = 2 | Y = 1) = P (red hair | blue eyes) . Here we are finding the probability that
X|Y
an individual in the sub-population of individuals with blue eyes has red hair. Specifically, we found that 1/6 (or approximately
16.7%) of SMC students with blue eyes have red hair.
Now, let's reverse the order of X and Y , and find p Y |X
:
(2|1)
p(1, 2) 0.12
pY |X (2|1) = = = 0.3
pX (1) 0.40
Now the sub-population is individuals with blonde hair, and we find the probability that an individual in this sub-population has
green eyes. Specifically, we found that 30$ of SMC students with blonde hair have green eyes.
Similarly, for a fixed x, we also have the following for the conditional pmf of Y , given X = x :
2. In general, the conditional distribution of X given Y does not equal the conditional distribution of Y given X, i.e.,
3. If X and Y are independent, discrete random variables, then the following are true:
pX|Y (x|y) = pX (x)
pY |X (y|x) = pY (y)
In other words, if X and Y are independent, then knowing the value of one random variable does not affect the probability
5.3.3 https://stats.libretexts.org/@go/page/7847
of the other one.
Definition 5.3.2
For discrete random variables X and Y , the conditional expected value of X, given Y = y , is given by
Similarly, we can define conditional variances. The conditional variance of X, given Y = y , is given by
2 2 2
σ = Var(X|Y = y) = E[(X − μX|Y =y ) |Y = y] = ∑(x − μX|Y =y ) pX|Y (x|y)
X|Y =y
2 2 2 2
= E[ X |Y = y] − μ = ( ∑ x pX|Y (x|y)) − μ
X|Y =y X|Y =y
X
Example 5.3.3
Continuing in the context of Example 5.3.2, we calculate the conditional mean and variance of hair color (X), given that a
randomly selected student has brown eyes (Y = 3 ).
First, we derive the conditional pmf of X, given Y = 3 , by taking the row for brown eyes in the joint pmf table and dividing each
by the marginal pmf for Y at 3, i.e., p (3) = 0.42. This gives the following:
Y
1 0.16/0.42 = 8/21
2 0.07/0.42 = 1/6
3 0.16/0.42 = 8/21
4 0.03/0.42 = 1/114
Using the conditional probabilities in the table above, we calculate the following:
5.3.4 https://stats.libretexts.org/@go/page/7847
4
8 1 8 1 15
E[X|Y = 3] = ∑ x ⋅ pX|Y (x|3) = (1) ( ) + (2) ( ) + (3) ( ) + (4) ( ) = ≈2
21 6 21 14 7
x=1
2 2 2
8 2
1 2
8 2
1 118
E[ X |Y = 3] = ∑x ⋅ pX|Y (x|3) = (1 ) ( ) + (2 ) ( ) + (3 ) ( ) + (4 ) ( ) =
21 6 21 14 21
x=1
2
2
118 15
2
⇒ Var(X|Y = 3) = E[ X |Y = 3] − (E[X|Y = 3]) = −( ) ≈1
21 7
Thus, the expected hair color of a student with brown eyes is red.
Definition 5.3.3
If X and Y are continuous random variables with joint pdf given by f (x, y), then the conditional probability density function
(pdf) of X, given that Y = y , is denoted f (x|y) and given by
X|Y
f (x, y)
fX|Y (x|y) = .
fY (y)
Similarly, we can define the conditional pdf, expected value, and variance of Y , given X = x , by swapping the roles of X and
Y in the above.
2. In general, the conditional distribution of X given Y does not equal the conditional distribution of Y given X, i.e.,
3. If X and Y are independent, discrete random variables, then the following are true:
fY |X (y|x) = fY (y)
5.3.5 https://stats.libretexts.org/@go/page/7847
In other words, if X and Y are independent, then knowing the value of one random variable does not affect the probability
of the other one.
Example 5.3.4
We verify the third property of conditional pdf's for the radioactive particle example (Example 5.2.1). Recall that we had the
following joint pdf and marginal pdf's for X and Y :
fX (x) = 1, for 0 ≤ x ≤ 1
fY (y) = 1, for 0 ≤ y ≤ 1
We showed in Example 5.2.3 that X and Y are independent. So, if we fix y ∈ [0, 1], the following shows that the conditional pdf of
X, given Y = y , is equal to the marginal pdf of X, as stated in the third property of conditional pdf's above:
f (x, y) 1
fX|Y (x|y) = = = fX (x)
fY (y) 1
Example 5.3.5
Continuing in the context of Example 5.2.2, where X gave the amount of gas stocked and Y gave the amount of gas sold at a given
gas station in a given week, we find the conditional pdf of Y given that X = 0.5 . In other words, we find the conditional
probability distribution for the amount of gas sold in a given week, when only half of the tank was stocked.
First, we find the marginal pdf for X:
x
x
∣ 2
fX (x) = ∫ f (x, y) dy = ∫ 3x dy = 3xy = 3x , for 0 ≤ x ≤ 1.
∣
0
R 0
Note the f (y|0.5) is the pdf for a uniform distribution on the interval [0, 0.5]. Thus, the conditional distribution of the amount
Y |X
of gas sold in a week, given that only half of the tank is stocked, is uniformly distributed between 0 and 0.5. Recognizing this, we
can easily compute the conditional expected value of Y , given that X = 0.5 :
0.5
E[Y |X = 0.5] = = 0.25.
2
In other words, given that 50% of tank is stocked, we expect that 25% will be sold.
Link to Video: Walkthrough of Example 5.3.5
This page titled 5.3: Conditional Probability Distributions is shared under a not declared license and was authored, remixed, and/or curated by
Kristin Kuter.
5.3.6 https://stats.libretexts.org/@go/page/7847
5.4: Finding Distributions of Functions of Continuous Random Variables
Single Variable Case
In addition to considering the probability distributions of random variables simultaneously using joint distribution functions, there
is also occasion to consider the probability distribution of functions applied to random variables as the following example
demonstrates.
Example 5.4.1
Consider again the context of Example 5.2.2, where the random variable X represented the amount of gas stocked in a given week
at a specific gas station. In Example 5.3.5, we showed that the pdf of X is given by
2
3x , for 0 ≤ x ≤ 1
fX (x) = {
0, otherwise.
Suppose that the storage tank at the gas station holds 10,000 gallons, then 10, 000X gives the number of gallons stocked in a
week. If the current price of gas is $3/gallon and there are fixed delivery costs of $2000, then the total cost to stock 10, 000X
gallons in a given week is given by the following
3(10, 000X) + 2000 ⇒ C = 30, 000X + 2000, (5.4.1)
One approach to finding the probability distribution of a function of a random variable relies on the relationship between the pdf
and cdf for a continuous random variable:
d
[F (x)] = f (x) ''derivative of cdf = pdf"
dx
As we will see in the following examples, it is often easier to find the cdf of a function of a continuous random variable, and then
use the above relationship to derive the pdf.
Example 5.4.2
Continuing with Example 5.4.1, we find the pdf of the total cost, C . First, we derive the cdf for X. If we let 0 ≤ x ≤ 1 , i.e., select
a value of x where the pdf of X is nonzero, then we have
x x
x
2 3∣ 3
FX (x) = P (X ≤ x) = ∫ fX (t) dt = ∫ 3t dt = t =x .
∣0
−∞ 0
For any x < 0 , the cdf of X is necessarily 0, since X cannot be negative (we cannot stock a negative proportion of the tank). And
for any x > 1 , the cdf of X is necessarily equal to 1, since the proportion of gas stocked will always be less than or equal to 100%
of the tank's capacity. Putting it all together, we have the cdf of X given as follows:
⎧ 0,
⎪
x <0
3
FX (x) = ⎨ x , 0 ≤x ≤1 (5.4.2)
⎩
⎪
1, 1 <x
5.4.1 https://stats.libretexts.org/@go/page/3275
We can now use the cdf of X to find the cdf of C . Let c denote a possible value for the random variable C . We relate the cdf of C
to the cdf of X by substituting the expression of C in terms of X given in Equation 5.4.1, and then solving for X, as follows:
c − 2000 c − 2000
FC (c) = P (C ≤ c) = P (30000X + 2000 ≤ c) = P (X ≤ ) = FX ( )
30000 30000
If 0 ≤ (c − 2000)/30000 ≤ 1 ⇒ 2000 ≤ c ≤ 32000 , then, using the formula for the cdf of X we derived in Equation 5.4.2 , we
find that the cdf of C is
3
c − 2000
FC (c) = ( ) , for 2000 ≤ c ≤ 32000. (5.4.3)
30000
Note that 2000 ≤ c ≤ 32000 gives the possible values of C . C cannot be less than $2,000 because of the fixed delivery costs, and
C cannot be more than $32,000, which is the cost of stocking the entire tank.
We now find the pdf of C by taking the derivative of the cdf in Equation 5.4.3:
3 2
d d c − 2000 c − 2000 1
fC (c) = [ FC (c)] = [( ) ] = 3( ) ×( ) , for 2000 ≤ c ≤ 32000
dc dc 30000 30000 30000
Example 5.4.2 demonstrates the general strategy to finding the probability distribution of a function of a random variable: we first
find the cdf of the random variable in terms of the random variable it is a function of (assuming we know the cdf of that random
variable), then we differentiate to find the pdf. Let's look at another example before formalizing the strategy.
Example 5.4.3
Let X be uniform on [0, 1]. We find the pdf of Y =X
2
.
Recall that the pdf for a uniform[0, 1] random variable is f X (x) = 1, for 0 ≤ x ≤ 1 , and that the cdf is
⎧ 0,
⎪
x <0
FX (x) = ⎨ x, 0 ≤x ≤1
⎩
⎪
1, x >1
The following formalizes the strategy we took to find the cdf's in the previous examples.
5.4.2 https://stats.libretexts.org/@go/page/3275
Change-of-Variable Technique
Suppose that X is a continuous random variable with pdf f X (x) , which is nonzero on interval I .
Further suppose that g is a differentiable function that is strictly monotonic on I .
Then the pdf of Y = g(X) is given by
−1
∣ d −1
∣
fY (y) = fX (g (y)) × ∣ [g (y)] ∣ , for x ∈ I and y = g(x). (5.4.4)
∣ dy ∣
Example 5.4.4
Recall that the random variable X denotes the amount of gas stocked. We now let Y denote the total cost of delivery for the gas
stocked, i.e., Y = 30000X + 2000. We know that the pdf of X is given by
2
fX (x) = 3 x , for 0 ≤ x ≤ 1,
so that I = [0, 1], using the notation of the Change-of-Variable technique. We also have that y = g(x) = 30000x + 2000 , and note
that g is increasing on the interval I . Thus, the Change-of-Variable technique can be applied, and so we find the inverse of g and its
derivative:
y − 2000 d d y − 2000 1
−1 −1
x =g (y) = and [g (y)] = [ ] =
30000 dy dy 30000 30000
Example 5.4.5
Let the random variable X have pdf given by f X (x) = 5x
4
, for 0 ≤x ≤1 , i.e., I = [0, 1]. Also, let g(x) = 1 − x
2
. To find the
pdf for Y = g(X) = 1 − X , we first find g :
2 −1
5.4.3 https://stats.libretexts.org/@go/page/3275
y = g(x)
2
y = 1 −x
2
y − 1 = −x
−1 − −−−
⇒ g (y) = √ 1 − y = x
dy dy 2
−1 ∣ d −1 ∣ − −−− 4 ∣ 1 −
1
∣ 5 − −−− 3
fY (y) = fX (g (y)) × ∣ [g (y)] ∣ = 5 (√ 1 − y ) × ∣ − (1 − y ) 2
∣ = (√ 1 − y ) , for 0 ≤ y ≤ 1.
∣ dy ∣ ∣ 2 ∣ 2
The Change-of-Variable technique requires that a monotonic function g is applied. However, if that is not the case, we can just
consider the monotonic pieces separately, as in the next example.
Example 5.4.6
Let X be uniform on [−1, 1]. Then the pdf of X is
1
fX (x) = , for − 1 ≤ x ≤ 1,
2
⎧ 0,
⎪
for x < −1
⎪
⎪
x +1
FX (x) = ⎨ , for − 1 ≤ x ≤ 1
⎪ 2
⎪
⎩
⎪
1, for x > 1
√y + 1 −√y + 1
⇒ FY (y) = FX (√y) − FX (−√y) = − = √y, if 0 ≤ √y ≤ 1,
2 2
which gives
⎧0
⎪
if y < 0
5.4.4 https://stats.libretexts.org/@go/page/3275
⎧ d
⎪ [0] = 0, if y < 0
⎪
⎪
⎪
⎪ dy
⎪
⎪
⎪
d d 1
fY (y) = [ FY (y)] = ⎨ [ √y] = , if 0 ≤ y ≤ 1
dy dy 2 √y
⎪
⎪
⎪
⎪
⎪
⎪ d
⎪
⎩
⎪ [1] = 0, if y > 1
dy
⎧ 1
⎪ , if 0 ≤ y ≤ 1
fY (y) = ⎨ 2 √y
⎩
⎪
0, otherwise
Functions of normally distributed random variables are of particular importance in statistics. In the next example, we derive the
probability distribution of the square of a standard normal random variable.
Example 5.4.7
Let Z be a standard normal random variable, i.e., Z ∼ N (0, 1) . We find the pdf of Y =Z
2
.
Let Φ denote the cdf of Z , i.e., Φ(z) = P (Z ≤ z) = F Z (z) . We first find the cdf of Y =Z
2
in terms of Φ (recall that there is no
closed form expression for Φ):
2
FY (y) = P (Y ≤ y) = P (Z ≤ y)
= Φ (√y) − Φ(−√y)
5.4.5 https://stats.libretexts.org/@go/page/3275
d d
fY (y) = [ FY (y)] = [Φ (√y) − Φ (−√y)]
dy dy
d d
= [Φ (√y)] − [Φ (−√y)]
dy dy
1 1
= φ (√y) ⋅ + φ (−√y) ⋅
2 √y 2 √y
1 −( √y) /2
2 1 1 2
−(−√y) /2
1
= −−e ⋅ + −−e ⋅
√2π 2 √y √2π 2 √y
1 −y/2
1
= −−e ⋅ .
√2π √y
In summary, if Y =Z
2
, where Z ∼ N (0, 1) , then the pdf for Y is given by
−1/2
y
−y/2
fY (y) = −− e , for y ≥ 0.
√2π
2
. This is also referred to as the chi-square distribution, denoted χ
2
. See
below for a video walkthrough of this example.
Multivariate Case
Example 5.4.8
Suppose that X and Y have joint pdf given by
−x
f (x, y) = e , for 0 ≤ y ≤ x < ∞.
The following figure shows the region (shaded in blue) over which the joint pdf of X and Y is nonzero.
Let's define the random variable W = g(X, Y ) = X − Y , and find the pdf of W .
First, we find the cdf of W ,
FW (w) = P (W ≤ w) = P (X − Y ≤ w),
5.4.6 https://stats.libretexts.org/@go/page/3275
for which we need to find where the region given by {(x, y)|x − y ≤ w} intersects the region over which the joint pdf f (x, y) is
nonzero. Note that if w < 0 , then P (X − Y ≤ w) = 0 , since there is no intersection with where the joint pdf f (x, y) is nonzero. If
w ≥ 0 , then
∞ y+w ∞
y+w
−x −x ∣
P (X − 1 ≤ w) = ∫ ∫ e dxdy = ∫ −e dy
∣
y
0 y 0
∞
−(y+w) −y
=∫ (−e +e )dy
0
∞
−y −w −1
=∫ (−e e +e )dy
0
∞
−y −w
=∫ e (−e + 1)dy
0
∞
−w −y ∣
= (−e )(−e )
∣0
−w
= 1 −e
d 0, if w < 0
fW (w) = [ FW (w)] = {
d −w −w
dw [1 − e ] =e , if w ≥ 0
dw
There is another approach to finding the probability distribution of functions of random variables, which involves moment-
generating functions. Recall the following properties of mgf's.
Theorem 3.8.4
The mgf M (t) of random variable X uniquely determines the probability distribution of X. In other words, if random
X
variables X and Y have the same mgf, M (t) = M (t) , then X and Y have the same probability distribution.
X Y
Theorem 3.8.2
Let X be a random variable with mgf M X (t) , and let a, b be constants. If random variable Y = aX + b , then the mgf of Y is
given by
bt
MY (t) = e MX (at).
5.4.7 https://stats.libretexts.org/@go/page/3275
Theorem 3.8.3
If X , … , X are independent random variables with mgf's
1 n MX (t), … , MX (t)
1 n
, respectively, then the mgf of random
variable Y = X + ⋯ + X is given by
1 n
Theorem 3.8.4 states that mgf's are unique, and Theorems 3.8.2 & 3.8.3 combined provide a process for finding the mgf of a linear
combination of random variables. All three theorems provide a Moment-Generating-Function technique for finding the probability
distribution of a function of random variable(s), which we demonstrate with the following examples involving the normal
distribution.
Example 5.4.9
Suppose that X ∼ N (μ, σ) . It can be shown that the mgf of X is given by
2 2
μt+( σ t /2)
MX (t) = e , for t ∈ R.
X −μ
Using this mgf formula, we can show that Z = has the standard normal distribution.
σ
2
X −μ 1 −μ
2. Also note that =( ) X +( ) , so by Theorem 3.8.2,
σ σ σ
μt
t 2
− t /2
M 1 μ (t) = e σ
MX ( ) =e .
X−
σ σ σ
X −μ
Thus, we have shown that Z and have the same mgf, which by Theorem 3.8.4, says that they have the same distribution.
σ
Now suppose X 1, … , Xn are each independent normally distributed with means μ 1, … , μn and sd's σ
1, … , σn , respectively.
Let's find the probability distribution of the sum Y = a1 X1 + ⋯ + an Xn (a 1, … , an constants) using the mgf technique:
2 2
μy t+σy t /2
⇒ MY (t) = e
5.4.8 https://stats.libretexts.org/@go/page/3275
Sums of Independent Normal Random Variables
If X , … , X are mutually independent normal random variables with means
1 n μ1 , … , μn and standard deviations σ1 , … , σn ,
respectively, then the linear combination
Y = a1 X1 + ⋯ + cn Xn = ∑ ai Xi ,
i=1
n n
2 2 2 2 2 2
μY = a1 μ1 + ⋯ + an μn = ∑ ai μi σ =a σ + ⋯ + an σn = ∑ a σ
Y 1 1 i i
i=1 i=1
This page titled 5.4: Finding Distributions of Functions of Continuous Random Variables is shared under a not declared license and was authored,
remixed, and/or curated by Kristin Kuter.
5.4.9 https://stats.libretexts.org/@go/page/3275
5.5: Sample Mean
Suppose we are interested in understanding the mean of some population of values, but do not have full information about the
entire population. One approach to solving this problem is to obtain a random sample of a subset of values from the population and
consider the mean of the sample. The mean of the sample is referred to as the sample mean. Since the sample is randomly
selected, the sample mean may be thought of as a function applied to a collection of random variables.
Example 5.5.1
Suppose we want to know the average SAT math score for girls in Indiana. We could randomly select seniors from high schools in
the South Bend School Corporation as a sample from all IN girls, and use the mean SAT math score for the South Bend girls as an
estimate of the overall mean for IN girls.
The mean of SB girls depends on which sample we randomly select, therefore the sample mean is a random variable.
The probability distribution of the sample mean is referred to as the sampling distribution of the sample mean. The following
result, which is a corollary to Sums of Independent Normal Random Variables, indicates how to find the sampling distribution when
the population of values follows a normal distribution.
Corollary 5.5.1
If X1, … , Xn represent the values of a random sample from a N (μ, σ) population, then the sample mean
n
X1 + ⋯ + Xn 1
X̄ = =∑ Xi ,
n n
i=1
¯ −
X ∼ N (μ, σ/ √n ).
Proof
1. Sample observations are independent when randomly selected. Furthermore, each observation has same distribution as
population. X , ⋯ , X represent the observations in the random sample ⟹ X , ⋯ , X are independent and each
1 n 1 n
Xi ∼ N (μ, σ)
2. ¯
X is the sum of independent normally distributed random variables:
n
X1 + ⋯ + Xn X1 X2 Xn 1 1
¯
X = = + +⋯ + =∑ Xi ⟹ ai = , for i = 1, ⋯ , n
n n n n n n
i=1
n
1 1 1 1
μ ¯ =∑ μ = μ+⋯ + μ =n μ =μ
X
n n n n
i=1
n 2
1 2 2
1 2
σ
2
σ ¯
= ∑( ) σ =n σ =
X 2
n n n
i=1
−−
−
−
−− σ2 σ
2
⟹ σ ¯ = √σ =√ =
X X
¯ −
n √n
5.5.1 https://stats.libretexts.org/@go/page/7844
So X
¯
∼ N (μ,
σ
) .
√n
Example 5.5.2
Suppose that SAT math scores for girls in Indiana are assumed to be N (549, 24).
Find and compare the sampling distributions for the sample means from a sample of size n = 16 and a sample of size n = 36 .
24
For n = 16 : sample mean X̄ ∼ N (549, −−
= 6)
√16
24
For n = 36 : sample mean Y¯ ∼ N (549, −− = 4)
√36
Let's find the probability that each sample mean will be within 10 points of actual population mean (μ = 549 ):
X̄ : P (| X̄ − μ| ≤ 10) = P (539 ≤ X̄ ≤ 559) = normalcdf(539, 559, 549, 6) = 0.9044
¯ ¯
Y : P (539 ≤ Y ≤ 559) = normalcdf(539, 559, 549, 4) = 0.9876
The following figure gives the plot of the pdf's for the sampling distributions of X
¯
(blue) and Y¯(yellow). Note that the spread of the
pdf for X
¯
is larger than for Y¯.
Example 5.5.3
Suppose we are interested in the lifetime of a radioactive particle. We saw in Section 4.5, that the probability distribution of such
lifetimes can be modeled with an exponential distribution. If λ = 3 , for example, then the pdf is skewed right, because there is a
tail of values with very low probabilities off to the right.
5.5.2 https://stats.libretexts.org/@go/page/7844
Central Limit Theorem
Let X , … , X be a random sample from any probability distribution with mean μ and sd σ. Then as the sample size n → ∞ ,
1 n
the probability distribution of the sample mean approaches the normal distribution. We write:
d
¯ −
X → N (μ, σ/ √n ), as n → ∞ (5.5.1)
In other words, if n is sufficiently large, we can approximate the sampling distribution of the sample mean as N (μ, σ/√−
n ).
Furthermore,
X̄ − μ
− ≈ N (0, 1) as n → ∞ (5.5.2)
σ/ √n
The d above the arrow in Equation 5.5.1 above stands for distribution and indicates that, as the sample size increases without
bound, the limit of the probability distribution of X
¯
is given by the N (μ, σ/√−
n ) distribution. This is referred to as convergence in
distribution.
What's "sufficiently large''?
If the distribution of the X is symmetric, unimodal or continuous, then a sample size n as small as 4 or 5 yields an adequate
i
approximation.
If the distribution of the X is skewed, then a sample size n of at least 25 or 30 yields an adequate approximation.
i
If the distribution of the X is extremely skewed, then you may need an even larger n .
i
The following website provides a simulation of sampling distributions and demonstrates the Central Limit Theorem (link
available).
Example 5.5.4
Continuing in the context of Example 5.5.3, suppose we sample n = 25 such radioactive particles. Then the sampling distribution
of the mean of the sample is approximated as follows.
Letting X , … , X denote the random sample, we have that each X ∼ exponential(λ = 3) . By the Properties of Exponential
1 25 i
Distributions, we know that the mean of an exponential(3) distribution is given by μ = = and the sd is also σ = = .
1
λ
1
3
1
λ
1
5.5.3 https://stats.libretexts.org/@go/page/7844
What is the use of the Central Limit Theorem if we don't know μ , the mean of the population? We can use the CLT to approximate
estimation error probabilities:
P (| x̄ − μ| ≤ ε), (5.5.3)
X̄ − μ
− ≈ N (0, 1). (5.5.4)
σ/ √n
From this fact, we can isolate μ in the inequality in Equation 5.5.3 as follows:
| X̄ − μ| ε ε ε ε
¯
P (| X − μ| ≤ ε) = P ( ≤ ) ≈ P (|Z| ≤ ) = P (− ≤Z ≤ ) (5.5.5)
− − − − −
σ/ √n σ/ √n σ/ √n σ/ √n σ/ √n
Example 5.5.5
Now suppose that we do not know the rate at which the radioactive particle of interest decays, i.e., we do not know the mean
lifetime of such particles. We can develop a method for approximating the probability that the mean of a sample of size n = 25 is
within 1 unit of the mean lifetime.
In other words, we want P (|X̄ − μ| ≤ 1) .
By the Central Limit Theorem and Equation 5.5.2, we know that
¯ ¯
X −μ X −μ
= ≈ N (0, 1).
−−
σ/ √25 σ/5
X̄ − μ 1 5
P ( ≤ ) ≈ P (|Z| ≤ )
σ/5 σ/5 σ
This page titled 5.5: Sample Mean is shared under a not declared license and was authored, remixed, and/or curated by Kristin Kuter.
5.5.4 https://stats.libretexts.org/@go/page/7844
Index
W
Weibull Distributions
4.6: Weibull Distributions
1 https://stats.libretexts.org/@go/page/9820
Glossary
Sample Word 1 | Sample Definition 1
1 https://stats.libretexts.org/@go/page/13565