BSTA 2104 Probability and Statistics II Notes Sep Dec 2024

BSTA 2104 PROBABILITY AND
STATISTICS II
UNIT LECTURER: MARY KAMINA
2024/2025 SEP-DEC SEMESTER
GROUPS:
BCS 2.1, BDAT 2.1, BIT 3.1, BSEN 2.1

Pre-requisite: BSTA 1203 Probability and Statistics I
Course Purpose
The course explores probability distributions of functions of random variables,
an essential skill in proving standard statistical results.
Expected Learning Outcomes

At the end of this course, students should be able to:
1. Explain the concept of both univariate/one-dimensional and bivariate/two-

dimensional random variables.
2. Evaluate the distribution of functions of random variables and calculate

expectations.
3. Calculate conditional means and variances based on bivariate/two-dimensional
distributions.
4. Establish the relationship (link) between various probability distributions.
Course Content
(a) Review of Random variables, probability distributions and mathe-

matical expectation.
(b) Law of large numbers.
(c) Conditional probability distributions.
(d) Marginal and conditional probabilities of bi-variate discrete distribu-
tions.
(e) Marginal and conditional probabilities of continuous distributions.
(f) Covariance and correlation coefficients.
(g) Conditional expectation and variance.
(h) Moments and probability generating function.
(i) Probability distributions; Binomial, Poisson, hypergeometric, expo-
nential, normal, beta and gamma and their links.
(j) Bivariate probability distributions; bivariate normal distribution and
transformation of variables.
For this unit, other than the notes provided here, use this website in tandem
https://www.probabilitycourse.com/preface.php.
Mary Kamina © 2024

1 Review of Random Variables, Probability Dis-
tributions and Mathematical Expectation
Random Variables
A random variable is a numerical outcome of a random process. It can be
classified into three main types:
1. Discrete Random Variables

Take on a finite or countably infinite set of values. Examples include the
number of heads in a series of coin tosses or the outcome of rolling a die.
2. Continuous Random Variables
Can take on any value within a range. Examples include the exact height
of students in a class or the time taken to run a race.
3. Mixed Random Variables

These are random variables that are neither discrete nor continuous, but
a mixture of both.
To learn more about Mixed random variables click on this link https:
//www.probabilitycourse.com/chapter4/4 3 1 mixed.php
Probability Distributions
Probability distributions describe how the values of a random variable are dis-
tributed. The main types are:
1. Discrete Distributions
For example, the binomial distribution describes the number of successes
in a fixed number of independent Bernoulli trials.
2. Continuous Distributions
For example, the normal distribution describes data that clusters around
a mean.
Some Distributions
1. Binomial Distribution
• PMF: P (X = k) = n

k pk (1 − p)n−k for k = 0, 1, . . . , n.
• Mean: E[X] = np
• Variance: Var(X) = np(1 − p)
2. Normal Distribution
(x−µ)2
• PDF: f (x) = √ 1
2πσ 2
e− 2σ 2
• Mean: µ
Mary Kamina © 2024

• Variance: σ 2
3. Exponential Distribution
• PDF: f (x) = λe−λx for x ≥ 0.
• Mean: E[X] = 1
λ
• Variance: Var(X) = 1
λ2
There are several important concepts that need to be understood about

Probability distributions.
1. Discrete Probability Distributions

Probability Mass Function (PMF): The PMF of a discrete random vari-
able X, denoted P (X = x), gives the probability that X takes the value
x. The PMF satisfies the properties:
X
P (X = x) = 1, 0 ≤ P (X = x) ≤ 1 ∀ x (1)
x
Example
1
For a fair six-sided die, the PMF is P (X = x) = 6 for x = 1, 2, 3, 4, 5 or 6
2. Probability Density Function (PDF) for Probability Density Function

(PDF): The PDF fX (x) of a continuous random variable X describes
the relative likelihood for X to take on a given value. The probability
that X lies within an interval [a, b] is given by:
Z b
P (a ≤ X ≤ b) = fX (x) dx (2)
a
Example
The standard normal distribution
1 (x−µ)2
f (x) = √ e− 2σ 2 (3)
2πσ 2
3. Cumulative Distribution Function (CDF) The CDF gives the probability
that X takes a value less than or equal to x.
FX (x) = P (X ≤ x) (4)
(a) For a discrete random variable

X
FX (x) = P (X = t) (5)
t≤x
Mary Kamina © 2024

Figure 1: FX (3) = P (X ≤ 3)
(b) For a continuous random variable

Z x
FX (x) = fX (x)dx (6)
−∞
Properties of the CDF

1. FX (x) is non decreasing: FX (x1 ) ≤ FX (x2 ) if x1 ¡ x2
2. limx→−∞ FX (x) = 0 and limx→∞ FX (x) = 1

3. FX (x) is right-continuous: limx→0+ FX (x + h) = FX (x)
Theorems related to Probability Distributions

1. The Law of Total Probability
2. Bayes Theorem
Mathematical Expectation
Mathematical expectation, also known as the expected value (EV) or mean, is
a fundamental concept in probability and statistics that provides a measure of
the central tendency of a random variable.
Notation
For a discrete random variable X, the expected value is denoted as E[X].
For a continuous random variable, it’s often represented as µ or E[X].
Discrete Random Variables

For a discrete random variable X with possible values x1 , x2 , . . . , xn and corre-
sponding probabilities P (X = xi ) = pi :
n
X
E[X] = xi · pi
i=1
Mary Kamina © 2024

Continuous Random Variables
For a continuous random variable with probability density function f (x):
Z ∞
E[X] = xf (x) dx
−∞
Properties of Expectation
1. Linearity of Expectation: If X and Y are random variables, then:
E[aX + bY ] = aE[X] + bE[Y ]
This holds true regardless of whether X and Y are independent.
2. Expectation of Constants: If c is a constant, then:
E[c] = c
3. Non-negativity: If X ≥ 0 almost surely, then E[X] ≥ 0.

4. Law of Total Expectation: If Y is another random variable, then:
E[X] = E[E[X|Y ]]
This link gives a general overview of probability distributions, showing what

we are going to cover in this unit.
https://www.statlect.com/probability-distributions/
Now that we have reviewed probability distributions, lets consider their
random variables. A probability distribution can either have one, two or more
random variables. For this level, let us look at one and two random variables
for Probability distributions.
If a distribution has a single random variable, it is called a
univariate/one-dimensional distribution. This could either be a Univariate
Discrete Probability distribution or a Univariate Continuous distribution. If it
has two random variables, it is called a Bivariate/Two-Dimensional
distribution.
This could either be a Bivariate Discrete distribution or a Bivariate
Continuous distribution. A Bivariate Continuous distribution describes the
joint behavior of two continuous random variables, X and Y . It is
characterized by a joint probability density function (PDF).
Univariate/One-Dimensional Random Variables

A univariate/one-dimensional random variable is a variable that can take on
different values based on the outcomes of a single random phenomenon. It is
characterized by its probability distribution, which describes the likelihood of
each possible value. We have two types of univariate/one-dimensional random
variables can either be from a discrete or continuous distribution.
Mary Kamina © 2024

Examples for Univariate Random Variables
1. The height of students in a class, which can be any value within a range
(e.g., 150 cm ≤ X ≤ 200 cm).
2. Let X be a continuous random variable representing the height of adults in
a certain population, measured in centimeters. The heights are normally
distributed with a mean (µ) of 170 cm and a standard deviation (σ) of 10
cm.
The probability density function for a normal distribution is given by:
1 (x−µ)2
f (x) = √ e− 2σ2
σ 2π
For our example, the PDF becomes:
1 (x−170)2
f (x) = √ e− 200
10 2π
To find the probability that a randomly selected adult has a height between
160 cm and 180 cm, we calculate:
Z 180
P (160 ≤ X ≤ 180) = f (x) dx
160
This integral can be computed using standard normal distribution tables

or computational tools.
To find probabilities using the standard normal distribution, we convert
the heights to Z-scores:
X −µ
Z=
σ
Calculating for our bounds:
• For X = 160:
160 − 170
Z= = −1
10
• For X = 180:
180 − 170
Z= =1
10
Now, we can use Z-tables:
P (160 ≤ X ≤ 180) = P (−1 ≤ Z ≤ 1) ≈ 0.6827
Thus, there is approximately a 68.27% chance that a randomly selected
adult from this population will have a height between 160 cm and 180 cm.
3. Let X be the outcome of rolling a fair six-sided die. The possible values
are 1, 2, 3, 4, 5, 6 with:
1
P (X = x) = , for x = 1, 2, 3, 4, 5, 6.
6
Mary Kamina © 2024

4. Let X be a discrete random variable representing the outcome of rolling a
fair six-sided die. The possible values for X are the integers 1, 2, 3, 4, 5, 6.
Since the die is fair, each outcome has an equal probability. The proba-
bility mass function P (X = x) can be defined as:
1
P (X = x) = for x = 1, 2, 3, 4, 5, 6
6
1. Probability of a Specific Outcome
The probability of rolling a 3:
1
P (X = 3) =
6
2. Probability of Rolling an Even Number

The even outcomes are 2, 4, 6. Thus, the probability of rolling an even
number is:
1 1 1 3 1
P (X is even) = P (X = 2) + P (X = 4) + P (X = 6) = + + = =
6 6 6 6 2
3. Cumulative Distribution Function (CDF)

The cumulative distribution function F (x) gives the probability that X is
less than or equal to x:
F (x) = P (X ≤ x)
For example: - F (3) = P (X ≤ 3) = P (X = 1) + P (X = 2) + P (X = 3) =

1 1 1 3 1
6 + 6 + 6 = 6 = 2
Bivariate/Two-Dimensional Random Variables

Bivariate random variables involve two random variables, typically denoted as X
and Y . They describe the outcomes of two related random phenomena, allowing
for the analysis of the relationship between them.
Joint Probability Distribution

1. Discrete Case: For discrete random variables, the joint probability dis-
tribution is given by the joint probability mass function P (X = x, Y = y).
2. Continuous Case: For continuous random variables, the joint distribu-

tion is defined by a joint probability density function f (x, y), where:
ZZ
P (A) = f (x, y) dx dy
A
for any region A.
Mary Kamina © 2024

Independence
Two random variables X and Y are independent if:
P (X = x, Y = y) = P (X = x) · P (Y = y)
for all x and y.
Examples for Bivariate Random Variables

1. Let X be the height (in cm) and Y be the weight (in kg) of individuals in
a sample. The joint distribution can help analyze how height influences
weight. A possible joint PMF could be:
P (X = 170, Y = 65) = 0.1
This indicates a 10% chance of selecting an individual who is 170 cm tall

and weighs 65 kg.
2. Consider X as the score in Mathematics and Y as the score in English
for a group of students. The joint distribution may reveal trends such as
higher math scores correlating with higher English scores, which could be
visualized using a scatter plot.
Covariance and Correlation

The relationship between two bivariate random variables can also be measured
using:
• Covariance:
Cov(X, Y ) = E[(X − E[X])(Y − E[Y ])].
• Correlation Coefficient:
Cov(X, Y )
ρX,Y = ,
σX σY
which standardizes the covariance to a range between -1 and 1, indicating
the strength and direction of the linear relationship.
Examples: Discrete and Continuous Distributions

Discrete Case
1. Consider a fair six-sided die. The expected value is:
1 21
E[X] = (1 + 2 + 3 + 4 + 5 + 6) = = 3.5
6 6
Mary Kamina © 2024

2. Let X be the number of heads in two tosses of a fair coin. The values and
probabilities are:
1
P (X = 0) = ,
4
1
P (X = 1) = ,
2
1
P (X = 2) = .
4
The expected value is:
1 1 1 1 2
E[X] = 0 · + 1 · + 2 · = 0 + + = 1.
4 2 4 2 4
1
3. Suppose you win KES 10 with probability 10 , KES 20 with probability 15 ,
7
and KES 0 with probability 10 . Then:
1 1 7
E[X] = 10 · + 20 · + 0 · = 1 + 4 = KES5.
10 5 10
Continuous Case
1. For a uniform distribution on the interval [0, 1]:
1 1
x2
Z
1
E[X] = x · 1 dx = = .
0 2 0 2
2. For a normal distribution X ∼ N (µ, σ 2 ), the expected value is:
E[X] = µ.
For example, if µ = 5 and σ 2 = 2, then E[X] = 5.
Examples: Probability Distributions and Mathematical

Expectation
Theoretical Examples
1. Prove that if X is a random variable that only takes the values 0 and 1,
then Var(X) = E[X](1 − E[X]).
Solution
Var(X) = E[X 2 ] − (E[X])2
Since X 2 = X when X is 0 or 1:
E[X 2 ] = E[X]
Thus:
Var(X) = E[X] − (E[X])2 = E[X](1 − E[X])
Mary Kamina © 2024

2. Show that if X is a constant random variable, then Var(X) = 0.
Solution
If X = c, a constant:
Var(X) = E[(X − E[X])2 ] = E[(c − c)2 ] = E[0] = 0
3. Prove that the expected value of the sum of independent random variables
is the sum of their expected values.
Solution
Let X1 , X2 , . . . , Xn be independent random variables:
" n # n
X X
E Xi = E[Xi ]
i=1 i=1
Using the linearity of expectation:

" n # n
X X
E Xi = E[Xi ]
i=1 i=1
The result holds as the sum is simply the sum of expectations. What if
you have random variables X and Y ? See the link https://www.youtube.
com/watch?v=7KeV3wLw0 o
Practical Examples
4. A fair six-sided die is rolled. Define a random variable X as the outcome
of the roll. What is E[X] and Var(X)?
Solution
P (X = x) = x, x = 1, 2, 3, 4, 5, 6
1+2+3+4+5+6
E[X] = = 3.5
6
12 + 2 2 + 3 2 + 4 2 + 5 2 + 6 2 91
Var(X) = − (3.5)2 = − 12.25 ≈ 2.92
6 6
5. If X is a random variable representing the number of heads in three coin
flips, find E[X] and Var(X).
Solution
- X can take values 0, 1, 2, 3 with probabilities 81 , 38 , 38 , 18 :
1 3 3 1
P (X = 0) = , P (X = 1) = , P (X = 2) = , P (X = 3) =
8 8 8 8
1 3 3 1 3 6 3 12
E[X] = 0 · +1· +2· +3· = + + = = 1.5
8 8 8 8 8 8 8 8
10.5
Var(X) = E[X 2 ] − (E[X])2 = − (1.5)2 = 0.75
8
Mary Kamina © 2024

6. A random variable X represents the number of students who pass an exam.
If the probability that any student passes is 0.8, and there are 10 students,
what are E[X] and Var(X)?
Solution
X follows a binomial distribution X ∼ Binomial(10, 0.8):
E[X] = np = 10 × 0.8 = 8
How would you interpret the E[X] = 8?
Var(X) = np(1 − p) = 10 × 0.8 × 0.2 = 1.6
Theoretical Examples: Discrete and Continuous Distribu-

tions
1. Proof that the sum of the probability mass function (PMF) over all pos-
sible values equals 1.
Problem breakdown
Prove that for a discrete random variable X with probability mass func-
tion pX (x), the sum of pX (x) over all possible values of x is equal to 1.
Solution
The probability mass function (PMF) pX (x) of a discrete random variable
X is defined as:
pX (x) = P (X = x)
This represents the probability that X takes a specific value x.
Since X can take one of a countable set of values, say x1 , x2 , . . ., the total
probability for all possible values of X must sum to 1:
X
pX (xi ) = 1
xi
This equation is a consequence of the axioms of probability, particularly

the law of total probability, which states that the sum of probabilities of
all mutually exclusive and exhaustive outcomes must equal 1.
Consider a simple case where X can take only three values: x1 , x2 , x3 .
The total probability is:
pX (x1 ) + pX (x2 ) + pX (x3 ) = 1
For a general discrete random variable X, which can take infinitely many
values x1 , x2 , x3 , . . ., the sum extends over all these values:
∞
X
pX (xi ) = 1
i=1
Mary Kamina © 2024

Therefore, by the definition of a probability mass function, the sum of
pX (x) over all possible values x must equal 1, completing the proof.
2. Proof of the limits of the Cumulative Distribution Function (CDF).
Problem breakdown
Show that if X is a random variable with a cumulative distribution func-
tion (CDF) FX (x), then limx→−∞ FX (x) = 0 and limx→∞ FX (x) = 1.
Solution
The cumulative distribution function (CDF) FX (x) of a random variable
X is defined as:
FX (x) = P (X ≤ x)
This represents the probability that the random variable X takes a value
less than or equal to x.
To prove the first part, limx→−∞ FX (x) = 0: As x approaches −∞, the
probability P (X ≤ x) becomes smaller and smaller, because X has a
lower probability of taking extremely negative values. In the limit, as x
approaches −∞, the CDF approaches 0 because:
lim P (X ≤ x) = 0
x→−∞
To prove the second part, limx→∞ FX (x) = 1: As x approaches ∞, the

probability P (X ≤ x) increases because X has a higher probability of
taking values less than or equal to x. In the limit, as x approaches ∞, the
CDF approaches 1 because:
lim P (X ≤ x) = 1
x→∞
This result holds because the total probability of all possible values of X
must sum to 1.
Thus, the limits of the cumulative distribution function are:
lim FX (x) = 0 and lim FX (x) = 1
x→−∞ x→∞
3. Proof that the expectation of a constant random variable is the constant

itself.
Problem breakdown
Prove that the expectation of a constant random variable c is c.
Solution
Let X be a constant random variable, meaning X takes the value c with
probability 1:
P (X = c) = 1
Mary Kamina © 2024

- For all other values, X has probability 0:
P (X = x) = 0 for x ̸= c
The expectation (or mean) of X is defined as:

X
E[X] = x · pX (x)
x
Since X = c with probability 1, the expectation simplifies to:
E[X] = c · P (X = c) = c · 1 = c
Therefore, the expectation of a constant random variable X = c is simply

the constant c, completing the proof.
4. Proof that the mean of a Poisson distribution is λ.
Problem breakdown
Prove that the mean of a Poisson distribution with parameter λ is λ.
Solution
Let X be a Poisson random variable with parameter λ, i.e., X ∼ Poisson(λ).

The probability mass function is given by:
λk e−λ
P (X = k) = for k = 0, 1, 2, . . .
k!
The expectation (mean) of X is defined as:

∞ ∞
X X λk e−λ
E[X] = k · P (X = k) = k·
k!
k=0 k=0
We can manipulate the summation to express it in a simpler form. Notice

that:
λk e−λ λk−1 e−λ
k· =λ·
k! (k − 1)!
This allows us to rewrite the expectation as:
∞
X λk−1 e−λ
E[X] = λ
(k − 1)!
k=1
Changing the index of summation by setting j = k − 1, we obtain:

∞
X λj e−λ
E[X] = λ
j=0
j!
Mary Kamina © 2024

The series:
∞
X λj e−λ
j=0
j!
is recognized as the sum of the Taylor series expansion of eλ , which equals

1. Therefore:
E[X] = λ × 1 = λ
Thus, the mean of a Poisson distribution with parameter λ is λ, completing

the proof.
5. Proof that the CDF of a normal distribution is given by the error function.
Problem breakdown
Show that if X follows a normal distribution N (µ, σ 2 ), then the CDF of
X is given by:
1 x−µ
FX (x) = 1 + erf √
2 σ 2
where erf is the error function.
Solution
The cumulative distribution function (CDF) FX (x) of a normally dis-
tributed random variable X with mean µ and variance σ 2 is defined as:
Z x
FX (x) = P (X ≤ x) = fX (t)d(t)
−∞
Let X be a random variable with a normal distribution X ∼ N (µ, σ 2 ).

The cumulative distribution function (CDF) FX (x) is defined as:
Z x
FX (x) = P (X ≤ x) = fX (t) dt
−∞
where fX (t) is the probability density function (PDF) of X:
1 (t−µ)2
fX (t) = √ e− 2σ2
σ 2π
To express this in terms of the error function, we perform a substitution:

t−µ
u= √
σ 2
which implies:
dt
du = √
σ 2
Mary Kamina © 2024

Thus, the integral becomes:
x−µ
Z √
σ 2 1 2
FX (x) = √ e−u du
−∞ π
Recognizing this integral as the definition of the error function erf(u), we
get:
1 x−µ
FX (x) = 1 + erf √
2 σ 2
Thus, the CDF of a normal distribution is given by the error function.
6. Proof that the Variance of a Binomial Distribution is np(1 − p)
Problem breakdown
Prove that the variance of a binomial distribution with parameters n and
p is np(1 − p).
Solution
Let X be a random variable following a binomial distribution with param-
eters n and p, denoted X ∼ Binomial(n, p). The probability mass function
(PMF) of X is:
n k
P (X = k) = p (1 − p)n−k
k
for k = 0, 1, 2, . . . , n.
The mean (expectation) of X is given by:
E[X] = np
To find the variance, Var(X), we use:

Var(X) = E[X 2 ] − (E[X])2
First, we find E[X 2 ]. Using the fact that:

E[X 2 ] = E[X(X − 1) + X]
we have:
E[X(X − 1)] = n(n − 1)p2
and:
E[X] = np
Thus:
E[X 2 ] = n(n − 1)p2 + np
Substituting this into the variance formula:
Var(X) = E[X 2 ] − (E[X])2 = n(n − 1)p2 + np − (np)2
Var(X) = np(1 − p)
Therefore, the variance of a binomial distribution with parameters n and
p is np(1 − p).
Mary Kamina © 2024

2 Law of Large Numbers (LLN)
The Law of Large Numbers (LLN) states that as the sample size n increases,
the sample average of independent and identically distributed (i.i.d.) random
variables X1 , X2 , . . . , Xn approaches the expected value µ = E[Xi ]. We have
two kind os expectations There are two main forms:
1. Weak LLN (WLLN)
The sample mean converges in probability to the population mean as
n → ∞.
2. Strong LLN (SLLN)
The sample mean converges almost surely to the population mean as n →
∞.
Weak Law of Large Numbers (WLLN)

The Weak Law of Large Numbers (WLLN) ensures that for any ϵ > 0:
n
!
1X
lim P Xi − µ ≥ ϵ = 0.
n→∞ n i=1
Derivation of WLLN
2
Let X1 , X2 , . . . , Xn be i.i.d. random
Pn variables with mean µ and variance σ .
1
Define the sample mean X n = n i=1 Xi .
Compute the expectation of the sample mean:
E[X n ] = µ.
Compute the variance of the sample mean:

σ2
Var(X n ) = .
n
Apply Chebyshev’s inequality:

σ2
P Xn − µ ≥ ϵ ≤ 2 .
nϵ
As n → ∞, the right-hand side goes to 0, proving the WLLN.
Strong Law of Large Numbers (SLLN)

The Strong Law of Large Numbers (SLLN) states that the sample mean con-
verges almost surely to the population mean µ:

P lim X n = µ = 1.
n→∞
Mary Kamina © 2024

Derivation of SLLN
Pn
Define the partial sums Sn = i=1 Xi .
Relate the sample mean to the partial sums: X n = Snn .
Use the Borel-Cantelli Lemma and advanced techniques to show that X n con-
verges almost surely to µ.
Central Limit Theorem (CLT)

The Central Limit Theorem (CLT) states that the sum of i.i.d. random vari-
ables, when normalized, converges in distribution to a normal distribution as
n → ∞: Pn
(Xi − µ) d
Zn = i=1 √ −
→ N (0, 1).
σ n
Derivation of CLT
Let X1 , X2 , . . . , Xn be independent and identically distributed (i.i.d.) random
variables with mean µ = E[Xi ] and variance σ 2 = Var(Xi ). Define the
normalized sum Zn as:
n
1 X Xi − µ
Zn = √ .
n i=1 σ
As n → ∞, the distribution of Zn converges in distribution to a standard
normal distribution:
d
Zn −
→ N (0, 1).
This means that the standardized sum of the random variables approaches a
normal distribution with mean 0 and variance 1, regardless of the distribution
of Xi , provided Xi has finite mean and variance.
To derive the CLT, we use characteristic functions, which are a powerful tool
in probability theory for studying sums of random variables.
Let X1 , X2 , . . . , Xn be i.i.d. random variables with mean µ and variance σ 2 .
The sum of these random variables is:
Sn = X1 + X2 + · · · + Xn .
The sample mean is:
n
Sn 1X
Xn = = Xi .
n n i=1
We are interested in the behavior of Sn as n → ∞. To standardize Sn , √
we
subtract the expected value nµ and divide by the standard deviation σ n,
forming a normalized sum Zn :
Mary Kamina © 2024

n
Sn − nµ 1 X Xi − µ
Zn = √ =√ .
σ n n i=1 σ
Thus, Zn is a standardized sum of the random variables. The goal is to show
that Zn converges in distribution to N (0, 1).
The characteristic function of a random variable X is defined as the expected
value of eitX :
φX (t) = E eitX .

The characteristic function uniquely determines the distribution of a random

variable and helps in analyzing sums of independent random variables.
Let Yi = Xiσ−µ . These are i.i.d. random variables with mean 0 and variance 1.
The characteristic function of Yi is:
φYi (t) = E eitYi .

Since the Yi ’s P
are independent, the characteristic function of the normalized
n
sum Zn = √1n i=1 Yi is:
n
t
φZn (t) = φY1 √ .
n
We now expand φYi (t) in a Taylor series around t = 0. Since E[Yi ] = 0 and
Var(Yi ) = 1, we have:
t2
φYi (t) = 1 − + o(t2 ).
2
Substituting this approximation into the characteristic function of Zn , we get:
n
t2

1
φZn (t) = 1 − +o .
2n n
n
For large n, we use the fact that 1 + nx → ex as n → ∞. Thus:
2
t
φZn (t) ≈ exp − .
2
This is the characteristic function of the standard normal distribution N (0, 1).
Since the characteristic function of Zn converges to the characteristic function
of N (0, 1), we conclude that Zn converges in distribution to N (0, 1):
d
Zn −
→ N (0, 1).
Mary Kamina © 2024

Relationship Between LLN and CLT
1. The LLN ensures that the sample mean converges to the population mean
as n → ∞.
2. The CLT describes the distribution of the sample mean for large, but finite
n, and shows that it becomes approximately normal.
3. Key Difference: The LLN is about convergence to the mean, while the
CLT is about the distribution of the sample mean for finite samples.
Theoretical Examples
1. Prove the Weak Law of Large Numbers using Chebyshev’s inequality for
i.i.d. random variables X1 , X2 , . . . , Xn with mean µ and variance σ 2 .
Solution
Compute E[X n ] = µ.
2
Compute Var(X n ) = σn . Apply Chebyshev’s inequality:
σ2
P |X n − µ| ≥ ϵ ≤ 2 .
nϵ
As n → ∞, the probability goes to 0, proving the WLLN.
2. Sketch the proof of the Strong Law of Large Numbers for i.i.d. random
variables X1 , X2 , . . . , Xn with mean µ and variance σ 2 .
Solution Pn Sn
Define Sn = i=1 Xi . Show that X n = n converges almost surely to µ.
(Hint: Use the Borel-Cantelli Lemma)
3. Show that the sample proportion in a sequence of Bernoulli trials with

probability of success p converges to p using the WLLN.
Solution
Let Xi = 1 if the i-th trial is a success, and 0 otherwise. Then Xi has
mean p and variance p(1 − p). Pn
Use WLLN to show that the sample proportion n1 i=1 Xi converges to
p in probability.
4. Derive the Central Limit Theorem for i.i.d. random variables X1 , X2 , . . . , Xn

with mean µ and variance σ 2 .
Solution
Mary Kamina © 2024

Pn
i=1 (X −µ)
Define Zn = √ i .
σ n
d
Show that Zn −
→ N (0, 1).
5. Prove the Strong Law of Large Numbers using martingale techniques (ad-
vanced).
Solution
(Hint: use the Martingale Convergence Theorem).
Practical Examples
1. You have collected daily temperatures over 365 days. Use the Weak Law
of Large Numbers to estimate the population mean temperature.
Solution
Let Xi be the temperature on day i.P
n
Compute the sample mean X n = n1 i=1 Xi .
By the WLLN, X n converges in probability to the true population mean
temperature as n → ∞.
2. Completion times of 100 employees are recorded. Estimate the average

completion time using the sample mean and explain why this is a good
estimate using the Weak LLN.
Solution
Let Xi be the completion time for the
Pni-th employee.
Compute the sample mean X n = n1 i=1 Xi .
By the WLLN, the sample mean approximates the population mean com-
pletion time as n grows large.
3. Heights of 1000 individuals are measured. Estimate the probability that

the sample mean height is within 2 cm of the population mean using the
Central Limit Theorem.
Solution
Assume the heights are normally distributed with mean µ and variance
σ2 .
Use the CLT to approximate the distribution of the sample mean:
σ2

X n ∼ N µ, .
n
Compute the probability that |X n − µ| ≤ 2 cm.
Mary Kamina © 2024

4. Over 200 days, the number of customers visiting a store is recorded daily.
Use the Strong Law of Large Numbers to explain why the average number
of visits will stabilize as the number of days increases.
Solution
Let Xi be the Pnumber of customer visits on day i. Compute the sample
n
mean X n = n1 i=1 Xi .
By the SLLN, X n will almost surely converge to the true mean number
of customer visits as n → ∞.
5. A sample of 1000 products is inspected, and 5% are found to be defective.

Estimate the probability that the proportion of defective products in fu-
ture samples will be within 1% of the true defect rate using the Central
Limit Theorem.
Solution
Let Xi be 1 if the i-thPproduct is defective, and 0 otherwise. The sample
n
proportion is X n = n1 i=1 Xi .
Apply the CLT to approximate the distribution of X n .
Use this distribution to compute the probability that the sample propor-
tion is within 1% of the true defect rate.
Here is a link to more examples in this lesson

https://www.probabilitycourse.com/chapter7/7 1 3 solved probs.php
3 Conditional Probability Distributions

Conditional Probability
Conditional probability quantifies the likelihood of an event A occurring given
that another event B has occurred. The formula is:
P (A ∩ B)
P (A|B) = , where P (B) > 0
P (B)
Examples on Conditional Probability

1. What is the probability that it will rain tomorrow given that it is cloudy
today? given that P (R ∩ C) = 0.15 (Probability of rain and cloudy) and
P (C) = 0.5 (Probability of cloudy)?
Solution
Identify events: R (Rain), C (Cloudy).
Use the formula: P (R|C) = P P(R∩C)
(C) .
Mary Kamina © 2024

Substitute values: P (R|C) = 0.15
0.5 .
Calculate: P (R|C) = 0.3.
Conclusion: The probability of rain given it is cloudy is 0.3.
2. What is the probability of passing an exam given that a student stud-

ied given P (P ∩ S) = 0.72 (Probability of passing and studying) and
P (S) = 0.8 (Probability of studying)
Solution
Identify events: P (Pass), S (Studied).
∩S)
Use the formula: P (P |S) = P P(P(S) .
0.72
Substitute values: P (P |S) = 0.8 .
Calculate: P (P |S) = 0.9.
Conclusion: The probability of passing given studying is 0.9.
Conditional Probability Distributions

For two random variables X and Y , the conditional probability distribution of
X given Y = y is defined as:
P (X = x, Y = y)
P (X = x|Y = y) = , P (Y = y) > 0
P (Y = y)
Examples on Conditional Probability Distributions

1. What is the probability that a student received an ’A’ given that they
studied for 5 hours given P (X = A, Y = 5) = 0.1 and P (Y = 5) = 0.25?
Solution
Identify events: X (Grade), Y (Hours studied).
Use the formula: P (X = A|Y = 5) = P (X=A,Y =5)
P (Y =5) .
0.1
Substitute values: P (X = A|Y = 5) = 0.25 .
Calculate: P (X = A|Y = 5) = 0.4.
Conclusion: The probability of receiving an ’A’ given 5 hours of study is
0.4.
2. What is the probability that food is served in under 10 minutes on Mon-

day given P (X = Under 10 minutes, Y = Monday) = 0.2 and P (Y =
Monday) = 0.5?
Solution
Identify events: X (Wait time), Y (Day).
P (X=Under 10 minutes,Y =Monday)
Use the formula: P (X = Under 10 minutes|Y = Monday) = P (Y =Monday) .
Mary Kamina © 2024

Substitute values: P (X = Under 10 minutes|Y = Monday) = 0.2 0.5 .
Calculate: P (X = Under 10 minutes|Y = Monday) = 0.4. Conclusion:
The probability of food being served in under 10 minutes on Monday is 0.4.
More Examples for you to practise

1. A software system is tested by two teams. Team A detects bugs 70% of the
time, and Team B detects bugs 80% of the time. Team A performs 40%
of the tests, and Team B performs 60%. Given that a bug was detected,
what is the probability that Team A conducted the test?
Solution: Define A as Team A testing and D as a bug being detected.

We want to find P (A|D).
Using Bayes’ Theorem:
P (D|A)P (A)
P (A|D) =
P (D)
First, calculate P (D):
P (D) = P (D|A)P (A) + P (D|B)P (B) = (0.7 × 0.4) + (0.8 × 0.6) = 0.74
Now calculate P (A|D):

0.7 × 0.4
P (A|D) = = 0.378
0.74
2. A data center has two types of servers. Type A servers fail with a prob-
ability of 0.1 when CPU usage is high, while Type B servers fail with a
probability of 0.2. 30% of the servers are Type A, and 70% are Type B.
If a server fails, what is the probability that it is a Type A server?
Solution
Let A be Type A server, F be a server failure.
We are asked to find P (A|F ).
P (F |A)P (A)
P (A|F ) =
P (F )
Compute P (F ):
P (F ) = P (F |A)P (A) + P (F |B)P (B) = (0.1 × 0.3) + (0.2 × 0.7) = 0.17
Mary Kamina © 2024

Now, calculate P (A|F ):
0.1 × 0.3
P (A|F ) = = 0.176
0.17
3. A cloud service provider has two types of servers: high performance (HP)
and low performance (LP). High-performance servers experience overload
with a probability of 0.05, while low-performance servers experience over-
load with a probability of 0.2. 20% of the servers are high-performance.
If an overload occurs, what is the probability it was a high-performance
server?
Solution
Let H be the event of using a high-performance server, and O be an over-
load.
We want to find P (H|O).
Use Bayes’ Theorem:
P (O|H)P (H)
P (H|O) =
P (O)
Calculate P (O):
P (O) = P (O|H)P (H) + P (O|L)P (L) = (0.05 × 0.2) + (0.2 × 0.8) = 0.17
Now, calculate P (H|O):

0.05 × 0.2
P (H|O) = = 0.0588
0.17
4. In a SaaS (Software as a Service, is a cloud computing model that delivers

software applications over the internet on a subscription basis) company,
60% of users who haven’t logged in for over a month churn, while 20% of
regular users churn. 25% of users haven’t logged in for over a month. If a
user churns, what is the probability that they hadn’t logged in for over a
month?
Solution
Define L as not logged in for over a month and C as churn.
We want to calculate P (L|C).
P (C|L)P (L)
P (L|C) =
P (C)
Mary Kamina © 2024

Calculate P (C):
P (C) = P (C|L)P (L) + P (C|R)P (R) = (0.6 × 0.25) + (0.2 × 0.75) = 0.3
Now, calculate P (L|C):

0.6 × 0.25
P (L|C) = = 0.5
0.3
4 Conditional Probability Distributions

A conditional probability distribution describes the probability of one random
variable taking specific values given that another random variable has already
taken a certain value. This applies to both discrete and continuous random
variables.
Given two random variables X and Y , the conditional probability distribu-
tion of X given Y = y is:
• For discrete random variables:

P (X = x, Y = y)
P (X = x|Y = y) = , P (Y = y) > 0
P (Y = y)
• For continuous random variables:

fX,Y (x, y)
fX|Y (x|y) = , fY (y) > 0
fY (y)
where fX|Y (x|y) is the conditional probability density function (PDF) of
X given Y = y.
Examples on Discrete Conditional Probability Distributions

1. Consider two systems, A and B, where system A fails with probability
0.2, and both fail together with probability 0.1. What is the conditional
probability that system A fails given that system B fails?
Solution
Define the events:
A : System A fails, B : System B fails
Known values:
P (A ∩ B) = 0.1, P (B) = 0.2
Mary Kamina © 2024

Apply the conditional probability formula:
P (A ∩ B) 0.1
P (A|B) = = = 0.5
P (B) 0.2
Conclusion: The probability that system A fails given that system B fails
is 0.5.
2. Suppose a bug detection system identifies bugs in two categories: X and

Y . If the probability of finding a bug in category X is 0.6, and the joint
probability of finding a bug in both categories is 0.3, what is the condi-
tional probability of finding a bug in X given that there’s a bug in Y ?
Solution
Define the events:
X : Bug found in category X, Y : Bug found in category Y
Known values:
P (X ∩ Y ) = 0.3, P (Y ) = 0.5
P (X ∩ Y ) 0.3
P (X|Y ) = = = 0.6
P (Y ) 0.5
Conclusion: The conditional probability of finding a bug in X given that

there is a bug in Y is 0.6.
3. A data science team tracks whether a customer clicks on a product link

C given that they visited the product page V . If 30% of all users click on
the link and 40% visit the page, with 20% clicking after visiting the page,
find the conditional probability of a click given a visit.
Solution
Define the events:
C : Customer clicks, V : Customer visits the page
Known values:
P (C ∩ V ) = 0.2, P (V ) = 0.4
Mary Kamina © 2024

P (C ∩ V ) 0.2
P (C|V ) = = = 0.5
P (V ) 0.4
Conclusion: The probability of a click given that the customer visited the
page is 0.5.
4. A company manufactures two types of products, and it is known that 15%

of all products are defective. Of the defective products, 5% are of type A,
and 10% are of type B. If a product is defective, what is the probability
it is of type A?
Solution
Define the events:
A : Product is of type A, D : Product is defective
Known values:
P (A ∩ D) = 0.05, P (D) = 0.15
P (A ∩ D) 0.05
P (A|D) = = = 0.333
P (D) 0.15
Conclusion: The probability that a defective product is of type A is ap-

proximately 0.333.
Examples on Continuous Conditional Probability Distribu-

tions
1. In a network, the time T to transfer data is modeled as a continuous
random variable. The conditional distribution of T , given that network
latency L is fixed at l = 2, is represented by:
fT,L (t, l)
fT |L (t|l) =
fL (l)
If fT,L (t, l) = e−(t+l) and fL (l) = e−l , find fT |L (t|2).
Mary Kamina © 2024

Solution
Joint PDF:
fT,L (t, l) = e−(t+l)
Marginal PDF for L:

fL (l) = e−l
Apply the conditional PDF formula:
e−(t+l)
fT |L (t|l) = = e−t
e−l
Conclusion: Therefore, fT |L (t|2) = e−t .
2. The CPU usage X at a given time is modeled as a continuous variable.

Given that at a certain time the CPU usage is known to be 60%, find the
conditional distribution of future CPU usage assuming the joint PDF is:
fX,Y (x, y) = λe−λ(x+y)
and the marginal distribution fY (y) is λe−λy , where λ = 1.
Solution
Joint PDF:
fX,Y (x, y) = e−(x+y)
Marginal PDF for Y :

fY (y) = e−y
e−(x+y)
fX|Y (x|y) = = e−x
e−y
Conclusion: Therefore, fX|Y (x|0.6) = e−x .
3. A data scientist tracks customer spending X based on the time Y spent

on the website. The joint distribution of X and Y is given by fX,Y (x, y) =
2e−2(x+y) . Find the conditional distribution fX|Y (x|y).
Solution
Joint PDF:
fX,Y (x, y) = 2e−2(x+y)
Mary Kamina © 2024

Marginal PDF for Y :
Z ∞
fY (y) = 2e−2(x+y) dx = e−2y
0
2e−2(x+y)
fX|Y (x|y) = = 2e−2x
e−2y
Conclusion: The conditional distribution is 2e−2x .
The Law of Total Probability

The Law of Total Probability states that the probability of an event can be com-
puted by considering all possible ways the event can occur based on a partition
of the sample space:
Xn
P (A) = P (A|Bi )P (Bi )
i=1
Derivation of The Law of Total Probability

1. Express P (A) as a sum over disjoint events Bi :
P (A) = P (A ∩ B1 ) + P (A ∩ B2 ) + · · · + P (A ∩ Bn )
2. Use the definition of conditional probability:
P (A ∩ Bi ) = P (A|Bi )P (Bi )
3. Substitute back into the sum:

n
X
P (A) = P (A|Bi )P (Bi )
i=1
Examples of the Law of Total Probability

1. What is the probability of server failure given hardware and software fail-
ures given P (A|B1 ) = 0.6 (hardware), P (B1 ) = 0.3 and P (A|B2 ) = 0.4
(software), P (B2 ) = 0.7?
Solution
Identify events: A (Server failure), B1 (Hardware failure), B2 (Software
Mary Kamina © 2024

failure).
Apply the Law of Total Probability:
P (A) = P (A|B1 )P (B1 ) + P (A|B2 )P (B2 )
Substitute values:
P (A) = (0.6)(0.3) + (0.4)(0.7)
Calculate:
P (A) = 0.18 + 0.28 = 0.46
Conclusion: The probability of server failure is 0.46.
2. What is the probability of a successful software deployment given P (A|B1 ) =

0.7 (manual testing), P (B1 ) = 0.4 and P (A|B2 ) = 0.9 (automated test-
ing), P (B2 ) = 0.6?
Solution
Identify events: A (Success), B1 (Manual testing), B2 (Automated test-
ing).
Apply the Law of Total Probability:
P (A) = P (A|B1 )P (B1 ) + P (A|B2 )P (B2 )
Substitute values:
P (A) = (0.7)(0.4) + (0.9)(0.6)
Calculate:
P (A) = 0.28 + 0.54 = 0.82
Conclusion: The probability of a successful software deployment is 0.82.
Bayes Theorem
Bayes’ Theorem relates the conditional and marginal probabilities of random
events:
P (B|A)P (A)
P (A|B) =
P (B)
Mary Kamina © 2024

Bayes Theorem Proof
Start with the definition of conditional probability:
P (A ∩ B)
P (A|B) =
P (B)
Rewrite P (A ∩ B) using conditional probability:
P (A ∩ B) = P (B|A)P (A)
Substitute into the first equation:
P (B|A)P (A)
P (A|B) =
P (B)
Click this link to see the differences between Bayes Theorem and Conditional
Probability https://www.cuemath.com/data/bayes-theorem/
Examples on Bayes Theorem

1. What is the probability that a server is faulty given that it has crashed
given P (F ) = 0.02 (Faulty), P (C|F ) = 0.95 (Crash given Faulty) and
P (C) = 0.1?
Solution
Identify events: F (Faulty), C (Crash).
Apply Bayes’ Theorem:
P (C|F )P (F )
P (F |C) =
P (C)
Substitute values:
(0.95)(0.02)
P (F |C) =
0.1
Calculate:
0.019
P (F |C) = = 0.19
0.1
Conclusion: The probability that the server is faulty given that it crashed
is 0.19.
Mary Kamina © 2024

2. What is the probability that a file is malicious given it was flagged given
P (M ) = 0.01 (Malicious), P (F |M ) = 0.9 (Flagged given Malicious) and
P (F ) = 0.05?
Solution
Identify events: M (Malicious), F (Flagged).
Apply Bayes’ Theorem:
P (F |M )P (M )
P (M |F ) =
P (F )
Substitute values:
(0.9)(0.01)
P (M |F ) =
0.05
Calculate:
0.009
P (M |F ) = = 0.18
0.05
Conclusion: The probability that the file is malicious given it was flagged
is 0.18.
Conditional Means and Variances

Conditional Mean
It is the expected value of X given Y = y:
X
E[X|Y = y] = xP (X = x|Y = y)
x
Conditional Variance
It is the variance of X given Y = y:
V ar(X|Y = y) = E[X 2 |Y = y] − (E[X|Y = y])2
Examples on Conditional Means and Variances

1. What is the expected salary of an employee given 5 years of experience
given salary distributions: S = {50k, 60k, 70k} with probabilities based
on experience?
Solution
Define possible salaries and probabilities for 5 years of experience.
Calculate the conditional mean:
E[S|Y = 5] = (50k)(0.3) + (60k)(0.4) + (70k)(0.3)
Mary Kamina © 2024

Calculate:
E[S|Y = 5] = 15k + 24k + 21k = 60k
Conclusion: The expected salary given 5 years of experience is KES 60,000.
2. What is the expected bug fix time given that there are 3 developers given
bug fix times: T = {5, 7, 10} hours with probabilities P (T = 5) = 0.2,
P (T = 7) = 0.5, P (T = 10) = 0.3?
Solution
Calculate the expected time:
E[T |D = 3] = (5)(0.2) + (7)(0.5) + (10)(0.3)
Calculate:
E[T |D = 3] = 1 + 3.5 + 3 = 7.5 hours
Conclusion: The expected bug fix time is 7.5 hours when 3 developers are
working.
Is Bayes Theorem and Conditional Probability the same?
The conditional probability of an event A given another event B is defined

as:
P (A ∩ B)
P (A|B) = , where P (B) > 0
P (B)
This formula tells us how to calculate the probability of A occurring given that
B has already occurred.
Bayes’ Theorem is a specific application of the conditional probability for-

mula. It allows us to reverse conditional probabilities, i.e., to compute P (A|B)
when we know P (B|A), P (A), and P (B). Bayes’ Theorem is derived from the
definition of conditional probability and is written as:
P (B|A)P (A)
P (A|B) =
P (B)
where P (B) can be expanded using the Law of Total Probability if necessary:
P (B) = P (B|A)P (A) + P (B|¬A)P (¬A)
Key Differences
Mary Kamina © 2024

1. Conditional Probability Formula: Directly calculates the probability of
one event given another.
2. Bayes’ Theorem: Reverses the known conditional probability to find an-
other conditional probability in cases where direct computation is not
possible. It is derived from the conditional probability formula, but it is
more specific as it gives a way to reverse or ”update” probabilities based
on new information.
5 Marginal and Conditional Probabilities of Bivariate/Two-

Dimensional Discrete Distributions
Bivariate Discrete Random Variables
In many real-world problems, we are often interested in two or more variables
simultaneously. These variables might depend on each other, and analyzing
their behavior jointly gives us a better understanding of the system under study.
For example, we might want to study the relationship between the number of
defective products and the number of complaints received in a factory.
A bivariate discrete random variable consists of two discrete random variables,

X and Y , which are defined on the same sample space Ω. The pair (X, Y )
takes values in a finite or countably infinite set, and each value corresponds to
an ordered pair (x, y) in the xy-plane.
Bivariate Discrete Probability Distributions

In probability theory, a bivariate discrete distribution refers to the probability
distribution of two discrete random variables defined on the same sample space.
These variables can be dependent or independent, and the relationship between
them can be captured through joint, marginal, and conditional distributions.
Understanding these distributions is crucial in fields such as data science, soft-
ware engineering, computer science, IT and statistics, where two or more related
phenomena are often studied together. Let’s denote the two discrete random
variables by X and Y
Joint Probability Mass Function (Joint PMF)

The joint probability mass function (PMF) of two discrete random variables X
and Y is denoted by P (X = x, Y = y) and gives the probability that X takes a
specific value x, and Y takes a specific value y. Formally, the joint PMF satisfies
two conditions:
1. P (X = x, Y = y) ≥ 0
P P
2. x y P (X = x, Y = y) = 1
Mary Kamina © 2024

The joint PMF describes the complete probabilistic behavior of X and Y
together.
Examples on Joint Distribution of Two Random Variables

1. Consider the example of placing three balls into three cells. Let X rep-
resent the number of balls in Cell 1, and Y represent the number of cells
occupied. The joint probability distribution for the random variables X
and Y is shown in the Total column.
X\Y 1 2 3 Total
2 6 8
0 27 27 0 27
6 6 12
1 0 27 27 27
6 6
2 0 27 0 27
1 1
3 27 0 0 27
2. In a computer network, packets are sent to two servers, and each packet
either reaches its destination or is lost. Let X represent the number of
packets reaching Server A and Y represent the number of packets reaching
Server B. If the joint PMF is given by:
X\Y 0 1 2 Total
0 0.1 0.1 0.05 0.25
1 0.1 0.2 0.15 0.45
2 0.05 0.1 0.15 0.30
How can we verify that this is a valid joint PMF?
Solution
We need to verify that the total probability sums to 1. Summing the table:
0.1 + 0.1 + 0.05 + 0.1 + 0.2 + 0.15 + 0.05 + 0.1 + 0.15 = 1
Thus, this is a valid joint PMF.

3. In an e-commerce platform, let X represent the number of items a cus-
tomer buys in a single session, and Y represent the number of items re-
turned. The joint distribution is given by:
X\Y 0 1 2 Total
0 0.2 0.05 0.02 0.27
1 0.15 0.2 0.05 0.4
2 0.08 0.15 0.1 0.33
How can we compute the total probability?
Mary Kamina © 2024

Solution
We need to sum all joint probabilities to check that they equal 1:
0.2 + 0.05 + 0.02 + 0.15 + 0.2 + 0.05 + 0.08 + 0.15 + 0.1 = 1
Thus, the distribution is valid.

4. In a computer system, tasks are scheduled across two processors. Let X
represent the number of tasks on Processor A, and Y represent the number
of tasks on Processor B. The joint PMF is given by:
X\Y 0 1 2 Total
0 0.15 0.1 0.05 0.3
1 0.2 0.2 0.1 0.5
2 0.05 0.1 0.05 0.2
What is the probability that Processor A handles exactly 1 task?
Solution
To find P (X = 1), sum the probabilities for X = 1:
P (X = 1) = 0.2 + 0.2 + 0.1 = 0.5
5. In a software testing process, two test suites are run on the same code. Let
X represent the number of failed test cases in Suite A, and Y represent
the number of failed test cases in Suite B. The joint distribution is given
as:
X\Y 0 1 2 Total
0 0.3 0.15 0.05 0.5
1 0.1 0.15 0.05 0.3
2 0.05 0.1 0.05 0.2
What is the probability that neither test suite has any failed test cases?
Solution
The probability that neither test suite has any failed test cases is P (X =
0, Y = 0) = 0.3.
Marginal Probability
Marginal Probability from Set Theory
Marginal probability can be understood from set theory by considering it as the
probability of a single event occurring without regard to other related events.
It is derived from joint probabilities by ”summing out” or ”marginalizing” over
the outcomes of the other events.
Mary Kamina © 2024

Definitions
• Sample Space (Ω): The set of all possible outcomes.
• Event (A): A subset of the sample space (i.e., A ⊆ Ω) representing a
specific outcome or a set of outcomes.
• Joint Probability: The probability of two events A and B happening
simultaneously is denoted as P (A ∩ B), where ∩ is the intersection of the
sets.
Marginal Probability from Set Theory

Given two events A and B, the marginal probability of event A, denoted P (A),
is the probability of event A happening, regardless of whether event B occurs.
In terms of joint probabilities, the marginal probability of A can be expressed
as: X
P (A) = P (A ∩ B)
B
Here, we are ”summing over” all possible occurrences of B, considering every
way that event A can occur in conjunction with different outcomes for event B.
Set Theory Interpretation

• Joint Probability P (A ∩ B)
This is the probability of both A and B occurring together, which is
represented as the intersection of the sets A and B.
• Marginal Probability P (A)
Marginal probability is the probability of being in the set A, regardless of
the status of set B. This means we are not focusing on the intersection
with B anymore but are looking at the overall likelihood of A happening,
which includes every scenario where B might occur or not.
Example from Set Theory

Consider two events A is the event that it rains and B is the event that a foot-
ball game is played.
The joint probability P (A ∩ B) is the probability that it rains and the football
game is played.
The marginal probability P (A) would be the total probability that it rains,
regardless of whether the game happens or not. This would include both:
• The probability that it rains and the game is played.
• The probability that it rains and the game is not played.
In set notation, marginal probability can be seen as focusing on the event A,
whether A ∩ B or A ∩ B c (where B c is the complement of B) occurs.
Mary Kamina © 2024

Marginal Probability Mass Function (Marginal PMF)
The marginal probability mass function of a discrete random variable is obtained
by summing the joint PMF over all possible values of the other variable. The
marginal PMF gives the probability distribution of one variable, disregarding
the other. The marginal PMF of X, denoted as P (X = x), is given by:
X
P (X = x) = P (X = x, Y = y)
y
The marginal PMF of Y , denoted as P (Y = y), is given by:

X
P (Y = y) = P (X = x, Y = y)
x
Examples on Marginal Distributions from Joint PMF

1. Using the joint distribution from Example 1 in the Joint Distribution of
Two Random Variables section, we can compute the marginal distribu-
tions as follows:
Solution
8 12 6 1
P (X = 0) = , P (X = 1) = , P (X = 2) = , P (X = 3) =
27 27 27 27
3 18 6
P (Y = 1) = , P (Y = 2) = , P (Y = 3) =
27 27 27
2. In the example of network packet loss, the second example in the Joint
Distribution of Two Random Variables section, the joint distribution of
packets reaching Servers A and B is:
X\Y 0 1 2 Total
0 0.1 0.1 0.05 0.25
1 0.1 0.2 0.15 0.45
2 0.05 0.1 0.15 0.30
What is the marginal distribution of packets reaching Server A?
Solution
The marginal distribution of packets reaching Server A, P (X), is:
P (X = 0) = 0.25, P (X = 1) = 0.45, P (X = 2) = 0.30
3. In the example of network packet loss, the joint distribution of packets

reaching Servers A and B is:
Mary Kamina © 2024

X\Y 0 1 2 Total
0 0.1 0.1 0.05 0.25
1 0.1 0.2 0.15 0.45
2 0.05 0.1 0.15 0.30
What is the marginal distribution of packets reaching Server A?
Solution
The marginal distribution of packets reaching Server A, P (X), is:
P (X = 0) = 0.25, P (X = 1) = 0.45, P (X = 2) = 0.30
4. In the earlier example of customer purchases and returns, the joint PMF
is given by:
X\Y 0 1 2 Total
0 0.2 0.05 0.02 0.27
1 0.15 0.2 0.05 0.4
2 0.08 0.15 0.1 0.33
What is the marginal distribution of the number of items returned, Y ?
Solution
We sum the joint probabilities over all values of X:
P (Y = 0) = 0.2 + 0.15 + 0.08 = 0.43

P (Y = 1) = 0.05 + 0.2 + 0.15 = 0.4
P (Y = 2) = 0.02 + 0.05 + 0.1 = 0.17
Thus, the marginal distribution of Y is P (Y = 0) = 0.43, P (Y = 1) = 0.4,

and P (Y = 2) = 0.17.
5. Given the joint distribution of tasks handled by two processors:
X\Y 0 1 2 Total
0 0.15 0.1 0.05 0.3
1 0.2 0.2 0.1 0.5
2 0.05 0.1 0.05 0.2
What is the marginal distribution of tasks handled by Processor B?
Solution
To find the marginal distribution of Y (tasks on Processor B):
Mary Kamina © 2024

P (Y = 0) = 0.15 + 0.2 + 0.05 = 0.4
P (Y = 1) = 0.1 + 0.2 + 0.1 = 0.4
P (Y = 2) = 0.05 + 0.1 + 0.05 = 0.2
Thus, the marginal distribution of Y is P (Y = 0) = 0.4, P (Y = 1) = 0.4,

and P (Y = 2) = 0.2.
6. In a software testing process, the joint distribution of the number of bugs
found by two teams, X and Y , is given as:
X\Y 0 1 2 Total
0 0.3 0.15 0.05 0.5
1 0.1 0.15 0.05 0.3
2 0.05 0.1 0.05 0.2
What is the marginal distribution of the number of bugs found by Team A?
Solution
We sum the joint probabilities for each value of X:
P (X = 0) = 0.3 + 0.15 + 0.05 = 0.5

P (X = 1) = 0.1 + 0.15 + 0.05 = 0.3
P (X = 2) = 0.05 + 0.1 + 0.05 = 0.2
Thus, the marginal distribution of X is P (X = 0) = 0.5, P (X = 1) = 0.3,

and P (X = 2) = 0.2.
Marginal Probability in Discrete Random Variables

When working with discrete random variables, computing marginal and condi-
tional probabilities is essential. Here’s how they are defined:
• Marginal Probability: The probability of an event occurring without

considering other variables. For instance, P (X = x) is the marginal prob-
ability of X = x, computed by summing over all possible values of the
other variable Y .
• Joint Probability: The probability that two random variables X and Y
take on specific values simultaneously, denoted P (X = x, Y = y).
• Conditional Probability: The probability of one event occurring given
that another has occurred, denoted P (Y = y | X = x).
In the context of bivariate discrete distributions, these probabilities are key to

understanding how two random variables interact.
Mary Kamina © 2024

Joint Probability Mass Function (PMF)
For two discrete random variables X and Y , the joint probability mass function
(PMF) is defined as:
PXY (x, y) = P (X = x, Y = y)
This represents the probability that X takes value x and Y takes value y simul-
taneously. The joint PMF contains all the information about the distribution
of X and Y .
Joint Range RXY is the set of all pairs (x, y) where PXY (x, y) > 0:
RXY = {(x, y) | PXY (x, y) > 0}
The joint PMF can be written as:
P (X = x, Y = y) = P ((X = x) ∩ (Y = y))
Lemma 1: The sum of all probabilities in the joint PMF must equal 1:
X X
PXY (x, y) = 1
x∈RX y∈RY
Proof
Let (X, Y ) be a pair of discrete random variables with a joint PMF PXY (x, y)
defined over their respective ranges RX and RY . The joint PMF PXY (x, y)
assigns probabilities to every pair (x, y), ensuring that all possible outcomes are
accounted for. By the definition of a probability measure, the total probability
across the sample space must equal 1:
X X
PXY (x, y) = 1
x∈RX y∈RY
This is a consequence of the axioms of probability, specifically the normalization

condition, which states that the probability of the entire sample space must sum
to 1. For a finite number of outcomes, we can express:
n = |RX | and m = |RY | (the number of elements in RX and RY ).
Therefore, we can write:

n X
X m
PXY (xi , yj ) = 1
i=1 j=1
Mary Kamina © 2024

Lemma 2: The marginal PMF for X can be derived by summing over the
joint PMF with respect to Y :
X
PX (x) = PXY (x, y)
y∈RY
Proof
To derive the marginal PMF PX (x), we sum the joint PMF PXY (x, y) over all
possible values of Y : X
PX (x) = PXY (x, y)
y∈RY
This signifies that the probability of X being equal to x is the aggregation of

all probabilities corresponding to each possible y given x.
Mathematically, this is equivalent to the law of total probability:
X
PX (x) = P (Y | X = x)PX (x) = P (X = x)
y∈RY
The total probability law states that we can calculate the probability of an event
by conditioning on a partition of the sample space.
The marginalization process ensures that PX (x) captures all possible interac-
tions with Y while isolating the effects of Y .
Thus, we express PX (x) as:
X
PX (x) = PXY (x, y)
y∈RY
Therefore, Lemma 2 is validated, demonstrating how to derive marginal distri-

butions from joint distributions.
Application of Lemmas in Theorems
Theorem: Marginal PMF for Y

Statement: The marginal PMF for Y can also be derived using a similar
approach:
X
PY (y) = PXY (x, y)
x∈RX
Proof
Mary Kamina © 2024

Using the same reasoning as in Lemma 2, we can derive the marginal PMF for
Y by summing the joint PMF with respect to X:
X
PY (y) = PXY (x, y)
x∈RX
This reflects the total probability of observing Y taking the value y, accounting
for all possible values of X.
We can express it as:
X
PY (y) = P (X | Y = y)PY (y) = P (Y = y)
x∈RX
This application of the law of total probability allows us to conclude:

X
PY (y) = PXY (x, y)
x∈RX
This relationship demonstrates the interdependence of joint and marginal dis-

tributions, reaffirming the validity of Lemmas 1 and 2.
Theorem: Marginal PMF for X

Statement: The marginal PMF for X can also be derived as follows:
X
PX (x) = PXY (x, y)
y∈RY
Proof
By applying Lemma 2, we can assert:
X
PX (x) = PXY (x, y)
y∈RY
This is equivalent to the aggregation of probabilities of X conditioned on all

possible outcomes of Y .
Thus, it can also be expressed as:
X
PX (x) = P (Y | X = x)PX (x)
y∈RY
This shows how marginal PMFs capture the essential probabilities of one vari-
able, eliminating the influence of the other.The equation confirms that PX (x)
is derived from summing the contributions from Y :
X
PX (x) = PXY (x, y)
y∈RY
Thus, the relationship highlights the importance of joint distributions in under-

standing marginal behaviors.
Mary Kamina © 2024

Marginal Probability from Joint Probability
From the joint PMF, we can derive the marginal probability mass function for
each variable.
• Marginal PMF of X:
X
PX (x) = PXY (x, y)
y∈RY
This gives the probability that X = x, regardless of the value of Y .

• Marginal PMF of Y :
X
PY (y) = PXY (x, y)
x∈RX
This gives the probability that Y = y, regardless of the value of X.
Example
1. Given the joint PMF of X and Y , find
X\Y 0 1 2
1 1 1
0 6 4 8
1 1 1
1 8 6 6
(a) Find the marginal PMFs of X and Y .
Solution
The marginal PMF of X For X = 0
X 1 1 1 9
PX (0) = PXY (0, y) = + + = = 0.375
y
6 4 8 24
For X = 1
X 1 1 1 15
PX (1) = PXY (1, y) = + + = = 0.625
y
8 6 6 24
The marginal PMF of Y

For Y = 0
1 1 7
PY (0) = PXY (0, 0) + PXY (1, 0) = + =
6 8 24
For Y = 1
1 1 10
PY (1) = PXY (0, 1) + PXY (1, 1) = + =
4 6 24
Mary Kamina © 2024

For Y = 2
1 1 7
PY (2) = PXY (0, 2) + PXY (1, 2) = + =
8 6 24
(b) Find the conditional probability P (Y = 1 | X = 0)
P (X = 0, Y = 1) 1/4 2
P (Y = 1 | X = 0) = = =
PX (0) 0.375 3
Conditional Probability from Joint PMF

The conditional probability of Y = y given that X = x can be computed from
the joint PMF:
P (X = x, Y = y)
P (Y = y | X = x) =
PX (x)
This gives the probability that Y = y, provided that X = x. Similarly, the
conditional probability P (X = x | Y = y) can be computed using the same
principles.
Examples on Conditional Probability from Joint PMF

1. In a machine learning model, two discrete random variables are observed
where X represents the user age group in years, and Y represents the time
spent on the website in hours. The joint PMF of the user age group X
and the time spent on the website Y is given by the following table:
X\Y 10 20 30
2 (Age Group: 20-29) 0.2 0.1 0.05
3 (Age Group: 30-39) 0.1 0.15 0.1
(a) Find the marginal PMF of X.
Solution
The marginal PMF of X is found by summing the joint PMF over
all possible values of Y for each X.
For X = 2 (Age Group: 20-29)

X
PX (2) = PXY (2, y) = 0.2 + 0.1 + 0.05 = 0.35
y∈{10,20,30}
For X = 3 (Age Group: 30-39)

X
PX (3) = PXY (3, y) = 0.1 + 0.15 + 0.1 = 0.35
y∈{10,20,30}
Mary Kamina © 2024

Thus, the marginal PMF of X is:
(
0.35, if X = 2
PX (x) =
0.35, if X = 3
(b) Find the marginal PMF of Y
Solution
To find the marginal PMF of Y from the joint PMF, we sum the
joint probabilities over all possible values of X for each value of Y .
This shows the probabilities associated with each value of Y
For Y = 10
X
PY (10) = PXY (x, 10) = PXY (2, 10)+PXY (3, 10) = 0.2+0.1 = 0.3
x∈{2,3}
For Y = 20
X
PY (20) = PXY (x, 20) = PXY (2, 20)+PXY (3, 20) = 0.1+0.15 = 0.25
x∈{2,3}
For Y = 30
X
PY (30) = PXY (x, 30) = PXY (2, 30)+PXY (3, 30) = 0.05+0.1 = 0.15
x∈{2,3}
Thus, the marginal PMF of Y is:


0.3, if Y = 10

PY (y) = 0.25, if Y = 20

0.15, if Y = 30

(c) Find the conditional probability P (Y = 20 | X = 3).

The conditional probability is given by:
P (X = 3, Y = 20)
P (Y = 20 | X = 3) =
PX (3)
From the joint PMF table, we know:
P (X = 3, Y = 20) = 0.15
and from the marginal PMF of X:
PX (3) = 0.35
Thus, the conditional probability is:

0.15
P (Y = 20 | X = 3) = ≈ 0.4286
0.35
Mary Kamina © 2024

2. Let X represent the number of defects detected in a code module and Y
the number of test cases executed. The joint PMF is given as:
X\Y 10 20 30
0 0.1 0.2 0.15
1 0.05 0.25 0.25
Solution
To find the marginal PMF of X, sum across the columns for each value of
X:
For X = 0
PX (0) = 0.1 + 0.2 + 0.15 = 0.45
For X = 1
PX (1) = 0.05 + 0.25 + 0.25 = 0.55
6 Marginal and Conditional Probabilities of Con-

tinuous Distributions
Bivariate/Two-Dimensional Continuous Random Variables
If X and Y are continuous random variables defined on the sample space Ω of a
random experiment, then the pair (X, Y ) is called a bivariate continuous random
variable. This means that (X, Y ) assigns a point in the xy-plane corresponding
to each outcome in the sample space Ω.
Joint and Marginal Distribution and Density Functions

Two-Dimensional Continuous Distribution Function
The distribution function of a two-dimensional continuous random variable
(X, Y ) is a real-valued function and is defined as:
F (x, y) = P (X ≤ x, Y ≤ y) for all real x and y.
Remark: F (x, y) can also be written as FX,Y (x, y).
Joint Probability Density Function

Let (X, Y ) be a continuous random variable assuming all values in some region
R of the xy-plane. Then, a function f (x, y) such that:
Z x Z y
FX,Y (x, y) = P (X ≤ x, Y ≤ y) = fX,Y (x′ , y ′ ) dx′ dy ′
−∞ −∞
Mary Kamina © 2024

is defined to be a joint probability density function.
As in the one-dimensional case, a joint probability density function has the
following properties:
• f (x, y) ≥ 0
R∞ R∞
• −∞ −∞ f (x, y) dy dx = 1
Examples
1. Given the joint PDF fX,Y (x, y) = 6xy for 0 < x < 1 and 0 < y < 1, how
do we find the value of the joint distribution function FX,Y (0.5, 0.5)?
Solution
To compute FX,Y (0.5, 0.5), we integrate the joint PDF:
Z 0.5 Z 0.5
FX,Y (0.5, 0.5) = 6xy dx dy
0 0
First, we integrate with respect to x:
Z 0.5 2 0.5
x 0.25
6xy dx = 6y = 6y × = 1.5y
0 2 0 2
Now, we integrate with respect to y:
Z 0.5 2 0.5
y 0.25
1.5y dy = 1.5 = 1.5 × = 0.1875
0 2 0 2
Thus, FX,Y (0.5, 0.5) = 0.1875.
2. Suppose the response times X and Y of two servers are modeled with the
joint PDF fX,Y (x, y) = 8xy for 0 < x < 1 and 0 < y < 1. How do we
calculate the probability that both servers respond within 0.3 seconds?
Solution
We need to find:
Z 0.3 Z 0.3
P (0 < X < 0.3, 0 < Y < 0.3) = 8xy dx dy
0 0
Z 0.3 2 0.3
x 0.09
8xy dx = 8y = 8y × = 0.36y
0 2 0 2
Next, we integrate with respect to y:
Z 0.3 2 0.3
y 0.09
0.36y dy = 0.36 = 0.36 × = 0.0162
0 2 0 2
Thus, the probability that both servers respond within 0.3 seconds is
0.0162.
Mary Kamina © 2024

Marginal Continuous Distribution Function
Let (X, Y ) be a two-dimensional continuous random variable having f (x, y) as
its joint probability density function. Now, the marginal distribution function
of the continuous random variable X is defined as:
Z ∞
FX (x) = P (X ≤ x) = f (x, y) dy
−∞
The marginal PDF of X is obtained by integrating the joint PDF over all values
of Y .
The marginal distribution function of the continuous random variable Y is de-
fined as: Z ∞
FY (y) = P (Y ≤ y) = f (x, y) dx.
−∞
The marginal PDF of Y is obtained by integrating the joint PDF over all values
of X.
Marginal Probability Density Function

Let (X, Y ) be a two-dimensional continuous random variable having F (X, Y )
and f (X, Y ) as its distribution function and joint probability density function,
respectively. Let F (x) and F (y) be the marginal distribution functions of X
and Y , respectively. Then, the marginal probability density function of X is
given by:
Z ∞
fX (x) = fX,Y (x, y) dy,
−∞
or it may also be obtained as:

d
fX (x) = FX (x).
dx
The marginal probability density function of Y is given by:
Z ∞
fY (y) = fX,Y (x, y) dx,
−∞
or
d
fY (y) = FY (y).
dy
Examples
can we determine the marginal PDF fX (x)?
Mary Kamina © 2024

Solution
We integrate the joint PDF over all values of y:
1 1 1
y2
Z Z
1
fX (x) = 6xy dy = 6x y dy = 6x = 6x × = 3x
0 0 2 0 2
Thus, the marginal PDF of X is fX (x) = 3x for 0 < x < 1.

2. With the joint PDF fX,Y (x, y) = 6xy, how do we calculate the marginal
probability that X is less than 0.4?
Solution
We find P (X < 0.4) by integrating the marginal PDF:
0.4 0.4
x2
Z
0.16
P (X < 0.4) = 3x dx = 3 =3× = 0.24
0 2 0 2
Thus, the probability that X is less than 0.4 is 0.24.
Conditional Probability Density Function

Let (X, Y ) be a two-dimensional continuous random variable having the joint
probability density function f (x, y). The conditional probability density func-
tion of Y given X = x is defined as:
f (x, y)
fY |X (y|x) =
fX (x)
where fX (x) > 0 is the marginal density of X.

Similarly, the conditional probability density function of X given Y = y is
defined as:
f (x, y)
fX|Y (x|y) =
fY (y)
where fY (y) > 0 is the marginal density of Y .
Conditional Continuous Distribution Function

For a two-dimensional continuous random variable (X, Y ), the conditional dis-
tribution function of Y given X = x is defined as:
Z y
FY |X (y|x) = P (Y ≤ y|X = x) = fY |X (y ′ |x) dy ′ .
−∞
Similarly, the conditional distribution function of X given Y = y is defined as:

Z x
FX|Y (x|y) = P (X ≤ x|Y = y) = fX|Y (x′ |y) dx′ .
−∞
Mary Kamina © 2024

Joint PDF of Continuous Random Variables
For two continuous random variables X and Y , the **joint probability den-
sity function (PDF)** describes the likelihood that X takes a value in a small
interval around x and Y takes a value in a small interval around y.
The joint PDF, denoted by fX,Y (x, y), is such that the probability that X
is between a and b, and Y is between c and d is given by:
Z d Z b
P (a ≤ X ≤ b, c ≤ Y ≤ d) = fX,Y (x, y) dx dy
c a
The joint PDF must satisfy the following conditions:
1. fX,Y (x, y) ≥ 0 for all x and y 2. The total probability must equal 1:
Z ∞Z ∞
fX,Y (x, y) dx dy = 1
−∞ −∞
Independence of Continuous Random Variables

Two continuous random variables X and Y are independent if the joint PDF
can be factored as the product of their marginal PDFs:
fX,Y (x, y) = fX (x) · fY (y) for all x and y
Examples on Independence of Continuous Random Vari-

ables
1. Consider a system where the response times (in seconds) of two servers,
X and Y , are jointly distributed with the PDF:
(
6xy if 0 < x < 1, 0 < y < 1
fX,Y (x, y) =
0 otherwise
Find the marginal PDF of X, the response time of Server A.
Solution
To find the marginal PDF of X, integrate the joint PDF over all values of
y:
Z 1 2 1
y
fX (x) = 6xy dy = 6x = 3x, 0 < x < 1
0 2 0
2. In a data processing pipeline, the time taken for processing X (in seconds)
and the number of errors Y are jointly distributed according to the PDF:
fX,Y (x, y) = 2e−x (1 − e−y ) x > 0, y > 0
Find the marginal PDF of the processing time X.
Mary Kamina © 2024

Solution
The marginal PDF of X is found by integrating the joint PDF over all
values of y:
Z ∞ Z ∞
−x −y −x
fX (x) = 2e (1 − e ) dy = 2e (1 − e−y ) dy
0 0
∞
1
= 2e−x y + y = 2e−x
e 0
3. Two processors in a distributed computing system are allocated tasks. Let

X be the time (in seconds) it takes for Processor A to finish its tasks, and
Y be the time for Processor B. The joint PDF is given by:
fX,Y (x, y) = e−(x+y) , x > 0, y > 0
Check if the task completion times X and Y are independent.
Solution
First, find the marginal PDFs:
Z ∞
fX (x) = e−(x+y) dy = e−x
0
Z ∞
fY (y) = e−(x+y) dx = e−y
0
Since fX,Y (x, y) = fX (x) · fY (y), the variables X and Y are independent.
4. In a software testing scenario, let X represent the number of bugs found
in module A, and Y represent the number of bugs found in module B.
Suppose the joint PDF of X and Y is:
(
10xy if 0 < x < 1, 0 < y < 1
fX,Y (x, y) =
0 otherwise
What is the probability that both modules have bug counts less than 0.5?
Solution
We need to compute:
Z 0.5 Z 0.5
P (X < 0.5, Y < 0.5) = 10xy dx dy
0 0
First, integrate with respect to x:

0.5 0.5
x2
Z
10xy dx = 10y = 10y × 0.125 = 1.25y
0 2 0
Mary Kamina © 2024

Now, integrate with respect to y:
Z 0.5 2 0.5
y
1.25y dy = 1.25 = 1.25 × 0.125 = 0.15625
0 2 0
Thus, the probability that both modules have bug counts less than 0.5 is
0.15625.
More Examples
1. Consider two continuous random variables, X and Y , which have a joint
PDF defined as fX,Y (x, y) = 6xy for values 0 < x < 1 and 0 < y < 1.
How can we verify that this joint PDF is valid by ensuring that the total
probability integrates to 1 over the specified range?
Solution
To confirm that this is a valid joint PDF, we need to compute the total
probability: Z 1Z 1 Z 1Z 1
6xy dx dy = 6 xy dx dy
0 0 0 0
First, we perform the integration with respect to x:
Z 1
y 1 1
xy dx = = y
0 2 0 2
Z 1
1 1 1 1
y dy = × =
0 2 2 2 4
Thus, the total integral is:
1
6× =1
4
This confirms that fX,Y (x, y) is indeed a valid joint PDF.
can we calculate the probability that both random variables X and Y fall
within the range of 0.2 to 0.5?
Solution
We need to find:
Z 0.5 Z 0.5
P (0.2 < X < 0.5, 0.2 < Y < 0.5) = 6xy dx dy
0.2 0.2

Z 0.5 2 0.5
x 0.25 0.04
6xy dx = 6y = 6y − = 3y × 0.21 = 0.63y
0.2 2 0.2 2 2
Mary Kamina © 2024

Z 0.5 2 0.5
y 0.25 − 0.04
0.63y dy = 0.63 = 0.63× = 0.63×0.105 = 0.06615
0.2 2 0.2 2
Thus, the probability that both X and Y are in the specified range is
0.06615.
do we calculate the conditional PDF fX|Y (x|0.5)?
Solution
First, find fY (0.5):
Z 1 Z 1
fY (y) = fX,Y (x, y) dx = 6xy dx = 3y ⇒ fY (0.5) = 3×0.5 = 1.5
0 0
Now, calculate fX|Y (x|0.5):

fX,Y (x, 0.5) 6x(0.5)
fX|Y (x|0.5) = = = 2x for 0 < x < 1
fY (0.5) 1.5
4. If the joint PDF for two software system response times X and Y is given
by fX,Y (x, y) = 8xy for 0 < x < 1 and 0 < y < 1, how can we calculate
the conditional PDF fX|Y (x|0.2)?
Solution
1. Find fY (0.2):
Z 1
fY (y) = 8xy dx = 4y ⇒ fY (0.2) = 4 × 0.2 = 0.8
0
2. Now compute fX|Y (x|0.2):

fX,Y (x, 0.2) 8x(0.2)
fX|Y (x|0.2) = = = 2x for 0 < x < 1
fY (0.2) 0.8
5. For the joint PDF fX,Y (x, y) = 10xy where 0 < x < 1 and 0 < y < 1, how
do we find P (X < 0.4|Y = 0.6)?
Solution
Calculate the conditional PDF fX|Y (x|0.6): First, find fY (0.6):
Z 1
fY (y) = 10xy dx = 5y ⇒ fY (0.6) = 5 × 0.6 = 3
0
Now find fX|Y (x|0.6):

fX,Y (x, 0.6) 10x(0.6) 20x
fX|Y (x|0.6) = = = for 0 < x < 1
fY (0.6) 3 3
Mary Kamina © 2024

Compute P (X < 0.4|Y = 0.6):
0.4 0.4
20 x2 20 × 0.08
Z
20x 20 0.16 1.6
P (X < 0.4|Y = 0.6) = dx = = × = = ≈ 0.5333
0 3 3 2 0 3 2 3 3
6. If the joint PDF fX,Y (x, y) = 4xy for 0 < x < 1 and 0 < y < 1 is given,
how do we find fX|Y (x|0.8)?
Solution
First, find fY (0.8):
Z 1
fY (y) = 4xy dx = 2y ⇒ fY (0.8) = 2 × 0.8 = 1.6
0
Now calculate fX|Y (x|0.8):
fX,Y (x, 0.8) 4x(0.8)

fX|Y (x|0.8) = = = 2x for 0 < x < 1
fY (0.8) 1.6
7. In a data analysis project, the joint PDF of the heights X and weights Y
of individuals is given by fX,Y (x, y) = 12xy for 0 < x < 1 and 0 < y < 1.
How do we find the probability that both height and weight are less than
0.5?
Solution
We need to calculate:
Z 0.5 Z 0.5
P (X < 0.5, Y < 0.5) = 12xy dx dy
0 0

0.5 0.5
x2
Z
0.25
12xy dx = 12y = 12y × = 1.5y
0 2 0 2
Next, integrate with respect to y:

0.5 0.5
y2
Z
0.25
1.5y dy = 1.5 = 1.5 × = 0.1875
0 2 0 2
Thus, the probability is 0.1875.

8. In a software application, the joint PDF of response time X and memory
usage Y is defined as fX,Y (x, y) = 16xy for 0 < x < 1 and 0 < y < 1.
How do we find the probability that the response time is less than 0.4 and
memory usage is less than 0.3?
Mary Kamina © 2024

Solution
Calculate:
Z 0.3 Z 0.4
P (X < 0.4, Y < 0.3) = 16xy dx dy
0 0

0.4 0.4
x2
Z
0.16
16xy dx = 16y = 16y × = 1.28y
0 2 0 2

0.3 0.3
y2
Z
0.09
1.28y dy = 1.28 = 1.28 × = 0.0576
0 2 0 2

9. In a machine learning model, suppose the joint PDF for features X and
Y is given by fX,Y (x, y) = 20xy for 0 < x < 1 and 0 < y < 1. How do we
determine the probability that both features are below 0.2?
Solution
We calculate:
Z 0.2 Z 0.2
P (X < 0.2, Y < 0.2) = 20xy dx dy
0 0
Integrate with respect to x:

0.2 0.2
x2
Z
0.04
20xy dx = 20y = 20y × = 0.4y
0 2 0 2

0.2 0.2
y2
Z
0.04
0.4y dy = 0.4 = 0.4 × = 0.008
0 2 0 2

10. In an experimental study, the joint PDF of two variables X (tempera-
ture) and Y (pressure) is defined as fX,Y (x, y) = 14xy for 0 < x < 1 and
0 < y < 1. How do we compute the probability that temperature is less
than 0.6 and pressure is less than 0.4?
Solution
We find: Z 0.4 Z 0.6
P (X < 0.6, Y < 0.4) = 14xy dx dy
0 0
Mary Kamina © 2024

0.6 0.6
x2
Z
0.36
14xy dx = 14y = 14y × = 2.52y
0 2 0 2

0.4 0.4
y2
Z
0.16
2.52y dy = 2.52 = 2.52 × = 0.2016
0 2 0 2

11. In a software performance test, the joint PDF of CPU usage X and mem-
ory usage Y is given as fX,Y (x, y) = 18xy for 0 < x < 1 and 0 < y < 1.
How do we find the probability that CPU usage is less than 0.5 and mem-
ory usage is less than 0.7?
Solution
We need to compute:
Z 0.7 Z 0.5
P (X < 0.5, Y < 0.7) = 18xy dx dy
0 0

0.5 0.5
x2
Z
0.25
18xy dx = 18y = 18y × = 2.25y
0 2 0 2

0.7 0.7
y2
Z
0.49
2.25y dy = 2.25 = 2.25 × = 0.55125
0 2 0 2
Exercise
1. Let X and Y be two random variables. Then for
(
kxy for 0 < x < 4 and 1 < y < 5
fX,Y (x, y) =
0 otherwise
to be a joint density function, what must be the value of k? As fX,Y (x, y)

is the joint probability density function,
Z ∞Z ∞
fX,Y (x, y) dy dx = 1
−∞ −∞
Mary Kamina © 2024

Z 4 Z 5
kxy dy dx = 1
0 1
Z 4 Z 5
kxy dy dx = 1
0 1
Z 5 Z 4
y dy kx dx = 1
1 0
5
52 12 25 − 1
Z
24
y dy = − = = = 12
1 2 2 2 2
Z 4
42
k x dx = =8
0 2
So,
1
12k · 8 = 1 ⇒ 96k = 1 ⇒ k = .
96
2. If the joint p.d.f. of a two-dimensional random variable (X, Y ) is given by
(
2 for 0 < x < 1 and 0 < y < x
fX,Y (x, y) =
0 otherwise
Then,
(a) Find the marginal density functions of X and Y .
Solution
Marginal density function of Y is given by
Z ∞ Z 2
fY (y) = fX,Y (x, y) dx = dx
−∞ 0
As x is involved in both the given ranges, i.e. 0 < x < 1 and

0 < y < x; therefore, here we will combine both these intervals
and hence have 0 < y < x < 1. ⇒ x takes the values from y to 1]
1
[2x]y
= 2 − 2y = 2(1 − y), 0<y<1
Marginal density function of X is given by
Z ∞
fX (x) = fX,Y (x, y) dy
−∞
Z x
2dy [0 < y < x < 1]
0
x
= 2 [y]0
= 2x, 0 < x < 1
Mary Kamina © 2024

(b) Find the conditional density functions.
Solution
Conditional density function of Y given X (0 < X < 1) is
fX,Y (x, y) 2
fY |X (y|x) = = for 0 < y < x.
fX (x) 2x
Conditional density function of X given Y (0 < Y < 1) is
fX,Y (x, y) 2 1
fX|Y (x|y) = = = for y < x < 1.
fY (y) 2(1 − y) 1−y
(c) Check for independence of X and Y .
Solution fX,Y (x, y) = 2,
fX (x)fY (y) = 2(2x)(1 − y) ̸= fX,Y (x, y).
⇒ X and Y are not independent.

3. If (X, Y ) be a two-dimensional random variable having joint density func-
tion
(
1
8 (6 − x − y) for 0 < x < 2, 2 < y < 4
fX,Y (x, y) =
0 elsewhere
Find
(a) P (X < 1, Y < 3)
Solution
Z 1 Z 3
P (X < 1, Y < 3) = fX,Y (x, y) dy dx
−∞ −∞
Z 1 Z 3
1
= (6 − x − y) dy dx
0 2 8
Z 1 Z 3
6−x−y
= dy dx
0 2 8
2 !
6y − xy − y2 3
Z 1
= dx
0 8 2
(18 − 3x − 92 ) − (12 − 2x − 2
Z 1
= dx
0 8
(18 − 3x − 92 ) − (10 − 2x)
Z 1
= dx
0 8
Mary Kamina © 2024

1 7
− x)
Z
2
= dx
0 8
2
! !
7 x 1
2x − 2 3
= = = 0.375
8 0 8
(b) P (X < 1|Y < 3).
Solution
P [X < 1, Y, 3
P [X < 1|Y < 3] =
P [Y < 3]
Where Y < 3 Z 2 Z 3
1
= (6 − x − y) dy dx
0 2 8
Z 2 Z 3
6−x−y
= dy dx
0 2 8
2 !
6y − xy − y2 3
Z 2
= dx
0 8 2
(18 − 3x − 92 ) − (12 − 2x − 2
Z 2
= dx
0 8
(18 − 3x − 92 ) − (10 − 2x)
Z 2
= dx
0 8
Z 27
2 − x)

= dx
0 8
x2
! !
7 2
2x − 2 5
= =
8 0 8
where Z 2
P (Y < 3) =
0
3
8 3
∴ P [X < 1|Y < 3] = 5 = = 0.6
8
5
7 Covariance and Correlation

For this topic, refer to the link. When you see solution, click on it to display
the workings. https://www.probabilitycourse.com/chapter5/5 3 1 covariance
correlation.php
Mary Kamina © 2024

8 Conditional Expectation and Variance.
For this topic, refer to the link. When you see solution, click on it to display
the workings. https://www.probabilitycourse.com/chapter5/5 1 5 conditional
expectation.php
9 Moments and moment generating functions.

For this topic, refer to the link.
Click here for moments.https://www.statlect.com/fundamentals-of-probability/
moments
Click here for moment generating functions. https://www.statlect.com/fundamentals-of-probability/
moment-generating-function
Click here for characteristic functions. https://www.statlect.com/fundamentals-of-probability/
characteristic-function
10 Probability Distributions and their linkages

At this point we have learnt about univariate discrete distribution, univariate
continuous distributions, bivariate discrete distributions and bivariate continu-
ous distributions. These probability distributions are connected/linked to other
probability distributions under certain conditions. https://www.statlect.com/
probability-distributions/relationships-among-probability-distributions
Once you have read the above find more worked examples below.
1. A subscription service has 100 customers, and each has a 10% chance of
leaving. What is the probability that exactly 5 will leave in a given month?
Solution
Using the Binomial PMF with n = 100, p = 0.1, and k = 5:

100
P (X = 5) = (0.1)5 (0.9)95 ≈ 0.1871
5
2. A web server receives an average of 3 requests per second. What is the

probability of receiving exactly 5 requests in a second?
Solution
Using the Poisson PMF with λ = 3 and k = 5:
35 e−3
P (X = 5) = ≈ 0.1008
5!
Mary Kamina © 2024

Binomial Distribution: n = 100, p = 0.1
0.3
Probability 0.2 0.19
0.1 · 10−2
9.128.51 · 10−2
6.48 · 10−2
4.84 · 10−2 4.36 · 10−2
· 10
1.98−3 −2 2.56 · 10−2
10−4
3 · 1.4 10· 10
·6.3 −3
0
0 1 2 3 4 5 6 7 8 9 10
Number of Customers Leaving (k)
Figure 2: Binomial Distribution PMF for n = 100, p = 0.1
Poisson Distribution: λ = 3
0.3
0.220.22
0.2
Probability
0.17
0.15
0.1
0.1
4.98 · 10−2 5.04 · 10−2

2.16 · 10−2 −3
8.1 ·2.7
10· 810· −3
10−4
0
0 1 2 3 4 5 6 7 8 9 10
Number of Requests (k)
Figure 3: Poisson Distribution PMF for λ = 3
3. A data scientist is analyzing customer churn in a subscription service. The

service has 100 customers, and each customer has a 10% chance of leaving
the service in any given month. What is the probability that exactly 5
Mary Kamina © 2024

customers will leave in a given month?
Solution
This is a classic example of a Binomial distribution where n = 100, p = 0.1,
and we are interested in finding the probability of k = 5 customers leaving.
Using the Binomial PMF:

100
P (X = 5) = (0.1)5 (0.9)95
5
Calculating the binomial coefficient:

100 100!
= = 75287520
5 5!(100 − 5)!
Now, substituting the values:
P (X = 5) = 75287520 × (0.1)5 × (0.9)95 ≈ 0.1871
Thus, the probability that exactly 5 customers will leave in the given
month is approximately 0.1871.
0.25
0.2 0.19
0.16
Probability
0.15
0.1
7.32 · 10−2 7.07 · 10−2
5 · 10−2
2.95 · 10−2
2.03 · 10−2
8.8 · 10−3 −3
−4· 10−3 4.9 · 10
1.1 · 10−3
1 · 101.6
0
0 1 2 3 4 5 6 7 8 9 10
Number of Successes (k)
Figure 4: Binomial Distribution for n = 100, p = 0.1
Mary Kamina © 2024

4. A web server experiences an average of 3 requests per second. What is
the probability that the server will receive exactly 5 requests in a second?
Solution
Here, λ = 3 and we are interested in finding P (X = 5).
Using the Poisson PMF:
35 e−3
P (X = 5) =
5!
Calculating the factorial:

5! = 120
Substituting the values:
243e−3
P (X = 5) = ≈ 0.1008
120
Thus, the probability that the server will receive exactly 5 requests in a
second is approximately 0.1008.
0.25
0.22 0.22
0.2
0.17
Probability
0.15
0.15
0.1
0.1
4.98 · 10−2 5.04 · 10−2

5 · 10−2
2.16 · 10−2
8.1 · 10−3 −3
2.7 · 108 · 10−4
0
0 1 2 3 4 5 6 7 8 9 10
Number of Requests (k)
Figure 5: Poisson Distribution for λ = 3
5. What is the probability that exactly 3 packets out of 5 transmitted are

successful if the probability of success in each transmission is 0.7?
Solution
We use the Binomial Probability Mass Function (PMF):
Mary Kamina © 2024

n k
P (X = k) = p (1 − p)n−k
k
Here:
n = 5, k = 3, p = 0.7
Substituting values:

5
P (X = 3) = (0.7)3 (0.3)2
3
Calculating:
5!
P (X = 3) = (0.7)3 (0.3)2 = 10 × 0.343 × 0.09 ≈ 0.3087
3!(5 − 3)!
Thus, the probability of exactly 3 successful transmissions is 0.3087.
0.4
0.36
0.31
0.3
Probability
0.2
0.17
0.13
0.1
2.54 · 10−2
2.4 · 10−3
0
0 1 2 3 4 5
Number of Successes (k)
Figure 6: Binomial Distribution for n = 5, p = 0.7
6. In a software testing scenario, defects occur with an average rate of 2

defects per hour. What is the probability that exactly 5 defects will be
found in an hour?
Solution
We use the Poisson PMF:
Mary Kamina © 2024

λk e−λ
P (X = k) =
k!
Here:
λ = 2, k=5
25 e−2 32e−2
P (X = 5) = = ≈ 0.0361
5! 120
Thus, the probability of finding exactly 5 defects in an hour is 0.0361.
0.2
0.18
0.15
0.14
Probability
0.1 9.02 · 10−2
5 · 10−2 3.61 · 10−2
1.2 · 10−2
3.4 · 108−3
· 10−4
2 · 10−4
4 · 10−5
0
0 1 2 3 4 5 6 7 8 9 10
Number of Defects (k)
Figure 7: Poisson Distribution for λ = 2
7. Suppose a batch of 20 electronic components contains 5 defective ones. If

4 components are selected at random, what is the probability that exactly
2 of them are defective?
Solution
We use the Hypergeometric PMF:
K N −K

k n−k
P (X = k) = N

n
Here:
Mary Kamina © 2024

N = 20, K = 5, n = 4, k=2
5 15

2 10 × 105
P (X = 2) = 20
2 = ≈ 0.2166
4
4845
Thus, the probability of selecting exactly 2 defective components is 0.2166.
0.3
0.22
0.2
Probability
0.16
0.1
7.81 · 10−2
7.8 · 10−3
0
0 1 2 3 4
Number of Defective Components (k)
Figure 8: Hypergeometric Distribution for N = 20, K = 5, n = 4
8. In a call center, the average time between customer calls is 10 minutes.

What is the probability that the next call will come within 5 minutes?
Solution
We use the Exponential PDF:
P (T ≤ t) = 1 − e−λt
Here:
1
λ= , t=5
10
1
P (T ≤ 5) = 1 − e− 10 ×5 = 1 − e−0.5 ≈ 1 − 0.6065 = 0.3935
Mary Kamina © 2024

Thus, the probability that the next call will come within 5 minutes is
0.3935.
0.15
Probability Density
0.1
5 · 10−2
0
0 5 10 15 20 25 30
Time (minutes)
Figure 9: Exponential Distribution for λ = 0.1
9. If the scores on a standardized test are normally distributed with a mean

of 100 and a standard deviation of 15, what is the probability that a ran-
domly selected score is between 85 and 115?
Solution
We use the Normal Distribution CDF:

a−µ b−µ
P (a < X < b) = P <Z<
σ σ
Where Z is the standard normal variable:
µ = 100, σ = 15, a = 85, b = 115
First, compute the Z-scores:
85 − 100 115 − 100

Z1 = = −1, Z2 = =1
15 15
Using the standard normal table, we find:
P (−1 < Z < 1) ≈ 0.6826
Thus, the probability that the score is between 85 and 115 is 0.6826.
Mary Kamina © 2024

·10−2
4
Probability Density
2
0
60 80 100 120 140
Score
Figure 10: Normal Distribution with µ = 100, σ = 15
10. In a machine learning model, the prior distribution of a parameter θ is

Beta-distributed with α = 2 and β = 3. What is the expected value of θ?
Solution
The expected value of a Beta distribution is:
α
E(θ) =
α+β
2 2
E(θ) = = = 0.4
2+3 5
Thus, the expected value of θ is 0.4.
11. The time until a server experiences two failures follows a Gamma distri-
bution with shape parameter α = 2 and rate parameter λ = 13 . What is
the probability that the server will experience the second failure within 6
hours?
Solution
We use the Gamma CDF for α = 2:
P (T ≤ t) = 1 − e−λt (1 + λt)
Here:
Mary Kamina © 2024

2
1.5
Probability Density
1
0.5
0
0 0.2 0.4 0.6 0.8 1
θ
Figure 11: Beta Distribution with α = 2, β = 3
1
α = 2, λ= , t=6
3

− 31 ×6 1
P (T ≤ 6) = 1 − e 1 + × 6 = 1 − e−2 × 3 ≈ 0.5941
3
Thus, the probability that the server will experience the second failure
within 6 hours is 0.5941.
When we have some data and we want to categorize a random variable, we can
use the chart (Figure 61.15:Distributional Choices) named in the link as a guide
https://tinyheero.github.io/2016/03/17/prob-distr.html
11 Bivariate Distributions
For this topic, refer to the link. When you see solution, click on it to display the
workings. https://www.probabilitycourse.com/chapter5/5 3 2 bivariate normal
dist.php
Mary Kamina © 2024

0.4
0.3
Probability Density
0.2
0.1
0
0 0.5 1 1.5 2 2.5 3
Time (hours)
1
Figure 12: Gamma Distribution with α = 2, λ = 3
Sample Examination Paper 1

Total Marks: 70
Section A: Short Questions (1 Mark Each)

1. Define a discrete random variable.
2. What is a continuous random variable?
3. State one property of the Probability Mass Function (PMF).
4. What is the Law of Large Numbers (LLN)?

5. Define covariance.
Section B: Medium Questions (2 Marks Each)

1. Differentiate between joint probability and marginal probability.
2. Explain the significance of a joint probability mass function (PMF).

3. If the probability that it will rain tomorrow given that it is cloudy today
is 0.3, express this using conditional probability notation.
4. Give an example of a bivariate discrete probability distribution.
5. For a normal distribution with mean µ = 0 and variance σ 2 = 1, compute

the value of f (0).
Mary Kamina © 2024

Section C: Practical Applications (3 Marks Each)
1. In a software testing process, if Suite A fails with a probability of 0.3 and
Suite B with 0.4, what is the joint probability of both suites failing if they
are independent?
2. For a fair six-sided die, calculate the expected value of the outcome.
3. A data center records 2 failures per hour. What is the probability of
recording exactly 3 failures in an hour? Use the Poisson distribution
formula.
4. In a bivariate distribution, if the probability of purchasing an item and

returning it is 0.1 and the probability of returning the item is 0.2, find
the conditional probability that the item was purchased given that it was
returned.
Section D: Problem Solving (4 Marks Each)

1. A machine produces 10,000 widgets in a day. The probability that a widget
is defective is 0.01. What is the probability that exactly 2 widgets will be
defective on a given day? Use the binomial distribution.
2. For a normal distribution with µ = 50 and σ = 10, find the probability
that a randomly selected value will fall between 40 and 60.
3. In an IT network, the time taken to process a request is exponentially
distributed with a mean of 2 seconds. What is the probability that a
request will take less than 1 second to process?
Section E: Long Answer (20 Marks)

1. (a) Define and explain the joint probability distribution for bivariate
discrete random variables. (5 marks)
(b) Using an example from Data Science, explain how to compute the
marginal and conditional probabilities from a joint probability mass
function (PMF). (5 marks)
(c) Solve the following problem: A web server receives an average of 3
requests per second. What is the probability of receiving exactly
5 requests in a second? (Use the Poisson distribution formula) (5
marks)
(d) Explain the relationship between the Law of Large Numbers (LLN)
and the Central Limit Theorem (CLT), and how they apply to statis-
tics. (5 marks)
Mary Kamina © 2024

Sample Examination Paper 2
Total Marks: 70
Section A: Short Questions (1 Mark Each)

1. Define the Central Limit Theorem (CLT).
2. What is the expected value of a random variable?
3. State one property of the cumulative distribution function (CDF).

4. Define a bivariate random variable.
5. What is a probability mass function (PMF)?
Section B: Medium Questions (2 Marks Each)

1. Differentiate between discrete and continuous random variables.
2. Explain the concept of conditional probability with an example.

3. In a binomial distribution, what happens as the number of trials n in-
creases?
4. Write down the general form of the Poisson distribution formula.
5. What is a covariance matrix?
Section C: Practical Applications (3 Marks Each)

1. In a web server, the time between requests is exponentially distributed
with a mean of 3 seconds. What is the probability that the next request
will arrive in less than 2 seconds?
2. For a discrete random variable with probabilities P (X = 1) = 0.2, P (X =

2) = 0.5, P (X = 3) = 0.3, compute the expected value of X.
3. A server receives an average of 4 requests per second. What is the proba-
bility of receiving exactly 6 requests in a second? Use the Poisson distri-
bution formula.
4. In a bivariate probability distribution, the joint probability of two variables

X and Y is given. How do you compute the marginal distribution of X?
Section D: Problem Solving (4 Marks Each)

1. For a binomial distribution with n = 10 and p = 0.6, find the probability
of getting exactly 7 successes.
Mary Kamina © 2024

2. In a data science project, the number of bugs found in two different code
modules follows a bivariate distribution. How would you compute the
conditional probability of bugs in one module given the number of bugs
in the other?
3. The time taken to process a task in a system is normally distributed with
a mean of 15 seconds and a standard deviation of 5 seconds. What is the
probability that a task will take between 10 and 20 seconds to complete?
Section E: Long Answer (20 Marks)

1. (a) Define and explain the concept of conditional probability for bivariate
distributions. (5 marks)
(b) Explain how the Central Limit Theorem applies to sampling distri-
butions, with an example from IT or Data Science. (5 marks)
(c) Solve the following: A call center receives an average of 5 calls per
minute. What is the probability of receiving exactly 8 calls in the
next minute? (Use the Poisson distribution formula). (5 marks)
(d) Explain how the Exponential distribution is related to the Poisson
distribution, and give a real-world application of this in computer
networks. (5 marks)
Mary Kamina © 2024

BSTA 2104 Probability and Statistics II Notes Sep Dec 2024

Uploaded by

Copyright:

Available Formats

BSTA 2104 Probability and Statistics II Notes Sep Dec 2024

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BSTA 2104 Probability and Statistics II Notes Sep Dec 2024

Uploaded by

Copyright:

Available Formats

BSTA 2104 PROBABILITY AND

UNIT LECTURER: MARY KAMINA

2024/2025 SEP-DEC SEMESTER

BCS 2.1, BDAT 2.1, BIT 3.1, BSEN 2.1

Expected Learning Outcomes

1. Explain the concept of both univariate/one-dimensional and bivariate/two-

2. Evaluate the distribution of functions of random variables and calculate

4. Establish the relationship (link) between various probability distributions.

(a) Review of Random variables, probability distributions and mathe-

Mary Kamina © 2024

1. Discrete Random Variables

3. Mixed Random Variables

Mary Kamina © 2024

There are several important concepts that need to be understood about

1. Discrete Probability Distributions

2. Probability Density Function (PDF) for Probability Density Function

(a) For a discrete random variable

Mary Kamina © 2024

(b) For a continuous random variable

Properties of the CDF

2. limx→−∞ FX (x) = 0 and limx→∞ FX (x) = 1

Theorems related to Probability Distributions

Discrete Random Variables

Mary Kamina © 2024

3. Non-negativity: If X ≥ 0 almost surely, then E[X] ≥ 0.

This link gives a general overview of probability distributions, showing what

Univariate/One-Dimensional Random Variables

Mary Kamina © 2024

This integral can be computed using standard normal distribution tables

Mary Kamina © 2024

2. Probability of Rolling an Even Number

3. Cumulative Distribution Function (CDF)

For example: - F (3) = P (X ≤ 3) = P (X = 1) + P (X = 2) + P (X = 3) =

Bivariate/Two-Dimensional Random Variables

Joint Probability Distribution

2. Continuous Case: For continuous random variables, the joint distribu-

for any region A.

Mary Kamina © 2024

for all x and y.

Examples for Bivariate Random Variables

P (X = 170, Y = 65) = 0.1

This indicates a 10% chance of selecting an individual who is 170 cm tall

Covariance and Correlation

Cov(X, Y ) = E[(X − E[X])(Y − E[Y ])].

Examples: Discrete and Continuous Distributions

Mary Kamina © 2024

2. For a normal distribution X ∼ N (µ, σ 2 ), the expected value is:

For example, if µ = 5 and σ 2 = 2, then E[X] = 5.

Examples: Probability Distributions and Mathematical

Mary Kamina © 2024

Var(X) = E[(X − E[X])2 ] = E[(c − c)2 ] = E[0] = 0

Using the linearity of expectation:

Mary Kamina © 2024

How would you interpret the E[X] = 8?

Var(X) = np(1 − p) = 10 × 0.8 × 0.2 = 1.6

Theoretical Examples: Discrete and Continuous Distribu-

This equation is a consequence of the axioms of probability, particularly

pX (x1 ) + pX (x2 ) + pX (x3 ) = 1

Mary Kamina © 2024