Ch4 Random Variables
Ch4 Random Variables
Probability
Stefano Bonaccorsi
/
/
Table of Contents
Introduction
! Introduction
Expectation
Variance
/
Discrete random variables
Introduction
Learning Goals
. Know how to compute the expected value (mean) of a discrete random variable.
. Know the expected value of Bernoulli, binomial and geometric random variables.
. Be able to compute the variance and standard deviation of a random variable.
Understand that standard deviation is a measure of scale or spread.
. Introduce the Poisson distribution.
/
Expectation
Introduction
Example
Suppose we have a six-sided die marked with five ’s and one . What would you expect
the average of rolls to be?
By a relative frequencies approach, if we knew the value of each roll, we could compute
the average by summing all the values and dividing by the number of rolls. Without
knowing the exact values, we can compute the expected average as follows.
Since there are five ’s and one , we expect roughly / of the rolls (around ) to
land a , and / of the rolls (around ) to land a .
Assuming this to be exactly true, we have the following table of values and counts:
value:
expected counts:
The average of these values is then
5000 · 3 + 1000 · 6 5 1
= · 3 + · 6 = 3.5
6000 6 6
/
Expectation
Introduction
Expectation
Given a random variable X with range R and probability distribution p we define the
expectation of X to be
)n
E[X] = xj pj .
j=1
If the range of X is infinite, we shall assume in this chapter that X is such that the infinite
series above converges.
Notes:
. The expected value is also called the mean or average of X and often denoted by µ
(“mu”).
. As seen in the above examples, the expected value need not be a possible value of
the random variable. Rather it is a weighted average of the possible values.
. Expected value is a summary statistic, providing a measure of the location or central
tendency of a random variable.
/ . If all the values are equally probable then the expected value is just the usual
Expectation
Introduction
Example .
Find E[X] when
(a) X is Bernoulli,
(b) X is uniform,
(c) X = xk with certainty.
/
Expectation
Introduction
Example .
Find E[X] when
(a) X is Bernoulli: E[X] = (1 → p) + (0 → (1−p)) = p,
"n 1
(b) X is uniform: E[X] = j=1 n xj ,
"
(c) X = xk with certainty: E[X] = nj=1 δxk xj = xk .
/
Algebraic properties of expectation
Introduction
• If X is a random variable with distribution p then
n
)
E[f(X)] = f(xj )pj
j=1
• If X and Y are two random variables with joint distribution pij , it makes sense to
define
)n ) m
E[X + Y] = (xi + yj )pij
i=1 j=1
and
n )
) m
E[XY] = xi yj pij
i=1 j=1
• A random variable X is said to be non-negative if R ⊂ [0, ∞), that is, no value in its
range is negative. Note that X2 is always non-negative irrespective of whether X is.
/
Theorem .
Let X and Y be two random variables defined on the same sample space and ε ∈ R.
(a) E[X + Y] = E[X] + E[Y]
(b) E[εX] = εE[X]
(c) If X is a non-negative random variable, then E[X] ≥ 0
It is possible to see that (c) is a special case of the following more general result:
min{R} ≤ E[X] ≤ max{R}, i.e., the mean (or average, or expected value) of X is
contained between the minimum and the maximum of the elements of its range.
/
Proof
By direct computation.
/
Variance
Introduction
Now, take f(X) = (X−µ)2 ; then f(X) is non-negative, so E[(X−µ)2 ] ≥ 0. We define the
variance of X, V(X) by
) n
V(X) = E[(X−µ)2 ] = (xj − µ)2 pj ( . )
j=1
/
Variance
Introduction
The only problem with ( . ) is the practical one that if the values of X are in (units) then
V(X) is measured in (units)2 . For this reason we find it useful to introduce the standard
deviation σ(X) of the random variable X by
#
σ(X) = V(X) ( . )
/
Variance
Introduction
Proof
Expand (X − µ)2 = X2 − 2µX + µ2 .
/
Example .
Compute the variance of the following random variables:
(a) X is Bernoulli,
(b) X is uniform in R = {1, . . . , n},
(c) X = xk with certainty,
(d) X is the sum of scores on two fair dice.
/
Solution
(a) X is Bernoulli: V(X) = p(1 − p),
n+1 n2 →1
(b) X is uniform: E[X] = 2 , V(X) = 12 ,
(c) X = xk with certainty: V(X) = 0,
/
Solution
(d) S is the sum of scores on two fair dice:
Let S = X + Y with X and Y the score on the first (second) die. Then
E[X] = E[Y] = 72 , hence E[S] = 7;
xk = Sum
pk = 1
36
2
36
3
36
4
36
5
36
6
36
5
36
4
36
3
36
2
36
1
36
xk pk = 2
36
6
36
12
36
20
36
30
36
42
36
40
36
36
36
30
36
22
36
12
36
x2k pk = 4
36
18
36
48
36
100
36
180
36
294
36
320
36
326
36
300
36
242
36
144
36 54 + 5
6
/
Variance
Introduction
Consider the following random variables:
X ∼ {(1, 1/5), (2, 1/5), (3, 1/5), (4, 1/5), (5, 1/5)}
Y ∼ {1, 1/10), (2, 2/10), (3, 4/10), (4, 2/10), (5, 1/10)}
Z ∼ {(1, 1/2), (2, 0), (3, 0), (4, 0), (5, 1/2)} W ∼ {1, 0), (2, 0), (3, 1), (4, 0), (5, 0)}
Order them from the largest to the smallest variance
https://app.wooclap.com/ADCJQM
/
Table of Contents
Covariance and correlation
! Introduction
Expectation
Variance
/
Dependence
Covariance and correlation
Suppose that X and Y are two random variables with means µX and µY respectively. We
say that they are linearly related if we can find constants m and c such that Y = mX + c,
so for each yk , 1 ≤ k ≤ n, we can find xk such that yk = mxk + c.
Each of the points (x1 , y1 ), . . . , (xm , ym ) lies on a straight line.
✻
y3 ❵
✧
✧
✧
❵
✧
❵✧
y2 ✧
y1 ✧
✧
✧ ✲
x1 x2 x3
/
Covariance
Covariance and correlation RF IP X S Rx Y S Ry
C 1122 discretesets
XY ra Ray
A quantity that enables us to measure how ‘close’ X and Y are to being linearly related is
the covariance Cov(X, Y).
This is defined by
Cov(X, Y) = E[(X−µX )(Y−µY )]
Note that Cov(X, Y) = Cov(Y, X) and that Cov(X, X) = V(X).
Furthermore, if X and Y are linearly related, then Cov(X, Y) = mV(X).
m b
M mutb marginal
of Y
distribution
/
Covariance and random vectors
Covariance and correlation
Suppose that X = (X1 , X2 ) is a random vector with range R = {(ai , bj )} and distribution
pij = P(X1 = ai , X2 = bj ). We know that the expectation of X is a vector
E[X] = (E[X1 ], E[X2 ]); what about the variance? it is necessary to take the square of
(X − E[X]) that we can substitute with the product
$ 2
%
(X1 − µ1 ) (X1 − µ1 )(X2 − µ2 )
(X − E[X])T · (X − E[X]) =
(X1 − µ1 )(X2 − µ2 ) (X2 − µ2 )2
Taking the expectation we recognize, on the diagonal, the variances V(X1 ) and V(X2 ), and
off-diagonal the covariance between X1 and X2 .
Cov(X, Y)
ρ(X, Y) =
σX σY
where σX and σY are the standard deviations of X and Y respectively.
If ρ(X, Y) = 0, we say that X and Y are uncorrelated.
b
ELY E m
Y E m DI
Y ELY ELY
m'EEX 2m BEE b MEIX b
m E m'EEX 2 Mb E 2mbETX
b b
m UX
TIY 1m 61 1
Coulx Y E X µ mxtb m My b1
E X Mxl m X Ax m VLX
Example .
Consider a symmetric binary channel, with q0|0 = q1|1 = 0.8 and q0|1 = q1|0 = 0.2
Suppose that the input is described by a symmetric Bernoulli random variable X (hence
p0 = P(X = 0) = 12 and p1 = 12 as well).
Notice that the random variable Y describing the output is also symmetric Bernoulli. The
error probability is given by ε = 0.2. Find the correlation ρ(X, Y) between input and
output.
1 0.8 ✲1
❅ ✒
$
$
input ❅$
0.2
receiver
❅ 0.2 Y
$ ❅
0
$ ❘
❅
✲0
Belt Y be 17
0.8
/
Example . 1 E 6
Consider a symmetric binary channel, with q0|0 = q1|1 = 0.8 and q0|1 = q1|0 = 0.2
Suppose that the input is described by a symmetric Bernoulli random variable X (hence
p0 = P(X = 0) = 12 and p1 = 12 as well).
Notice that the random variable Y describing the output is also symmetric Bernoulli. The
error probability is given by ε = 0.2. Find the correlation ρ(X, Y) between input and
output.
We have µX = µY = 12 and σX = σY = 12 .
The joint probabilities are
p(0, 0) = p0 q0|0 = 0.4, p(1, 1) = 0.4, p(0, 1) = p0 q1|0 = 0.1, p(1, 0) = 0.1
hence 1 2 1E E LE
Cov(X, Y) = E[XY] − µX µY = 1 → 0.4 − (0.5)2 = 0.15
so ρ(X, Y) = 0.15
(0.5)2
= 0.6
Eliel 111 11
/ 1 2s 1211 e 1 2 1
The Cauchy-Schwarz inequality
Covariance and correlation
Proof
For every t ∈ R, the random variable (X + tY)2 is ≥ 0, hence its mean is ≥ 0, i.e.,
/
The Cauchy-Schwarz inequality
Covariance and correlation
Corollary
(i) | Cov(X, Y)| ≤ σX σY
(ii) −1 ≤ ρ(X, Y) ≤ 1
/
Exercise .
Show that, if X, Y and Z are arbitrary random variables and ε and β are real numbers,
then
(i) V(X + ε) = V(X)
(ii) Cov(X, Y) = E[XY] − E[X]E[Y]
(iii) Cov(X, εY + βZ) = ε Cov(X, Y) + β Cov(X, Z)
(iv) V(X + Y) = V(X) + 2 Cov(X, Y) + V(Y)
/
Linearly dependent random variables
Covariance and correlation
Theorem
X and Y are linearly dependent: Y = aX + b
if and only if the correlation is ±1
Proof
If Y = aX + b then µY = aµX + b, V(Y) = a2 V(X) and
n
)
E[XY] = xi (axi + b)pX (i) = aE[X2 ] + bE[X]
i=1
#
hence Cov(X, Y) = aV(X) = V(X)(a2 V(X)).
/
Linearly dependent random variables
Covariance and correlation
Theorem
X and Y are linearly dependent: Y = aX + b
if and only if the correlation is ±1
Proof
Conversely, let X" = X/σX and Y" = Y/σY ; then
then
so if ρ(X, Y) = ±1 then either V(X" − Y" ) = 0 or V(X" + Y" ) = 0, and we know that a
random variable has variance = if and only if it is degenerate.
/
Independent random variables
Covariance and correlation
Two random variables X and Y are said to be (probabilistically) independent if each of the
events (X = xj ) and (Y = yk ) are independent:
Example
Choose a number at random between and . Let X be the remainder of the division of
the number by , and Y be the remainder of the division of the number by . Then X is
uniformly distributed in {0, 1} and Y is uniformly distributed in {0, 1, 2}.
Moreover, X and Y are independent.
However, if we take the number at random between 0 and , then X and Y are no longer
independent!
/
Theorem .
If X and Y are independent, then
(a) E[XY] = E[X]E[Y]
(b) Cov(X, Y) = ρ(X, Y) = 0
(c) V(X + Y) = V(X) + V(Y).
Proof
n )
) m n )
) m
E[XY] = xj yk P(X = xj , Y = yk ) = xj yk P(X = xj )P(Y = yk )
j=1 k=1 j=1 k=1
* +
)n m
)
= xj P(X = xj ) yk P(Y = yk ) = E[X]E[Y]
j=1 k=1
as required.
/
Theorem .
If X and Y are independent, then
(a) E[XY] = E[X]E[Y]
(b) Cov(X, Y) = ρ(X, Y) = 0
(c) V(X + Y) = V(X) + V(Y).
Proof
Since
Cov(X, Y) = E[XY] − E[X]E[Y]
it follows from previous computation that Cov(X, Y) = 0 and, a fortiori, ρ(X, Y) = 0.
/
Theorem .
If X and Y are independent, then
(a) E[XY] = E[X]E[Y]
(b) Cov(X, Y) = ρ(X, Y) = 0
(c) V(X + Y) = V(X) + V(Y).
Proof
Since
V(X + Y) = V(X) + V(Y) + 2 Cov(X, Y)
the thesis follows from the previous computation.
/
Table of Contents
Binomial and Poisson
! Introduction
Expectation
Variance
/
Let X1 , X2 , ..., Xn be i.i.d. Bernoulli random variables:
The sum S(n) = X1 + · · · + Xn is called a binomial random variable with parameters n and
p, S(n) ∼ B(n, p).
The range of S(n) is {0, 1, . . . , n};
Lemma .
,n-
The probability law of S(n) is given by p(k) = k pk (1 − p)n→k for 0 ≤ k ≤ n
/
Example .
An information source emits a six-digit message into a channel in binary code. Each digit is
chosen independently of the others and is a one with probability . . Calculate the
probability that the message contains
(i) three ones,
(ii) between two and four ones (inclusive),
(iii) no less than two zeros.
/
From Binomial to Poisson distribution
Having dealt with a finite number of i.i.d. Bernoulli random variables it is natural (if you
are a mathematician) to inquire about the behavior of an infinite number of these. Of
course, the passage to the infinite generally involves taking some kind of limit and this
needs to be carried out with care.
We will take the limit of the probability law
$ %
n k
p(k) = p (1 − p)n→k
k
as n → ∞ and as p → 0.
In order to obtain a sensible answer we will assume that n increases and p decreases in
such a way that λ = np remains fixed.
/
We denote Y as the corresponding random variable, which is called a Poisson random
variable with parameter λ. The range of Y is N.
To obtain the probability law of S we take the limit
$ % $ %k $ %n→k k
n λ λ →λ λ
pY (k) = lim 1− =e
n→∞ k n n k!
for every k ≥ 0.
/
Remember the well-known fact that
∞
) λk
= eλ
k!
k=0
Prove (either by taking the limit in E[S(n)] = np and V(S(n)) = np(1 − p) or by a direct
computation) that
E[Y] = λ, V(Y) = λ.
/
Example
A typesetter makes, on average, one mistake per words. Assume that he is setting a
book with words to a page. Let S100 be the number of mistakes that he makes on a
single page. Then the exact probability distribution for S100 would be obtained by
considering S100 as a result of Bernoulli trials with p = 1/1000. The expected value of
S100 is λ = 100(1/1000) = .1. The exact probability that S100 takes a certain value j is
,100- j (.1)j
j p (1 − p) , and the Poisson approximation is e
100→j →.1
j! . Numerically, the values
are:
Poisson λ = 0.1 . . . . .
Binomial n = 100, p = 0.001 . . . . .
/
Sum of Independent Poisson Random Variables
Let X and Y be independent Poisson random variables with respective means λ1 and λ2 .
Calculate the distribution of X + Y.
Solution
Since the event {X + Y = n} may be written as the union of the disjoint events
{X = k, Y = n−k}, 0 ≤ k ≤ n, we have
n
) n
)
P(X + Y = n) = P(X = k, Y = n − k) = P(X = k)P(Y = n − k)
k=0 k=0
n n
) λk1 −λ2 λn−k 1 ) n! −λ1 −λ2 1
= e −λ1
e 2
= e−λ1 −λ2 λk1 λn−k = e (λ1 + λ2 )n
k! (n − k)! n! k! (n − k)! 2
n!
k=0 k=0
/
Splitting a Poisson distribution
The number of precipitation phenomena during the month of January in a certain district
is Poisson distributed with mean µ = 14.3. Suppose that each phenomenon can be
classified as normal or extreme and that each of them, independently of the others, is an
extreme phenomenon with probability p = 0.03.
Find the joint distribution that n normal events and m extreme events occur in the next
month of January.
/
Solution
Let N be the total number of events, N1 the number of normal events and N2 the number of
extreme events, with N1 + N2 = N. Conditioning on N gives
Given that n + m events have occurred, the fact that n of them are classified as normal and the
remaining as extreme is just a binomial distribution, hence
$ % n+m n m
n+m n m −λ λ −λ(1−p) (λ(1 − p)) −λp (λp)
P(N1 = n, N2 = m) = (1 − p) p e =e e
n (n + m)! n! m!
Because the preceding joint probability mass function factors into two products, one of which
depends only on n and the other only on m, it follows that N1 and N2 are independent and
/
Table of Contents
Geometric, negative binomial, and hypergeometric random variables
! Introduction
Expectation
Variance
/
Example. Geometric distribution
A coin of parity p is repeatedly tossed. What is the probability p(r) of getting the first
head on the r-th try?
By independence we have
Now define a random variable Y taking values in N by: Y is the first try when H appears. Y
is called a geometric random variable and we have p (r) = p(1 − p) r→1 .
"∞ Y
Verify that r=1 p(r) = 1 and that E[Y] = 1p , V(Y) = 1→pp2
.
/
Example
In printing a book, an error can occur in every character with probability p = 0.1
Find the probability that no misprint occurs in the first characters.
/
We want to compute the probability that Y > 10, where Y is a geometric random variable
of parameter p.
This gives us a useful opportunity to develop a general formula to compute the
cumulative distribution FY (n) = P(Y ≤ n)
We have that (Y > n) means that n successive failures have occurred in the first n tries,
hence
FY (n) = 1 − (1 − p)n
In our case, we have (with n = 10 and p = 0.1) P(Y > 10) = 0.3487
/
We will remain in the same context as above with our sequence X1 , X2 , ... of i.i.d.
Bernoulli random variables. We saw that the geometric random variable could be
interpreted as a ‘waiting time’ until the first is registered.
Now we will consider a more general kind of waiting time, namely, we fix r ∈ N and ask
how long we have to wait (i.e. how many Xj ’s have to be emitted) until we observe r ’s.
To be specific we define a random variable N, called the negative binomial random
variable with parameters r and p and range {r, r + 1, r + 2, ...} by: N is the smallest value
of n for which X1 + X2 + · · · + Xn = r
We have
P(N = n) = P({r − 1 of the r.v.’s X1 , . . . Xn→1 take the value } ∩ {Xn = 1})
= P({r − 1 of the r.v.’s X1 , . . . Xn→1 take the value })P({Xn = 1})
$ %
n−1 r
= p (1 − p)n→r
r−1
You should try to prove and convince yourself that N is the sum of r independent
geometric random variables.
/
Suppose we have a supply of n binary symbols (i.e. ’s and ’s), m of which take the value
(so that n−m take the value ). Suppose that we wish to form a ‘codeword of length r’,
that is, a sequence of r binary symbols out of our supply of n symbols, and that this
codeword is chosen at random.
We define a random variable: H = number of ’s in the codeword
so that H has range {0, 1, 2, ..., p}, where p is the minimum of m and r; H is called the
hypergeometric random variable with parameters n, m and r.
We have ,m-,n→m-
x
P(H = x) = ,nr→x
-
r
/
[ ] - Probability
Thank you for listening!
Any questions?
/
Application: elementary statistical inference
In statistics we are interested in gaining information about a population which for
reasons of size, time or cost is not directly accessible to measurement.
We are interested in some quality of the members of the population which can be
measured numerically.
Statisticians attempt to learn about the population by studying a sample taken from it at
random.
Clearly, the properties of the population will be reflected in the properties of the sample.
Suppose that we want to gain information about the population mean µ. If our sample is
{x1 , x2 , ..., xn }, we might calculate the sample mean
n
1)
x̄ = xj
n
j=1