Lecture 1
Lecture 1
Paolo Zacchia
Lecture 1
Sample Spaces
Definition 1
Sample Space. The set S collecting all possible outcomes associated
with a certain phenomenon is called the sample space.
Definition 2
Events. A subset of a sample space S, including S itself, is an event.
Examples:
• Coin tossing: Anull = ∅, Ahead = {Head}, Atail = {T ail},
Af ull = Scoin = {Head, T ail};
Theorem 1
Properties of Events. Let AS , BS and CS be any three events asso-
ciated with the sample space S. The following properties hold.
a. Commutativity: AS ∪ BS = BS ∪ AS
AS ∩ BS = BS ∩ AS
b. Associativity: AS ∪ (BS ∪ CS ) = (AS ∪ BS ) ∪ CS
AS ∩ (BS ∩ CS ) = (AS ∩ BS ) ∩ CS
c. Distributive Laws: AS ∩ (BS ∪ CS ) = (AS ∩ BS ) ∪ (AS ∩ CS )
AS ∪ (BS ∩ CS ) = (AS ∪ BS ) ∩ (AS ∪ CS )
c
d. DeMorgan’s Laws: (AS ∪ BS ) = AcS ∩ BcS
c
(AS ∩ BS ) = AcS ∪ BcS
Partitions
Definition 3
Disjoint Events. Two events A1 and A2 are disjoint or mutually
exclusive if A1 ∩ A2 = ∅. The events in a collection A1 , A2 , . . . are
pairwise disjoint or mutually exclusive if Ai ∩ Aj = ∅ for all pairs
i 6= j.
Definition 4
Partition. The events in a collection A1 , A2 , . . . form a partition of
the sample space S if they are pairwise disjoint and ∪Z i=1 Ai = S if the
collection is of finite dimension Z; ∪∞ A
i=1 i = S if the collection has an
infinite number of elements.
Definition 5
Sigma Algebra. Given some set S, a sigma algebra (σ-algebra) or
Borel field is a collection of subsets of S, which is denoted as B, that
satisfies the following properties:
a. ∅ ∈ B;
b. for any subset A ∈ B, it is Ac ∈ B;
c. for any countable sequence of subsets A1 , A2 , · · · ∈ B, it holds that
∪∞i=1 Ai ∈ B.
i
• However, ∪∞ i−1
i=1 0, i / B 0 , which contradicts the
= (0, 1) ∈
definition of sigma algebra.
Definition 6
Probability Function. Given a sample space S and an associated
σ-algebra B, a probability function P is a function with domain B
that satisfies the three axioms of probability:
a. P (A) ≥ 0 ∀A ∈ B;
b. P (S) = 1;
c. given a countable sequence of pairwise P∞ disjoint subsets written as
A1 , A2 , · · · ∈ B, then P (∪∞
i=1 A i ) = i=1 P (Ai ).
0
1
2
3
4
5
Theorem 2
Properties of Probability Functions (1). If P is some probability
function and A is a set in B, the following properties hold:
a. P (∅) = 0;
b. P (A) ≤ 1;
c. P (Ac ) = 1 − P (A).
Proof.
The observation that A and Ac form a partition of S and therefore it
is P (A) + P (Ac ) = P (S) = 1 proves c. – thus a. and b. follow.
Properties of probability functions (2/4)
Theorem 3
Properties of Probability Functions (2). If P is some probability
function and A, B are sets in B, the following properties hold:
a. P (B ∩ Ac ) = P (B) − P (A ∩ B);
b. P (A ∪ B) = P (A) + P (B) − P (A ∩ B);
c. if A ⊂ B, it is P (A) ≤ P (B).
Proof.
To prove a. note that B can be expressed as the union of two disjoint
sets B = {B ∩ A} ∪ {B ∩ Ac }, thus P (B) = P (B ∩ A) + P (B ∩ Ac ). To
show b. decompose the union of A and B as A ∪ B = A ∪ {B ∩ Ac } –
again two disjoint sets; hence by a. the following holds.
Proof.
Regarding a. note that, by the Distributive Laws of events, it is
∞ ∞
!
[ [
A=A∩S=A∩ Ci = (A ∩ Ci )
i=1 i=1
∞ ∞ ∞ ∞
! !
[ [ X X
∗
P Ai = P Ai = P (A∗i ) ≤ P (Ai )
i=1 i=1 i=1 i=1
where the second equality follows from the pairwise disjoint property.
Such additional collection of events can be obtained as:
c
i−1
[ i−1
\
A∗1 = A1 , A∗i = Ai ∩ A j = Ai ∩ Acj for i = 2, 3, . . .
j=1 j=1
Definition 7
Conditional Probability. Consider a sample space S, an associated
σ-algebra B, and any two events A, B ∈ B such that P (B) > 0. The
conditional probability of A given B is written as P ( A| B) and is
defined as follows.
P (A ∩ B)
P ( A| B) =
P (B)
• Hence:
P (A ∩ passing) 0.3 1
P ( A| passing) = = =
P (passing) 0.9 3
P (i ∩ i > 0)
P ( i| i > 0) =
P (i > 0)
h i
= I −2 (I + 1 − i)2 − (I − i)2
P (B ∩ A)
P ( B| A) =
P (A)
P ( B| A) P (A)
P ( A| B) =
P (B)
Theorem 5
Bayes’ Theorem. Let A1 , A2 , . . . be a partition of the sample space
S, and B some event B ⊂ S. For i = 1, 2, . . . the following holds.
P ( B| Ai ) P (Ai )
P ( Ai | B) = P∞
j=1 P ( B| Aj ) P (Aj )
Proof.
This follows from Bayes’ Rule for A = Ai and by observing that:
∞
X ∞
X
P (B) = P (B ∩ Aj ) = P ( B| Aj ) P (Aj )
j=1 j=1
P (taker)
P ( taker| sick) = P ( sick| taker)
P (sick)
4
=
9
P (A ∩ B) = P (A) P (B)
Definition 9
Mutual statistical independence (multiple events). The events
of any collection A1 , A2 , . . . , AN are mutually independent if, for
any subcollection Ai1 , Ai2 , . . . , AiN 0 with N 0 ≤ N , the following holds.
0
\N YN0
P Aij = P Aij
j=1 j=1
Independence and complementary events
Theorem 6
Independence and Complementary Events. Consider any two
independent events A and B. It can be concluded that the following
pairs of events are independent too:
a. A and Bc ;
b. Ac and B;
c. Ac and Bc .
Proof.
Case a. follows from the definition of independence (second equality):
P (A ∩ Bc ) = P (A) − P (A ∩ B)
= P (A) − P (A) P (B)
= P (A) [1 − P (B)]
= P (A) P (Bc )
Definition 10
Random Variables. A random variable X is a function from the
the sample space S onto the set of real numbers X : S → R.
Definition 11
Cumulative Probability Distribution. Given a random variable
X, a cumulative (probability) distribution function (typically
abbreviated as c.d.f.) is a function FX (x) which is defined as follows.
1 FX2.coins
.75
.50
.25
x
0 1 2
Properties of cumulative distributions
Theorem 7
Properties of Probability Distribution Functions. A function
F (x) can be a (cumulative) probability distribution function if and only
if the following three conditions hold:
a. limx→−∞ F (x) = 0 and limx→∞ F (x) = 1;
b. F (x) is a nondecreasing function of x;
c. F (x) is right-continuous, that is limx↓x0 F (x) = F (x0 ) ∀x0 ∈ R.
Proof.
(Outline.) Necessity follows directly from the definition of Probability
Functions. Sufficiency requires some reverse engineering, showing how
for each Probability Distribution Function with the above properties,
one can find an appropriate sample space S, an associated probability
function P and a relative random variable X.
Discrete and continuous random variables
Definition 12
Types of Random Variables. A random variable X is continuous
if FX (x) is a continuous function of x, while it is discrete if FX (x) is
a step function of x.
Examples:
• Discrete: X2.coins , Xgrades and similar ones;
1
FX (x) = Λ (x) =
1 + exp (−x)
FX (x) Λ (x)
1 Φ (x)
0.5
x
−5 −3 −1 1 3 5
Definition 14
Probability Density Function. Given a continuous random variable
X, its probability density function fX (x) (which is often abbreviated
as p.d.f.) is defined as the function that satisfies the following rela-
tionship. ˆ x
FX (x) = fX (t) dt for all x ∈ R
−∞
Definition 15
Support of a random variable. Given a random variable X which
is either discrete or continuous, its support X is defined as the set
X ≡ {x : x ∈ R, fX (x) > 0}
In general:
• the support of discrete random variables is a countable
set (corresponding with a countable sample space);
• whereas the support of continuous random variables is an
uncountable set (thus corresponding with an uncountable
sample space).
Mass function and support
• As the support of a discrete random variable is countable,
a p.m.f. has an easy interpretation as a transposition of the
underlying probability function.
• Thus:
b
X
P (a ≤ X ≤ b) = FX (b) − FX (a) = fX (t)
t=a
hence:
b
X
P (X ≤ b) = FX (b) = fX (t)
t=inf X
and: X
P (X ∈ X) = fX (t) = 1
t∈X
1 fX2.coins
.75
.50
.25
x
0 1 2
Density function and support
• For density functions instead the support is an uncountable
set, and the interpretation of fX (x) ≥ 0 is subtler.
hence: ˆ
P (X ∈ X) = fX (t) dt = 1
X
a probabilistic interpretation for segments of R.
0.2
x
−5 −3 −1 1 3 5
Theorem 8
Properties of mass and density functions. A function fX (X) is
an appropriate probability mass or density function of a given random
variable X if and only if:
a. fX (X) ≥ 0 for all x ∈ R;
P ´
b. x∈X fX (x) = 1 or X fX (x) dx = 1 respectively for mass and
density functions.
Proof.
(Outline.) Necessity follows directly by the definitions of c.d.f., p.m.f.
and p.d.f.; sufficiency follows by Theorem 7 after having constructed
the associated cumulative distribution FX (X).
Mixing discrete and continuous
• Certain distributions are continuous in some parts of their
support, and discrete in other parts.
Φ≥0 (x)
1
0.8
0.6
0.4
0.2
x
−5 −3 −1 1 3 5
Theorem 9
Identical Distribution. Given two random variables X and Y whose
primitive sample space is a subset of the real numbers S ⊆ R, the fol-
lowing two statements are identical:
a. X and Y are identically distributed;
b. FX (x) = FY (x) for every x in the relevant support.
Proof.
(Outline.) Clearly here a. implies b. by construction. The reverse is
proved by showing that if the two distributions are identical, they also
share a probability function defined for some sigma algebra B of S.
Transforming random variables
• Sometimes one wants to apply a transformation g (·) to a
random variable X.
Y = g (X)
Theorem 10
Cumulative Distribution of Transformed Random Variables.
Let X and Y = g (X) be two random variables that are related by a
transformation g (·), X and Y their respective supports, and FX (x) the
cumulative distribution of X.
a. If g (·) is increasing in X, it is FY (y) = FX g −1 (y) for all y ∈ Y.
Proof.
(Continues. . . )
Cumulative transformed distributions (2/2)
Proof.
(Continued.) This is almost tautological: a. is shown as:
ˆ g −1 (y)
fX (x) dx = FX g −1 (y)
FY (y) =
−∞
Theorem 11
Density of Transformed Random Variables (simple). Let X and
Y = g (X) be two random variables related by a transformation g (·),
X and Y their respective supports, and fX (x) the probability density
function of X, which is continuous on X. If the inverse of the transfor-
mation function, g −1 (·), is continuously differentiable on Y, the prob-
ability density function of Y can be calculated as follows.
(
fX g −1 (y) dy
d −1
g (y) if y ∈ Y
fY (y) =
0 if y ∈
/Y
Proof.
(Continues. . . )
Transformed density functions: simple (2/2)
Theorem 11
Proof.
(Continued.) Increasing and decreasing functions are monotone;
hence, since g −1 (·) is continuously differentiable on Y, for all y ∈ Y:
d
fY (y) = FY (y)
dy
(
fX g −1 (y) dy
d −1
g (y) if g (·) is increasing
=
−fX g −1 (y) dy
d −1
g (y) if g (·) is decreasing
FX (x) fX (x)
1 1
x x
0 1 0 1
Note: cumulative distribution function FX (x) on the left, density function fX (x) on the right
Example: uniform-to-exponential (2/2)
Next, apply to X the transformation Y = − log X: this returns
the exponential distribution with unit parameter. Notice
that the inverse transformation is X = exp (−Y ), the support of
Y is Y = R+ ; Y has c.d.f.
while its p.d.f. is fY (y) = exp (−y) both defined for y > 0.
FY (y) fY (y)
1 1
y y
0 2 4 0 2 4
Note: cumulative distribution function FY (y) on the left, density function fY (y) on the right
Transformed density functions: composite
Theorem 12
Density of Transformed Random Variables (composite). Let X
and Y = g (X) be two random variables related by some transforma-
tion g (·), X and Y their respective supports, and fX (x) the probability
density function of X. Suppose further that there exists a partition of
X’s support, X0 , X1 , . . . , XK such that ∪ki=0 Xi = X, P (x ∈ X0 ) = 0,
and fX (x) is continuous on each Xi . Finally, suppose that there is a
sequence of functions g1 (x) , . . . , gk (x), each associated with one set in
X1 , . . . , XK , satisfying the following conditions for i = 1, . . . , K:
i. g (x) = gi (x) for every x ∈ Xi ;
ii. gi (x) is monotone in Xi ;
iii. Y = {y : y = gi (x) for some x ∈ Xi }, that is the image of gi (x)
is always equal to the support of Y ;
iv. gi−1 (y) exists and is continuously differentiable in Y.
Then the density of Y can be calculated as follows.
(PK
−1
d −1
i=1 fX gi (y) dy gi (y) if y ∈ Y
fY (y) =
0 if y ∈
/Y
Example: squaring the standard normal
Let X follow the standard normal distribution Φ (x), and allow
for the transformation Y = X 2 : this is not monotone in X = R,
but it is decreasing in X1 = R−− , increasing in X2 = R++ , while
in both sets it maps onto Y = R++ . Also, P (X = 0) = 0. Thus:
√ 2 !
1 − y 1
fY (y) = √ exp − − √
2π 2 2 y
√ 2 !
1 y 1
+ √ exp − √
2π 2 2 y
1 1 y
= √ √ exp −
2π y 2
FY (y) fY (y)
1 1
y y
0 3 6 0 3 6
Note: cumulative distribution function FY (y) on the left, density function fY (y) on the right
FX (x) 7 QX (p)
1
5
3
1 p
0.5
−1 0.5 1
−3
x −5
−5 −3 −1 1 3 5 7
This figure represents the c.d.f. (left) and the quantile function
(right) of the random variable described in this example.
Cumulative Transformation
Theorem 13
Cumulative Transformation. For any continuous random variable
X with cumulative distribution denoted as FX (x), the transformation
P = FX (X) follows a uniform distribution on the unit interval.
Proof.
By the properties of quantile functions (they are monotone increasing
by definition, etc.), for all p ∈ (0, 1) it holds that:
P (P ≤ p) = P (FX (X) ≤ p)
= P (QX [FX (X)] ≤ QX (p))
= P (X ≤ QX (p))
= FX (QX (p))
=p
The fourth and fifth lines follow from the definition and continuity of
FX (x). Since by construction FP (p) = 0 for p ≤ 0 and FP (p) = 1 for
p ≥ 0, P follows a uniform distribution on the interval (0, 1).
Uncentered moments
Definition 18
Uncentered Moments. The r-th uncentered moment of a random
variable X with support X, denoted as E [X r ], is defined as follows for
some positive integer r and for discrete random variables:
X
E [X r ] = xr fX (x)
x∈X
´∞
• Note: limM →∞ −y exp (−y)|M
0 = 0 and 0 exp (−y) dy = 1.
Example: moments in the exponential case (2/2)
• The second uncentered moment is:
h i ˆ ∞
E Y2 = y 2 exp (−y) dy
0
ˆ ∞
∞
2
= −y exp (−y) +2 y exp (−y) dy
0 0
=2
E [Y ] = E [a + bX]
= a + b E [X]
Theorem 14
Markov’s Inequality. Given a nonnegative random variable X ∈ R+
and a constant k > 0, it must be P [X ≥ k] ≤ E [X] /k.
Proof.
Apply the decomposition
ˆ +∞
E [X] = xf (x) dx
0
ˆ +∞
≥ xf (x) dx
k
ˆ +∞
≥k f (x) dx
k
= k P [X ≥ k]
Theorem 15
Čebyšëv’s Inequality. Given a random variable Y ∈ R and a number
δ > 0, it must be P [|Y − E [Y ]| ≥ δ] ≤ Var [Y ] /δ 2 .
Proof.
2
Rephrase Markov’s inequality setting X = (Y − E [Y ]) and k = δ 2 ,
and notice that:
h i
2
P [|Y − E [Y ]| ≥ δ] ≤ P (Y − E [Y ]) ≥ δ 2
h i
2
E (Y − E [Y ])
≤
δ2
Var [Y ]
=
δ2
as postulated.
Minimum squared error of prediction (1/3)
• Hence:
2 2
min E X −X
b = min Var [X] + E E [X] − X
b
X
b X
b
Definition 20
Moment generating function. Given some random variable X with
support X, the moment-generating function MX (t) is defined, for
t ∈ R, as the expectation of the transformation g (X) = exp (tX), so
long as it exists. For discrete random variables this is:
X
MX (t) = E [exp (tX)] = exp (tx) fX (x)
x∈X
Theorem 16
Moment generation. If a random variable X has an associated mo-
ment generating function MX (t), its r-th uncentered moment can be
calculated as the r-th derivative of the moment generating function eval-
uated at t = 0.
dr MX (t)
E [X r ] =
dtr t=0
Proof.
Note that, for all r = 1, 2, . . . :
dr MX (t) dr
r
d
= E [exp (tX)] = E exp (tX) = E [X r exp (tX)]
dtr dtr dtr
so long as the r-th derivative with respect to t can pass through the
expectation operator. If so, E [X r exp (tX)] = E [X r ] for t = 0.
Example: m.g.f. for coin experiments
• This m.g.f. only exists for t < 1! In fact, the integral that
defines it diverges if t ≥ 1.
M.g.f. of transformations