0% found this document useful (0 votes)
17 views87 pages

Lecture 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views87 pages

Lecture 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 87

Random Variables

Paolo Zacchia

Probability and Statistics

Lecture 1
Sample Spaces

Statistics deals with uncertain “events.” It is thus necessary to


formalize notions about the probability for such events.

However, one needs first to characterize what are the “events.”


To do this, mathematical probability borrows from set theory.

Definition 1
Sample Space. The set S collecting all possible outcomes associated
with a certain phenomenon is called the sample space.

A sample space provides, for all possible occurrences of a given


phenomenon, the finest-grained characterization that is possible.
Sample Spaces: examples
Countable sample spaces (can be enumerated)
• Tossing a coin: Scoin = {Head, T ail}

• Exam grades: Sexam = {A, B, C, D, E, F }

• E-mails received: Semails = {0, 1, 2, . . . } = N0

Uncountable sample spaces (cannot be enumerated)


• Individual income: Sincome = R+

• Individual wealth: Swealth = R

• Household wealth: Shousehold = Swealth.1 × Swealth.2 = R2


(e.g. if a household is composed of two individuals, 1 and 2)
Events

Any combination of a sample space’s elements is an event.

Definition 2
Events. A subset of a sample space S, including S itself, is an event.

Examples:
• Coin tossing: Anull = ∅, Ahead = {Head}, Atail = {T ail},
Af ull = Scoin = {Head, T ail};

• Exam pass/fail: Apassing = {A, B, C}, Af ailing = {D, E, F };

• Any bracket – a segment of the (positive) real line – of the


income or wealth space.
Set properties of events

Standard set operations such as: union (∪), intersection (∩)


and complementation (Ac ), extend to events.

Theorem 1
Properties of Events. Let AS , BS and CS be any three events asso-
ciated with the sample space S. The following properties hold.

a. Commutativity: AS ∪ BS = BS ∪ AS
AS ∩ BS = BS ∩ AS
b. Associativity: AS ∪ (BS ∪ CS ) = (AS ∪ BS ) ∪ CS
AS ∩ (BS ∩ CS ) = (AS ∩ BS ) ∩ CS
c. Distributive Laws: AS ∩ (BS ∪ CS ) = (AS ∩ BS ) ∪ (AS ∩ CS )
AS ∪ (BS ∩ CS ) = (AS ∪ BS ) ∩ (AS ∪ CS )
c
d. DeMorgan’s Laws: (AS ∪ BS ) = AcS ∩ BcS
c
(AS ∩ BS ) = AcS ∪ BcS
Partitions

A sample space can be split between events that cannot happen


at the same time.

Definition 3
Disjoint Events. Two events A1 and A2 are disjoint or mutually
exclusive if A1 ∩ A2 = ∅. The events in a collection A1 , A2 , . . . are
pairwise disjoint or mutually exclusive if Ai ∩ Aj = ∅ for all pairs
i 6= j.

Definition 4
Partition. The events in a collection A1 , A2 , . . . form a partition of
the sample space S if they are pairwise disjoint and ∪Z i=1 Ai = S if the
collection is of finite dimension Z; ∪∞ A
i=1 i = S if the collection has an
infinite number of elements.

Examples: Apassing and Af ailing , all brackets of a tax system.


Sigma Algebra
The following concept is central in the axiomatic definition of
probability functions.

Definition 5
Sigma Algebra. Given some set S, a sigma algebra (σ-algebra) or
Borel field is a collection of subsets of S, which is denoted as B, that
satisfies the following properties:
a. ∅ ∈ B;
b. for any subset A ∈ B, it is Ac ∈ B;
c. for any countable sequence of subsets A1 , A2 , · · · ∈ B, it holds that
∪∞i=1 Ai ∈ B.

Note that properties b. and c. together with DeMorgan’s Law


also imply ∩∞
i=1 Ai ∈ B for any appropriate countable sequence of
subsets.
Sigma Algebra: examples
Countable sample spaces (can be enumerated)

• Trivial σ-algebrae: B = {∅, S}

• Any partition of S, and all possible unions of its elements

Uncountable sample spaces (cannot be enumerated)

• If S = R: B would contain all segments of the kind

[a, b] , (a, b] , [a, b) , (a, b)

for any two a, b ∈ R with a ≤ b, plus all their unions and


intersections.

• If S = RK with K > 1, similarly constructed connected sets.


Sigma Algebra: counterexample

• Suppose that B 0 contains all the finite disjoint unions of sets


of the following form, for any two a, b ∈ R with a ≤ b.

(−∞, a] , (a, b] , (b, ∞) , ∅, R

• Note that this is based off simple partitions of R.

 i
• However, ∪∞ i−1
i=1 0, i / B 0 , which contradicts the
= (0, 1) ∈
definition of sigma algebra.

• Problem: one cannot characterize events that are obtained


as the union of more primitive events in B 0 !
Probability function
The following axiomatic definition is due to A. N. Kolmogorov.

Definition 6
Probability Function. Given a sample space S and an associated
σ-algebra B, a probability function P is a function with domain B
that satisfies the three axioms of probability:
a. P (A) ≥ 0 ∀A ∈ B;
b. P (S) = 1;
c. given a countable sequence of pairwise P∞ disjoint subsets written as
A1 , A2 , · · · ∈ B, then P (∪∞
i=1 A i ) = i=1 P (Ai ).

The construction of probability functions proceeds as follows.


• For countable sample spaces: assign a number p (s) ≥ 0 to
each element s ∈ S, such that s∈S p (s) = 1.
P

• For uncountable sample spaces: use proper measures.


Example: the naı̈ve dart thrower
Consider dart players who score points depending on how close
they get to the center of the dartboard.

0
1
2
3
4
5

Suppose a player hits every point completely at random. What


is the probability that he scores a given number of points i?
Probability measures on the dartboard
• A probability function for an uncountable sample space!

• This has to be proportional to the measure of the areas of


each “ring” in the dartboard – and the outside area too.

• Let the distance of each consecutive “ring” from the center


be (I − i + 1) r, where I is the maximum number of points
(say 5) and r the distance between two contiguous rings.

• Suppose that the area outside the dartboard “measures” T .


The probability of scoring zero is P (0) = T / T + πI 2 r2 .


• The probability of scoring 0 < i ≤ I points is as follows.


h i  −1
P (i) = πr2 (I + 1 − i)2 − (I − i)2 × T + πI 2 r2
| {z } | {z }
= area of the i-th “ring” = total area
Properties of probability functions (1/4)

Theorem 2
Properties of Probability Functions (1). If P is some probability
function and A is a set in B, the following properties hold:
a. P (∅) = 0;
b. P (A) ≤ 1;
c. P (Ac ) = 1 − P (A).

Proof.
The observation that A and Ac form a partition of S and therefore it
is P (A) + P (Ac ) = P (S) = 1 proves c. – thus a. and b. follow.
Properties of probability functions (2/4)
Theorem 3
Properties of Probability Functions (2). If P is some probability
function and A, B are sets in B, the following properties hold:
a. P (B ∩ Ac ) = P (B) − P (A ∩ B);
b. P (A ∪ B) = P (A) + P (B) − P (A ∩ B);
c. if A ⊂ B, it is P (A) ≤ P (B).

Proof.
To prove a. note that B can be expressed as the union of two disjoint
sets B = {B ∩ A} ∪ {B ∩ Ac }, thus P (B) = P (B ∩ A) + P (B ∩ Ac ). To
show b. decompose the union of A and B as A ∪ B = A ∪ {B ∩ Ac } –
again two disjoint sets; hence by a. the following holds.

P (A ∪ B) = P (A) + P (B ∩ Ac ) = P (A) + P (B) − P (A ∩ B)

Finally, c. follows from a. as A ⊂ B implies that P (A ∩ B) = P (A), so


P (B ∩ Ac ) = P (B) − P (A) ≥ 0.
Properties of probability functions (3/4)
Theorem 4
Properties of Probability Functions (3). If P is some probability
function, the following properties hold:
P∞
a. P (A) = i=1 P (A ∩ Ci ) for any A ∈ B as well as any partition
C1 , C2 , . . . of the sample space such that Ci ∈ B for all i ∈ N;
P∞
b. P (∪∞i=1 Ai ) ≤ i=1 P (Ai ) for any sets A1 , A2 , . . . such that Ai ∈ B
for all i ∈ N.

Proof.
Regarding a. note that, by the Distributive Laws of events, it is
∞ ∞
!
[ [
A=A∩S=A∩ Ci = (A ∩ Ci )
i=1 i=1

where the intersection sets of the form A ∩ Ci are pairwise disjoint as


the Ci sets are. Hence, a. follows from the third axiom of probability
functions. (Continues. . . )
Properties of probability functions (4/4)
Theorem 4
Proof.
(Continued.) To establish b. construct another collection of pairwise
disjoint events A∗1 , A∗2 , . . . such that ∪∞ ∞ ∗
i=1 Ai = ∪i=1 Ai and

∞ ∞ ∞ ∞
! !
[ [ X X

P Ai = P Ai = P (A∗i ) ≤ P (Ai )
i=1 i=1 i=1 i=1

where the second equality follows from the pairwise disjoint property.
Such additional collection of events can be obtained as:
 c  
i−1
[ i−1
\
A∗1 = A1 , A∗i = Ai ∩  A j  = Ai ∩  Acj  for i = 2, 3, . . .
j=1 j=1

that are by construction pairwise disjoint and satisfy ∪∞ ∞ ∗


i=1 Ai = ∪i=1 Ai .
∗ ∗
Furthermore, by construction Ai P ⊂ Ai and thusPP (Ai ) ≤ P (Ai ) for
∞ ∞
every i, leading to the inequality i=1 P (A∗i ) ≤ i=1 P (Ai ).
Conditional probability

Definition 7
Conditional Probability. Consider a sample space S, an associated
σ-algebra B, and any two events A, B ∈ B such that P (B) > 0. The
conditional probability of A given B is written as P ( A| B) and is
defined as follows.
P (A ∩ B)
P ( A| B) =
P (B)

• Conditional probability is about defining the probability for


an event A when restricting (re-defining) the sample space
to a specific event B.

• To help intuition, we speak about “the probability of A when


B is fixed.”
Conditional probability: examples (1/3)
• Let us return to the grades example and suppose that:

P (A) = P (B) = P (C) = 0.3


P (D) = P (F ) = 0.05
P (E) = 0

• . . . hence, it is P (passing) = 0.9 and P (f ailing) = 0.1.

• Note that A ⊂ Apassing , so P (A ∩ passing) = P (A).

• Hence:
P (A ∩ passing) 0.3 1
P ( A| passing) = = =
P (passing) 0.9 3

and similarly P ( D| f ailing) = P ( F | f ailing) = 0.5.


Conditional probability: examples (2/3)
• Let us return to the “naı̈ve dart thrower” example.

• Since it is P (i > 0) = πI 2 r2 / T + πI 2 r2 , it is:




P (i ∩ i > 0)
P ( i| i > 0) =
P (i > 0)
h i
= I −2 (I + 1 − i)2 − (I − i)2

i.e. the probability of scoring i, conditional on hitting the


dartboard.
• Suppose that the player learns to consistently score at least
3 < I points: the new probabilities are conditional.
h i
P (i ∩ i > 2) (I + 1 − i)2 − (I − i)2
P ( i| i > 2) = =
P (i > 2) (I − 2)2
Conditional probability: examples (3/3)
• After the partial takeup of a preemptive medical treatment:

P (taker ∩ healthy) = 0.40


P (taker ∩ sick) = 0.20
P (hesitant ∩ healthy) = 0.15
P (hesitant ∩ sick) = 0.25

• . . . and therefore, it is P (taker) = 0.60, P (hesitant) = 0.40,


P (healthy) = 0.55 and P (sick) = 0.45.
• It is not easier to find a sick person among those who took
the treatment than among those who hesitated.
P (taker ∩ sick) 1
P ( sick| taker) = =
P (taker) 3
P (hesitant ∩ sick) 5
P ( sick| hesitant) = =
P (hesitant) 8
Bayes’ Rule

• One can recast the definition of conditional probability by


swapping A and B.

P (B ∩ A)
P ( B| A) =
P (A)

• Combining the above with the reverse expression delivers


the following result (again A and B can be swapped).

P ( B| A) P (A)
P ( A| B) =
P (B)

• This is known as Bayes’ Rule: a powerful expression that


relates conditional probabilities to one another.
Bayes’ Theorem

Bayes’ Rule extends to multiple pairwise disjoint events.

Theorem 5
Bayes’ Theorem. Let A1 , A2 , . . . be a partition of the sample space
S, and B some event B ⊂ S. For i = 1, 2, . . . the following holds.

P ( B| Ai ) P (Ai )
P ( Ai | B) = P∞
j=1 P ( B| Aj ) P (Aj )

Proof.
This follows from Bayes’ Rule for A = Ai and by observing that:

X ∞
X
P (B) = P (B ∩ Aj ) = P ( B| Aj ) P (Aj )
j=1 j=1

from Theorem 4 and the definition of conditional probability.


Bayes’ Rule: example

• Let us return to the previous example about the imperfect


takeup of an imperfect preemptive medical treatment.

• A simple application of Bayes’ rule is the following.

P (taker)
P ( taker| sick) = P ( sick| taker)
P (sick)
4
=
9

• Thus, one can calculate the reverse conditional probabilities


without knowing the probabilities of intersected events!
Independent events
Definition 8
Statistical independence (two events). Two events A and B are
statistically independent if the following holds.

P (A ∩ B) = P (A) P (B)

• Note: this implies P ( A| B) = P (A)! Fixing B does not affect the


probability of A.

Definition 9
Mutual statistical independence (multiple events). The events
of any collection A1 , A2 , . . . , AN are mutually independent if, for
any subcollection Ai1 , Ai2 , . . . , AiN 0 with N 0 ≤ N , the following holds.
 0 
\N YN0

P Aij  = P Aij
j=1 j=1
Independence and complementary events
Theorem 6
Independence and Complementary Events. Consider any two
independent events A and B. It can be concluded that the following
pairs of events are independent too:
a. A and Bc ;
b. Ac and B;
c. Ac and Bc .

Proof.
Case a. follows from the definition of independence (second equality):

P (A ∩ Bc ) = P (A) − P (A ∩ B)
= P (A) − P (A) P (B)
= P (A) [1 − P (B)]
= P (A) P (Bc )

Cases b. and c. are analogous.


Random variables

• Probability functions are defined for generic σ-algebrae but


sometimes these can be complicated to work with.

• It is often convenient to re-formulate probability functions


so that the function’s domain is a subset of R.

Definition 10
Random Variables. A random variable X is a function from the
the sample space S onto the set of real numbers X : S → R.

• They are denoted with upper case italic letters like X.

• A specific realization of a random variable (e.g. the value


of X that occurs, or is presumed to occur in the real world)
is denoted with a lower case italic letter like x.
Random variables: examples
• Xcoin is a random variable mapping {Head, T ail} → {0, 1}
or vice versa, or any other mapping one may find suitable.

• Xcoin has two realizations.


(
1 if T ail
xcoin =
0 if Head

• Xgrade is a random variable mapping {A, B, C, D, E, F } to,


say, numeric scores for GPA calculation.

• Xemails maps the sample space N0 onto itself.

• Xincome also maps the sample space R+ onto itself.

• Xwealth likewise maps the sample space R onto itself.


Probability distributions
• Random variables re-formulate the sample space, but what
about the probability functions?

Definition 11
Cumulative Probability Distribution. Given a random variable
X, a cumulative (probability) distribution function (typically
abbreviated as c.d.f.) is a function FX (x) which is defined as follows.

FX (x) = P (X ≤ x) for all x ∈ R

• Conventionally, the subscript X in FX associates the c.d.f.


to the random variable X.

• The domain of a c.d.f. is typically the map of the union of


multiple events in the original sample space.
Example: two coins (1/3)
• Suppose we are tossing not one, but two coins. Thus:

S2.coins = {Head & Head, Head & T ail,


T ail & Head, T ail & T ail}

where & separates the outcomes of first vs. second coin.

• The random variable X2.coins ∈ {0, 1, 2} ∈ R here counts the


number of “tails.”

X (Head & Head) = 0


X (Head & T ail) = 1
X (T ail & Head) = 1
X (T ail & T ail) = 2
Example: two coins (2/3)
• Suppose that the coins are balanced: at every attempt, the
chances to get tail or head are equal.

• Thus all elements in S2.coins have equal probability!

• It follows that P (X2.coins = 0) = P (X2.coins = 2) = 0.25 and


P (X2.coins = 1) = 0.50.

• The associated c.d.f. is the following.





0 if x ∈ (−∞, 0)

.25 if x ∈ [0, 1)

FX2.coins (x) =


.75 if x ∈ [1, 2)

1 if x ∈ [2, ∞)


Example: two coins (3/3)

• This c.d.f. is represented graphically below.

• Observe the “discrete steps” when a new tail count is hit.

1 FX2.coins

.75

.50

.25
x
0 1 2
Properties of cumulative distributions

Theorem 7
Properties of Probability Distribution Functions. A function
F (x) can be a (cumulative) probability distribution function if and only
if the following three conditions hold:
a. limx→−∞ F (x) = 0 and limx→∞ F (x) = 1;
b. F (x) is a nondecreasing function of x;
c. F (x) is right-continuous, that is limx↓x0 F (x) = F (x0 ) ∀x0 ∈ R.

Proof.
(Outline.) Necessity follows directly from the definition of Probability
Functions. Sufficiency requires some reverse engineering, showing how
for each Probability Distribution Function with the above properties,
one can find an appropriate sample space S, an associated probability
function P and a relative random variable X.
Discrete and continuous random variables
Definition 12
Types of Random Variables. A random variable X is continuous
if FX (x) is a continuous function of x, while it is discrete if FX (x) is
a step function of x.
Examples:
• Discrete: X2.coins , Xgrades and similar ones;

• Continuous: the standard logistic distribution;

1
FX (x) = Λ (x) =
1 + exp (−x)

• Continuous: the standard normal distribution.


ˆ x !
1 t2
FX (x) = Φ (x) = √ exp − dt
−∞ 2π 2
Standard logistic and normal distributions

FX (x) Λ (x)
1 Φ (x)

0.5

x
−5 −3 −1 1 3 5

The standard logistic assigns a higher probability, relative to the


standard normal, to realizations that are more distant from zero.
Mass and density functions
Definition 13
Probability Mass Function. Given a discrete random variable X,
its probability mass function fX (x) (which is often abbreviated as
p.m.f.) is defined as follows.

fX (x) = P (X = x) for all x ∈ R

Definition 14
Probability Density Function. Given a continuous random variable
X, its probability density function fX (x) (which is often abbreviated
as p.d.f.) is defined as the function that satisfies the following rela-
tionship. ˆ x
FX (x) = fX (t) dt for all x ∈ R
−∞

Note: if a the c.d.f. of a continuous random variable is differentiable


everywhere in R, the associated density function is fX (t) = ∂F∂x
X (t)
.
Support of random variables

Definition 15
Support of a random variable. Given a random variable X which
is either discrete or continuous, its support X is defined as the set

X ≡ {x : x ∈ R, fX (x) > 0}

where fX (x) is the probability mass or density function associated with


X, as appropriate.

In general:
• the support of discrete random variables is a countable
set (corresponding with a countable sample space);
• whereas the support of continuous random variables is an
uncountable set (thus corresponding with an uncountable
sample space).
Mass function and support
• As the support of a discrete random variable is countable,
a p.m.f. has an easy interpretation as a transposition of the
underlying probability function.

• Thus:
b
X
P (a ≤ X ≤ b) = FX (b) − FX (a) = fX (t)
t=a

hence:
b
X
P (X ≤ b) = FX (b) = fX (t)
t=inf X

and: X
P (X ∈ X) = fX (t) = 1
t∈X

which connects a p.m.f. with the c.d.f. FX (x).


Example: two coins, revisited
The p.m.f. associated with X2.coins is as follows (with figure).



 .25 if x = 0

.50

if x = 1
fX2.coins (x) =


 .25 if x = 2

0 otherwise

1 fX2.coins

.75

.50

.25
x
0 1 2
Density function and support
• For density functions instead the support is an uncountable
set, and the interpretation of fX (x) ≥ 0 is subtler.

• This is no probability: x has “measure zero” in the support.

• In this case, the definition of c.d.f. implies:


ˆ b
P (a ≤ X ≤ b) = FX (b) − FX (a) = fX (t) dt
a

hence: ˆ
P (X ∈ X) = fX (t) dt = 1
X
a probabilistic interpretation for segments of R.

• Unlike mass functions, density functions can generally take


values larger than one: their probabilistic interpretation is
based on the above integral formulations.
Standard logistic and normal densities (1/2)

• The density function associated with the standard logistic


distribution Λ (x) is:

dΛ (x) exp (−x)


λ (x) = =
dx [1 + exp (−x)]2

• and the density function corresponding with the standard


normal distribution is as follows.
!
dΦ (x) 1 x2
φ (x) = = √ exp −
dx 2π 2

• More general (i.e. non-standard) parameterized versions


of these distributions exist.
Standard logistic and normal densities (2/2)

0.4 fX (x) λ (x)


φ (x)

0.2

x
−5 −3 −1 1 3 5

• Note: the shaded areas represent the probability that x falls


in the [1, 3] interval for either distribution.
Properties of mass and density functions

Theorem 8
Properties of mass and density functions. A function fX (X) is
an appropriate probability mass or density function of a given random
variable X if and only if:
a. fX (X) ≥ 0 for all x ∈ R;
P ´
b. x∈X fX (x) = 1 or X fX (x) dx = 1 respectively for mass and
density functions.

Proof.
(Outline.) Necessity follows directly by the definitions of c.d.f., p.m.f.
and p.d.f.; sufficiency follows by Theorem 7 after having constructed
the associated cumulative distribution FX (X).
Mixing discrete and continuous
• Certain distributions are continuous in some parts of their
support, and discrete in other parts.

• A unified treatment of these distributions is possible using


measure theory.

• An example is a truncated standard normal distribution.


(
0 if x < 0
Φ≥0 (x) =
Φ (x) if x ≥ 0

Interpretation: negative realizations of x are not measured.

• For such distributions, the definitions of mass and density


functions are relevant only in subsets of the support.
Truncated standard normal distribution

Φ≥0 (x)
1
0.8
0.6
0.4
0.2
x
−5 −3 −1 1 3 5

• Note: it is P (X = 0) = 0.5 and P (X < 0) = 0.


Identical distributions
Definition 16
Identically Distributed Random Variables. Any pair of random
variables X and Y sharing a sample space S and an associated sigma
algebra B are said to be identically distributed if, for every event
A ∈ B, it is P (X ∈ X (A)) = P (Y ∈ Y (A)).

Theorem 9
Identical Distribution. Given two random variables X and Y whose
primitive sample space is a subset of the real numbers S ⊆ R, the fol-
lowing two statements are identical:
a. X and Y are identically distributed;
b. FX (x) = FY (x) for every x in the relevant support.

Proof.
(Outline.) Clearly here a. implies b. by construction. The reverse is
proved by showing that if the two distributions are identical, they also
share a probability function defined for some sigma algebra B of S.
Transforming random variables
• Sometimes one wants to apply a transformation g (·) to a
random variable X.
Y = g (X)

• This is interpreted as follows.

P (Y ∈ [a, b]) = P (g (X) ∈ [a, b])

• Interpretation is clearer if g (·) is invertible in the interval of


interest – say [a, b], (a, b], [a, b) or (a, b).
 
P (Y ∈ [a, b]) = P X ∈ g −1 ([a, b])

• Transformations may imply a change in support; e.g. if X


has support X = R, Y = exp (X) has support Y = R++ .
Transformations: discrete vs. continuous

• When X is discrete, it is easy to calculate the distribution


of its transformations.

• For every point y ∈ Y (where Y is in the support of Y ):


 
fY (y) = fX g −1 (y)

• . . . and thus the cumulative distribution for Y would follow.


y
X
FY (y) = fY (y)
inf Y

• Continuous distributions entail more complications.


Cumulative transformed distributions (1/2)

Theorem 10
Cumulative Distribution of Transformed Random Variables.
Let X and Y = g (X) be two random variables that are related by a
transformation g (·), X and Y their respective supports, and FX (x) the
cumulative distribution of X.
a. If g (·) is increasing in X, it is FY (y) = FX g −1 (y) for all y ∈ Y.


b. If g (·) is decreasing in X and X is a continuous random variable,


it is FY (y) = 1 − FX g −1 (y) for all y ∈ Y.

Proof.
(Continues. . . )
Cumulative transformed distributions (2/2)

Proof.
(Continued.) This is almost tautological: a. is shown as:
ˆ g −1 (y)
fX (x) dx = FX g −1 (y)

FY (y) =
−∞

because an increasing function applied upon some interval preserves its


order. The demonstration of b. is symmetric:
ˆ ∞
fX (x) dx = 1 − FX g −1 (y)

FY (y) =
g −1 (y)

because a decreasing function


´a applied ´upon an interval would invert

its order and because −∞ fX (x) dx + a fX (x) dx = 1 if fX (x) is a
density function.
Transformed density functions: simple (1/2)

Theorem 11
Density of Transformed Random Variables (simple). Let X and
Y = g (X) be two random variables related by a transformation g (·),
X and Y their respective supports, and fX (x) the probability density
function of X, which is continuous on X. If the inverse of the transfor-
mation function, g −1 (·), is continuously differentiable on Y, the prob-
ability density function of Y can be calculated as follows.
(
fX g −1 (y) dy
 d −1
g (y) if y ∈ Y
fY (y) =
0 if y ∈
/Y

Proof.
(Continues. . . )
Transformed density functions: simple (2/2)

Theorem 11
Proof.
(Continued.) Increasing and decreasing functions are monotone;
hence, since g −1 (·) is continuously differentiable on Y, for all y ∈ Y:

d
fY (y) = FY (y)
dy
(
fX g −1 (y) dy
 d −1
g (y) if g (·) is increasing
=
−fX g −1 (y) dy
 d −1
g (y) if g (·) is decreasing

by Theorem 10 and the chain rule.


Example: uniform-to-exponential (1/2)
Consider the uniform distribution on the unit interval: a
random variable X with support X = [0, 1], c.d.f.

0

 if x ∈ (−∞, 0]
FX (x) = x if x ∈ (0, 1)

if x ∈ [1, ∞)

1

and p.d.f. fX (x) = 1 [x ∈ [0, 1]].

FX (x) fX (x)
1 1

x x
0 1 0 1
Note: cumulative distribution function FX (x) on the left, density function fX (x) on the right
Example: uniform-to-exponential (2/2)
Next, apply to X the transformation Y = − log X: this returns
the exponential distribution with unit parameter. Notice
that the inverse transformation is X = exp (−Y ), the support of
Y is Y = R+ ; Y has c.d.f.

FY (y) = 1 − exp (−y)

while its p.d.f. is fY (y) = exp (−y) both defined for y > 0.

FY (y) fY (y)
1 1

y y
0 2 4 0 2 4
Note: cumulative distribution function FY (y) on the left, density function fY (y) on the right
Transformed density functions: composite
Theorem 12
Density of Transformed Random Variables (composite). Let X
and Y = g (X) be two random variables related by some transforma-
tion g (·), X and Y their respective supports, and fX (x) the probability
density function of X. Suppose further that there exists a partition of
X’s support, X0 , X1 , . . . , XK such that ∪ki=0 Xi = X, P (x ∈ X0 ) = 0,
and fX (x) is continuous on each Xi . Finally, suppose that there is a
sequence of functions g1 (x) , . . . , gk (x), each associated with one set in
X1 , . . . , XK , satisfying the following conditions for i = 1, . . . , K:
i. g (x) = gi (x) for every x ∈ Xi ;
ii. gi (x) is monotone in Xi ;
iii. Y = {y : y = gi (x) for some x ∈ Xi }, that is the image of gi (x)
is always equal to the support of Y ;
iv. gi−1 (y) exists and is continuously differentiable in Y.
Then the density of Y can be calculated as follows.
(PK
−1
 d −1
i=1 fX gi (y) dy gi (y) if y ∈ Y
fY (y) =
0 if y ∈
/Y
Example: squaring the standard normal
Let X follow the standard normal distribution Φ (x), and allow
for the transformation Y = X 2 : this is not monotone in X = R,
but it is decreasing in X1 = R−− , increasing in X2 = R++ , while
in both sets it maps onto Y = R++ . Also, P (X = 0) = 0. Thus:
√ 2 !
1 − y 1
fY (y) = √ exp − − √
2π 2 2 y
√ 2  !
1 y 1
+ √ exp − √
2π 2 2 y
1 1 y
 
= √ √ exp −
2π y 2

where in the first line:



• the first term accounts for X1 = R−− with g1−1 (y) = − y;

• the second term accounts for X2 = R++ with g2−1 (y) = y.
The chi-squared distribution with one d.f.
The distribution obtained by squaring the standard normal is a
specific kind of a chi-squared (χ2 ) distribution, that with one
degree of freedom. Its c.d.f. obtains by integrating the p.d.f.:
ˆ y
1 1 t
 
FY (y) = √ √ exp − dt.
0 2π t 2

FY (y) fY (y)
1 1

y y
0 3 6 0 3 6
Note: cumulative distribution function FY (y) on the left, density function fY (y) on the right

• Chi-squared distributions are central in statistical inference.


Quantile functions
Definition 17
Quantile Function. The quantile function associated with a random
variable X is the following function with argument p ∈ (0, 1).

QX (p) = inf {x ∈ X : p ≤ FX (x)}

• QX (p) corresponds with the inverse of FX (x) if the latter


is strictly increasing.

• otherwise (if FX (x) is flat on segments of the support of X)


the quantile function returns a “pseudo-inverse” that has the
property

P (QX [FX (X)] ≤ QX (p)) = P (X ≤ QX (p))

for all p ∈ (0, 1), by construction.


Example: non-strictly-monotonic c.d.f. (1/2)
• Consider a random variable with the following distribution.

−1
[1 + exp (−x)]

 if x ∈ (−∞, 0]
FX (x) = 0.5 if x ∈ (0, 3)
[1 + exp (−x + 3)]−1

if x ∈ [3, ∞)

• This is similar to the standard logistic, but it is “stretched”


over the (0, 3) interval of the support, where it is flat.

• The corresponding quantile function is as follows.


(
log (x) − log (1 − x) if x ∈ (0, 0.5]
QX (p) =
log (x) − log (1 − x) + 3 if x ∈ (0.5, 1)

• At p = 0.5 the quantile function is discontinuous and equals


zero: the infimum of the interval where the c.d.f. is flat.
Example: non-strictly-monotonic c.d.f. (2/2)

FX (x) 7 QX (p)
1
5
3
1 p
0.5
−1 0.5 1
−3
x −5
−5 −3 −1 1 3 5 7

This figure represents the c.d.f. (left) and the quantile function
(right) of the random variable described in this example.
Cumulative Transformation
Theorem 13
Cumulative Transformation. For any continuous random variable
X with cumulative distribution denoted as FX (x), the transformation
P = FX (X) follows a uniform distribution on the unit interval.
Proof.
By the properties of quantile functions (they are monotone increasing
by definition, etc.), for all p ∈ (0, 1) it holds that:

P (P ≤ p) = P (FX (X) ≤ p)
= P (QX [FX (X)] ≤ QX (p))
= P (X ≤ QX (p))
= FX (QX (p))
=p

The fourth and fifth lines follow from the definition and continuity of
FX (x). Since by construction FP (p) = 0 for p ≤ 0 and FP (p) = 1 for
p ≥ 0, P follows a uniform distribution on the interval (0, 1).
Uncentered moments
Definition 18
Uncentered Moments. The r-th uncentered moment of a random
variable X with support X, denoted as E [X r ], is defined as follows for
some positive integer r and for discrete random variables:
X
E [X r ] = xr fX (x)
x∈X

and as follows in the case of continuous random variables.


ˆ
r
E [X ] = xr fX (x) dx
X
 
Note: by definition, it is always E X 0 = 1.

• The most important uncentered moment is that for r = 1:


E [X]. It is called the mean.

• It is the “central” value of X in a probabilistic sense.


Centered moments
Definition 19
Centered Moments. The r-th centered moment of a random variable
r
X with support X, denoted as E [(X − E [X]) ], is defined as follows for
some positive integer r and for discrete random variables:
r
X r
E [(X − E [X]) ] = (x − E [X]) fX (x)
x∈X

and as follows in the case of continuous random variables.


ˆ
r r
E [(X − E [X]) ] = (x − E [X]) fX (x) dx
X

• The most important


h centered
i moment is the one for r = 2:
2
Var [X] ≡ E (X − E [X]) . It is called the variance.

• This is a measure of “dispersion” of the distribution of X


around the mean. It is never negative.
Skewness
• The standardized centered moment of order r = 3 is called
skewness.
h i
E (X − E [X])3
Skew [X] ≡  h i 3 R 0
2 2
E (X − E [X])

• It measures the degree of asymmetry of a distribution. It is


a positive number for asymmetric distributions “skewed” to
the right of the mean.

• At the same time, it is a negative number for asymmetric


distributions “skewed” to the left of the mean.

• It equals exactly zero for distributions that are symmetric


around the mean.
Kurtosis
• The standardized centered moment of order r = 4 is called
kurtosis.
h i
E (X − E [X])4
Kurt [X] ≡  h i2 ≥ 0
E (X − E [X])2

• Like the variance, it is always nonnegative.

• It measures the overall “thickness” of the distribution – the


relative frequency of realizations of X that are distant from
the mean.

• The kurtosis gives more weight to extreme deviations from


the mean than the variance does.
From centered to uncentered moments
• Observe that the variance can be conveniently expressed in
terms of uncentered moments.
h i
Var [X] = E (X − E [X])2
h i
= E X 2 − 2 E [X] E [X] + E [X]2
h i
= E X 2 − E [X]2

• This is true more generally for all centered moments.


r
!
r r h i
(−1)r−i E X i E [X]r−i
X
E [(X − E [X]) ] =
i=0
i

• This is convenient: uncentered moments can be calculated


via moment generating or characteristic functions (more on
this soon).
Example: tossing an unbalanced coin
• Consider an unbalanced coin: let xcoin = 1 for Head while
xcoin = 0 for Tail, with fXcoin (1) = 0.6 and fXcoin (0) = 0.4.

• The mean is calculated as:

E [Xcoin ] = 1 · fXcoin (1) + 0 · fXcoin (0)


= 1 · 0.6 + 0 · 0.4
= 0.6

• . . . whereas the variance is calculated as follows.

Var [Xcoin ] = (1 − E [Xcoin ])2 · fXcoin (1) +


+ (0 − E [Xcoin ])2 · fXcoin (0)
= (0.4)2 · 0.6 + (−0.6)2 · 0.4
= 0.24
Example: tossing many unbalanced coins (1/3)
• Now let Xn.coins be the random variable which counts the
outcome of n such experiments with an unbalanced coin.

• In total, there are 2n possible sequences of outcomes, the


interest though falls on the total number of heads.

• The outcomes that deliver exactly x heads are counted via


the binomial coefficient nx = n!/x! (n − x)!.


• For each such outcome, a head still occurs with probability


0.6 and a tail with probability 0.4.

• The p.m.f. for Xn.coins is thus as follows.


!
n n!
fXn.coins (x) = · 0.6x · 0.4n−x = · 0.6x · 0.4n−x
x x! (n − x)!
Example: tossing many unbalanced coins (2/3)
The mean of Xn.coins is calculated as follows.
n
!
X n
E [Xn.coins ] = x · 0.6x · 0.4n−x
x=0
x
n
!
X n−1
= n · 0.6x · 0.4n−x
x=1
x − 1
n
!
X n−1
= 0.6 · n · 0.6x−1 · 0.4n−1−x+1
x=1
x−1
n
!
X n−1
= 0.6 · n · 0.6y · 0.4n−1−y
y=0
y
| {z }
=1
= 0.6 · n

Note: in the fourth line, it is y = x − 1.


Example: tossing many unbalanced coins (3/3)
The second uncentered moment of Xn.coins thus obtains as:
n
!
h i X n
E 2
Xn.coins = x 2
· 0.6x · 0.4n−x
x=0
x
n
!
X n−1
= xn · 0.6x · 0.4n−x
x=1
x − 1
n
!
X n−1
=n (y + 1) · 0.6y+1 · 0.4n−y−1
y=0
y
n
!
n−1
X
= 0.6 · n y · 0.6y · 0.4n−y +
y=0
y
n
!
X n−1
+ 0.6 · n · 0.6y · 0.4n−y
y=0
y
= 0.6 · n · [0.6 · (n − 1) + 1]
2 − E [Xn.coins ]2 = 0.24 · n.
 
hence, Var [Xn.coins ] = E Xn.coins
Example: moments of the uniform distribution
• Let X follow the uniform distribution on the [0, 1] interval.

• The mean is obtained as:


ˆ 1 1
x2 1
E [X] = xdx = =
0 2 0
2

• while the second uncentered moment is:


ˆ 1 1
h
2
i
2 x3 1
E X = x dx = =
0 3 0
3

• thus the variance is as follows.


h i 1 1 1
Var [X] = E X 2 − E [X]2 = − =
3 4 12
Example: moments in the exponential case (1/2)

• Now let Y = − log (X): therefore, Y follows the exponential


distribution with unit parameter as examined earlier.

• The mean is obtained as:


ˆ ∞
E [Y ] = y exp (−y) dy
0
ˆ ∞
= −y exp (−y)|∞
0 + exp (−y) dy
0
=1

where the second line applies integration by parts.

´∞
• Note: limM →∞ −y exp (−y)|M
0 = 0 and 0 exp (−y) dy = 1.
Example: moments in the exponential case (2/2)
• The second uncentered moment is:
h i ˆ ∞
E Y2 = y 2 exp (−y) dy
0
ˆ ∞

2
= −y exp (−y) +2 y exp (−y) dy
0 0
=2

Once again, this applies integration by parts and standard


calculations.

• We can thus derive the variance.


h i
Var [Y ] = E Y 2 − E [Y ]2
=2−1
=1
Moments of linear transformations
Let Y = a + bX a linear transformation of a random variable
X. How are the moments of X and Y related to one another?
• As for the mean:

E [Y ] = E [a + bX]
= a + b E [X]

• whereas with regard to the variance it is:


h i
Var [Y ] = E (Y − E [Y ])2
h i
= E b2 (X − E [X])2
h i
= b2 E (X − E [X])2
= b2 Var [X]

• These follow from the linear properties of sums/integrals.


Moments of non-linear transformations
And if Y & X are related by a non-linear function Y = g (X)?
• No exact result, but by Jensen’s Inequality one has:

E [g (X)] ≤ g (E [X]) if g (·) is a concave function;


E [g (X)] ≥ g (E [X]) if g (·) is a convex function.

• Furthermore, a Taylor approximation around E [X]:

g (X) ≈ g (E [X]) + g 0 (E [X]) [X − E [X]]


= g (E [X]) − g 0 (E [X]) E [X] + g 0 (E [X]) X
 

hints that E [g (X)] ≈ g (E [X]) is a bad approximation; the


rearrangement of terms in the second line however suggests
that the following is a good approximation.
2
Var [g (X)] ≈ g 0 (E [X]) Var [X]

Markov’s Inequality

Theorem 14
Markov’s Inequality. Given a nonnegative random variable X ∈ R+
and a constant k > 0, it must be P [X ≥ k] ≤ E [X] /k.
Proof.
Apply the decomposition
ˆ +∞
E [X] = xf (x) dx
0
ˆ +∞
≥ xf (x) dx
k
ˆ +∞
≥k f (x) dx
k
= k P [X ≥ k]

with the first equality requiring X to be nonnegative.


Čebyšëv’s Inequality

Theorem 15
Čebyšëv’s Inequality. Given a random variable Y ∈ R and a number
δ > 0, it must be P [|Y − E [Y ]| ≥ δ] ≤ Var [Y ] /δ 2 .
Proof.
2
Rephrase Markov’s inequality setting X = (Y − E [Y ]) and k = δ 2 ,
and notice that:
h i
2
P [|Y − E [Y ]| ≥ δ] ≤ P (Y − E [Y ]) ≥ δ 2
h i
2
E (Y − E [Y ])

δ2
Var [Y ]
=
δ2
as postulated.
Minimum squared error of prediction (1/3)

• To build intuition about the relationship between the mean


and the variance, consider the following problem.
 2 
min E X −X
b
X
b

• This is about guessing (“predicting”) the realization of X


using a single value X.
b

• The minimand is the mean squared error (MSE). This


punishes big mistakes more than small mistakes.

• (This is an ubiquitous exercise in statistics! Understanding


this simple example helps.)
Minimum squared error of prediction (2/3)
Note that:
 2   2 
E X −X
b = E X − E [X] + E [X] − X
b
h i  2 
2
= E (X − E [X]) +E E [X] − X
b
h  i
+ 2 E (X − E [X]) E [X] − X
b
| {z }
=0
 2 
= Var [X] + E E [X] − X
b

because the third term in the second line must be zero:


h  i  
E (X − E [X]) E [X] − X
b = E [X] − X
b E [(X − E [X])] = 0

b are “constants” and E [(X − E [X])] = 0.


both E [X] and X
Minimum squared error of prediction (3/3)

• Hence:
 2   2 
min E X −X
b = min Var [X] + E E [X] − X
b
X
b X
b

• . . . and since Var [X] is a constant, the optimal predictor


is the mean!
Xb optimal = E [X]

• A prediction based on the mean returns a residual MSE that


is equal to the variance Var [X].

• Therefore, the variance is the prediction error “that cannot


be removed.”
Moment generating function

Definition 20
Moment generating function. Given some random variable X with
support X, the moment-generating function MX (t) is defined, for
t ∈ R, as the expectation of the transformation g (X) = exp (tX), so
long as it exists. For discrete random variables this is:
X
MX (t) = E [exp (tX)] = exp (tx) fX (x)
x∈X

while for continuous random variables, it is as follows.


ˆ
MX (t) = E [exp (tX)] = exp (tx) fX (x) dx
X

• A moment generating function is often abbreviated as m.g.f.


Moment generation

Theorem 16
Moment generation. If a random variable X has an associated mo-
ment generating function MX (t), its r-th uncentered moment can be
calculated as the r-th derivative of the moment generating function eval-
uated at t = 0.
dr MX (t)
E [X r ] =
dtr t=0

Proof.
Note that, for all r = 1, 2, . . . :

dr MX (t) dr
 r 
d
= E [exp (tX)] = E exp (tX) = E [X r exp (tX)]
dtr dtr dtr

so long as the r-th derivative with respect to t can pass through the
expectation operator. If so, E [X r exp (tX)] = E [X r ] for t = 0.
Example: m.g.f. for coin experiments

• Recall Xcoin for an unbalanced coin. Its m.g.f. is:

MXcoin (t) = exp (t · 1) · fXcoin (1) + exp (t · 0) · fXcoin (0)


= 0.6 · exp (t) + 0.4

• while for Xn.coins , it is as follows:


n
!
X n
MXn.coins (t) = exp (tx) · 0.6x · 0.4n−x
x=0
x
n
!
n
[0.6 · exp (t)]x · 0.4n−x
X
=
x=0
x
= [0.6 · exp (t) + 0.4]n

this obtains by another application of the binomial formula.


Example: m.g.f. for the uniform distribution
• Let X follow the uniform distribution on the [0, 1] interval.
The m.g.f. is calculated as follows.
ˆ 1
MX (t) = exp (tx) dx
0
1
1
= exp (tx)
t 0
1
= [exp (t) − 1]
t
• Note: this might seem ill-defined at t = 0, but applying the
following Taylor expansion for t = 0:
∞ n
X t
exp (t) =
n=0
n!

into the m.g.f. formula, reveals that actually MX (0) = 1.


Example: m.g.f. for the exponential case

• Let Y = − log (X): as previously, Y follows the exponential


distribution with unit parameter.

• The m.g.f. is calculated as follows.


ˆ ∞
MY (t) = exp ((t − 1) y) dy
0
M
1
= lim − exp (− (1 − t) y)
M →∞ 1−t 0
1
=
1−t

• This m.g.f. only exists for t < 1! In fact, the integral that
defines it diverges if t ≥ 1.
M.g.f. of transformations

• Let Y = a + bX again a linear transformation of a random


variable X.

• If they exist, the two m.g.f.s of X and Y are related to one


another as follows.

MY (t) = E [exp (tY )]


= E [exp (ta + tbX)]
= exp (at) E [exp (btX)]
= exp (at) MX (bt)

• Getting exact results for non-linear transformations is not


as straightforward.
The role of m.g.f.s
• Why are m.g.f.s important? As already seen, they allow to
calculate moments, often more easily.

• There is another reason: a m.g.f. is unique to a distribution


– i.e. each distribution FX (x) has a distinct m.g.f. MX (t).

• The proof of this result is outside this lecture’s scope.

• However, a sequence of moments does not uniquely identify


a distribution, i.e. different distributions can have identical
moments. Exception: distributions with bounded support.

• Problem: m.g.f.s may not exist (for some value of t even).

• However, the closely related characteristic functions are


also unique to distributions, and they always exist.
Characteristic functions
Definition 21
Characteristic function. Given some random variable X with sup-
port X, the characteristic function ϕX (t) is defined, for t ∈ R and for
discrete random variables:
X
ϕX (t) = E [exp (itX)] = exp (itx) fX (x)
x∈X

while for continuous random variables it is:


ˆ
ϕX (t) = E [exp (itX)] = exp (itx) fX (x) dx
X

where i is the imaginary unit.


• Similarly to the case of m.g.f.s, one can calculate moments
via characteristic functions.
1 dr ϕX (t)
E [X r ] = · for r ∈ N
ir dtr t=0

You might also like