Lecturenotes 2
Lecturenotes 2
2 Random variables 1
2.1 Random variable and expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2.2 Two random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Averages and their convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Random variables
This chapter provides a short introduction to the concept of random variable; these concepts are
central in data science and will be important in the rest of the course. Indeed we will want to
forecast using ideas from probability theory, in order to design efficient prediction tools but also to
quantify the uncertainties associated with them.
• We can describe the relation between X and an already-defined variable. For example, if we
say that X = − log(U ) where U is Uniform in [0, 1], we have fully described X, and we can
simulate it on the computer. At least, if we assume that we can simulate Uniform variables
on the computer.
1
0.7
probability
probability
0.3
0.0 0
heads tails a b c d
0 1
state state
(a) (b)
Figure 2.1: Two random variables: a biased coin flip (left), and a Uniform variable in [0, 1]. Here,
b − a is equal to d − c, so the variable is equally likely to land in (a, b) or in (c, d).
• We can describe the probability density function of X, denoted by fX , from the state space
to R+ , the set of positive reals. For example, the density function of a Uniform(0, 1) variable
is x 7→ 1(x ∈ (0, 1)), and the density function of the Exponential(1) variable is x 7→ exp(−x).
From the probability density function, we can compute the probability that the random
variable lands in any subset of the state space. For example for any a < b in the state space,
Rb
we have P(X ∈ (a, b)) = a fX (x)dx. In words: the area under the curve of fX between a and
b. Here we represent probabilities as integrals of fX , which is convenient because integrals
can be computed by certain humans and all computers.
We can prove that X = − log(U ) where U is Uniform in [0, 1] is indeed a random variable with
density function x 7→ exp(−x), via the change of variable formula.
Change of variable. Suppose that X is a random variable with density fX , and that s is a
one-to-one function. Define Y = s(X). What is the density of fY ?
ds (y)
−1
(change of variable) fY (y) = fX (s−1 (y)) × . (2.1)
dy
In the above equation, s−1 is the inverse of s: s−1 (y) is the number such that s(s−1 (y)) = y. The
last term on the right is the absolute value of the derivative of s−1 evaluated at y.
To summarise, we can think of random variables as mathematical objects (more precisely, as
functions), and as concrete objects that we can simulate on the computer (see Figure 2.2 on the
Exponential variable). Both views are useful.
0.75 0.75
density
U
0.50 0.50
0.25 0.25
0.00 0.00
0 2 4 0 2 4
X state space
(a) (b)
Figure 2.2: Two views on the Exponential random variable. Left: generate X = − log(U ) with
U ∼ Uniform(0, 1). Right: probability density function x 7→ exp(−x).
Properties. Once we have defined a variable, we can look at its properties. The expectation of
R +∞
a random variable X, E[X], also known as its mean, is defined by E[X] = −∞ xfX (x)dx, where
fX is the probability density function of X. The integral is not always finite, so the expectation
can be infinite: this is the case for example with a Cauchy variable that has density x 7→ π −1 (1 +
R +∞
x2 )−1 . Similarly we can define E[h(X)] = −∞ h(x)fX (x)dx for a function h, for example E[X 2 ] =
x fX (x)dx. It is helpful to know that expectations are defined as integrals. But it does not mean
R 2
that we have to resort to (scary!) integral calculations every time we meet an expectation, thanks
to fundamental properties recalled below (linearity, and later on, the tower property).
The expectation is linear, which means the following: if X is equal (with probability one) to a
constant real number c, then E[X] = c. For any pair of random variables X and Y , and any two
real numbers a and b, the following holds:
Using linearity, we can find the expectation of a Uniform(a, b) variable X from the expectation of
a Uniform(0, 1) variable U , which is equal to 1/2. Since X is a + (b − a)U , E[X] = a + (b − a)/2.
We can also use the linearity of expectation to show that E[(X − E[X])2 ] is equal to E[X 2 ] − E[X]2 .
This is called the variance of X and denoted by V[X]. The variance satisfies:
0.3 0.3
density
0.2 0.2
0.1 0.1
0.0 0.0
−4 −2 0 2 4 −4 −2 0 2 4
Figure 2.3: Two views on the Normal random variable. Left: normalized histogram of generated
values simulated using (2.5). Right: probability density function ϕ defined in (2.4).
1 1
(Normal pdf) ϕ : x 7→ √ exp − 2 (x − µ) ,
2
(2.4)
2πσ 2 2σ
for x ∈ R. This is one way of defining it. It is “standard” if µ = 0 and σ = 1. Alternatively, we can
describe a Normal variable by its relation to the (already defined) Uniform variable. Suppose that
U1 and U2 are independent (more on independence below) Uniform(0, 1) variables. Define Z as
Independence. We now consider a pair of real-valued random variables X and Y . We can put
them in a “vector” V = (X, Y ), in which case, we have one random vector of length 2. The random
vector, like any other random variable, can be described with its probability density function fX,Y .
For example the density of a vector (U1 , U2 ) of independent Uniform(0, 1) variables is the function
(x, y) 7→ 1(x ∈ (0, 1)) × 1(y ∈ (0, 1)). To simulate independent variables, we just simulate them
separately, without sharing of information, recycling or communication between the two simulators.
Conditioning. Consider again a pair of real-valued random variables X and Y , not necessarily
independent. Conditioning on X, or more precisely on the event {X = x}, means considering
the distribution of the random variables while fixing the value of X to some x ∈ R. As if the
random variable X was solidified into the value x. For example, suppose that X ∼ Normal(0, 1)
and Y = aX + b. Then if we “condition” on {X = x}, Y is equal to the ax + b.
Conditioning on {X = x}, the variable Y might have a different distribution than if we do not
condition on {X = x}. We denote the conditional density of Y given {X = x} by y 7→ fY |X (y|x).
For any joint distribution fX,Y , we can always write
(general factorization) fX,Y (x, y) = fX (x) fY |X (y|x) = fY (y) fX|Y (x|y) . (2.7)
From this we get the expression fY |X (y|x) = fX,Y (x, y) /fX (x), so we can obtain an expression for
the conditional density using the joint and the marginal densities.
Note, since (2.7) is always true, independence as in (2.6) implies that fY |X (y|x) = fY (y) and
that fX|Y (x|y) = fX (x). This is intuitive, it corresponds our human idea of “independence”:
knowing the value of X does not change our understanding of the distribution of Y , and vice versa.
The notion of independence is symmetric in X and Y .
Tower property. We write E[Y |X] or E[Y |X = x] for the expectation of the random variable Y
when we condition/know the value of X, say x. We have this very useful property, for any pair of
random variables X and Y ,
For example, suppose that X and W are two independent Normal(0, 1) variables and that Y =
aX + bW . If we condition on the event {X = x}, then Y becomes ax + bW and its distribution
is Normal(ax, b2 ), thus E[Y |X] = aX. On the other hand, unconditionally the expectation E[Y ]
is 0. We can find this by linearity: E[Y ] = aE[X] + bE[W ]. Or by the tower property: E[Y ] =
Products. A useful property about independent variables is that the expectation E [XY ] is equal
to the product E [X] E [Y ]. Indeed, using the tower property:
where we have used E [Y |X] = E [Y ] by independence. Another useful property is that, for any two
functions g and h, if X and Y are independent then g(X) and h(Y ) are independent.
The second equality can be checked by developing the product, and using the linearity of expec-
tations. The covariance is not a very intuitive notion but note that if X and Y are independent,
then Cov(X, Y ) = 0 because then E[XY ] = E[X]E[Y ]. If Cov(X, Y ) = 0 we say that X and Y are
uncorrelated. Independent variables are always uncorrelated.
Uncorrelated but dependent. There are plenty of pairs of variables X and Y such that
Cov (X, Y ) = 0 and yet X and Y are dependent. Consider X following a symmetric distri-
bution around 0, such as a centered Normal distribution or a Uniform distribution on [−1, 1].
Define Y as Y = X 2 . Then X brings a lot of information on Y (in fact, X determines Y com-
pletely), so intuitively the two variables are dependent. On the other hand, we can compute
Cov (X, Y ) = E X 3 − E[X]E X 2 . Since X is symmetric around 0, we have E X 3 = 0 and
(bilinearity of covariance) Cov (aX + bY, cW ) = ac Cov (X, W ) + bc Cov (Y, W ) . (2.10)
0 0
X2
X2
−3 −3
−6 −6
−5 0 5 10 −5 0 5 10
X1 X1
(a) (b)
Figure 2.4: Bivariate Normal random variable. Left: samples. Right: contours of the probability
density function.
Cov (X, Y )
(correlation) Cor (X, Y ) = p . (2.11)
V [X] V [Y ]
Some properties of the correlation are derived from those of the covariance (symmetry, invariance
by shifts). Some properties are specific to the correlation:
• Invariance by scalings: Cor (aX, Y ) = Cor (X, Y ) for all a ∈ R. Invariance by shifts and
scalings means that the correlation is insensitive to the units used for X and Y .
• The correlation is always between −1 and +1. Indeed, for any pair of random variables X
and Y with finite first two moments (i.e. E X 2 and E Y 2 are finite), the Cauchy–Schwarz
2
inequality states that E [XY ] ≤ E X 2 E Y 2 . If we apply this inequality to the variables
2
X − E [X] and Y − E [Y ], we obtain Cov (X, Y ) ≤ V [X] V[Y ] and thus Cor (X, Y ) ∈ [−1, 1].
Furthermore, the equality holds only if Y = aX + b for some real numbers a and b. Therefore,
we have Cor (X, Y ) = 1 (resp. = −1) if and only if Y = aX + b with a > 0 (resp. with a < 0).
Maximally correlated variables are perfectly aligned.
The latter property hints at a limitation of the correlation coefficient: it really only captures linear
associations.
then we can explicitly invert Σ and compute its determinant. After some work, we can write, for
any pair (x1 , x2 ), the joint density fX1 ,X2 (x1 , x2 ) of (X1 , X2 ) ∼ Normal(µ, Σ) as
" 2 2 #!
1 1
x1 − µ1 x2 − µ2 x1 − µ1 x2 − µ2
exp − + − 2ρ .
2πσ1 σ2 1−ρ 2 (1 ρ2 )
p
2 − σ1 σ2 σ1 σ2
Note that if ρ = 0, then the off-diagonal elements of Σ are zero, i.e. Cov(X1 , X2 ) = 0. But also
the joint density factorizes into a product of marginal densities as in (2.6). In that case, X1 and
X2 are independent. So for variables that are jointly Normal, lack of correlation is equivalent to
independence. It is not true for general pairs of random variables.
Average and expectation. One justification is through the law of large numbers. Assume that
E [|X|] = |x|fX (x)dx < ∞, and that X1 , X2 is a sequence of independent identical copies of X
R
density
2
Xn
0.2
1
0.0
0 5
0
0 25 50 75 100 n (Xn − E[X])
n n 2 3 20
(a) (b)
Pn
The convergence “almost sure” or “a.s.” means that P limn→∞ n−1 Xt = E [X] = 1; in
t=1
words: in every experiment where we would generate such sequence X1 , X2 , etc, there is an integer
n large enough so that X̄n is close to E [X]. This is called an “asymptotic” result, because it
describes a phenomenon occurring when n → ∞ and it does not say anything about any finite
value of n.
If we assume more, we get more. For example if we assume that V[X] < ∞, then we have
Chebyshev’s inequality that states that for all ε > 0 and for all n ≥ 1:
V[X]
(Chebyshev) P(|X̄n − E[X]| > ε) ≤ . (2.13)
nε2
Accordingly, the probability that X̄n is ε-away from E[X] goes to zero as 1/n when n → ∞. But
Chebyshev is a non-asymptotic result: it works for all n. Under the same assumption V[X] < ∞,
the Central Limit Theorem is a purely asymptotic result that states
√ d
(CLT) n(X̄n − E[X]) −−−−→ Normal(0, V[X]). (2.14)
n→∞
The convergence is “in distribution”: the random variable on the left becomes more and more like
the random variable on the right of the arrow.
With the same reasoning, if we replace V [X] by the empirical variance σ̂x2 defined as Ĉov (x1:n , x1:n ),
and if we replace V [Y ] by σ̂y2 , then from Eq. (2.11) we obtain the empirical correlation as
Pn Pn
n−1 (xt − x̄n ) (yt − ȳn ) (xt − x̄n ) (yt − ȳn )
Ĉor (x1:n , y1:n ) = t=1
q = qP t=1 . (2.16)
n 2 Pn 2
σ̂x2 σ̂y2 t=1 (x t − x̄ n ) t=1 (y t − ȳn )
The ability to approximate “theoretical” quantities such as expectations using samples will be
key in the developments of the next chapters.