Week12 Notes
Week12 Notes
12. Week 12
(a) Measures of Central Tendency or location: here, we try to find a ‘central’ value around
which the possible values of the RV are distributed.
(b) Measures of Dispersion: once we have an idea of the ‘central’ value of the RV (equivalently,
the probability distribution), we check the scattering/dispersion of the all the possible
values of the RV around this ‘central’ value.
(c) Measures of Skewness: here, we try to quantify the asymmetry of the probability distribu-
tion.
(d) Measures of Kurtosis: here, we try to measure the thickness of the tails of the RV (equiv-
alently, the probability distribution) while comparing with the Normal distribution.
Example 12.2 (Measures of Central Tendency). (a) The Mean of an RV is a good example
of a measure of central tendency. It also has the useful property of linearity. However, it
may be affected by few extreme values, referred to as the outliers. The mean may not exist
for all distributions.
1
(b) Median, i.e. a quantile of order 2
of an RV is always defined and is usually not affected by
a few outliers. However, the median lacks the linearity property, i.e. a median of X + Y
has no general relationship with the medians of X and Y . Further, a median focuses on
the probabilities with which the values of the RV occur rather than the exact numerical
values. A median need not be unique.
150
(c) The mode m0 of a probability distribution is the value that occurs with ‘highest probability’,
and is defined by fX (m0 ) = sup {fX (x) : x ∈ SX }, where fX denotes the p.m.f./p.d.f. of X,
as appropriate and SX denotes the support of X. Mode need not be unique. Distributions
with one, two or multiple modes are called unimodal, bimodal or multimodal distributions,
respectively. Usually, it is easy calculate. However, it may so happen that a distribution
has more than multiple modes situated far apart, in which case it may not be suitable for
a measure of central tendency.
Note 12.4 (A Measure of Skewness). If the distribution of an RV X is symmetric about the mean
µ, then fX (µ + x) = fX (µ − x), ∀x ∈ R, where fX denotes the p.m.f./p.d.f. of X. If this is not the
case, then two cases may occur.
(a) (Positively skewed) the distribution may have more probability mass towards the right
hand side of the graph of fX . In this case, the tails on the right hand side are longer.
(b) (Negatively skewed) the distribution may have more probability mass towards the left hand
side of the graph of fX . In this case, the tails on the left hand side are longer.
To measure this asymmetry, we usually look at EZ 3 , where Z = √X−EX , provided the moments
V ar(X)
exist. Note that Z is independent of the units of measurement and
E(X − EX)3 µ3 (X)
EZ 3 = 3 = .
(V ar(X)) 2 (µ2 (X))3/2
We may refer to a distribution being positively or negatively skewed according as the above quantity
being positive or negative. If X ∼ Exponential(λ), then EZ 3 = 2 and hence the distribution of
X is positively skewed.
Note 12.5. There are many other measures of skewness used in practice. However, we do not
discuss them in this course.
Note 12.6 (A measure of Kurtosis). The probability distribution of X is said to have higher
(respectively, lower) kurtosis than the Normal distribution, if its p.m.f./p.d.f., in comparison with
the p.d.f. of a Normal distribution, has a sharper (respectively, rounded) peak and longer/fatter
(respectively, shorter/thinner) tails. To measure the kurtosis of X, we look at EZ 4 , where Z =
√X−EX , provided the moments exist. Note that Z is independent of the units of measurement
V ar(X)
and
E(X − EX)4 µ4 (X)
EZ 4 = = .
(V ar(X))2 (µ2 (X))2
If X ∼ N (µ, σ 2 ), then Z ∼ N (0, 1) and hence EZ 4 = 3 (see Remark 8.1). For a general RV X, the
µ4 (X)
quantity (µ2 (X))2
− 3 is referred to as the excess kurtosis of X. If the excess kurtosis is zero, positive
or negative, then we refer to the corresponding probability distribution as mesokurtic, leptokurtic
152
Definition 12.7 (Quantile function of an RV). Let X be an RV with the DF FX . The function
QX : (0, 1) → R defined by
Proof. We prove only the first statement. The proof of the second statement is similar. Take
Y = FX (X). Then,
0, if y < 0,
FY (y) = P(Y ≤ y) = P(FX (X) ≤ y) =
1,
if y ≥ 1.
for some x1 , x2 ∈ R with FX (x1 ) = FX (x2 ). Here, we have used the fact that X is a continuous
RV. Now, for y ∈ [0, 1),
= 1 − P(FX (X) ≥ y)
= 1 − P(X ≥ QX (y))
= P(X ≤ QX (y))
153
= FX (QX (y)
= y.
Note 12.9. Let X be an RV with the quantile function QX . If we can generate random samples
U1 , U2 , · · · , Un from U ∼ U nif orm(0, 1), then QX (U1 ), QX (U2 ), · · · , QX (Un ) are random samples
from the distribution of X. This observation may be used in practice to generate random samples
for known distributions from the U nif orm(0, 1) distribution.
Note 12.10 (Moments do not determine the distribution of an RV). Let X ∼ N (0, 1) and consider
Y = eX . The distribution of Y is usually called the lognormal distribution, since ln Y = X ∼
N (0, 1). Using standard techniques, we can compute the p.d.f. of Y :
2
h i
√1 y −1 exp − (ln2y) , if y > 0,
2π
fY (y) =
0, otherwise.
It can be shown that the continuous RVs Xα , α ∈ [−1, 1] with the p.d.fs
has the same moments as Y . However, the distributions are different. This shows that the
moments of an RV do not determine the distribution. (see the article ‘On a property of the
lognormal distribution’ by C.C. Heyde, published in Journal of the Royal Statistical Society:
Series B, volume 29 (1963).)
(a) (Convex combination of DFs) The function H : R → [0, 1] defined by H(x) := αF (x) +
(1 − α)G(x), ∀x ∈ R has the relevant properties and hence is a DF.
154
(b) (Product of DFs) The function H : R → [0, 1] defined by H(x) := F (x)G(x), ∀x ∈ R has
the relevant properties and hence is a DF. In particular, F 2 is a DF, if F is so.
In fact, a general DF can be written as a convex combination of discrete DFs and some special
continuous DFs. We do not discuss such results in this course.
Remark 12.12. In practice, given a known RV X, many times we need to find out the distribution
of h(X) for some function h : R → R or even, simply, compute the expectations of the form
Eh(X). As already discussed earlier in the course, we can theoretically (i.e., in principle) compute
R∞
Eh(X) as −∞ h(x)fX (x) dx, when X is a continuous RV with p.d.f. fX , for example. However, in
practice, it may happen that this integral does not have a closed form expression – which makes it
challenging to evaluate. The problem becomes more intractable when we look at similar problems
where X is a random vector and the joint/marginal distributions need to be considered. In such
situations, as an alternative, we try to find ‘good’ approximations for the quantities of interest,
where the approximation terms are easier to compute than the original expression. This motivation
leads to the various notions for convergence of RVs. If some quantity of interest involving an RV
X, say EX, is difficult to compute, then we find an appropriate ‘approximating’ sequence of RVs
{Xn }n for X and use the values EXn as an approximation for EX.
Remark 12.13. Given a random sample X1 , X2 , · · · , Xn from N (µ, σ 2 ) distribution, consider the
1 Pn
sample mean X̄n = n i=1 Xi . Here, we have written X̄n , instead of just X̄, to highlight the
2
dependence of the sample mean on the sample size n. Recall that X̄n ∼ N µ, σn . The behaviour
of X̄n for large n is of interest. This is also another motivation for us to study the convergence of
sequences of RVs.
Definition 12.14 (Convergence in r-th mean). Let X, X1 , X2 , · · · be RVs defined on the same
probability space (Ω, F, P). Let r ≥ 1. If E|X|r < ∞, E|Xn |r < ∞, ∀n and if
Note 12.15. (a) If a sequence {Xn }n converges to X in r-th mean for some r ≥ 1, then we
have
lim E|Xn |r = E|X|r ,
n→∞
and
lim EXnr = EX r ,
n→∞
Remark 12.16. Even though we have defined the r-th order moments for 0 < r < 1, for technical
reasons we do not consider the convergence in r-th mean in this case. The details are beyond
the scope of this course. In what follows, whenever we consider the convergence in r-th mean, we
assume r ≥ 1.
lim P(|Xn − X| ≥ ϵ) = 0,
n→∞
P
then we say that the sequence {Xn }n converges to X in probability and write Xn −−−→ X.
n→∞
Note 12.18. (a) Suppose that a sequence {Xn }n converges to X in probability. Now, for all
ϵ > 0, note that
(b) The sequence {Xn }n converges to X in probability if and only if the sequence {Xn − X}n
converges to 0 in probability.
156
Proposition 12.19. Let X, X1 , X2 , · · · be RVs defined on the same probability space (Ω, F, P). If
P
the sequence {Xn }n converges to X in r-th mean for some r ≥ 1, then Xn −−−→ X.
n→∞
Corollary 12.20. Let {Xn }n be a sequence of RVs with finite second moments. If limn EXn = µ
and limn V ar(Xn ) = 0, then {Xn }n converges to µ in 2nd mean and in particular, in probability.
Proof. We have E|Xn − µ|2 = E [(Xn − µn ) + (µn − µ)]2 = E(Xn − µn )2 + (µn − µ)2 = V ar(Xn ) +
(µn − µ)2 . By our hypothesis, limn E|Xn − µ|2 = 0. Hence, {Xn }n converges to µ in 2nd mean.
By Proposition 12.19, the sequence also converges in probability. □
Example 12.21. Let X1 , X2 , · · · be i.i.d. U nif orm(0, θ) RVs, for some θ > 0. The sequence
{Xn }n being i.i.d. means that the collection {Xn : n ≥ 1} is mutually independent and that all
the RVs have the same law/distribution. Here, the common p.d.f. and the common DF are given
by
0, if x < 0,
1, if x ∈ (0, θ),
θ x
f (x) = , F (x) = θ
, if 0 ≤ x < θ,
0, otherwise
1, if x ≥ θ.
Consider X(n) = max{X1 , X2 , · · · , Xn }. Using Proposition 10.15, we have the marginal p.d.f. of
X(n) is given by
n xn−1 , if x ∈ (0, θ),
θn
gX(n) (x) =
0, otherwise.
Then, Z θ Z θ
n n n n−1 n 2
EX(n) = x n xn−1 dx = θ, 2
EX(n) = x2 n
x dx = θ
0 θ n+1 0 θ n+2
157
and
" 2 #
n n n(n + 1)2 − n2 (n + 2) n
2
V ar(X(n) ) = θ − = θ2 2
= θ2 .
n+2 n+1 (n + 2)(n + 1) (n + 2)(n + 1)2
Now, limn EX(n) = θ and limn V ar(X(n) ) = 0. Hence, by Corollary 12.20, {X(n) }n converges in
2nd mean to θ and also in probability.
Remark 12.22 (Convergence in probability does not imply convergence in r-th mean). Consider a
sequence of discrete RVs {Xn }n with Xn ∼ Bernoulli( n1 ), ∀n. Consider Yn := nXn , ∀n. Then Yn ’s
are also discrete with the p.m.f.s given by
1 − n1 , if y = 0,
fYn (y) = 1
n
, if y = n,
0, otherwise.
1 n→∞ P
For all ϵ > 0, we have P(|Yn | ≥ ϵ) = n
−−−→ 0 and hence Yn −−−→ 0. But, for any r > 1,
n→∞
E|Yn |r = nr−1 , ∀n. Here, {Yn }n does not converge to 0 in r-th mean.
Example 12.23. X1 , X2 , · · · be i.i.d. RVs following N (µ, σ 2 ) distribution. Recall that X̄n =
2
1 Pn σ σ2
n i=1 Xi ∼ N µ, n . Then limn EX̄n = limn µ = µ and limn V ar(X̄n ) = limn n
= 0. By
Corollary 12.20, {X̄n }n converges in 2nd mean to µ and also in probability.
Theorem 12.24 (Weak Law of Large Numbers (WLLN)). Let X1 , X2 , · · · be i.i.d. RVs such that
1 Pn P
EX1 exists. Then, X̄n = n i=1 Xi −−−→ EX1 .
n→∞
Remark 12.25. We only discuss the proof of Theorem 12.24, when EX12 exists. The proof of the
theorem when EX12 does not exist is beyond the scope of this course. However, we shall use this
theorem in its full generality.
158
1 Pn
Proof of WLLN (Theorem 12.24) (assuming EX12 < ∞). Observe that EX̄n = n i=1 EXi = n1 nEX1 =
EX1 and, using independence of Xi ’s we have
n
!
1 X 1 n→∞
V ar(X̄n ) = 2 V ar Xi = V ar(X1 ) −−−→ 0.
n i=1 n
By Corollary 12.20, the result follows. □