Chapter 2 - Random Variables and Probabi - 2016 - Introduction To Statistical Ma
Chapter 2 - Random Variables and Probabi - 2016 - Introduction To Statistical Ma
RANDOM VARIABLES
AND PROBABILITY
DISTRIBUTIONS 2
CHAPTER CONTENTS
Mathematical Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Random Variable and Probability Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Properties of Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Expectation, Median, and Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Variance and Standard Deviation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Skewness, Kurtosis, and Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Transformation of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
In this chapter, the notions of random variables and probability distributions are
introduced, which form the basis of probability and statistics. Then simple statistics
that summarize probability distributions are discussed.
A = {1, 3, 5}.
The event with no sample point is called the empty event and denoted by ∅. An
event consisting only of a single sample point is called an elementary event, while
an event consisting of multiple sample points is called a composite event. An event
that includes all possible sample points is called the whole event. Below, the notion
of combining events is explained using Fig. 2.1.
The event that at least one of the events A and B occurs is called the union of
events and denoted by A ∪ B. For example, the union of event A that an odd number
appears and event B that a number less than or equal to three appears is expressed as
FIGURE 2.1
Combination of events.
On the other hand, the event that both events A and B occur simultaneously is called
the intersection of events and denoted by A ∩ B. The intersection of the above events
A and B is given by
A ∩ B = ∅,
2.2 PROBABILITY 13
events A and B are called disjoint events. The event that an odd number appears
and the event that an even number appears cannot occur simultaneously and thus are
disjoint. For events A, B, and C, the following distributive laws hold:
(A ∪ B) ∩ C = (A ∩ C) ∪ (B ∩ C),
(A ∩ B) ∪ C = (A ∪ C) ∩ (B ∪ C).
The event that event A does not occur is called the complementary event of A and
denoted by Ac . The complementary event of the event that an odd number appears is
that an odd number does not appear, i.e., an even number appears. For the union and
intersection of events A and B, the following De Morgan’s laws hold:
(A ∪ B)c = Ac ∩ B c ,
(A ∩ B)c = Ac ∪ B c .
2.2 PROBABILITY
Probability is a measure of likeliness that an event will occur and the probability that
event A occurs is denoted by Pr(A). A Russian mathematician, Kolmogorov, defined
the probability by the following three axioms as abstraction of the evident properties
that the probability should satisfy.
1. Non-negativity: For any event Ai ,
0 ≤ Pr(Ai ) ≤ 1.
Pr(Ω) = 1.
From the above axioms, events A and B are shown to satisfy the following
additive law:
This can be extended to more than two events: for events A, B, and C,
FIGURE 2.2
Examples of probability mass function. Outcome
of throwing a fair six-sided dice (discrete uniform
distribution U{1, 2, . . . , 6}).
Pr(x) = f (x),
f (x) is called the probability mass function. Note that f (x) should satisfy
∀x, f (x) ≥ 0, and f (x) = 1.
x
The outcome of throwing a fair six-sided die, x ∈ {1, 2, 3, 4, 5, 6}, is a discrete random
variable, and its probability mass function is given by f (x) = 1/6 (Fig. 2.2).
A random variable that takes a continuous value is called a continuous random
variable. If probability that continuous random variable x takes a value in [a, b] is
given by
b
Pr(a ≤ x ≤ b) = f (x)dx, (2.1)
a
2.3 RANDOM VARIABLE AND PROBABILITY DISTRIBUTION 15
(a) Probability density function f (x) (b) Cumulative distribution function F(x)
FIGURE 2.3
Example of probability density function and its cumulative distribution function.
f (x) is called a probability density function (Fig. 2.3(a)). Note that f (x) should satisfy
∀x, f (x) ≥ 0, and f (x)dx = 1.
For example, the outcome of spinning a roulette, x ∈ [0, 2π), is a continuous random
variable, and its probability density function is given by f (x) = 1/(2π). Note that
Eq. (2.1) also has an important implication, i.e., the probability that continuous
random variable x exactly takes value b is actually zero:
b
Pr(b ≤ x ≤ b) = f (x)dx = 0.
b
Thus, the probability that the outcome of spinning a roulette is exactly a particular
angle is zero.
The probability that continuous random variable x takes a value less than or equal
to b,
b
F(b) = Pr(x ≤ b) = f (x)dx,
−∞
is called the cumulative distribution function (Fig. 2.3(b)). The cumulative distribu-
tion function F satisfies the following properties:
• Monotone nondecreasing: x < x ′ implies F(x) ≤ F(x ′).
• Left limit: lim x→−∞ F(x) = 0.
• Right limit: lim x→+∞ F(x) = 1.
If the derivative of a cumulative distribution function exists, it agrees with the
probability density function:
F ′(x) = f (x).
16 CHAPTER 2 RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS
FIGURE 2.4
Expectation is the average of x weighted according to f (x), and median
is the 50% point both from the left-hand and right-hand sides. α-quantile
for 0 ≤ α ≤ 1 is a generalization of the median that gives the 100α%
point from the left-hand side. Mode is the maximizer of f (x).
Note that, as explained in Section 4.5, there are probability distributions such as the
Cauchy distribution that the expectation does not exist (diverges to infinity).
The expectation can be defined for any function ξ of x similarly:
Discrete: E[ξ(x)] = ξ(x) f (x),
x
2.4 PROPERTIES OF PROBABILITY DISTRIBUTIONS 17
FIGURE 2.5
Income distribution. The expectation is 62.1 thousand dollars, while the
median is 31.3 thousand dollars.
Continuous: E[ξ(x)] = ξ(x) f (x)dx.
E[c] = c,
E[x + c] = E[x] + c,
E[cx] = cE[x].
Pr(x ≤ b) = 1/2.
That is, the median is the “center” of a probability distribution in the sense that it is
the 50% point both from the left-hand and right-hand sides. In the example of Fig. 2.5,
the median is 31.3 thousand dollars and it is indeed in the middle of everybody.
The α-quantile for 0 ≤ α ≤ 1 is a generalization of the median that gives b such
that
Pr(x ≤ b) = α.
That is, the α-quantile gives the 100α% point from the left-hand side (Fig. 2.4) and
is reduced to the median when α = 0.5.
18 CHAPTER 2 RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS
Let us consider a probability density function f defined on a finite interval [a, b].
Then the minimizer of the expected squared error, defined by
b
E (x − y) 2
= (x − y)2 f (x)dx,
a
with respect to y is shown to agree with the expectation of x. Similarly, the minimizer
y of the expected absolute error, defined by
b
E |x − y| = |x − y| f (x)dx, (2.2)
a
b
(1 − α)(x − y) (x > y),
|x − y|α f (x)dx, |x − y|α =
a
α(y − x)
(x ≤ y),
is minimized with respect to y by the α-quantile of x.
Another popular statistic is the mode, which is defined as the maximizer of f (x)
(Fig. 2.4).
often makes the computation easier. For constant c, variance operator V satisfies the
following properties:
V [c] = 0,
V [x + c] = V [x],
V [cx] = c2V [x].
Note that these properties are quite different from those of the expectation.
2.4 PROPERTIES OF PROBABILITY DISTRIBUTIONS 19
The square root of the variance is called the standard deviation and is denoted by
D[x]:
D[x] = V [x].
Conventionally, the variance and the standard deviation are denoted by σ 2 and σ,
respectively.
µk = E[x k ]
is called the kth moment about the origin. The expectation, variance, skewness, and
kurtosis can be expressed by using µk as
Expectation: µ1 ,
Variance: µ2 − µ21 ,
µ3 − 3µ2 µ1 + 2µ31
Skewness: 3
,
(µ2 − µ21 ) 2
µ4 − 4µ3 µ1 + 6µ2 µ21 − 3µ41
Kurtosis: − 3.
(µ2 − µ21 )2
20 CHAPTER 2 RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS
FIGURE 2.6
Skewness.
FIGURE 2.7
Kurtosis.
e t x f (x)
(Discrete),
x
Mx (t) = E[e t x ] =
e t x f (x)dx
(Continuous).
Indeed, substituting zero to the kth derivative of the moment-generating function with
(k)
respect to t, Mx (t), gives the kth moment:
(k)
Mx (0) = µk .
g ′(0) g ′′(0)
g(t) = g(0) + t + t2 +··· .
1! 2!
If higher-order terms in the right-hand side are ignored and the infinite sum
is approximated by a finite sum, an approximation to g(t) can be obtained.
When only the first-order term g(0) is used, g(t) is simply approximated
by g(0), which is too rough. However, when the second-order term tg ′(0)
is included, the approximation gets better, as illustrated below. By further
including higher-order terms, the approximation gets more accurate and
converges to g(t) if all terms are included.
FIGURE 2.8
Taylor series expansion at the origin.
Given that the kth derivative of function e t x with respect to t is x k e t x , the Taylor
series expansion (Fig. 2.8) of function e t x at the origin with respect to t yields
(t x)2 (t x)3
e t x = 1 + (t x) + + +··· .
2! 3!
Taking the expectation of both sides gives
µ2 µ3
E[e t x ] = Mx (t) = 1 + t µ1 + t 2 + t3 +··· .
2! 3!
Taking the derivative of both sides yields
µ3 µ4
Mx′ (t) = µ1 + µ2 t + t 2 + t 3 + · · · ,
2! 3!
µ4 2 µ5 3
Mx (t) = µ2 + µ3 t + t + t + · · · ,
′′
2! 3!
..
.
(k) µk+2 2 µk+3 3
Mx (t) = µk + µk+1 t + t + t +··· .
2! 3!
(k)
Substituting zero into this gives Mx (0) = µk .
22 CHAPTER 2 RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS
where i denotes the imaginary unit such that i 2 = −1. The characteristic function
corresponds to the Fourier transform of a probability density function.
r = ax + b,
1 E[x]
Setting a = and b = − yields
D[x] D[x]
x E[x] x − E[x]
z= − = ,
D[x] D[x] D[x]
x = ξ(r).
dx
g(r) = f ξ(r) .
dr
as
dx
f (x)dx = f g(r) dr.
X R dr
This allows us to change variables of integration from x to r. dx
dr in the right-
hand side corresponds to the ratio of lengths when variables of integration are
changed from x to r. For example, for
FIGURE 2.9
One-dimensional change of variables in integration. For multidimensional cases, see Fig. 4.2.