0% found this document useful (0 votes)
47 views14 pages

Chapter 2 - Random Variables and Probabi - 2016 - Introduction To Statistical Ma

Uploaded by

Robinson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views14 pages

Chapter 2 - Random Variables and Probabi - 2016 - Introduction To Statistical Ma

Uploaded by

Robinson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

CHAPTER

RANDOM VARIABLES
AND PROBABILITY
DISTRIBUTIONS 2
CHAPTER CONTENTS
Mathematical Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Random Variable and Probability Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Properties of Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Expectation, Median, and Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Variance and Standard Deviation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Skewness, Kurtosis, and Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Transformation of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

In this chapter, the notions of random variables and probability distributions are
introduced, which form the basis of probability and statistics. Then simple statistics
that summarize probability distributions are discussed.

2.1 MATHEMATICAL PRELIMINARIES


When throwing a six-sided die, the possible outcomes are only 1, 2, 3, 4, 5, 6, and
no others. Such possible outcomes are called sample points and the set of all sample
points is called the sample space.
An event is defined as a subset of the sample space. For example, event A that any
odd number appears is expressed as

A = {1, 3, 5}.

The event with no sample point is called the empty event and denoted by ∅. An
event consisting only of a single sample point is called an elementary event, while
an event consisting of multiple sample points is called a composite event. An event
that includes all possible sample points is called the whole event. Below, the notion
of combining events is explained using Fig. 2.1.
The event that at least one of the events A and B occurs is called the union of
events and denoted by A ∪ B. For example, the union of event A that an odd number
appears and event B that a number less than or equal to three appears is expressed as

A ∪ B = {1, 3, 5} ∪ {1, 2, 3} = {1, 2, 3, 5}.

An Introduction to Statistical Machine Learning. DOI: 10.1016/B978-0-12-802121-7.00013-3


Copyright © 2016 by Elsevier Inc. All rights of reproduction in any form reserved. 11
12 CHAPTER 2 RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS

(a) Event A (b) Event B (c) Complementary event Ac

(d) Union of events (e) Intersection of events (f) Disjoint events

(g) Distributive law 1 (h) Distributive law 2

(i) De Morgan’s law 1 (j) De Morgan’s law 2

FIGURE 2.1
Combination of events.

On the other hand, the event that both events A and B occur simultaneously is called
the intersection of events and denoted by A ∩ B. The intersection of the above events
A and B is given by

A ∩ B = {1, 3, 5} ∩ {1, 2, 3} = {1, 3}.

If events A and B never occur at the same time, i.e.,

A ∩ B = ∅,
2.2 PROBABILITY 13

events A and B are called disjoint events. The event that an odd number appears
and the event that an even number appears cannot occur simultaneously and thus are
disjoint. For events A, B, and C, the following distributive laws hold:

(A ∪ B) ∩ C = (A ∩ C) ∪ (B ∩ C),
(A ∩ B) ∪ C = (A ∪ C) ∩ (B ∪ C).

The event that event A does not occur is called the complementary event of A and
denoted by Ac . The complementary event of the event that an odd number appears is
that an odd number does not appear, i.e., an even number appears. For the union and
intersection of events A and B, the following De Morgan’s laws hold:

(A ∪ B)c = Ac ∩ B c ,
(A ∩ B)c = Ac ∪ B c .

2.2 PROBABILITY
Probability is a measure of likeliness that an event will occur and the probability that
event A occurs is denoted by Pr(A). A Russian mathematician, Kolmogorov, defined
the probability by the following three axioms as abstraction of the evident properties
that the probability should satisfy.
1. Non-negativity: For any event Ai ,

0 ≤ Pr(Ai ) ≤ 1.

2. Unitarity: For entire sample space Ω,

Pr(Ω) = 1.

3. Additivity: For any countable sequence of disjoint events A1 , A2 , . . .,

Pr(A1 ∪ A2 ∪ · · · ) = Pr(A1 ) + Pr(A2 ) + · · · .

From the above axioms, events A and B are shown to satisfy the following
additive law:

Pr(A ∪ B) = Pr(A) + Pr(B) − Pr(A ∩ B).

This can be extended to more than two events: for events A, B, and C,

Pr(A ∪ B ∪ C) = Pr(A) + Pr(B) + Pr(C)


− Pr(A ∩ B) − Pr(A ∩ C) − Pr(B ∩ C)
+ Pr(A ∩ B ∩ C).
14 CHAPTER 2 RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS

FIGURE 2.2
Examples of probability mass function. Outcome
of throwing a fair six-sided dice (discrete uniform
distribution U{1, 2, . . . , 6}).

2.3 RANDOM VARIABLE AND PROBABILITY


DISTRIBUTION
A variable is called a random variable if probability is assigned to each realization
of the variable. A probability distribution is the function that describes the mapping
from any realized value of the random variable to probability.
A countable set is a set whose elements can be enumerated as 1, 2, 3, . . .. A
random variable that takes a value in a countable set is called a discrete random
variable. Note that the size of a countable set does not have to be finite but can be
infinite such as the set of all natural numbers. If probability for each value of discrete
random variable x is given by

Pr(x) = f (x),

f (x) is called the probability mass function. Note that f (x) should satisfy

∀x, f (x) ≥ 0, and f (x) = 1.
x

The outcome of throwing a fair six-sided die, x ∈ {1, 2, 3, 4, 5, 6}, is a discrete random
variable, and its probability mass function is given by f (x) = 1/6 (Fig. 2.2).
A random variable that takes a continuous value is called a continuous random
variable. If probability that continuous random variable x takes a value in [a, b] is
given by
 b
Pr(a ≤ x ≤ b) = f (x)dx, (2.1)
a
2.3 RANDOM VARIABLE AND PROBABILITY DISTRIBUTION 15

(a) Probability density function f (x) (b) Cumulative distribution function F(x)

FIGURE 2.3
Example of probability density function and its cumulative distribution function.

f (x) is called a probability density function (Fig. 2.3(a)). Note that f (x) should satisfy

∀x, f (x) ≥ 0, and f (x)dx = 1.

For example, the outcome of spinning a roulette, x ∈ [0, 2π), is a continuous random
variable, and its probability density function is given by f (x) = 1/(2π). Note that
Eq. (2.1) also has an important implication, i.e., the probability that continuous
random variable x exactly takes value b is actually zero:
 b
Pr(b ≤ x ≤ b) = f (x)dx = 0.
b
Thus, the probability that the outcome of spinning a roulette is exactly a particular
angle is zero.
The probability that continuous random variable x takes a value less than or equal
to b,
 b
F(b) = Pr(x ≤ b) = f (x)dx,
−∞
is called the cumulative distribution function (Fig. 2.3(b)). The cumulative distribu-
tion function F satisfies the following properties:
• Monotone nondecreasing: x < x ′ implies F(x) ≤ F(x ′).
• Left limit: lim x→−∞ F(x) = 0.
• Right limit: lim x→+∞ F(x) = 1.
If the derivative of a cumulative distribution function exists, it agrees with the
probability density function:
F ′(x) = f (x).
16 CHAPTER 2 RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS

FIGURE 2.4
Expectation is the average of x weighted according to f (x), and median
is the 50% point both from the left-hand and right-hand sides. α-quantile
for 0 ≤ α ≤ 1 is a generalization of the median that gives the 100α%
point from the left-hand side. Mode is the maximizer of f (x).

Pr(a ≤ x) is called the upper-tail probability or the right-tail probability, while


Pr(x ≤ b) is called the lower-tail probability or the left-tail probability. The upper-tail
and lower-tail probabilities together are called the two-sided probability, and either
of them is called a one-sided probability.

2.4 PROPERTIES OF PROBABILITY DISTRIBUTIONS


When discussing properties of probability distributions, it is convenient to have
simple statistics that summarize probability mass/density functions. In this section,
such statistics are introduced.

2.4.1 EXPECTATION, MEDIAN, AND MODE


The expectation is the value that a random variable is expected to take (Fig. 2.4). The
expectation of random variable x, denoted by E[x], is defined as the average of x
weighted according to probability mass/density function f (x):

Discrete: E[x] = x f (x),
x

Continuous: E[x] = x f (x)dx.

Note that, as explained in Section 4.5, there are probability distributions such as the
Cauchy distribution that the expectation does not exist (diverges to infinity).
The expectation can be defined for any function ξ of x similarly:

Discrete: E[ξ(x)] = ξ(x) f (x),
x
2.4 PROPERTIES OF PROBABILITY DISTRIBUTIONS 17

FIGURE 2.5
Income distribution. The expectation is 62.1 thousand dollars, while the
median is 31.3 thousand dollars.


Continuous: E[ξ(x)] = ξ(x) f (x)dx.

For constant c, the expectation operator E satisfies the following properties:

E[c] = c,
E[x + c] = E[x] + c,
E[cx] = cE[x].

Although the expectation represents the “center” of a probability distribution, it


can be quite different from what is intuitively expected in the presence of outliers. For
example, in the income distribution illustrated in Fig. 2.5, because one person earns
1 million dollars, all other people are below the expectation, 62.1 thousand dollars.
In such a situation, the median is more appropriate than the expectation, which is
defined as b such that

Pr(x ≤ b) = 1/2.

That is, the median is the “center” of a probability distribution in the sense that it is
the 50% point both from the left-hand and right-hand sides. In the example of Fig. 2.5,
the median is 31.3 thousand dollars and it is indeed in the middle of everybody.
The α-quantile for 0 ≤ α ≤ 1 is a generalization of the median that gives b such
that

Pr(x ≤ b) = α.

That is, the α-quantile gives the 100α% point from the left-hand side (Fig. 2.4) and
is reduced to the median when α = 0.5.
18 CHAPTER 2 RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS

Let us consider a probability density function f defined on a finite interval [a, b].
Then the minimizer of the expected squared error, defined by

   b
E (x − y) 2
= (x − y)2 f (x)dx,
a

with respect to y is shown to agree with the expectation of x. Similarly, the minimizer
y of the expected absolute error, defined by

   b
E |x − y| = |x − y| f (x)dx, (2.2)
a

with respect to y is shown to agree with the expectation of x. Furthermore, a weighted


variant of Eq. (2.2),

 
b
 (1 − α)(x − y) (x > y),



|x − y|α f (x)dx, |x − y|α = 
a

 α(y − x)


(x ≤ y),

is minimized with respect to y by the α-quantile of x.
Another popular statistic is the mode, which is defined as the maximizer of f (x)
(Fig. 2.4).

2.4.2 VARIANCE AND STANDARD DEVIATION


Although the expectation is a useful statistic to characterize probability distributions,
probability distributions can be different even when they share the same expectation.
Here, another statistic called the variance is introduced to represent the spread of the
probability distribution.
The variance of random variable x, denoted by V [x], is defined as
 
V [x] = E (x − E[x])2 .

In practice, expanding the above expression as


 
V [x] = E x 2 − 2xE[x] + (E[x])2 = E[x 2 ] − (E[x])2

often makes the computation easier. For constant c, variance operator V satisfies the
following properties:

V [c] = 0,
V [x + c] = V [x],
V [cx] = c2V [x].

Note that these properties are quite different from those of the expectation.
2.4 PROPERTIES OF PROBABILITY DISTRIBUTIONS 19

The square root of the variance is called the standard deviation and is denoted by
D[x]:

D[x] = V [x].

Conventionally, the variance and the standard deviation are denoted by σ 2 and σ,
respectively.

2.4.3 SKEWNESS, KURTOSIS, AND MOMENTS


In addition to the expectation and variance, higher-order statistics such as the
skewness and kurtosis are also often used. The skewness and kurtosis represent
asymmetry and sharpness of probability distributions, respectively, and defined as
E (x − E[x])3
 
Skewness: ,
(D[x])3
E (x − E[x])4
 
Kurtosis: − 3.
(D[x])4
(D[x])3 and (D[x])4 in the denominators are for normalization purposes and −3
included in the definition of the kurtosis is to zero the kurtosis of the normal
distribution (see Section 4.2). As illustrated in Fig. 2.6, the right tail is longer than
the left tail if the skewness is positive, while the left tail is longer than the right tail
if the skewness is negative. The distribution is perfectly symmetric if the skewness is
zero. As illustrated in Fig. 2.7, the probability distribution is sharper than the normal
distribution if the kurtosis is positive, while the probability distribution is duller than
the normal distribution if the kurtosis is positive.
The above discussions imply that the statistic,
 
νk = E (x − E[x])k ,

plays an important role in characterizing probability distributions. νk is called the kth


moment about the expectation, while

µk = E[x k ]

is called the kth moment about the origin. The expectation, variance, skewness, and
kurtosis can be expressed by using µk as

Expectation: µ1 ,
Variance: µ2 − µ21 ,
µ3 − 3µ2 µ1 + 2µ31
Skewness: 3
,
(µ2 − µ21 ) 2
µ4 − 4µ3 µ1 + 6µ2 µ21 − 3µ41
Kurtosis: − 3.
(µ2 − µ21 )2
20 CHAPTER 2 RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS

(a) Skewness: −0.32 (b) Skewness: 0 (c) Skewness: 0.32

FIGURE 2.6
Skewness.

(a) Kurtosis: −1.2 (b) Kurtosis: 0 (c) Kurtosis: 3

FIGURE 2.7
Kurtosis.

Probability distributions will be more constrained if the expectation, variance,


skewness, and kurtosis are specified. As the limit, if the moments of all orders
are specified, the probability distribution is uniquely determined. The moment-
generating function allows us to handle the moments of all orders in a systematic
way:


e t x f (x)

(Discrete),





 x
Mx (t) = E[e t x ] = 



 
 e t x f (x)dx

(Continuous).


Indeed, substituting zero to the kth derivative of the moment-generating function with
(k)
respect to t, Mx (t), gives the kth moment:
(k)
Mx (0) = µk .

Below, this fact is proved.


2.4 PROPERTIES OF PROBABILITY DISTRIBUTIONS 21

The value of function g at point t can be expressed as

g ′(0) g ′′(0)
g(t) = g(0) + t + t2 +··· .
1! 2!
If higher-order terms in the right-hand side are ignored and the infinite sum
is approximated by a finite sum, an approximation to g(t) can be obtained.
When only the first-order term g(0) is used, g(t) is simply approximated
by g(0), which is too rough. However, when the second-order term tg ′(0)
is included, the approximation gets better, as illustrated below. By further
including higher-order terms, the approximation gets more accurate and
converges to g(t) if all terms are included.

FIGURE 2.8
Taylor series expansion at the origin.

Given that the kth derivative of function e t x with respect to t is x k e t x , the Taylor
series expansion (Fig. 2.8) of function e t x at the origin with respect to t yields
(t x)2 (t x)3
e t x = 1 + (t x) + + +··· .
2! 3!
Taking the expectation of both sides gives
µ2 µ3
E[e t x ] = Mx (t) = 1 + t µ1 + t 2 + t3 +··· .
2! 3!
Taking the derivative of both sides yields
µ3 µ4
Mx′ (t) = µ1 + µ2 t + t 2 + t 3 + · · · ,
2! 3!
µ4 2 µ5 3
Mx (t) = µ2 + µ3 t + t + t + · · · ,
′′
2! 3!
..
.
(k) µk+2 2 µk+3 3
Mx (t) = µk + µk+1 t + t + t +··· .
2! 3!
(k)
Substituting zero into this gives Mx (0) = µk .
22 CHAPTER 2 RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS

Depending on probability distributions, the moment-generating function does not


exist (diverges to infinity). On the other hand, its sibling called the characteristic
function always exists:

φ x (t) = Mi x (t) = Mx (it),

where i denotes the imaginary unit such that i 2 = −1. The characteristic function
corresponds to the Fourier transform of a probability density function.

2.5 TRANSFORMATION OF RANDOM VARIABLES


If random variable x is transformed as

r = ax + b,

the expectation and variance of r are given by

E[r] = aE[x] + b and V [r] = a2V [x].

1 E[x]
Setting a = and b = − yields
D[x] D[x]

x E[x] x − E[x]
z= − = ,
D[x] D[x] D[x]

which has expectation 0 and variance 1. This transformation from x to z is called


standardization.
Suppose that random variable x that has probability density f (x) defined on X is
obtained by using transformation ξ as

x = ξ(r).

 probability density function of z is not simply given by f ξ(r) , because



Then the
f ξ(r) is not integrated to 1 in general. For example, when x is the height of a
person in centimeter and r is its transformation in meter, f ξ(r) should be divided
by 100 to be integrated to 1.
More generally, as explained in Fig. 2.9, if the Jacobian dx
dr is not zero, the scale
should be adjusted by multiplying the absolute Jacobian as

 dx
g(r) = f ξ(r) .
dr

g(r) is integrated to 1 for any transform x = ξ(r) such that dx


dr , 0.
2.5 TRANSFORMATION OF RANDOM VARIABLES 23

Integration of function f (x) over X can be expressed by using function g(r)


on R such that

x = g(r) and X = g(R)

as
 
 dx
f (x)dx = f g(r) dr.
X R dr
This allows us to change variables of integration from x to r. dx
 
dr in the right-
hand side corresponds to the ratio of lengths when variables of integration are
changed from x to r. For example, for

f (x) = x and X = [2, 3],

integration of function f (x) over X is computed as


  3  3
1 2 5
f (x)dx = xdx = x = .
X 2 2 2 2

On the other hand, g(r) = r 2 yields


√ √ dx
R = [ 2, 3], f g(r) = r 2 , and = 2r.

dr
This results in
  √3   √3
 dx 1 5
f g(r) dr = √ r 2 · 2rdr = r 4 √ = .
R dr
2 2 2 2

FIGURE 2.9
One-dimensional change of variables in integration. For multidimensional cases, see Fig. 4.2.

For linear transformation


r = ax + b and a , 0,
r−b dx 1
x= yields = , and thus
a dr a
 
1 r−b
g(r) = f
|a| a
is obtained.

You might also like