CourseNotesEE501 PDF
CourseNotesEE501 PDF
Statistical Communication
Germain Drolet
Department of Electrical & Computer Engineering,
Royal Military College of Canada,
P.O. Box 17000, Station Forces,
Kingston, Ontario, CANADA
K7K 7B4
Copyright ⃝2006
c by G. Drolet. All rights reserved.
Permission is granted to make and distribute verbatim copies of these notes
provided the copyright notice and this permission notice are preserved on all
copies.
Preface
The main purpose of these notes is to complement and guide a student through
the study of Chapters 1, 2, 3, 4 in the textbook
These notes may not be considered a replacement for the textbook since many
proofs and discussions are incomplete. Instead we present sketches of the proofs
and illustrate them with some diagrams in order to help understand the for-
mal proofs presented in the textbook. The more concise format of the notes
may additionally be helpful in quickly locating a concept in the textbook. The
textbook is required for a complete understanding of Chapters 1, 2, 3, 4. Some
examples and results borrowed from lecture notes by Dr. G.E. Séguin are in-
cluded in the notes. Specifically, the general approach of Dr. Séguin’s notes is
used in the presentation of random processes in Chapter 3.
The first objective of the course for which these notes are written is to
provide a foundation in the Theory of Probability, Random Variables, Random
Processes for the Electrical and Computer Engineer. The second objective of
the course is to apply the concepts of probability to the Theory of Detection in
communication systems and to present the foundation for Coding Theory.
The notes are organized as follows:
Chapter 3: follows the approach of Dr. Séguin’s notes and covers the remain-
der of W&J Chapter 3.
iii
iv PREFACE
August 2006
Germain Drolet
Department of Electrical & Computer Engineering,
Royal Military College of Canada,
P.O. Box 17000, Station Forces,
Kingston, Ontario, CANADA
K7K 7B4
Preface iii
2 Probability Theory 1
2.1 Fundamental Definitions . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Communication problem . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Random variables and distribution functions . . . . . . . . . . . 14
2.4 Transformation of random variables . . . . . . . . . . . . . . . . 28
dFy (α)
2.4.1 Calculate Fy (α) first, then py (α) = dα . . . . . . . . . 28
2.4.2 y = f (x) where f ( ) is differentiable and non-constant on
any interval . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.3 Reversible transformation of random variables . . . . . . 30
2.4.4 Transformation of a uniformly distributed random variable 34
2.4.5 Impulses in px (α) (Wozencraft & Jacobs, pages 64, 65) . . 39
2.5 Conditional probability density . . . . . . . . . . . . . . . . . . . 41
2.6 Statistical Independence . . . . . . . . . . . . . . . . . . . . . . . 43
2.7 Mixed probability expressions . . . . . . . . . . . . . . . . . . . . 47
2.8 Statistical independence . . . . . . . . . . . . . . . . . . . . . . . 49
2.9 Communication example . . . . . . . . . . . . . . . . . . . . . . . 52
2.9.1 Details of calculation of P (C ) . . . . . . . . . . . . . . . . 53
2.9.2 Design of the optimal receiver for a channel with additive
Gaussian noise . . . . . . . . . . . . . . . . . . . . . . . . 56
2.9.3 Performance of the optimal receiver for a channel with
additive Gaussian noise . . . . . . . . . . . . . . . . . . . 57
2.10 Expected Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.10.1 Moments of a Gaussian random variable . . . . . . . . . . 63
2.10.2 Moments of some specific random variables . . . . . . . . 65
2.11 Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
2.11.1 Weak law of large numbers . . . . . . . . . . . . . . . . . 70
2.11.2 Central Limit Theorem . . . . . . . . . . . . . . . . . . . 71
2.11.3 Chernoff Bound . . . . . . . . . . . . . . . . . . . . . . . . 73
2.12 Moments (mixed) of Random Vectors . . . . . . . . . . . . . . . 83
2.13 Gaussian Random Vectors . . . . . . . . . . . . . . . . . . . . . . 89
2.13.1 Definition: . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
2.13.2 Properties of jointly Gaussian Random Variables: . . . . . 90
v
vi CONTENTS
Probability Theory
1
2 CHAPTER 2. PROBABILITY THEORY
6. fN (seq, A) = N (seq,A)
N is the relative frequency of occurrence of result A
in sequence seq of length N .
Remarks.
2. Notice that Dom(f∞ ) = {all results} ̸= {all outcomes} (W&J, page 14).
Example 2.1.1.
N (seq, {1})
f∞ ({1}) = lim = 1/6
N →∞ N
N (seq, {1, 3})
f∞ ({1, 3}) = lim
N →∞ N
N (seq, {1}) + N (seq, {3})
= lim
N →∞ N
N (seq, {1})) N (seq, {3}))
= lim + lim
N →∞ N N →∞ N
= f∞ ({1}) + f∞ ({3}) = 1/3
The above definitions are adequate to describe the physical concept but too
loose and imprecise to describe a mathematical concept. We next give the formal
axiomatic definition of probability system. Under certain physical conditions a
random experiment can be modeled as a probability system and obeys the same
laws. After giving the definitions and its immediate consequences, we illustrate
the relationship between “probability system” and “random experiment”. This
abstract definition should be well understood; this will make it easier to grasp
the concepts of random variables and random processes later.
1. Ω ∈ F
2. A ∈ F ⇒ Ā ∈ F
4. P (Ω) = 1.
5. P (A) ≥ 0, ∀A ∈ F .
Ω is called sample space and its elements are called sample points. The elements
of F are called events (F is called class of events). A probability space will be
denoted by (Ω, F , P : F → [0, 1] ⊂ R) or simply (Ω, F , P ).
Remarks.
1. The definitions given by Wozencraft & Jacobs are limited to the case
|F | ̸= ∞ (W&J, page 20).
4 CHAPTER 2. PROBABILITY THEORY
8. A1 ⊃ A2 ⊃ A3 . . . ∈ F ⇒ P (∩∞
i=1 Ai ) = limn→∞ P (An )
Sketch of proof: (2) to (6) only. (1) is hard to prove. (7) and (8) are called
continuity properties and are also difficult to prove; they will be used to prove
the continuity of cumulative distribution functions (defined later).
(3) follows from A ∪ Ā = Ω and A ∩ Ā = ∅.
(2) follows from axiom (3).
(4) follows from axioms (3) and (4).
(5) follows from B = A ∪ (B − A) and property (1).
(6) follows from A ∪ B = A ∪ (Ā ∩ B) and B = (A ∩ B) ∪ (Ā ∩ B).
Remarks.
We will often refer to simple real world random experiments for example/il-
lustration purposes. Conversely, for every random experiment we can construct
a corresponding probability system to be used as idealized mathematical model.
This is done as follows:
3. P (A) , f∞ (A), ∀A ∈ F .
Relation of the Model to the Real World: Wozencraft & Jacobs pages
24 - 29.
This is a difficult paragraph and may be viewed as comprising two intercon-
nected parts:
ical construction.
6 CHAPTER 2. PROBABILITY THEORY
seq = 2, 8, 9, 5, 10, 4, 7, 8, . . . , 6
seqshortened = 2, 8, 10, 4, 8, . . . , 6
relative frequency of
occurrence of “sum is larger N (seqshortened , A)
=
than 3” given the “sum is length of seqshortened
even”
N (seq, A ∩ B)
=
N (seq, B)
N (seq, A ∩ B)/N
=
N (seq, B)/N
fN (seq, A ∩ B)
=
fN (seq, B)
The latter expression is of the same form as the expression used to define con-
ditional probability. This intuitively explains that P (A|B) as defined above is
mathematical model for relative frequency of occurrence of a result given that
another result has occurred.
If we define:
FB = {A ∩ B|A ∈ F } ,
one can also verify that (B, FB , P ( |B)) is another probability system and
corresponds to the hatched portion of figure 2.8 (Wozencraft & Jacobs, page
30).
8 CHAPTER 2. PROBABILITY THEORY
Theorem 2.
(b) P (Ai |B) = ∑nP (Ai )P (B|Ai ) , for any i = 1, 2, . . . , n; this is known
j=1 P (Aj )P (B|Aj )
5
as Bayes theorem.
Proof. We prove the second part of the theorem only; the proof of the first part
is left as an exercise.
( )
1. B ⊂ ∪nj=1 Aj ⇒ B = B ∩ ∪nj=1 Aj = ∪nj=1 (B ∩ Aj ) and all the B ∩ Aj
∑n
are pairwise disjoint. It follows that P (B) = j=1 P (B ∩ Aj ) from which
the result follows.
Bayes theorem is useful in situations where P (B|Aj ) is given for every j but
P (Aj |B) is not directly known. The following example illustrates this.
The (random) experiment consists in first choosing a box at random among the
three boxes (equiprobably, i.e. each has a probability 1/3 of being chosen) and
then draw a ball at random from the box chosen. Let B denote the event “a red
ball has been drawn”. Calculate the probability that the ball was drawn from
box number 2 if it is red, i.e. P (A2 |B).
4 cf: Wozencraft & Jacobs page 31
5 cf: Wozencraft & Jacobs problem 2.6
2.1. FUNDAMENTAL DEFINITIONS 9
Solution: From the data given in the problem we have that P (Ai ) = 1/3, i =
1, 2, 3, and clearly B ⊂ ∪3j=1 Aj . It follows from the theorem that:
Figure 2.1:
P (RX = 0 | T X = m0 ) = 0.99
P (RX = 1 | T X = m0 ) = 0.01
P (RX = 0 | T X = m1 ) = 0.01
P (RX = 1 | T X = m1 ) = 0.99
Calculate
1. P (error | RX = 0) = P (T X = m1 | RX = 0),
2. P (error | RX = 1) = P (T X = m0 | RX = 1).
10 CHAPTER 2. PROBABILITY THEORY
Solution. This problem is equivalent to the following. We are given two boxes
labeled as m0 , m1 which contain balls labeled 0 or 1 as follows:
• box m0 contains 99 balls labeled 0 and 1 ball labeled 1,
• box m1 contains 1 ball labeled 0 and 99 balls labeled 1,
A random experiment consists in first taking one of the two boxes at random
with probabilities P (m0 ) = 1/3, P (m1 ) = 2/3 followed by drawing one ball from
the box chosen. Calculate:
1. the probability that the ball was drawn from box m1 if it is labeled 0,
2. the probability that the ball was drawn from box m0 if it is labeled 1.
We see that the problem is solved similarly to example 2.1.3. We easily find
(verify this):
1. P (error | RX = 0) = 2/101 ≈ 19.8 × 10−3 ,
2. P (error | RX = 1) = 1/199 ≈ 5.03 × 10−3 .
Notice that an error is roughly four times more likely on reception of 0 than
on reception of 1, even though the transmission of 1 is only twice as likely than
that of 0.
P (A ∩ B̄) = P (A) − P (A ∩ B)
= P (A) − P (A)P (B)
= P (A)(1 − P (B))
= P (A)P (B̄)
PS (∅) = 0 ,
PS ({m0 }) = Pm0 ,
PS ({m1 }) = Pm1 ,
1 = PS (Ωsource ) = PS ({m0 , m1 }) = Pm0 + Pm1 = 1 .
for all i. Combining the source and the discrete communication channel
we define:
Sample space:
Probability function:
for every i and every j. One easily verifies that (you should verify
this): ∑∑
PDCC (ΩDCC ) = PDCC ({(mi , rj )}) = 1 .
all i all j
b )).
(this corresponds to the above example of mapping m(
The probability system (ΩD , 2ΩD , PD ) describes the overall operation of the
discrete communication channel. Its performance is measured by its probability
of correct decision P (C ) (page 35, W&J). We call C = {(m0 , m0 ), (m1 , m1 )} ⊂
ΩD the correct decision event. For the above example of mapping m( b ), the
correct decision event C ⊂ ΩD corresponds to C˜ ⊂ ΩDCC given by:
P (C ) = P (C˜) = PDCC ({(m0 , r0 )}) + PDCC ({(m1 , r1 )}) + PDCC ({(m1 , r2 )})
∑
2
= PDCC ({(m̂(rj ), rj )})
j=0
∑
2
= PS ({m̂(rj )})P [rj |m̂(rj )] (2.1)
j=0
From equation (2.1) it is clear that the probability of correct decision depends
on the assignment made by the decision element m̂( ). The remainder of this
paragraph (page 35 in W&J) shows how the mapping (decision element) can
be chosen in order to maximize P (C ) given a priory message probabilities
P ({mi }) = Pmi and channel transition probabilities P [rj |mi ]. It is helpful
to first consider a numerical example.
C = {(m0 , m0 ), (m1 , m1 )} ∈ FD
We then obtain
P (C ) = P (C˜) = PS (m0 ) P [r0 |m0 ] + PS (m1 ) P [r1 |m1 ] + PS (m1 ) P [r2 |m1 ]
= 0.6 × 0.9 + 0.4 × 0.15 + 0.4 × 0.8
= 0.54 + 0.06 + 0.32
= 0.92
Definition 2.2.1.
Figure 2.2:
x: Ω → R ∪ {±∞}
x: ω 7→ ω2 − 1
x: Ω → R ∪ {±∞}
x: ω 7 → ω2 − 1
Figure 2.3:
We now turn our attention to a more general situation involving more than
one random variable. For example, consider the random experiment of throwing
two fair dice labelled as #1 and #2. The first random number is the outcome
of die #1 and the second random number is the sum of both outcomes of dice
#1, #2. We will detail the tools required to calculate the probability of events
defined by multiple random variables.
Given two random variables x1 , x2 we can construct the two probability
distribution functions Fx1 (α1 ), Fx2 (α2 ). As we have seen the probability of any
event involving x1 only and the probability of any event involving x2 only can
respectively be calculated from the functions Fx1 (α1 ) and Fx2 (α2 ). But the two
distributions functions are not sufficient to calculate the probability of an event
involving both variables x1 , x2 , such as P ({2 < x1 ≤ 4} ∩ {−1 < x2 ≤ 1}). In
order to calculate the probability of such an event a new distribution function
is defined:
Definition 2.3.3. Let x1 : Ω → R ∪ {±∞}, x2 : Ω → R ∪ {±∞} be random
variables for the probability system (Ω, F , P : F → [0, 1]). The function:
Fx1 ,x2 : (R ∪ {±∞})2 → [0, 1]
Fx1 ,x2 : (α1 , α2 ) 7→ P ({ω : x1 (ω) ≤ α1 } ∩ {ω : x2 (ω) ≤ α2 })
is called joint probability distribution function of the random variables x1 and
x2 .
Remarks.
( )
1. For any region I ∈ R2 , the set {ω : x1 (ω), x2 (ω) ∈ I} = {(x1 , x2 ) ∈
I} is an event. If I1 , I2 are two disjoint regions in R2 then the events
{(x1 , x2 ) ∈ I1 } and {(x1 , x2 ) ∈ I2 } are disjoint.
2. The properties of joint probability distribution functions are immediate
generalizations of the properties of (1-dimensional) probability distribu-
tion functions (refer to page 40, W&J).
3. Property VI on page 40, W&Y, is similar to the proof of property 3, page
17: first show that
Fx1 ,x2 (a1 , a2 ) = Fx1 ,x2 (a1 , b2 ) +
P ({ω : x1 (ω) ≤ a1 } ∩ {ω : b2 < x2 (ω) ≤ a2 })
Fx1 ,x2 (a1 , a2 ) = Fx1 ,x2 (b1 , a2 ) +
P ({ω : b1 < x1 (ω) ≤ a1 } ∩ {ω : x2 (ω) ≤ a2 })
Ω = [0, 1] ,
F ≡ set of intervals of Ω and “countable” unions and
intersections of such intervals,
P ≡ total length of the union of the intervals making
the event.
The remainder of the example is easy to follow.
Remark. “The use of probability distribution functions is inconvenient in com-
putations”.7 For this reason we introduce the probability density function de-
fined below.
dFx (α)
Definition 2.3.5 (page 45, W&J). The function px (α) = dα is called prob-
ability density function of the random variable x.
Proposition 7. Properties of the probability density function [pages 44 - 47,
W&J]
1. When I ⊂ R ∪ {±∞}∫ is such that {ω : x(ω) ∈ I} ∈ F we can show that
P ({ω : x(ω) ∈ I}) = I px (α)dα.
2. If lim+ Fx (α) = lim− Fx (α) and lim+ Fx′ (α) ̸= lim− Fx′ (α) then px (a) =
α→a α→a α→a α→a
lim Fx′ (α), i.e. the probability density function px ( ) is continuous on the
α→a+
right.
3. If lim+ Fx (α) − lim− Fx (α) = Pa ̸= 0 then px (α) contains the impulse
α→a α→a
Pa δ(α − a).8
7 bottom of page 43, W&J
8 δ( ) is the Dirac impulse function and it is defined on page 46, W&J. For all t0 ∈ R and
ϵ > 0: ∫ t0 +ϵ
f (t)δ(t − t0 )dt = f (t0 )
t0 −ϵ
if f (t) is continuous at t = t0 . We also have:
f (t)δ(t − t0 ) = f (t0 )δ(t − t0 )
f (t) ∗ δ(t − t0 ) = f (t − t0 )
if f (t) is continuous at t = t0 (∗ denotes the convolution).
20 CHAPTER 2. PROBABILITY THEORY
∫α
4. Fx (α) = −∞
px (β)dβ
5. px (α) ≥ 0, ∀α ∈ R
∫∞
6. −∞ px (α)dα = 1
Theorem 8. For every function f (α) : R ∪ {±∞} → R ∪ {±∞} satisfying:
f (α) ≥ 0, ∀α,
∫ ∞
f (α)dα = 1,
−∞
there exists a probability system (Ω, F , P : R ∪ {±∞} → [0, 1]) and a random
variable x : Ω → R ∪ {±∞} such that px (α) = f (α).
Sketch of proof. The probability system is constructed as follows:
• Ω = R,
• F is the set of intervals of R = Ω, plus (enumerable) unions, intersections
and complements of such intervals,
∫
• P (I) , I f (α)dα, ∀I ∈ F .
Next we define the random variable
x: Ω=R → R
x: ω 7 → ω
One can then easily show that px (α) = f (α) (this is somewhat similar to the
construction of page 23 in W&J).
Example 2.3.4. The probability density function of the random variable de-
scribed in example 2.3.2 and distribution function as in figure 2.3 is shown in
figure 2.4.
Figure 2.4:
1. uniform: b > a
1
b−a ;a ≤ α < b
px (α) =
0 ; elsewhere
0 ;α < a
Fx (α) = α−a
;a ≤ α < b
b−a
1 ;b ≤ α
3. Poisson: ρ > 0
∞
∑ ρk
px (α) = e−ρ δ(α − k)
k!
k=0
A typical Poisson probability density function is sketched in figure 2.6.
22 CHAPTER 2. PROBABILITY THEORY
Remarks.
∫∞
1. All of the above functions satisfy −∞
px (α)dα = 1 (verify).
2. It can be shown that
∫ ∞
1
e−(α−µ)
2
/(2σ 2 )
√ dα = 1.
−∞ 2πσ 2
∫∞ √
e−x dx =
2
This follows from −∞
π.
3. There is no analytical form to the Gaussian probability distribution func-
tion. It is tabulated (refer to Appendix A) as:
∫ ∞ [ ( )]
1 −β 2 /2 1 α
Q(α) = √ e dβ = 1 − erf √
2π α 2 2
∫ α √
2
e−β dβ = 1 − 2Q( 2α)
2
erf(α) = √
π 0
represented graphically by the hatched areas in figures 2.7 and 2.8.9 The
function erf( ) is defined in the C math library math.h and most com-
puter packages (MAPLE, MATHEMATICA, MATLAB ) also include an
implementation of the function. The erf( ) and Q( ) functions are plotted
in figure 2.9.
2.3. RANDOM VARIABLES AND DISTRIBUTION FUNCTIONS 23
1 2 2
e
2
Q
2 2
e
erf
1
e−(α−µ)
2
/2σ 2
px (α) = √
2πσ 2
then:
( )
a−µ
P (x ≥ a) = Q
σ
P (a ≤ x < b) = P (x ≥ a) − P (x ≥ b)
( ) ( )
a−µ b−µ
= Q −Q
σ σ
9 Tables generally only list the values of Q(α) for α ≥ 0. If α < 0, we use Q(α) = 1−Q(−α).
10 The parameters µ and σ 2 will respectively be called mean and variance of the random
variable x; refer to section 2.10.
24 CHAPTER 2. PROBABILITY THEORY
Figure 2.9: Graphs of the erf(α) function (red) and Q(α) function (blue)
∏
(n−3)/2
∞
∑ (−1)(n−3)/2 (2i + 1)
1 α
Q(α) = −√ + √ i=0
αn
2 2π n=3
n! 2π
n odd
1 α α3 α5 α7 α9
≈ −√ + √ − √ + √ − √ + O(α11 )
2 2π 6 2π 40 2π 336 2π 3456 2π
dα 2π
Definition 2.3.6 (Wozencraft & Jacobs, page 49). Let x1 , x2 be two random
variables with joint probability distribution function Fx1 , x2 (α1 , α2 ). The func-
tion
∂2
px1 ,x2 (α1 , α2 ) = Fx ,x (α1 , α2 )
∂α1 ∂α2 1 2
is called joint probability density function of x1 , x2 .
Proposition 9. Properties of the joint probability density function [Wozencraft
& Jacobs, pp 50 - 55]
1. When I ⊂ R2 is such that {ω : (x1 (ω), x2 (ω)) ∈ I} ∈ F we can show that:
∫∫
P ({ω : (x1 (ω), x2 (ω)) ∈ I}) = px1 ,x2 (α1 , α2 )dα1 dα2
I
11 this is required to solve problem 2.39 - more details to follow
2.3. RANDOM VARIABLES AND DISTRIBUTION FUNCTIONS 25
2. ∫ α1 ∫ α2
Fx1 ,x2 (α1 , α2 ) = px1 ,x2 (β1 , β2 )dβ2 dβ1
−∞ −∞
∫ α1 ∫ ∞
Fx1 (α1 ) = px1 ,x2 (β1 , β2 )dβ2 dβ1
−∞ −∞
∫ ∞ ∫ α2
Fx2 (α2 ) = px1 ,x2 (β1 , β2 )dβ2 dβ1
−∞ −∞
5. Discontinuities are treated in the same way as in the one dimensional case.
∑
2 [ ]
1 1 (α2 − i)2
px1 ,x2 (α1 , α2 ) = δ(α1 − i) √ exp −
i=1
2 2π 2
where u( ) is the unit step function, i.e. π radians (180◦ ) must be added
(or subtracted) whenever α1 is negative since the tan−1 ( ) function only
covers angles from −π/2 to +π/2.
26 CHAPTER 2. PROBABILITY THEORY
Figure 2.10: Plots of the joint Gaussian probability density function of 2 random
variables
∫∞
Theorem 10. px1 (α1 ) = −∞
px1 ,x2 (α1 , α2 )dα2 .
Proof. We have:
∫ α1 ∫ ∞
Fx1 (α1 ) = Fx1 ,x2 (α1 , ∞) = px1 ,x2 (β1 , β2 )dβ2 dβ1
−∞ −∞
| {z }
h(β1 )
∫ α1
= h(β1 )dβ1 (2.2)
−∞
which follow from property V page 40 (W&J) and equation W&J (2.56). We
2.3. RANDOM VARIABLES AND DISTRIBUTION FUNCTIONS 27
also have: ∫ α1
Fx1 (α1 ) = px1 (β1 )dβ1 (2.3)
−∞
from equation W&J (2.40 b). Both RHS of 2.2 and 2.3 give Fx1 (α1 ) for all
values of α1 . The integrands must be the same, i.e.
∫ ∞
px1 (α1 ) = h(α1 ) = px1 ,x2 (α1 , β2 )dβ2 .
−∞
Remark. This proof∫ is not 100 % rigorous; neither is Wozencraft & Jacobs’.
d α
They show that dα −∞
h(β)dβ = h(α) when h(β) is continuous at β = α. The
result remains valid even if this condition is not satisfied, and this follows from
the way in which the discontinuities of Fx (α) were handled in the definition of
px (α) = dFdα
x (α)
.
Example 2.3.7 (Wozencraft & Jacobs, pp 56 - 57). If x1 , x2 are two jointly
Gaussian random variables then x1 is Gaussian and x2 is Gaussian. Therefore
the volume under the surface of figure 2.27 in W&J, inside the strip bounded
by a, b, is obtained with the Q( ) function; there is no need of a 2-dimensional
version of the Q( ) function in this case.
Definition 2.3.7. Let x = (x1 , x2 , . . . , xk ) be a random vector comprising
k random variables with joint probability distribution function Fx (α). The
∂k
function px (α) = ∂α Fx (α) is called joint probability density function of the
random vector x.
Proposition 11. Properties of the joint probability density function [Wozen-
craft & Jacobs pp 57 - 58]
1. When I ⊂ Rk is such that {ω : x(ω) ∈ I} ∈ F we can show that:
∫
P ({ω : x(ω) ∈ I}) = px (α)dα
I
dFy (α)
2.4.1 Calculate Fy (α) first, then py (α) = dα
The following are obtained (when px (α) is non-impulsive) [refer to W&J]:
1. y = bx + a where a, b ∈ R, b ̸= 0:
( )
1 α−a
⇒ py (α) = px
|b| b
{
x ; x≥0
2. Half-wave (linear) rectifier y =
0 ; x<0
3. y = x2
1 √ √
√
2 α
[px ( α) + px (− α)] ; α>0
⇒ py (α) =
0 ; α<0
S(α) = {β ∈ R : f (β) = α}
Then
∑
px (β)
; if S(α) ̸= ∅ and f ′ (β) ̸= 0,
|f ′ (β)|
py (α) =
β∈S(α) ∀β ∈ S(α)
0 ; if S(α) = ∅
S(α) = {β ∈ R : β 2 = α}
√
{± α} ; α > 0
=
∅ ; α<0
30 CHAPTER 2. PROBABILITY THEORY
f ◦ g : Rk → Rk
f ◦g : β 7 → f (g(β)) = β
and
2.4. TRANSFORMATION OF RANDOM VARIABLES 31
(b) ∂fi
∂xj and ∂gi
∂xj exist 1 ≤ ∀i, j ≤ k.
2. Wozencraft & Jacobs gives a very good interpretation of this result fol-
lowing equation (2A.7) on page W&J 113.
I = {α|f (α) ≤ β}
= {g(γ)|γ ≤ β}
px (α) −→ px (g(γ))
dα −→ |Jg (γ)|dγ
α∈I −→ γ ≤ β (α = g(γ) ∈ I ⇒ γ ≤ β)
(Wozencraft
∫ & Jacobs (2A.6)). Identification of (2.7) with Fy (β) =
p
γ≤β y
(γ)dγ yields
py (γ) = px (g(γ))|Jg (γ)|
(Wozencraft & Jacobs (2A.7)) since equality is verified for every β ∈ Rk .
( )
and the inverse transformation g = g1 ( , ), g2 ( , ) is given by:
x = z−w = g1 (z, w)
y = w = g2 (z, w)
z = xy = f1 (x, y)
w = y = f2 (x, y)
( )
and the inverse transformation g = g1 ( , ), g2 ( , ) is given by:
x = z/w = g1 (z, w)
y = w = g2 (z, w)
Example 2.4.4. The example of pages 113 - 114 (W&J) is a classic! Equation
Wozencraft & Jacobs(2A.9a) should read
( )
x2
y2 = f2 (x) = tan−1 ± πu(−x1 )
x1
x = f (y)
Example 2.4.5.
√ Consider the random variable y ≥ 0 defined by the transfor-
mation y = −2 ln(1 − x) in which x is a random variable uniformly distributed
between 0 and 1. Then
( )
x = 1 − e−y /2 u(y) = f (y)
2
and from the sketch in figure 2.12, f ( ) is clearly seen to be a valid cumulative
distribution function. Therefore:
py (β) = f ′ (β) , ∀β
= βe−β
2
/2
u(β)
≡ Rayleigh probability density function.
Figure 2.12:
py (β) = f ′ (β)
1
= √ , −10 ≤ β ≤ 10
π 100 − β 2
Figure 2.13:
f −1 : [0, 1] → R
36 CHAPTER 2. PROBABILITY THEORY
as follows. Over the range of input values where the function Fy ( ) is invertible,
we let f −1 ( ) = Fy−1 ( ). If Fy ( ) has a step at β0 then define f −1 (α) = β0 for
every α ∈ [Fy (β0− ), Fy (β0 )]. If Fy ( ) is constant over the semi-opened interval
[β0 , β1 [ then f −1 (Fy (β0 )) = β0 . Finally f −1 (0) ≡ largest β such that Fy (β) =
0. The function f −1 ( ) is well-defined over the entire range [0, 1]. The prob-
ability density function of the random variable defined by the transformation
y = f −1 (x) where x is a random variable uniformly distributed in the interval
from 0 to 1 is the desired probability density function py (β). This is illustrated
in the example that follows.
Figure 2.14:
β2
Fy (β) = ⇒ Fy (1) = 1/2 = Fy (2).
2
2.4. TRANSFORMATION OF RANDOM VARIABLES 37
β
Fy (β) = − 1/2 ⇒ Fy (3) = 1.
2
Fy (β) remains constant at 1 for any β ≥ 3.
In summary we have:
0 ; β<0
β2
; 0<β<1
2
1
Fy (β) = ; 1<β<2
2
(β−1)
; 2<β<3
2
1 ; β>3
Figure 2.15:
Figure 2.16:
p̃y (β) = 0
Figure 2.17:
We then obtain
0 ; β<1
1/6
2√ β ; 1<β<4
p̃y (β) =
1/6
√ ; 4<β<9
β
0 ; β>9
Finally we add the contribution of the impulses and obtain
1 1
py (β) = p̃y (β) + δ(β − 1/4) + δ(β − 4)
3 6
Figure 2.18:
Definition 2.5.1.
Figure 2.19:
2. The function
∫ α
Fx1 (α|x2 = v) = px1 (β|x2 = v)dβ
−∞
P ({x ≤ α}, A)
Fx (α|A) , P ({x ≤ α}|A) =
P (A)
dFx (α|A)
px (α|A) ,
dα
∫
P (x ∈ B|A) = px (α|A)dα
B
2.6. STATISTICAL INDEPENDENCE 43
We then obtain:
∂Fx1 (α|x2 = v)
px1 (α|x2 = v) =
∂α
1 ∂2
= Fx ,x (α, v)
px2 (v) ∂α∂v 1 2
px1 ,x2 (α, v)
=
px2 (v)
Refer to W&J, page 67, last paragraph for a very good visual interpretation.
Example 2.5.1 (Wozencraft & Jacobs, page 68). The resulting graph is ex-
plained as follows: when |ρ| → 1, the random variables x1 , x2 become more
“correlated” (formal definition to be given later ). So if |ρ| → 1 and x2 = v then
x1 is very likely close to v. But it is also more likely to be between 0 and v since
its mean (formal definition to be given later) is 0; that’s why the conditional
probability density function looks more and more like an impulse located just
on the left of v.
Ai = {ω : xi (ω) ∈ Ii }, ∀i = 1, 2, . . . , k,
The following may (or may not) be useful to answer the four questions:
∫ (t 1 )
teKt dt = eKt − 2 , ∀K ̸= 0.
K K
2.6. STATISTICAL INDEPENDENCE 45
1. Show that:
{
2Ae−α ; 0≤α
px (α) =
0 ; elsewhere
{
A(β + 1) ; −1 ≤ β < 1
py (β) =
0 ; elsewhere
Solution:
1.
∫ ∞
px (α) = px,y (α, β)dβ
−∞
∫ 1
= A(β + 1)e−α dβ ; α ≥ 0
−1
[ ]1
β2
= Ae−α +β ; α≥0
2 −1
= 2Ae−α ; α ≥ 0
∫ ∞
py (β) = px,y (α, β)dα
−∞
∫ ∞
= A(β + 1)e−α dα ; −1 ≤ β < 1
0
[ ]0
= A(β + 1) e−α ∞ ; −1 ≤ β < 1
= A(β + 1) ; −1 ≤ β < 1
3. We have to see if the product px (α)py (β) equals pxy (α, β). We obtain:
{
2A2 (β + 1)e−α ; 0 ≤ α , −1 ≤ β < 1
px (α)py (β) =
0 ; elsewhere
{ −α
(1/2)(β + 1)e ; 0 ≤ α , −1 ≤ β < 1
pxy (α, β) =
0 ; elsewhere
Using the value A = 12 , the above two expressions are clearly equal and
x, y are therefore statistically independent.
4. By definition:
{
0 ; z<0
Fz (γ) = P (z ≤ γ) =
P (y ≤ γ) = Fy (γ) ; z ≥ 0
If 0 ≤ z < 1 we easily find:
∫ γ
β+1 (γ + 1)2
Fz (γ) = Fy (γ) = dβ =
−1 2 4
A sketch of Fz (γ) is presented on figure 2.20. Finally:
dFz (γ) γ+1
(1/4)δ(γ) ; γ=0
pz (γ) = = ; 0<γ<1
dγ 2
0 ; elsewhere
or equivalently:
γ+1
pz (γ) = (1/4)δ(γ) + (u(γ) − u(γ − 1))
2
where u( ) denotes the unit step function.
∂k
px (α, A) = Fx (α, A) (2.8)
∂α
48 CHAPTER 2. PROBABILITY THEORY
P (AB) Fx (α,A)
From P (A|B) = P (B) we have Fx (α|A) = P (A) and we define
∂k 1 ∂k px (α, A)
px (α|A) = Fx (α|A) = Fx (α, A) = (2.9)
∂α P (A) ∂α P (A)
∫ ∫∞
where P (A) = ··· −∞ px (α, A)dα. One easily verifies that px (α|A)
is
∫ a valid
∫∞ joint probability density function, that is px (α|A) ≥ 0 and
··· −∞ px (α|A)dα = 1.
We define naturally
∫ a+∆
px (α, A)dα px (a, A)
P (A|x = a) = lim a−∆ ∫ a+∆ = , (2.10)
∆→0 px (α)dα px (a)
a−∆
where the last equality holds when both functions are continuous at α = a and
px (α) = px (α, Ω). We next show that, as required, 0 ≤ P (A|x = a) ≤ 1.
Lemma 15. A ⊂ B ⇒ P (A)px (α|A) ≤ P (B)px (α|B).
Proof. Since A ⊂ B, then ∀α, ∆ ∈ Rk we have:
{α ≤ x < α + ∆} ∩ A ⊂ {α ≤ x < α + ∆} ∩ B
⇒P ({α ≤ x < α + ∆} ∩ A) ≤ P ({α ≤ x < α + ∆} ∩ B)
∫ ∫ α+∆ ∫ ∫ α+∆
⇒ ... px (β, A)dβ ≤ . . . px (β, B)dβ (2.11)
α α
It follows that
px (α, A) ≤ px (α, B), (2.12)
since (2.11) is verified ∀α, ∆ ∈ R . From the definitions we immediately have:
k
1. We have
px (α) = px (α, Ω)
1 1 1
√ e−α /2 + √ e−(α−1) /2 + √ e−α /8
2 2 2
=
4 2π 4 2π 4 2π
13 Wozencraft & Jacobs, page 77, last sentence in subparagraph Statistical Independence.
50 CHAPTER 2. PROBABILITY THEORY
This is sketched in figure 2.21. We could then use the probability density
function to calculate for example the probability of the event {x ≥ 5}:
∫ ∞
P ({x ≥ 5}) = px (α, Ω)dα
5
( ) ( ) ( )
1 5−0 1 5−1 1 5−0
= Q √ + Q √ + Q √
4 1 4 1 2 4
= 3.1128 × 10−3
Figure 2.21:
2. We have:
∫ ∞
P (A) = px (α, A)dα
−∞
∫ ∞
= px (α, {ω1 }) + px (α, {ω2 })dα
−∞
1 1 1
= + =
4 4 2
Similarly we would find:
P ({ω1 }) = 1/4
P ({ω2 }) = 1/4
P ({ω3 }) = 1/2
P ({ω1 , ω3 }) = 3/4
P ({ω2 , ω3 }) = 3/4
2.8. STATISTICAL INDEPENDENCE 51
4. We also have:
∫ ∞
P ({x ≥ 5}|A) = px (α|A)dα
∫5 ∞
px (α, A)
= dα
5 P (A)
∫ ∞
= 2 px (α, A)dα
| 5
{z }
7.9895×10−6 from above
−5
= 1.600 × 10
Notice that this is different from P ({x ≥ 5}) found previously and con-
sequently, the random variable x and the event A are not statistically
independent. This will be discussed again below.
5. We have:
px (5, A)
P (A|x = 5) =
px (5)
where:
1 1 1
√ e−5 /2 + √ e−(5−1) /2 + √ e−5 /8
2 2 2
px (5) =
4 2π 4 2π 4 2π
= 4.4159 × 10−3
1 1
√ e−5 /2 + √ e−(5−1) /2
2 2
px (5, A) =
4 2π 4 2π
= 3.3829 × 10−5
It follows that:
P (A|x = 5) = 7.66 × 10−3
6. We also have:
px (5, {ω3 })
P ({ω3 }|x = 5) =
px (5)
52 CHAPTER 2. PROBABILITY THEORY
It follows that:
P ({ω3 }|x = 5) = 0.99234
Notice that as expected we have:
7. Although we already know from above that x and A are not statistically
dependent we can also verify this by checking whether px (α, A) equals
P (A)px (α).
(a) From above we have:
1 1
px (α, A) = √ e−α /2 + √ e−(α−1) /2
2 2
4 2π 4 2π
r2 m : r i mj
r0
m0
r1
m1 Transition probabilities
r2
• Probability function:
PS ({m0 }) = Pm0
PS ({m1 }) = Pm1
such that Pm0 + Pm1 = 1.
The source output is modulated (mapped into R):
s : Ωsource → R
s: m0 7 → s0
s: m1 7→ s1
which is easily seen to satisfy the axioms of a random variable.
2. Continuous Communication Channel: similarly to the Discrete Commu-
nication Channel of §2.2, it is a transformation with unpredictable char-
acteristics. In the present case, the channel output r is given by:
r =s+n
where n is a random variable statistically independent of s (equivalently n
is statistically independent of the events {m0 } and {m1 }) with probability
density function pn (α). Combining the source, the modulator and the
channel we obtain the following set of results:
{(m0 , ρ), (m1 , ρ) : ρ ∈ R},
and the likelihood of a result is described by the mixed form joint proba-
bility density function
pr (ρ, {mi }) = pr (ρ | {mi })P ({mi })
= pn+si (ρ | {mi })Pmi
= pn (ρ − si | {mi })Pmi
= pn (ρ − si )Pmi (2.13)
for i = 0, 1.
3. The decision element is a deterministic mapping
m̂ : R → {m0 , m1 } = Ωsource
m̂ : ρ 7→ m̂(ρ)
and similarly to §2.2 we define the sample space
ΩD = {(m0 , m0 ), (m0 , m1 ), (m1 , m0 ), (m1 , m1 )}
where the first component of each pair denotes the message transmitted
and the second component denotes the decision made. The probability of
each sample point is:
∫
P ({mi , mj }) = pr (ρ, {mi })dρ
m̂(ρ)=mj
2.9. COMMUNICATION EXAMPLE 55
Figure 2.24:
C = {(m0 , m0 ), (m1 , m1 )} ⊂ ΩD
where we define:
R ⊃ I0 = {ρ ∈ R : m̂(ρ) = m0 }
R ⊃ I1 = {ρ ∈ R : m̂(ρ) = m1 }
(this is analogous to equation (2.1) in §2.2 of these notes). The optimum decision
rule (to maximize P (C ) in equation (2.14)) is:
m0 ; if pr (ρ, {m0 }) ≥ pr (ρ, {m1 })
m̂(ρ) = (2.15)
m1 ; otherwhise
56 CHAPTER 2. PROBABILITY THEORY
for every ρ ∈ R; this is the rule that is used to determine the optimal deci-
sion regions I0 , I1 . Finally, P (C ) of equation (2.14) is expanded as (equation
(2.117a) in W&J):
∫ ∫
P (C ) = pr (ρ|{m0 })P ({m0 })dρ + pr (ρ|{m1 })P ({m1 })dρ
I0
∫ ∫ I1
= Pm0 pr (ρ|{m0 })dρ +Pm1 pr (ρ|{m1 })dρ
I I
| 0 {z } | 1 {z }
P (C |{m0 }) P (C |{m1 })
W&J start the analysis with the decision rule of equation W&J (2.108) (equiv-
alent to the decision rule of equation (2.15) above), as do most textbooks; we
have showed that this is optimal.
Since ln( ) is a strictly increasing function we can take ln( ) on both sides of the
above inequality to obtain:
( )
−(ρ − s0 )2 + (ρ − s1 )2 P ({m1 })
m̂ : ρ 7→ m0 ⇔ > ln
2σ 2 P ({m0 })
( )
P ({m1 })
⇔ −ρ2 + 2ρs0 − s20 + ρ2 − 2ρs1 + s21 > 2σ 2 ln
P ({m0 })
( )
P ({m1 })
⇔ 2ρ(s0 − s1 ) > s20 − s21 + 2σ 2 ln
P ({m0 })
( )
s0 + s1 σ 2
P ({m1 })
⇔ ρ> + ln
2 s0 − s1 P ({m0 })
| {z }
a
P (E ) = 1 − P (C )
= P (E |{m0 })P ({m0 }) + P (E |{m1 })P ({m1 })
P (E |{m0 }) = P (r ∈ I1 |{m0 })
= P (r < a|{m0 })
= P (s0 + n < a|{m0 })
= P (n < a − s0 |{m0 })
= P (n < a − s0 )
( 0 − (a − s ) )
0
= Q
σ
(s − a)
0
= Q
σ
Similarly we find (a − s )
1
p(E |{m1 }) = Q
σ
16 without loss of generality
58 CHAPTER 2. PROBABILITY THEORY
s0 − s1 s0 − a a − s1
Q( ) ≥ P ({m0 })Q( ) + P ({m1 })Q( )
| 2σ }
{z σ σ
useful upper bound
Proof. Refer to the last paragraph of page 81 and top of page 82 in W & J.
The significance of the upper bound is that it is possible to design a receiver
−s1
for which the probability of error is no larger than Q( s02σ ). It also follows that
the probability of error of an optimal receiver is largest when the messages are
equiprobable, i.e. when the source is most unpredictable. The probability of
error of an optimal receiver can be made arbitrarily small by making the mes-
sages unequally likely. For example a source with a priori message probabilities
P ({m0 }) = 1, P ({m1 }) = 0 leads to a trivial receiver that always guesses mes-
sage m0 ; no information is however communicated from such a trivial source.
Theorem 19. Let Eb denote the maximum allowable transmitted
(√ ) energy per
message. The optimal receiver will achieve P (E ) ≤ Q Eb
2 2σ .
| {z }
2
per message
Eb
Definition 2.9.2. The ratio 2σ 2 is the signal-to-noise ratio in the system.
Remarks.
1. W&J uses Eb /σ 2 as definition of the signal-to-noise ratio. We will later
Eb
see why it is preferable to use 2σ 2 ; not important for now. The probability
Figure 2.25:
In chapter 5 (W&J) other factors are added to the list, such as bandwidth
of the channel, rate of transmission, . . .
E[x] = E[g(y)]
∫ ∞∫ ∞ ∫ ∞
= ··· g(β)py (β)dβ (2.16)
−∞ −∞ −∞
Sketch of proof. The outline of the proof can be represented as in figure 2.26.
2.10. EXPECTED VALUE 61
k
g i ai , i
Bi : g Ii
, i
k
B2
B1 2
B1
1
Domain of y
1
B0
0 g: k
I1 I0 I1 I2
a1 a0 a1 a2
Domain of x
and the n-th moment of the probability density function px (α) is then simply the
expected value of the random variable xn . From now on we naturally extend the
62 CHAPTER 2. PROBABILITY THEORY
w: Ω → C
w: ω 7 → x(ω) + jy(ω)
Theorem 23. Let px (α) and Mx (ν) respectively denote the probability density
function and characteristic function of a random variable x of which all the
moments are finite.
(n) (n)
1. Mx (0) = (j n )E[xn ], where Mx (0) denotes the n-th derivative of Mx (ν)
evaluated at ν = 0.
Sketch of proof.
1. refer to W & J.
∑∞
(jνx)n
ejνx =
n=0
n!
∑∞ (jν)n
It follows
∫∞ that Mx (ν) = E[ejνx ] = n=0
n
n! E[x ]. Finally, px (α) =
1 −jνα
2π −∞ Mx (ν)e dν.
ejνµ e− 2 ν
2
1
σ2
Mx(1) (ν) = (jµ − νσ 2 )
−ejνµ e− 2 ν
2
1
σ2
Mx(2) (ν) = (µ2 + 2jµσ 2 ν + σ 2 − ν 2 σ 4 )
Mx(1) (0) = jµ
Mx(2) (0) = −µ2 − σ 2
64 CHAPTER 2. PROBABILITY THEORY
Finally we obtain:
(1)
Mx (0)
x̄ = =µ
j
(2)
Mx (0)
x2 = = µ2 + σ 2 ⇒ σx2 = σ 2
j2
This shows that the parameters µ and σ 2 respectively represent E[x] and
E[(x − x̄)2 ] = V ar(x) and we consequently write the characteristic function
of a Gaussian random variable as Mx (ν) = ejν x̄ e− 2 ν σx .
1 2 2
Theorem 24. The general form of the Gaussian probability density function
is:
1
e−(α−x̄) /(2σx )
2 2
px (α) = √
2πσx2
Proof. Calculate F −1 [Mx (−ν)] with the above Mx (ν) using MAPLE, another
math package or a table of Fourier transform pairs.
Theorem 25.
1. If x is a Gaussian random variable, then:
0 ; n odd
E[(x − x̄)n ] =
n/2
n
n!σx
; n even
n
2 ( 2 !)
as already established.
We then have:
b+a
E[x] = ,
2
1 2
E[x2 ] = (a + b2 + ab),
3
(b − a)2
E[(x − x̄)2 ] = = σx2
12
where the last equality is obtained by MAPLE. Multiplying the last equa-
2.10. EXPECTED VALUE 67
We then have
E[x] = N p,
E[x2 ] = N p(N p + 1 − p),
E[(x − x̄)2 ] = N p(1 − p) = σx2
68 CHAPTER 2. PROBABILITY THEORY
We then have
E[x] = ρ,
E[x2 ] = ρ(ρ + 1),
E[(x − x̄)2 ] = ρ = σx2
Refer to problem 2.34 in W&J for the calculation of the characteristic
function. We find:
jν
−1)
Mx (ν) = eρ(e = eρ(cos(ν)−1) ejρ sin(ν)
We easily find:
Mx′ (0)
x̄ = =ρ
j
Mx′′ (0)
x2 = = ρ2 + ρ
j2
as already established.
The probability density function px (α) of a random variable x leads to the
probability calculation of events involving x. Through the E( ) operator we also
obtain the moments (when they exist) and the characteristic function Mx (ν).
The moments and/or the characteristic function may conversely be used to
perform probability calculations of events on x by first recovering the probability
density function px (α). In the next section we see that the moments can directly
be used (i.e. without first calculating px (α)) to approximate some probabilities;
Chebyshev’s inequality (lemma 26) is one such example.
2.11. LIMIT THEOREMS 69
1 ∑
N
m= xi
N i=1
E(m) = xi = x
σx2i σ2
V ar(m) = = x
N N
In this section we study some formal probability statements, referred to as limit
theorems, about the sample mean. In particular we will see that it is fair to
assume that m is close to E(x) as N becomes large.
Remark. If xi is a random variable defined
∑N from the i-th trial of a compound
experiment, then the sample mean N1 i=1 xi is a statistical estimate of the
expected value x̄. The definition of the sample mean does not require that the
xi be statistically independent repeated trials of the same experiment but in
practice this is often the case.
18 This assumes that the sample mean with only 4 random variables has an approximately
Proof. We have
∫ ∞
σy2 = (α − ȳ)2 py (α)dα (2.17)
−∞
∫ ∞
= β 2 py (β + ȳ)dβ (2.18)
−∞
∫
≥ β 2 py (β + ȳ)dβ
|β|≥ϵ
∫
≥ ϵ 2
py (β + ȳ)dβ (2.19)
|β|≥ϵ
∫
= ϵ2 py (α)dα (2.20)
|α−ȳ|≥ϵ
= ϵ2 P (|y − ȳ| ≥ ϵ)
equation (2.18) follows from (2.17) using the change of variable β = α − ȳ and
equation (2.20) follows from (2.19) using the change of variable α = β + ȳ
In concrete words the lemma says that “The variance is a measure of the
randomness of a random variable”.19
In the last two paragraphs of The Weak Law of Large Numbers, Wozen-
craft&Jacobs present a very important special case that confirms the validity
of the estimation of the probability of an event A by Monte Carlo simulations.
19 W&J, page 94
20 W&J, page 96
2.11. LIMIT THEOREMS 71
This is what we meant in item #2 described under Relation of the Model to the
Real World on page 6 of these notes. This also leads to a (conservative) bound
on the confidence intervals given a certain number of repetitions.21
1 ∑ √
N
z=√ (yi − ȳ) = N (m − ȳ)
N i=1
and denote by Fz (α) the probability distribution function of z. Then:
∫ α
1
e−β /2σy dβ
2 2
lim Fz (α) = √
N →∞ −∞ 2πσy
In concrete terms, the distribution function of the sample mean becomes
Gaussian as N increases.
Remarks. (pp 108 - 109):
1. The result does not imply that pz (α) is Gaussian (refer to Wozen-
craft&Jacobs, page 108 for a very good discussion).
2. As N becomes large the shape of the distribution function of the sum
approaches the shape of the Gaussian distribution function. “Large N ”
depends on the shape of the density function of each yi . Obviously if the
yi ’s are all Gaussian to start with, then the result is true for any N . In fact
if the yi are all Gaussian they don’t need to be statistically independent,
nor do they need to have the same variance nor the same mean. This
general result will be shown in chapter 3 (W&J),22 but the special case of
two jointly Gaussian random variables of equal unity variance and equal
zero-mean can be shown using equations (2.87) (W&J) and (2.58) (W&J).
This is useful, if not required to solve problems 2.25 (W&J) and 3.2(c)
(W&J).
3. The result explains why we often model the noise in a system with a
Gaussian probability density function (more on this in problems 3.15,
3.16).
4. The result finds applications similar to the weak law of large number:
( ) ( )
1 ∑
N
a
P √ (yi − ȳ) ≥ a ≈ Q
N i=1 σy
5. The approximation:
(( ) ) ( )
1 ∑ 1 ∑
N N
P yi − ȳ ≥ ϵ = P (yi − ȳ) ≥ ϵ
N i=1 N i=1
( )
1 ∑ √
N
= P √ (yi − ȳ) ≥ ϵ N
N i=1
( √ )
ϵ N
≈ Q (2.22)
σy
is not normally used. The Chernoff bound (§2.11.3) is much tighter for
large values of N and is also much tighter than the weak law of large num-
bers. The above approximation is nonetheless sometimes used to estimate
the confidence intervals of simulation results, because it is easier to use
than the Chernoff bound.
[ ( )]N
1 ∑
N
ν
z=√ yi ⇒ Mz (ν) = My √
N i=1 N
(3) Using the Taylor series of My (ν) with ȳ = 0, as in the proof of part 2 of
theorem 23, we obtain (W&J equation (2.176)):
ν 2 σy2
My (ν) = 1 − + ν 3 f (ν)
2 |{z}
−jy 3
goes to 3! as ν → 0
(2) & (3) together with the Taylor series of ln(1 + w) (refer to equations
(2.176), (2.177) in W&J):
[ ( )]
ν 2 σy2 ν3 ν
ln [Mz (ν)]
= N ln 1 − − 3/2 f √
2N N
N
| {z }
−w
= N ln(1 + w)
N w2 N w3
= Nw − + − ··· when |w| < 1
2 3
≈ Nw if |w| ≪ 1
( )
−ν 2 σy2 ν3 ν
ln (Mz (ν)) ≈ N w = +√ f √
2 N N
Mz (ν) = e− 2 ν
2
1
σy2
In concrete words the above says that if My (ν) is the characteristic function
∑N
of y1 , y2 , ..., yN , then the characteristic function of z = √1N i=1 yi is Mz (ν) =
√
My (ν/ N )N . When N becomes large enough this approaches e− 2 ν σy . This is
1 2 2
and σy2 = V ar(y) = 1/12 (refer to §2.10.2). The plots shown in figure 2.28
√
illustrate that MyN (ν/ N ) approaches e− 2 ν σy as N increases.
1 2 2
py
1 1
2 2
E[xeλ0 x ]
=d
E[eλ0 x ]
1. The conditions of the Chernoff bound are always verified if the xi ’s are
discrete random variables taking only a finite number of values.
2. d > x̄ ⇒ λ0 ≥ 0
d < x̄ ⇒ λ0 ≤ 0
2
My
My
2
N 1 N2
√
(a) My (ν), N = 1 (b) My (ν/ 2)2 , N = 2
My
N
N
N1
N2
N4 4
My 2
2
24
e
√ 2
(c) My (ν/ N )N , N = 1, 2, 4 (d) My (ν/2)4 ≈ e−ν /24
√
Figure 2.28: Illustration of limN →∞ My (ν/ N )N = e− 2 ν σy
1 2 2
1. Exact probability:
{ } ∑
N ( )
N
P ( same face 3N/4 ) = 2 (1/2)k (1/2)N −k
or more times k
k=3N/4
∑
N ( )
1 N
=
2N −1 k
k=3N/4
Finally we obtain:
( )N
2
P (m ≥ 1/2) ≤ ≈ 0.877383N
33/4
E[xeλ0 x ] 1
λ x
=−
E[e ]
0 2
(eλ0 + e−λ0 )
=
2
(eλ0 −e−λ0 ) √
Therefore (eλ0 +e−λ0 )
= − 12 ⇒ λ0 = − ln( 3) . It follows that
Finally we obtain:
( )N
2
P (m ≤ 1/2) ≤ ≈ 0.877383N
33/4
In summary:
1
∑N (N )
1. Exact probability value: 2N −1 k=3N/4 k
4
2. Weak Law of Large Numbers (upper) bound: N
The graphs shown in figures 2.29 illustrate that the probability goes to 0 as
N → ∞. The weak law of large numbers bound becomes looser as N increases.
Whenever possible the Chernoff bound should be used because it is the tightest
exponential bound.
∑
N ( )
N
P (E ) = (1/6)k (5/6)N −k
N +1
k
k= 2
∑
N
N +1
E ⇔ xi ≥
i=1
2
∑
N
N
⇔ xi ≥
i=1
2
1 ∑
N
1
⇔ xi ≥
N i=1 2
2.11. LIMIT THEOREMS 79
( ∑N )
since by assumption N is odd. Therefore P (E ) = P 1
N i=1 xi ≥ 1
2 , where:
x̄ = (1 − p)(0) + (p)(1) = p
E[x2 ] = (1 − p)(0)2 + (p)(1)2 = p
not required to apply
the Chernoff bound
σx2 = p − p = p(1 − p)
2
and x denotes any one of the random variables xi . The Chernoff bound (d =
1/2 > x̄ = p) then states that:
( )
1 ∑
N
1 [ ]N
P xi ≥ ≤ E eλ0 (x−1/2)
N i=1 2
( )
where λ0 = ln 1−p p > 0 since by assumption p < 1/2 (refer W&J, page 103
for the details). It follows that:
[ ] √ √
E eλ0 (x−1/2) = 2p 2(1 − p)
√
= 2 p(1 − p) < 1 (since p < 1/2)
( ∑ ) ( √ )N
N
So P N1 i=1 xi ≥ 12 ≤ 2 p(1 − p) and we immediately see that the
probability of error goes to 0 exponentially as N → ∞.
The graphs in figure 2.30 show the behaviour of the Chernoff bound in
comparison to the exact probability of error (obtained in problem 2.10(c)) when
p = 1/6.
80 CHAPTER 2. PROBABILITY THEORY
N
(a) linear (vertical) scale
N
(b) logarithmic (vertical) scale
N
(a) linear (vertical) scale
N
(b) logarithmic (vertical) scale
In particular we have:
∫ ∞ ∫ ∞
E[xi11 xi22 . . . xikk ] = ... α1i1 α2i2 . . . αkik px (α)dα
−∞ −∞
Definition 2.12.1.
83
84 CHAPTER 2. PROBABILITY THEORY
density function is not known but statistical estimates of the mixed moments
are available.
Lemma 30. The total number of different n-th moments of a set of v random
variables is given by the expression:
v ( )(
∑ )
v n−1
k k−1
k=1
(n)
where, by convention, k = 0 whenever k > n.
Sketch of proof. This is equivalent to the number of ways in which a line segment
of length n can be divided into v or fewer sub-segments (all lengths are integer).
Cov(x1 , x2 ) = E[x1 x2 ] − x1 x2
= E[x1 ]E[x2 ] − x1 x2
= 0
Cov(x1 , x2 )
ρ,
σx1 σx2
Proposition 31.
2. If x2 = ax1 , a ̸= 0 then ρ = ±1
3. −1 ≤ ρ ≤ 1.
2.12. MOMENTS (MIXED) OF RANDOM VECTORS 85
Proof. Only part 3 of the proposition remains to be proved (parts 1 and 2 were
shown before the statement of the proposition). Let z = x1 + λx2 where λ ∈ R.
Then for every λ we have:
0 ≤ V ar(z) = V ar(x1 + λx2 )
= E[(x1 + λx2 )2 ] − E[x1 + λx2 ]2
= E[x21 ] + λ2 E[x22 ] + 2λE[x1 x2 ]
−E[x1 ]2 − λ2 E[x2 ]2 − 2λE[x1 ]E[x2 ]
= V ar(x1 ) + λ2 V ar(x2 ) + 2λCov(x1 , x2 )
The parabola in λ is always positive or 0 ⇒ the parabola has no distinct real
roots ⇒ its discriminant ∆ is not positive:
0≥∆ = “b2 − 4ac”
= 4Cov(x1 , x2 )2 − 4V ar(x1 )V ar(x2 )
Cov(x1 ,x2 )2
Therefore V ar(x1 )V ar(x2 ) ≤ 1 ⇒ ρ2 ≤ 1 ⇒ |ρ| ≤ 1.
−1 < ρ < 1, from equation W&J (2.58). The parameter ρ used in this expression
is precisely the correlation coefficient of x1 and x2 . Indeed, in this case we found
that x1 , x2 are both Gaussian with x1 = x2 = 0 and σx21 = σx22 = 1 (refer to
equation (2.64), page 56 in W&J). It follows that the correlation coefficient is
simply given by
E[(x1 − x1 )(x2 − x2 )]
correlation coefficient = √ = E[x1 x2 ]
σx21 σx22
We easily obtain (using MAPLE or math tables):
correlation coefficient = E[x1 x2 ]
∫∫ ∞
= α1 α2 px1 ,x2 (α1 , α2 )dα1 dα2
−∞
= ρ
86 CHAPTER 2. PROBABILITY THEORY
The significance of ρ in the joint Gaussian density function was pointed out
from the results obtained at W&J pages 52 and 69.
Definition 2.12.2. Two random variables x1 , x2 such that Cov(x1 , x2 ) = 0 are
said to be uncorrelated.
Remark. It follows from proposition 31 that if x1 , x2 are statistically indepen-
dent then they are uncorrelated. The converse is not always true.
Example 2.12.3. The joint probability density function of the two random
variables x, y of example 2.6.1 is given by:
{
A(β + 1)e−α ; 0 ≤ α, −1 ≤ β ≤ 1
px,y (α, β) =
0 ; elsewhere
px (α) = 2Ae−α ; α ≥ 0
E[x] = 1 E[y] = ?
2
E[x ] = 2 E[y 2 ] = ?
2
V ar(x) = ? V ar(y) =
9
E[xy] = ?
Cov(x, y) = ?
y = E[y] = mx × aT + b
σy2 = E[(y − y)2 ] = a × Λx × aT
mx = (0, 0, 0)
1 0 1/2
Λx = 0 1 0
1/2 0 1
σy2 = a × Λx × aT
1 0 1/2 1
= (1, 1, 1) × 0 1 0 × 1
1/2 0 1 1
= 4
my = mx × AT + b
Λy = A × Λx × AT
Remark. px (α) can be regained from Mx (ν) by (refer W&J equation (3.61)):
∫ ∞
1
Mx (ν)e−jν×α dν
T
px (α) =
(2π)k −∞
Theorem 34.
2.13. GAUSSIAN RANDOM VECTORS 89
Proof. We prove the first part of the theorem only. The second part is easy to
prove; refer to the property 2 on page 165 in W&J.
First, if the xi ’s are statistically independent then the random variables
∏k ejνi xi
jνi xi
are also statistically independent. It follows that Mx (ν) = E[ i=1 e ] =
∏k jνi xi
∏k
E[e ] = M (ν
xi i ).
i=1 i=1
∏k
Conversely, if Mx (ν) = i=1 Mxi (νi ) then
∫ ∞
1
Mx (ν)e−jν×α dν
T
px (α) = k
(2π) −∞
∫ ∞ (∏ k )
1 −jνi αi
= Mxi (νi )e dν
−∞ i=1
2π
∏ k ∫ ∞
1
= Mxi (νi )e−jνi αi dνi
i=1
2π −∞
∏
k
= pxi (αi )
i=1
2.13.1 Definition:
Definition 2.13.1. A random vector x = (x1 , x2 , . . . , xk ) is Gaussian (equiv-
alently the random variables x1 , x2 , . . . , xk are jointly Gaussian) if
Mx (ν) = e− 2 ν×Λx ×ν
T
1
+jν×mT
x
where Λx and mx are respectively the covariance matrix and mean vector of x.
Remarks.
90 CHAPTER 2. PROBABILITY THEORY
The Central Limit Theorem for a single Gaussian random variable (Chapter
2 in W&J) can be generalized to random vectors. If
1 ∑
N
z=√ xi
N i=1
y = x × AT + a
2
where A ∈ Rk is a k × k matrix and a ∈ Rk is a 1 × k vector. If x is
Gaussian then y is also Gaussian.
Part 2 of the corollary generalizes the result of theorem 25. Using theorem
33, we can give a simpler proof of Part 2 of the corollary (property (4)) than
that given by W&J.
My (ν) = E[exp(jν × y T )]
= E[exp(jν × (x × AT + b)T )]
= exp(jν × bT )E[exp(j(ν × A) × xT )]
T
= ejν×b Mx (ν × A)
T 1
= ejν×b exp(jν × A × mTx ) exp(− ν × A × Λx × AT × ν T )
2
T 1
= exp(jν × (mx × AT + b) ) exp(− ν × (A × Λx × AT ) ×ν T )
| {z } 2 | {z }
my Λy
∑
k
a0 + ai xi = 0 ⇒ ai = 0, ∀i,
i=1
otherwise x is singular.
where
Λx = diag[σ12 , σ22 , . . . , σk2 ] ,
∏
k
|Λx | = σi2 ,
i=1
[ ]
1 1 1
Λ−1
x = diag 2 , 2 , . . . , 2 .
σ1 σ2 σk
2.13. GAUSSIAN RANDOM VECTORS 93
x × AT = y − my
√
Remark. The above proof also shows that |Λy | > 0 and |Λy | is consequently
a real number as required. Indeed Λy = A × Λx × AT where A is an invertible
matrix and Λx = diag[σ12 , σ22 , . . . , σk2 ]. It follows that:
Example 2.13.1. (from W&J, page 171) Consider 2 jointly Gaussian random
variables y1 , y2 such that:
E[y1 ] = E[y2 ] = 0
E[y12 ] = E[y22 ] = σ 2
E[y1 y2 ] = ρσ 2
my = (0, 0)
[ ] [ 2 ]
V ar(x1 ) Cov(x1 , x2 ) σ ρσ 2
Λy = =
Cov(x2 , x1 ) V ar(x2 ) ρσ 2 σ2
|Λy | = σ 4 (1 − ρ2 )
[ ] [ −ρ
]
1
1 σ2 −ρσ 2
Λ−1 = = σ 2 (1−ρ2 )
−ρ
σ 2 (1−ρ2 )
y
σ 4 (1 − ρ2 ) −ρσ 2 σ2 σ 2 (1−ρ2 )
1
σ 2 (1−ρ2 )
94 CHAPTER 2. PROBABILITY THEORY
which leads to
β12 − 2ρβ1 β2 + β22
(β1 , β2 ) × Λ−1
y × (β1 , β2 ) =
T
σ 2 (1 − ρ2 )
Finally we obtain:
( )
1 β 2 − 2ρβ1 β2 + β22
py (β1 , β2 ) = √ exp − 1 2 .
2πσ 2 1 − ρ2 2σ (1 − ρ2 )
2.14 Problems
1. Une expérience aléatoire consiste à tirer deux billes de quatres urnes iden-
tifiées par la lettre A et les chiffres 1, 2, 3 de la façon suivante. La première
bille est prise de l’urne A qui contient 75 billes marquées 1, 24 billes
marquées 2, 1 bille marquée 3. Le chiffre de la bille tirée de l’urne A
indique l’urne de laquelle la deuxième bille doit être prise. Le contenant
des urnes 1, 2, 3 est comme suit:
where the first component of each sample point is the result of the first ball
(drawn from urn A) and the second component is the colour of the second
ball (drawn from either urn B or C). The class of events is assumed to be
the set of all 16 subsets of Ω.
(b) Show that P [{the second ball is red}] = 0.27. It obviously follows
from this result that P [{the second ball is black}] = 0.73 (you need
not show the latter).
(c) Calculate the probability that the second ball is drawn from urn B
if it is red. Calculate the probability that the second ball is drawn
from urn C if it is red.
(d) Based on your result in part (2c), which urn would you guess the
second ball is drawn from if it is red. Justify your answer. What is
the probability that your guess is correct.
4 π
Calculate the following:
(a) If r = 1.5 is received, which of the three messages was most likely
transmitted.
(b) The following decision rule (not necessarily the best) is used in parts
(7b) and (7c):
m0 ; ρ < 1.25
b
m(ρ) = m1 ; 1.25 < ρ < 4.375 (2.24)
m2 ; ρ > 4.375
b
Calculate P (m(r) b
= m0 ), P (m(r) b
= m1 ), P (m(r) = m2 ).
2.14. PROBLEMS 97
(a) (b)
(c)
Figure 2.31:
Figure 2.32:
23
9. We wish to simulate a communication system on a digital computer and
estimate the error probability P (E ) by measuring the relative frequency
of error. Let N denote the number of independent uses of the channel in
the simulation and xi , i = 1, 2, . . . , N denote the random variable such
that:
0 ; no error on the i-th use of the channel
xi =
1 ; there is an error on the i-th use of the channel
1 ∑
N
m= xi
N i=1
Hint: For the Weak Law of Large Numbers, use equation (2.21) on page
70 in the notes (in theorem 27). For the Central Limit Theorem, use the
approximation:
(( ) ) ( √ )
1 ∑ N
ϵ N
P xi − x̄ ≥ ϵ ≈ 2Q
N σx
i=1
(c) Calculate the mean and variance of x using the above characteristic
function.
(d) Calculate the mean and variance of x using the probability density
function px (α).
The followings may be useful:
∫
xe−x dx = −(1 + x) e−x
∫
x2 e−x dx = −(2 + 2x + x2 ) e−x
lim xn e−x = 0, ∀n ∈ N
x→∞
11. The mean vector and covariance matrix of a random vector x = (x1 , x2 )
are respectively given by:
mx = (1, 0)
( )
9 −4
Λx =
−4 4
100 CHAPTER 2. PROBABILITY THEORY
(a) Calculate the mean vector my and covariance matrix Λy of the ran-
dom vector y = (y1 , y2 ) defined by the transformation:
( )
1 1
y =x× + (0, 1)
3 −1
my = (2, 1)
( )
25 −10
Λy =
−10 15
Figure 2.33:
Chapter 3
Random Waveforms
x: Ω×R → R
x: (ω, t) 7 → x(ω, t)
2. The functions of time x(ω, t), ω ∈ Ω, taken by the random process x(t)
are called sample functions.
Similarly x(t2 ) is another random variable and the behaviour of x(t1 ) and x(t2 )
is specified by the joint probability density function px(t1 ),x(t2 ) (α, β). In general,
for times t1 , t2 , . . . , tk we denote the joint probability function by px(t) (α) where
t = (t1 , t2 , . . . , tk )
x(t) = (x(t1 ), x(t2 ), . . . , x(tk ))
α = (α1 , α2 , . . . , αk )
1 sequence ⇒ denumerable number of components; this is not the case for a process.
101
102 CHAPTER 3. RANDOM WAVEFORMS
A = {ω ∈ Ω : ai < x(ω, ti ) ≤ bi , i = 1, 2, . . . , k}
then
∫ b1 ∫ b2 ∫ bk
P (A) = ... px(t) (α)dαk . . . dα2 dα1
a1 a2 ak
For example a random process for which the probability system and the
mapping x : Ω × R → R are known is specified. In applications, four methods
of specification are encountered: 3
2. The probability density functions px(t) (α) are stated directly. This is
only possible in some very particular and simple cases, such as Gaussian
process introduced in §3.6 (pages 186 - 192, W&J).
In the above we define two processes x(t) and y(t) to be equal if and only if the
probability of the event {ω ∈ Ω : x(ω, t) ̸= y(ω, t)} is equal to 0; the concept
is similar to that of equality of random variables as per definition 2.3.8.
x: Ω×R → R
x : (Head, t) 7→ 2
x : (Tail, t) 7→ sin(t)
2 W&J page 133
3 W&J, pages 133 - 135
3.1. FUNDAMENTAL DEFINITIONS 103
1
P (x(0) = 2) =
2
1
P (x(0) = 0) =
2
x(π/2) is also a random variable and we have
1
P (x(π/2) = 2) =
2
1
P (x(π/2) = 1) =
2
The probability density functions of x(0), x(π/2) are sketched in figures 3.1,
3.2. Their joint probability density function is sketched in figure 3.3.
Figure 3.1:
Figure 3.2:
Figure 3.3:
Then x(t0 ) = 10 sin(2πt0 +θ) and the probability density function of the random
variable x(t0 ) is obtained using the technique of §2.4.2:
∑ pθ (β)
px(t0 ) (α) =
|g ′ (β)|
β∈S(α)
where
g(β) = 10 sin(2πt0 + β)
′
g (β) = 10 cos(2πt0 + β)
|g ′ (β)| = |10 cos(2πt0 + β)|
As can be seen from figure 3.4, S(α) = {β0 , β1 } (i.e. S(α) contains two ele-
ments) for any −10 < α < 10, and β0 is given by:
β0 = arcsin(α/10) − 2πt0
The expression for β1 is more complicated but it is not required since pθ (β0 ) =
pθ (β1 ) = 1/(2π) and |g ′ (β1 )| = |g ′ (β0 )| = |10 cos(arcsin(α))| (easy to see on the
3.1. FUNDAMENTAL DEFINITIONS 105
graph of figure 3.4). We then obtain directly (as long as −10 < α < 10):
pθ (β0 ) pθ (β1 )
px(t0 ) (α) = ′
+ ′
|g (β0 )| |g (β1 )|
1/(2π)
= 2
|10 cos(2πt0 + (arcsin(α/10) − 2πt0 ))|
1
=
10π| cos(arcsin(α/10))|
1
= √
π 100 − α2
for any t0 . px(0) (α) is sketched on figure 3.5.
Figure 3.4:
Figure 3.5:
Definition 3.1.3. A random process for which all sample functions are periodic
with (same) period T > 0 is called periodic random process.
The random process of example 3.1.2 above is periodic and its period is 1
[second].
106 CHAPTER 3. RANDOM WAVEFORMS
3.2 Stationariness
(W&J pp. 135 - 144)
Definition 3.2.1. A random process x(t) for which
Figure 3.6:
1. Using the technique of §2.4.2, the probability density function of the ran-
dom variable x(t1 ) = f (t1 + τ ) is given by:
∑ pτ (β)
px(t1 ) (α) =
|g ′ (β)|
β∈S(α)
where the transformation g( ) and the set of roots S(α) are respectively
given by:
g(τ ) = f (t1 + τ )
S(α) = {β ∈ [0, T ] : g(β) = f (t1 + β) = α}
We then have:
g ′ (β) ≡ slope of f ( ) = a/T
3.2. STATIONARINESS 107
and one easily sees that ∀α ∈ [0, a] there exists a unique β0 ∈ S(α) (refer
to figure 3.7). We obtain directly:
pτ (β0 ) 1/T 1
px(t1 ) (α) = ′
= =
|g (β0 )| |a/T | a
Figure 3.7:
for every t1 , T, α ∈ R.
in which
h(t) = f (t − t1 + α1 T /a)
It follows that
α1 T
x(t2 ) = h(t2 ) = f (t2 − t1 + ) , a2
a
··· ··· ··· ··· ···
α1 T
x(tk ) = h(tk ) = f (tk − t1 + ) , ak
a
This is equation (3.15a) in W&J and it follows that
∏
k
px(t) (α) = px(t1 ) (α1 ) δ(αi − ai )
i=2
∏
k
α1 T
= px(t1 ) (α1 ) δ(αi − f (ti − t1 + ))
i=2
a
∏
k
α1 T
= px(t1 +T ) (α1 ) δ(αi − f ((ti + T ) − (t1 + T ) + ))
i=2
a
= px(t+T ) (α)
for every T ∈ R.
where z(t) = y(t + T ) (the process z(t) is specified by method 4 of page 102).
Let
Ωy ≡ set of sample functions of y(t)
Ωz ≡ set of sample functions of z(t)
Then we easily see that Ωy = Ωz (any function in Ωz lies in Ωy and vice versa).
Moreover the sample functions are all equiprobable. It follows that if
I = {a(t) ∈ Ωy : a(t) ≤ α}
K = {b(t) ∈ Ωz : b(t) ≤ α}
then I = K and
2π
This follows from W&J Appendix 2A in which we found that if (ρ = 0)
1 −α1 /2
2
α
2π e ; α1 ≥ 0, 0 ≤ α2 < 2π
pr,θ (α1 , α2 ) =
0 ; elsewhere
and we define the random variables x = r cos θ,4 y = r sin θ = x(0), then
1 −(β12 +β22 )/2
pxy (β1 , β2 ) = e
2π
By elimination of the random variable x we obtain
∫ ∞
1
pxy (β1 , β2 )dβ1 = √ e−β2 /2 ,
2
py (β2 ) =
−∞ 2π
and by the change of variables α = β2 the desired result is obtained.
Proof.
mx (t1 ) = E[x(t1 )]
∫ ∞
= αpx(t1 ) (α)dα
−∞
∫ ∞
= αpx(t2 ) (α)dα
−∞
= E[x(t2 )]
= mx (t2 )
Proof.
Rx (t, s) = E[x(t)x(s)]
∫∫ ∞
= αβpx(t),x(s) (α, β)dαdβ
−∞
∫∫ ∞
= αβpx(t+T ),x(s+T ) (α, β)dαdβ
−∞
= E[x(t + T )x(s + T )]
= Rx (t + T, s + T )
3.3. MOMENT FUNCTIONS OF A RANDOM PROCESS 111
Rx (τ ) , E[x(t)x(t + τ )]
Definition 3.3.2. A random process x(t) for which the mean function is inde-
pendent of time and the autocorrelation function is independent of time origin
is said to be wide sense stationary.
Example 3.3.1. Consider the following random process specified from the ex-
periment consisting in the throw of a coin (P (Head) = P (Tail) = 1/2) as in
example 3.1.1:
x: Ω×R → R
x : (Head, t) 7→ 2
x : (Tail, t) 7→ sin t
1. Mean function:
mx (t) = E[x(t)]
= (1/2) sin t + (1/2)(2)
sin t
= 1+
2
The mean function mx (t) is sketched in figure 3.8(a).
2. Autocorrelation function:
Rx (t, s) = E[x(t)x(s)]
= (1/2)(2 · 2) + (1/2)(sin t sin s)
= 2 + 1/2 sin t sin s
= 2 + 1/4 cos(t − s) − 1/4 cos(t + s)
3. Autocovariance function:
Figure 3.8: Mean and autocovariance functions for the process of example 3.3.1
mx (t) = E[x(t)]
= E[10 sin(2πt + θ)]
∫ ∞
= 10 sin(2πt + α)pθ (α)dα
−∞
∫ 2π
10
= sin(2πt + α)dα
0 2π
= 0
Rx (t, s) = E[x(t)x(s)]
= E[100 sin(2πt + θ) sin(2πs + θ)]
= 50 cos(2π(t − s)) − 50 E[cos(2π(t + s) + 2θ)]
∫ 2π
cos(2π(t + s) + 2α)
= 50 cos(2π(t − s)) − 50 dα
2π
|0 {z }
0
= 50 cos(2π(t − s))
Example 3.3.3. Let x(t) = r sin(2πt + θ) where r and θ are two random
variables with joint density function
α −α2 /2
2π e ; 0 ≤ β < 2π and 0 ≤ α
prθ (α, β) =
0 ; elsewhere
1. Mean function:
mx (t) = E[x(t)]
∫ ∞ ∫ 2π
= α sin(2πt + β)prθ (α, β)dβdα
0 0
= 0
Figure 3.9: Autocorrelation function for the random process of example 3.3.2
2. Autocorrelation function:
Rx (t, s) = E[x(t)x(s)]
∫ ∞ ∫ 2π
= α2 sin(2πt + β) sin(2πs + β)prθ (α, β)dβdα
0 0
= cos(2π(t − s))
hence independent of time origin.5 Again we see that Rx (t, s) = Rx (t − s)
is periodic (with respect to t − s) and has same period as x(t).
The process is wide sense stationary; it is also stationary in the strict sense
by theorem 39.
Example 3.3.4. Consider a stationary process x(t) with mean function mx = 2
[volts], and autocorrelation function Rx (τ ) = 24 sinc(500 Hz τ ) + 4 [volts2 ].
1. Use Chebyshev’s inequality to estimate the range of x(1 ms) with a prob-
ability of 95%.
2. Let y = x(1 ms) + 2x(2 ms) − 3x(5 ms) + 5 volts. Calculate ȳ, V ar(y).
Solution:
1. We need to find a, b such that P (a < x(1 ms) < b) ≥ 95% = 0.95, or in
other words:
P (x(1 ms) ≤ a or x(1 ms) ≥ b) ≤ 0.05
5 The above is easily verified with MAPLE : use two nested integrations and the function
It is given that
x(1 ms) = mx = 2 volts
2. Define the random vector x , [x(1 ms), x(2 ms), x(5 ms)]. We first find
its mean vector
( )
mx = E [x(1 ms), x(2 ms), x(5 ms)] = [2, 2, 2]
and its covariance matrix
Lx (0) Lx (1 ms) Lx (4 ms)
Λx = Lx (1 ms) Lx (0) Lx (3 ms)
Lx (4 ms) Lx (3 ms) Lx (0)
24 15.279 0
= 15.279 24 −5.093
0 −5.093 24
Using lemma 32 and noticing that:
1
y =x× 2 +5
−3
116 CHAPTER 3. RANDOM WAVEFORMS
we easily obtain:
[ ] 1
ȳ = 2 2 2 × 2 +5
−3
= 5
[ ] 24 15.279 0 1
σy2 = 1 2 −3 × 15.279 24 −5.093 × 2
0 −5.093 24 −3
= 458.23
∞
∑
x(t) = ai g(t − iT + τ )
i=−∞
where the ai ’s and τ are statistically independent random variables such that
1. Mean function:
mx (t) = E[x(t)]
∞
∑
= E[ ai g(t − iT + τ )]
i=−∞
∞
∑
= E[ai g(t − iT + τ )]
i=−∞
∑∞
= E[ai ] E[g(t − iT + τ )]
| {z }
i=−∞
0
= 0
independent of time.
3.3. MOMENT FUNCTIONS OF A RANDOM PROCESS 117
2. Autocorrelation function:
Rx (t, s) = E[x(t)x(s)]
[( ∑∞ )( ∑
∞ )]
= E ai g(t − iT + τ ) ai g(s − iT + τ )
i=−∞ i=−∞
[ ∑
∞ ∞
∑ ]
= E ai aj g(t − iT + τ )g(s − jT + τ )
i=−∞ j=−∞
∞
∑ ∞
∑
= E[ai aj ]E[g(t − iT + τ )g(s − jT + τ )]
i=−∞ j=−∞
∑∞
= E[a2i ] E[g(t − iT + τ )g(s − iT + τ )]
| {z }
i=−∞
1
∑∞
= E[g(t − iT + τ )g(s − iT + τ )]
i=−∞
= 0.375
2. Rx (τ ) = Rx (−τ ), ∀τ ∈ R.
Sketch of proof.
1. It has been shown that |Cov(x1 , x2 )| ≤ σx1 σx2 for any two (finite variance)
random variables x1 , x2 .7 Consequently, if x(t) is wide sense stationary
6 The same name autocorrelation function is used with both deterministic signals and ran-
dom signals because both functions relate to similar physical quantities when the random
signals are wide-sense stationary (see definition 3.3.3 below and the Wiener-Khinchine theo-
rem in section 3.4 below).
7 Can also be proved by expanding 0 ≤ E[(x(t) ± x(t + τ ))2 ].
3.3. MOMENT FUNCTIONS OF A RANDOM PROCESS 119
(a) Square pulse in binary random process (b) typical sample function
(c) other typical sample function (d) yet another sample function
(e) yet another one more sample function (f) autocorrelation function
(a) Raised-sine pulse in binary random process (b) typical sample function
(c) other typical sample function (d) yet another sample function
(e) yet another one more sample function (f) autocorrelation function
Figure 3.11: Binary random process with raised sine pulses, T = 1 ms.
3.3. MOMENT FUNCTIONS OF A RANDOM PROCESS 121
then
√
|Lx (s − t)| ≤ V ar(x(t)) V ar(x(s)) = Lx (0)
| {z } | {z }
Lx (0) Lx (0)
= V ar(x(t))
We also have |Lx (s − t)| + m2x ≥ |Lx (s − t) + m2x | = |Rx (s − t)|. Therefore
|Rx (τ )| ≤ |Rx (0)| for any τ = s − t.
2.
3. Let T be the period of x(t). Then x(t) = x(t + T ) for every t. It follows
that:
Definition 3.3.3. Let x(t) be a wide-sense stationary process with mean func-
tion mx and autocorrelation function Rx (τ ) such that Rx (0) is finite.
Discussion: Consider the set of all the sample functions x(ω, t), ∀ω ∈ Ω. In
the applications to communication, x(ω, t) represents an electric signal (voltage
or current), for which
∫ T
1
DC value ≡ lim x(ω, t)dt = ⟨x(ω, t)⟩
T →∞ 2T −T
∫ T
1
Total power ≡ lim x(ω, t)2 dt = ⟨x(ω, t)2 ⟩
T →∞ 2T −T
∫ T
1
AC power ≡ lim (x(ω, t) − ⟨x(ω, t)⟩)2 dt
T →∞ 2T −T
= ⟨(x(ω, t) − ⟨x(ω, t)⟩)2 ⟩
The sample functions x(ω, t) do not all have same DC values, total power nor
AC power. It can however be shown that whenever Rx (0) is finite then the
integrations and expectations can be interchanged. We then obtain:
( ∫ T )
1
average DC value ≡ E[⟨x(ω, t)⟩] = E lim
x(ω, t)dt
T →∞ 2T −T
∫ T
1
= lim E(x(ω, t))dt
T →∞ 2T −T
∫ T
1
= lim mx dt
T →∞ 2T −T
∫ T
1
= mx lim dt
T →∞ 2T −T
| {z }
1
= mx
( ∫ T )
1
average total power ≡ E[⟨x(ω, t)2 ⟩] = E lim x(ω, t)2 dt
T →∞ 2T −T
∫ T
1
= lim E(x(ω, t)2 )dt
T →∞ 2T −T
∫ T
1
= lim Rx (0)dt
T →∞ 2T −T
∫ T
1
= Rx (0) lim dt
T →∞ 2T −T
| {z }
1
= Rx (0)
3.4. CORRELATION FUNCTIONS AND POWER SPECTRA 123
since E(x(ω, t)2 ) = Rx (0) whenever the process is wide-sense stationary. In the
same manner we can show that
Remarks.
2. The sample function x(ω, t) can be use to define a random process in the
variable f ∈ R as follows: consider the sample functions:
∫ 2
1 T −j2πf t
X(ω, f ) = lim x(ω, t)e dt .
T →∞ 2T −T
It follows that:
∫∫
( ) 1 T ( )
E X(ω, f ) = lim E x(t)x(s) e−j2πf (t−s) dtds
T →∞ 2T −T | {z }
Rx (t−s)
Therefore
∫ ∞ ( ∫∫ T )
−1 1 −j2πf (t−s)
F (E(X(f ))) = lim Rx (t − s)e dtds ej2πf τ df
−∞ T →∞ 2T −T
∫∫ T ∫ ∞
1
= lim Rx (t − s)e−j2πf (t−s) ej2πf τ df dtds
T →∞ 2T −T −∞
∫∫ T ∫ ∞
1
= lim Rx (t − s) ej2πf (s−t+τ ) df dtds
T →∞ 2T −T −∞
| {z }
δ(s−t+τ )
∫ T ∫ ∞
1
= lim Rx (t − s)δ(s − t + τ )dt ds
T →∞ 2T −T −∞
| {z }
Rx ((s+τ )−s)=Rx (τ )
∫ T
1
= lim Rx (τ )ds
T →∞2T −T
= Rx (τ )
where G(f ) = F (g(t)) denotes the Fourier transform of the pulse g(t). In the
following we use:
sin(πx)
sinc(x) ,
{ πx
1 ; −1/2 < x < 1/2
rect(x) ,
0 ; elsewhere
{
1 − 2|x| ; −1/2 < x < 1/2
∆(x) ,
0 ; elsewhere
It follows that
|G(f )| = (1 ms) |sinc(f /1000)|
for the rectangular pulse.8
• For the triangular pulse we use the Fourier transform pair
we obtain
1
rect(t/(1 ms)) cos(2πt/(1 ms)) ←→ (0.25 ms)sinc((1.0 ms)(f − (1 kHz)))+
2
(0.25 ms)sinc((1.0 ms)(f + (1 kHz))) =
(0.25 ms)sinc((f − 1000)/1000)+
(0.25 ms)sinc((f + 1000)/1000)
8 recall that the pulses in example 3.3.5 go from 0 to T = 1 ms, whereas the above goes
from -0.5 ms to +0.5 ms. The time delay affects the phase of the Fourier transform only; it
has no effect on the magnitude, which is all that is required here.
126 CHAPTER 3. RANDOM WAVEFORMS
It follows that
|G(f )| =(0.5 ms)sinc(f /1000)+
(0.25 ms)sinc((f − 1000)/1000)+
(0.25 ms)sinc((f + 1000)/1000)
Figure 3.12 shows sketches of the Power spectral densities of the binary ran-
dom processes of example 3.3.5 for the cases of rectangular pulse (green curve),
triangular pulse (blue curve) and raised cosine pulse (red curve). Experimental
Figure 3.12:
measurements were taken which confirm the above results; refer to figure 3.13.
Theorem 41. (W&J pp. 179 - 180) Let z(t) be a random process with mean
function mz (t), autocorrelation function Rz (t, s) which is finite for all t, s ∈ R.
Define the random process:
∫ ∞
y(t) = z(α)h(t − α)dα
−∞
= z(t) ∗ h(t),
3.4. CORRELATION FUNCTIONS AND POWER SPECTRA 127
Figure 3.13:
∫ ∞
my (t) = mz (α)h(t − α)dα = mz (t) ∗ h(t)
−∞
∫∫ ∞
Ry (t, s) = Rz (α, β)h(t − α)h(s − β)dαdβ
−∞
∫∫ ∞
Ly (t, s) = Lz (α, β)h(t − α)h(s − β)dαdβ
−∞
Sketch of proof: Assuming that expectations and integrals are freely inter-
128 CHAPTER 3. RANDOM WAVEFORMS
changeable we have:
my (t) = E(y(t))
(∫ ∞ )
= E z(α)h(t − α) dα
−∞
∫ ∞
( )
= E z(α) h(t − α) dα
−∞
∫ ∞
= mz (α)h(t − α) dα
−∞
= mz (t) ∗ h(t)
Ry (t, s) = E(y(t)y(s))
(∫ ∞ ∫ ∞ )
= E z(α)h(t − α) dα z(β)h(s − β) dβ
−∞ −∞
( ∫∫ ∞ )
= E z(α)z(β)h(t − α)h(s − β) dα dβ
−∞
∫∫ ∞
( )
= E z(α)z(β) h(t − α)h(s − β) dα dβ
−∞ | {z }
Rz (α, β)
∫∫ ∞
= Rz (α, β)h(t − α)h(s − β) dα dβ
−∞
Theorem 42 (W&J, pp. 181 - 182). Let z(t) be a wide-sense stationary random
process with mean mz , autocorrelation function Rz (τ ) and power spectral density
Sz (f ). Define the random process
∫ ∞
y(t) = z(α)h(t − α)dα
−∞
= z(t) ∗ h(t)
3.4. CORRELATION FUNCTIONS AND POWER SPECTRA 129
Then y(t) is a wide-sense stationary random process and if Rz (0) is finite then:
∫ ∞
my (t) = my = mz h(t)dt
−∞
∫∫ ∞
Ry (τ ) = Rz (τ + α − β)h(α)h(β)dαdβ
−∞
∫∫ ∞
Ly (τ ) = Lz (τ + α − β)h(α)h(β)dαdβ
−∞
Sy (f ) = Sz (f )|H(f )|2
∫∞
where H(f ) = −∞
h(t)e−j2πf t dt.
1. Sx (f ) ∈ R, ∀f ,
2. Sx (f ) = Sx (−f ), ∀f ,
3. Sx (f ) ≥ 0, ∀f .
∫∞
4. Rx (0) = −∞ Sx (f )df .
Proof.
= Sy (f )df
−∞
∫ ∞
= Sx (f )|H(f )|2 df
−∞
≈ Sx (−f0 )∆f + Sx (f0 )∆f
= 2Sx (f0 )∆f
It follows that:
Ry (0)
Sx (f0 ) ≈ ≥ 0, ∀f0 .
2∆f
130 CHAPTER 3. RANDOM WAVEFORMS
Figure 3.14:
The above proof of property (3) illustrates the physical meaning of the power
spectral density Sx (f ):
3.5 Ergodicity
Up to now, the moments of random processes have been calculated with expres-
sions such as:
∫ ∞
x(t) = E[x(t)] = αpx(t) (α)dα
−∞
∫
∞ (
( )2 ( ) )2
x(t) − x(t) = V ar x(t) = α − x(t) px(t) (α)dα
−∞
∫∫ ∞
Rx (s, t) = αβpx(t)x(s) (α, β)dαdβ
−∞
... ... ...
... ... ...
Operations of this form are called ensemble averages: the expectation is calcu-
lated through all the sample functions at fixed time value(s). This is the true
expectation on a random process. It may however be impractical for experi-
mental measurements since it requires a statistically large number of sample
Example 3.5.1. The random process of example 3.1.1 is not ergodic (obviously,
since the process is not stationary; this example illustrates this by calculating
the time averages). We recall that:
x: Ω×R → R
x : (Head, t) 7→ 2
x : (Tail, t) 7 → sin t
All of mx (t), ⟨x(Head, t)⟩, ⟨x(Tail, t)⟩ (and a whole lot more) would have to be
equal for the process to be ergodic.
Example 3.5.2. Let x(t) = 10 sin(2πt + θ) where θ is a random variable
uniformly distributed between 0 and 2π. It was found in example 3.3.2 that
mx (t) = E[x(t)] = 0, ∀t. We also have:
for any value of θ(ω). It was also found that Rx (τ ) = 50 cos(2πτ ). We now
find:
⟨x(ω, t)x(ω, t + τ )⟩
∫ T
1
= lim 100 sin(2πt + θ(ω)) sin(2π(t + τ ) + θ(ω))dt
T →∞ 2T −T
= 50 cos(2πτ )
for any value of θ(ω). The above indicates (but does not guarantee) that the
process may be ergodic.
In the following section we investigate Gaussian random processes, some of
which are ergodic.
3.6. GAUSSIAN RANDOM PROCESSES 133
Notice that x(t) is not (wide sense) stationary (nor ergodic). We let x =
(x1 , x2 ) = (x(1), x(3/2)) and find:
mx = (0, −1)
[ ]
1 e−1/2
Λx = −1/2
e 1
e−1
|Λx | = 1 − e−1 =
e [ ]
[ ] e1/2
e 1 −e−1/2 e
Λ−1 = −1/2 = e−1 1−e
x
e−1 −e
1/2
1 e e
1−e e−1
Similarly any joint probability density function px(t) (α) can be calculated for
any k ∈ N∗ and any t = (t1 , t2 , . . . , tk ) ∈ Rk .
Remarks.
1. Λx(t) is symmetrical as required since Rx (t, s) = E[x(t)x(s)] =
E[x(s)x(t)] = Rx (s, t).
10 W&J top of page 72.
11 We recall that Lx (t, s) = Rx (t, s) − mx (t)mx (s).
134 CHAPTER 3. RANDOM WAVEFORMS
2. Random processes which are not Gaussian are in general not specified by
mx (t) and Rx (t, s) or Lx (t, s) alone; other moment functions are required.
Indeed W&J shows two random processes y(t) and z(t) having the same
mean and autocorrelation functions (⇒ same covariance function) at the
top of page 175, but having different probability density functions. This
means that in order to specify these non-Gaussian random processes, more
information is required than that carried by the mean and autocorrelation
functions.
3. (W&J, pp. 175 - 177) A Gaussian random process x(t) is (strict sense)
stationary if and only if it is wide sense stationary. This follows from
the previous remark that any joint probability density function px(t) (α)
is completely determined by the functions mx (t) and Rx (t, s): if x(t) is
wide sense stationary, the functions mx (t) and Rx (t, s) are independent of
the time origin and any px(t) (α) is also consequently independent of the
time origin. Wide sense stationariness of non-Gaussian random processes
does not guarantee their strict sense stationariness. For example, W&J
shows at the top of page 177 a (non-Gaussian) random process which is
wide sense stationary, but not strict sense stationary.
Theorem 44. Let x(t) be a stationary Gaussian random process with autocor-
relation function Rx (τ ). Then:
∫ ∞
|Rx (τ )|dτ < ∞ ⇒ x(t) is ergodic
−∞
Theorem 45. (W&J, p. 179) Let x(t), y(t) be random processes such that:
Sketch of proof. Suppose h(t) = 0, ∀|t| > T , for some 0 < T ∈ R. Then
∫ T
y(t) = h(α)x(t − α)dα
−T
∑
k
≈ h(αi )x(t − αi )∆αi
i=1
3.6. GAUSSIAN RANDOM PROCESSES 135
[αi , αi + ∆i ] ∩ [αj , αj + ∆j ] = ∅, ∀i ̸= j
∪ki=1 [αi , αi + ∆i ] = [−T, T ]
The random variables y(t1 ), y(t2 ), . . . , y(tN ) can therefore be written in matrix
notation as follows:
x(t1 − α1 )
..
.
x(t1 − αk )
x(t − α )
y(t1 ) h 0 ... 0 2 1
y(t2 ) 0 h . . . 0
..
.
= ×
. . x(t2 − αk )
.. .. .. . . ..
. . .
..
y(tN ) 0 0 ... h .
x(tN − α1 )
..
.
x(tN − αk )
where h = [h(α1 ) ∆α1 , h(α2 ) ∆α2 , . . . , h(αk ) ∆αk ]. It follows from property
4, W&J p. 166 that the random vector [y(t1 ), y(t2 ), . . . , y(tN )] is Gaussian. As
stated in W&J, “the formal proof is mathematically involved” and deals with
infinite weighted sums of Gaussian random variables.
In general, the probability density functions of the output random process
are not of the same nature as those of the input random process, but it works
for the Gaussian process. Combining this result with those of theorems 41 and
42, we see that if x(t) is Gaussian and specified then y(t) is also Gaussian and
specified.
Figure 3.15:
generalized to the present situation to show that any random vector formed by
samples taken from y(t) and z(t) is Gaussian. Thus y(t) and z(t) are jointly
Gaussian random processes.
We recall that the random processes y(t) and z(t) are specified by their
respective mean and autocorrelation functions since they are Gaussian. But
the joint probability density function of the random vector denoted as w in
definition 3.6.2 cannot be written without the knowledge of E[y(tj )z(si )] or
Cov(y(tj ), z(si )) = E[y(tj )z(si )] − my (tj )mz (si ). This is contained in the cross-
correlation function defined next.
Definition 3.6.3.
1. The function
Ryz (t, s) , E[y(t)z(s)]
2. The function
Lyz (t, s) , Cov(y(t), z(s))
One easily shows that Lyz (t, s) = Ryz (t, s) − my (t)mz (s). The
joint probability density function of any random vector such as w in
definition 3.6.2 can be written from the knowledge of the functions
my (t), mz (t), Ry (t, s), Rz (t, s), Ryz (t, s). In such cases y(t) and z(t) are
said jointly specified, meaning that any joint probability density function of
the form py(t),z(s) (α, β) can be calculated. If the mean and correlation func-
tions are independent of time origin, y(t) and z(t) are said jointly station-
ary (jointly wide sense stationary if the processes are not jointly Gaussian)
and as usual one argument of the functions can be dropped and we write:
my , mz , Ry (τ ), Rz (τ ), Ryz (τ ) , Ryz (t, t + τ ) in which τ = s − t.
Remark. In the case of the autocorrelation, we have Ry (τ ) = Ry (t, t + τ ) =
Ry (t + τ, t) since the autocorrelation function is even as shown in theorem 40.
3.6. GAUSSIAN RANDOM PROCESSES 137
2. If x(t) is stationary then y(t) and z(t) are jointly stationary and:
∫∫ ∞
Ryz (τ ) = Rx (τ + β − α)hy (α)hz (β)dαdβ
−∞
∫∫ ∞
Lyz (τ ) = Lx (τ + β − α)hy (α)hz (β)dαdβ
−∞
Proof.
1. Similar to what has been done in the sketch of proof of theorem 41 by
interchanging the order of expectation and integration.
2. The cross-correlation formula follows from the general formula by replac-
ing Rx (t − α, s − β) with Rx (s − t + α − β) and identifying τ = s − t.
The cross-power spectra formula directly follows from the properties of
the Fourier transform applied to both sides of the formula for the cross-
correlation.
N0
2 |H(f )|
2
2. (W&J, eq. 3.134a) Sn (f ) =
12 W&J, p. 190.
140 CHAPTER 3. RANDOM WAVEFORMS
Example 3.6.3. (W&J, pp. 190 - 192) We consider the filtering of a white
Gaussian noise w(t) with an ideal low-pass filter of transfer function W (f ) given
by: {
1 ; |f | < W
W (f ) =
0 ; elsewhere
Denoting by n(t) the output of the filter, we have:
5. The total average power of n(t) is W N0 . Since n(t) is ergodic the total
average power of any sample function of n(t) is also W N0 .
Figure 3.16:
mn = E[n] = (n1 , n2 , . . . , nk )
= (mn , mn , . . . , mn )
= 0
[ ]
Λn = Cov(ni , nj ) i=1,...,k
j=1,...,k
[ ( ( ) ( j ))]
i
= Cov n + T ,n +T
2W 2W i=1,...,k
j=1,...,k
Ln (0) Ln ( 2W1
) Ln ( W 1
) . . . Ln ( k−1 2W )
Ln ( 1 ) L (0) L ( 1
) . . . L ( k−2
2W n n 2W n 2W )
Ln ( 1 ) Ln ( 1 ) L (0) . . . L ( k−3
= W 2W n n 2W )
. . . . . . . . . . . . . . .
... ... ... ... ...
Ln ( k−1 ) Ln ( k−2
2W ) Ln ( 2W ) . . .
k−3
Ln (0)
[ 2W ]
= diag N0 W, N0 W, . . . , N0 W
∏
k
pn (α) = pni (αi )
i=1
( )
−1 ∑ 2
k
1
= exp αi .
(2πN0 W )k/2 2N0 W i=1
142 CHAPTER 3. RANDOM WAVEFORMS
3.7 Problems
14
1. Consider two zero-mean jointly Gaussian noise processes, n1 (t) and n2 (t)
such that:
sin πτ
Ri (τ ) , ni (t)ni (t − τ ) = ; i = 1, 2,
πτ
sin πτ
R12 (τ ) , n1 (t)n2 (t − τ ) = .
2πτ
sin πτ
Remark: lim πτ = 1.
τ →0
2. Let x(t), y(t) denote two 0-mean jointly Gaussian jointly stationary pro-
cesses such that:
y(t) = x(t) ∗ h(t)
(a) (b)
Figure 3.17:
Figure 3.18:
i.e. ω can take the values 1 or 2 with equal probabilities and θ is uniformly
distributed between 0 and 2π. Calculate the mean function mx (t) and the
autocorrelation Rx (t1 , t2 ) of x(t). Based on your answer, is the processes
wide-sense stationary?
144 CHAPTER 3. RANDOM WAVEFORMS
mz = 0
AW Wτ
Rz (τ ) = sinc2 ( )
2 2
A + 2Af ; −W 2 ≤f ≤0
W
Sz (f ) = A − 2Af ; 0≤f ≤ W
W 2
0 ; elsewhere
(a) Express or sketch on a labelled graph the mean function my and the
power spectral density Sy (f ) of the wide-sense stationary random
process y(t) as functions of A and W .
Hint: Do not use the expression for Rz (τ ) and do not use any table
of Fourier transform pairs.
(b) Express the total average power of y(t) as a function of A and W .
Figure 3.19:
Figure 3.20:
Optimum Receiver
Principles
147
148 CHAPTER 4. OPTIMUM RECEIVER PRINCIPLES
Figure 4.1:
Definition 4.2.1.
−1
s : {mi }M
i=0 → RN
s: mi 7 → si = (si1 , si2 , . . . , siN )
The output of the transmitter is thus a random vector s for which the sample
set consists of M − 1 sample vectors s0 , s1 , . . . , sM −1 , and the probabilities
correspond to the a priori message probabilities.
Given a received vector r = ρ, a receiver m( b ) selects the message m(ρ)b
b
and makes a correct decision if and only if the message transmitted is m(ρ). In
terms of the a posteriori probabilities we have:
P (C |r = ρ) = P (m(ρ)|r
b = ρ)
P (E )|m
b opt ( ) = min(P (E ))
b )
m(
b M AP ( ) is such that:
2. A maximum a posteriori receiver m
∀j = 0, 1, . . . , M − 1.
b M L ( ) is such that:
3. A maximum likelihood receiver m
∀j = 0, 1, . . . , M − 1.
b M M X ( ) (introduced in problem 2.12) is such that:
4. A minimax receiver m
( ) ( )
max P (E )|m b M M X ( ) ≤ max P (E )|m(
b ) , ∀ m(
b )
P (mi ) P (mi )
150 CHAPTER 4. OPTIMUM RECEIVER PRINCIPLES
Theorem 50.
1. The maximum a posteriori receiver is optimal.
∀j = 0, 1, . . . , M − 1.
4. If the signal set is completely symmetric (W&J, page 264 and §4.5.3 of
these notes) then the maximum likelihood receiver is a congruent decision
region receiver and it is also minimax.
Proof. (W&J, page 213 - 214) We prove the first three statements of the theorem
only; the fourth statement is proved in §4.5.3.
b
m(ρ) = mk ⇔ P (mk |r = ρ) ≥ P (mj |r = ρ),
1. In case of a tie, i.e. many messages yield the same value of P (mk )pr (ρ|s =
sk ), any one of them can be selected as the MAP estimate; the particular
choice will not affect the resulting probability of error.
2. In case of a tie with the ML decision rule, the receiver also selects arbi-
trarily any of the most likely messages. The particular choice will affect
the resulting probability of error unless the messages are all equiprobable.
4.2. VECTOR CHANNELS 151
the latter being equation (4.17c) in W&J. Substituting equation (4.3) into (4.2)
results into equation (4.18) in W&J, and simplifies to W&J equation (4.19); the
MAP receiver finds the message mi (equivalently si ) that minimizes
|ρ − si |2 − 2σ 2 ln P (mi ) .
b M L (ρ) = mk ⇔ |ρ − sk |2 ≤ |ρ − sj |2 , ∀j.
m
The decision regions may therefore be calculated once and for all from the:
• a priori message probabilities,
• sample vectors of the transmitted random vector,
• variance of the components of n.
The probability of correct decision is derived from the decision regions, and
using equation (4.3) it yields (equations (4.20b), (4.20c) in W&J):
∫ ∫
P (C |mi ) = ··· pr (ρ|s = si )dρ
I
∫ ∫i
= ··· pn (ρ − si )dρ
Ii
∫ ∫
1
e−|ρ−si | /(2σ ) dρ
2 2
= · · ·
(2πσ 2 )N/2 Ii
∑
M −1
P (E ) = 1 − P (C ) = 1 − P (mi )P (C |mi )
i=0
An equivalent form of the above that will be more useful in §4.5.1 is obtained
below using the change of variable β = ρ − si :
∫ ∫
P (C |mi ) = ··· pn (ρ − si )dρ
Ii
∫ ∫
= ··· pn (β)dβ (4.4)
Ii −si
Remark. The channels need not be physically different nor use different medi-
ums. For example, the transmitted vector can be transmitted over different
time intervals with the same channel (time diversity), or using different carrier
frequencies (frequency diversity), or many receiving antennas may be used with
the same transmitter and medium (space diversity).
The analysis of multivector channels is a trivial generalization, since multi-
vectors can be merged together and form super-vectors. A multivector channel
is described mathematically by the M following conditional probability density
functions (equations (4.21), (4.22) in W&J):
Figure 4.2:
encountered examples). The basis functions ϕj (t) and the vectors si may on
the other hand be calculated from the sample functions by the Gram-Schmidt
orthonormalization procedure. The mapping between the mi ’s and the si ’s is
arbitrary.
Definition 4.3.1. (W&J, page 225) The N -dimensional geometric space in
which the M vectors s0 , s1 , . . . , sM −1 are visualized as points (see for example
figures 4.14, 4.A.3, 4.A.4, 4.A.5 in W&J) is called signal space.
∑
N
si (t) = si,j ϕj (t).
j=1
In the presence of noise, the previous decomposition does not recover the
transmitted vector since the transmission of s(t) implies the reception of r(t) =
s(t) + nw (t) ̸= s(t) and (equations (4.41a), (4.41b) in W&J):
∫ ∞
r(t)ϕ1 (t)dt = r1 = s1 + n1
−∞
∫ ∞
r(t)ϕ2 (t)dt = r2 = s2 + n2
−∞
......... ... ......... (4.5)
∫ ∞
r(t)ϕN (t)dt = rN = sN + nN
−∞
∫∞
In the above, ni = −∞ nw (t)ϕi (t)dt, i = 1, 2, . . . , N (equation (4.43b) in
W&J), and we define r 1 = (r1 , r2 , . . . , rN ), n = (n1 , n2 , . . . , nN ), so that
r 1 = s + n (equation (4.42) in W&J). The receiver estimates the transmitted
message by maximizing the a posteriori probability from the knowledge of r 1 .
This will easily be done since r 1 = s + n and
• n is a Gaussian random vector with mean vector and covariance matrix
respectively given by (refer to the proof of theorem 64):
mn = 0
N0
Λn = IN
2
∑
N
r(t) − rj ϕj (t) = r2 (t) ̸= 0.
j=1
Figure 4.3:
The value r 2 = ρ2 can be disregarded when making the decision if and only if
pr2 (ρ2 |r 1 = ρ1 , s = si ), the only factor involving r 2 , takes the same value for
all si :
pr2 (ρ2 |r 1 = ρ1 , s = si ) = pr2 (ρ2 |r 1 = ρ1 )
∀i, ∀ρ1 , ∀ρ2 . In concrete word, this says that r 2 is statistically independent of
s when conditioned on r 1 .
Remarks (W&J, page 220).
1. r 2 is irrelevant if and only if pr2 (ρ2 |r 1 = ρ1 , s = si ) is constant with
respect to i.
4.3. WAVEFORM CHANNELS 157
for every ρ1 , ρ2 , i.
3. Two possibly useful factorizations when applying the theorem of irrele-
vance:
pr2 ,r1 |s
pr2 |r1 ,s =
pr1 |s
pr1 |r2 ,s pr2 |s
=
pr1 |s
Example 4.3.1 (W&J, pages 220, 221). Consider the diagram of figure 4.8 in
W&J. In this example we have:
r1 = s + n1
r2 = n2
This is in general not the same value for all i and r 2 is consequently not irrele-
vant.
Corollary (Theorem of reversibility). (W&J, page 222)The minimum attain-
able probability of error is not affected by the introduction of a reversible oper-
ation at the output of a channel (or front end of a receiver).
158 CHAPTER 4. OPTIMUM RECEIVER PRINCIPLES
Figure 4.4:
∑
N
r1 (t) = rj ϕj (t) = s(t) + n(t)
j=1
r2 (t) = r(t) − r1 (t)
4.3. WAVEFORM CHANNELS 159
where r(t) = s(t) + nw (t) is the input to the receiver (received waveform). The
above are described as follows:
• r(t) ≡ received waveform,
• r1 (t) ≡ component of r(t) which is contained in the vector space spanned
by {ϕj (t)}N
j=1 ,
We first notice that r2 (t) is the result of linear operations performed on nw (t)
(W&J, equation (4.45(b)), also refer to figure 4.4):
r2 (t) = r(t) − r1 (t)
= (s(t) + nw (t)) − (s(t) + n(t))
= nw (t) − n(t)
r2 (t) is consequently a Gaussian random process which is statistically indepen-
dent of s(t). By theorem 64 in Appendix B, r2 (t) and n(t) are statistically
independent 0-mean jointly Gaussian random processes. Let (W&J, equation
(4.46))
r 2 = (r2 (t1 ), . . . , r2 (tq ))
be the random vector obtained through sampling of r2 (t). r 2 is a Gaussian
random vector statistically independent of both s and n and the situation is
similar to that depicted in figure 4.8 in W&J; r 2 is therefore irrelevant for any
t1 , t2 , . . . , tq and for any q. It follows that r2 (t) is irrelevant and an optimal
MAP receiver may base its decision solely upon r 1 . The following summarizes
all of section 4.3 in W&J.
Theorem 51. Consider an additive white Gaussian noise waveform channel as
depicted in figure 4.1 in W&J with
−1
• m ∈ {mi }M
i=0 “with a-priori probabilities” P (mi ), i = 0, 1, . . . , M − 1,
−1
• s(t) ∈ {sj (t)}M
j=0 “specified by” the a-priori probabilities,
for i = 0, 1, . . . , M − 1 and j = 1, 2, . . . , N .
mn = (0, 0, . . . , 0)
Λn = diag[N0 /2, N0 /2, . . . , N0 /2]
3. The waveform channel reduces to the additive Gaussian noise vector chan-
nel of figure 4.4 in W&J. The waveform transmitter may be broken down
into a vector transmitter followed by a modulator as depicted in figure 4.17
in W&J. The optimal waveform receiver may be broken down into a de-
tector followed by a MAP vector receiver as depicted in figures 4.16, 4.17
in W&J.
4. The decision regions of the MAP vector receiver are found using W&J
equation (4.19) in which σ 2 = N0 /2.
Remark. (W&J, pp 232 - 233) The performance P (E ) does not depend on the
choice of the orthonormal basis.
j = 1, 2, . . . , N .
b = mk if
2. it applies the MAP decision rule (W&J, equation(4.51)): m
|r − si |2 − N0 ln(P (mi )) is minimum for i = k.
4 we drop the subscript 1 from the relevant vector r since it is the only received vector;
1
r 2 is irrelevant and will no longer be used or mentioned.
4.4. RECEIVER IMPLEMENTATION 161
By changing the sign, dividing by 2 and noticing that the term |r|2 is the same
for all i we obtain the equivalent (W&J equations (4.53a), (4.53b)):
where
1( )
ci = N0 ln(P (mi )) − |si |2 , (4.7)
2
for i = 0, 1, . . . , M − 1.
Remarks.
1. The above equivalent decision rule is not as easy to visualize as the decision
rule of equation (4.19) in W&J, but it leads to simpler implementations
since it does not require any squaring devices.
Definition 4.4.1.
2. (W&J page 235) Let ϕ(t) be identically zero outside some finite time
interval 0 ≤ t ≤ T . A linear invariant causal filter of impulse response
h(t) = ϕ(T − t) is said to be matched to ϕ(t).
As shown on page 235 in W&J, “if each member of the orthonormal basis
{ }N
ϕj (t) j=1 is identically zero outside some finite time interval, say 0 ≤ t ≤ T ,
the outputs of filters matched to the ϕj (t) are waveforms uj (t) that go through
the desired values rj when their input is r(t). Then the multiplications and in-
tegrations can easily be implemented by matched filters and sample-and-holds.”
Example 4.4.1. ∫ ∞(pp. 236 - 237 in W&J): We next illustrate on a special case
that the values −∞ r(t)ϕj (t)dt = rj can be obtained by sampling the output of
matched filters when ϕj (t) = 0, ∀t ∈/ [0, T ]. Let
{ √
− 2000 cos(2π(3 kHz)t) ; 0 ≤ t ≤ 1 ms
ϕ1 (t) =
0 ; elsewhere
{ √
− 2000 cos(2π(4.5 kHz)t) ; 0 ≤ t ≤ 1 ms
ϕ2 (t) =
0 ; elsewhere
The functions ϕ1 (t) and ϕ2 (t) are plotted on figure 4.5 and one easily verifies
that ϕ1 (t) and ϕ2 (t) are orthonormal. Suppose the signal r(t) sketched on figure
Orthonormal functions
60
phi1(t)
40
phi2(t)
20
[V]
-20
-40
-60
0 0.001
t [s]
Figure 4.5:
4.6 is received; this is a sample function of the random process described by:
where nw (t) is a 0-mean white (over the frequency range -100 kHz ≤ f ≤
100 kHz since the sampling frequency is 200 kHz) Gaussian noise with power
spectral density:
N
20 = 0.125 × 10−3 ; -100 kHz ≤ f ≤ 100 kHz
Sw (f ) =
0 ; elsewhere
received signal
25
received signal
15
5
[V]
-5
-15
-25
0 0.001
t [s]
Figure 4.6:
∫ ∞
r(t)ϕ1 (t)dt = r1 = 0.215392
−∞
∫ ∞
r(t)ϕ2 (t)dt = r2 = −0.000791
−∞
Figure 4.7:
h1 (t) = ϕ1 (1 ms − t)
h2 (t) = ϕ2 (1 ms − t)
at t = 1 ms. The output of the filters fed by r(t) are sketched on figure 4.8 and
it is seen that they are respectively equal to 0.215392 and -0.000791 at t = 1
ms, precisely equal to the values of r1 and r2 indicated above.
Matched filters outputs
0.1
[V]
-0.1
-0.2
0 0.001 0.002
t [s]
Figure 4.8:
Remark. The signal-to-noise ratio (definition 4.5.1 on page 169) for the case
illustrated above is:
( √ )2
1/ 20 1/20
SNR = = = 200 → 23 dB
N0 2 × 0.125 × 10−3
The signal r(t) is extremely clean contrary to what one might think by look-
ing at figure 4.6. Refer to figure 4.9 for a more realistic illustration in which
SNR = 5 dB.
150
-150
0 0.001
t [s]
Figure 4.9:
2. The terms ⟨r, si ⟩+ci of equation (4.53a) in W&J can directly be calculated
from the waveform r(t) without
∫ ∞ first projecting on the basis ϕj (t) since,
by theorem 63, ⟨r, si ⟩ = −∞ r(t)si (t)dt. This leads to a correlation
receiver (not in W&J) which simplifies to the matched filter receiver of
−1
figure 4.21, when all the waveforms {si (t)}M i=0 are identically 0 outside a
finite interval 0 ≤ t ≤ T . As remarked on page 239 in W&J, this is not
necessarily a simplification over the matched filter receiver of figure 4.19
(in W&J).
The next theorem illustrates from another point of view the optimality of the
matched filter receiver. The following lemma is needed to prove the theorem.
Lemma 52 (Schwarz inequality). (W&J, page 240) For any pair of finite energy
waveforms a(t), b(t) we have (W&J, equation (4.64)):
(∫ ∞ )2 ( ∫ ∞ )( ∫ ∞ )
a(t)b(t)dt ≤ a2 (t)dt b2 (t)dt
−∞ −∞ −∞
3. Parseval’s relationships.
The result follows immediately from the above, noticing that cos2 θ ≤ 1. In
addition we observe that equality is satisfied iff θ = 0 or θ = π, i.e. iff b(t) =
c a(t) for some non-zero constant c ∈ R.
We will then design the filter (i.e. find the impulse response h(t)) to maximize
r2
the ratio V ar(r) so that we can decide, based on r, with as little randomness as
possible whether or not ϕ(t) is present in the received signal r(t).
Figure 4.10:
S r2
=
N n2
is maximized when h(t) is matched to ϕ(t).
and
∫ ∞
r = y(T ) = (ϕ(T − α) + nw (T − α))h(α)dα
−∞
∫ ∞ ∫ ∞
= ϕ(T − α)h(α)dα + nw (T − α)h(α)dα
−∞ −∞
| {z } | {z }
term #1 term #2
Therefore:
( )2
∫∞
−∞
ϕ(T − α)h(α)dα
S
= N0 ∞
∫
N 2 −∞
h2 (β)dβ
⟨ϕ(T − t), h(t)⟩2
= N0
2 |h(t)|
2
|ϕ(T − t)|2
= N0
2
and equality is obtained when h(t) ∝ ϕ(T − t), i.e. when h(t) is matched to
ϕ(t).
Definition 4.4.3. The ratio r2 /n2 is called signal-to-noise ratio.6
6 See also definitions 4.5.1 and 4.5.2.
168 CHAPTER 4. OPTIMUM RECEIVER PRINCIPLES
The previous theorem shows that the j-th matched filter of the receiver in
figure 4.19 in W&J extracts from r(t) the j-th component of the (transmitted)
random vector s with the best possible signal-to-noise ratio. This is another
point of view on the optimality of the matched filter receiver.
Remarks.
by:
∑
M −1
′ =
Em P (mi )|s′i |2
i=0
∑M −1
is minimized by choosing a = s = i=0 P (mi )si .
4.5. PROBABILITY OF ERROR 169
Proof.
1. (W&J, top of page 247) The formal proof is tedious. We only need to
notice that equations (4.19), (4.20b) in W&J only depend on the distances
between a given r = ρ and the si , and that these distances are not affected
by rotation and translation of both r = ρ and the set of all si . Therefore
changing S to S ′ requires a change of the decision regions Ii to Ii′ , i =
0, 1, . . . , M − 1, but equation (4.20b) (with the new Ii′ ) will yield the
same answer as before. It follows that (4.20c) also remains unchanged:
P (E ′ ) = P (E ).
2. Follows from “the moment of inertia is minimum when taken around the
centroid (center of gravity) of a system.” (W&J, page 248)
Definition 4.5.2. Consider a linear time invariant filter used to detect a signal
7
ϕ(t) as in figure 4.22,
√ page 240, W&J. The signal-to-noise ratio of the estimate
is defined as E[r]/ V ar(r). When expressed in units of power, the definition
becomes:
r2
SN R =
V ar(r)
and in the context of figure 4.22 (W&J), this simplifies to r2 /n2 .
Theorem 55. (binary signals) Let s0 , s1 be two signal vectors in a signal space
with (orthonormal basis) {ϕ1 (t), ϕ2 (t)}, and let d = |s0 − s1 |.
7 Refer to page 512 in M.B. Priestley, Spectral Analysis and Time Series, vol. 1, Academic
(√ )
Es
P (E ) = Q 2
N0
Proof. Refer to W&J, pages 248 - 251. Notice that the probability of error is
more easily calculated, in the present situation, with the formula of equation
(4.4). In the case of unequally likely messages, the expression of ∆ is easily de-
rived from equations (4.6) and (4.7) based on the decision rule of the correlation
receiver.
The probability of error of equally likely binary antipodal and binary or-
thogonal signal sets is plotted as a function of the signal to noise ratio Es /N0
on figure 4.11.
Remarks.
1. Antipodal signalling is more energy efficient than orthogonal signalling.
2. Let √
2Es
0≤t≤T
T sin(2πfc t + iπ) ;
si (t) =
0 ; elsewhere
for i = 0, 1 and fc T ∈ N. The modulation scheme is called phase reversal
keying (PRK) and is a special case of phase shift keying (PSK). It is
a form of antipodal signalling since s1 (t) = −s0 (t) (exercise: find an
orthonormal basis and the corresponding vectors s0 , s1 ).
4.5. PROBABILITY OF ERROR 171
Figure 4.11:
3. Let:
√
2000Es cos(2π(3 kHz)t) ; 0 ≤ t ≤ 1ms
s0 (t) =
0 ; elsewhere
√
2000Es cos(2π(4.5 kHz)t) ; 0 ≤ t ≤ 1ms
s1 (t) =
0 ; elsewhere
This modulation scheme is called frequency shift keying (FSK) and is a
form of orthogonal signalling (exercise: find an orthonormal basis and
the corresponding vectors s0 , s1 ).
4. Let
s0 (t) = 0
√
2Es
0≤t≤T
T sin(2πfc t) ;
s1 (t) =
0 ; elsewhere
where fc T ∈ N. This modulation scheme is called amplitude shift key-
ing (ASK) and is neither antipodal nor orthogonal signalling. It is left
172 CHAPTER 4. OPTIMUM RECEIVER PRINCIPLES
as an exercise to show that ASK has the same energy efficiency as that
of orthogonal signalling (exercise: find an orthonormal basis and the
corresponding vectors s0 , s1 ).
Rectangular decision regions (W&J, pages 251 - 254) occur when “the signal
vector configuration is rectangular and all signals are equally likely”. This
situation is easily analyzed and generalizes the previous theorem.
Example 4.5.1. Consider the signal set depicted in figure 4.12. We have (again
Figure 4.12:
∫∫
P (C |m0 ) = pn (β)dβ
I0 −s0
∫ d/2 ∫ ∞
= pn (β)dβ pn (β)dβ
−∞ −d/2
√ 1 e−β /N0 .
2
where pn (β) = πN0
It follows that:
P (C |m0 ) = (1 − p)2
∫∞ ( √ )
where p = d/2
pn (β)dβ = Q d/ 2N0 . Noticing that Em = 5d2 /2, p can be
4.5. PROBABILITY OF ERROR 173
(√ )
Em
expressed as a function of SN R: p = Q 5N0 . Similarly we find:
∫∫
P (C |m4 ) = pn (β)dβ
I4 −s4
∫ d/2 ∫ ∞
= pn (β)dβ pn (β)dβ
−d/2 −d/2
( ( ( d/2 ))( ( −d/2 ))
−d/2 )
= Q √ −Q √ Q √
N0 /2 N0 /2 N0 /2
( ( d/2 ) ( d/2 ))( ( d/2 ))
= 1−Q √ −Q √ 1−Q √
N0 /2 N0 /2 N0 /2
= (1 − 2p)(1 − p)
P (C |m12 ) = (1 − 2p)2
Also
P (C |m0 ) = P (C |mi ), i = 1, 2, 3,
P (C |m4 ) = P (C |mi ), i = 5, 6, 7, . . . , 11,
P (C |m12 ) = P (C |mi ), i = 13, 14, 15.
Finally, after simplification we obtain
∑
15
P (C ) = P (mi )P (C |mi )
i=0
= (1 − 3p/2)2
The P (E ) = 1 − P (C ) is plotted on figure 4.13 as a function of the SN R.
In the special case where:
√
2
; 0≤t≤T
T cos(2πfc t)
ϕ1 (t) =
0 ; elsewhere
√
2
; 0≤t≤T
T sin(2πfc t)
ϕ2 (t) =
0 ; elsewhere
this modulation is referred to as quadrature amplitude modulation with 16 levels
(QAM-16). It generalizes to other numbers of levels. QAM-4 is also called
quadri-phase shift keying (QPSK).
Theorem 56. (vertices of a hypercube) Let M = 2N equally likely messages
be mapped onto signal vectors s0 , . . . , sM −1 which form the vertices of an N -
dimensional hypercube centered on the origin:
si = (si1 , , si2 , , . . . , siN ), sij = ±d/2 .
174 CHAPTER 4. OPTIMUM RECEIVER PRINCIPLES
Figure 4.13:
( ) (√ )
N d2
Then P (C ) = (1 − p) N
where p = Q √ d
2N
=Q 2Es
N N0 and Es = 4 =
0
Em .
as seen on figure 4.35(b) in W&J. Clearly the decision regions are the regions
of R3 delimited by the axes (octants). Suppose s0 is transmitted. Then:
1. no errors ⇔ n1 < d/2, n2 < d/2, n3 < d/2 (as in W&J bottom of page
255),
N d2
3. Em = |s|2 = 4 = Es (as in equation (4.88a) in W&J).
4.5. PROBABILITY OF ERROR 175
Therefore:
∫ d/2 ∫ d/2 ∫ d/2
P (C |m0 ) = pn (α1 )pn (α2 )pn (α3 )dα1 dα2 dα3
−∞ −∞ −∞
( ∫ d/2 )3
= pn (α)dα
−∞
= (1 − p)3
( d ) (√ 2E )
√ 1 e−α /N0
2
where pn (α) = πN0
and p = Q √2N =Q N N0 . The symmetry
s
0
Remark. The irrelevance theorem may be used to give a more global proof of
this result (refer to W&J, top of page 257).
P (E ) is plotted as a function of the SN R for the case N = 4, M = 16 in
figure 4.13. It is seen, not surprisingly, that the QAM-16 modulation yields a
bigger P (E ) for the same value of SN R. P (E ) versus SN R is plotted in figure
4.14 for various values of N .
Proof. (W&J, pages 257 - 259) Notice that the decision regions are no longer
rectangular. In this case we use the receiver structure of figure 4.19 in W&J to
describe the event E and to calculate its probability. With an orthogonal signal
set, the various elements in fig 4.19 become: 8
√
• ϕj (t) = sj (t)/ Es ,
• c0 = c1 = . . . = cM −1 ,
√ Even though it is not clear from the expression, P (E ) is function of the ratio
ES /N0 only; it does not depend of both ES √ and N0 independently. P (E ) is
plotted on figure 4.15 as a function of SN R = ES /N0 .
M −1
Definition 4.5.3. Let {sj }j=0 be a set of M = N equally likely, equal energy,
−1
orthogonal signal vectors. The signal vector set {s′j = sj − a}M
j=0 where
M −1
1 ∑
a=s= si
M i=0
Remarks.
8 The notation is changed in that j ranges from 0 to M − 1 instead of from 1 to N . We
Figure 4.15:
√
(s)
Es
s′i − s′j = − (0, 0, . . . , 0, −M, 0, 0, . . . , 0, +M, 0, . . . , 0)
M (M − 1)
In the above, the component −M is in position i and the component +M is in
position j. The square of the distance is then:
(s)
Es
|s′i − s′j |2 = × 2M 2
M (M − 1)
2M
= E (s)
M −1 s
(o) M Es(s)
An orthogonal set of signals of equal energy Es = M −1 is also such that the
(s)
square of the distance between any two distinct signals is M2M −1 Es . Both sets
consequently have the same P (E ) and P (C ). For example, a ternary (M = 3)
(s)
simplex set of dimension N = 2 with Es = 1 V2 has the same P (E ) as the
(o)
corresponding orthogonal signal set with energy Es = 1.5 V2 . In other words,
the probability of error of a ternary simplex set at a signal-to-noise ratio of 0 dB
(SN R = 1) is equal to the probability of error of a ternary orthogonal set at a
signal-to-noise ratio of√1.76 dB (SN R = 1.5). P (E ) is plotted on figure 4.15 as
a function of SN R = ES /N0 .
−1
Definition 4.5.4. (W&J, page 261) Let {sj }N j=0 be a set of equal energy or-
−1
thogonal signal vectors. The signal vector set {sj , −sj }N
j=0 is called biorthog-
onal set and clearly M = 2N .
−1
Theorem 58. Let {sj , −sj }N j=0 contain M = 2N equiprobable biorthogonal
signal vectors with with same energy Es = Em . Then (equation (4.104) in
W&J):
∫ ∞ √ (∫ ∞ )N −1
P (C ) = pn (α − Es ) pn (β)dβ dα
0 −α
∫ ∞ √ ( ∫ ∞ )N −1
= pn (α − Es ) 1 − 2 pn (β)dβ dα
0 α
∫ ∞ √ ( ( α ))N −1
= pn (α − Es ) 1 − 2Q √ dα
0 N0 /2
4.5. PROBABILITY OF ERROR 179
This becomes:
r0 > 0
no errors ⇔ and
r0 > ri and r0 > −ri , i = 1, 2, . . . , N − 1
Figure 4.16:
180 CHAPTER 4. OPTIMUM RECEIVER PRINCIPLES
We easily see that a symmetrical signal vector set with equiprobable mes-
sages leads to a congruent decision region receiver by the MAP decision rule
(page 263 in W&J). Equivalently, when the signal set is symmetrical, a max-
imum likelihood receiver is a congruent decision region receiver. A congruent
decision region receiver need not be optimal, but over an additive white Gaus-
sian noise channel it satisfies (equation (4.106a) in W&J):
P (C |mi ) = P (C |mj )
∑
M −1
P (C ) = P (mi )P (C |mi )
i=0
= P (C |mi ), ∀i.
Minimax Receiver
Theorem 60. A congruent decision region receiver which is optimal for a cer-
tain choice of the P (mi )’s is minimax. In particular a maximum likelihood
receiver with symmetrical signal vector set is minimax.
Proof. Refer to page 264 in W&J; this proves part 4 of theorem 50.
∑
M −1
P (E |mi ) ≤ P2 [si , sk ]
k=0
k ̸= i
where P2 [si , sk ] denotes the P (E ) of a binary system that would use signals
si and sk to communicate one of two equally likely messages. For the additive
white Gaussian noise channel we have (equations (4.110) and previously (4.76b)
in W&J) ( )
|si − sk |
P2 [si , sk ] = Q √
2N0
and therefore
∑
M −1 ( )
|si − sk |
P (E |mi ) ≤ Q √
2N0
k=0
k ̸= i
10
2. For simplex signals:
(√( )( ))
M Es
P (E ) ≤ (M − 1)Q
M −1 N0
( )
(s) (o) (o) (s) (s) (o)
10 Recall that Es = Es 1− 1
M
⇒ Es = M
E ,
M −1 s
where Es , Es respectively
denote the average signal energy of the simplex and orthogonal signal sets.
182 CHAPTER 4. OPTIMUM RECEIVER PRINCIPLES
Proof.
∑
M −1 ( )
|s0 − sk |
P (E |m0 ) ≤ Q √
k=1
2N0
∑ −1
M
(√ )
= Q Es /N0
k=1
(√ )
= (M − 1)Q Es /N0
2. For a simplex signal set we start with the latter bound, and recalling that
(page 261 in W&J):
M (s)
Es(o) = Es
M −1
(s)
where Es = Em , we obtain:
(√( )( ))
M Em
P (E ) ≤ (M − 1)Q
M −1 N0
∑
N −1 ( ) N∑−1 ( ) ( )
|s0 − sk | |s0 + sk | |s0 + s0 |
P (E |m0 ) = Q √ + Q √ +Q √
k=1
2N0 k=1
2N0 2N0
P (E ) = P (E |m0 )
(√ ) (√ )
= 2(N − 1)Q Es /N0 + Q 2Es /N0
(√ ) (√ )
= (M − 2)Q Es /N0 + Q 2Es /N0
These bounds are plotted in figure 4.17 for the case M = 16.
Figure 4.17:
2. Each message error results in a number of bit errors that can be anywhere
between 1 and log2 (M ).
184 CHAPTER 4. OPTIMUM RECEIVER PRINCIPLES
It may be very difficult to predict the bit error rate of a signal set as a
function of the average signal-to-noise ratio per bit, mainly because it depends
on the mapping of information bit sequences to signal vectors. A simple ap-
proximation may be obtained by noticing that at a low signal-to-noise ratio one
expects that half of the log2 (M ) bits of a message will be incorrect and at a high
signal-to-noise ratio only one of the log2 (M ) bits of a message will be incorrect
(using a Gray code to map the information bit sequences to signal vectors).
After rescaling the horizontal axis to reflect the average signal-to-noise ratio
per bit, some of the curves of the previous section are redrawn as two curves each,
one of which corresponding to “half of the information bits are incorrect” and the
other curve corresponding to “only one bit is incorrect”; the true performance
curve should lie somewhere in between the two. Some of the curves are presented
in figures 4.18 and 4.19.
Figure 4.18:
4.7. PROBLEMS: 185
Figure 4.19:
4.7 Problems:
1. Consider the signals in figure 4.20.
Figure 4.20:
Chapter 5
Reliable Communication on
an Unreliable Channel
A B
source channel receiver
Figure 5.1:
187
188CHAPTER 5. RELIABLE COMMUNICATION ON AN UNRELIABLE CHANNEL
never be made arbitrarily small. Shannon showed how this can be accomplished
in a famous paper published in 1948. In the following we denote:
A N = {(a1 , a2 , . . . , aN )|ai ∈ A }
in which N > 1 is an integer. A sequence of N source alphabet symbols is an
element of A N .
Definition 5.0.1.
1. A block code C of (block) length N and rate R over the alphabet A is a
subset of A N of size (cardinality) M = A RN . The elements of C are
called codewords. It is convenient to number the codewords and without
loss of generality we write:
C = {y 1 , y 2 , . . . , y M }.
N
yi ∈ C z∈B decoder ŷ ( z ) ∈ C
source channel
ŷ
RN
symbols
source encoder
If RN is an integer
Figure 5.2:
which implies
∑
M
P (E ) = P (y i )P (E |y i )
i=1
∑
M ∑
= P (y i ) PN (z|y i )
i=1 z ∈I
/ i
The last expression shows that the probability is a function of the a priori
probabilities P (y i ), the channel probabilities PN (z|y i ), the code C and the
decoders decision regions Ii , i = 1, 2, . . . M .
190CHAPTER 5. RELIABLE COMMUNICATION ON AN UNRELIABLE CHANNEL
Remark. the last expression assumes that z is a discrete ∑variable; if∫ it is con-
tinuous then all PN (z|y i ) become pz (γ|y = y i ) and all z become dγ.
Definition 5.0.3.
Example 5.1.1. Let A = {0, 1}, A = 2 and the (memoryless) Binary Sym-
metric Channel (BSC) shown in figure 5.3 with p < 12 . We have:
where dH (z, y i ), called the Hamming distance between z and y i , denotes the
number of positions in which z and y i differ. Clearly p < 12 ⇒ 1−p p
< 1 and
PN (z|y i ) is maximized by minimizing dH (z, y i ). So if
then we find:
y2 =1,0,0,1
Figure 5.3:
For the code of size M = 2 and the decision regions of the maximum likeli-
hood decoder we have:
∑
P (E |y 1 ) = P (z ∈ I2 |y 1 ) = PN (z|y 1 )
z∈I2
We then notice that for any z ∈ I2 we have PN (z|y 1 ) < PN (z|y 2 ), i.e.
( )s
PN (z|y 2 ) PN (z|y 2 )
≥1⇒ ≥ 1, ∀s ≥ 0
PN (z|y 1 ) PN (z|y 1 )
The above P (E |y 1 ) is bounded by:
∑ ( )s
PN (z|y 2 )
P (E |y 1 ) ≤ PN (z|y 1 )
PN (z|y 1 )
z∈I2
∑
= PN (z|y 1 )1−s PN (z|y 2 )s
z∈I2
∑
≤ PN (z|y 1 )1−s PN (z|y 2 )s ,
z∈BN
∀s ≥ 0. The above bound is convenient because it does not involve the decision
regions; it can be calculated or computed without first having to find the decision
192CHAPTER 5. RELIABLE COMMUNICATION ON AN UNRELIABLE CHANNEL
∀s ≥ 0. It follows that:
P (E ) = P (E |y 1 )P (y 1 ) + P (E |y 2 )P (y 2 )
∑
≤ P (y 1 ) PN (z|y 1 )1−s PN (z|y 2 )s
z∈BN
∑
+P (y 2 ) PN (z|y 2 )1−s PN (z|y 1 )s
z∈BN
This bound only depends on the code C = {y 1 , y 2 } and the discrete commu-
nication channel probabilities PN (z|y), with or without memory; it assumes
equally likely messages and a maximum likelihood decoder.
∑ ∏
N √
= P (zn |y1n )P (zn |y2n )
z∈BN n=1
∑ ∑ ∑ ∏
N √
= ··· P (zn |y1n )P (zn |y2n )
z1 ∈B z2 ∈B zN ∈B n=1
N ∑ √
∏
= P (zn |y1n )P (zn |y2n )
n=1 zn ∈B
The subscript n of zn may be omitted in the above. We notice that if y1n = y2n
∑√ ∑
P (z|y1n )P (z|y2n ) = P (z|y1n ) = 1
z∈B z∈B
and those values do not affect the product. We can then rewrite the product in
the above bound as:
∏
N ∑√
P (E ) ≤ P (z|y1n )P (z|y2n )
z∈B
n=1
y1n ̸= y2n
y1n ̸= y2n
= γ dH (y1 ,y2 )
∑ √
where γ , P (z|0)P (z|1). If z is a continuous variable then γ =
∫∞ √ z∈B
−∞
pr (ρ|0)pr (ρ|1)dρ.
2
Example 5.1.2.
1. Binary symmetric channel with transition probability p. 3 We find:
∑ √ √
γ= P (z|0)P (z|1) = 2 p(1 − p)
z∈{0,1}
and P (E ) is bounded by
( )dH (y1 ,y2 )
√
P (E ) ≤ 2 p(1 − p) .
with (binary) antipodal modulation and a maximum likelihood receiver. We then have p =
(√ )
Q 2EN /N0 , where EN and N0 /2 respectively denote the energy transmitted per channel
use and the double sided noise power spectral density.
194CHAPTER 5. RELIABLE COMMUNICATION ON AN UNRELIABLE CHANNEL
and the bound is found to be 14 times larger than the exact value in this
specific case.
2. AWGN channel with binary antipodal modulation, maximum likelihood
(ML) receiver and as before EN and N0 /2 respectively denote the energy
transmitted per channel use and the double sided noise power spectral
density: ∫ ∞√ √ √
γ= pr (ρ| EN )pr (ρ| − EN )dρ
−∞
In the above we have
1
e−(ρ−si )
2
/2σ 2
pr (ρ|si ) = pn (ρ − si ) = √
2πσ 2
N0
and σ 2 = 2 . It follows that:
∫ ∞√
1 √ 1 √
γ = √ e−(ρ− EN )2 /2σ2 √ e−(ρ+ EN )2 /2σ2 dρ
2πσ 2 2πσ 2
−∞
∫ ∞ ( √ √ )
1 −(ρ − EN ) − (ρ + EN )2
2
= √ exp dρ
−∞ 2πσ 2 4σ 2
∫ ∞ ( )
1 ρ2 + EN
= √ exp − dρ
−∞ 2πσ 2 2σ 2
( )∫ ∞
EN 1
e−ρ /2σ dρ
2 2
= exp − 2 √
2σ −∞ 2πσ 2
−EN /N0
= e
Refer to the graph in figure 5.4 for a comparison of the γ corresponding to:
1. BSC,
2. AWGN channel/antipodal modulation/ML receiver.
Since a smaller value of γ is desirable we notice that channel 2 is better (≈ 2
dB more energy efficient).
0.6
gamma
0.4
0.2
0
-15 -10 -5 0 5 10 15
SNR EN/N0 [dB]
Figure 5.4:
g is simplified as follows:
∑∑∑ √ √
g = QN (y 1 ) PN (z|y 1 )QN (y 2 ) PN (z|y 2 )
z y1 y2
∑(∑ √
)( ∑
√
)
= QN (y 1 ) PN (z|y 1 ) QN (y 2 ) PN (z|y 2 )
z y1 y2
∑( ∑ √
)2
= QN (y) pN (z|y)
z y∈A N
The average probability of error for any discrete communication channel (with
or without memory) with 2 equally likely messages and a maximum likelihood
decoder is bounded by:
∑ ( ∑ √
)2
P (E ) ≤ g = QN (y) pN (z|y)
z∈BN y∈A N
∑( ∑ ∏
N √ )2
g = QN (y) P (zn |yn )
z y∈A N n=1
∑( ∑ ∏
N √
)2
= Q(yn ) P (zn |yn )
z y∈A N n=1
∏N
where we choose QN (y) = n=1 Q(yn ) in which Q( ) is a probability measure
on A (this corresponds to drawing by independent trials the symbols of y out
5.2. P (E ) AVERAGED OVER ALL POSSIBLE C ⊂ A N , |C | = M = 2 197
∑( ∏
N ∑ √
)2
g = Q(y) P (zn |y)
z n=1 y∈A
N ( ∑
∑ ∏ √
)2
= Q(y) P (zn |y)
z∈BN n=1 y∈A
∏
N ∑(∑ √
)2
= Q(y) P (z|y)
n=1 z∈B y∈A
( )2 )N
∑(∑ √
= Q(y) P (z|y)
z∈B y∈A
−N R(Q)
= 2
where we define
(∑(∑ )2 )
√
R(Q) , − log2 Q(y) P (z|y)
z∈B y∈A
As long as Q( ) and the channel probabilities P (z|y) are such that R(Q) > 0
(bits per channel use) then g can be made as small as we wish by taking N large
enough; there must then exist at least one pair of codewords {y 1 , y 2 } ⊂ A N
for which the probability of error is less than g.
In order to obtain the “tightest” bound, the probability measure Q( ) max-
imizing R(Q) is chosen.
Equivalently we have:
( (∑(∑ )2 ))
√
R0 = − log2 min Q(y) P (z|y)
Q( )
z∈B y∈A
∑( √ √
)2
R0 = − log2 (1/2) P (z|0) + (1/2) P (z|1)
z∈B
√
∑ ( P (z|0) P (z|1) P (z|0)P (z|1)
)
= − log2 + +
4 4 2
z∈B
= − log2 (1/2)(1 + γ)
= 1 − log2 (1 + γ)
∑ √
where we recall that γ = z∈B P (z|0)P (z|1).
Example 5.2.1. Refer to the graphs on figure 5.5 for a comparison of the R0
corresponding to:
1. BSC,
0.9
0.8
R0 for BSC
0.7 R0 for AWGN
0.6
R0
0.5
0.4
0.3
0.2
0.1
0
-15 -10 -5 0 5 10 15
SNR EN/N0 [dB]
Figure 5.5:
5.3. P (E ) AVERAGED OVER ALL POSSIBLE C ⊂ A N , |C | = M 199
code C = {y 1 , y 2 , . . . , y M } is:
∏
M
P (C ) = QN (y i )
i=1
P (E ) = P (E |y 1 ) = P (z ∈ ∪j̸=1 I1j |y 1 )
∑
≤ P (z ∈ I1j |y 1 )
j̸=1
∑
≤ g(y 1 , y j )
j̸=1
∑∏
M ∑
= QN (y l ) g(y 1 , y j )
C l=1 j̸=1
∑ ∑ ∑ ∑∏
M
= ··· QN (y l )g(y 1 , y j )
y 1 ∈A N y 2 ∈A N y M ∈A N j̸=1 l=1
∑
M ∑ ∑ ∑ ∏
M
= ··· g(y 1 , y j ) QN (y l )
j=2 y 1 ∈A N y 2 ∈A N y M ∈A N l=1
∑M ∑ ∑
y 1 ∈A N y j ∈A N g(y 1 , y j )QN (y 1 )QN (y j )×
j=2
∑ ∑
··· QN (y 2 ) · · · QN (y M )
= y 2 ∈A N y M ∈A N
| {z }
but not y 1 nor y j
| {z }
equals 1
∑
M ∑ ∑
= QN (y 1 )QN (y j )g(y 1 , y j )
j=2 y 1 ∈A N y j ∈A N
= (M − 1)g ≤ M g
∑ ( ∑ √
)2
g= QN (y) P (z|y)
z∈BN y∈A N
where
log2 M
RN =
N
R0 = max R(Q)
Q
( (∑(∑ )2 ))
√
= − log2 min Q(y) P (z|y)
Q
z∈B y∈A
Example 5.3.1.
1. There exists a code of rate RN = 0.6 that can be used to transmit a 160
bit sequence over a BSC with p = 1.25 × 10−2 and a P (E ) < 10−8 . 7 This
is because the channel has R0 ≈ 0.7105, N = 160
0.6 = 267 and
The size of the code is M = 2160 ≈ 1.5 × 1048 codewords. Practical design
of such codes/encoders/decoders exceeds the scope of this course.
2. The results presented below are obtained by Monte Carlo simulation for
the transmission of a 5000 bit sequence over the AWGN channel with
binary antipodal signalling and maximum likelihood receiver at a signal-
to-noise ratio E N
= 0 dB. This means that the channel’s probability of
√N0
bit error is Q( 2) = 7.865 × 10−2 and the channel’s computational cut-off
rate is R0 ≈ 0.548. We distinguish the cases:
(a) No coding: The packet-error-rate (PER) ≈ 1 since every packet/se-
quence contains an average of 393 bit errors; it is very unlikely that
a packet be received with not a single bit error:
Code Con-
packet
straint packets bit errors PER BER
errors
Length
— 500 500 196592 1 7.86 × 10−2
There were 196592 bit errors during the transmission of
2500000 bits at a signal to noise ratio of 0.0 dB
7 500 122 1063 2.44 × 10−1 2.13 × 10−4
9 500 29 185 5.80 × 10−2 7.40 × 10−5
11 500 12 72 2.40 × 10−2 2.88 × 10−5
12 500 10 65 2.00 × 10−2 2.60 × 10−5
13 263 0 0 0 0
Table 5.1: simulation results rate 1/2 convolutional codes over an AWGN chan-
nel with binary antipodal modulation, 0 dB signal-to-noise ratio and maximum
likelihood receiver – Viterbi decoding of the convolutional code
7 this corresponds to a signal ratio of EN = 0.4 dB if the BSC is implemented with binary
N0
antipodal signalling over an AWGN channel and a maximum likelihood receiver
Appendix A
0.7 −x2 /2
The approximation Q(x) ≈ √1 (1
x 2π
− x2 )e may be used when x > 2.
203
204 APPENDIX A. TABLE OF THE Q( ) AND ERF( ) FUNCTIONS
Appendix B
Orthonormal Expansion
and Vector Representation
of Continuous Signals
e define a scalar product for finite energy signals. This scalar product,
W essentially corresponds to the projection described in the motivating prob-
lem below, and is used to obtain a geometric view of a given finite set of (finite
energy) signals. Additionally, we obtain a distance measure between signals.
Signal bases are defined and we show how to perform a basis change. Finally
we prove Parseval’s relationships. We start with a review of the concepts with
3-dimensional vectors over the real numbers.
(1, 1, 1)
(0, 1, 1) Objects measured by team 1 with
(0, 1, −1) NSEW reference.
205
206APPENDIX B. ORTHONORMAL EXPANSION AND VECTOR REPRESENTATION OF CON
Figure B.1:
where v1 , v2 , v3 ∈ R and ⃗i, ⃗j, ⃗k are the standard unit vectors which form a basis of
R3 . Since the above expansion is unique, the triplet (v1 , v2 , v3 ) can consequently
be used to represent ⃗v with respect to the basis and we write:
⃗v ←→ (v1 , v2 , v3 )
⃗ ←→ (w1 , w2 , w3 )
w
meaning that
⃗ = w1⃗i + w2⃗j + w3⃗k .
w
Definition B.1.2. Let ⃗v = v1⃗i + v2⃗j + v3⃗k, w
⃗ = w1⃗i + w2⃗j + w3⃗k.
⟨ , ⟩ : R3 × R3 → R
⟨ , ⟩ : (⃗v , w)
⃗ 7 → ⟨⃗v , w⟩
⃗ , v1 w1 + v2 w2 + v3 w3
is called dot product (aka scalar product or standard inner product) with
respect to the basis {⃗i, ⃗j, ⃗k}. One easily shows that:
In light of theorem 62, we see that a vector is normal if its length is 1, and
⃗v and w ⃗ are orthogonal if the projection of ⃗v on w ⃗ (or vice-versa) has a length
of 0 as one would expect intuitively.
The basis {⃗i, ⃗j, ⃗k} used above is by definition orthonormal.1 The expansion
of ⃗v with respect to ⃗i, ⃗j, ⃗k is represented graphically in figure B.2.
Figure B.2:
One easily verifies that any two of w⃗1 , w⃗2 , w⃗3 are orthogonal and that each one
is normalized.
Example B.1.1. Let ⃗i, ⃗j, ⃗k denote the standard unit vectors and define:3
v⃗1 = ⃗i + ⃗j + ⃗k ←→ (1, 1, 1)
v⃗2 = ⃗j + ⃗k ←→ (0, 1, 1)
v⃗3 = ⃗j − ⃗k ←→ (0, 1, −1)
| {z }
with respect to the basis {⃗i, ⃗j, ⃗
k}
Since {w⃗1 , w⃗2 , w⃗3 } forms a basis of R3 they can also be used to expand
v⃗1 , v⃗2 , v⃗3 . We would find (this will be explained below):4
√ √
v⃗1 = 3w⃗1√ √ ←→ ( 3,√
√ 0, 0)
v⃗2 = √ 2w⃗1 / 3 + w⃗2 2/3 ←→ (2/ 3, √ 2/3, 0)
v⃗3 = 2w⃗3 ←→ (0, 0, 2)
| {z }
with respect to the basis {w
⃗1 , w ⃗3 }
⃗2 , w
Not surprisingly, the representation of the same vectors v⃗1 , v⃗2 , v⃗3 depends on
the basis used to represent them, but the dot products are however independent
of the orthonormal basis used (and this is not a coincidence as we will see):
⟨⃗
vi , v⃗j ⟩ = ⟨⃗
vi , v⃗j ⟩ , ∀i, j = 1, 2, 3.
| {z } | {z }
wrt {⃗i, ⃗j, ⃗
k} wrt {w
⃗1 , w ⃗3 }
⃗2 , w
Why can’t we just use the standard unit vector basis since it is or-
thonormal? Why do we need other orthonormal bases? In R3 not
much is to be gained by using different orthonormal bases, but if we have a
small number of vectors with many components then some orthonormal bases
may give a very simple representation of the vectors. This is illustrated in the
next example.
4 This could correspond to the measurements taken by team #2 in our motivating example,
assuming that their measurements are in agreement with those made by team #1.
210APPENDIX B. ORTHONORMAL EXPANSION AND VECTOR REPRESENTATION OF CON
with respect to the standard unit vector basis {i⃗1 , i⃗2 , . . . , i⃗10 } in R10 . The vectors
√
w⃗1 = (−1, 2, 1, 0, 0, 2, −1, 0, −2, 0)/ 15
√
w⃗2 = (2, −4, −1/2, 3/2, 3, 2, −5/2, −3/2, −2, 3)/ 57
with respect to the basis {w⃗1 , w⃗2 } and we may represent the vectors graphically
as in figure B.3. Clearly the representation of v⃗1 , v⃗2 , v⃗3 in the basis {w⃗1 , w⃗2 }
is much simpler than the representation in the standard unit vector basis.
Figure B.3:
Step 2 w⃗2 = normalize v⃗2 − ⟨v⃗2 , w⃗1 ⟩w⃗1 if different than 0. This operation is
represented graphically in figure B.4.
B.2. VECTORS OBTAINED BY SAMPLING OF A WAVEFORM 211
Figure B.4:
Step 3 w⃗3 = normalize v⃗3 − ⟨v⃗3 , w⃗1 ⟩w⃗1 − ⟨v⃗3 , w⃗2 ⟩w⃗2 if different than 0.
Step 4 . . . and so on . . .
There are as many steps as there are (non-zero) vectors in the given set. The
vectors w⃗1 , w⃗2 , . . . are orthonormal. The number of (non-zero) orthonormal
vectors is less than or equal to the number of vectors in the given set.
vi : {1, 2, . . . , N } → R
vi : n 7→ vi (n) = vi n
so that we may write v⃗i = (vi (1), vi (2), . . . , vi (N )) in the standard unit vector
basis. The Gram-Schmidt procedure applied to the v⃗i returns L ≤ M orthonor-
mal vectors
∑
L
v⃗i = ⟨⃗
vi , w⃗j ⟩w⃗j ,
j=1
∑
N ∑
L
vi (n)vk (n) = cij ckj
n=1 j=1
| {z } | {z }
wrt standard unit vectors wrt the basis {w
⃗j }
Example B.2.2. Three signals v1 (n), v2 (n), v3 (n) are sketched on figure B.5(a).
The signals w1 (n), w2 (n) sketched in figure B.5(b) form an orthonormal ba-
sis suitable to represent v1 (n), v2 (n), v3 (n) (they have been obtained by ap-
plying the Gram-Schmidt procedure). v1 (n), v2 (n), v3 (n) can then be repre-
sented as a point in a Cartesian plane similarly to the sketch in figure B.3.
By a proper choice of the linear combination, v1 (n), v2 (n), v3 (n) can be recov-
√
ered from
√ w1 (n), w2 (n)√as shown in figure √ B.5(c) by the √sketches of w1 (n) 15,
w1 (n) 15/3 + 2w2 (n) 57/3 and w1 (n) 15/3 − 4w2 (n) 57/3. Finally the in-
variance of the dot product to the choice of the orthonormal basis is illustrated
below with signals v1 (n), v2 (n):
∑
10
⟨v1 (n), v2 (n)⟩ = v1 (n)v2 (n) = 5
n=1
√ √ √ 15
⟨( 15, 0), ( 15/3, 2 57/3)⟩ = +0=5
3
Figure B.5: Example for Gram-Schmidt procedure with the sampling of a waveform
214APPENDIX B. ORTHONORMAL EXPANSION AND VECTOR REPRESENTATION OF CON
( ( )) N
n
v⃗1 = s
fs n=1
( ( )
n )
2N
v⃗2 = s
2fs n=1
Both v⃗1 , v⃗2 contain samples of s(t) for 0 < t < N/fs , but v⃗2 contains twice as
many samples. Consequently:
In order for the dot product of discrete signals to be independent of the sampling
frequency we define:
∑ s1 (n)s2 (n)
⟨s1 (n), s2 (n)⟩ ,
n
fs
as the dot product of two discrete signals sampled at the rate fs (both signals
need to be sampled at the same rate fs so that they both have the same number
of samples in any given time interval).
We now have two different dot products, and there will not be any ambigu-
ities by the context:
is the dot product between two signals. The concepts of distance, norm and
projection introduced in definition B.1.2 and theorem 62 naturally extend to
signals. We then have:
√
1. d(s1 (t), s2 (t)) = ⟨s1 (t) − s2 (t), s1 (t) − s2 (t)⟩;
√
2. |s1 (t)| = ⟨s1 (t), s1 (t)⟩;
⟨s1 (t), s2 (t)⟩
3. |s2 (t)| is the length of the projection of signal s1 (t) on signal s2 (t). In
particular, if |s2 (t)| = 1 then ⟨s1 (t), s2 (t)⟩ is the length of the projection
of signal s1 (t) on signal s2 (t);
B.4. CONTINUOUS TIME SIGNALS 215
Step 2 ϕ2 (t) = normalize s2 (t) − ⟨s2 (t), ϕ1 (t)⟩ϕ1 (t) if different than 0.
Step 3 ϕ3 (t) = normalize s3 (t) − ⟨s3 (t), ϕ1 (t)⟩ϕ1 (t) − ⟨s3 (t), ϕ2 (t)⟩ϕ2 (t) if dif-
ferent than 0.
Step 4 . . . and so on . . .
Example B.4.1. (W&J, pp 269 - 272) Find an orthonormal basis for the signals
s1 (t), s2 (t), s3 (t) and s4 (t) sketched in figure B.6 and restricted to the interval
t ∈ [0, 3].
Figure B.6:
∫ 3
⟨θ1 (t), θ1 (t)⟩ = ⟨s1 (t), s1 (t)⟩ = s21 (t)dt
0
∫ 1 ∫ 2 ∫ 3
2 2
= (2) dt + (−2) dt + (2)2 dt
0 1 2
= 12
Figure B.7:
Figure B.8:
∑
N
si (t) = sij ϕj (t), i = 0, 1, . . . , M − 1,
j=1
where ϕj (t) are N appropriately chosen signals and N ≤ M . The set of signals
{ϕj (t)} is not unique (for example, the Gram-Schmidt procedure may give a
different set of orthonormal functions if the signals are simply reordered). The
Gram-Schmidt procedure yields a minimal set of orthonormal functions, the
cardinality of which is called dimensionality of the set of signals. The coefficients
sij in the above equation are given by:
∫ ∞
sij = ⟨si (t), ϕj (t)⟩ = si (t)ϕj (t)dt
−∞
218APPENDIX B. ORTHONORMAL EXPANSION AND VECTOR REPRESENTATION OF CON
Figure B.9:
We also have
∑
N
⟨si (t), sk (t)⟩ = ⟨si (t), ϕj (t)⟩⟨sk (t), ϕj (t)⟩
j=1
and more generally for any finite energy signal s(t) lying in the space generated
by the orthonormal functions ϕ1 (t), . . . ϕN (t), we also have:
∑
N
⟨r(t), s(t)⟩ = ⟨r(t), ϕj (t)⟩⟨s(t), ϕj (t)⟩
j=1
B.4. CONTINUOUS TIME SIGNALS 219
Figure B.10:
Theorem 64. Let nw (t) be a 0-mean white Gaussian noise with power spectral
density
Snw (f ) = N0 /2
and define
Sketch of proof. We assume that expectations and integrals are freely inter-
changeable.
220APPENDIX B. ORTHONORMAL EXPANSION AND VECTOR REPRESENTATION OF CON
Figure B.11:
1. We have:
∫ ∞
nj = nw (t)ϕj (t)dt
−∞
≡ linear operations on the Gaussian process nw (t)
(n1 , n2 , . . . , nN )
4. r2 (t) = nw (t)−n(t) where nw (t), n(t) are both Gaussian random processes
and therefore r2 (t) is also a Gaussian random process.
B.4. CONTINUOUS TIME SIGNALS 221
5. Both n(t), r2 (t) result from operations on the same random process nw (t);
n(t) and r2 (t) are then jointly Gaussian random processes.
6. In order to show that n(t) and r2 (t) are 0-mean we first calculate :
It follows that
(∑
N )
E(n(t)) = E nj ϕj (t)
j=1
∑
N
= E(nj ) ϕj (t)
j=1
| {z }
0, ∀j
= 0
Thus n(t) and r2 (t) are jointly Gaussian 0-mean random processes. Even
though this is not required at this time, we next calculate E(ni nj ); we’ll
222APPENDIX B. ORTHONORMAL EXPANSION AND VECTOR REPRESENTATION OF CON
∑
N ( ( ))
= ϕj (t)E nj nw (s) − n(s) (B.1)
j=1
B.4. CONTINUOUS TIME SIGNALS 223
N0 ∑N
N0
= ⟨ δ(t − s), ϕj (t)⟩ − δj, i ϕi (s)
2 i=1
2
∫
N0 ∞ N0
= δ(t − s) ϕj (t)dt − ϕj (s)
2 −∞ 2
N0 N0
= ϕj (s) − ϕj (s)
2 2
= 0