0% found this document useful (0 votes)
95 views

CourseNotesEE501 PDF

Uploaded by

Saba saba
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
95 views

CourseNotesEE501 PDF

Uploaded by

Saba saba
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 231

Lecture Notes: An Introduction to the Theory of

Statistical Communication

Germain Drolet
Department of Electrical & Computer Engineering,
Royal Military College of Canada,
P.O. Box 17000, Station Forces,
Kingston, Ontario, CANADA
K7K 7B4

Tel: (613) 541-6000, extension: 6192


Fax: (613) 544-8107
Email: [email protected]
ii

Copyright ⃝2006
c by G. Drolet. All rights reserved.
Permission is granted to make and distribute verbatim copies of these notes
provided the copyright notice and this permission notice are preserved on all
copies.
Preface

The main purpose of these notes is to complement and guide a student through
the study of Chapters 1, 2, 3, 4 in the textbook

J.M. Wozencraft, I.M. Jacobs, Principles of Communications Engi-


neering, Waveland Press Inc., U.S.A., 1965 (reprinted 1990), ISBN
0-88133-554-1.

These notes may not be considered a replacement for the textbook since many
proofs and discussions are incomplete. Instead we present sketches of the proofs
and illustrate them with some diagrams in order to help understand the for-
mal proofs presented in the textbook. The more concise format of the notes
may additionally be helpful in quickly locating a concept in the textbook. The
textbook is required for a complete understanding of Chapters 1, 2, 3, 4. Some
examples and results borrowed from lecture notes by Dr. G.E. Séguin are in-
cluded in the notes. Specifically, the general approach of Dr. Séguin’s notes is
used in the presentation of random processes in Chapter 3.
The first objective of the course for which these notes are written is to
provide a foundation in the Theory of Probability, Random Variables, Random
Processes for the Electrical and Computer Engineer. The second objective of
the course is to apply the concepts of probability to the Theory of Detection in
communication systems and to present the foundation for Coding Theory.
The notes are organized as follows:

Chapter 2 up to §2.12: covers W&J Chapter 2.

Chapter 2 §2.12 to §2.13: covers specific results on Random Vectors and


Gaussian Random Vectors. This material is contained in W&J Chap-
ter 3, but we feel that it is beneficial to present the results before the
definition of Random Processes; some of the results are taken from Dr.
Séguin’s notes.

Chapter 3: follows the approach of Dr. Séguin’s notes and covers the remain-
der of W&J Chapter 3.

Chapter 4: covers W&J Chapter 4.

iii
iv PREFACE

Chapter 5: presents the foundation of Coding Theory. The material is taken


from Dr. Séguin’s lecture notes for a course taught in 1984 at the Uni-
versité Laval, Ste-Foy, Québec, QC, CANADA. Some of the results are
contained in W&J Chapter 5. This chapter is self-contained and no refer-
ences are made to the textbook by W&J.

Suggestions and comments to improve the notes are welcome.

August 2006

Germain Drolet
Department of Electrical & Computer Engineering,
Royal Military College of Canada,
P.O. Box 17000, Station Forces,
Kingston, Ontario, CANADA
K7K 7B4

Tel: (613) 541-6000, extension: 6192


Fax: (613) 544-8107
Email: [email protected]
Contents

Preface iii

2 Probability Theory 1
2.1 Fundamental Definitions . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Communication problem . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Random variables and distribution functions . . . . . . . . . . . 14
2.4 Transformation of random variables . . . . . . . . . . . . . . . . 28
dFy (α)
2.4.1 Calculate Fy (α) first, then py (α) = dα . . . . . . . . . 28
2.4.2 y = f (x) where f ( ) is differentiable and non-constant on
any interval . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.3 Reversible transformation of random variables . . . . . . 30
2.4.4 Transformation of a uniformly distributed random variable 34
2.4.5 Impulses in px (α) (Wozencraft & Jacobs, pages 64, 65) . . 39
2.5 Conditional probability density . . . . . . . . . . . . . . . . . . . 41
2.6 Statistical Independence . . . . . . . . . . . . . . . . . . . . . . . 43
2.7 Mixed probability expressions . . . . . . . . . . . . . . . . . . . . 47
2.8 Statistical independence . . . . . . . . . . . . . . . . . . . . . . . 49
2.9 Communication example . . . . . . . . . . . . . . . . . . . . . . . 52
2.9.1 Details of calculation of P (C ) . . . . . . . . . . . . . . . . 53
2.9.2 Design of the optimal receiver for a channel with additive
Gaussian noise . . . . . . . . . . . . . . . . . . . . . . . . 56
2.9.3 Performance of the optimal receiver for a channel with
additive Gaussian noise . . . . . . . . . . . . . . . . . . . 57
2.10 Expected Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.10.1 Moments of a Gaussian random variable . . . . . . . . . . 63
2.10.2 Moments of some specific random variables . . . . . . . . 65
2.11 Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
2.11.1 Weak law of large numbers . . . . . . . . . . . . . . . . . 70
2.11.2 Central Limit Theorem . . . . . . . . . . . . . . . . . . . 71
2.11.3 Chernoff Bound . . . . . . . . . . . . . . . . . . . . . . . . 73
2.12 Moments (mixed) of Random Vectors . . . . . . . . . . . . . . . 83
2.13 Gaussian Random Vectors . . . . . . . . . . . . . . . . . . . . . . 89
2.13.1 Definition: . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
2.13.2 Properties of jointly Gaussian Random Variables: . . . . . 90

v
vi CONTENTS

2.13.3 Joint Gaussian Density Function . . . . . . . . . . . . . . 91


2.14 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

3 Random Waveforms 101


3.1 Fundamental Definitions . . . . . . . . . . . . . . . . . . . . . . . 101
3.2 Stationariness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
3.3 Moment Functions of a Random Process . . . . . . . . . . . . . . 109
3.4 Correlation Functions and Power Spectra . . . . . . . . . . . . . 123
3.5 Ergodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
3.6 Gaussian random processes . . . . . . . . . . . . . . . . . . . . . 133
3.6.1 Jointly Gaussian Processes . . . . . . . . . . . . . . . . . 135
3.6.2 White Gaussian Noise . . . . . . . . . . . . . . . . . . . . 139
3.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

4 Optimum Receiver Principles 147


4.1 Basic Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
4.2 Vector Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
4.3 Waveform Channels . . . . . . . . . . . . . . . . . . . . . . . . . 153
4.3.1 Waveform synthesis . . . . . . . . . . . . . . . . . . . . . 153
4.3.2 Recovery of signal vectors . . . . . . . . . . . . . . . . . . 154
4.3.3 Irrelevant Data . . . . . . . . . . . . . . . . . . . . . . . . 154
4.3.4 Theorem of Irrelevance . . . . . . . . . . . . . . . . . . . . 156
4.4 Receiver Implementation . . . . . . . . . . . . . . . . . . . . . . . 160
4.5 Probability of Error . . . . . . . . . . . . . . . . . . . . . . . . . 168
4.5.1 Rectangular Signal Sets . . . . . . . . . . . . . . . . . . . 169
4.5.2 Orthogonal and Related Signal Sets . . . . . . . . . . . . 176
4.5.3 Completely Symmetric Signal Sets and a priori Knowledge180
4.5.4 Union Bound on the Probability of Error . . . . . . . . . 181
4.6 Bit Error Rate and Signal-to-Noise Ratio per Bit . . . . . . . . . 183
4.7 Problems: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

5 Reliable Communication on an Unreliable Channel 187


5.1 Codes of size M = 2 . . . . . . . . . . . . . . . . . . . . . . . . . 190
5.1.1 Memoryless channel . . . . . . . . . . . . . . . . . . . . . 192
5.1.2 Memoryless channel and A = {0, 1} . . . . . . . . . . . . 193
5.2 P (E ) averaged over all possible C ⊂ A N , |C | = M = 2 . . . . . 194
5.2.1 Memoryless channel . . . . . . . . . . . . . . . . . . . . . 196
5.2.2 Special case 2: memoryless channel and A = {0, 1} . . . . 198
5.3 P (E ) averaged over all possible C ⊂ A N , |C | = M . . . . . . . . 199

A Table of the Q( ) and erf( ) functions 203

B Orthonormal Expansion and Vector Representation of Contin-


uous Signals 205
B.1 Three-dimensional vectors in R3 : . . . . . . . . . . . . . . . . . . 206
B.2 Vectors obtained by sampling of a waveform . . . . . . . . . . . . 211
CONTENTS vii

B.3 Effect of sampling frequency . . . . . . . . . . . . . . . . . . . . . 212


B.4 Continuous time signals . . . . . . . . . . . . . . . . . . . . . . . 214
viii CONTENTS
Chapter 2

Probability Theory

robability is at the essence of communication and this explains why the


P study of communication systems must be preceded by that of probability
theory. Specifically,
• A communication system comprises, among other things, a source that
sends a message. As explained in W&J, chap 1, page 5, the message
produced by the source is random; if we knew what the message was going
to be, then there would be no need for communication. Thus randomness
is inherent to the study of communication systems.
• Also considering the transmission of a large number of bits through an
imperfect channel, we know from experience that there is a fraction of bits
that are received incorrectly. This fraction is referred to as the bit error
rate. In chapter 4 we study, among other things, the probability of bit error
P (E ) of some specific communication systems and we would expect P (E )
to be the theoretical model of the experimental bit error rate.1 Thus again
the study of probability is seen as necessary to understand communication
systems.
Probability is concerned with making precise numerical statements and pre-
dictions about experiments/procedures of which the outcome cannot be pre-
dicted. Each realization of the experiment usually produces a different outcome
which can only be obtained by performing the experiment. Our goal is to analyze
the unpredictable nature of communication systems and design the best receiver,
i.e. the receiver that will most often correctly estimate the transmitted message.
How can one make precise numerical statements about an unpredictable exper-
iment? Even though the outcome of the experiment is unpredictable, the rate
of occurrence of a specific outcome may be predictable. Consider for example:
If a die is tossed a very large number N of times there will likely
be N/6 times the outcome 1 and N/3 times an outcome that is a
multiple of 3 (i.e. 3 or 6).
1 Although wrong in general, this statement holds in some special cases.

1
2 CHAPTER 2. PROBABILITY THEORY

2.1 Fundamental Definitions


Refer to Wozencraft & Jacobs pp 13 - 20 for the details.
The following definitions are intuitive and serve as basis for the formal ax-
iomatic definition of probability system that will follow.

Definition 2.1.1. A random experiment is a procedure producing an outcome


that cannot be predicted accurately and with certainty at the same time. We
have:

1. outcome: produced by a random experiment,

2. set of all outcomes: which may be infinite,

3. result: set of some (or all or none) of the outcomes,

4. a sequence of length N is a list of N outcomes obtained by repeating a


random experiment N times.

5. N (seq, A) is the number of occurrences of event A in sequence seq.

6. fN (seq, A) = N (seq,A)
N is the relative frequency of occurrence of result A
in sequence seq of length N .

7. relative frequency of occurrence of a result A, denoted by f∞ (A) is defined


by

f∞ (A) , lim fN (seq, A)


N →∞
N (seq, A)
= lim
N →∞ N

Remarks.

1. We need not specify the sequence in f∞ (A) since, as N → ∞, all sequences


will converge to the same ratio; this is called statistical regularity and will
be discussed in more details at section 2.11.1.

2. Notice that Dom(f∞ ) = {all results} ̸= {all outcomes} (W&J, page 14).

3. As shown in W&J, pp 14 - 15, f∞ (A ∪ B) = f∞ (A) + f∞ (B) when A, B


are mutually exclusive results.

Example 2.1.1.

random experiment: tossing of a die,

set of all outcomes: {1, 2, 3, 4, 5, 6},

result: {1} and {3, 6} ≡ {multiples of 3} are two examples of results,


2.1. FUNDAMENTAL DEFINITIONS 3

relative frequency of occurrence:

N (seq, {1})
f∞ ({1}) = lim = 1/6
N →∞ N
N (seq, {1, 3})
f∞ ({1, 3}) = lim
N →∞ N
N (seq, {1}) + N (seq, {3})
= lim
N →∞ N
N (seq, {1})) N (seq, {3}))
= lim + lim
N →∞ N N →∞ N
= f∞ ({1}) + f∞ ({3}) = 1/3

The above definitions are adequate to describe the physical concept but too
loose and imprecise to describe a mathematical concept. We next give the formal
axiomatic definition of probability system. Under certain physical conditions a
random experiment can be modeled as a probability system and obeys the same
laws. After giving the definitions and its immediate consequences, we illustrate
the relationship between “probability system” and “random experiment”. This
abstract definition should be well understood; this will make it easier to grasp
the concepts of random variables and random processes later.

Definition 2.1.2. (Axiomatic definition) A probability system consists of a non-


empty collection of objects Ω, a non-empty collection F of subsets of Ω and a
function P : F → R satisfying the following (axioms):

1. Ω ∈ F

2. A ∈ F ⇒ Ā ∈ F

3. A1 , A2 , . . . , An , An+1 , . . . ∈ F ⇒ A1 ∪ A2 ∪ A3 ∪ . . . ∈ F , i.e. F is closed


under countable unions.

4. P (Ω) = 1.

5. P (A) ≥ 0, ∀A ∈ F .

6. A1 , A2 , . . . ∈ F and Ai ∩ Aj = ∅, ∀j ̸= i (pairwise disjoints) ⇒ P (A1 ∪


A2 ∪ A3 ∪ . . .) = P (A1 ) + P (A2 ) + . . .

Ω is called sample space and its elements are called sample points. The elements
of F are called events (F is called class of events). A probability space will be
denoted by (Ω, F , P : F → [0, 1] ⊂ R) or simply (Ω, F , P ).

Remarks.

1. The definitions given by Wozencraft & Jacobs are limited to the case
|F | ̸= ∞ (W&J, page 20).
4 CHAPTER 2. PROBABILITY THEORY

2. If F = 2Ω when |Ω| = ∞ there may not exit a function P satisfying the


axioms (4), (5) and (6). It can be shown that for any Ω, finite or not, there
is at least one F ⊂ 2Ω for which a probability measure exists (satisfies
the axioms (1) . . . (6)).
Theorem 1. Axioms (1) to (5) imply the following:
1. A1 , A2 , . . . , An , An+1 , . . . ∈ F ⇒ any set obtained by a Boolean function
on the events Ai also lies in F .
2. ∅ ∈ F and P (∅) = 0
3. P (Ā) = 1 − P (A), ∀A ∈ F
4. 0 ≤ P (A) ≤ 1, ∀A ∈ F
5. A ⊂ B ⇒ P (A) ≤ P (B), ∀A, B ∈ F
6. P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
7. A1 ⊂ A2 ⊂ A3 . . . ∈ F ⇒ P (∪∞
i=1 Ai ) = limn→∞ P (An )

8. A1 ⊃ A2 ⊃ A3 . . . ∈ F ⇒ P (∩∞
i=1 Ai ) = limn→∞ P (An )

Sketch of proof: (2) to (6) only. (1) is hard to prove. (7) and (8) are called
continuity properties and are also difficult to prove; they will be used to prove
the continuity of cumulative distribution functions (defined later).
(3) follows from A ∪ Ā = Ω and A ∩ Ā = ∅.
(2) follows from axiom (3).
(4) follows from axioms (3) and (4).
(5) follows from B = A ∪ (B − A) and property (1).
(6) follows from A ∪ B = A ∪ (Ā ∩ B) and B = (A ∩ B) ∪ (Ā ∩ B).

Examples of probability system: Refer to Wozencraft & Jacobs pages 21


- 23 for some examples with
1. finite sample space,
2. real line sample space. In the example “Real-line sample space,”2 F is the
set of opened and closed intervals of [0, 1] = Ω, plus (enumerable) unions,
intersections and complements of such intervals. Clearly F ̸= 2Ω since
I = {x ∈ [0, 1] : no “7” in the decimal representation of x} ⊂ Ω
but I ∈/ F . As stated in remark 2 on page 4, F must be carefully
specified to ensure the axioms are all satisfied and it may be necessary to
use F 2Ω .
2 Wozencraft & Jacobs, page 21
2.1. FUNDAMENTAL DEFINITIONS 5

The correspondence between “probability system” and “random experi-


ment” is summarized in the following table:

Probability System Random Experiment


Sample space Set of all outcomes
Sample point Outcome
Event Result
Probability measure Relative frequency of occurrence

Remarks.

1. The axiomatic definition has the following advantages:

(a) It is very precise and formal and uses objects/concepts suitable to


the development of a strong theory.
(b) It does not rely on the existence of a physical random experiment.3
We can construct many examples of probability systems more freely.

2. The axiomatic definition has the disadvantage of being suitable to the


modeling of a random experiment only under very specific controlled con-
ditions. We have to be careful when using a specific law from the proba-
bility theory into the context of a real world random experiment.

We will often refer to simple real world random experiments for example/il-
lustration purposes. Conversely, for every random experiment we can construct
a corresponding probability system to be used as idealized mathematical model.
This is done as follows:

1. Ω , set of all possible outcomes,

2. F , set of all results, the probability of which may have to be calculated,


plus any other required subset of Ω, required to satisfy the axioms of a
class of event (if Ω is finite it is most convenient to take all the subsets of
Ω),

3. P (A) , f∞ (A), ∀A ∈ F .

Relation of the Model to the Real World: Wozencraft & Jacobs pages
24 - 29.
This is a difficult paragraph and may be viewed as comprising two intercon-
nected parts:

1. Of primary importance are the definitions of compound experiment and


binomial distribution (this is used in the problems) from the bottom of
page W&J 25 after equation (2.15c) up to the paragraph starting with
“Our primary interest . . . ” on page W&J 27.
3 refer to the examples in “Real line sample space”, W&J, page 21, for a purely mathemat-

ical construction.
6 CHAPTER 2. PROBABILITY THEORY

2. Of secondary importance (remainder of this paragraph from page W&J 27


to the bottom of page W&J 29) W&J illustrates that in a long sequence of
M (independent) outcomes of a random experiment, the fraction of these
outcomes that lie in an event A of probability p = P (A), is very likely
close to p, or in other words f∞ (A) = P (A). This will be explained again
with the Weak Law of Large Numbers in §2.11.1 and may be skipped for
now.
Definition 2.1.3. A compound experiment is a random experiment which con-
sists of a sequence of M independent (defined later) trials of a simpler experi-
ment.
If A is an event of the simpler experiment with P (A) = p ̸= 0, P (Ā) = 1 − p,
then the following set is a sample space of the compound experiment (M trials):
ΩM = {(A, A, . . . , A),(Ā, A, A, . . . , A), (A, Ā, A, A, . . . , A),
(Ā, Ā, A, A, . . . , A), . . . , (Ā, Ā, . . . , Ā)}
Let FM be the set of all subsets of ΩM , i.e. FM = 2ΩM . We also define the
mapping:
PM : FM → [0, 1]
PM : {x} 7→ p# of A in x (1 − p)# of Ā in x
for any sequence x ∈ ΩM , and this together with the axioms of a probability
measure completely determines PM for every event E ∈ FM :

P (E) = p# of A in x (1 − p)# of Ā in x
x∈E

One can easily verify that (ΩM , FM , PM ) is a probability system, i.e. FM is a


valid class of events and PM is a valid probability measure. The following M + 1
events defined below are of special interest:
Ai = {x ∈ ΩM :x has i occurrences of A and
M − i occurrences of Ā} ∈ FM ⊂ ΩM ,
for i = 0, 1, . . . , M . It is then seen (bottom of page 26) that
( )
M i
PM (Ai ) = p (1 − p)M −i , i = 0, 1, . . . , M,
i
( )
where we recall that Mi = i!(M M!
−i)! . This is called the binomial distribution
(other distributions will be defined later). We see that
Ai ∩ Aj = ∅, whenever i ̸= j
∪M i
i=0 A = ΩM
∑M
so we expect [from axioms (4) and (5)] i=0 PM (Ai ) = PM (ΩM ) = 1. This is
verified by the binomial theorem (top of page 27).
Example 2.1.2. The probability of obtaining five heads when throwing a coin
( ) ( 1 )5 ( 1 )10
15 times is 15
5 2 2 ≈ 0.0916.
2.1. FUNDAMENTAL DEFINITIONS 7

Conditional probability: Wozencraft & Jacobs pages 29 - 31.


P (A∩B)
Definition 2.1.4. If A, B ∈ F and P (B) ̸= 0, then P (A|B) = P (B) is called
the probability of event A conditioned on event B.

This definition is consistent since by axioms (2) and (3), A ∩ B ∈ F .


(Ω, F , P ( |B)) is easily seen to be a probability system. Conditional prob-
ability is the model for the relative frequency of occurrence of an event A given
that an event B occurs. To illustrate this we consider the random experiment,
the outcome of which is the sum obtained by the throw of two fair dice. We
define the following results:

A = {sum of dice is either 4, 5 or 6}


B = {sum of dice is an even number}

A sequence of N outcomes may look like:

seq = 2, 8, 9, 5, 10, 4, 7, 8, . . . , 6

In order to obtain the relative frequency of occurrence of A given that B has


occurred, we first extract the even numbered outcomes out of seq:

seqshortened = 2, 8, 10, 4, 8, . . . , 6

and calculate the fraction of times the outcome is 4, 5 or 6 in seqshortened :

relative frequency of
occurrence of “sum is larger N (seqshortened , A)
=
than 3” given the “sum is length of seqshortened
even”
N (seq, A ∩ B)
=
N (seq, B)
N (seq, A ∩ B)/N
=
N (seq, B)/N
fN (seq, A ∩ B)
=
fN (seq, B)

The latter expression is of the same form as the expression used to define con-
ditional probability. This intuitively explains that P (A|B) as defined above is
mathematical model for relative frequency of occurrence of a result given that
another result has occurred.
If we define:
FB = {A ∩ B|A ∈ F } ,
one can also verify that (B, FB , P ( |B)) is another probability system and
corresponds to the hatched portion of figure 2.8 (Wozencraft & Jacobs, page
30).
8 CHAPTER 2. PROBABILITY THEORY

Theorem 2.

1. If A1 , A2 , . . . ∈ F and Ai ∩ Aj ∩ B = ∅, ∀j, ∀i ̸= j, then:




P (A1 ∪ A2 ∪ . . . |B) = P (Ai |B).
i=1
∑∞
If moreover A1 ∪ A2 ∪ . . . ⊃ B then i=1 P (Ai |B) = 1; this is known as
the theorem of total probability.4

2. Let A1 , A2 , . . . , An ∈ F , be pairwise disjoint and P (Aj ) ̸= 0, ∀j. If B ∈ F


satisfies P (B) ̸= 0 and B ⊂ ∪nj=1 Aj , then
∑n
(a) P (B) = j=1 P (Aj )P (B|Aj ).

(b) P (Ai |B) = ∑nP (Ai )P (B|Ai ) , for any i = 1, 2, . . . , n; this is known
j=1 P (Aj )P (B|Aj )
5
as Bayes theorem.

Proof. We prove the second part of the theorem only; the proof of the first part
is left as an exercise.
( )
1. B ⊂ ∪nj=1 Aj ⇒ B = B ∩ ∪nj=1 Aj = ∪nj=1 (B ∩ Aj ) and all the B ∩ Aj
∑n
are pairwise disjoint. It follows that P (B) = j=1 P (B ∩ Aj ) from which
the result follows.

2. P (Ai |B) = P (AiP)P(B)


(B|Ai )
from equation (2.21) in W&J. The desired result
follows from part 1 of the theorem.

Bayes theorem is useful in situations where P (B|Aj ) is given for every j but
P (Aj |B) is not directly known. The following example illustrates this.

Example 2.1.3. (Refer to figure 2.1) We are given three boxes A1 , A2 , A3


containing coloured balls as follows:

A1 : 2 red balls and 3 black balls,


A2 : 3 red balls and 5 black balls,
A3 : 4 red balls and 4 black balls.

The (random) experiment consists in first choosing a box at random among the
three boxes (equiprobably, i.e. each has a probability 1/3 of being chosen) and
then draw a ball at random from the box chosen. Let B denote the event “a red
ball has been drawn”. Calculate the probability that the ball was drawn from
box number 2 if it is red, i.e. P (A2 |B).
4 cf: Wozencraft & Jacobs page 31
5 cf: Wozencraft & Jacobs problem 2.6
2.1. FUNDAMENTAL DEFINITIONS 9

Solution: From the data given in the problem we have that P (Ai ) = 1/3, i =
1, 2, 3, and clearly B ⊂ ∪3j=1 Aj . It follows from the theorem that:

P (A2 )P (B|A2 ) (1/3)(3/8)


P (A2 |B) = ∑3 =
j=1 P (A j )P (B|Aj ) (1/3) [2/5 + 3/8 + 4/8]
15
= ≈ 0.294118
51

Figure 2.1:

Example 2.1.4. A binary source transmits at random one of two messages


m0 or m1 , with probabilities P (m0 ) = 1/3, P (m1 ) = 2/3. The message is fed
through a random (noisy) channel of input and output respectively denoted as
T X and RX, and with the following transition probabilities:

P (RX = 0 | T X = m0 ) = 0.99
P (RX = 1 | T X = m0 ) = 0.01
P (RX = 0 | T X = m1 ) = 0.01
P (RX = 1 | T X = m1 ) = 0.99

Calculate

1. P (error | RX = 0) = P (T X = m1 | RX = 0),

2. P (error | RX = 1) = P (T X = m0 | RX = 1).
10 CHAPTER 2. PROBABILITY THEORY

Solution. This problem is equivalent to the following. We are given two boxes
labeled as m0 , m1 which contain balls labeled 0 or 1 as follows:
• box m0 contains 99 balls labeled 0 and 1 ball labeled 1,
• box m1 contains 1 ball labeled 0 and 99 balls labeled 1,
A random experiment consists in first taking one of the two boxes at random
with probabilities P (m0 ) = 1/3, P (m1 ) = 2/3 followed by drawing one ball from
the box chosen. Calculate:
1. the probability that the ball was drawn from box m1 if it is labeled 0,
2. the probability that the ball was drawn from box m0 if it is labeled 1.
We see that the problem is solved similarly to example 2.1.3. We easily find
(verify this):
1. P (error | RX = 0) = 2/101 ≈ 19.8 × 10−3 ,
2. P (error | RX = 1) = 1/199 ≈ 5.03 × 10−3 .

Notice that an error is roughly four times more likely on reception of 0 than
on reception of 1, even though the transmission of 1 is only twice as likely than
that of 0.

Statistical independence: Wozencraft & Jacobs pages 31-32.


Definition 2.1.5.
1. A, B ∈ F are (statistically) independent if P (A ∩ B) = P (A)P (B) (equiv-
alently P (A|B) = P (A) or P (B|A) = P (B) when P (A)P (B) ̸= 0).
2. In general A∏1 , A2 , . . . , An ∈ F are (statistically) independent if
P (∩i∈I Ai ) = i∈I P (Ai ), ∀I ⊂ {1, 2, . . . , n}, I ̸= ∅.
3. A1 , A2 , . . . , An ∈ F are pairwise (statistically) independent if P (Ai ∩
Aj ) = P (Ai )P (Aj ), for every 1 ≤ i ̸= j ≤ n.
Theorem 3. Let A1 , A2 , . . . , Am , B1 , B2 , . . . , Bn ∈ F be (statistically) inde-
pendent and define:

A ≡ event obtained from Ai using ∪, ∩ and –


B ≡ event obtained from Bj using ∪, ∩ and –

Then A and B are statistically independent.


Example 2.1.5. Without using the result of the above theorem, show that A
and B̄ are statistically independent if A and B are two statistically independent
events.
2.2. COMMUNICATION PROBLEM 11

Solution: we first notice that A = (A ∩ B) ∪ (A ∩ B̄). Next,

P (A ∩ B̄) = P (A) − P (A ∩ B)
= P (A) − P (A)P (B)
= P (A)(1 − P (B))
= P (A)P (B̄)

2.2 Communication problem


Refer to Wozencraft & Jacobs, pages 33 to 37.
In this example, a probability system is constructed by combining a random
experiment (message source), a random transformation (digital communication
channel) and a deterministic transformation (decision element).
1. Source: We define:
Sample space: Ωsource = {m0 , m1 }
Class of events: 2Ωsource = {∅, {m0 }, {m1 }, {m0 , m1 }}
Probability function:

PS (∅) = 0 ,
PS ({m0 }) = Pm0 ,
PS ({m1 }) = Pm1 ,
1 = PS (Ωsource ) = PS ({m0 , m1 }) = Pm0 + Pm1 = 1 .

Pm0 , Pm1 are called a priori message probabilities.


( )
Ωsource , 2Ωsource , PS satisfies all the axioms of a probability system.
2. Discrete communication channel: is a transformation with unpredictable
(random) characteristics. It is determined by its transition probabilities
P [rj |mi ] ≥ 0 as in figure 2.10 (W&J). To be valid it must satisfy:

P [rj |mi ] = 1
all j

for all i. Combining the source and the discrete communication channel
we define:

Sample space:

ΩDCC = {(m0 , r0 ), (m0 , r1 ), (m0 , r2 ), (m1 , r0 ), (m1 , r1 ), (m1 , r2 )}

Class of events: 2ΩDCC = {∅, {(m0 , r0 )}, {(m0 , r1 )}, . . . , ΩDCC }


12 CHAPTER 2. PROBABILITY THEORY

Probability function:

PDCC ({(mi , rj )}) = PS ({mi })P [rj |mi ] ,

for every i and every j. One easily verifies that (you should verify
this): ∑∑
PDCC (ΩDCC ) = PDCC ({(mi , rj )}) = 1 .
all i all j

(ΩDCC , 2ΩDCC , PDCC ) forms a probability system.


3. The decision element is a deterministic (non-random) mapping:

mb : {r0 , r1 , r2 } → {m0 , m1 } 
 This is just one example;

mb : r0 7→ m0
there are 8 possible map-
mb : r1 7→ m1 

 pings!
mb : r2 7→ m1

This mapping m b can be used together with the probability system


(ΩDCC , 2ΩDCC , PDCC ) to define the probability system (ΩD , 2ΩD , PD ) de-
scribed below:

Sample Space: ΩD = {(m0 , m0 ), (m0 , m1 ), (m1 , m0 ), (m1 , m1 )} where


the first component of each pair denotes the message transmitted
and the second component denotes the decision made.
Events: 2ΩD = {∅, {(m0 , m0 )}, {(m0 , m1 )}, . . . , ΩD }
Probability function: completely specified by the probability function
b ), the axioms and the following:
PDCC ( ), the decision mapping m(

PD ({(m0 , m0 )}) = PDCC ({(m0 , r0 )})


PD ({(m0 , m1 )}) = PDCC ({(m0 , r1 ), (m0 , r2 )})
PD ({(m1 , m0 )}) = PDCC ({(m1 , r0 )})
PD ({(m1 , m1 )}) = PDCC ({(m1 , r1 ), (m1 , r2 )})

b )).
(this corresponds to the above example of mapping m(

The probability system (ΩD , 2ΩD , PD ) describes the overall operation of the
discrete communication channel. Its performance is measured by its probability
of correct decision P (C ) (page 35, W&J). We call C = {(m0 , m0 ), (m1 , m1 )} ⊂
ΩD the correct decision event. For the above example of mapping m( b ), the
correct decision event C ⊂ ΩD corresponds to C˜ ⊂ ΩDCC given by:

C˜ = {(m0 , r0 ), (m1 , r1 ), (m1 , r2 )}

In general, C˜ is given by:

C˜ = {(mi , rj ) : m̂(rj ) = mi , ∀i ∈ {0, . . . , M − 1}, ∀j ∈ {0, . . . , J − 1}}


= {(m̂(rj ), rj ) : j ∈ {0, . . . , J − 1}}
2.2. COMMUNICATION PROBLEM 13

The probability of the event C is:

P (C ) = P (C˜) = PDCC ({(m0 , r0 )}) + PDCC ({(m1 , r1 )}) + PDCC ({(m1 , r2 )})

2
= PDCC ({(m̂(rj ), rj )})
j=0


2
= PS ({m̂(rj )})P [rj |m̂(rj )] (2.1)
j=0

From equation (2.1) it is clear that the probability of correct decision depends
on the assignment made by the decision element m̂( ). The remainder of this
paragraph (page 35 in W&J) shows how the mapping (decision element) can
be chosen in order to maximize P (C ) given a priory message probabilities
P ({mi }) = Pmi and channel transition probabilities P [rj |mi ]. It is helpful
to first consider a numerical example.

Example 2.2.1. Consider the discrete communication channel of figure 2.2.


The a priori and transition probabilities are shown in the figure and we consider
the specific decision element also shown in the figure. We want to calculate the
probability of correct decision P (C ), where

C = {(m0 , m0 ), (m1 , m1 )} ∈ FD

From the paragraph above, we have P (C ) = P (C˜) where

C˜ = {(m0 , r0 ), (m1 , r1 ), (m1 , r2 )} ∈ FDCC

We then obtain

P (C ) = P (C˜) = PS (m0 ) P [r0 |m0 ] + PS (m1 ) P [r1 |m1 ] + PS (m1 ) P [r2 |m1 ]
= 0.6 × 0.9 + 0.4 × 0.15 + 0.4 × 0.8
= 0.54 + 0.06 + 0.32
= 0.92

Finally, the probability of error is P (E ) = 1 − P (C ) = 0.08. We leave it as an


exercise to show that no other decision element would lead to a larger probability
of correct decision; in other words, the decision element of figure 2.2 is optimum.

The important results and terminology are summarized in definition 2.2.1


and theorem 4.

Definition 2.2.1.

1. An optimum decision (element) is a decision mapping that yields the


largest P (C ) over any other choice of decision mapping.
14 CHAPTER 2. PROBABILITY THEORY

Figure 2.2:

2. The a posteriori probability of message mi given the received symbol rj is


defined by the conditional probability:
PDCC ({(mi , rj )})
P ({mi } | {rj }) ,
P ({rj })
∑M −1
where P ({rj }) = i=0 PDCC ({(mi , rj )}) as long as P ({rj }) ̸= 0.
Theorem 4.
1. An optimum decision mapping is obtained by mapping a received symbol
rj to that message mi that maximizes the product P [rj | mi ]P ({mi }).
2. The decision mapping that maps a received symbol rj (P ({rj }) ̸= 0) to that
message mi that maximizes the a posteriori probability P ({mi } | {rj }) is
an optimum receiver.

2.3 Random variables and distribution func-


tions
Refer to Wozencraft & Jacobs, pages 37 to 44.
We now describe a mathematical concept that is used to model physical
random experiments, the outcomes of which are real numbers. The definition
appears more complicated and abstract than necessary, but this is required and
will later naturally extend to the definition of random processes in chapter 3.
Moreover the abstract definition
1. leads to a simpler definition of joint random variables i.e. many random
variables (such as in definition 2.3.3),
2. extends to the definition of transformation of random variables (section
2.4),
2.3. RANDOM VARIABLES AND DISTRIBUTION FUNCTIONS 15

3. indicates that there is an underlying random physical phenomenon, the


result of which is an observed real number.
Definition 2.3.1. Consider of probability system (Ω, F , P : F → [0, 1]) and
the mapping:
x : Ω → R ∪ {±∞}
x : ω 7→ x(ω)
such that {ω : x(ω) ≤ α} ∈ F , ∀α ∈ R ∪ {±∞}, and {ω : x(ω) = ±∞} ∈ F are
events of probability 0. The mapping x is a random variable.
We often use the shorter notation {x ≤ α} , {ω : x(ω) ≤ α} whenever there
is no confusion.
Proposition 5.
1. For every β ∈ R ∪ {±∞} such that β < α we have
i. {x < α} ∈ F , {x = α} ∈ F ,
ii. {x ≥ α} ∈ F , {x > α} ∈ F ,
iii. {β < x ≤ α} ∈ F .
2. For any interval I ∈ R, the set {ω : x(ω) ∈ I} = {x ∈ I} is an event. If
I1 , I2 are two disjoint intervals in R then the events {x ∈ I1 } and {x ∈ I2 }
are disjoint.
The book gives some examples of random variables. The following describes
other examples of random and “non random” variables.
|E|
1. (Ω = {1, 2, 3, 4, 5, 6}; F = 2Ω ; P (E) = 6 , ∀E ∈ F ). This is a probability
space and one easily verifies that:

x: Ω → R ∪ {±∞}
x: ω 7→ ω2 − 1

is a random variable. The function:


y: Ω → R ∪ {±∞}
y: ω 7 → 1
ω−5

is not a random variable. The function is not defined at ω = 5. The


axiom P ({ω : x(ω) = ∞}) = 0 and P ({ω : x(ω) = −∞}) = 0 is not
satisfied.
|E|
2. (Ω = {1, 2, 3, 4, 5, 6}; F = {∅, Ω, {2, 4, 6}, {1, 3, 5}}; P (E) = 6 , ∀E ∈ F ).
This is a probability space. The function:

x: Ω → R ∪ {±∞}
x: ω 7 → ω2 − 1

is not a random variable since {ω : x(ω) ≤ 9} = {1, 2, 3} ∈


/ F.
16 CHAPTER 2. PROBABILITY THEORY

3. Ω = [0, 1]; F ≡ sub-intervals + countable unions and intersections of such


sub-intervals; P ≡ total length of the sub-interval(s). Consider:
x: Ω → R ∪ {±∞}
x: ω 7 → drop all 7’s in the deci-
mal representation of ω
This is not a random variable since {ω : x(ω) ≤ 1} does not lie in F .6
Notice that y : ω 7→ ω−1/2
1
is a random variable; in this case P (y = ∞) =
P (y = −∞) = 0.
Definition 2.3.2. Let x : Ω → R ∪ {±∞} be a random variable for the proba-
bility system (Ω, F , P : F → [0, 1]). The function
Fx : R ∪ {±∞} → [0, 1]
Fx : α 7→ P ({ω : x(ω) ≤ α})
is called probability distribution function of x.
Example 2.3.1. Refer to W&J, page 38.
Example 2.3.2. The random experiment is that of throwing a fair die and the
probability space consists in:
Ω = {1, 2, 3, 4, 5, 6}
F = 2Ω
The probability is determined by:
P ({1}) = P ({2}) = . . . = P ({6}) = 1/6
| {z }
⇒P (E)=|E|/6, ∀E∈F

We define a function x as follows:


x: Ω → R ∪ {±∞}
x: ω 7→ ω2 + 1
One easily verifies that the function x satisfies the axioms of a random variable.
In order to sketch Fx (α) we evaluate it at a few points:
Fx (0) = P ({x ≤ 0}) = P ({ω ∈ Ω : ω 2 + 1 ≤ 0}) = P (∅) = 0
Fx (1) = P ({x ≤ 1}) = P ({ω ∈ Ω : ω 2 + 1 ≤ 1}) = P (∅) = 0
Fx (2) = P ({x ≤ 2}) = P ({ω ∈ Ω : ω 2 + 1 ≤ 2}) = P ({1}) = 1/6
Fx (5) = P ({x ≤ 5}) = P ({ω ∈ Ω : ω 2 + 1 ≤ 5}) = P ({1, 2}) = 1/3
lim− Fx (α) = P ({x < 2}) = P ({ω ∈ Ω : ω 2 + 1 < 2}) = P (∅) = 0
α→2
lim Fx (α) = P ({x < 5}) = P ({ω ∈ Ω : ω 2 + 1 < 5}) = P ({1}) = 1/6
α→5−
... ... ...
Finally we obtain the function sketched in figure 2.3.
6 refer to remark 1 on page 3
2.3. RANDOM VARIABLES AND DISTRIBUTION FUNCTIONS 17

Figure 2.3:

Theorem 6. (properties of probability distribution functions)


1. 0 ≤ Fx (α) ≤ 1, ∀α ∈ R ∪ {±∞},
2. Fx (−∞) = 0; Fx (∞) = 1,
3. a > b ⇒ Fx (a) − Fx (b) = P ({ω|b < x(ω) ≤ a}),
4. a > b ⇒ Fx (a) ≥ Fx (b),
5. ∀a ∈ R, Fx (α) is continuous on the right at the point α = a,
6. Fx (a) − limα→a− Fx (α) = P ({ω|x(ω) = a})
proof of 5.: Define Ai = {ω : x(ω) ≤ a + 1/i}, i = 1, 2, . . .. We then have
A1 ⊃ A2 ⊃ A3 ⊃ . . .
∩∞
i=1 Ai = {ω : x(ω) ≤ a}

It then follows from property 8 of theorem 1, the continuity property, that


Fx (a) = P ({ω : x(ω) ≤ a}) = P (∩∞
i=1 Ai ) =
lim P (An ) = lim Fx (a + 1/n) = lim Fx (α)
n→∞ n→∞ α→a+
18 CHAPTER 2. PROBABILITY THEORY

We now turn our attention to a more general situation involving more than
one random variable. For example, consider the random experiment of throwing
two fair dice labelled as #1 and #2. The first random number is the outcome
of die #1 and the second random number is the sum of both outcomes of dice
#1, #2. We will detail the tools required to calculate the probability of events
defined by multiple random variables.
Given two random variables x1 , x2 we can construct the two probability
distribution functions Fx1 (α1 ), Fx2 (α2 ). As we have seen the probability of any
event involving x1 only and the probability of any event involving x2 only can
respectively be calculated from the functions Fx1 (α1 ) and Fx2 (α2 ). But the two
distributions functions are not sufficient to calculate the probability of an event
involving both variables x1 , x2 , such as P ({2 < x1 ≤ 4} ∩ {−1 < x2 ≤ 1}). In
order to calculate the probability of such an event a new distribution function
is defined:
Definition 2.3.3. Let x1 : Ω → R ∪ {±∞}, x2 : Ω → R ∪ {±∞} be random
variables for the probability system (Ω, F , P : F → [0, 1]). The function:
Fx1 ,x2 : (R ∪ {±∞})2 → [0, 1]
Fx1 ,x2 : (α1 , α2 ) 7→ P ({ω : x1 (ω) ≤ α1 } ∩ {ω : x2 (ω) ≤ α2 })
is called joint probability distribution function of the random variables x1 and
x2 .
Remarks.
( )
1. For any region I ∈ R2 , the set {ω : x1 (ω), x2 (ω) ∈ I} = {(x1 , x2 ) ∈
I} is an event. If I1 , I2 are two disjoint regions in R2 then the events
{(x1 , x2 ) ∈ I1 } and {(x1 , x2 ) ∈ I2 } are disjoint.
2. The properties of joint probability distribution functions are immediate
generalizations of the properties of (1-dimensional) probability distribu-
tion functions (refer to page 40, W&J).
3. Property VI on page 40, W&Y, is similar to the proof of property 3, page
17: first show that
Fx1 ,x2 (a1 , a2 ) = Fx1 ,x2 (a1 , b2 ) +
P ({ω : x1 (ω) ≤ a1 } ∩ {ω : b2 < x2 (ω) ≤ a2 })
Fx1 ,x2 (a1 , a2 ) = Fx1 ,x2 (b1 , a2 ) +
P ({ω : b1 < x1 (ω) ≤ a1 } ∩ {ω : x2 (ω) ≤ a2 })

Definition 2.3.4. Let xi : Ω → R ∪ {±∞}, i = 1, 2, 3, . . . , k be k random


variables for the probability system (Ω, F , P : F → [0, 1]).
1. The mapping
k
x: Ω → (R ∪ {±∞})
x: ω 7→ (x1 (ω), x2 (ω), . . . , xk (ω)) , x(ω)
is called random vector.
2.3. RANDOM VARIABLES AND DISTRIBUTION FUNCTIONS 19

2. Let a = (a1 , a2 , . . . , ak ), b = (b1 , b2 , . . . , bk ) be two elements of


k
(R ∪ {±∞}) . We write a ≤ b to signify that ai ≤ bi , ∀i = 1, 2, . . . , k.
3. The function
k
Fx : (R ∪ {±∞}) → [0, 1]
Fx : α 7 → P ({ω : x(ω) ≤ α})

is called joint probability distribution function of the random vector x.


Example 2.3.3 (bottom of page 42 in W&J). Recall that

Ω = [0, 1] ,
F ≡ set of intervals of Ω and “countable” unions and
intersections of such intervals,
P ≡ total length of the union of the intervals making
the event.
The remainder of the example is easy to follow.
Remark. “The use of probability distribution functions is inconvenient in com-
putations”.7 For this reason we introduce the probability density function de-
fined below.
dFx (α)
Definition 2.3.5 (page 45, W&J). The function px (α) = dα is called prob-
ability density function of the random variable x.
Proposition 7. Properties of the probability density function [pages 44 - 47,
W&J]
1. When I ⊂ R ∪ {±∞}∫ is such that {ω : x(ω) ∈ I} ∈ F we can show that
P ({ω : x(ω) ∈ I}) = I px (α)dα.
2. If lim+ Fx (α) = lim− Fx (α) and lim+ Fx′ (α) ̸= lim− Fx′ (α) then px (a) =
α→a α→a α→a α→a
lim Fx′ (α), i.e. the probability density function px ( ) is continuous on the
α→a+
right.
3. If lim+ Fx (α) − lim− Fx (α) = Pa ̸= 0 then px (α) contains the impulse
α→a α→a
Pa δ(α − a).8
7 bottom of page 43, W&J
8 δ( ) is the Dirac impulse function and it is defined on page 46, W&J. For all t0 ∈ R and
ϵ > 0: ∫ t0 +ϵ
f (t)δ(t − t0 )dt = f (t0 )
t0 −ϵ
if f (t) is continuous at t = t0 . We also have:
f (t)δ(t − t0 ) = f (t0 )δ(t − t0 )
f (t) ∗ δ(t − t0 ) = f (t − t0 )
if f (t) is continuous at t = t0 (∗ denotes the convolution).
20 CHAPTER 2. PROBABILITY THEORY

∫α
4. Fx (α) = −∞
px (β)dβ
5. px (α) ≥ 0, ∀α ∈ R
∫∞
6. −∞ px (α)dα = 1
Theorem 8. For every function f (α) : R ∪ {±∞} → R ∪ {±∞} satisfying:

f (α) ≥ 0, ∀α,
∫ ∞
f (α)dα = 1,
−∞

there exists a probability system (Ω, F , P : R ∪ {±∞} → [0, 1]) and a random
variable x : Ω → R ∪ {±∞} such that px (α) = f (α).
Sketch of proof. The probability system is constructed as follows:
• Ω = R,
• F is the set of intervals of R = Ω, plus (enumerable) unions, intersections
and complements of such intervals,

• P (I) , I f (α)dα, ∀I ∈ F .
Next we define the random variable
x: Ω=R → R
x: ω 7 → ω

One can then easily show that px (α) = f (α) (this is somewhat similar to the
construction of page 23 in W&J).
Example 2.3.4. The probability density function of the random variable de-
scribed in example 2.3.2 and distribution function as in figure 2.3 is shown in
figure 2.4.

Figure 2.4:

Example 2.3.5 (some probability density functions – refer to Wozencraft &


Jacobs pp 47 - 49 for the definitions of the exponential, Rayleigh, Cauchy and
Gaussian probability density functions).
2.3. RANDOM VARIABLES AND DISTRIBUTION FUNCTIONS 21

1. uniform: b > a

 1
b−a ;a ≤ α < b
px (α) =

0 ; elsewhere


 0 ;α < a



Fx (α) = α−a
;a ≤ α < b


b−a



1 ;b ≤ α

2. Binomial: N ∈ N∗ , 0 < p < 1


N ( )
∑ N
px (α) = pk (1 − p)N −k δ(α − k)
k
k=0
(N )
where k , k!(NN−k)!
!
. A typical binomial probability density function is
sketched in figure 2.5.

Figure 2.5: Typical binomial probability density function with n = 5, p = 0.4.

3. Poisson: ρ > 0

∑ ρk
px (α) = e−ρ δ(α − k)
k!
k=0
A typical Poisson probability density function is sketched in figure 2.6.
22 CHAPTER 2. PROBABILITY THEORY

Figure 2.6: First 6 impulses of a typical Poisson probability density function


with λ = 1.5. The Poisson probability density function contains an infinite
number of impulses: 1 for every x ∈ N.

Remarks.
∫∞
1. All of the above functions satisfy −∞
px (α)dα = 1 (verify).
2. It can be shown that
∫ ∞
1
e−(α−µ)
2
/(2σ 2 )
√ dα = 1.
−∞ 2πσ 2
∫∞ √
e−x dx =
2
This follows from −∞
π.
3. There is no analytical form to the Gaussian probability distribution func-
tion. It is tabulated (refer to Appendix A) as:
∫ ∞ [ ( )]
1 −β 2 /2 1 α
Q(α) = √ e dβ = 1 − erf √
2π α 2 2
∫ α √
2
e−β dβ = 1 − 2Q( 2α)
2
erf(α) = √
π 0

represented graphically by the hatched areas in figures 2.7 and 2.8.9 The
function erf( ) is defined in the C math library math.h and most com-
puter packages (MAPLE, MATHEMATICA, MATLAB ) also include an
implementation of the function. The erf( ) and Q( ) functions are plotted
in figure 2.9.
2.3. RANDOM VARIABLES AND DISTRIBUTION FUNCTIONS 23

1  2  2
e



2













 Q 



















        


Figure 2.7: Definition of the Q( ) function

2 2
e

erf 




































        


Figure 2.8: Definition of the erf( ) function

4. If x is a Gaussian random variable with probability density function:10

1
e−(α−µ)
2
/2σ 2
px (α) = √
2πσ 2

then:
( )
a−µ
P (x ≥ a) = Q
σ

for any a ∈ R. More generally for any a, b ∈ R, a < b:

P (a ≤ x < b) = P (x ≥ a) − P (x ≥ b)
( ) ( )
a−µ b−µ
= Q −Q
σ σ

and in particular P (x < b) = 1 − Q( b−µ µ−b


σ ) = Q( σ ).

9 Tables generally only list the values of Q(α) for α ≥ 0. If α < 0, we use Q(α) = 1−Q(−α).
10 The parameters µ and σ 2 will respectively be called mean and variance of the random
variable x; refer to section 2.10.
24 CHAPTER 2. PROBABILITY THEORY

Figure 2.9: Graphs of the erf(α) function (red) and Q(α) function (blue)

5. The Taylor series of the function Q( ) is given by:11


(n−3)/2

∑ (−1)(n−3)/2 (2i + 1)
1 α
Q(α) = −√ + √ i=0
αn
2 2π n=3
n! 2π
n odd

1 α α3 α5 α7 α9
≈ −√ + √ − √ + √ − √ + O(α11 )
2 2π 6 2π 40 2π 336 2π 3456 2π

6. The first derivative of the function Q(α) is given by:


dQ(α) −1
= √ e−α /2
2

dα 2π

Definition 2.3.6 (Wozencraft & Jacobs, page 49). Let x1 , x2 be two random
variables with joint probability distribution function Fx1 , x2 (α1 , α2 ). The func-
tion
∂2
px1 ,x2 (α1 , α2 ) = Fx ,x (α1 , α2 )
∂α1 ∂α2 1 2
is called joint probability density function of x1 , x2 .
Proposition 9. Properties of the joint probability density function [Wozencraft
& Jacobs, pp 50 - 55]
1. When I ⊂ R2 is such that {ω : (x1 (ω), x2 (ω)) ∈ I} ∈ F we can show that:
∫∫
P ({ω : (x1 (ω), x2 (ω)) ∈ I}) = px1 ,x2 (α1 , α2 )dα1 dα2
I
11 this is required to solve problem 2.39 - more details to follow
2.3. RANDOM VARIABLES AND DISTRIBUTION FUNCTIONS 25

2. ∫ α1 ∫ α2
Fx1 ,x2 (α1 , α2 ) = px1 ,x2 (β1 , β2 )dβ2 dβ1
−∞ −∞
∫ α1 ∫ ∞
Fx1 (α1 ) = px1 ,x2 (β1 , β2 )dβ2 dβ1
−∞ −∞
∫ ∞ ∫ α2
Fx2 (α2 ) = px1 ,x2 (β1 , β2 )dβ2 dβ1
−∞ −∞

3. px1 ,x2 (α1 , α2 ) ≥ 0, ∀α1 , ∀α2 .


∫∫ ∞
4. −∞ px1 ,x2 (α1 , α2 )dα2 dα2 = 1.

5. Discontinuities are treated in the same way as in the one dimensional case.

Example 2.3.6 (Wozencraft & Jacobs, pp 53 - 54).

1. Computer generated plots (using MAPLE ) of equation W&J (2.58) are


presented in figure 2.10. A more general expression for the probability
density function of two (and more) jointly Gaussian random variables will
be presented in §2.13.3.

2. The joint probability density function of equation (2.59) in W&J is purely


impulsive because the term
1
δ(α1 − i) δ(α2 − j)
36
is 0 everywhere except at α1 = i and α2 = j where it is an infinitely high
and infinitely narrow spike of volume 1/36.

3. Equation W&J (2.60) should be


2 [ ]
1 1 (α2 − i)2
px1 ,x2 (α1 , α2 ) = δ(α1 − i) √ exp −
i=1
2 2π 2

and, as explained, represents two infinitely high fences at α1 = 1 and


α1 = 2. For any other value of α1 the joint probability density function
is 0 since the impulse δ(α1 − 1) is 0 everywhere except at α1 = 1 and
similarly δ(α1 − 2) is 0 everywhere except at α1 = 2.

4. The change of variable at the bottom of page 54 in W&J is the conversion


from rectangular to polar coordinates. It is understood that
α2
θ = tan−1 ( ) ± π u(−α1 )
α1

where u( ) is the unit step function, i.e. π radians (180◦ ) must be added
(or subtracted) whenever α1 is negative since the tan−1 ( ) function only
covers angles from −π/2 to +π/2.
26 CHAPTER 2. PROBABILITY THEORY

This example shows that it is sometimes possible to obtain an expression


for the volume under a 2-dimensional Gaussian probability density func-
tion even though it is not possible for a 1-dimensional Gaussian probability
density function. We will also see that the Q( ) and erf( ) functions may
also sometimes be used (problem 3.2 in W&J).

(a) ρ = 0 (b) ρ = 0.9

(c) ρ = −0.9 (d) ρ = −0.7

Figure 2.10: Plots of the joint Gaussian probability density function of 2 random
variables

∫∞
Theorem 10. px1 (α1 ) = −∞
px1 ,x2 (α1 , α2 )dα2 .

Proof. We have:
∫ α1 ∫ ∞
Fx1 (α1 ) = Fx1 ,x2 (α1 , ∞) = px1 ,x2 (β1 , β2 )dβ2 dβ1
−∞ −∞
| {z }
h(β1 )
∫ α1
= h(β1 )dβ1 (2.2)
−∞

which follow from property V page 40 (W&J) and equation W&J (2.56). We
2.3. RANDOM VARIABLES AND DISTRIBUTION FUNCTIONS 27

also have: ∫ α1
Fx1 (α1 ) = px1 (β1 )dβ1 (2.3)
−∞

from equation W&J (2.40 b). Both RHS of 2.2 and 2.3 give Fx1 (α1 ) for all
values of α1 . The integrands must be the same, i.e.
∫ ∞
px1 (α1 ) = h(α1 ) = px1 ,x2 (α1 , β2 )dβ2 .
−∞

Remark. This proof∫ is not 100 % rigorous; neither is Wozencraft & Jacobs’.
d α
They show that dα −∞
h(β)dβ = h(α) when h(β) is continuous at β = α. The
result remains valid even if this condition is not satisfied, and this follows from
the way in which the discontinuities of Fx (α) were handled in the definition of
px (α) = dFdα
x (α)
.
Example 2.3.7 (Wozencraft & Jacobs, pp 56 - 57). If x1 , x2 are two jointly
Gaussian random variables then x1 is Gaussian and x2 is Gaussian. Therefore
the volume under the surface of figure 2.27 in W&J, inside the strip bounded
by a, b, is obtained with the Q( ) function; there is no need of a 2-dimensional
version of the Q( ) function in this case.
Definition 2.3.7. Let x = (x1 , x2 , . . . , xk ) be a random vector comprising
k random variables with joint probability distribution function Fx (α). The
∂k
function px (α) = ∂α Fx (α) is called joint probability density function of the
random vector x.
Proposition 11. Properties of the joint probability density function [Wozen-
craft & Jacobs pp 57 - 58]
1. When I ⊂ Rk is such that {ω : x(ω) ∈ I} ∈ F we can show that:

P ({ω : x(ω) ∈ I}) = px (α)dα
I

2. Discontinuities of Fx (α) are treated in the same way as the 1-dimensional


and 2-dimensional cases.
3. (Elimination of random variables, W&J, pp 55) If x =
(x1 , x2 , . . . , xk ), x′ = (x1 , x2 , . . . , xi−1 , xi+1 , . . . , xk ), α = (α1 , α2 ,
. . . , αk ), α′ = (α1 , α2 , . . . , αi−1 , αi+1 , . . . , αk ) then
∫ ∞

px (α ) =
′ px (α)dαi .
−∞

Definition 2.3.8. Let x1 , x2 be random variables for the probability system


(Ω, F , P : F → [0, 1]). We write x1 = x2 if P ({ω : x1 (ω) ̸= x2 (ω)}) = 0.
28 CHAPTER 2. PROBABILITY THEORY

In particular x1 (ω) = x2 (ω), ∀ω ∈ Ω ⇒ x1 = x2 . Indeed x1 (ω) =


x2 (ω), ∀ω ∈ Ω ⇒ {ω : x1 (ω) ̸= x2 (ω)} = ∅, and P (∅) = 0. We usually (i.e.
from now on) exclude from Ω the sample points of probability 0 and in this case
we find that x1 = x2 ⇔ {ω : x1 (ω) ̸= x2 (ω)} = ∅ ⇔ x1 (ω) = x2 (ω), ∀ω ∈ Ω.
Under this assumption any one of the above two statements can be used to
define equality of random variables.

2.4 Transformation of random variables


Refer to Wozencraft & Jacobs, pp 58 - 65.
Let x : Ω → R ∪ {±∞} be a random variable for the probability system
(Ω, F , P : F → [0, 1]), and let
y: Ω → R ∪ {±∞}
y: ω 7 → f (x(ω))
where f : R → R is a piecewise differentiable function. For many such functions,
y satisfies the axioms of a random variable. We say that y is defined by a
transformation of random variable(s) and we write y = f (x). We want to
calculate py (α) from px (α) and the function f . This generalizes to random
vectors.
Remark (Wozencraft & Jacobs, p 64). If px (α) contains impulses, the contri-
bution of each impulse is calculated separately and added to the non-impulsive
part. This will be illustrated in §2.4.5.

dFy (α)
2.4.1 Calculate Fy (α) first, then py (α) = dα
The following are obtained (when px (α) is non-impulsive) [refer to W&J]:
1. y = bx + a where a, b ∈ R, b ̸= 0:
( )
1 α−a
⇒ py (α) = px
|b| b
{
x ; x≥0
2. Half-wave (linear) rectifier y =
0 ; x<0

⇒ py (α) = P0 δ(α) + u(α)px (α)


∫0
where P0 = −∞
px (β)dβ = Fx (0) and u( ) is the unit step function.

3. y = x2
 1 √ √
 √
2 α
[px ( α) + px (− α)] ; α>0
⇒ py (α) =

0 ; α<0

Finally, by definition, py (0) = lim+ py (α).


α→0
2.4. TRANSFORMATION OF RANDOM VARIABLES 29

4. We skip the example y = x21 + x22 (y is a function of a random vector)
for now. The result is
∫ 2π
py (α) = u(α) αpx1 ,x2 (α cos θ, α sin θ)dθ
0

and will will be covered in more generality at §2.4.3.


In light of the preceding examples we observe that the resulting probability
density function py (β) of y = f (x) may roughly be said to take the form:

py (α) = “probability (density) of y = α”


= “probability (density) of f (x) = α”
= “probability (density) of x , g(α) = f −1 (α)”

1 dg(α)
= px (g(α)) × df (α) = px (g(α)) × (2.4)

| {z }
dα | {z }
scaling factor scaling factor

Equation (2.4) is in general wrong. The development preceding it is found to


be helpful in remembering some of the expressions for the techniques we detail
next. The two main reasons why equation (2.4) may not work are that the
inverse function g( ) , f −1 ( ) need not exist and that, if it exists, it may not
be differentiable.

2.4.2 y = f (x) where f ( ) is differentiable and non-constant


on any interval
In this case the inverse function g( ) referred to in the previous paragraph does
not exist but the equation f (x) = α (loosely speaking f −1 (α)) has a countable
number of solutions; for every α ∈ R, we let S(α) denote the set of these roots:

S(α) = {β ∈ R : f (β) = α}

Then
 ∑


px (β)
; if S(α) ̸= ∅ and f ′ (β) ̸= 0,
 |f ′ (β)|
py (α) =
β∈S(α) ∀β ∈ S(α)



0 ; if S(α) = ∅

This technique is very effective with transformations such as y = x2 , y = |x|.


Example 2.4.1. Let y = x2 , i.e. f (β) = β 2 and f ′ (β) = 2β. Clearly,

S(α) = {β ∈ R : β 2 = α}
 √
 {± α} ; α > 0
=

∅ ; α<0
30 CHAPTER 2. PROBABILITY THEORY

We then directly obtain:


 √ √
 p (− α)
 x 2√ α +
px ( α)
√ ; if α > 0
2 α
py (α) = (2.5)

 0 ; if α < 0

2.4.3 Reversible transformation of random variables


Refer to Wozencraft & Jacobs, pp 111 - 114.
In this case we have a invertible transformation of random vectors and the
scaling factor in equation (2.4) becomes the Jacobian of the transformation.
Specifically for random vectors of length 2 we would have:
y1 = f1 (x1 , x2 ) x1 = g1 (y1 , y2 )
y2 = f2 (x1 , x2 ) x2 = g2 (y1 , y2 )

The joint probability density function py1 , y2 (β1 , β2 ) is given by:


py1 , y2 (β1 , β2 ) = px1 , x2 (g1 (β1 , β2 ), g2 (β1 , β2 )) × |Jg (β1 , β2 )|
where Jg (β1 , β2 ) is given by:

∂g1 (β1 , β2 ) ∂g1 (β1 , β2 )

∂β1 ∂β2


∂g2 (β1 , β2 )
∂β
∂g2 (β1 , β2 )
∂β2

1

More generally we have:


Theorem 12. Let x be a random vector for the probability system (Ω, F , P :
F → [0, 1]) and let y be defined by:
y: Ω → (R ∪ {±∞})k
y: ω 7→ f (x(ω)) = (f1 (x(ω)), . . . , fk (x(ω)))
where f : Rk → Rk is a reversible differentiable transformation. Then y is a
random vector and
py (β) = px (g(β))|Jg (β)|
where g = f −1 and Jg (β) is the Jacobian associated with the transformation g.
Remarks.
1. f : Rk → Rk is reversible and differentiable if and only if:
(a) ∃g : Rk → Rk such that:
g ◦ f : Rk → Rk
g◦f : α 7 → g(f (α)) = α

f ◦ g : Rk → Rk
f ◦g : β 7 → f (g(β)) = β
and
2.4. TRANSFORMATION OF RANDOM VARIABLES 31

(b) ∂fi
∂xj and ∂gi
∂xj exist 1 ≤ ∀i, j ≤ k.

2. Wozencraft & Jacobs gives a very good interpretation of this result fol-
lowing equation (2A.7) on page W&J 113.

Proof. (refer to Wozencraft & Jacobs, page 112)

Fy (β) = P ({ω : y(ω) ≤ β})


= P ({ω : f (x(ω)) ≤ β})
= P ({ω : x(ω) ∈ I})

= px (α)dα (2.6)
I

where (refer to Wozencraft & Jacobs, equation 2A.3b)

I = {α|f (α) ≤ β}
= {g(γ)|γ ≤ β}

represented graphically as in figure 2.11. We then apply the change of variables


α = g(γ) to equation (2.6):

px (α) −→ px (g(γ))
dα −→ |Jg (γ)|dγ
α∈I −→ γ ≤ β (α = g(γ) ∈ I ⇒ γ ≤ β)

and it follows that



Fy (β) = px (g(γ))|Jg (γ)|dγ, (2.7)
γ≤β

(Wozencraft
∫ & Jacobs (2A.6)). Identification of (2.7) with Fy (β) =
p
γ≤β y
(γ)dγ yields
py (γ) = px (g(γ))|Jg (γ)|
(Wozencraft & Jacobs (2A.7)) since equality is verified for every β ∈ Rk .

Example 2.4.2. Let x, y be 2 random variables with joint probability density


function px y (α, β) and define the random variable z = x + y. We want to find
the probability density function pz (γ).

Solution: Because the transformation from x, y to z is not


( invertible we) intro-
duce a random variable w = y. The transformation f = f1 ( , ), f2 ( , ) takes
the form:
z = x+y = f1 (x, y)
w = y = f2 (x, y)
32 CHAPTER 2. PROBABILITY THEORY

Figure 2.11: change of variable used in proof

( )
and the inverse transformation g = g1 ( , ), g2 ( , ) is given by:

x = z−w = g1 (z, w)
y = w = g2 (z, w)

The Jacobian of the inverse transformation is:



∂g1 (γ, β) ∂g1 (γ, β)

Jg (γ, β) = ∂g2∂γ ∂β

(γ, β)
∂γ
∂g2 (γ, β)
∂β


1 −1
= =1
0 1

and we may directly write:


( )
pz w (γ, β) = px y g1 (γ, β), g2 (γ, β) |Jg (γ, β)|
= px y (γ − β, β) × 1

Finally the probability density function of z is obtained by elimination of the


random variable w:
∫ ∞ ∫ ∞
pz (γ) = pz w (γ, β) dβ = px y (γ − β, β) dβ
−∞ −∞

this is equation (2.87) in W&J.

Example 2.4.3. Let x, y be 2 random variables with joint probability density


function px y (α, β) and define the random variable z = xy. We want to find the
probability density function pz (γ).
2.4. TRANSFORMATION OF RANDOM VARIABLES 33

Solution: (Assume that P (x = 0) = P (y = 0) = 0) Because the transformation


from x, y to z is not
( invertible we) introduce a random variable w = y. The
transformation f = f1 ( , ), f2 ( , ) takes the form:

z = xy = f1 (x, y)
w = y = f2 (x, y)
( )
and the inverse transformation g = g1 ( , ), g2 ( , ) is given by:

x = z/w = g1 (z, w)
y = w = g2 (z, w)

(notice that P (w = 0) = P (y = 0) = 0). The Jacobian of the inverse transfor-


mation is:

∂g1 (γ, β) ∂g1 (γ, β)

Jg (γ, β) = ∂g2 (γ, β) ∂g2 (γ, β)
∂γ ∂β
∂γ ∂β


1/β −γ/β 2
= = 1/β

0 1

and we may directly write:


( )
pz w (γ, β) = px y g1 (γ, β), g2 (γ, β) |Jg (γ, β)|
1
= px y (γ/β, β) ×
|β|

Finally the probability density function of z is obtained by elimination of the


random variable w:
∫ ∞ ∫ ∞
px y (γ/β, β)
pz (γ) = pz w (γ, β) dβ = dβ
−∞ −∞ |β|

this is equation (2.89) in W&J.

Example 2.4.4. The example of pages 113 - 114 (W&J) is a classic! Equation
Wozencraft & Jacobs(2A.9a) should read
( )
x2
y2 = f2 (x) = tan−1 ± πu(−x1 )
x1

where u( ) is the unit step function. If ρ = 0 in equation Wozencraft & Ja-


cobs(2A.11b) then r is Rayleigh distributed, θ is uniformly distributed in the
interval [0, 2π] and r, θ are statistically independent.
34 CHAPTER 2. PROBABILITY THEORY

2.4.4 Transformation of a uniformly distributed random


variable
Theorem 13. Let y be a random variable and f ( ) a valid cumulative distribu-
tion function12 such that the random variable x defined by the transformation

x = f (y)

is uniformly distributed between 0 and 1. The probability density function of y


is then given by:
df (β)
py (β) =

Proof. Refer to problem 2.16 in W & J.

Example 2.4.5.
√ Consider the random variable y ≥ 0 defined by the transfor-
mation y = −2 ln(1 − x) in which x is a random variable uniformly distributed
between 0 and 1. Then
( )
x = 1 − e−y /2 u(y) = f (y)
2

and from the sketch in figure 2.12, f ( ) is clearly seen to be a valid cumulative
distribution function. Therefore:

py (β) = f ′ (β) , ∀β
= βe−β
2
/2
u(β)
≡ Rayleigh probability density function.

Figure 2.12:

12 f ( ) is a monotonically increasing function from 0 to 1.


2.4. TRANSFORMATION OF RANDOM VARIABLES 35

Example 2.4.6. Consider the random variable −10 ≤ y ≤ 10 defined by the


transformation y = 10 sin(π(x − 1/2)) in which x is a random variable uniformly
distributed between 0 and 1. Then
1 (y) 1
x = arcsin + = f (y)
π 10 2
and from the sketch shown in figure 2.13, f ( ) is clearly seen to be a valid
cumulative distribution function. Therefore:

py (β) = f ′ (β)
1
= √ , −10 ≤ β ≤ 10
π 100 − β 2

Figure 2.13:

Corollary. If x, y are two random variables uniformly distributed between 0


and 1 such that pxy (α, β) = px (α) py (β), then the random variable z defined by
the transformation: √
z = sin(π(x − 1/2)) −2 ln(y)
is a Gaussian random variable.
Proof. Recognizing that both y and 1 − y are uniformly distributed between 0
and 1, the result follows from the previous two examples and problem 2.19 in
W & J.
A uniformly distributed random variable is often readily available on a com-
puter system and we next show how theorem 13 can be used to transform this
random variable into another random variable y having any desired probability
density function py (β). We first calculate the corresponding desired cumulative
distribution function Fy (β) of the random variable y. Next we define a function

f −1 : [0, 1] → R
36 CHAPTER 2. PROBABILITY THEORY

as follows. Over the range of input values where the function Fy ( ) is invertible,
we let f −1 ( ) = Fy−1 ( ). If Fy ( ) has a step at β0 then define f −1 (α) = β0 for
every α ∈ [Fy (β0− ), Fy (β0 )]. If Fy ( ) is constant over the semi-opened interval
[β0 , β1 [ then f −1 (Fy (β0 )) = β0 . Finally f −1 (0) ≡ largest β such that Fy (β) =
0. The function f −1 ( ) is well-defined over the entire range [0, 1]. The prob-
ability density function of the random variable defined by the transformation
y = f −1 (x) where x is a random variable uniformly distributed in the interval
from 0 to 1 is the desired probability density function py (β). This is illustrated
in the example that follows.

Example 2.4.7. We want to generate a random variable y with probability


density function shown in figure 2.14. Clearly this is a valid probability density
function and by theorem 7 the random variable y exists. We propose to apply
a transformation of random variable on a random variable x which is uniformly
distributed between 0 and 1.

Figure 2.14:

Step 1: Calculate the cumulative distribution function Fy (β).


Case 1 (0 < β < 1).

Fy (β) = py (β)dβ

β2
= βdβ = +K
2

We must have Fy (0) = 0 ⇒ K = 0. Therefore:

β2
Fy (β) = ⇒ Fy (1) = 1/2 = Fy (2).
2
2.4. TRANSFORMATION OF RANDOM VARIABLES 37

Case 2 (2 < β < 3).



Fy (β) = py (β)dβ

1 β
= dβ = + K
2 2
We must have Fy (2) = 1/2 ⇒ 2/2 + K = 1/2 ⇒ K = −1/2. Therefore

β
Fy (β) = − 1/2 ⇒ Fy (3) = 1.
2
Fy (β) remains constant at 1 for any β ≥ 3.
In summary we have:

 0 ; β<0







 β2
; 0<β<1

 2


1
Fy (β) = ; 1<β<2


2






(β−1)
; 2<β<3

 2



1 ; β>3

A sketch of Fy (β) is presented in figure 2.15. In the following we let f ( ) = Fy ( ).

Figure 2.15:

Step 2: We calculate the inverse function f −1 ( ) on the interval [0, 1]. We


start with the equation α = f (β) and isolate β.
Case 1 (when 0 < α < 1/2).

α = f (β) = β 2 /2 ⇒ β = 2α = f −1 (α)
38 CHAPTER 2. PROBABILITY THEORY

Case 2 (when 1/2 < α < 1).


(β − 1)
α= ⇒ β = 2α + 1 = f −1 (α)
2
Case 3 (when α = 1/2).


f −1 (1/2) = 2α =1
α=1/2

Case 4 (f −1 (0) and f −1 (1)).


f −1 (0) = 0
−1
f (1) = 3
In summary we have


 0 ; α=0



 √



 2α ; 0 < α < 1/2



f −1 (α) = 1 ; α = 1/2






 2α + 1
 ; 1/2 < α < 1





3 ; α=1
or more concisely:
 √
 2α ; 0 ≤ α ≤ 1/2
f −1 (α) =

2α + 1 ; 1/2 < α ≤ 1
−1
A sketch of f (α) is presented in figure 2.16.

Step 3: The random variable y = f −1 (x) where x is a random variable uni-


formly distributed in the interval from 0 to 1 has the required probability density
function as illustrated in the simulation below.

Simulation: We generate 10 × 106 numbers between 0 and 1 using the


function
(double)rand()/(double)RAND MAX
in C. We apply the function f −1 ( ) on each of these numbers and sketch a
histogram with 500 bins between -0.5 and 3.5. We then scale appropriately the
vertical axis (scale the number of points in a bin so that it represents the value
of the probability density function):
(# points in bin)
(bin width) × (total # points)
The graph obtained is shown in figure 2.17.
2.4. TRANSFORMATION OF RANDOM VARIABLES 39

Figure 2.16:

2.4.5 Impulses in px (α) (Wozencraft & Jacobs, pages 64,


65)
As stated before if px (α) contains impulses, the contribution(s) of each impulse
is (are) calculated separately and added to the non-impulsive part. Refer to
W&J, bottom of page 64 and top of page 65 for an example. An other example
is detailed below.

Example 2.4.8. Let x be a random variable with probability density function


px (α) sketched in figure 2.18. We want to find the probability density function
py (β) of the random variable defined by y = x2 . We first set aside the two
impulses and respectively denote by p̃x (α), p̃y (β) the non-impulsive parts of
px (α), py (β). Using equation (2.5) we calculate the p̃y (β) as follows. First, if
β < 0 then

p̃y (β) = 0

Next, if β > 0 then


 √
√  1/6 ⇔ 1 < β < 3 ⇔ 1 < β < 9
p̃x ( β) =

0 elsewhere
 √
√  1/6 ⇔ −3 < − β < −2 ⇔ 4 < β < 9
p̃x (− β) =

0 elsewhere
40 CHAPTER 2. PROBABILITY THEORY

Figure 2.17:

We then obtain 

 0 ; β<1






 1/6
 2√ β ; 1<β<4
p̃y (β) =




1/6
√ ; 4<β<9

 β




0 ; β>9
Finally we add the contribution of the impulses and obtain

1 1
py (β) = p̃y (β) + δ(β − 1/4) + δ(β − 4)
3 6

The probability density function py (β) is sketched on figure 2.19.


2.5. CONDITIONAL PROBABILITY DENSITY 41

Figure 2.18:

2.5 Conditional probability density


Refer to Wozencraft & Jacobs, pp 65 - 70.

Definition 2.5.1.

1. Let x1 , x2 be random variables and v ∈ R be such that px2 (v) ̸= 0. The


function:
∫ v+∆
px1 ,x2 (α, β)dβ
px1 (α|x2 = v) = lim v−∆∫ v+∆
∆→0 p (β)dβ
v−∆ x2

is called conditional probability density of x1 given x2 = v. When


px1 ,x2 (α, β) and px2 (β) are continuous at β = v, this simplifies to

px1 ,x2 (α, v)


px1 (α|x2 = v) =
px2 (v)
42 CHAPTER 2. PROBABILITY THEORY

Figure 2.19:

2. The function
∫ α
Fx1 (α|x2 = v) = px1 (β|x2 = v)dβ
−∞

is called conditional probability distribution function of x1 given x2 = v.

Justification (Wozencraft & Jacobs, pp 65 - 66). Let x denote a random vari-


able, (Ω, F , P : F → [0, 1]) a probability system, A ∈ F an event of probability
P (A) ̸= 0. We naturally have:

P ({x ≤ α}, A)
Fx (α|A) , P ({x ≤ α}|A) =
P (A)
dFx (α|A)
px (α|A) ,


P (x ∈ B|A) = px (α|A)dα
B
2.6. STATISTICAL INDEPENDENCE 43

from which it follows that:

Fx1 (α|x2 = v) = lim P ({x1 ≤ α}|{v < x2 ≤ v + ∆})


∆→0
P ({x1 ≤ α} ∩ {v < x2 ≤ v + ∆})
= lim
∆→0 P ({v < x2 ≤ v + ∆})
[P ({x1 ≤ α} ∩ {x2 ≤ v + ∆}) − P ({x1 ≤ α} ∩ {x2 < v})]/∆
= lim
∆→0 [P ({x2 ≤ v + ∆}) − P ({x2 < v})]/∆
[Fx1 ,x2 (α, v + ∆) − Fx1 ,x2 (α, v)]/∆
= lim
∆→0 [Fx2 (v + ∆) − Fx2 (v)]/∆
∂Fx1 ,x2 (α, v)/∂v
=
dFx2 (v)/dv
∂/∂v Fx1 ,x2 (α, v)
=
px2 (v)

We then obtain:
∂Fx1 (α|x2 = v)
px1 (α|x2 = v) =
∂α
1 ∂2
= Fx ,x (α, v)
px2 (v) ∂α∂v 1 2
px1 ,x2 (α, v)
=
px2 (v)

Refer to W&J, page 67, last paragraph for a very good visual interpretation.

Example 2.5.1 (Wozencraft & Jacobs, page 68). The resulting graph is ex-
plained as follows: when |ρ| → 1, the random variables x1 , x2 become more
“correlated” (formal definition to be given later ). So if |ρ| → 1 and x2 = v then
x1 is very likely close to v. But it is also more likely to be between 0 and v since
its mean (formal definition to be given later) is 0; that’s why the conditional
probability density function looks more and more like an impulse located just
on the left of v.

Applications (Wozencraft & Jacobs pp 68 - 70):


∫∞
1. z = x + y ⇒ pz (γ) = −∞ px,y (γ − β, β)dβ
∫∞
2. z = xy ⇒ pz (γ) = 1
p ( γ , β)dβ
−∞ |β| x,y β

2.6 Statistical Independence


Refer to Wozencraft & Jacobs, pp 70 - 73.
44 CHAPTER 2. PROBABILITY THEORY

Definition 2.6.1. k random ∏kvariables x1 , x2 , . . . , xk are statistically


∏k indepen-
dent if and only if px (α) = i=1 pxi (αi ) or equivalently Fx (α) = i=1 Fxi (αi ),
where x = (x1 , x2 , . . . , xk ) and α = (α1 , α2 , . . . , αk ) ∈ Rk . Random variables
that are not statistically independent are called dependent. k random variables
are pairwise (statistically) independent if pxi xj (αi , αj ) = pxi (αi )pxj (αj ), ∀i ̸= j
or equivalently Fxi xj (αi , αj ) = Fxi (αi )Fxj (αj ), ∀i ̸= j.
Remarks.
1. (Wozencraft & Jacobs, page 71) “. . . the statistical independence of a set
of random variables guarantees the statistical independence of any subset
of them.” Consequently, k random variables that are statistically indepen-
dent are also pairwise statistically independent; the converse is not always
true.
2. (Wozencraft & Jacobs, page 71) If the events A1 , A2 , . . . , Ak are such that
each Ai is defined in terms of xi only, i.e.

Ai = {ω : xi (ω) ∈ Ii }, ∀i = 1, 2, . . . , k,

then the events A1 , A2 , . . . , Ak are statistically independent.


3. (Wozencraft & Jacobs, page 72) In the joint Gaussian probability density
function of x1 , x2 given by equation Wozencraft & Jacobs(2.58), page 53,
we see that
x1 and x2 are statistically ⇔ ρ = 0
independent
The random variables in figure 2.10(a) are independent and the random
variables in figures 2.10(b), 2.10(c), 2.10(d) are not independent.
4. If z = x + y and x, y are statistically independent then
∫ ∞
pz (γ) = px (γ − β)py (β)dβ
−∞
= px (γ) ∗ py (γ)

where the operator ∗ denotes the convolution. This result generalizes to


the case of k statistically independent random variables.
Example 2.6.1. The joint probability density function of two random variables
x, y is given by:
{
A(β + 1)e−α ; 0 ≤ α, −1 ≤ β ≤ 1
px,y (α, β) =
0 ; elsewhere

The following may (or may not) be useful to answer the four questions:
∫ (t 1 )
teKt dt = eKt − 2 , ∀K ̸= 0.
K K
2.6. STATISTICAL INDEPENDENCE 45

1. Show that:
{
2Ae−α ; 0≤α
px (α) =
0 ; elsewhere
{
A(β + 1) ; −1 ≤ β < 1
py (β) =
0 ; elsewhere

2. Calculate the value of A.

3. Are the random variables x, y statistically independent? Justify your an-


swer.

4. Calculate the probability density function pz (γ) of the random variable z


defined by the following transformation (half-wave rectifier):
{
y ; 0≤y
z=
0 ; elsewhere

Solution:

1.
∫ ∞
px (α) = px,y (α, β)dβ
−∞
∫ 1
= A(β + 1)e−α dβ ; α ≥ 0
−1
[ ]1
β2
= Ae−α +β ; α≥0
2 −1
= 2Ae−α ; α ≥ 0
∫ ∞
py (β) = px,y (α, β)dα
−∞
∫ ∞
= A(β + 1)e−α dα ; −1 ≤ β < 1
0
[ ]0
= A(β + 1) e−α ∞ ; −1 ≤ β < 1
= A(β + 1) ; −1 ≤ β < 1

2. One easily verifies that:


∫ ∞ ∫ ∞ ∫∫ ∞
2A = px (α)dα = py (β)dβ = px,y (α, β)dαdβ = 1
−∞ −∞ −∞

We then conclude that A = 12 .


46 CHAPTER 2. PROBABILITY THEORY

3. We have to see if the product px (α)py (β) equals pxy (α, β). We obtain:
{
2A2 (β + 1)e−α ; 0 ≤ α , −1 ≤ β < 1
px (α)py (β) =
0 ; elsewhere
{ −α
(1/2)(β + 1)e ; 0 ≤ α , −1 ≤ β < 1
pxy (α, β) =
0 ; elsewhere
Using the value A = 12 , the above two expressions are clearly equal and
x, y are therefore statistically independent.
4. By definition:
{
0 ; z<0
Fz (γ) = P (z ≤ γ) =
P (y ≤ γ) = Fy (γ) ; z ≥ 0
If 0 ≤ z < 1 we easily find:
∫ γ
β+1 (γ + 1)2
Fz (γ) = Fy (γ) = dβ =
−1 2 4
A sketch of Fz (γ) is presented on figure 2.20. Finally:

dFz (γ)  γ+1
(1/4)δ(γ) ; γ=0
pz (γ) = = ; 0<γ<1
dγ  2
0 ; elsewhere
or equivalently:
γ+1
pz (γ) = (1/4)δ(γ) + (u(γ) − u(γ − 1))
2
where u( ) denotes the unit step function.

Figure 2.20: Sketch of Fz (γ) in example 2.6.1

Definition 2.6.2 (Wozencraft & Jacobs, p 72). The function


∫ ∞
Mx (ν) = px (α)ejνα dα
−∞

is called characteristic function of the random variable x.


2.7. MIXED PROBABILITY EXPRESSIONS 47

Proposition 14. The characteristic function satisfies the following properties:


( ) ( )
1. Let Px (ω) = F px (α) = P̃x (ω/2π) = P̃x (f ), where F denotes the
Fourier transform. Then:

Mx (ν) = Px∗ (ν) = Px (−ν)


= P̃x∗ (ν/2π) = P̃x (−ν/2π)

2. |Mx (ν)| ≤ 1, ∀ν ∈ R. In particular, Mx (ν) always exists (for any proba-


bility density function), since the integral always converges.
∫∞
1
3. px (α) = 2π −∞
Mx (ν)e−jνα dν = F −1 [Mx (−ν)].
4. z = x+y where x and y are two statistically independent random variables
=⇒ Mz (ν) = Mx (ν)My (ν).

2.7 Mixed probability expressions


Refer to Wozencraft & Jacobs pages 73-76.
Consider a probability system (Ω, F , P : F → [0, 1]), an event A ∈ F such
that P (A) ̸= 0 and a random vector x : Ω → (R ∪ {±∞})k with probability
density function px (α). P (A) together with px (α) are not sufficient to calculate
probabilities of events involving both x and A, such as {ω : x(ω) ∈ I} ∩ A. We
need expressions/definitions for mixed form probabilities px (α, A), from which
mixed form conditional probabilities px (α|A), P (A|x = a) will be defined.
Remark. We have already seen a special case of mixed form probability ex-
pressions in the section describing conditional probability density functions.
Indeed let x1 : Ω → R ∪ {±∞} and x2 : Ω → R ∪ {±∞} be random variables.
If we are given the observation (or experimental measurement) x2 = v then
we can define the probability density function px1 (α|x2 = v). If in addition
A = {ω : x1 (ω) ∈ I} is an event of non-zero probability, then we can write

P (A|x2 = v) = px1 (α|x2 = v)dα.
I

Clearly the above P (A|x2 = v) is a mixed form probability. In this situation


the event A is described in terms of a random variable whereas in the section
“mixed form probability expressions” the event A ∈ F need not be specified in
terms of any of the random variables x1 , . . . , xk of the random vector x.
We start by considering the mixed form distribution function

Fx (α, A) = P ({ω : x(ω) ≤ α} ∩ A),

for every α ∈ (R ∪ {±∞})k and we define naturally

∂k
px (α, A) = Fx (α, A) (2.8)
∂α
48 CHAPTER 2. PROBABILITY THEORY

P (AB) Fx (α,A)
From P (A|B) = P (B) we have Fx (α|A) = P (A) and we define

∂k 1 ∂k px (α, A)
px (α|A) = Fx (α|A) = Fx (α, A) = (2.9)
∂α P (A) ∂α P (A)
∫ ∫∞
where P (A) = ··· −∞ px (α, A)dα. One easily verifies that px (α|A)
is
∫ a valid
∫∞ joint probability density function, that is px (α|A) ≥ 0 and
··· −∞ px (α|A)dα = 1.
We define naturally
∫ a+∆
px (α, A)dα px (a, A)
P (A|x = a) = lim a−∆ ∫ a+∆ = , (2.10)
∆→0 px (α)dα px (a)
a−∆

where the last equality holds when both functions are continuous at α = a and
px (α) = px (α, Ω). We next show that, as required, 0 ≤ P (A|x = a) ≤ 1.
Lemma 15. A ⊂ B ⇒ P (A)px (α|A) ≤ P (B)px (α|B).
Proof. Since A ⊂ B, then ∀α, ∆ ∈ Rk we have:
{α ≤ x < α + ∆} ∩ A ⊂ {α ≤ x < α + ∆} ∩ B
⇒P ({α ≤ x < α + ∆} ∩ A) ≤ P ({α ≤ x < α + ∆} ∩ B)
∫ ∫ α+∆ ∫ ∫ α+∆
⇒ ... px (β, A)dβ ≤ . . . px (β, B)dβ (2.11)
α α

It follows that
px (α, A) ≤ px (α, B), (2.12)
since (2.11) is verified ∀α, ∆ ∈ R . From the definitions we immediately have:
k

px (α, A) = P (A)px (α|A) ,


px (α, B) = P (B)px (α|B) ,
and the desired result is obtained by substituting the above into (2.12).
Corollary. 0 ≤ P (A|x = a) ≤ 1.
Proof. By definition and lemma 15:
px (a, A) P (A)px (a|A) P (Ω)px (a|Ω)
P (A|x = a) = = ≤ =1
px (a) px (a) px (a)
since px (a|Ω) = px (a). Finally P (A|x = a) ≥ 0 since both px (a, A), px (a) are
≥ 0.
Theorem 16. The above definition of P (A|x = a) is consistent with the equa-
tion: ∫
P (A|x = a) = py (β|x = a)dβ
I
that is obtained when A is specified in terms of a random vector y : Ω → Rl ,
such as A = {ω : y(ω) ∈ I}.
2.8. STATISTICAL INDEPENDENCE 49

Proof. Easy when all functions are continuous.


Theorem 17 (Baye’s). P (A|x = a)px (a) = px (a|A)P (A) = px (a, A)

2.8 Statistical independence


Refer to Wozencraft & Jacobs, pp 76 - 77.

1. Two random vectors x and y are statistically independent if and only if


px,y (α, β) = px (α)py (β), ∀α, β. If B = {ω : x(ω) ∈ I1 }, C = {ω : y(ω) ∈
I2 } then B and C are statistically independent whenever x and y are
statistically independent.
2. A random vector x and an event A are statistically independent if and
only if px (α, A) = P (A)px (α), ∀α ∈ Rk . If B = {ω : x(ω) ∈ I} then
A and B are statistically independent whenever x and A are statistically
independent.
3. x and y are statistically independent when conditioned on A if and only
if px,y (α, β|A) = px (α|A)py (β|A). If B = {ω : x(ω) ∈ I1 } and C = {ω :
y(ω) ∈ I2 } then P (B ∩ C|A) = P (B|A)P (C|A) whenever x and y are
statistically independent when conditioned on A.
4. “. . . functions of statistically independent random vectors (or random vari-
ables) are statistically independent.”13

Example 2.8.1. Consider a probability system (Ω = {ω1 , ω2 , ω3 }, F = 2Ω , P :


F → [0, 1]), the event A = {ω1 , ω2 } and a random variable x with mixed form
probability density function determined by:
1
√ e−α /2
2
px (α, {ω1 }) =
4 2π
1
√ e−(α−1) /2
2
px (α, {ω2 }) =
4 2π
1
√ e−α /8
2
px (α, {ω3 }) =
4 2π
In the following we calculate the probabilities of some events involving the ran-
dom variable x and/or the event A. We also calculate some conditional proba-
bilities to illustrate the use of the mixed form probability expressions described
in the section.

1. We have

px (α) = px (α, Ω)
1 1 1
√ e−α /2 + √ e−(α−1) /2 + √ e−α /8
2 2 2
=
4 2π 4 2π 4 2π
13 Wozencraft & Jacobs, page 77, last sentence in subparagraph Statistical Independence.
50 CHAPTER 2. PROBABILITY THEORY

This is sketched in figure 2.21. We could then use the probability density
function to calculate for example the probability of the event {x ≥ 5}:
∫ ∞
P ({x ≥ 5}) = px (α, Ω)dα
5
( ) ( ) ( )
1 5−0 1 5−1 1 5−0
= Q √ + Q √ + Q √
4 1 4 1 2 4
= 3.1128 × 10−3

Figure 2.21:

2. We have:
∫ ∞
P (A) = px (α, A)dα
−∞
∫ ∞
= px (α, {ω1 }) + px (α, {ω2 })dα
−∞
1 1 1
= + =
4 4 2
Similarly we would find:

P ({ω1 }) = 1/4
P ({ω2 }) = 1/4
P ({ω3 }) = 1/2
P ({ω1 , ω3 }) = 3/4
P ({ω2 , ω3 }) = 3/4
2.8. STATISTICAL INDEPENDENCE 51

3. What is the probability of {x ≥ 5} ∩ A?


∫ ∞
P ({x ≥ 5} ∩ A) = px (α, A)dα
∫5 ∞
= px (α, {ω1 }) + px (α, {ω2 })dα
5
1 1
= Q(5) + Q(4)
4 4
= 7.9895 × 10−6

4. We also have:
∫ ∞
P ({x ≥ 5}|A) = px (α|A)dα
∫5 ∞
px (α, A)
= dα
5 P (A)
∫ ∞
= 2 px (α, A)dα
| 5
{z }
7.9895×10−6 from above
−5
= 1.600 × 10

Notice that this is different from P ({x ≥ 5}) found previously and con-
sequently, the random variable x and the event A are not statistically
independent. This will be discussed again below.

5. We have:
px (5, A)
P (A|x = 5) =
px (5)
where:
1 1 1
√ e−5 /2 + √ e−(5−1) /2 + √ e−5 /8
2 2 2
px (5) =
4 2π 4 2π 4 2π
= 4.4159 × 10−3

1 1
√ e−5 /2 + √ e−(5−1) /2
2 2
px (5, A) =
4 2π 4 2π
= 3.3829 × 10−5

It follows that:
P (A|x = 5) = 7.66 × 10−3

6. We also have:
px (5, {ω3 })
P ({ω3 }|x = 5) =
px (5)
52 CHAPTER 2. PROBABILITY THEORY

where as before px (5) = 4.4159 × 10−3 and


1
√ e−5 /8
2
px (5, {ω3 }) =
4 2π
= 4.382 × 10−3

It follows that:
P ({ω3 }|x = 5) = 0.99234
Notice that as expected we have:

P (A|x = 5) + P ({ω3 }|x = 5) = 1

7. Although we already know from above that x and A are not statistically
dependent we can also verify this by checking whether px (α, A) equals
P (A)px (α).
(a) From above we have:
1 1
px (α, A) = √ e−α /2 + √ e−(α−1) /2
2 2

4 2π 4 2π

(b) Again from above we have:


1( 1 1 1 )
√ e−α /2 + √ e−(α−1) /2 + √ e−α /8
2 2 2
P (A)px (α) =
2 4 2π 4 2π 4 2π
Clearly the 2 are not equal and again we find that x and A are not
statistically independent.

2.9 Communication example


Refer to Wozencraft & Jacobs, pp 77 - 84.
The communication example of section 2.2 (Wozencraft & Jacobs, pp 33
- 37) dealt with a discrete communication channel such as that presented in
figure 2.22. The DCC comprises many analog and/or digital components. For
some systems (propagation in space) it can further be broken down in smaller
components such as those presented in figure 2.23. The modulator mapping is
a non-random transformation from the set of all outcome of a random experi-
ment into the real number; it is easily seen to satisfy the definition of a random
variable.14 A lower P (E ) can be achieved when the decision is based on si + n
rather than rj (we will see why in Wozencraft & Jacobs, Chapter 4). This leads
to the diagram of figure Wozencraft & Jacobs 2.34, page 78. Assume for sim-
plicity that there are only 2 possible messages m0 and m1 , with corresponding
“signals” s0 and s1 , and we assume WLOG (without loss of generality) that
s0 > s1 . The optimum decision rule is similar to that of example Wozencraft
14 A generalization of this will later be used to introduce the concept of random processes.
2.9. COMMUNICATION EXAMPLE 53

& Jacobs, pp 33 - 37: “. . . suppose r = ρ. As in the DCC, the probability of


correct decision is maximized by mapping ρ into the message m0 or m1 , for
which the a posteriori probability is maximum”. 15 This statement is justified
in §2.9.1 below.

(random) (random) (not random)


m0 r0
Estimate
SOURCE DCC Decision of
m1 r1 message

r2 m : r i  mj

r0
m0
r1
m1 Transition probabilities
r2

Figure 2.22: Discrete communication channel used in example Wozencraft &


Jacobs, pp 33 - 37

Figure 2.23: Details of the continuous communication channel

2.9.1 Details of calculation of P (C )


1. The source is described as in §2.2:

• Sample space: Ωsource = {m0 , m1 },


• Class of events: 2Ωsource ,
15 Wozencraft & Jacobs, page 78. Also refer to A communication problem, page 33 in

W&J and page 11 of these notes.


54 CHAPTER 2. PROBABILITY THEORY

• Probability function:
PS ({m0 }) = Pm0
PS ({m1 }) = Pm1
such that Pm0 + Pm1 = 1.
The source output is modulated (mapped into R):
s : Ωsource → R
s: m0 7 → s0
s: m1 7→ s1
which is easily seen to satisfy the axioms of a random variable.
2. Continuous Communication Channel: similarly to the Discrete Commu-
nication Channel of §2.2, it is a transformation with unpredictable char-
acteristics. In the present case, the channel output r is given by:
r =s+n
where n is a random variable statistically independent of s (equivalently n
is statistically independent of the events {m0 } and {m1 }) with probability
density function pn (α). Combining the source, the modulator and the
channel we obtain the following set of results:
{(m0 , ρ), (m1 , ρ) : ρ ∈ R},
and the likelihood of a result is described by the mixed form joint proba-
bility density function
pr (ρ, {mi }) = pr (ρ | {mi })P ({mi })
= pn+si (ρ | {mi })Pmi
= pn (ρ − si | {mi })Pmi
= pn (ρ − si )Pmi (2.13)
for i = 0, 1.
3. The decision element is a deterministic mapping
m̂ : R → {m0 , m1 } = Ωsource
m̂ : ρ 7→ m̂(ρ)
and similarly to §2.2 we define the sample space
ΩD = {(m0 , m0 ), (m0 , m1 ), (m1 , m0 ), (m1 , m1 )}
where the first component of each pair denotes the message transmitted
and the second component denotes the decision made. The probability of
each sample point is:

P ({mi , mj }) = pr (ρ, {mi })dρ
m̂(ρ)=mj
2.9. COMMUNICATION EXAMPLE 55

Figure 2.24:

where i, j = 0, 1 (the above expression is analogous to PD ({(m0 , m1 )}) =


PDCC ({(m0 , r1 ), (m0 , r2 )}) = PDCC ({(m0 , r1 )}) + PDCC ({(m0 , r2 )}) on
page 12, §2.2 of these notes). We call

C = {(m0 , m0 ), (m1 , m1 )} ⊂ ΩD

the correct decision event:

P (C ) = P ({(m0 , m0 ), (m1 , m1 )})


∫ ∫
= pr (ρ, {m0 })dρ + pr (ρ, {m1 })dρ
m̂(ρ)=m0 m̂(ρ)=m1
∫ ∫
= pr (ρ, {m0 })dρ + pr (ρ, {m1 })dρ (2.14)
I0 I1

where we define:

Definition 2.9.1. The sets:

R ⊃ I0 = {ρ ∈ R : m̂(ρ) = m0 }
R ⊃ I1 = {ρ ∈ R : m̂(ρ) = m1 }

are called decision regions.

A graphical representation of the decision regions is shown in figure 2.24.


Clearly I0 ∩ I1 = ∅, I0 ∪ I1 = R, so that equation (2.14) may be rewritten as:
∫ ∞
P (C ) = pr (ρ, {m̂(ρ)})dρ
−∞

(this is analogous to equation (2.1) in §2.2 of these notes). The optimum decision
rule (to maximize P (C ) in equation (2.14)) is:

 m0 ; if pr (ρ, {m0 }) ≥ pr (ρ, {m1 })
m̂(ρ) = (2.15)

m1 ; otherwhise
56 CHAPTER 2. PROBABILITY THEORY

for every ρ ∈ R; this is the rule that is used to determine the optimal deci-
sion regions I0 , I1 . Finally, P (C ) of equation (2.14) is expanded as (equation
(2.117a) in W&J):
∫ ∫
P (C ) = pr (ρ|{m0 })P ({m0 })dρ + pr (ρ|{m1 })P ({m1 })dρ
I0
∫ ∫ I1
= Pm0 pr (ρ|{m0 })dρ +Pm1 pr (ρ|{m1 })dρ
I I
| 0 {z } | 1 {z }
P (C |{m0 }) P (C |{m1 })

= Pm0 P (C |{m0 }) + Pm1 P (C |{m1 })

W&J start the analysis with the decision rule of equation W&J (2.108) (equiv-
alent to the decision rule of equation (2.15) above), as do most textbooks; we
have showed that this is optimal.

2.9.2 Design of the optimal receiver for a channel with


additive Gaussian noise
Starting from the probability statement of equation (2.13) and the optimal
decision rule (2.15), we have (equation (2.111) in W&J):
m̂ : ρ 7→ m0 ⇔ P ({m0 })pn (ρ − s0 ) > P ({m1 })pn (ρ − s1 )
Suppose n is 0-mean Gaussian (equation (2.113) in W&J):
1
e−α
2
/(2σ 2 )
pn (α) = √
2πσ 2
We claim that the optimum decision rule is equivalent to
m̂ : ρ 7→ m0 ⇔ ρ > a,
where ( )
s0 + s1 σ2 P ({m1 })
a= + ln ,
2 s0 − s1 P ({m0 })
is called decision threshold, assuming that s0 > s1 . If s1 > s0 the inequality
is reversed, and we always use s0 ̸= s1 (obviously). The decision threshold
specifies the decision regions: I0 = {x ∈ R : x ≥ a}, I1 = {x ∈ R : x < a}.
Proof of the claim.
P ({m0 }) −(ρ−s0 )2 /(2σ2 ) P ({m1 }) −(ρ−s1 )2 /(2σ2 )
m̂ : ρ 7→ m0 ⇔ √ e > √ e
2πσ 2 2πσ 2
e−(ρ−s0 ) /(2σ )
2 2
P ({m1 })
⇔ >
e−(ρ−s1 )2 /(2σ2 ) P ({m0 })
( )
(ρ − s0 ) 2
(ρ − s1 )2 P ({m1 })
⇔ exp − + >
2σ 2 2σ 2 P ({m0 })
2.9. COMMUNICATION EXAMPLE 57

Since ln( ) is a strictly increasing function we can take ln( ) on both sides of the
above inequality to obtain:
( )
−(ρ − s0 )2 + (ρ − s1 )2 P ({m1 })
m̂ : ρ 7→ m0 ⇔ > ln
2σ 2 P ({m0 })
( )
P ({m1 })
⇔ −ρ2 + 2ρs0 − s20 + ρ2 − 2ρs1 + s21 > 2σ 2 ln
P ({m0 })
( )
P ({m1 })
⇔ 2ρ(s0 − s1 ) > s20 − s21 + 2σ 2 ln
P ({m0 })
( )
s0 + s1 σ 2
P ({m1 })
⇔ ρ> + ln
2 s0 − s1 P ({m0 })
| {z }
a

assuming s0 > s1 , otherwise the inequality is reversed in the last expression.


If moreover P ({m0 }) = P ({m1 }) = 1/2, the optimum decision rule becomes:
s0 + s1
m̂ : ρ 7→ m0 ⇔ ρ ≥
2
naturally.

2.9.3 Performance of the optimal receiver for a channel


with additive Gaussian noise
The performance of the receiver is measured by its probability of error (same
as the communication example of §2.2):

P (E ) = 1 − P (C )
= P (E |{m0 })P ({m0 }) + P (E |{m1 })P ({m1 })

Assuming WLOG16 that s0 > s1 (similarly to §2.9.2), we obtain:

P (E |{m0 }) = P (r ∈ I1 |{m0 })
= P (r < a|{m0 })
= P (s0 + n < a|{m0 })
= P (n < a − s0 |{m0 })
= P (n < a − s0 )
( 0 − (a − s ) )
0
= Q
σ
(s − a)
0
= Q
σ
Similarly we find (a − s )
1
p(E |{m1 }) = Q
σ
16 without loss of generality
58 CHAPTER 2. PROBABILITY THEORY

and it follows that


(s − a) (a − s )
0 1
P (E ) = P ({m0 })Q + P ({m1 })Q
σ σ
s0 −a −s1
If moreover P ({m0 }) = P ({m1 }) = 1/2, then a = s0 +s
2 ,
1
σ = a−s
σ
1
= s02σ
s0 −s1
and the expression for the probability of error simplifies to P (E ) = Q( 2σ ).
Theorem 18. For any a priori probabilities(P ({m0 }),) P ({m1 }) and correspond-
s0 +s1 σ2 P ({m1 })
ing optimal threshold a = 2 + s0 −s1 ln P ({m0 }) we have:

s0 − s1 s0 − a a − s1
Q( ) ≥ P ({m0 })Q( ) + P ({m1 })Q( )
| 2σ }
{z σ σ
useful upper bound

Proof. Refer to the last paragraph of page 81 and top of page 82 in W & J.
The significance of the upper bound is that it is possible to design a receiver
−s1
for which the probability of error is no larger than Q( s02σ ). It also follows that
the probability of error of an optimal receiver is largest when the messages are
equiprobable, i.e. when the source is most unpredictable. The probability of
error of an optimal receiver can be made arbitrarily small by making the mes-
sages unequally likely. For example a source with a priori message probabilities
P ({m0 }) = 1, P ({m1 }) = 0 leads to a trivial receiver that always guesses mes-
sage m0 ; no information is however communicated from such a trivial source.
Theorem 19. Let Eb denote the maximum allowable transmitted
(√ ) energy per
message. The optimal receiver will achieve P (E ) ≤ Q Eb
2 2σ .
| {z }
2

per message

Eb
Definition 2.9.2. The ratio 2σ 2 is the signal-to-noise ratio in the system.
Remarks.
1. W&J uses Eb /σ 2 as definition of the signal-to-noise ratio. We will later
Eb
see why it is preferable to use 2σ 2 ; not important for now. The probability

of error is plotted as a function of the signal to noise ratio in figure 2.25.


In the worst case (low signal-to-noise ratio), the probability of error is 0.5,
i.e. equivalent to guessing between m0 and m1 without even looking at
the channel output r. For large signal-to-noise ratios, the probability of
error becomes smaller but is never 0. It follows that we cannot guarantee
the correct transmission of an arbitrarily large sequence; this has been the
belief of engineers for a long time until the work of Shannon in 1948.
2. This example illustrates the relationship between the various factors af-
fecting the error performance of a Continuous Communication Channel :
• addition of noise on the channel and in the receiver,
• randomness of the source,
2.10. EXPECTED VALUE 59

Figure 2.25:

• maximum transmitted energy per message,


• choice of signals.

In chapter 5 (W&J) other factors are added to the list, such as bandwidth
of the channel, rate of transmission, . . .

2.10 Expected Value


Refer to Wozencraft & Jacobs pp 84 - 94 for the details.

The probability density function px (α) of a random variable x is required in


order to make probability calculations of events involving x. Obtaining the
probability density function of a random variable may however be difficult,
especially when modelling a physical random experiment. Moments, when they
exist, are numerical values that characterize a probability density function and
they are calculated using an operator called expected value. They sometimes
have a physical meaning and may be easy to calculate and/or measure. In
many situations the probability density function is obtained by first finding its
moments.

Definition 2.10.1. Let x denote a random variable with probability density


60 CHAPTER 2. PROBABILITY THEORY

function px (α) and n > 0 an integer.

1. The n-th moment of px (α) is given by:


∫ ∞
αn px (α)dα
−∞

when it exists (i.e. when the integral converges).

2. The expected value of x, denoted by E(x) or by x̄, is the first moment of


its probability density function:
∫ ∞
E(x) , α px (α)dα
−∞

when it exists; E(x) is also called mean of x.

For a random variable x = g(y) defined in terms of a transformation of ran-


dom variables, E(x) = E(g(y)) is the first moment of the probability density
function px (α) which must first be calculated using py (β) and the function g( )
following the techniques detailed in §2.4. The fundamental theorem of expec-
tation that follows shows that the first moment of px (α), and more generally
any moment of px (α), can be calculated directly from the probability density
function py (β) and the function g( ).

Theorem 20 (Fundamental theorem of expectation (W&J, page 85)). Let y


be a k- dimensional random vector with joint probability density function py (β)
and define the random variable x by the transformation x = g(y) where g( ) is
a function from Rk into R. Then

E[x] = E[g(y)]
∫ ∞∫ ∞ ∫ ∞
= ··· g(β)py (β)dβ (2.16)
−∞ −∞ −∞

(refer to equation W&J (2.126b)).

Remark. It is usually faster to calculate E[x] by equation (2.16)


∫ ∞ rather than
deriving px (α) from py (β) and g : Rk → R, and then calculate −∞ αpx (α)dα.

Sketch of proof. The outline of the proof can be represented as in figure 2.26.
2.10. EXPECTED VALUE 61

It then follows that:


∫ ∞
E[x] = αpx (α)dα
−∞


≈ ai P ({ω|x(ω) ∈ Ii })
i=−∞
∑∞
= ai P ({ω|g(y(ω)) ∈ Ii })
i=−∞
∑∞
= g(β i )P ({ω|y(ω) ∈ Bi })
i=−∞
∫ ∞ ∫ ∞
≈ ··· g(β)py (β)dβ
−∞ −∞

if we assume that all Ii ’s and Bi ’s are very small.

k

g i  ai ,  i
Bi   : g  Ii
,  i
k
B2

B1 2
B1
1
Domain of y
1
B0

0 g: k 

I1 I0 I1 I2 
a1 a0 a1 a2

Domain of x

Figure 2.26: Proof of the fundamental theorem of expectation

In light of the theorem of expectation we have


∫ ∞
αn px (α) dα = E(xn )
−∞

and the n-th moment of the probability density function px (α) is then simply the
expected value of the random variable xn . From now on we naturally extend the
62 CHAPTER 2. PROBABILITY THEORY

definition of moment of a probability density function to moment of a random


variable: the n-th moment of x is simply the n-th moment of the probability
density function of x.
Proposition 21 (Properties of expected value W&J, pp 87 - 88). Let x, y be
two random variables with mean values E(x), E(y) respectively. Then:
1. E(ax + by) = aE(x) + bE(y), ∀a, b ∈ R.
2. x, y are statistically independent ⇒ E(xy) = E(x)E(y).
Remark. The converse of property (2) is in general not true.
Definition 2.10.2 (W&J, p 88). Let x, y be random variables with probability
density function px (α), py (β) respectively.
∫∞
1. Whenever finite, −∞ (α − x̄)n px (α)dα = E[(x − x̄)n ] is called the n-th
central moment of x.
2. The second central moment (if it exists), denoted by σx2 or V ar(x), is
called variance of x.
3. The standard deviation σx is the square root of the variance (if it exists).
Remark (W&J, p 89). The first and second moments can respectively be inter-
preted as center of gravity and moment of inertia.
The followings are easy to prove:
Proposition 22 (Properties of 2nd moments).
1. E(x2 ) = σx2 + x̄2 .
2. z = ax + by in which x, y are statistically independent ⇒ σz2 = a2 σx2 +
b2 σy2 .
3. z = ax ⇒ σz2 = a2 σx2 .
Definition 2.10.3. A complex random variable is a mapping:

w: Ω → C
w: ω 7 → x(ω) + jy(ω)

where x : Ω → R, y : Ω → R are random variables. The complex random


variable w is completely described by the joint probability density function
px,y (α, β) of x and y.
Using the fundamental theorem of expectation we naturally extend the def-
inition of expected value to complex random variables:
∫ ∞∫ ∞
E(w) = (α + jβ)px,y (α, β)dαdβ
−∞ −∞
= x+jy
2.10. EXPECTED VALUE 63

We obtain a new interpretation of the characteristic function of a random vari-


able x: ∫ ∞
Mx (ν) = px (α)ejνα dα = E[ejνx ]
−∞

(refer to W&J, equation (2.136)). This, together with statement 4 of section


2.8 on page 49 of these notes, leads to a one-line-proof that “if z = x + y and
x, y are statistically independent, then Mz (ν) = Mx (ν)My (ν).” We also have:

Theorem 23. Let px (α) and Mx (ν) respectively denote the probability density
function and characteristic function of a random variable x of which all the
moments are finite.
(n) (n)
1. Mx (0) = (j n )E[xn ], where Mx (0) denotes the n-th derivative of Mx (ν)
evaluated at ν = 0.

2. The moments completely specify the probability density function px (α).

Sketch of proof.

1. refer to W & J.

2. The Taylor series of ejνx (around ν = 0) is:

∑∞
(jνx)n
ejνx =
n=0
n!
∑∞ (jν)n
It follows
∫∞ that Mx (ν) = E[ejνx ] = n=0
n
n! E[x ]. Finally, px (α) =
1 −jνα
2π −∞ Mx (ν)e dν.

2.10.1 Moments of a Gaussian random variable


Definition 2.10.4. A random variable x is Gaussian if its characteristic func-
tion is of the form
Mx (ν) = ejνµ e− 2 ν σ
1 2 2

where µ, σ 2 are constants in R.

We then find (using MAPLE or another math package):

ejνµ e− 2 ν
2
1
σ2
Mx(1) (ν) = (jµ − νσ 2 )
−ejνµ e− 2 ν
2
1
σ2
Mx(2) (ν) = (µ2 + 2jµσ 2 ν + σ 2 − ν 2 σ 4 )

Substituting ν = 0 we then have:

Mx(1) (0) = jµ
Mx(2) (0) = −µ2 − σ 2
64 CHAPTER 2. PROBABILITY THEORY

Finally we obtain:
(1)
Mx (0)
x̄ = =µ
j
(2)
Mx (0)
x2 = = µ2 + σ 2 ⇒ σx2 = σ 2
j2
This shows that the parameters µ and σ 2 respectively represent E[x] and
E[(x − x̄)2 ] = V ar(x) and we consequently write the characteristic function
of a Gaussian random variable as Mx (ν) = ejν x̄ e− 2 ν σx .
1 2 2

Theorem 24. The general form of the Gaussian probability density function
is:
1
e−(α−x̄) /(2σx )
2 2
px (α) = √
2πσx2

Proof. Calculate F −1 [Mx (−ν)] with the above Mx (ν) using MAPLE, another
math package or a table of Fourier transform pairs.
Theorem 25.
1. If x is a Gaussian random variable, then:


 0 ; n odd
E[(x − x̄)n ] =

 n/2
n
n!σx
; n even
n
2 ( 2 !)

2. The sum of N statistically independent Gaussian random variables is also


Gaussian.
Sketch of proof.
1. Let y = x − x̄. Then y is a Gaussian random variable with ȳ = 0 and
σy2 = σx2 (easy to prove). It follows that My (ν) = e− 2 ν σx . From theorem
1 2 2

23 on page 63 we obtain (using MAPLE or another math package):

y = −jMy(1) (0) = 0 y 2 = −My(2) (0) = σx2


y 3 = jMy(3) (0) = 0 y 4 = My(4) (0) = 3(σx2 )2
y 5 = −jMy(5) (0) = 0 y 6 = −My(6) (0) = 15(σx2 )3
y 7 = jMy(7) (0) = 0 y 8 = My(8) (0) = 105(σx2 )4
y 9 = −jMy(9) (0) = 0 y 10 = −My(10) (0) = 945(σx2 )5
.. ..
. .

2. Refer to W&J page 93.


2.10. EXPECTED VALUE 65

2.10.2 Moments of some specific random variables


1. Exponential probability density function: c > 0
1 −α/c
px (α) = e u(α)
c
where u( ) denotes the unit step function. We then have
E[x] = c,
E[x2 ] = 2c2 ,
E[(x − x̄)2 ] = c2 = σx2
To obtain the characteristic function, we start with Fourier transform pair:
1
e−at u(t) ←→
a + jω
It follows that:
1 −α/c 1/c 1
px (α) = e u(α) ←→ Px (ω) = =
c 1/c + jω 1 + jcω
Finally
Mx (ν) = Px (−ν)
1
=
1 − jcν
We easily find:
Mx′ (0)
x̄ = =c
j
Mx′′ (0)
x2 = = 2c2
j2
as already established.

2. Rayleigh probability density function: b > 0


2α −α2 /b
px (α) = e u(α)
b
where u( ) denotes the unit step function.17 We then have
1√
E[x] = bπ,
2
E[x2 ] = b,
( π)
E[(x − x̄)2 ] = b 1− = σx2 ≈ 0.2146 b
4
17 Equation Wozencraft & Jacobs(7.75b) is not quite the same as equation Wozencraft &

Jacobs(2.46a), but it is equivalent.


66 CHAPTER 2. PROBABILITY THEORY

The characteristic function is given by:


∫ ∞
Mx (ν) = px (α)ejνα dα
−∞
∫ ∞
2α −α2 /b jνα
= e e dα
b
∫ ∞
0
2α −α2 /b + jνα
= e dα
0 b
( √ )( ( jν √b ))
jν bπ
e−ν b/4
2
= 1+ 1 + erf
2 2

where the last equality is obtained by MAPLE. We easily find (using


MAPLE again):

Mx′ (0) bπ
x̄ = =
j 2
Mx′′ (0)
x2 = =b
j2

as already established.

3. uniform probability density function: b > a



 1
b−a ;a ≤ α ≤ b
px (α) =

0 ; elsewhere

We then have:

b+a
E[x] = ,
2
1 2
E[x2 ] = (a + b2 + ab),
3
(b − a)2
E[(x − x̄)2 ] = = σx2
12

The characteristic function is given by:


∫ b
ejνα
Mx (ν) = dα
b−a
(a )( )( )
j 1
= ejνb − ejνa
a−b ν

where the last equality is obtained by MAPLE. Multiplying the last equa-
2.10. EXPECTED VALUE 67

tion by e−jν(b+a)/2 ejν(b+a)/2 we obtain:


( jν(b+a)/2 )( )( )
e 1
Mx (ν) = ejν(b−a)/2 − e−jν(b−a)/2
b−a jν
( jν(b+a)/2 )( )
e 2
= sin(ν(b − a)/2)
b−a ν
sin(πν(b − a)/(2π))
= ejν(b+a)/2
πν(b − a)/(2π)
( )
ν(b − a)
= ejν(b+a)/2 sinc

We easily find (using MAPLE ):
Mx′ (0) a+b
x̄ = =
j 2
Mx′′ (0) a2 + ab + b2
x2 = =
j2 3
as already established.

4. Cauchy probability density function: b > 0


b/π
px (α) =
b2 + α 2
We then have:
E[x] = does not exist
E[x2 ] = does not exist
E[(x − x̄)2 ] = does not exist
The characteristic function is given by:
Mx (ν) = e−b |ν|
(refer to problem 2.35 in W&J). The characteristic function is not differ-
entiable at ν = 0 as expected since the first and second moments (in fact
any moment) do not exist.

5. Binomial probability density function: N ∈ N∗ , 0 < p < 1


N ( )
∑ N
px (α) = pk (1 − p)N −k δ(α − k)
k
k=0

We then have
E[x] = N p,
E[x2 ] = N p(N p + 1 − p),
E[(x − x̄)2 ] = N p(1 − p) = σx2
68 CHAPTER 2. PROBABILITY THEORY

The characteristic function is given by:


[ ]
Mx (ν) = E ejνx
∑N ( )
N k
= ejνk p (1 − p)N −k
k
k=0
( )N
= 1 − p + pejν

where the last equality is obtained by MAPLE. We easily find:


Mx′ (0)
x̄ = = Np
j
Mx′′ (0)
x2 = = p2 N 2 + N p − p2 N
j2
as already established.
6. Poisson probability density function: ρ > 0

∑ ρk
px (α) = e−ρ δ(α − k)
k!
k=0

We then have
E[x] = ρ,
E[x2 ] = ρ(ρ + 1),
E[(x − x̄)2 ] = ρ = σx2
Refer to problem 2.34 in W&J for the calculation of the characteristic
function. We find:

−1)
Mx (ν) = eρ(e = eρ(cos(ν)−1) ejρ sin(ν)
We easily find:
Mx′ (0)
x̄ = =ρ
j
Mx′′ (0)
x2 = = ρ2 + ρ
j2
as already established.
The probability density function px (α) of a random variable x leads to the
probability calculation of events involving x. Through the E( ) operator we also
obtain the moments (when they exist) and the characteristic function Mx (ν).
The moments and/or the characteristic function may conversely be used to
perform probability calculations of events on x by first recovering the probability
density function px (α). In the next section we see that the moments can directly
be used (i.e. without first calculating px (α)) to approximate some probabilities;
Chebyshev’s inequality (lemma 26) is one such example.
2.11. LIMIT THEOREMS 69

2.11 Limit Theorems


Refer to Wozencraft & Jacobs, pages 94 to 111.

Motivating problem: You walk into a classroom of 100+ stu-


dents you’ve never seen before. The students just received a test for
which the class average is 75% and the standard deviation is 10 (i.e.
the variance is 100).
For this game, you are asked to pick a group of N students at once
and you win if this group’s average is within 10 points of the class
average, i.e. between 65% and 85%. The prize is $100. Ideally, you
could take the whole class and you would certainly win, but for each
of the N students you pick, you have to pay $1; thus, taking the
whole class would cost you more than you would win.
How many students should you take to be 95% certain of winning,
i.e. 95% chances of making a net 100 − N dollars and 5% chances
of loosing N dollars.
Answer: With the weak law of large numbers we will find N = 20.
That means you have at least 95% chances of making $80 and at
most 5% chances of loosing $20. With the Central Limit Theorem
we will find that N > 3.84, i.e. 4 students may be sufficient.18

Definition 2.11.1. Let xi , i = 1, 2, . . . , N denote N statistically independent


identically distributed random variables. The random variable m defined by:

1 ∑
N
m= xi
N i=1

is called sample mean of x1 , x2 , . . . , xN .


We first note that

E(m) = xi = x
σx2i σ2
V ar(m) = = x
N N
In this section we study some formal probability statements, referred to as limit
theorems, about the sample mean. In particular we will see that it is fair to
assume that m is close to E(x) as N becomes large.
Remark. If xi is a random variable defined
∑N from the i-th trial of a compound
experiment, then the sample mean N1 i=1 xi is a statistical estimate of the
expected value x̄. The definition of the sample mean does not require that the
xi be statistically independent repeated trials of the same experiment but in
practice this is often the case.
18 This assumes that the sample mean with only 4 random variables has an approximately

Gaussian density function.


70 CHAPTER 2. PROBABILITY THEORY

Lemma 26 (Chebyshev’s inequality, W&J, page 95). Let y be a random variable


with finite variance σy2 . Then for any positive ϵ we have P (|y − ȳ| ≥ ϵ) ≤ σy2 /ϵ2 .

Proof. We have
∫ ∞
σy2 = (α − ȳ)2 py (α)dα (2.17)
−∞
∫ ∞
= β 2 py (β + ȳ)dβ (2.18)
−∞

≥ β 2 py (β + ȳ)dβ
|β|≥ϵ

≥ ϵ 2
py (β + ȳ)dβ (2.19)
|β|≥ϵ

= ϵ2 py (α)dα (2.20)
|α−ȳ|≥ϵ

= ϵ2 P (|y − ȳ| ≥ ϵ)

equation (2.18) follows from (2.17) using the change of variable β = α − ȳ and
equation (2.20) follows from (2.19) using the change of variable α = β + ȳ

In concrete words the lemma says that “The variance is a measure of the
randomness of a random variable”.19

2.11.1 Weak law of large numbers


An immediate application of Chebyshev’s inequality is that “the probability that
the sample mean differs from the true mean by more than ϵ > 0 approaches 0
as N → ∞”.20 This is the first of our limit theorems and it is formally stated
in theorem 27.

Theorem 27 (Weak law of large numbers, W&J, pp 96 - 97). Let x1 , x2 , . . . , xN


be statistically independent identically distributed random variables with mean
x̄ and variance σx2 . Then ∀ϵ > 0 we have
( ( ) )
1 ∑ N σ2

P xi − x̄ ≥ ϵ ≤ x2 . (2.21)
N Nϵ
i=1

Proof. Easy; follows immediately from Chebyshev’s inequality

In the last two paragraphs of The Weak Law of Large Numbers, Wozen-
craft&Jacobs present a very important special case that confirms the validity
of the estimation of the probability of an event A by Monte Carlo simulations.
19 W&J, page 94
20 W&J, page 96
2.11. LIMIT THEOREMS 71

This is what we meant in item #2 described under Relation of the Model to the
Real World on page 6 of these notes. This also leads to a (conservative) bound
on the confidence intervals given a certain number of repetitions.21

2.11.2 Central Limit Theorem


Theorem 28 (Central Limit Theorem, W&J, pp 106 - 111). Let y1 , y2 , . . . , yN
denote N statistically independent, identically distributed random variables with
mean ȳ and (finite) variance σy2 . We define the random variable:

1 ∑ √
N
z=√ (yi − ȳ) = N (m − ȳ)
N i=1
and denote by Fz (α) the probability distribution function of z. Then:
∫ α
1
e−β /2σy dβ
2 2
lim Fz (α) = √
N →∞ −∞ 2πσy
In concrete terms, the distribution function of the sample mean becomes
Gaussian as N increases.
Remarks. (pp 108 - 109):
1. The result does not imply that pz (α) is Gaussian (refer to Wozen-
craft&Jacobs, page 108 for a very good discussion).
2. As N becomes large the shape of the distribution function of the sum
approaches the shape of the Gaussian distribution function. “Large N ”
depends on the shape of the density function of each yi . Obviously if the
yi ’s are all Gaussian to start with, then the result is true for any N . In fact
if the yi are all Gaussian they don’t need to be statistically independent,
nor do they need to have the same variance nor the same mean. This
general result will be shown in chapter 3 (W&J),22 but the special case of
two jointly Gaussian random variables of equal unity variance and equal
zero-mean can be shown using equations (2.87) (W&J) and (2.58) (W&J).
This is useful, if not required to solve problems 2.25 (W&J) and 3.2(c)
(W&J).
3. The result explains why we often model the noise in a system with a
Gaussian probability density function (more on this in problems 3.15,
3.16).
4. The result finds applications similar to the weak law of large number:
( ) ( )
1 ∑
N
a
P √ (yi − ȳ) ≥ a ≈ Q
N i=1 σy

when N is “large” and |a/σy | is “relatively small”.


21 W&J, page 97.
22 property 4 on pages 166 - 168
72 CHAPTER 2. PROBABILITY THEORY

5. The approximation:
(( ) ) ( )
1 ∑ 1 ∑
N N
P yi − ȳ ≥ ϵ = P (yi − ȳ) ≥ ϵ
N i=1 N i=1
( )
1 ∑ √
N
= P √ (yi − ȳ) ≥ ϵ N
N i=1
( √ )
ϵ N
≈ Q (2.22)
σy

is not normally used. The Chernoff bound (§2.11.3) is much tighter for
large values of N and is also much tighter than the weak law of large num-
bers. The above approximation is nonetheless sometimes used to estimate
the confidence intervals of simulation results, because it is easier to use
than the Chernoff bound.

Sketch of proof. Assume ȳ = 0:

(1) W&J equation (2.174)

[ ( )]N
1 ∑
N
ν
z=√ yi ⇒ Mz (ν) = My √
N i=1 N

(2) W&J equation (2.175b)


[ ( )]
ν
ln [Mz (ν)] = N ln My √
N

(3) Using the Taylor series of My (ν) with ȳ = 0, as in the proof of part 2 of
theorem 23, we obtain (W&J equation (2.176)):

ν 2 σy2
My (ν) = 1 − + ν 3 f (ν)
2 |{z}
−jy 3
goes to 3! as ν → 0

and it follows that:


( ) ( )
ν ν 2 σy2 ν3 ν
My √ =1− + 3/2 f √
N 2N N N
| {z }
−jy 3
goes to 3! as N → ∞
2.11. LIMIT THEOREMS 73

(2) & (3) together with the Taylor series of ln(1 + w) (refer to equations
(2.176), (2.177) in W&J):
 
 [ ( )]
 ν 2 σy2 ν3 ν 
ln [Mz (ν)] 
= N ln 1 − − 3/2 f √ 
2N N 
 N 
| {z }
−w
= N ln(1 + w)
N w2 N w3
= Nw − + − ··· when |w| < 1
2 3
≈ Nw if |w| ≪ 1

* The latter condition |w| ≪ 1 is achieved when N is large enough. We then


have:

( )
−ν 2 σy2 ν3 ν
ln (Mz (ν)) ≈ N w = +√ f √
2 N N

** If N is even larger, then ln [Mz (ν)] = − 12 ν 2 σy2 , from which we obtain

Mz (ν) = e− 2 ν
2
1
σy2

In concrete words the above says that if My (ν) is the characteristic function
∑N
of y1 , y2 , ..., yN , then the characteristic function of z = √1N i=1 yi is Mz (ν) =

My (ν/ N )N . When N becomes large enough this approaches e− 2 ν σy . This is
1 2 2

illustrated in the following example.

Example 2.11.1. Let y be a uniformly distributed random variable with prob-


ability density function sketched in figure 2.27. The characteristic function of y
is given by:
sin(ν/2)
My (ν) =
ν/2

and σy2 = V ar(y) = 1/12 (refer to §2.10.2). The plots shown in figure 2.28

illustrate that MyN (ν/ N ) approaches e− 2 ν σy as N increases.
1 2 2

2.11.3 Chernoff Bound


Refer to W&J, pp 97 - 106.
74 CHAPTER 2. PROBABILITY THEORY

py   

1 1

2 2

Figure 2.27: Probability density function of y in example 2.11.1.

Theorem 29. Let x1 , x2 , . . . , xn be statistically independent, identically dis-


tributed random variables and let x denote any one of the xi ’s. If E[xeλ0 x ] and
E[eλ0 x ] are both finite then:
 ( ∑ )
 N
 P N i=1 xi ≥ d
 1
if d > x̄
E[eλ0 (x−d) ]N ≥ ( )

 P 1 ∑N xi ≤ d

if d < x̄
N i=1

where λ0 is given implicitly by the equation:

E[xeλ0 x ]
=d
E[eλ0 x ]

Remarks (W&J, page 101).

1. The conditions of the Chernoff bound are always verified if the xi ’s are
discrete random variables taking only a finite number of values.

2. d > x̄ ⇒ λ0 ≥ 0

d < x̄ ⇒ λ0 ≤ 0

3. The Chernoff bound decreases exponentially with N .

4. The Chernoff bound is exponentially tight.

In the following examples we illustrate the application of the Chernoff Bound


and compare its value with the bound obtained by the Weak Law of Large
Numbers and the exact probability value.

Example 2.11.2 (W&J, pp 97 - 106). Consider the compound experiment


of throwing a coin N times. We want to find the probability of obtaining at
least 3N/4 times the same face. We also want to look at the behaviour of this
probability as N becomes large (N → ∞).
2.11. LIMIT THEOREMS 75

2
My    
My
2 
 
 N 1   N2 

 

 

 
 

 

         


(a) My (ν), N = 1 (b) My (ν/ 2)2 , N = 2


My
 

N
N

N1
N2


N4 4
  My    2 
2


24
e
 

 
 

 

         

√ 2
(c) My (ν/ N )N , N = 1, 2, 4 (d) My (ν/2)4 ≈ e−ν /24


Figure 2.28: Illustration of limN →∞ My (ν/ N )N = e− 2 ν σy
1 2 2

1. Exact probability:

{ } ∑
N ( )
N
P ( same face 3N/4 ) = 2 (1/2)k (1/2)N −k
or more times k
k=3N/4


N ( )
1 N
=
2N −1 k
k=3N/4

This expression does not easily show the behaviour as N increases.


2. Upper bound of the probability with the Weak Law of Large Numbers:
Define N random variables x1 , x2 , . . . , xN by:

 1 if “Tail” on i-th toss
xi =

−1 if “Head” on i-th toss
76 CHAPTER 2. PROBABILITY THEORY

Then P (xi = 1) = P (xi = −1) = 1/2. The xi ’s are statistically indepen-


dent and are identically distributed:
1 1
pxi (α) = δ(α + 1) + δ(α − 1)
2 2
1
∑N
Let m = N i=1 xi . Then we easily see that:
same face 3N/4 ⇔ |m| ≥ 1/2
times or more
In order to invoke the Weak Law of Large Numbers we calculate:
x̄ = 0
σx2 = (1/2)(1)2 + (1/2)(−1)2 = 1
Then:
1 4
P (|m| ≥ 1/2) ≤ =
N (1/2)2 N
We now see that the probability goes to 0 as N → ∞.
3. Upper bound of the probability with the Chernoff Bound: Define
x1 , x2 , . . . , xN and m as above. Once again we have x̄ = 0 (and although
it will not be needed σx2 = 1). The Chernoff bound is given by equations
W&J, (2.163a) and (2.163b).
(a) Let d = 1/2 > x̄. Then we have
P (m ≥ 1/2) ≤ E[eλ0 (x−1/2) ]N
where λ0 is obtained by solving the equation:
E[xeλ0 x ] 1
λ x
=
E[e ]0 2
We find:
E[xeλ0 x ] = (1/2)eλ0 + (1/2)(−e−λ0 )
(eλ0 − e−λ0 )
=
2
E[eλ0 x ] = (1/2)eλ0 + (1/2)(e−λ0 )
(eλ0 + e−λ0 )
=
2
(eλ0 −e−λ0 ) √
Therefore (eλ0 +e−λ0 )
= 1
2 ⇒ λ0 = ln( 3) . It follows that

E[eλ0 (x−1/2) ] = (1/2)eλ0 (1−1/2) + (1/2)eλ0 (−1−1/2)


eλ0 /2 + e−3λ0 /2
=
2
2
=
33/4
2.11. LIMIT THEOREMS 77

Finally we obtain:
( )N
2
P (m ≥ 1/2) ≤ ≈ 0.877383N
33/4

(b) Let d = −1/2 < x̄. Then we have

P (m ≤ −1/2) ≤ E[eλ0 (x+1/2) ]N

where λ0 is obtained by solving the equation:

E[xeλ0 x ] 1
λ x
=−
E[e ]
0 2

We find (as before):

E[xeλ0 x ] =(1/2)eλ0 + (1/2)(−e−λ0 )


(eλ0 − e−λ0 )
=
2
E[e ] = (1/2)eλ0 + (1/2)(e−λ0 )
λ0 x

(eλ0 + e−λ0 )
=
2

(eλ0 −e−λ0 ) √
Therefore (eλ0 +e−λ0 )
= − 12 ⇒ λ0 = − ln( 3) . It follows that

E[eλ0 (x+1/2) ] = (1/2)eλ0 (1+1/2) + (1/2)eλ0 (−1+1/2)


e3λ0 /2 + e−λ0 /2
=
2
2
= same as before
33/4

Finally we obtain:
( )N
2
P (m ≤ 1/2) ≤ ≈ 0.877383N
33/4

The events {m ≤ −1/2} and {m ≥ 1/2} are disjoint. Consequently

P (|m| ≥ 1/2) = P (m ≥ 1/2) + P (m ≤ −1/2)


( )N
2
= 2 ≈ 2 × 0.877383N
33/4

Once again we see that the probability goes to 0 as N → ∞.


78 CHAPTER 2. PROBABILITY THEORY

In summary:
1
∑N (N )
1. Exact probability value: 2N −1 k=3N/4 k

4
2. Weak Law of Large Numbers (upper) bound: N

3. Chernoff (upper) bound: 2 × 0.877383N

The graphs shown in figures 2.29 illustrate that the probability goes to 0 as
N → ∞. The weak law of large numbers bound becomes looser as N increases.
Whenever possible the Chernoff bound should be used because it is the tightest
exponential bound.

Example 2.11.3 (W&J, pp 102 - 106). As a second application of the Chernoff


bound, consider the compound experiment of problem 2.10(c). In order to
transmit information on the binary symmetric channel of transition probability
p = 1/6, a symbol is repeatedly transmitted N times (N odd) and the receiver
estimates the transmitted message by taking a majority vote on the received
symbols. If messages “0” and “1” are equally likely the exact probability of
error P (E ) is given by:


N ( )
N
P (E ) = (1/6)k (5/6)N −k
N +1
k
k= 2

In order to better see the behaviour of this probability as N becomes large


(N → ∞) we calculate the Chernoff bound of the above probability expression.
To be more general we consider the binary symmetric channel of transition
probability p where we assume WLOG p < 1/2 and as before each equally
likely message (“0” or “1”) is transmitted N times (N odd) and the receiver
estimates the message by majority vote on the received symbols. In order to
apply the Chernoff bound we rephrase the problem in terms of random variables
x1 , x2 , . . . , xN defined by:
{
0 ; if the i-th digit is received correctly
xi =
1 ; if the i-th digit is not received correctly

It is then easy to show that the error event E is equivalent to:


N
N +1
E ⇔ xi ≥
i=1
2

N
N
⇔ xi ≥
i=1
2

1 ∑
N
1
⇔ xi ≥
N i=1 2
2.11. LIMIT THEOREMS 79
( ∑N )
since by assumption N is odd. Therefore P (E ) = P 1
N i=1 xi ≥ 1
2 , where:

x̄ = (1 − p)(0) + (p)(1) = p

E[x2 ] = (1 − p)(0)2 + (p)(1)2 = p 
not required to apply
 the Chernoff bound
σx2 = p − p = p(1 − p)
2

and x denotes any one of the random variables xi . The Chernoff bound (d =
1/2 > x̄ = p) then states that:
( )
1 ∑
N
1 [ ]N
P xi ≥ ≤ E eλ0 (x−1/2)
N i=1 2
( )
where λ0 = ln 1−p p > 0 since by assumption p < 1/2 (refer W&J, page 103
for the details). It follows that:
[ ] √ √
E eλ0 (x−1/2) = 2p 2(1 − p)

= 2 p(1 − p) < 1 (since p < 1/2)
( ∑ ) ( √ )N
N
So P N1 i=1 xi ≥ 12 ≤ 2 p(1 − p) and we immediately see that the
probability of error goes to 0 exponentially as N → ∞.
The graphs in figure 2.30 show the behaviour of the Chernoff bound in
comparison to the exact probability of error (obtained in problem 2.10(c)) when
p = 1/6.
80 CHAPTER 2. PROBABILITY THEORY





 



   



 



         
N
(a) linear (vertical) scale





 
 

 
 

 

  
 

 
          
N
(b) logarithmic (vertical) scale

Figure 2.29: Probability of obtaining 3N/4 times the same


face with N throws of a coin
2.11. LIMIT THEOREMS 81

 

  














     
N
(a) linear (vertical) scale




 













     
N
(b) logarithmic (vertical) scale

Figure 2.30: Probability of error as a function of N


82 CHAPTER 2. PROBABILITY THEORY
Additional Topics in
Probability Theory

e now focus on the special case of Gaussian random vectors. We give


W a formal definition and state the properties which will later be most
important to the analysis of Gaussian processes. Before considering Gaussian
random vectors, we study in more generality the (mixed) moments of random
vectors.

2.12 Moments (mixed) of Random Vectors


Let x = (x1 , . . . , xk ) be a random vector. From the theorem of expectation we
have ∫ ∞ ∫ ∞
E[g(x)] = ... g(α)px (α)dα .
−∞ −∞

In particular we have:
∫ ∞ ∫ ∞
E[xi11 xi22 . . . xikk ] = ... α1i1 α2i2 . . . αkik px (α)dα
−∞ −∞

Definition 2.12.1.

1. If the integrals converge, E[xi11 . . . xikk ] is called n-th mixed moment of x


where n = i1 + i2 + . . . + ik .
[∏ ]
k
2. If the integrals converge, E j=1 (x j − xj )ij
is called n-th mixed central
moment of x.

3. The second mixed central moment of two random variables x1 , x2 is called


covariance and denoted Cov(x1 , x2 ). Clearly Cov(x1 , x2 ) = E[x1 x2 ] −
x1 x2 .

In general there will be many different n-th mixed moments of a random


vector x. This is illustrated in the example that follows. The practical meaning
of mixed moments is to give information about the joint probability density
function of a random vector, in applications for which the joint probability

83
84 CHAPTER 2. PROBABILITY THEORY

density function is not known but statistical estimates of the mixed moments
are available.

Lemma 30. The total number of different n-th moments of a set of v random
variables is given by the expression:
v ( )(
∑ )
v n−1
k k−1
k=1
(n)
where, by convention, k = 0 whenever k > n.

Sketch of proof. This is equivalent to the number of ways in which a line segment
of length n can be divided into v or fewer sub-segments (all lengths are integer).

Example 2.12.1. Let x = (x1 , x2 ). There are 3 second moments of x:


x21 , x22 , x1 x2 . The second central moments are V ar(x1 ) = σx21 , V ar(x2 ) = σx22
and Cov(x1 , x2 ).

If x1 and x2 are statistically independent then:

Cov(x1 , x2 ) = E[x1 x2 ] − x1 x2
= E[x1 ]E[x2 ] − x1 x2
= 0

If x1 and x2 are related by x2 = ax1 , a ̸= 0, then:

Cov(x1 , x2 ) = E[(x1 − x1 )a(x1 − x1 )]


= aV ar(x1 )
{ √
= √V ar(x1 )V ar(x2 ) ; a>0
− V ar(x1 )V ar(x2 ) ; a<0

since V ar(x2 ) = a2 V ar(x1 ). The (normalized) correlation coefficient is used


as an indicator of the “linear resemblance” between 2 random variables. It is
usually denoted by ρ and defined by:

Cov(x1 , x2 )
ρ,
σx1 σx2

Proposition 31.

1. If x1 , x2 are statistically independent then ρ = 0.

2. If x2 = ax1 , a ̸= 0 then ρ = ±1

3. −1 ≤ ρ ≤ 1.
2.12. MOMENTS (MIXED) OF RANDOM VECTORS 85

Proof. Only part 3 of the proposition remains to be proved (parts 1 and 2 were
shown before the statement of the proposition). Let z = x1 + λx2 where λ ∈ R.
Then for every λ we have:
0 ≤ V ar(z) = V ar(x1 + λx2 )
= E[(x1 + λx2 )2 ] − E[x1 + λx2 ]2
= E[x21 ] + λ2 E[x22 ] + 2λE[x1 x2 ]
−E[x1 ]2 − λ2 E[x2 ]2 − 2λE[x1 ]E[x2 ]
= V ar(x1 ) + λ2 V ar(x2 ) + 2λCov(x1 , x2 )
The parabola in λ is always positive or 0 ⇒ the parabola has no distinct real
roots ⇒ its discriminant ∆ is not positive:
0≥∆ = “b2 − 4ac”
= 4Cov(x1 , x2 )2 − 4V ar(x1 )V ar(x2 )
Cov(x1 ,x2 )2
Therefore V ar(x1 )V ar(x2 ) ≤ 1 ⇒ ρ2 ≤ 1 ⇒ |ρ| ≤ 1.

In concrete words |ρ| represents the amount of randomness that a random


variable exhibits with respect to another random variable where a small value
of |ρ| means a lot of randomness.
Remark. This result is somewhat counter intuitive by the fact that the word
covariance contains the root variance. Chebyshev’s inequality, lemma 26, relates
a large variance with a lot of randomness in a random variable. Here we have
that a small |ρ|, which is also related to a relatively small covariance, means a
lot of randomness of a random variable with respect to another random variable.
Example 2.12.2. Let x1 , x2 be jointly Gaussian random variables with joint
probability density function:
( )
1 −(α12 − 2ρα1 α2 + α22 )
px1 ,x2 (α1 , α2 ) = √ exp
2π 1 − ρ2 2(1 − ρ2 )

−1 < ρ < 1, from equation W&J (2.58). The parameter ρ used in this expression
is precisely the correlation coefficient of x1 and x2 . Indeed, in this case we found
that x1 , x2 are both Gaussian with x1 = x2 = 0 and σx21 = σx22 = 1 (refer to
equation (2.64), page 56 in W&J). It follows that the correlation coefficient is
simply given by
E[(x1 − x1 )(x2 − x2 )]
correlation coefficient = √ = E[x1 x2 ]
σx21 σx22
We easily obtain (using MAPLE or math tables):
correlation coefficient = E[x1 x2 ]
∫∫ ∞
= α1 α2 px1 ,x2 (α1 , α2 )dα1 dα2
−∞
= ρ
86 CHAPTER 2. PROBABILITY THEORY

The significance of ρ in the joint Gaussian density function was pointed out
from the results obtained at W&J pages 52 and 69.
Definition 2.12.2. Two random variables x1 , x2 such that Cov(x1 , x2 ) = 0 are
said to be uncorrelated.
Remark. It follows from proposition 31 that if x1 , x2 are statistically indepen-
dent then they are uncorrelated. The converse is not always true.
Example 2.12.3. The joint probability density function of the two random
variables x, y of example 2.6.1 is given by:
{
A(β + 1)e−α ; 0 ≤ α, −1 ≤ β ≤ 1
px,y (α, β) =
0 ; elsewhere

We have shown that A = 12 , x, y are statistically independent and their proba-


bility density functions are:

px (α) = 2Ae−α ; α ≥ 0

py (β) = A(β + 1) ; −1 ≤ β < 1

Calculate the missing values in the following:

E[x] = 1 E[y] = ?
2
E[x ] = 2 E[y 2 ] = ?
2
V ar(x) = ? V ar(y) =
9
E[xy] = ?
Cov(x, y) = ?

Solution: We easily obtain:


∫ 1 [ 3 ]1
β(β + 1) y y2 1
E[y] = dβ = + =
−1 2 6 4 −1 3
∫ 1 [ ] 1
β 2 (β + 1) y4 y3 1
E[y 2 ] = dβ = + =
−1 2 8 6 −1 3
V ar(x) = E[x2 ] − E[x]2 = 2 − 1 = 1

and since x and y are statistically independent:


1
E[xy] = E[x]E[y] = (1)(1/3) = ,
3
Cov(x, y) = 0 .

Definition 2.12.3. Let x = (x1 , . . . , xk ) be a random vector with finite first


and second (mixed) moments. Then:
2.12. MOMENTS (MIXED) OF RANDOM VECTORS 87

1. the vector mx , (x1 , . . . , xk ) is called mean vector and we write mx =


E[x].

2. the symmetric matrix Λx defined by:


 
V ar(x1 ) Cov(x1 , x2 ) . . . Cov(x1 , xk )
 Cov(x2 , x1 ) V ar(x2 ) . . . Cov(x2 , xk ) 
 
Λx =  Cov(x 3 , x1 ) Cov(x 3 , x2 ) . . . Cov(x3 , xk ) 

 ... ... ... ... 
Cov(xk , x1 ) Cov(xk , x2 ) ... V ar(xk )

is called covariance matrix of x. We write Λx = E[(x − mx )T × (x −


mx )] = Cov(xT × x). If Λx is diagonal then the random variables
x1 , . . . , xk are uncorrelated.

Lemma 32. Let x = (x1 , x2 , . . . , xk ) be a random vector with mean vector


and
∑k covariance matrix mx , Λx respectively, and define a random variable y =
i=1 ai xi = x × a + b where a = (a1 , a2 , . . . , ak ) ∈ R , b ∈ R. The mean and
T k

variance of y are respectively given by:

y = E[y] = mx × aT + b
σy2 = E[(y − y)2 ] = a × Λx × aT

Proof. The expression of y follows


∑kimmediately from the linearity of the operator
E[ ] which implies that E[y] = i=1 ai E[xi ] + b = mx × aT + b. It follows that
(suppose b = 0 for simplicity; the result still holds true when b ̸= 0):

V ar(y) = E[(y − y)2 ] = E[(x × aT − mx × aT )2 ]


= E[(x × aT − mx × aT )T × (x × aT − mx × aT )]
= E[(a × xT − a × mTx ) × (x × aT − mx × aT )]
= E[a × (xT − mTx ) × (x − mx ) × aT ]
= a × E[(xT − mTx ) × (x − mx )] × aT
= a × Λx × aT

Example 2.12.4. Let x = (x1 , x2 , x3 ) be a random vector with mean vector


and covariance matrix:

mx = (0, 0, 0)
 
1 0 1/2
Λx =  0 1 0 
1/2 0 1

respectively. Calculate the variance of the random variable y = x1 + x2 + x3 .


88 CHAPTER 2. PROBABILITY THEORY

Solution: In matrix notation, y = x × aT where a = (1, 1, 1). A direct


application of lemma 32 gives:

σy2 = a × Λx × aT
   
1 0 1/2 1
= (1, 1, 1) ×  0 1 0 × 1 
1/2 0 1 1
= 4

This lemma generalizes to the following which we state without proof.

Theorem 33. Let x be a random vector and define a random vector y =


2
x × AT + b where A ∈ Rk is a k × k matrix, b ∈ Rk is a k component vector.
The mean vector and covariance matrix of y are respectively given by:

my = mx × AT + b
Λy = A × Λx × AT

Refer to problem 11 at the end of the chapter for a numerical application of


this result.
Remark. Theorem 33 may be further generalized to the transformation of a k
component random vector x to an N component random vector y, k ̸= N , with
a k × N non-squared matrix AT and an N component vector b; the formulas
given in the theorem still hold.

Definition 2.12.4. Let px (α) be the joint probability density function of a


random vector x = (x1 , x2 , . . . , xk ). The function Mx : Rk → C defined by:
∫ ∫ ∞
Mx (ν) = ··· px1 ...xk (α1 , . . . , αk )ej(ν1 α1 +...+νk αk ) dα1 . . . dαk
−∞
∫ ∞
T
= px (α)ejν×α dα
−∞

k
T
= E[ ejνi xi ] = E[ejν×x ]
i=1

is called joint characteristic function or multivariate characteristic function of


the random vector x.

Remark. px (α) can be regained from Mx (ν) by (refer W&J equation (3.61)):
∫ ∞
1
Mx (ν)e−jν×α dν
T
px (α) =
(2π)k −∞

Theorem 34.
2.13. GAUSSIAN RANDOM VECTORS 89

1. The xi ’s are statistically independent if and only if Mx (ν) =


∏k
i=1 Mxi (νi ).

2. If x = (x1 , x2 , . . . , xk ), x′ = (x1 , x2 , . . . , xi−1 , xi+1 , . . . , xk ), ν = (ν1 , ν2 ,


. . . , νk ), ν ′ = (ν1 , ν2 , . . . , νi−1 , νi+1 , . . . , νk ) then

Mx′ (ν ′ ) = Mx (ν1 , ν2 , . . . , νi−1 , 0, νi+1 , . . . , νk )

Proof. We prove the first part of the theorem only. The second part is easy to
prove; refer to the property 2 on page 165 in W&J.
First, if the xi ’s are statistically independent then the random variables
∏k ejνi xi
jνi xi
are also statistically independent. It follows that Mx (ν) = E[ i=1 e ] =
∏k jνi xi
∏k
E[e ] = M (ν
xi i ).
i=1 i=1
∏k
Conversely, if Mx (ν) = i=1 Mxi (νi ) then
∫ ∞
1
Mx (ν)e−jν×α dν
T
px (α) = k
(2π) −∞
∫ ∞ (∏ k )
1 −jνi αi
= Mxi (νi )e dν
−∞ i=1

∏ k ∫ ∞
1
= Mxi (νi )e−jνi αi dνi
i=1
2π −∞


k
= pxi (αi )
i=1

Therefore the xi ’s are statistically independent.

Remark. Mx (ν) can be used to generate the mixed moments of x (refer to


W&J, pp 157-159).

2.13 Gaussian Random Vectors


Refer to W&J, pp 157 - 171.

2.13.1 Definition:
Definition 2.13.1. A random vector x = (x1 , x2 , . . . , xk ) is Gaussian (equiv-
alently the random variables x1 , x2 , . . . , xk are jointly Gaussian) if

Mx (ν) = e− 2 ν×Λx ×ν
T
1
+jν×mT
x

where Λx and mx are respectively the covariance matrix and mean vector of x.

Remarks.
90 CHAPTER 2. PROBABILITY THEORY

1. If k = 1 we obtain Mx (ν) = e− 2 ν σx +jνx which is in agreement with a


1 2 2

result already established for a Gaussian random variable.


2. The explicit form of Mx (ν) is:
( )
1 ∑∑ ∑
k k k
Mx (ν) = exp − νi Cov(xi , xl )νl + j νi xi
2 i=1 i=1
l=1

The Central Limit Theorem for a single Gaussian random variable (Chapter
2 in W&J) can be generalized to random vectors. If

1 ∑
N
z=√ xi
N i=1

where the xi ’s are identically distributed statistically independent random vec-


tors with finite covariance matrix, then the characteristic function Mz (ν) ap-
proaches the Gaussian form as N → ∞. The argument used to show this is
similar (conceptually) to that used in §2.11.2 (refer to W&J pp 160-163).

2.13.2 Properties of jointly Gaussian Random Variables:


Refer to W&J, pp 164 - 168.

1. The joint probability density function px (α) of a Gaussian random vector


x depends only on the mean vector and the covariance matrix of x.
2. If {x1 , x2 , . . . , xk } are jointly Gaussian random variables then any subset
of these are also jointly Gaussian random variables.
3. Jointly Gaussian random variables are statistically independent if and only
if they are uncorrelated, i.e. their covariance matrix is diagonal.
4. Let x be a random vector and define:

y = x × AT + a
2
where A ∈ Rk is a k × k matrix and a ∈ Rk is a 1 × k vector. If x is
Gaussian then y is also Gaussian.

Proof. Refer to W&J, pp 166-167 (easy to read).


Corollary.
1. Property (2) implies that each random variable of a Gaussian random
vector is Gaussian (W&J, page 165).
2. Property (4) implies that any linear combination of jointly Gaussian ran-
dom variables xi is a Gaussian random variable, whether or not the ran-
dom variables xi are statistically independent.
2.13. GAUSSIAN RANDOM VECTORS 91

Part 2 of the corollary generalizes the result of theorem 25. Using theorem
33, we can give a simpler proof of Part 2 of the corollary (property (4)) than
that given by W&J.

Theorem 35 (Property (4)). Let x be a Gaussian vector, i.e.


( ) ( 1 )
Mx (ν) = exp jν × mTx exp − ν × Λx × ν T
2
where mx , Λx are respectively the mean vector and the covariance matrix of x
and define
y = x × AT + b
2
where A ∈ Rk is a k × k matrix and b ∈ Rk is a 1 × k row vector. Then y is a
Gaussian vector.

Proof of property (4).

My (ν) = E[exp(jν × y T )]
= E[exp(jν × (x × AT + b)T )]
= exp(jν × bT )E[exp(j(ν × A) × xT )]
T
= ejν×b Mx (ν × A)
T 1
= ejν×b exp(jν × A × mTx ) exp(− ν × A × Λx × AT × ν T )
2
T 1
= exp(jν × (mx × AT + b) ) exp(− ν × (A × Λx × AT ) ×ν T )
| {z } 2 | {z }
my Λy

where we recognize my = mx × AT + b and Λy = A × Λx × AT from theorem


33. Therefore My (ν) has the form required for y to be a Gaussian random
vector.

2.13.3 Joint Gaussian Density Function


:
Refer to W&J, pp 168-171

Definition 2.13.2. A random vector x is said non-singular when


k
a0 + ai xi = 0 ⇒ ai = 0, ∀i,
i=1

otherwise x is singular.

The following is stated without proof (⇒ is easy to prove, but ⇐ is hard to


prove).

Theorem 36. x is singular ⇔ |Λx | = 0.


92 CHAPTER 2. PROBABILITY THEORY

Equivalently, x is non-singular if and only if no random variable xj ±


“a constant” is a linear combination of the other random variables of x. If
x is a singular random vector then px (α) contains impulses (refer to the foot-
note on page 168 of W&J). Indeed we have seen (W&J, page 52) that when
x1 , x2 become related by x2 = ax1 + b for some a, b ∈ R, a ̸= 0 (i.e. the corre-
lation coefficient ρ → 1) then the 2-D bell shaped probability density function
becomes flat and reaches very large values; we say that px (α) contains (multi-
dimensional) impulses. Singular Gaussian random vectors are not considered in
the following.
The most general form of joint Gaussian density function is obtained by
inverse transform of
( )
1
Mx (ν) = exp − ν × Λx × ν T + jmx × ν T
2
and this is, in general, not easily done. If however y is a non-singular Gaussian
random vector then py (β) can be calculated by an indirect method. We obtain
the following:
Theorem 37. If y = (y1 , y2 , . . . , yk ) is a non-singular Gaussian random vector
then:
( )
1 1 −1
py (β) = exp − (β − my ) × Λy × (β − my ) T
(2π)k/2 |Λy |1/2 2
Sketch of proof (refer to W&J, pp 168-170). Let x = (x1 , x2 , . . . , xk ) be a ran-
dom vector of k 0-mean statistically independent jointly Gaussian random vari-
ables. Then
∏k
px (α) = pxi (αi )
i=1
and since each random variable xi is Gaussian (property 2, page 165, W&J)
with mean 0 and variance σi2 , it follows that
∏k ( )
1 αi2
px (α) = √ exp − 2
i=1 2πσi2 2σi
( )
1 ∑ αi2
k
1
= √∏ exp −
k 2 2 i=1 σi2
(2π)k/2 σ
i=1 i
( )
1 1 −1
= exp − α × Λ × α T
(2.23)
(2π)k/2 |Λx |1/2 2 x

where
Λx = diag[σ12 , σ22 , . . . , σk2 ] ,

k
|Λx | = σi2 ,
i=1
[ ]
1 1 1
Λ−1
x = diag 2 , 2 , . . . , 2 .
σ1 σ2 σk
2.13. GAUSSIAN RANDOM VECTORS 93

Equation (2.23) corresponds to equation (3.86b) in W&J. We see that in this


special case (all statistically independent jointly Gaussian random variables),
px (α) has the general form stated by the theorem.
Next, let y be any non-singular Gaussian random vector. It follows from
property 4 (W&J, top of page 168) that there exists a vector x of k 0-mean
statistically independent jointly Gaussian random variables such that:

x × AT = y − my

(equation 3.87a in W&J) in which my is the mean vector of y and A is a k × k


invertible matrix. Then:

The joint probability density function of x has the form of equation


(3.86b) in W&J. By applying the method of Appendix 2A: “Re-
versible Transformation of Random Variables” to the present situa-
tion (straightforward, long and tedious) and identifying the terms of
the resulting expression, we obtain the form stated in the theorem
(refer to W&J, pp 169-170 for the details).


Remark. The above proof also shows that |Λy | > 0 and |Λy | is consequently
a real number as required. Indeed Λy = A × Λx × AT where A is an invertible
matrix and Λx = diag[σ12 , σ22 , . . . , σk2 ]. It follows that:

det(Λy ) = det(A) det(Λx ) det(AT )



k
= det(A)2 σi2 > 0
i=1

Example 2.13.1. (from W&J, page 171) Consider 2 jointly Gaussian random
variables y1 , y2 such that:

E[y1 ] = E[y2 ] = 0
E[y12 ] = E[y22 ] = σ 2
E[y1 y2 ] = ρσ 2

i.e. the correlation coefficient is √ Cov(y1 ,y2 )


= ρ. Then we easily find
V ar(y1 )V ar(y2 )

my = (0, 0)
[ ] [ 2 ]
V ar(x1 ) Cov(x1 , x2 ) σ ρσ 2
Λy = =
Cov(x2 , x1 ) V ar(x2 ) ρσ 2 σ2
|Λy | = σ 4 (1 − ρ2 )
[ ] [ −ρ
]
1
1 σ2 −ρσ 2
Λ−1 = = σ 2 (1−ρ2 )
−ρ
σ 2 (1−ρ2 )
y
σ 4 (1 − ρ2 ) −ρσ 2 σ2 σ 2 (1−ρ2 )
1
σ 2 (1−ρ2 )
94 CHAPTER 2. PROBABILITY THEORY

which leads to
β12 − 2ρβ1 β2 + β22
(β1 , β2 ) × Λ−1
y × (β1 , β2 ) =
T
σ 2 (1 − ρ2 )

Finally we obtain:
( )
1 β 2 − 2ρβ1 β2 + β22
py (β1 , β2 ) = √ exp − 1 2 .
2πσ 2 1 − ρ2 2σ (1 − ρ2 )

Example 2.13.2. Problem 3.3, W&J, page 201.

2.14 Problems
1. Une expérience aléatoire consiste à tirer deux billes de quatres urnes iden-
tifiées par la lettre A et les chiffres 1, 2, 3 de la façon suivante. La première
bille est prise de l’urne A qui contient 75 billes marquées 1, 24 billes
marquées 2, 1 bille marquée 3. Le chiffre de la bille tirée de l’urne A
indique l’urne de laquelle la deuxième bille doit être prise. Le contenant
des urnes 1, 2, 3 est comme suit:

l’urne 1 contient 1 bille noire, 15 billes blanches, 14 billes rouges,


l’urne 2 contient 1 bille noire, 9 billes blanches, 5 billes rouges, et
l’urne 3 contient 9 billes noires, 1 bille blanche, 5 billes rouges.

(a) Calculez la probabilité que la deuxième bille soit noire.


(b) De quelle urne la deuxième bille a-t-elle le plus probablemment été
tirée si elle est noire?

2. A random experiment consists in drawing two balls out of three urns


labeled A, B, C in the following manner. The first ball is drawn from the
urn labeled A which contains 1 ball labeled B and 9 balls labeled C, the
letter B or C indicating from which urn the second ball is to be drawn.
Urn B contains 9 red balls and 1 black ball, and urn C contains 1 red ball
and 4 black balls. A valid sample space for the experiment is:

Ω = {(B, red), (B, black), (C, red), (C, black)}

where the first component of each sample point is the result of the first ball
(drawn from urn A) and the second component is the colour of the second
ball (drawn from either urn B or C). The class of events is assumed to be
the set of all 16 subsets of Ω.

(a) Calculate the probability of the four events


consisting of a single sample point, i.e.
P [{(B, red)}], P [{(B, black)}], P [{(C, red)}], P [{(C, black)}].
2.14. PROBLEMS 95

(b) Show that P [{the second ball is red}] = 0.27. It obviously follows
from this result that P [{the second ball is black}] = 0.73 (you need
not show the latter).
(c) Calculate the probability that the second ball is drawn from urn B
if it is red. Calculate the probability that the second ball is drawn
from urn C if it is red.
(d) Based on your result in part (2c), which urn would you guess the
second ball is drawn from if it is red. Justify your answer. What is
the probability that your guess is correct.

3. The probability density function of a Gaussian random variable x is given


by:
1
px (α) = √ e−(α−2) /16
2

4 π
Calculate the following:

(a) P ({ω : x(ω) ≥ 4})


(b) P ({ω : x(ω) ≥ 0})
(c) P ({ω : 3 ≤ x(ω) < 4})

4. The joint probability density function of two random variables x and y is


given by:  1
 2 e−|β−α| ; 0 ≤ α < 1
pxy (α, β) =

0 ; elsewhere
(a) Show that the value of the probability density function of the random
variable y∫ at 2 is py (2) = e−1
2e2 .
Recall: eu du = eu
(b) Calculate the probability that the value of x be between 0.5 and 1 if
it known that y = 2.
5. Let y denote a random variable defined by the transformation y = |x| + 1
where x is a random variable with probability density function:
 1

 ; −2 ≤ α < −1


2

px (α) = 1
; 1≤α<3


4



0 ; elsewhere

Calculate and sketch the probability density function py (β) of y.


Suggestion:
(a) Calculate S(β) = {α ∈ R : 1 + |α| = β}.
(b) Show that |g ′ (α)| = 1, ∀ 0 ̸= α ∈ R where g(α) = 1 + |α|.
96 CHAPTER 2. PROBABILITY THEORY

(c) Use the above two results to obtain:



 0 ; β<1
py (β) =

px (β − 1) + px (1 − β) ; β>1

(d) Sketch py (β).


6. In this problem we consider a discrete communication system with a bi-
nary source producing one of two messages m0 and m1 , with unequal a
priori probabilities P (m0 ) = 1/3, P (m1 ) = 2/3. The message is transmit-
ted on a discrete communication channel having the following transitions
probabilities
P (RX = 0 | T X = m0 ) = 0.99
P (RX = 1 | T X = m0 ) = 0.01
P (RX = 0 | T X = m1 ) = 0.01
P (RX = 1 | T X = m1 ) = 0.99
where T X and RX respectively denote the transmitted message and the
received symbol. The receiver estimates the message m0 whenever RX = 0
and the message m1 whenever RX = 1. With this receiver, it has been
found that P (error | RX = 1) = P (T X = m0 | RX = 1) = 1/199 ≈
0.0050251. Calculate
(a) P (error | RX = 0) = P (T X = m1 | RX = 0),
(b) P (error) of the communication system.
7. In a ternary communication system, a random message m is transmitted,
where m is one of three equally likely symbols m0 , m1 or m2 : P (m0 ) =
P (m1 ) = P (m2 ) = 31 . The received random variable is denoted by r and
b
from a received value r = ρ, a decision denoted by m(ρ) b
is made: m(ρ) ∈
{m0 , m1 , m2 }. As usual we let E denote the error event, i.e. m(r)
b ̸= m
and we let C denote the correct decision event, i.e. m(r)b = m. The
conditional probability density functions of the received random variable
r given m are sketched on figure 2.31.

(a) If r = 1.5 is received, which of the three messages was most likely
transmitted.
(b) The following decision rule (not necessarily the best) is used in parts
(7b) and (7c):


 m0 ; ρ < 1.25

b
m(ρ) = m1 ; 1.25 < ρ < 4.375 (2.24)



m2 ; ρ > 4.375

b
Calculate P (m(r) b
= m0 ), P (m(r) b
= m1 ), P (m(r) = m2 ).
2.14. PROBLEMS 97

(c) Calculate P (E ) with the decision rule (2.24).


Hint: It is easier to first calculate P (C ).

(a) (b)

(c)

Figure 2.31:

8. A communication system is used to transmit a voltage s to a decision


device as shown in Fig. 2.32. Thus the decision device has available two
received voltages, r1 and r2 , in which
r 1 = s + n1 , r 2 = n1 + n2
Assume that n1 , n2 are zero-mean Gaussian random variables with vari-
ances σ12 = σ22 = σ 2 and that s, n1 and n2 are jointly statistically inde-
pendent. The system is used to communicate one of two messages m0 and
m1 with a priori probabilities P [m0 ] = P [m1 ] = 1/2. For message ml the
voltage is √
sl = (−1)l E; l = 0, 1.
The optimum decision rule seeks to determine that l for which the mixed
form joint density function pr1 , r2 (ρ1 , ρ2 , {s = sl }) is maximum.
(a) Determine the structure of the optimum decision device.
Hint: Start from
pr1 , r2 (ρ1 , ρ2 , {s = sl }) = P [ml ] pr1 (ρ1 |{s = sl }) pr2 (ρ2 |{s = sl }, r1 = ρ1 )
= P [ml ] pn1 (ρ1 − sl ) pn2 (ρ2 − ρ1 + sl )
and show that

 m0 ; 2ρ1 − ρ2 > 0
m̂(ρ1 , ρ2 ) =

m1 ; otherwise
98 CHAPTER 2. PROBABILITY THEORY

(b) Calculate the probability of error.


Hint: By symmetry P (E ) = P (E |{s = s0 }) = P (E |{s = s1 })
(you need not show this). Starting from the above optimum
√ decision
device, show that P (E |{s = s1 }) reduces to P ({n > 2 E}), where
n = n1 − n2 is a Gaussian random variable. Calculate the mean and
variance of n and finally express the probability of error in terms of
the Q( ) function.

Figure 2.32:

23
9. We wish to simulate a communication system on a digital computer and
estimate the error probability P (E ) by measuring the relative frequency
of error. Let N denote the number of independent uses of the channel in
the simulation and xi , i = 1, 2, . . . , N denote the random variable such
that:

 0 ; no error on the i-th use of the channel
xi =

1 ; there is an error on the i-th use of the channel

The xi are identically distributed and statistically independent random


variables. The relative frequency of error, i.e. the estimate of P (E ) is
given by the sample mean:

1 ∑
N
m= xi
N i=1

and clearly E[m] = E[xi ] = P (E ). Approximate by means of

(a) the Weak Law of Large Numbers,


(b) the Central Limit Theorem,
23 This problem is based on problem 2.38, page 127 in W&J.
2.14. PROBLEMS 99

how many independent uses of the channel we must simulate in order to


be 99.9% certain that the observed relative frequency lies within 5% of
the true P [E ], which we may assume to be approximately 0.01.

Hint: For the Weak Law of Large Numbers, use equation (2.21) on page
70 in the notes (in theorem 27). For the Central Limit Theorem, use the
approximation:
( ( ) ) ( √ )
1 ∑ N
ϵ N
P xi − x̄ ≥ ϵ ≈ 2Q
N σx
i=1

which follows from equation (2.22) on page 72 in the notes.


10. The probability density function of a random variable x is given by:

 e−(α−a) ; α ≥ a
px (α) =

0 ; α<a

(a) Show that px (α) is a valid probability density function, as claimed.


(b) Using a table of Fourier transform pairs and properties of the Fourier
transform, show that the characteristic function of x is given by:
ejνa
Mx (ν) =
1 − jν

(c) Calculate the mean and variance of x using the above characteristic
function.
(d) Calculate the mean and variance of x using the probability density
function px (α).
The followings may be useful:

xe−x dx = −(1 + x) e−x

x2 e−x dx = −(2 + 2x + x2 ) e−x

lim xn e−x = 0, ∀n ∈ N
x→∞

11. The mean vector and covariance matrix of a random vector x = (x1 , x2 )
are respectively given by:
mx = (1, 0)
( )
9 −4
Λx =
−4 4
100 CHAPTER 2. PROBABILITY THEORY

(a) Calculate the mean vector my and covariance matrix Λy of the ran-
dom vector y = (y1 , y2 ) defined by the transformation:
( )
1 1
y =x× + (0, 1)
3 −1

(b) If the above random vector x is Gaussian then y is also Gaussian.


Calculate the probability that y = (y1 , y2 ) lies in the hatched region
of the plane in figure 2.33.
Note: If you have not been able to solve question (11a), you may
use the following:

my = (2, 1)
( )
25 −10
Λy =
−10 15

Those are not the answer to question (11a).

Figure 2.33:
Chapter 3

Random Waveforms

3.1 Fundamental Definitions


Definition 3.1.1.

1. Let (Ω, F , P : F → [0, 1]) be a probability system and let:

x: Ω×R → R
x: (ω, t) 7 → x(ω, t)

be such that x(t0 ) is a random variable for all t0 ∈ R. The mapping x is


called a random process or stochastic process and denoted x(t).

2. The functions of time x(ω, t), ω ∈ Ω, taken by the random process x(t)
are called sample functions.

A random process x(t) can intuitively be understood by an engineer as an


infinite sequence 1 of random variables, a huge (infinite) random vector. Hence
for any t1 ∈ R, x(t1 ) is specified by a probability density function px(t1 ) (α). We
may for example write:
∫ b
P (a < x(t1 ) ≤ b) = px(t1 ) (α)dα
a

Similarly x(t2 ) is another random variable and the behaviour of x(t1 ) and x(t2 )
is specified by the joint probability density function px(t1 ),x(t2 ) (α, β). In general,
for times t1 , t2 , . . . , tk we denote the joint probability function by px(t) (α) where

t = (t1 , t2 , . . . , tk )
x(t) = (x(t1 ), x(t2 ), . . . , x(tk ))
α = (α1 , α2 , . . . , αk )
1 sequence ⇒ denumerable number of components; this is not the case for a process.

101
102 CHAPTER 3. RANDOM WAVEFORMS

x(t) is simply a random vector. For example, if (cf: W&J eq 3.3a)

A = {ω ∈ Ω : ai < x(ω, ti ) ≤ bi , i = 1, 2, . . . , k}

then
∫ b1 ∫ b2 ∫ bk
P (A) = ... px(t) (α)dαk . . . dα2 dα1
a1 a2 ak

Definition 3.1.2. A random process x(t) is specified if a rule is given or implied


for determining the joint probability density functions px(t) (α) (or equivalently
Mx(t) (ν)), ∀k ∈ N∗ and all α ∈ Rk , t ∈ Rk . 2

For example a random process for which the probability system and the
mapping x : Ω × R → R are known is specified. In applications, four methods
of specification are encountered: 3

1. The probability system and mapping x : Ω × R → R are known (example


3.1.1 below).

2. The probability density functions px(t) (α) are stated directly. This is
only possible in some very particular and simple cases, such as Gaussian
process introduced in §3.6 (pages 186 - 192, W&J).

3. The process is a function of random variables. For example x(ω, t) =


r(ω) sin(2πt + θ(ω)) where r and θ are random variables. The calculations
leading to px(t) (α) may be long and tedious (refer to example 3.1.2 below).

4. The process is a transformation of one or more processes which are all


specified. For example:

• y(t) = x(t − T ), for some T > 0 (delay),


• r(t) = s(t) + n(t),
∫∞
• y(t) = −∞ h(α)x(t − α)dα, for some fixed h(t) (filtering).

In the above we define two processes x(t) and y(t) to be equal if and only if the
probability of the event {ω ∈ Ω : x(ω, t) ̸= y(ω, t)} is equal to 0; the concept
is similar to that of equality of random variables as per definition 2.3.8.

Example 3.1.1. Let Ω = {Head, Tail}, P (Head) = P (Tail) = 1


2 and define the
random process x(t) by:

x: Ω×R → R
x : (Head, t) 7→ 2
x : (Tail, t) 7→ sin(t)
2 W&J page 133
3 W&J, pages 133 - 135
3.1. FUNDAMENTAL DEFINITIONS 103

x(0) is a random variable and we see that

1
P (x(0) = 2) =
2
1
P (x(0) = 0) =
2
x(π/2) is also a random variable and we have

1
P (x(π/2) = 2) =
2
1
P (x(π/2) = 1) =
2
The probability density functions of x(0), x(π/2) are sketched in figures 3.1,
3.2. Their joint probability density function is sketched in figure 3.3.

Figure 3.1:

Figure 3.2:

Example 3.1.2. Let x(t) = 10 sin(2πt + θ) where θ is a random variable uni-


formly distributed between 0 and 2π, i.e.

 1/(2π) ; 0 ≤ θ < 2π
pθ (β) =

0 ; elsewhere
104 CHAPTER 3. RANDOM WAVEFORMS

Figure 3.3:

Then x(t0 ) = 10 sin(2πt0 +θ) and the probability density function of the random
variable x(t0 ) is obtained using the technique of §2.4.2:
∑ pθ (β)
px(t0 ) (α) =
|g ′ (β)|
β∈S(α)

where

g(β) = 10 sin(2πt0 + β)

g (β) = 10 cos(2πt0 + β)
|g ′ (β)| = |10 cos(2πt0 + β)|

S(α) = {β ∈ [0, 2π] : 10 sin(2πt0 + β) = α}

As can be seen from figure 3.4, S(α) = {β0 , β1 } (i.e. S(α) contains two ele-
ments) for any −10 < α < 10, and β0 is given by:

β0 = arcsin(α/10) − 2πt0

The expression for β1 is more complicated but it is not required since pθ (β0 ) =
pθ (β1 ) = 1/(2π) and |g ′ (β1 )| = |g ′ (β0 )| = |10 cos(arcsin(α))| (easy to see on the
3.1. FUNDAMENTAL DEFINITIONS 105

graph of figure 3.4). We then obtain directly (as long as −10 < α < 10):
pθ (β0 ) pθ (β1 )
px(t0 ) (α) = ′
+ ′
|g (β0 )| |g (β1 )|
1/(2π)
= 2
|10 cos(2πt0 + (arcsin(α/10) − 2πt0 ))|
1
=
10π| cos(arcsin(α/10))|
1
= √
π 100 − α2
for any t0 . px(0) (α) is sketched on figure 3.5.

Figure 3.4:

Figure 3.5:

Definition 3.1.3. A random process for which all sample functions are periodic
with (same) period T > 0 is called periodic random process.
The random process of example 3.1.2 above is periodic and its period is 1
[second].
106 CHAPTER 3. RANDOM WAVEFORMS

3.2 Stationariness
(W&J pp. 135 - 144)
Definition 3.2.1. A random process x(t) for which

px(t+T ) (α) = px(t) (α)

for all α ∈ Rk , t ∈ Rk , k ∈ N∗ and for any T = (T, T, . . . , T ) where T ∈ R, is


called stationary.
Example 3.2.1. In example 3.1.1, we found that px(0) (α) ̸= px(π/2) (α). There-
fore x(t) is not a stationary process.
Example 3.2.2. (W&J pp 136 - 139) Let x(t) = f (t + τ ) where f (t) is the
periodic triangular wave of period T sketched in figure 3.6 and τ is a random
variable uniformly distributed between 0 and T . We see that x(t) is specified
by method 3 of page 102. We show that x(t) is stationary in two steps: first we
prove that px(t1 ) (α) = px(t1 +T ) (α) for all α, t1 , T ∈ R. Next we use the result
to prove that px(t) (α) = px(t+T ) (α), ∀t, T = (T, T, . . . , T ), α ∈ Rk and for all
k ∈ N∗ .

Figure 3.6:

1. Using the technique of §2.4.2, the probability density function of the ran-
dom variable x(t1 ) = f (t1 + τ ) is given by:
∑ pτ (β)
px(t1 ) (α) =
|g ′ (β)|
β∈S(α)

where the transformation g( ) and the set of roots S(α) are respectively
given by:

g(τ ) = f (t1 + τ )
S(α) = {β ∈ [0, T ] : g(β) = f (t1 + β) = α}

We then have:
g ′ (β) ≡ slope of f ( ) = a/T
3.2. STATIONARINESS 107

and one easily sees that ∀α ∈ [0, a] there exists a unique β0 ∈ S(α) (refer
to figure 3.7). We obtain directly:

pτ (β0 ) 1/T 1
px(t1 ) (α) = ′
= =
|g (β0 )| |a/T | a

as long as 0 ≤ α < a. The latter is equation (3.14) in W&J and it clearly

Figure 3.7:

indicates that the probability is independent of t1 ∈ R. Therefore

px(t1 +T ) (α) = px(t1 ) (α)

for every t1 , T, α ∈ R.

2. We factor px(t) (α) into:

px(t) (α) = px′ (t′ ) (α′ |x(t1 ) = α1 )px(t1 ) (α1 )

in which

x′ (t′ ) = (x(t2 ), x(t3 ), . . . , x(tk ))


t′ = (t2 , t3 , . . . , tk )
α′ = (α2 , α3 , . . . , αk )

and we notice that the knowledge of x(t1 ) = α1 implies the knowledge of


τ which completely specifies the values of x(t2 ), x(t3 ), . . . , x(tk ). Indeed
from t1 and x(t1 ) = α1 , 0 ≤ α1 < a we define

h(t) = f (t − t1 + α1 T /a)

where 0 ≤ α1 T /a < T . Clearly h(t) is a delayed version of f (t) and one


easily sees that

h(t1 ) = f (t1 − t1 + α1 T /a) = f (α1 T /a) = α1 T /a × a/T = α1 = x(t1 ) .


108 CHAPTER 3. RANDOM WAVEFORMS

It follows that
α1 T
x(t2 ) = h(t2 ) = f (t2 − t1 + ) , a2
a
··· ··· ··· ··· ···
α1 T
x(tk ) = h(tk ) = f (tk − t1 + ) , ak
a
This is equation (3.15a) in W&J and it follows that


k
px(t) (α) = px(t1 ) (α1 ) δ(αi − ai )
i=2

k
α1 T
= px(t1 ) (α1 ) δ(αi − f (ti − t1 + ))
i=2
a

k
α1 T
= px(t1 +T ) (α1 ) δ(αi − f ((ti + T ) − (t1 + T ) + ))
i=2
a
= px(t+T ) (α)

for every T ∈ R.

Example 3.2.2 above generalizes to the following:


Theorem 38. Let g(t) be a periodic function of period T and define the periodic
random process y(t) = g(t+τ ) where τ is a random variable uniformly distributed
in the interval [0, T ]. y(t) is stationary.
Sketch of proof. (W&J pp. 140 - 141). Clearly

py(t) (α) = py(t+T ) (α) ⇔ Fy(t) (α) = Fy(t+T ) (α)


⇔ Fy(t) (α) = Fz(t) (α)

where z(t) = y(t + T ) (the process z(t) is specified by method 4 of page 102).
Let
Ωy ≡ set of sample functions of y(t)
Ωz ≡ set of sample functions of z(t)
Then we easily see that Ωy = Ωz (any function in Ωz lies in Ωy and vice versa).
Moreover the sample functions are all equiprobable. It follows that if

I = {a(t) ∈ Ωy : a(t) ≤ α}
K = {b(t) ∈ Ωz : b(t) ≤ α}

then I = K and

Fy(t) (α) = P (I) = P (K) = Fz(t) (α) = Fy(t+T ) (α) .

y(t) is therefore stationary.


3.3. MOMENT FUNCTIONS OF A RANDOM PROCESS 109

Theorem 39. (W&J pp 142). Let g(W , t) be a periodic function of t of pe-


riod T and J parameters W1 , W2 , . . . , WJ , none of which affect the period T of
g(W , t). Define x(t) = g(γ, t + τ ) where τ is a random variable, γ is a random
vector and τ , γ are statistically independent. If τ is uniformly distributed in the
interval [0, T ] then x(t) is stationary.
Proof. Refer to W&J p. 142.
Example 3.2.3. It follows immediately from theorem 38 that the random pro-
cess x(t) = 10 sin(2πt + θ) from example 3.1.2 is stationary. Also theorem 39
implies that the random process defined by equation (3.5) on W&J p. 133 is
stationary when pr,θ (α, β) is defined by equation (3.6). This is illustrated on
W&J pp. 142 - 143.
Remark. (W&J p. 143) “It is readily verified that
1
px(ti ) (α) = px(0) (α) = √ e−α /2 ”.
2


This follows from W&J Appendix 2A in which we found that if (ρ = 0)

1 −α1 /2
2
 α
2π e ; α1 ≥ 0, 0 ≤ α2 < 2π
pr,θ (α1 , α2 ) =

0 ; elsewhere

and we define the random variables x = r cos θ,4 y = r sin θ = x(0), then
1 −(β12 +β22 )/2
pxy (β1 , β2 ) = e

By elimination of the random variable x we obtain
∫ ∞
1
pxy (β1 , β2 )dβ1 = √ e−β2 /2 ,
2
py (β2 ) =
−∞ 2π
and by the change of variables α = β2 the desired result is obtained.

3.3 Moment Functions of a Random Process


(not in W&J)
Definition 3.3.1.
1. Let x(t) be a random process and n ∈ N∗ . E[x(t1 )x(t2 ) . . . x(tn )] is a
function of t = (t1 , t2 , ..., tn ) ∈ Rn called n-th moment function of x(t) if
it exists.
2. The first moment function of x(t) always exists and is called mean function
and denoted mx (t) , E[x(t)].
4 this variable x is not to be confused with the random process x(t).
110 CHAPTER 3. RANDOM WAVEFORMS

3. The second moment function of x(t), if it exists, is called autocorrelation


function and denoted Rx (t, s) , E[x(t)x(s)].
4. Similarly, the n-th central moment function is defined by E[(x(t1 ) −
mx (t1 ))(x(t2 ) − mx (t2 )) . . . (x(tn ) − mx (tn ))] if it exists. The second cen-
tral moment function of x(t), if it exists, is called autocovariance function
and denoted Lx (t, s).
Remarks.
1.
Lx (t, s) = E[(x(t) − mx (t))(x(s) − mx (s))]
= E[(x(t)x(s)] − mx (t)mx (s)
= Rx (t, s) − mx (t)mx (s)
If x(t) is statistically independent of x(s) the covariance reduces to 0 at
(t, s), i.e. x(t) and x(s) are uncorrelated. Also, if mx (t) = 0 or mx (s) = 0
then Lx (t, s) = Rx (t, s).
2. V ar(x(t0 )) = Lx (t0 , t0 ) = Rx (t0 , t0 ) − m2x (t0 ).
3. If x(t) is stationary then the mean function is a constant function of time,
i.e. mx (t1 ) = mx (t2 ), ∀t1 , t2 ∈ R, as shown below.

Proof.
mx (t1 ) = E[x(t1 )]
∫ ∞
= αpx(t1 ) (α)dα
−∞
∫ ∞
= αpx(t2 ) (α)dα
−∞
= E[x(t2 )]
= mx (t2 )

Also, the autocorrelation (autocovariance) function is independent of the


time origin, i.e. Rx (t, s) = Rx (t + T, s + T ), ∀t, s, T ∈ R, as shown below.

Proof.
Rx (t, s) = E[x(t)x(s)]
∫∫ ∞
= αβpx(t),x(s) (α, β)dαdβ
−∞
∫∫ ∞
= αβpx(t+T ),x(s+T ) (α, β)dαdβ
−∞
= E[x(t + T )x(s + T )]
= Rx (t + T, s + T )
3.3. MOMENT FUNCTIONS OF A RANDOM PROCESS 111

When x(t) is stationary we see that Rx (t, s) is not function of both t, s,


since by the subtraction of t (or s) from both arguments we obtain Rx (t, s) =
Rx (0, s − t) = Rx (t − s, 0). We consequently drop one argument and write
Rx (s − t) or simply Rx (τ ) where τ represents s − t:

Rx (τ ) , E[x(t)x(t + τ )]

Similarly, we denote by mx the mean function of a stationary process x(t).

Definition 3.3.2. A random process x(t) for which the mean function is inde-
pendent of time and the autocorrelation function is independent of time origin
is said to be wide sense stationary.

Obviously, a (strict sense) stationary process is also wide sense stationary,


but the converse is not necessarily true. An example of a wide sense stationary
process that is not (strict sense) stationary is presented on page 177 in W&J.
It is also a lot easier to prove (or disprove) the wide sense stationariness of a
process than the (strict sense) stationariness, especially if we only dispose of
experimental data.

Example 3.3.1. Consider the following random process specified from the ex-
periment consisting in the throw of a coin (P (Head) = P (Tail) = 1/2) as in
example 3.1.1:
x: Ω×R → R
x : (Head, t) 7→ 2
x : (Tail, t) 7→ sin t

1. Mean function:

mx (t) = E[x(t)]
= (1/2) sin t + (1/2)(2)
sin t
= 1+
2
The mean function mx (t) is sketched in figure 3.8(a).

2. Autocorrelation function:

Rx (t, s) = E[x(t)x(s)]
= (1/2)(2 · 2) + (1/2)(sin t sin s)
= 2 + 1/2 sin t sin s
= 2 + 1/4 cos(t − s) − 1/4 cos(t + s)

recalling that 2 sin u sin v = cos(u − v) − cos(u + v).


112 CHAPTER 3. RANDOM WAVEFORMS

3. Autocovariance function:

Lx (t, s) = Rx (t, s) − mx (t)mx (s)


= 2 + 1/2 sin t sin s − (1 + 1/2 sin t)(1 + 1/2 sin s)
= (1 − 1/2 sin t)(1 − 1/2 sin s)

The function Lx (t, s) is sketched in figure 3.8(b). Clearly this process is


not wide sense stationary (⇒ nor strict sense stationary).

(a) mx (t) of example 3.3.1

(b) Lx (t1 , t2 ) of example 3.3.1

Figure 3.8: Mean and autocovariance functions for the process of example 3.3.1

Example 3.3.2. Let x(t) = 10 sin(2πt + θ) where θ is a random variable uni-


formly distributed between 0 and 2π.
3.3. MOMENT FUNCTIONS OF A RANDOM PROCESS 113

1. Mean function: From the fundamental theorem of expectation we obtain:

mx (t) = E[x(t)]
= E[10 sin(2πt + θ)]
∫ ∞
= 10 sin(2πt + α)pθ (α)dα
−∞
∫ 2π
10
= sin(2πt + α)dα
0 2π
= 0

independently of the time variable t.

2. Autocorrelation function: Again using the fundamental theorem of expec-


tation we obtain:

Rx (t, s) = E[x(t)x(s)]
= E[100 sin(2πt + θ) sin(2πs + θ)]
= 50 cos(2π(t − s)) − 50 E[cos(2π(t + s) + 2θ)]
∫ 2π
cos(2π(t + s) + 2α)
= 50 cos(2π(t − s)) − 50 dα

|0 {z }
0
= 50 cos(2π(t − s))

We find that Rx (t, s) is independent of the time origin. It is also periodic


with respect to t − s and has the same period as x(t). Rx (t, s) is sketched
in figure 3.9.

The process is clearly wide sense stationary; in fact it is stationary in the


strict sense by theorem 39.

Example 3.3.3. Let x(t) = r sin(2πt + θ) where r and θ are two random
variables with joint density function

α −α2 /2
 2π e ; 0 ≤ β < 2π and 0 ≤ α
prθ (α, β) =

0 ; elsewhere

1. Mean function:

mx (t) = E[x(t)]
∫ ∞ ∫ 2π
= α sin(2πt + β)prθ (α, β)dβdα
0 0
= 0

hence independent of time.


114 CHAPTER 3. RANDOM WAVEFORMS

Figure 3.9: Autocorrelation function for the random process of example 3.3.2

2. Autocorrelation function:
Rx (t, s) = E[x(t)x(s)]
∫ ∞ ∫ 2π
= α2 sin(2πt + β) sin(2πs + β)prθ (α, β)dβdα
0 0
= cos(2π(t − s))
hence independent of time origin.5 Again we see that Rx (t, s) = Rx (t − s)
is periodic (with respect to t − s) and has same period as x(t).
The process is wide sense stationary; it is also stationary in the strict sense
by theorem 39.
Example 3.3.4. Consider a stationary process x(t) with mean function mx = 2
[volts], and autocorrelation function Rx (τ ) = 24 sinc(500 Hz τ ) + 4 [volts2 ].
1. Use Chebyshev’s inequality to estimate the range of x(1 ms) with a prob-
ability of 95%.
2. Let y = x(1 ms) + 2x(2 ms) − 3x(5 ms) + 5 volts. Calculate ȳ, V ar(y).
Solution:
1. We need to find a, b such that P (a < x(1 ms) < b) ≥ 95% = 0.95, or in
other words:
P (x(1 ms) ≤ a or x(1 ms) ≥ b) ≤ 0.05
5 The above is easily verified with MAPLE : use two nested integrations and the function

combine( , trig) to simplify the trigonometric expression.


3.3. MOMENT FUNCTIONS OF A RANDOM PROCESS 115

The Chebyshev’s inequality states that:


( ) σx(1
2
ms)
P |x(1 ms) − x(1 ms)| ≥ ϵ ≤
ϵ2
which may be rewritten as:
( ) 2
σx(1 ms)
P x(1 ms) − x(1 ms) ≥ ϵ or x(1 ms) − x(1 ms) ≥ ϵ ≤
ϵ2
( ) 2
σx(1 ms)
P x(1 ms) ≥ x(1 ms) + ϵ or x(1 ms) ≤ x(1 ms) − ϵ ≤
| {z } | {z } ϵ2
| {z }
b a
0.05

It is given that
x(1 ms) = mx = 2 volts

ms) = Lx (0) = Rx (0) − m2x


2
σx(1
= 28 − 4 = 24 volts2
It follows that
24 volts2
ϵ2 = ⇒ ϵ = 21.91 volts.
0.05
The probability of x(1 ms) being outside the range [−19.91, 23.91] volts
is therefore smaller than 0.05%; in other words, we are 95% certain that
x(1 ms) lies between -19.91 volts and 23.91 volts. This is a very conser-
vative estimate of the range; if the process was specified, a much nar-
rower range could be determined from the probability density function
px(1 ms) (α).

2. Define the random vector x , [x(1 ms), x(2 ms), x(5 ms)]. We first find
its mean vector
( )
mx = E [x(1 ms), x(2 ms), x(5 ms)] = [2, 2, 2]
and its covariance matrix
 
Lx (0) Lx (1 ms) Lx (4 ms)
Λx =  Lx (1 ms) Lx (0) Lx (3 ms) 
Lx (4 ms) Lx (3 ms) Lx (0)
 
24 15.279 0
=  15.279 24 −5.093 
0 −5.093 24
Using lemma 32 and noticing that:
 
1
y =x× 2 +5
−3
116 CHAPTER 3. RANDOM WAVEFORMS

we easily obtain:


[ ] 1
ȳ = 2 2 2 × 2 +5
−3
= 5    
[ ] 24 15.279 0 1
σy2 = 1 2 −3 ×  15.279 24 −5.093  ×  2 
0 −5.093 24 −3
= 458.23

Example 3.3.5. Binary Random Process: Let



x(t) = ai g(t − iT + τ )
i=−∞

where the ai ’s and τ are statistically independent random variables such that

pai (α) = 1/2δ(α − 1) + 1/2δ(α + 1), ∀i ∈ Z



 1/T ; 0 ≤ α < T
pτ (α) =

0 ; elsewhere

and g(t) is a deterministic “pulse” which is 0 outside of the interval [0, T ].

1. Mean function:

mx (t) = E[x(t)]


= E[ ai g(t − iT + τ )]
i=−∞


= E[ai g(t − iT + τ )]
i=−∞
∑∞
= E[ai ] E[g(t − iT + τ )]
| {z }
i=−∞
0
= 0

independent of time.
3.3. MOMENT FUNCTIONS OF A RANDOM PROCESS 117

2. Autocorrelation function:
Rx (t, s) = E[x(t)x(s)]
[( ∑∞ )( ∑
∞ )]
= E ai g(t − iT + τ ) ai g(s − iT + τ )
i=−∞ i=−∞
[ ∑
∞ ∞
∑ ]
= E ai aj g(t − iT + τ )g(s − jT + τ )
i=−∞ j=−∞

∑ ∞

= E[ai aj ]E[g(t − iT + τ )g(s − jT + τ )]
i=−∞ j=−∞
∑∞
= E[a2i ] E[g(t − iT + τ )g(s − iT + τ )]
| {z }
i=−∞
1
∑∞
= E[g(t − iT + τ )g(s − iT + τ )]
i=−∞

The expected value inside the above sum is expanded as follows:


E[g(t − iT + τ )g(s − iT + τ )]
∫ T
= g(t − iT + α)g(s − iT + α)pτ (α)dα
0
∫ t−(i−1)T
1
= g(u)g(s − t + u)du
T t−iT

where, in the last line, we use the change of variables


u = t − iT + α ⇔ α = u − t + iT
dα = du
α = 0 ⇒ u = t − iT
α = T ⇒ u = t − (i − 1)T
It follows that
∑ ∞ ∫
1 t−(i−1)T
Rx (t, s) = g(u)g(s − t + u)du
i=−∞
T t−iT

1 ∞
= g(u)g(s − t + u)du
T −∞
1
= Rg (s − t)
T
where we define:
∫ ∞
Rg (t) = g(u)g(u − t)du
−∞
= g(t) ∗ g(−t)
118 CHAPTER 3. RANDOM WAVEFORMS

Hence Rx (t, s) is independent of the time origin.


Remark. The function Rg (t) is called (deterministic) autocorrelation func-
tion of the energy signal g(t). We have then showed that the autocorrela-
tion function Rx (t, s) of the binary random process x(t) is proportional to
the deterministic autocorrelation function Rg (t) at s−t. Even though they
bear the same name, Rx (t, s) and Rg (t) are not to be confused; Rx (t, s) is
used with random processes and Rg (t) is used with deterministic (energy)
signals.6
Case 1 (Square pulse). The pulse is shown in figure 3.10(a). Typical
sample functions are sketched on figures 3.10(b) - 3.10(e), and the corre-
sponding autocorrelation function is plotted on figure 3.10(f).

Case 2 (Raised-sine pulse). The pulse is shown in figure 3.11(a). Typical


sample functions are sketched on figures 3.11(b) - 3.11(e), and the corre-
sponding autocorrelation function is plotted on figure 3.11(f). Notice that
the maximum value 0.375 in the autocorrelation function corresponds to:
∫ ∫ ( )2
1 T
2 1 1 ms
1 − cos(2π(1 kHz)t)
g (t)dt = dt
T 0 1 ms 0 2

= 0.375

The autocorrelation function is related to the average bandwidth of a random


process as detailed in the next section, but we first present its properties.

Theorem 40 (Properties of Rx (τ )). Let x(t) be a wide-sense stationary pro-


cess. The autocorrelation function Rx (τ ) = E[x(t)x(t + τ )] of x(t) satisfies the
following:

1. Rx (0) ≥ |Rx (τ )|, ∀τ ∈ R.

2. Rx (τ ) = Rx (−τ ), ∀τ ∈ R.

3. If x(t) is periodic then Rx (τ ) is periodic.

4. Rx (0) = E[x2 (t)].

Sketch of proof.

1. It has been shown that |Cov(x1 , x2 )| ≤ σx1 σx2 for any two (finite variance)
random variables x1 , x2 .7 Consequently, if x(t) is wide sense stationary
6 The same name autocorrelation function is used with both deterministic signals and ran-

dom signals because both functions relate to similar physical quantities when the random
signals are wide-sense stationary (see definition 3.3.3 below and the Wiener-Khinchine theo-
rem in section 3.4 below).
7 Can also be proved by expanding 0 ≤ E[(x(t) ± x(t + τ ))2 ].
3.3. MOMENT FUNCTIONS OF A RANDOM PROCESS 119

(a) Square pulse in binary random process (b) typical sample function

(c) other typical sample function (d) yet another sample function

(e) yet another one more sample function (f) autocorrelation function

Figure 3.10: Binary random process with square pulses, T = 1 ms.


120 CHAPTER 3. RANDOM WAVEFORMS

(a) Raised-sine pulse in binary random process (b) typical sample function

(c) other typical sample function (d) yet another sample function

(e) yet another one more sample function (f) autocorrelation function

Figure 3.11: Binary random process with raised sine pulses, T = 1 ms.
3.3. MOMENT FUNCTIONS OF A RANDOM PROCESS 121

then

|Lx (s − t)| ≤ V ar(x(t)) V ar(x(s)) = Lx (0)
| {z } | {z }
Lx (0) Lx (0)

= V ar(x(t))

which implies that

|Lx (s − t)| + m2x ≤ Lx (0) + m2x = Rx (0)

We also have |Lx (s − t)| + m2x ≥ |Lx (s − t) + m2x | = |Rx (s − t)|. Therefore
|Rx (τ )| ≤ |Rx (0)| for any τ = s − t.

2.

Rx (τ ) = E[x(t0 )x(t0 + τ )], ∀t0


= E[x(t0 + τ )x(t0 )], ∀t0
= E[x(t)x(t − τ )], ∀t = t0 + τ
= Rx (−τ )

3. Let T be the period of x(t). Then x(t) = x(t + T ) for every t. It follows
that:

Rx (τ ) = E[x(t0 )x(t0 + τ )]; ∀t0


= E[x(t0 )x(t0 + τ + T )]; ∀t0
= Rx (τ + T )

4. Obvious from the definitions.

The definition of the autocorrelation function was motivated by the concept


of moments. The following definition and discussion show that the autocor-
relation function is surprisingly also related to some physical quantities in a
wide-sense stationary random process.

Definition 3.3.3. Let x(t) be a wide-sense stationary process with mean func-
tion mx and autocorrelation function Rx (τ ) such that Rx (0) is finite.

1. mx ≡ average DC value of x(t).

2. Rx (0) ≡ average total (DC + AC) power of x(t).

3. Lx (0) ≡ average AC power of x(t).


122 CHAPTER 3. RANDOM WAVEFORMS

Discussion: Consider the set of all the sample functions x(ω, t), ∀ω ∈ Ω. In
the applications to communication, x(ω, t) represents an electric signal (voltage
or current), for which
∫ T
1
DC value ≡ lim x(ω, t)dt = ⟨x(ω, t)⟩
T →∞ 2T −T
∫ T
1
Total power ≡ lim x(ω, t)2 dt = ⟨x(ω, t)2 ⟩
T →∞ 2T −T
∫ T
1
AC power ≡ lim (x(ω, t) − ⟨x(ω, t)⟩)2 dt
T →∞ 2T −T
= ⟨(x(ω, t) − ⟨x(ω, t)⟩)2 ⟩

The sample functions x(ω, t) do not all have same DC values, total power nor
AC power. It can however be shown that whenever Rx (0) is finite then the
integrations and expectations can be interchanged. We then obtain:

( ∫ T )
1
average DC value ≡ E[⟨x(ω, t)⟩] = E lim
x(ω, t)dt
T →∞ 2T −T
∫ T
1
= lim E(x(ω, t))dt
T →∞ 2T −T
∫ T
1
= lim mx dt
T →∞ 2T −T
∫ T
1
= mx lim dt
T →∞ 2T −T
| {z }
1
= mx

since E(x(ω, t)) = mx (t) = mx whenever the process is wide-sense stationary.


Similarly we have:

( ∫ T )
1
average total power ≡ E[⟨x(ω, t)2 ⟩] = E lim x(ω, t)2 dt
T →∞ 2T −T
∫ T
1
= lim E(x(ω, t)2 )dt
T →∞ 2T −T
∫ T
1
= lim Rx (0)dt
T →∞ 2T −T
∫ T
1
= Rx (0) lim dt
T →∞ 2T −T
| {z }
1
= Rx (0)
3.4. CORRELATION FUNCTIONS AND POWER SPECTRA 123

since E(x(ω, t)2 ) = Rx (0) whenever the process is wide-sense stationary. In the
same manner we can show that

E[⟨(x(ω, t) − ⟨x(ω, t)⟩)2 ⟩] = Lx (0) −→ average A.C. power

m2x represents the average DC power as shown below:

Rx (0) = Lx (0) + m2x


| {z } | {z } |{z}
average total power average AC power average DC power

3.4 Correlation Functions and Power Spectra


W&J, p 179 - 185.

Definition 3.4.1. Let x(t) be a wide-sense stationary process and Rx (τ ) its


autocorrelation function. Then
∫ ∞
Sx (f ) , F (Rx (τ )) = Rx (τ )e−j2πf τ dτ
−∞

is called power spectral density of x(t).

Remarks.

1. Rx (τ ) can be recovered from Sx (f ) by:


∫ ∞
Rx (τ ) = Sx (f )ej2πf τ df
−∞

2. The sample function x(ω, t) can be use to define a random process in the
variable f ∈ R as follows: consider the sample functions:
∫ 2
1 T −j2πf t

X(ω, f ) = lim x(ω, t)e dt .
T →∞ 2T −T

For each ω ∈ Ω we obtain a (possibly) different function X(ω, f ) and


we easily see that X(f ) is a complex random process. It can be shown
that if Rx (0) is finite then E[X(f )] = Sx (f ). This result is known as the
Wiener-Khinchine theorem and means that “Sx (f ) represents the average
frequency content of the sample functions of the process x(t)”. The phys-
ical meaning of Sx (f ) will be illustrated later in theorem 43 (also W&J
pp. 182 - 183).

Sketch of proof - Wiener-Khinchine theorem.


We assume that integrations and expectation are freely interchangeable. We
show that F −1 (E(X(f ))) = Rx (τ ). Given that by definition F −1 (Sx (f )) =
124 CHAPTER 3. RANDOM WAVEFORMS

Rx (τ ), this implies that E(X(f )) = Sx (f ). First we have:


∫ )( ∫ T
1 ( T )∗
X(ω, f ) = lim x(ω, t)e−j2πf t dt x(ω, s)e−j2πf s ds
T →∞ 2T −T −T
( ∫ T )( ∫ T )
1
= lim x(ω, t)e−j2πf t dt x(ω, s)ej2πf s ds
T →∞ 2T −T −T
∫∫ T
1
= lim x(ω, t)x(ω, s)e−j2πf (t−s) dtds
T →∞ 2T −T

It follows that:
∫∫
( ) 1 T ( )
E X(ω, f ) = lim E x(t)x(s) e−j2πf (t−s) dtds
T →∞ 2T −T | {z }
Rx (t−s)

Therefore
∫ ∞ ( ∫∫ T )
−1 1 −j2πf (t−s)
F (E(X(f ))) = lim Rx (t − s)e dtds ej2πf τ df
−∞ T →∞ 2T −T
∫∫ T ∫ ∞
1
= lim Rx (t − s)e−j2πf (t−s) ej2πf τ df dtds
T →∞ 2T −T −∞
∫∫ T ∫ ∞
1
= lim Rx (t − s) ej2πf (s−t+τ ) df dtds
T →∞ 2T −T −∞
| {z }
δ(s−t+τ )
∫ T ∫ ∞
1
= lim Rx (t − s)δ(s − t + τ )dt ds
T →∞ 2T −T −∞
| {z }
Rx ((s+τ )−s)=Rx (τ )
∫ T
1
= lim Rx (τ )ds
T →∞2T −T
= Rx (τ )

Example 3.4.1. We take the Fourier transform of the autocorrelation functions


of the binary random processes sketched in figures 3.10 and 3.11 to estimate the
bandwidth requirements of an average sample function.
From example 3.3.5 we have:
1
Rx (τ ) = Rg (τ )
T
1
= g(τ ) ∗ g(−τ )
T
By definition and the properties of the Fourier transform we have:
Sx (f ) = F (Rx (τ ))
1
= |G(f )|2
T
3.4. CORRELATION FUNCTIONS AND POWER SPECTRA 125

where G(f ) = F (g(t)) denotes the Fourier transform of the pulse g(t). In the
following we use:
sin(πx)
sinc(x) ,
{ πx
1 ; −1/2 < x < 1/2
rect(x) ,
0 ; elsewhere
{
1 − 2|x| ; −1/2 < x < 1/2
∆(x) ,
0 ; elsewhere

• For the rectangular pulse we use the Fourier transform pair:

rect(t/(1 ms)) ←→ (1 ms) sinc((1 ms) f ) =


(1 ms) sinc(f /1000)

It follows that
|G(f )| = (1 ms) |sinc(f /1000)|
for the rectangular pulse.8
• For the triangular pulse we use the Fourier transform pair

∆(t/(1 ms)) ←→ (0.5 ms) sinc2 ((0.5 ms) f )

This is G(f ) for the triangular pulse.


• For the raised-cosine pulse we notice that
( )
1 + cos(2πt/(1 ms))
g(t + 0.5 ms) = rect(t/(1 ms))
2
1 1
= rect(t/(1 ms)) + rect(t/(1 ms)) cos(2πt/(1 ms))
2 2
Using the property of modulation on the Fourier transform pair
1
rect(t/(1 ms)) ←→ (0.5 ms) sinc((1 ms) f ) =
2
(0.5 ms) sinc(f /1000)

we obtain
1
rect(t/(1 ms)) cos(2πt/(1 ms)) ←→ (0.25 ms)sinc((1.0 ms)(f − (1 kHz)))+
2
(0.25 ms)sinc((1.0 ms)(f + (1 kHz))) =
(0.25 ms)sinc((f − 1000)/1000)+
(0.25 ms)sinc((f + 1000)/1000)
8 recall that the pulses in example 3.3.5 go from 0 to T = 1 ms, whereas the above goes

from -0.5 ms to +0.5 ms. The time delay affects the phase of the Fourier transform only; it
has no effect on the magnitude, which is all that is required here.
126 CHAPTER 3. RANDOM WAVEFORMS

It follows that

|G(f )| = (0.5 ms)sinc(f /1000)+
(0.25 ms)sinc((f − 1000)/1000)+

(0.25 ms)sinc((f + 1000)/1000)

for the raised cosine pulse.

Figure 3.12 shows sketches of the Power spectral densities of the binary ran-
dom processes of example 3.3.5 for the cases of rectangular pulse (green curve),
triangular pulse (blue curve) and raised cosine pulse (red curve). Experimental

Figure 3.12:

measurements were taken which confirm the above results; refer to figure 3.13.

Theorem 41. (W&J pp. 179 - 180) Let z(t) be a random process with mean
function mz (t), autocorrelation function Rz (t, s) which is finite for all t, s ∈ R.
Define the random process:
∫ ∞
y(t) = z(α)h(t − α)dα
−∞
= z(t) ∗ h(t),
3.4. CORRELATION FUNCTIONS AND POWER SPECTRA 127

Figure 3.13:

i.e. y(t) is the output of a (bounded-input-bounded-output stable, linear, invari-


ant) filter of impulse response h(t) with input z(t). Then:

∫ ∞
my (t) = mz (α)h(t − α)dα = mz (t) ∗ h(t)
−∞
∫∫ ∞
Ry (t, s) = Rz (α, β)h(t − α)h(s − β)dαdβ
−∞
∫∫ ∞
Ly (t, s) = Lz (α, β)h(t − α)h(s − β)dαdβ
−∞

Sketch of proof: Assuming that expectations and integrals are freely inter-
128 CHAPTER 3. RANDOM WAVEFORMS

changeable we have:
my (t) = E(y(t))
(∫ ∞ )
= E z(α)h(t − α) dα
−∞
∫ ∞
( )
= E z(α) h(t − α) dα
−∞
∫ ∞
= mz (α)h(t − α) dα
−∞
= mz (t) ∗ h(t)

Ry (t, s) = E(y(t)y(s))
(∫ ∞ ∫ ∞ )
= E z(α)h(t − α) dα z(β)h(s − β) dβ
−∞ −∞
( ∫∫ ∞ )
= E z(α)z(β)h(t − α)h(s − β) dα dβ
−∞
∫∫ ∞
( )
= E z(α)z(β) h(t − α)h(s − β) dα dβ
−∞ | {z }
Rz (α, β)
∫∫ ∞
= Rz (α, β)h(t − α)h(s − β) dα dβ
−∞

By definition and using the above we then have:


Ly (t, s) =Ry (t, s) − my (t)my (s)
∫∫ ∞
= Rz (α, β)h(t − α)h(s − β) dα dβ
−∞
∫∫ ∞
− mz (α)h(t − α)mz (β)h(s − β) dαdβ
−∞
∫∫ ∞ ( )
= Rz (α, β) − mz (α)mz (β) h(t − α)h(s − β) dα dβ
−∞ | {z }
Lz (α,β)
∫∫ ∞
= Lz (α, β)h(t − α)h(s − β) dα dβ
−∞

Theorem 42 (W&J, pp. 181 - 182). Let z(t) be a wide-sense stationary random
process with mean mz , autocorrelation function Rz (τ ) and power spectral density
Sz (f ). Define the random process
∫ ∞
y(t) = z(α)h(t − α)dα
−∞
= z(t) ∗ h(t)
3.4. CORRELATION FUNCTIONS AND POWER SPECTRA 129

Then y(t) is a wide-sense stationary random process and if Rz (0) is finite then:
∫ ∞
my (t) = my = mz h(t)dt
−∞
∫∫ ∞
Ry (τ ) = Rz (τ + α − β)h(α)h(β)dαdβ
−∞
∫∫ ∞
Ly (τ ) = Lz (τ + α − β)h(α)h(β)dαdβ
−∞

Sy (f ) = Sz (f )|H(f )|2
∫∞
where H(f ) = −∞
h(t)e−j2πf t dt.

Sketch of proof. Top of page 182 in W&J; easy to read.

Theorem 43 (Properties of Sx (f )). (W& J pp. 184 - 185)

1. Sx (f ) ∈ R, ∀f ,

2. Sx (f ) = Sx (−f ), ∀f ,

3. Sx (f ) ≥ 0, ∀f .
∫∞
4. Rx (0) = −∞ Sx (f )df .

Proof.

1. Follows from Rx (τ ) being an even function.

2. Follows from Rx (τ ) and Sx (f ) being real functions.

3. Consider the situation depicted in figure 3.14. We have:


∫ ∞

0 ≤ Ry (0) = Sy (f )ej2πf τ
df
−∞
∫ ∞ τ =0

= Sy (f )df
−∞
∫ ∞
= Sx (f )|H(f )|2 df
−∞
≈ Sx (−f0 )∆f + Sx (f0 )∆f
= 2Sx (f0 )∆f

It follows that:
Ry (0)
Sx (f0 ) ≈ ≥ 0, ∀f0 .
2∆f
130 CHAPTER 3. RANDOM WAVEFORMS

Figure 3.14:

The above proof of property (3) illustrates the physical meaning of the power
spectral density Sx (f ):

Average power of y(t)


Sx (f0 ) ≡
2∆f
Average power of x(t) in the band |f ± f0 | ≤ ∆f /2

2∆f

In concrete words (W&J, pp. 182 - 183):

“Sx (f ) describes the distribution of mean power with frequency in


the process x(t)”.9

3.5 Ergodicity
Up to now, the moments of random processes have been calculated with expres-
sions such as:
∫ ∞
x(t) = E[x(t)] = αpx(t) (α)dα
−∞

∞ (
( )2 ( ) )2
x(t) − x(t) = V ar x(t) = α − x(t) px(t) (α)dα
−∞
∫∫ ∞
Rx (s, t) = αβpx(t)x(s) (α, β)dαdβ
−∞
... ... ...
... ... ...

Operations of this form are called ensemble averages: the expectation is calcu-
lated through all the sample functions at fixed time value(s). This is the true
expectation on a random process. It may however be impractical for experi-
mental measurements since it requires a statistically large number of sample

9 W&J, pp. 182-183.


3.5. ERGODICITY 131

functions, as opposed to the following time averages:


∫ T
1
⟨x(ω, t)⟩ = lim x(ω, t)dt
T →∞ 2T −T
∫ T
( )2 1 ( )2
⟨ x(ω, t) − ⟨x(ω, t)⟩ ⟩ = lim x(ω, t) − ⟨x(ω, t)⟩ dt
T →∞ 2T −T
∫ T
1
⟨x(ω, t)x(ω, t + τ )⟩ = lim x(ω, t)x(ω, t + τ )dt
T →∞ 2T −T
... ... ...
... ... ...
In the above expressions, x(ω, t) is a sample function of x(t) and different ω’s
generally yield different values in the above integrals. An ergodic process yields
the same values for all sample functions x(ω, t).
Definition 3.5.1. A stationary random process x(t) is ergodic if
E[g(x(t),x(t + τ1 ), . . . , x(t + τn−1 ))]
∫ T
1
= lim g(x(ω, t), x(ω, t + τ1 ), . . . , x(ω, t + τn−1 ))dt
T →∞ 2T −T

for every function g : Rn → R, for all τ1 , τ2 , . . . , τn−1 ∈ R, for all n ∈ N∗ and


for all ω ∈ Ω.
In other words, each sample function of a process that is ergodic contains all
the statistical information of the random process; time and ensemble averages
consequently yield the same values. In particular all sample functions have the
same DC value, average power, AC power, frequency content:
E[x(t)] = mx (t) = mx = ⟨x(ω, t)⟩
E[x(t)x(t + τ )] = Rx (τ ) = ⟨x(ω, t)x(ω, t + τ )⟩
∫ 2
1 −T −j2πf t

Sx (f ) = lim x(ω, t)e dt
T →∞ 2T T

since by definition an ergodic process is necessarily stationary (but the converse


is not always true).
Remark. If x(t) is wide sense stationary but not ergodic we still refer to
mx as average DC value
Rx (0) as average power
Lx (0) as average AC power
As mentioned in the discussion following definition 3.3.3, the quantities may
differ from one sample function to another and mx , Rx (0), Lx (0) are then the
expected values of the quantities (average DC value, average power, average AC
power) calculated through the ensemble of all the sample functions. Similarly,
if x(t) is wide sense stationary but not ergodic, Sx (f ) represents the ensemble
average of the frequency content of the sample functions (Wiener-Khinchine
theorem).
132 CHAPTER 3. RANDOM WAVEFORMS

Example 3.5.1. The random process of example 3.1.1 is not ergodic (obviously,
since the process is not stationary; this example illustrates this by calculating
the time averages). We recall that:

x: Ω×R → R
x : (Head, t) 7→ 2
x : (Tail, t) 7 → sin t

Indeed, it was found that mx (t) = E[x(t)] = 1 + sin(t)/2. We also have:

⟨x(Head, t)⟩ = ⟨2⟩


∫ T
1
= lim 2dt
T →∞ 2T −T
= 2

⟨x(Tail, t)⟩ = ⟨sin t⟩


∫ T
1
= lim sin tdt
T →∞ 2T −T
= 0

All of mx (t), ⟨x(Head, t)⟩, ⟨x(Tail, t)⟩ (and a whole lot more) would have to be
equal for the process to be ergodic.
Example 3.5.2. Let x(t) = 10 sin(2πt + θ) where θ is a random variable
uniformly distributed between 0 and 2π. It was found in example 3.3.2 that
mx (t) = E[x(t)] = 0, ∀t. We also have:

⟨x(ω, t)⟩ = ⟨10 sin(2πt + θ(ω))⟩


∫ T −θ(ω)/2π
1
= lim 10 sin(2πt + θ(ω))dt
T →∞ 2T −T −θ(ω)/2π
| {z }
cover interval of length 2T
= 0

for any value of θ(ω). It was also found that Rx (τ ) = 50 cos(2πτ ). We now
find:

⟨x(ω, t)x(ω, t + τ )⟩
∫ T
1
= lim 100 sin(2πt + θ(ω)) sin(2π(t + τ ) + θ(ω))dt
T →∞ 2T −T
= 50 cos(2πτ )

for any value of θ(ω). The above indicates (but does not guarantee) that the
process may be ergodic.
In the following section we investigate Gaussian random processes, some of
which are ergodic.
3.6. GAUSSIAN RANDOM PROCESSES 133

3.6 Gaussian random processes


(refer to W&J, pp. 171 - 179, 186 - 192)
Definition 3.6.1. (W&J, p. 171) A random process x(t) is Gaussian if the
random vector x(t) is Gaussian for all t = (t1 , t2 , . . . , tk ) ∈ Rk and for all
k ∈ N∗ .
In many practical situations, the conditions of the central limit theorem are
satisfied which makes “Gaussian processes of the utmost practical (as well as
mathematical) importance”.10
In light of sections 2.13 and 3.3, we see that a Gaussian random process is
specified by its mean function mx (t) and its autocorrelation function Rx (t, s)
or its autocovariance function Lx (t, s).11 The following example illustrates that
the roles played by the mean and autocovariance functions are similar to that
of the mean vector and covariance matrix of a Gaussian random vector.
Example 3.6.1. (W&J, page 174) Let x(t) be a Gaussian random process with:

mx (t) = sin(πt) (W&J, eq. 3.96a)


−|t−s|
Lx (t, s) = e (W&J, eq. 3.96b)

Notice that x(t) is not (wide sense) stationary (nor ergodic). We let x =
(x1 , x2 ) = (x(1), x(3/2)) and find:

mx = (0, −1)
[ ]
1 e−1/2
Λx = −1/2
e 1
e−1
|Λx | = 1 − e−1 =
e [ ]
[ ] e1/2
e 1 −e−1/2 e
Λ−1 = −1/2 = e−1 1−e
x
e−1 −e
1/2
1 e e
1−e e−1

Finally we obtain from equation (W&J, 3.90):


1 ( −1 )
px (α) = √ exp (α − mx ) × Λ−1
x × (α − mx )
T
2π |Λx | 2

Similarly any joint probability density function px(t) (α) can be calculated for
any k ∈ N∗ and any t = (t1 , t2 , . . . , tk ) ∈ Rk .
Remarks.
1. Λx(t) is symmetrical as required since Rx (t, s) = E[x(t)x(s)] =
E[x(s)x(t)] = Rx (s, t).
10 W&J top of page 72.
11 We recall that Lx (t, s) = Rx (t, s) − mx (t)mx (s).
134 CHAPTER 3. RANDOM WAVEFORMS

2. Random processes which are not Gaussian are in general not specified by
mx (t) and Rx (t, s) or Lx (t, s) alone; other moment functions are required.
Indeed W&J shows two random processes y(t) and z(t) having the same
mean and autocorrelation functions (⇒ same covariance function) at the
top of page 175, but having different probability density functions. This
means that in order to specify these non-Gaussian random processes, more
information is required than that carried by the mean and autocorrelation
functions.

3. (W&J, pp. 175 - 177) A Gaussian random process x(t) is (strict sense)
stationary if and only if it is wide sense stationary. This follows from
the previous remark that any joint probability density function px(t) (α)
is completely determined by the functions mx (t) and Rx (t, s): if x(t) is
wide sense stationary, the functions mx (t) and Rx (t, s) are independent of
the time origin and any px(t) (α) is also consequently independent of the
time origin. Wide sense stationariness of non-Gaussian random processes
does not guarantee their strict sense stationariness. For example, W&J
shows at the top of page 177 a (non-Gaussian) random process which is
wide sense stationary, but not strict sense stationary.

Theorem 44. Let x(t) be a stationary Gaussian random process with autocor-
relation function Rx (τ ). Then:
∫ ∞
|Rx (τ )|dτ < ∞ ⇒ x(t) is ergodic
−∞

Remark. The condition of the theorem is not satisfied if mx ̸= 0; Rx (τ ) cannot


be replaced by Lx (τ ) in the theorem.

Theorem 45. (W&J, p. 179) Let x(t), y(t) be random processes such that:

y(t) = x(t) ∗ h(t)


∫ ∞
= h(α)x(t − α)dα
−∞

i.e. y(t) is the output of a bounded-input-bounded-output stable linear invariant


filter of impulse response h(t) with input x(t). Then y(t) is Gaussian whenever
x(t) is Gaussian.

Sketch of proof. Suppose h(t) = 0, ∀|t| > T , for some 0 < T ∈ R. Then
∫ T
y(t) = h(α)x(t − α)dα
−T

k
≈ h(αi )x(t − αi )∆αi
i=1
3.6. GAUSSIAN RANDOM PROCESSES 135

where αi ∈ [−T, T ], ∆i ∈ R+ , i = 1, 2, . . . , k determine k disjoint intervals


[αi , αi + ∆i ] covering [−T, T ], i.e.

[αi , αi + ∆i ] ∩ [αj , αj + ∆j ] = ∅, ∀i ̸= j
∪ki=1 [αi , αi + ∆i ] = [−T, T ]

The random variables y(t1 ), y(t2 ), . . . , y(tN ) can therefore be written in matrix
notation as follows:
 
x(t1 − α1 )
 .. 
 . 
 
 x(t1 − αk ) 
     x(t − α ) 


y(t1 ) h 0 ... 0  2 1 
 y(t2 )   0 h . . . 0   
.. 

     . 
  =   ×
. .   x(t2 − αk ) 

.. .. .. . . ..
 .   . . 
 .. 
y(tN ) 0 0 ... h  . 
 
 x(tN − α1 ) 
 
 .. 
 . 
x(tN − αk )

where h = [h(α1 ) ∆α1 , h(α2 ) ∆α2 , . . . , h(αk ) ∆αk ]. It follows from property
4, W&J p. 166 that the random vector [y(t1 ), y(t2 ), . . . , y(tN )] is Gaussian. As
stated in W&J, “the formal proof is mathematically involved” and deals with
infinite weighted sums of Gaussian random variables.
In general, the probability density functions of the output random process
are not of the same nature as those of the input random process, but it works
for the Gaussian process. Combining this result with those of theorems 41 and
42, we see that if x(t) is Gaussian and specified then y(t) is also Gaussian and
specified.

3.6.1 Jointly Gaussian Processes


(W&J, pp. 186 - 188)
Although some of the definitions and results of this subsection apply to any
random processes, we limit the discussion to Gaussian random processes.
Definition 3.6.2. Let y(t) and z(t) be Gaussian random processes. If

w = (y(t1 ), y(t2 ), . . . , y(tk ), z(tk+1 ), z(tk+2 ), . . . , z(tk+l ))

is a Gaussian random vector ∀k, l ∈ N, not both 0, and ∀t = (t1 , t2 , . . . , tk+l ) ∈


Rk+l then y(t) and z(t) are called jointly Gaussian random processes.
Example 3.6.2. (W&J, p. 186) Consider the following situation depicted in
Fig. 3.25, W&J, and figure 3.15. The sketch of proof of theorem 45 can be
136 CHAPTER 3. RANDOM WAVEFORMS

Figure 3.15:

generalized to the present situation to show that any random vector formed by
samples taken from y(t) and z(t) is Gaussian. Thus y(t) and z(t) are jointly
Gaussian random processes.

We recall that the random processes y(t) and z(t) are specified by their
respective mean and autocorrelation functions since they are Gaussian. But
the joint probability density function of the random vector denoted as w in
definition 3.6.2 cannot be written without the knowledge of E[y(tj )z(si )] or
Cov(y(tj ), z(si )) = E[y(tj )z(si )] − my (tj )mz (si ). This is contained in the cross-
correlation function defined next.

Definition 3.6.3.

1. The function
Ryz (t, s) , E[y(t)z(s)]

is called cross-correlation function of the random processes y(t) and z(t).

2. The function
Lyz (t, s) , Cov(y(t), z(s))

is called cross-covariance function of the random processes y(t) and z(t).

One easily shows that Lyz (t, s) = Ryz (t, s) − my (t)mz (s). The
joint probability density function of any random vector such as w in
definition 3.6.2 can be written from the knowledge of the functions
my (t), mz (t), Ry (t, s), Rz (t, s), Ryz (t, s). In such cases y(t) and z(t) are
said jointly specified, meaning that any joint probability density function of
the form py(t),z(s) (α, β) can be calculated. If the mean and correlation func-
tions are independent of time origin, y(t) and z(t) are said jointly station-
ary (jointly wide sense stationary if the processes are not jointly Gaussian)
and as usual one argument of the functions can be dropped and we write:
my , mz , Ry (τ ), Rz (τ ), Ryz (τ ) , Ryz (t, t + τ ) in which τ = s − t.
Remark. In the case of the autocorrelation, we have Ry (τ ) = Ry (t, t + τ ) =
Ry (t + τ, t) since the autocorrelation function is even as shown in theorem 40.
3.6. GAUSSIAN RANDOM PROCESSES 137

For the cross-correlation function we assume, unless otherwise stated, that:


Ryz (τ ) , Ryz (t, t + τ )
= Rzy (t + τ, t)
, Rzy (−τ )
and the functions Ryz ( ), Rzy ( ) are in general not even.
Definition 3.6.4. Let y(t), z(t) be jointly stationary Gaussian random pro-
cesses with cross-correlation function Ryz (τ ). Then the function
∫ ∞
Syz (f ) = Ryz (τ )e−j2πf τ dτ
−∞

is called cross-power spectra or cross-power spectral density. Ryz (τ ) can be


recovered uniquely from Syz (f ) by:
∫ ∞
Ryz (τ ) = Syz (f )ej2πf τ df
−∞

In contrast to the functions Rx (τ ), Sx (f ), no physical interpretation can


easily be made of the functions Ryz (τ ), Syz (f ); they do not relate to the average
total power or average frequency content. Ryz (τ ), Syz (f ) are used to write
the moments of random vectors obtained by sampling random process. The
following properties of the cross-correlation and cross-spectra are easy to show
and stated without proof.
Proposition 46 (Properties of Ryz (f ) and Syz (f )).
1. Ryz (τ ) = Rzy (−τ )
2. Syz (f ) = Szy (−f )
3. Syz (−f ) = Syz (f )∗
Theorem 47. Let x(t) be a Gaussian random process and define the jointly
Gaussian random processes y(t), z(t) by:
∫ ∞
y(t) = x(t) ∗ hy (t) = x(t − α)hy (α)dα
−∞
∫ ∞
z(t) = x(t) ∗ hz (t) = x(t − α)hz (α)dα
−∞

If Rx (t, s) is finite for all t, s then:


1. (W&J, eq 3.127a)
∫∫ ∞
Ryz (t, s) = Rx (t − α, s − β)hy (α)hz (β)dαdβ
−∞
∫∫ ∞
Lyz (t, s) = Lx (t − α, s − β)hy (α)hz (β)dαdβ
−∞
138 CHAPTER 3. RANDOM WAVEFORMS

2. If x(t) is stationary then y(t) and z(t) are jointly stationary and:
∫∫ ∞
Ryz (τ ) = Rx (τ + β − α)hy (α)hz (β)dαdβ
−∞
∫∫ ∞
Lyz (τ ) = Lx (τ + β − α)hy (α)hz (β)dαdβ
−∞

Syz (f ) = Sx (f )Hy (f )Hz (f )∗

|Syz (f )|2 = Sy (f )Sz (f )

Proof.
1. Similar to what has been done in the sketch of proof of theorem 41 by
interchanging the order of expectation and integration.
2. The cross-correlation formula follows from the general formula by replac-
ing Rx (t − α, s − β) with Rx (s − t + α − β) and identifying τ = s − t.
The cross-power spectra formula directly follows from the properties of
the Fourier transform applied to both sides of the formula for the cross-
correlation.

Remark. If Hy (f ) and Hz (f ) are non-overlapping (as in W&J, Fig. 3.26), i.e.


Hy (f )Hz (f ) = 0, ∀f and x(t) is stationary then Syz (f ) = 0, ∀f and this implies
that Ryz (τ ) = 0, ∀τ . We also have from theorem 42 that
∫ ∞
my = mx hy (t)dt = mx Hy (0)
−∞
mz = mx Hz (0)
and finally my mz = m2x Hy (0)Hz (0) = 0. Therefore:
Lyz (τ ) = Ryz (τ ) − my mz
= 0
“Thus any covariance involving both y(t) and z(t) must be zero”. It follows
that (W&J, eq 3.130):
py(t)z(s) (α, β) = py(t) (α)pz(s) (β) (3.1)
recalling that x(t) must be both Gaussian and stationary.
Definition 3.6.5. Two random processes y(t) and z(t) for which the joint
probability density functions satisfy equation (3.1) for all t, s ∈ Rk+l and for
all k, l ∈ N (not both 0) are called statistically independent.
Theorem 48. Two jointly Gaussian processes x(t), y(t) are statistically inde-
pendent if and only if Lxy (t, s) = 0, ∀t, s.
3.6. GAUSSIAN RANDOM PROCESSES 139

3.6.2 White Gaussian Noise


(W&J, pp. 188 - 192)
Definition 3.6.6. (W&J, p. 189) A zero-mean stationary Gaussian random
process nw (t) for which the power spectral density Sw (f ) and the autocorrela-
tion function Rw (τ ) are of the form:
N0
Sw (f ) = , ∀f
2
N0
Rw (τ ) = δ(τ )
2
is called white Gaussian noise.
Remarks.
1. Lw (τ ) = Rw (τ ) since nw (t) is zero-mean.
2. White Gaussian noise is ergodic by theorem 44.
3. Rw (τ ) = N20 δ(τ ) means that “white Gaussian noise represents the ulti-
mate in randomness”.12
∫∞
The total average power of white Gaussian noise is Rw (0) = −∞ Sw (f )df =
∞. As a consequence white noise is fictitious; it is a mathematical construc-
tion/model that is not found in nature. Its usefulness comes from the fact that
filtered white Gaussian noise produces an ergodic (⇒ stationary) zero-mean
Gaussian noise (random process) that is meaningful.
The next theorem, stated without proof, results from section 3.4, properties
of the Fourier transform and Parseval’s theorem:
Theorem 49. Let nw (t) be a white Gaussian noise with power spectral density
Sw (f ) = N0 /2 and define:

n(t) = nw (t) ∗ h(t)


∫ ∞
= nw (t − α)h(α)dα
−∞
∫∞ ∫∞
in which −∞
h2 (t)dt is finite. Let H(f ) , F [h(t)] = −∞
h(t)e−j2πf t dt. Then:
1. (W&J, eq. 3.137)
N0
Rn (τ ) = Ln (τ ) = h(τ ) ∗ h(−τ )
2 ∫
N0 ∞
= h(β − τ )h(β)dβ
2 −∞

N0
2 |H(f )|
2
2. (W&J, eq. 3.134a) Sn (f ) =
12 W&J, p. 190.
140 CHAPTER 3. RANDOM WAVEFORMS

3. The total average power of n(t) is given by (W&J, eq. 3.134b):



N0 ∞ 2
n2 (t) = Rn (0) = h (α)dα
2 −∞

N0 ∞
= |H(f )|2 df
2 −∞

Proof of part 1: From theorem 42 we have:


∫∫ ∞
Rn (τ ) = Rnw (τ + α − β)h(α)h(β)dαdβ
−∞
∫∫ ∞
N0
= δ(τ + α − β)h(α)h(β)dαdβ
−∞ 2
∫ ∞
N0
= h(β)h(β − τ )dβ
−∞ 2
N0
= Rh (τ )
2
where Rh (τ ) denotes the deterministic autocorrelation function of h(t).

Example 3.6.3. (W&J, pp. 190 - 192) We consider the filtering of a white
Gaussian noise w(t) with an ideal low-pass filter of transfer function W (f ) given
by: {
1 ; |f | < W
W (f ) =
0 ; elsewhere
Denoting by n(t) the output of the filter, we have:

1. n(t) is an ergodic (⇒ stationary) Gaussian random process.


∫∞
2. mw = 0 ⇒ mn = mw −∞ h(t)dt = 0.
{ N0
2 ; |f | < W
3. Sn (f ) =
0 ; elsewhere

4. Rn (τ ) = Ln (τ ) = F −1 [Sn (f )] = W N0 sinc(2τ W ).13 Rn (τ ) is sketched


on figure 3.16.

5. The total average power of n(t) is W N0 . Since n(t) is ergodic the total
average power of any sample function of n(t) is also W N0 .

6. Define the jointly Gaussian random variables


( i )
ni = n + T , i = 1, 2, . . . , k,
2W
sin(πx)
13 Recall that sinc(x) , πx
.
3.6. GAUSSIAN RANDOM PROCESSES 141

Figure 3.16:

where T ∈ R. In order to calculate pn (α) of the Gaussian vector n =


(n1 , n2 , . . . , nk ) we first calculate the mean vector mn and covariance
matrix Λn of n:

mn = E[n] = (n1 , n2 , . . . , nk )
= (mn , mn , . . . , mn )
= 0

[ ]
Λn = Cov(ni , nj ) i=1,...,k
j=1,...,k
[ ( ( ) ( j ))]
i
= Cov n + T ,n +T
2W 2W i=1,...,k
j=1,...,k
 
Ln (0) Ln ( 2W1
) Ln ( W 1
) . . . Ln ( k−1 2W )
 Ln ( 1 ) L (0) L ( 1
) . . . L ( k−2 
 2W n n 2W n 2W ) 
 Ln ( 1 ) Ln ( 1 ) L (0) . . . L ( k−3 
=  W 2W n n 2W ) 
 . . . . . . . . . . . . . . . 
 
 ... ... ... ... ... 
Ln ( k−1 ) Ln ( k−2
2W ) Ln ( 2W ) . . .
k−3
Ln (0)
[ 2W ]
= diag N0 W, N0 W, . . . , N0 W

and the ni ’s are consequently statistically independent. Therefore:


k
pn (α) = pni (αi )
i=1
( )
−1 ∑ 2
k
1
= exp αi .
(2πN0 W )k/2 2N0 W i=1
142 CHAPTER 3. RANDOM WAVEFORMS

3.7 Problems
14
1. Consider two zero-mean jointly Gaussian noise processes, n1 (t) and n2 (t)
such that:

sin πτ
Ri (τ ) , ni (t)ni (t − τ ) = ; i = 1, 2,
πτ
sin πτ
R12 (τ ) , n1 (t)n2 (t − τ ) = .
2πτ

(a) Write the covariance matrix of the random vector x , (x1 , x2 , x3 )


where:

x1 , n1 (t) t=0 ,

x2 , n1 (t) t=1 ,

x3 , n2 (t) t=0 .

sin πτ
Remark: lim πτ = 1.
τ →0

(b) Calculate the variance of the random variable y , x1 + x2 + x3 .

2. Let x(t), y(t) denote two 0-mean jointly Gaussian jointly stationary pro-
cesses such that:
y(t) = x(t) ∗ h(t)

as shown in figure 3.17. The autocorrelation and cross-correlation func-


tions are shown in figure 3.18. Note that the cross-correlation Rxy (τ )
denotes Rxy (t, t + τ ) = E[x(t)y(t + τ )].

(a) Is the process x(t) ergodic given that:



 1 − 2|τ | ; −1/2 < τ < 1/2
Rx (τ ) =

0 ; elsewhere

Justify your answer.


(b) Calculate the average (total) power of x(t) and of y(t).
(c) Write the expressions of the probability density functions
px(1/2) (α), py(0) (β).
(d) Calculate the probability of the event described by x(1/2)+2y(0) > 5.

14 Based on W&J, problem 3.4


3.7. PROBLEMS 143

(a) (b)

Figure 3.17:

Figure 3.18:

3. Consider the random process x(t) = 10 sin(ωt + θ) where ω, θ are two


statistically independent random variables with probability density func-
tions:
1 1
pω (α) = δ(α − 1) + δ(α − 2)
2 2
 1
 2π ; 0 ≤ α < 2π
pθ (α) =

0 ; elsewhere

i.e. ω can take the values 1 or 2 with equal probabilities and θ is uniformly
distributed between 0 and 2π. Calculate the mean function mx (t) and the
autocorrelation Rx (t1 , t2 ) of x(t). Based on your answer, is the processes
wide-sense stationary?
144 CHAPTER 3. RANDOM WAVEFORMS

Hint: If you need to calculate an expected value by integrating with the


joint probability density function pω, θ (α, β), you should carry out the first
integration with respect to β; it’s a lil’ faster but it would work either way.
The following trigonometric identity may also be useful:

2 sin(u) sin(v) = cos(u − v) − cos(u + v)

4. Let z(t) be a wide-sense stationary random process with mean mz , auto-


correlation function Rz (τ ) and power spectral density Sz (f ):

mz = 0

AW Wτ
Rz (τ ) = sinc2 ( )
2 2


 A + 2Af ; −W 2 ≤f ≤0


W

Sz (f ) = A − 2Af ; 0≤f ≤ W


W 2



0 ; elsewhere

The power spectral density of z(t) is sketched in figure 3.19. As shown


in figure 3.20, z(t) is fed through a linear invariant filter (ideal low-pass
filter), the frequency response of which is given by:

 1 ; |f | ≤ W 4
H(f ) =

0 ; elsewhere

The filter’s output is labelled y(t).

(a) Express or sketch on a labelled graph the mean function my and the
power spectral density Sy (f ) of the wide-sense stationary random
process y(t) as functions of A and W .
Hint: Do not use the expression for Rz (τ ) and do not use any table
of Fourier transform pairs.
(b) Express the total average power of y(t) as a function of A and W .

5. Calculate the autocorrelation function Rx (t, s) = E[x(t)x(s)] of the ran-


dom process:
x(t) = r sin(2πf t + θ) + y(t)
where

• f ∈ R, f > 0 is not random,


3.7. PROBLEMS 145

Figure 3.19:

Figure 3.20:

• r is a Rayleigh distributed random variable (b > 0),



 2α −α2 /b
b e ; α≥0
pr (α) =

0 ; α<0
• θ is a uniformly distributed random variable between 0 and 2π,
 1
 2π ; 0 ≤ α < 2π
pθ (α) =

0 ; elsewhere
• y(t) is a wide sense stationary random process with mean and auto-
corelation functions respectively denoted as my and Ry (τ ),
and y(t), r, θ are statistically independent. Express your answer as a
function of t, s, b, Ry ( ), my .

Hint: From section §2.10.2 we have r = 12 bπ, r2 = b, θ = π, θ2 = 4π 2 /3
(you need not show this). The following trigonometric identity may be
useful:
2 sin(u) sin(v) = cos(u − v) − cos(u + v)
146 CHAPTER 3. RANDOM WAVEFORMS
Chapter 4

Optimum Receiver
Principles

e detail the operation of a digital communication system and design the


W receiver that achieves the minimum probability of error in the special
case referred to as additive white Gaussian noise (AWGN) channel depicted in
figure 4.1. This situation occurs when a very weak received signal is fed into a
receiver implemented with analog electrical components. Indeed, assuming that
the signal suffers no disturbances during propagation (other than a constant
attenuation and constant delay), additive white Gaussian noises are injected
into the received signal by the electrical components used to implement the
receiver; resistors and active devices (vacuum tubes, transistors, . . . ) introduce
white Gaussian noises known respectively as thermal noise and shot noise. The
conceptual analysis that we present is independent of the technology used to
implement the receiver. Thus the results presented here were valid at the time
of writing of W&J and they are still valid now.
We assume known the a priori message probabilities P (mi ), i = 0, . . . , M −
1, the modulator signal set {s0 (t), s1 (t), . . . , sM −1 (t)} and the power spectral
density Sw (f ) = N0 /2 of the white Gaussian noise nw (t). The source mes-
sage (as well as the output of the modulator) and the noise process nw (t) are
statistically independent.

4.1 Basic Approach


−1
“[. . . ] replace all transmitted waveforms {si (t)}M
i=0 by (finite) N -dimensional
vectors.”1 This can for example be accomplished by sampling of the signals
−1
{si (t)}M i=0 when they have a finite support (i.e. they are 0 outside an interval
of finite duration). A better approach is presented in Appendix W&J 4A (Ap-
pendix B of these notes) where we see that the replacement of the transmitted
1 W&J p. 212

147
148 CHAPTER 4. OPTIMUM RECEIVER PRINCIPLES

Figure 4.1:

waveforms by finite dimensional vectors is possible even if the signals do not


have finite support; we then only require that the signals have finite energy,
but in all practical applications they also have finite support. The procedure
detailed in Appendix W&J 4A (Appendix B of these notes) also shows that one
can always assume without loss of generality that the (finite) dimension of the
space spanned by the vectors is less than or equal to M . In §4.3.4, we show that
the replacement of the waveforms by vectors does not affect the optimality of a
receiver.

4.2 Vector Channels


(W&J, pp. 212 - 222)
Up to the paragraphs Theorem of Irrelevance and Theorem of Re-
versibility (W&J, pp. 220 - 222), this section generalizes and formalizes the
definitions and results of paragraph Communication Example (W&J, pp. 77
- 84) and of problem 2.25 in W&J.
Let S denote a random source producing one of M messages m0 , m1 , . . . ,
mM −1 , each with a priori probabilities P (m0 ), P (m1 ), . . . , P (mM −1 ) such that
∑M −1
i=0 P (mi ) = 1.

Definition 4.2.1.

1. A vector transmitter is an injective map (1-to-1) from the set of messages


−1
{mi }M
i=0 into a set of vectors (subset of R ):
N

−1
s : {mi }M
i=0 → RN
s: mi 7 → si = (si1 , si2 , . . . , siN )

2. A vector channel is a random transformation that disturbs a transmitted


random vector s into a received random vector r. It is described math-
ematically by the M conditional probability density functions: pr (ρ|s =
si ), i = 0, 1, . . . , M − 1, ∀ρ ∈ RN . These conditional probability density
functions are also denoted by pr|si (ρ), i = 0, 1, . . . , M − 1.
4.2. VECTOR CHANNELS 149

3. The M mixed form conditional probability density functions P (mk |r =


ρ), k = 0, 1, 2, . . . , M − 1, are called a posteriori probabilities.
4. A receiver mb is a mapping from the set of all possible received vectors into
the set of messages:
b : RN
m → {m0 , . . . , mM −1 }
b : ρ
m 7→ b
m(ρ)

The output of the transmitter is thus a random vector s for which the sample
set consists of M − 1 sample vectors s0 , s1 , . . . , sM −1 , and the probabilities
correspond to the a priori message probabilities.
Given a received vector r = ρ, a receiver m( b ) selects the message m(ρ)b
b
and makes a correct decision if and only if the message transmitted is m(ρ). In
terms of the a posteriori probabilities we have:

P (C |r = ρ) = P (m(ρ)|r
b = ρ)

It follows that the overall probability of correct decision is (equation (4.7b) in


W&J):
∫ ∫ ∞
P (C ) = ··· P (C |r = ρ)pr (ρ)dρ
−∞
∫ ∫ ∞
= ··· b
P (m(ρ)|r = ρ)pr (ρ)dρ (4.1)
−∞

and clearly, different receivers will likely have different P (E ) , 1 − P (C ). We


distinguish the following receivers.
Definition 4.2.2.
b opt ( ) is such that:
1. An optimum receiver m

P (E )|m
b opt ( ) = min(P (E ))
b )
m(

b M AP ( ) is such that:
2. A maximum a posteriori receiver m

b M AP (ρ) = mk ⇔ P (mk |r = ρ) ≥ P (mj |r = ρ),


m

∀j = 0, 1, . . . , M − 1.
b M L ( ) is such that:
3. A maximum likelihood receiver m

b M L (ρ) = mk ⇔ pr (ρ|s = sk ) ≥ pr (ρ|s = sj ),


m

∀j = 0, 1, . . . , M − 1.
b M M X ( ) (introduced in problem 2.12) is such that:
4. A minimax receiver m
( ) ( )
max P (E )|m b M M X ( ) ≤ max P (E )|m(
b ) , ∀ m(
b )
P (mi ) P (mi )
150 CHAPTER 4. OPTIMUM RECEIVER PRINCIPLES

b CDR ( ) is defined in §4.5.3 (page


5. The congruent decision region receiver m
263 in W&J).

Theorem 50.
1. The maximum a posteriori receiver is optimal.

2. The MAP decision rule simplifies to

b M AP (ρ) = mk ⇔ P (mk )pr (ρ|s = sk ) ≥ P (mj )pr (ρ|s = sj ),


m

∀j = 0, 1, . . . , M − 1.

3. If P (mi ) = 1/M, i = 0, . . . , , M − 1 (equally likely messages) then the


maximum likelihood receiver is optimal.

4. If the signal set is completely symmetric (W&J, page 264 and §4.5.3 of
these notes) then the maximum likelihood receiver is a congruent decision
region receiver and it is also minimax.
Proof. (W&J, page 213 - 214) We prove the first three statements of the theorem
only; the fourth statement is proved in §4.5.3.

1. Since pr (ρ) ≥ 0, P (C ) of equation (4.1) is maximized by maximizing


b
P (m(ρ)|r = ρ) for each value of ρ. This means:

b
m(ρ) = mk ⇔ P (mk |r = ρ) ≥ P (mj |r = ρ),

∀j = 0, 1, . . . , , M − 1; this is precisely the MAP rule.


2. Follows from the mixed form of Baye’s rule, as indicated by equations
(4.8a), (4.8b), (4.9) in W&J.
3. Follows from (1) and (2) above. The MAP receiver is optimal and its
simplified rule can further be simplified when P (mk ) = P (mj ), ∀j, ∀k.
This yields the maximum likelihood rule; both MAP and ML receivers are
equivalent and optimal in this case.

Remarks. (W&J, page 213)

1. In case of a tie, i.e. many messages yield the same value of P (mk )pr (ρ|s =
sk ), any one of them can be selected as the MAP estimate; the particular
choice will not affect the resulting probability of error.
2. In case of a tie with the ML decision rule, the receiver also selects arbi-
trarily any of the most likely messages. The particular choice will affect
the resulting probability of error unless the messages are all equiprobable.
4.2. VECTOR CHANNELS 151

Definition 4.2.3. (W&J, page 214 - 215)


b ), we define
1. Given a receiver m(
Ik = {ρ ∈ RN : m(ρ)
b = mk },
k = 0, 1, 2, . . . , M − 1. The sets I0 , . . . , IM −1 , called decision regions,
form a partition of a subset of RN :
−1
∪M
i=0 Ii ⊂ RN
{
∅ ; i ̸= j
Ii ∩ Ij =
Ii = Ij ; i=j

2. Decision regions obtained from a receiver mapping that is optimal are


called optimal decision regions.
A receiver only needs to find which region Ik of its decision regions contains
r = ρ in order to estimate the transmitted message (similar to a table look-up).
b ) only,
The decision regions (figures 4.3, 4.5 in W&J) depend of the receiver m(
and are calculated once for all transmissions. Optimal decision regions may be
obtained from mb M AP ( ) and they only depend on the:
• a priori message probabilities,
• vector channel conditional probabilities,
as seen in part 2 of theorem 50.
Definition 4.2.4. (W&J, page 216) An additive Gaussian noise vector channel
disturbs the transmitted random vector s according to r = s + n, where n is a
zero-mean Gaussian random vector which is statistically independent of s. The
vector channel conditional probabilities then reduce to:
pr (ρ|s = si ) = pn (ρ − si ),
i = 0, 1, . . . , M − 1, where pn (α) is the joint Gaussian density function of n.
Over an additive Gaussian noise vector channel, the MAP decision rule be-
comes (equation 4.16 in W&J):
b M AP (ρ) = mk ⇔ P (mk )pn (ρ − sk ) ≥ P (mj )pn (ρ − sj ), ∀j.
m (4.2)
If in addition the N components of n are statistically independent with equal
variance σ 2 , then:2
Λn = diag[σ 2 , σ 2 , . . . , σ 2 ] = σ 2 IN
Λ−1
n = diag[1/σ 2 , 1/σ 2 , . . . , 1/σ 2 ] = IN /σ 2
1 ∑ 2
N
1 1
α × Λ−1
n ×α
T
= 2
α × αT
= 2
αi , 2 |α|2
σ σ i=1 σ
1
e−|α| /(2σ )
2 2
pn (α) = (4.3)
(2πσ 2 )N/2
2I
N denotes the N × N identity matrix.
152 CHAPTER 4. OPTIMUM RECEIVER PRINCIPLES

the latter being equation (4.17c) in W&J. Substituting equation (4.3) into (4.2)
results into equation (4.18) in W&J, and simplifies to W&J equation (4.19); the
MAP receiver finds the message mi (equivalently si ) that minimizes

|ρ − si |2 − 2σ 2 ln P (mi ) .

For the ML decision rule we have:

b M L (ρ) = mk ⇔ |ρ − sk |2 ≤ |ρ − sj |2 , ∀j.
m

The decision regions may therefore be calculated once and for all from the:
• a priori message probabilities,
• sample vectors of the transmitted random vector,
• variance of the components of n.
The probability of correct decision is derived from the decision regions, and
using equation (4.3) it yields (equations (4.20b), (4.20c) in W&J):
∫ ∫
P (C |mi ) = ··· pr (ρ|s = si )dρ
I
∫ ∫i
= ··· pn (ρ − si )dρ
Ii
∫ ∫
1
e−|ρ−si | /(2σ ) dρ
2 2
= · · ·
(2πσ 2 )N/2 Ii

M −1
P (E ) = 1 − P (C ) = 1 − P (mi )P (C |mi )
i=0

An equivalent form of the above that will be more useful in §4.5.1 is obtained
below using the change of variable β = ρ − si :
∫ ∫
P (C |mi ) = ··· pn (ρ − si )dρ
Ii
∫ ∫
= ··· pn (β)dβ (4.4)
Ii −si

where Ii − si denotes the regions obtained by translating the decision region Ii


in such a way that si is at the origin. A different technique will be required in
§4.5.2.

Multivector channels (W&J, pp. 219)


Definition 4.2.5. A communication system for which the transmitted random
vector s is applied at the input of two (or more) different channels, and the re-
ceiver observes all outputs, is called a multivector (or diversity) communication
system.
4.3. WAVEFORM CHANNELS 153

Remark. The channels need not be physically different nor use different medi-
ums. For example, the transmitted vector can be transmitted over different
time intervals with the same channel (time diversity), or using different carrier
frequencies (frequency diversity), or many receiving antennas may be used with
the same transmitter and medium (space diversity).
The analysis of multivector channels is a trivial generalization, since multi-
vectors can be merged together and form super-vectors. A multivector channel
is described mathematically by the M following conditional probability density
functions (equations (4.21), (4.22) in W&J):

pr1 ,r2 (ρ1 , ρ2 |s = si ),

i = 0, 1, 2, . . . , M − 1. The various receivers are implemented in the same way


as before (MAP, maximum likelihood, . . . ). Therefore vector channels include
WLOG the case of multivector channels.

4.3 Waveform Channels


(W&J, pp. 223 - 233)
Consider the general case of a communication channel as depicted in W&J
Figures 1.7 (p. 9) and 4.1 (p. 211). s(t), nw (t), r(t) are random processes
specified by: 3
• s(t) takes any one of the sample functions s0 (t), s1 (t), . . . , sM −1 (t) with a
priori probabilities P (m0 ), P (m1 ), . . . , P (mM −1 ), (specification method
#1),
• nw (t) is a white Gaussian noise process with power spectral density
Sw (f ) = N0 /2 and statistically independent of s(t), (specification method
#2),
• r(t) is given by r(t) = s(t) + nw (t) (specification method #4).
In this section we demonstrate that the receiver implementation for such a
channel reduces to that of an additive Gaussian noise vector channel (as defined
in W&J, pp. 216 - 219 and in §4.2 of these notes) without affecting the minimum
attainable probability of error.

4.3.1 Waveform synthesis


(W&J, pages 223 - 225)
By virtue of the Gram-Schmidt orthonormalization procedure, the transmit-
ter can always be broken down (WLOG) into a vector transmitter followed by
a modulator (synthesizer) as shown in figure 4.2.
The sample functions of s(t) are given by the type of modulation (refer to
W&J, bottom of page 224, top of page 225 and figure 4.13 for two frequently
3 refer to definition 3.1.2 and the four general specification methods following the definition.
154 CHAPTER 4. OPTIMUM RECEIVER PRINCIPLES

Figure 4.2:

encountered examples). The basis functions ϕj (t) and the vectors si may on
the other hand be calculated from the sample functions by the Gram-Schmidt
orthonormalization procedure. The mapping between the mi ’s and the si ’s is
arbitrary.
Definition 4.3.1. (W&J, page 225) The N -dimensional geometric space in
which the M vectors s0 , s1 , . . . , sM −1 are visualized as points (see for example
figures 4.14, 4.A.3, 4.A.4, 4.A.5 in W&J) is called signal space.

4.3.2 Recovery of signal vectors


(W&J, pages 227 - 228)
In the absence of noise, the receiver can recover the transmitted vector from
the received waveform r(t) = s(t) and the basis {ϕi (t)}N i=1 . The block diagram
of figure 4.16, page 228 in W&J, performs this. Indeed, suppose the transmitted
waveform is s(t) = si (t), 0 ≤ i ≤ M − 1, which is related to the transmitted
vector si = (si,1 , . . . , si,N ) by:


N
si (t) = si,j ϕj (t).
j=1

Upon reception of r(t), the receiver recovers:


∫ ∞
r(t)ϕ1 (t)dt = si,1
−∞
∫ ∞
r(t)ϕ2 (t)dt = si,2
−∞
∫ ......... ... ...

r(t)ϕN (t)dt = si,N
−∞

since the functions ϕj (t) are orthonormal (equation (4.38) in W&J).

4.3.3 Irrelevant Data


(W&J, pages 229-232 and pages 220-222)
4.3. WAVEFORM CHANNELS 155

In the presence of noise, the previous decomposition does not recover the
transmitted vector since the transmission of s(t) implies the reception of r(t) =
s(t) + nw (t) ̸= s(t) and (equations (4.41a), (4.41b) in W&J):
∫ ∞
r(t)ϕ1 (t)dt = r1 = s1 + n1
−∞
∫ ∞
r(t)ϕ2 (t)dt = r2 = s2 + n2
−∞
......... ... ......... (4.5)
∫ ∞
r(t)ϕN (t)dt = rN = sN + nN
−∞
∫∞
In the above, ni = −∞ nw (t)ϕi (t)dt, i = 1, 2, . . . , N (equation (4.43b) in
W&J), and we define r 1 = (r1 , r2 , . . . , rN ), n = (n1 , n2 , . . . , nN ), so that
r 1 = s + n (equation (4.42) in W&J). The receiver estimates the transmitted
message by maximizing the a posteriori probability from the knowledge of r 1 .
This will easily be done since r 1 = s + n and
• n is a Gaussian random vector with mean vector and covariance matrix
respectively given by (refer to the proof of theorem 64):

mn = 0
N0
Λn = IN
2

• n is independent of s: follows from nw (t) being independent of s(t).


We recognize, in the preceding description, the additive Gaussian noise vector
channel of definition 4.2.4 in which the N components of n are 0-mean and
statistically independent with equal variance N0 /2. A block diagram of this
receiver is presented in figure 4.3. However it is not clear that this is the best
way to estimate the transmitted message. The reason is that r 1 does not contain
all the information of r(t) since (W&J, equation 4.45a):


N
r(t) − rj ϕj (t) = r2 (t) ̸= 0.
j=1

Does a MAP receiver based on r 1 yield the lowest probability of error? Is a


MAP receiver based on r 1 alone still optimal? The answer is yes and the proof
that the receiver is still optimal is fundamental in communication and signal
detection theory. Assuming the validity of the statement for now, we then
see that the general waveform channel communication system can be reduced
to a vector channel communication system, as illustrated in figure 4.17, page
233 in W&J. This does not affect the performance nor the generality of the
communication system of figure 4.1 in W&J. The design of the optimal receiver
and the calculation of its probability of error has been detailed in §4.2.
156 CHAPTER 4. OPTIMUM RECEIVER PRINCIPLES

Figure 4.3:

4.3.4 Theorem of Irrelevance


(W&J, pages 220 - 222)
Consider a vector channel in which two inputs r 1 and r 2 are available at the
receiver (as in figure 4.7 in W&J). A MAP receiver may disregard r 2 (i.e. the
decision may be based on r 1 alone) and still achieve optimal performance if and
only if r 2 when conditioned on r 1 is statistically independent of s:

pr2 |r1 , s = pr2 |r1 .

Proof. The MAP decision rule is:

P (mi )pr1 (ρ1 |s = si )pr2 (ρ2 |r 1 = ρ1 , s = si )


b = mk ⇔
m
is maximized when i = k.

The value r 2 = ρ2 can be disregarded when making the decision if and only if
pr2 (ρ2 |r 1 = ρ1 , s = si ), the only factor involving r 2 , takes the same value for
all si :
pr2 (ρ2 |r 1 = ρ1 , s = si ) = pr2 (ρ2 |r 1 = ρ1 )
∀i, ∀ρ1 , ∀ρ2 . In concrete word, this says that r 2 is statistically independent of
s when conditioned on r 1 .
Remarks (W&J, page 220).
1. r 2 is irrelevant if and only if pr2 (ρ2 |r 1 = ρ1 , s = si ) is constant with
respect to i.
4.3. WAVEFORM CHANNELS 157

2. A sufficient (but not necessary) condition is that

pr2 (ρ2 |r 1 = ρ1 , s = si ) = pr2 (ρ2 )

for every ρ1 , ρ2 , i.
3. Two possibly useful factorizations when applying the theorem of irrele-
vance:
pr2 ,r1 |s
pr2 |r1 ,s =
pr1 |s
pr1 |r2 ,s pr2 |s
=
pr1 |s

Example 4.3.1 (W&J, pages 220, 221). Consider the diagram of figure 4.8 in
W&J. In this example we have:

r1 = s + n1
r2 = n2

and s, n1 , n2 are all statistically independent random vectors. We immediately


obtain:
pr2 |r1 ,s = pn2
and clearly r 2 = n2 is irrelevant.
Example 4.3.2 (W&J, page 221). Consider the diagram of figure 4.9 in W&J.
It is assumed that s, n1 , n2 are statistically independent. Then n2 = r 2 − r 1
and it follows that:

pr2 (ρ2 |r 1 = ρ1 , s = si ) = pn2 (ρ2 − ρ1 |r 1 = ρ1 , s = si )


= pn2 (ρ2 − ρ1 )

since n2 is statistically independent of s, n1 and consequently r 1 = s + n1 .


Clearly this is the same for all i and r 2 is consequently irrelevant.
Example 4.3.3. Consider the diagram of figure 4.10 in W&J. It is assumed that
n2 is statistically independent of both n1 , s and consequently of r 1 = s + n1 .
Since n2 = r 2 − n1 = r 2 − (r 1 − s) = r 2 − r 1 + s we obtain:

pr2 (ρ2 |r 1 = ρ1 , s = si ) = pn2 (ρ2 − ρ1 + si |r 1 = ρ1 , s = si )


= pn2 (ρ2 − ρ1 + si )

This is in general not the same value for all i and r 2 is consequently not irrele-
vant.
Corollary (Theorem of reversibility). (W&J, page 222)The minimum attain-
able probability of error is not affected by the introduction of a reversible oper-
ation at the output of a channel (or front end of a receiver).
158 CHAPTER 4. OPTIMUM RECEIVER PRINCIPLES

Figure 4.4:

Proof. Refer to figure 4.11(a) in W&J and suppose G is reversible. Then


r 2 = G−1 (r 1 ) and “given that r 1 = ρ1 ” becomes “given that r 2 = G−1 (ρ1 )”.
Therefore:
pr2 (ρ2 |r 1 = ρ1 , s = si ) = “δ(ρ2 − G−1 (ρ1 ))”,
and r 2 is irrelevant since the above is clearly the same value for all i.

Irrelevant Data continued . . . (pages 230 - 232 in W&J) We now return to


the reduction of the additive white Gaussian noise waveform channel into an
additive Gaussian noise vector channel. The receiver under review is presented
in figure 4.4, and we want to prove that a MAP receiver can base its decision
on r 1 and disregard r2 (t). We recall that (W&J, equations (4.42), (4.44(c)),
(4.44(a))):
r1 = (r1 , r2 , . . . , rN ) = s + n
∫ ∞
rj = r(t)ϕj (t)dt = sj + nj , j = 1, . . . , N
−∞

N
n(t) = nj φj (t)
j=1


N
r1 (t) = rj ϕj (t) = s(t) + n(t)
j=1
r2 (t) = r(t) − r1 (t)
4.3. WAVEFORM CHANNELS 159

where r(t) = s(t) + nw (t) is the input to the receiver (received waveform). The
above are described as follows:
• r(t) ≡ received waveform,
• r1 (t) ≡ component of r(t) which is contained in the vector space spanned
by {ϕj (t)}N
j=1 ,

• n(t) ≡ component of nw (t) which is contained in the vector space spanned


by {ϕj (t)}N
j=1 ,

• r2 (t) ≡ component of r(t) which is not contained in the vector space


spanned by {ϕj (t)}N
j=1 .

We first notice that r2 (t) is the result of linear operations performed on nw (t)
(W&J, equation (4.45(b)), also refer to figure 4.4):
r2 (t) = r(t) − r1 (t)
= (s(t) + nw (t)) − (s(t) + n(t))
= nw (t) − n(t)
r2 (t) is consequently a Gaussian random process which is statistically indepen-
dent of s(t). By theorem 64 in Appendix B, r2 (t) and n(t) are statistically
independent 0-mean jointly Gaussian random processes. Let (W&J, equation
(4.46))
r 2 = (r2 (t1 ), . . . , r2 (tq ))
be the random vector obtained through sampling of r2 (t). r 2 is a Gaussian
random vector statistically independent of both s and n and the situation is
similar to that depicted in figure 4.8 in W&J; r 2 is therefore irrelevant for any
t1 , t2 , . . . , tq and for any q. It follows that r2 (t) is irrelevant and an optimal
MAP receiver may base its decision solely upon r 1 . The following summarizes
all of section 4.3 in W&J.
Theorem 51. Consider an additive white Gaussian noise waveform channel as
depicted in figure 4.1 in W&J with
−1
• m ∈ {mi }M
i=0 “with a-priori probabilities” P (mi ), i = 0, 1, . . . , M − 1,
−1
• s(t) ∈ {sj (t)}M
j=0 “specified by” the a-priori probabilities,

• nw (t) “specified by” mw (t) = mw = 0, Sw (f ) = N0 /2,


−1
and let ϕ1 (t), . . . , ϕN (t) be an orthonormal basis for the set {sj (t)}M
j=0 . De-
fine the random vectors n = (n1 , n2 , . . . , nN ), s = (s1 , s2 , . . . , sN ), r =
(r1 , r2 , . . . , rN ) by:
∫ ∞
ni = nw (t)ϕi (t)dt
−∞
∫ ∞
si = s(t)ϕi (t)dt
−∞
r1 = s+n
160 CHAPTER 4. OPTIMUM RECEIVER PRINCIPLES

1. s takes any one of M vectors s0 , s1 , . . . , sM −1 with a priori probabilities


P (m0 ), . . . , P (mM −1 ):
∫ ∞
sij = si (t)ϕj (t)dt
−∞

for i = 0, 1, . . . , M − 1 and j = 1, 2, . . . , N .

2. n is a Gaussian random vector such that:

mn = (0, 0, . . . , 0)
Λn = diag[N0 /2, N0 /2, . . . , N0 /2]

3. The waveform channel reduces to the additive Gaussian noise vector chan-
nel of figure 4.4 in W&J. The waveform transmitter may be broken down
into a vector transmitter followed by a modulator as depicted in figure 4.17
in W&J. The optimal waveform receiver may be broken down into a de-
tector followed by a MAP vector receiver as depicted in figures 4.16, 4.17
in W&J.

4. The decision regions of the MAP vector receiver are found using W&J
equation (4.19) in which σ 2 = N0 /2.

5. The probability of error is calculated with equations W&J (4.20a), (4.20b),


(4.20c).

Remark. (W&J, pp 232 - 233) The performance P (E ) does not depend on the
choice of the orthonormal basis.

4.4 Receiver Implementation


W&J, pages 233 - 245
From theorem 51 we see that the optimum receiver performs two tasks:

1. it calculates the relevant received vector4 r = (r1 , r2 , . . . , rN ) by (W&J,


equations (4.50a), (4.50b)):
∫ ∞
rj = r(t)ϕj (t)dt ,
−∞

j = 1, 2, . . . , N .

b = mk if
2. it applies the MAP decision rule (W&J, equation(4.51)): m
|r − si |2 − N0 ln(P (mi )) is minimum for i = k.
4 we drop the subscript 1 from the relevant vector r since it is the only received vector;
1
r 2 is irrelevant and will no longer be used or mentioned.
4.4. RECEIVER IMPLEMENTATION 161

The implementation of this decision rule can be simplified by noticing that

|r − si |2 = |r|2 − 2⟨r, si ⟩ + |si |2

The minimization of the term in equation W&J (4.51) becomes equivalent to


the minimization of

|r|2 − 2⟨r, si ⟩ − (N0 ln(P (mi )) − |si |2 )

By changing the sign, dividing by 2 and noticing that the term |r|2 is the same
for all i we obtain the equivalent (W&J equations (4.53a), (4.53b)):

b = mk ⇔ ⟨r, si ⟩ + ci is maximum for i = k,


m (4.6)

where
1( )
ci = N0 ln(P (mi )) − |si |2 , (4.7)
2
for i = 0, 1, . . . , M − 1.
Remarks.

1. The above equivalent decision rule is not as easy to visualize as the decision
rule of equation (4.19) in W&J, but it leads to simpler implementations
since it does not require any squaring devices.

2. (W&J, pages 235 - 236) ”The bias terms ci , i = 0, 1, . . . , M − 1 represent


a priori data available to the receiver independently of the received signal
r(t).” They only need to be computed once and for all; squaring devices
and log amplifiers are not required in the final implementation.

Definition 4.4.1.

1. (W&J, page 234) Implementation of the (optimum) MAP receiver with


the equivalent simplified decision rule takes the form of figure 4.18, page
235 in W&J. This is called a correlation receiver.

2. (W&J page 235) Let ϕ(t) be identically zero outside some finite time
interval 0 ≤ t ≤ T . A linear invariant causal filter of impulse response
h(t) = ϕ(T − t) is said to be matched to ϕ(t).

As shown on page 235 in W&J, “if each member of the orthonormal basis
{ }N
ϕj (t) j=1 is identically zero outside some finite time interval, say 0 ≤ t ≤ T ,
the outputs of filters matched to the ϕj (t) are waveforms uj (t) that go through
the desired values rj when their input is r(t). Then the multiplications and in-
tegrations can easily be implemented by matched filters and sample-and-holds.”

Definition 4.4.2. (W&J, page 235) An implementation of the MAP correlation


receiver with filters matched to the functions ϕj (t) takes the form of figure 4.19
in W&J; this is called a matched filter receiver.
162 CHAPTER 4. OPTIMUM RECEIVER PRINCIPLES

Example 4.4.1. ∫ ∞(pp. 236 - 237 in W&J): We next illustrate on a special case
that the values −∞ r(t)ϕj (t)dt = rj can be obtained by sampling the output of
matched filters when ϕj (t) = 0, ∀t ∈/ [0, T ]. Let
{ √
− 2000 cos(2π(3 kHz)t) ; 0 ≤ t ≤ 1 ms
ϕ1 (t) =
0 ; elsewhere
{ √
− 2000 cos(2π(4.5 kHz)t) ; 0 ≤ t ≤ 1 ms
ϕ2 (t) =
0 ; elsewhere

The functions ϕ1 (t) and ϕ2 (t) are plotted on figure 4.5 and one easily verifies
that ϕ1 (t) and ϕ2 (t) are orthonormal. Suppose the signal r(t) sketched on figure

Orthonormal functions

60

phi1(t)
40
phi2(t)

20
[V]

-20

-40

-60
0 0.001
t [s]

Figure 4.5:

4.6 is received; this is a sample function of the random process described by:

r(t) = −10 cos(2π(3 kHz)t) +nw (t), 0 ≤ t ≤ 1 ms


| {z }

ϕ1 (t)/ 20 = s1 (t)

where nw (t) is a 0-mean white (over the frequency range -100 kHz ≤ f ≤
100 kHz since the sampling frequency is 200 kHz) Gaussian noise with power
spectral density:
 N
 20 = 0.125 × 10−3 ; -100 kHz ≤ f ≤ 100 kHz
Sw (f ) =

0 ; elsewhere

A correlation receiver would return:


4.4. RECEIVER IMPLEMENTATION 163

received signal

25

received signal

15

5
[V]

-5

-15

-25
0 0.001
t [s]

Figure 4.6:

∫ ∞
r(t)ϕ1 (t)dt = r1 = 0.215392
−∞
∫ ∞
r(t)ϕ2 (t)dt = r2 = −0.000791
−∞

which is a point in the


√ 2-dimensional plane that is very close to the noise-free
coordinates of ϕ1 (t)/ 2 as seen in figure 4.7.

Figure 4.7:

Equivalently, r1 , r2 are obtained by sampling the output of filters h1 (t) and


164 CHAPTER 4. OPTIMUM RECEIVER PRINCIPLES

h2 (t) matched to ϕ1 (t) and ϕ2 (t):

h1 (t) = ϕ1 (1 ms − t)
h2 (t) = ϕ2 (1 ms − t)

at t = 1 ms. The output of the filters fed by r(t) are sketched on figure 4.8 and
it is seen that they are respectively equal to 0.215392 and -0.000791 at t = 1
ms, precisely equal to the values of r1 and r2 indicated above.
Matched filters outputs

output of filter matched to phi1


0.2
output of filter matched to phi2

0.1
[V]

-0.1

-0.2
0 0.001 0.002
t [s]

Figure 4.8:

Remark. The signal-to-noise ratio (definition 4.5.1 on page 169) for the case
illustrated above is:
( √ )2
1/ 20 1/20
SNR = = = 200 → 23 dB
N0 2 × 0.125 × 10−3
The signal r(t) is extremely clean contrary to what one might think by look-
ing at figure 4.6. Refer to figure 4.9 for a more realistic illustration in which
SNR = 5 dB.

Consequences of Parseval’s relationships


1. (W&J, page 238) The bias term ci in equation (4.53b) in W&J can be
rewritten as:
1( )
ci = N0 ln(P (mi )) − Ei
2
∫∞
where Ei = |si |2 = −∞ s2i (t)dt is the energy of the i-th waveform si (t).
4.4. RECEIVER IMPLEMENTATION 165

received signal; SNR = 5 dB

150

received signal; SNR = 5 dB


[V]

-150
0 0.001
t [s]

Figure 4.9:

2. The terms ⟨r, si ⟩+ci of equation (4.53a) in W&J can directly be calculated
from the waveform r(t) without
∫ ∞ first projecting on the basis ϕj (t) since,
by theorem 63, ⟨r, si ⟩ = −∞ r(t)si (t)dt. This leads to a correlation
receiver (not in W&J) which simplifies to the matched filter receiver of
−1
figure 4.21, when all the waveforms {si (t)}M i=0 are identically 0 outside a
finite interval 0 ≤ t ≤ T . As remarked on page 239 in W&J, this is not
necessarily a simplification over the matched filter receiver of figure 4.19
(in W&J).
The next theorem illustrates from another point of view the optimality of the
matched filter receiver. The following lemma is needed to prove the theorem.
Lemma 52 (Schwarz inequality). (W&J, page 240) For any pair of finite energy
waveforms a(t), b(t) we have (W&J, equation (4.64)):
(∫ ∞ )2 ( ∫ ∞ )( ∫ ∞ )
a(t)b(t)dt ≤ a2 (t)dt b2 (t)dt
−∞ −∞ −∞

(this may equivalently be written as ⟨a(t), b(t)⟩ ≤ |a(t)|2 |b(t)|2 ). In addition,


2

equality is satisfied iff b(t) = ca(t) for some constant c ∈ R, c ̸= 0.


Sketch of proof. (W&J, page 240 - 241) We use the following:
1. construct an orthonormal basis {ψ1 (t), ψ2 (t)} such that:
a(t) = a1 ψ1 (t) + a2 ψ2 (t)
b(t) = b1 ψ1 (t) + b2 ψ2 (t)
166 CHAPTER 4. OPTIMUM RECEIVER PRINCIPLES

and define a = (a1 , a2 ), b = (b1 , b2 ).

2. ⟨a, b⟩ = |a| |b| cos θ (refer to figure 4.23 in W&J).

3. Parseval’s relationships.

The result follows immediately from the above, noticing that cos2 θ ≤ 1. In
addition we observe that equality is satisfied iff θ = 0 or θ = π, i.e. iff b(t) =
c a(t) for some non-zero constant c ∈ R.

In the diagram of figure 4.10, ϕ(t) denotes a deterministic signal which is


identically zero outside a finite interval 0 ≤ t ≤ T and which may or may not
be present in r(t). nw (t) is a stationary white Gaussian process with power
spectral density Sw (f ) = N0 /2 and h(t) is the (linear invariant) filter’s impulse
response. We wish to determine whether or not ϕ(t) is present in r(t), or in
other words, to detect it when present and buried in the noise. The filter’s
output y(t) is clearly a random process that is sampled at time t = T , yielding
the random variable r , y(T ).5 We will show that

r , E[r] depends only of ϕ(t), h(t),


V ar(r) depends only of nw (t), h(t).

We will then design the filter (i.e. find the impulse response h(t)) to maximize
r2
the ratio V ar(r) so that we can decide, based on r, with as little randomness as
possible whether or not ϕ(t) is present in the received signal r(t).

Figure 4.10:

Theorem 53 (Optimal linear filtering). (W&J, page 240) The ratio S /N


defined by (W&J, equation (4.62)):

S r2
=
N n2
is maximized when h(t) is matched to ϕ(t).

Proof. (W&J, pages 241 - 242) Clearly

y(t) = r(t) ∗ h(t)


= (ϕ(t) + nw (t)) ∗ h(t)
5 not to be confused with the random process r(t).
4.4. RECEIVER IMPLEMENTATION 167

and
∫ ∞
r = y(T ) = (ϕ(T − α) + nw (T − α))h(α)dα
−∞
∫ ∞ ∫ ∞
= ϕ(T − α)h(α)dα + nw (T − α)h(α)dα
−∞ −∞
| {z } | {z }
term #1 term #2

We first show that term #1 ≡ r̄:


[ ]
r = E[r] = E r(t) ∗ h(t) t=T
[∫ ∞ ∫ ∞ ]
= E ϕ(T − α)h(α)dα + nw (T − α)h(α)dα
−∞ −∞
∫ ∞
= ϕ(T − α)h(α)dα
−∞

It follows that term #2 ≡ n and n̄ = 0. Then:


[ ∫∫ ∞ ]
n2 = E nw (T − α)nw (T − β)h(α)h(β)dαdβ
−∞
∫∫ ∞
= Rw (β − α)h(α)h(β)dαdβ
−∞
∫∫ ∞
N0
= δ(β − α)h(α)h(β)dαdβ
−∞ 2
∫ ∞
N0 2
= h (β)dβ
−∞ 2

Therefore:
( )2
∫∞
−∞
ϕ(T − α)h(α)dα
S
= N0 ∞

N 2 −∞
h2 (β)dβ
⟨ϕ(T − t), h(t)⟩2
= N0
2 |h(t)|
2

|ϕ(T − t)|2 |h(t)|2


≤ N0
2 |h(t)|
2

|ϕ(T − t)|2
= N0
2

and equality is obtained when h(t) ∝ ϕ(T − t), i.e. when h(t) is matched to
ϕ(t).
Definition 4.4.3. The ratio r2 /n2 is called signal-to-noise ratio.6
6 See also definitions 4.5.1 and 4.5.2.
168 CHAPTER 4. OPTIMUM RECEIVER PRINCIPLES

The previous theorem shows that the j-th matched filter of the receiver in
figure 4.19 in W&J extracts from r(t) the j-th component of the (transmitted)
random vector s with the best possible signal-to-noise ratio. This is another
point of view on the optimality of the matched filter receiver.
Remarks.

1. W&J provides some interesting intuitive explanations on the optimality


of the matched filter on pages 242 - 243.

2. Component accuracy: the previous analysis assumes that the receiver is


perfectly synchronized with the arrival of the waveform r(t) and that its
orthonormal basis {ϕj (t)}Nj=1 is exactly the same as that of the transmit-
ter. In practise there is a non-zero probability that the synchronization
is lost. As well the vector space spanned by the orthonormal basis of the
receiver is not exactly the same as that of the transmitter.

4.5 Probability of Error


W&J, pages 245 - 266
In order to evaluate the probability of error for some specific signal set
constellations, the decision regions are first determined using the MAP rule of
equation (4.19) in W&J. The choice of equation (4.19) over the equivalent rule of
equations (4.53a), (4.53b) also in W&J, is motivated by the former being more
easily visualized; the latter equivalent rule was derived in order to simplify the
structure of the receiver.

Theorem 54. (W&J, pages 246 - 248) Consider a communication system


over the additive Gaussian noise vector channel satisfying equation (4.17c) in
W&J with signal vectors S = {s0 , s1 , . . . , sM −1 } and vector transmitter map-
ping mi 7→ si , i = 0, 1, . . . , M − 1 . Let P (E ) denote the probability of error
∑M −1
obtained by using a MAP receiver, and define Em = i=0 P (mi )|si |2 = |s|2 ≡
average energy per message.

1. If S ′ = {s′0 , . . . , s′M −1 } is a signal vector set obtained by rotation and


translation of S in the signal space, and if S ′ is used in the communication
system in lieu of S with the vector transmitter mapping mi 7→ s′i , the
resulting probability of error P (E )′ obtained with a (new) MAP receiver is
the same as before, i.e. P (E )′ = P (E ).

2. Let a ∈ RN and S ′ = {s′i = si − a : i = 0, 1, . . . , M − 1}. Em


′ defined

by:

M −1
′ =
Em P (mi )|s′i |2
i=0
∑M −1
is minimized by choosing a = s = i=0 P (mi )si .
4.5. PROBABILITY OF ERROR 169

Proof.

1. (W&J, top of page 247) The formal proof is tedious. We only need to
notice that equations (4.19), (4.20b) in W&J only depend on the distances
between a given r = ρ and the si , and that these distances are not affected
by rotation and translation of both r = ρ and the set of all si . Therefore
changing S to S ′ requires a change of the decision regions Ii to Ii′ , i =
0, 1, . . . , M − 1, but equation (4.20b) (with the new Ii′ ) will yield the
same answer as before. It follows that (4.20c) also remains unchanged:
P (E ′ ) = P (E ).

2. Follows from “the moment of inertia is minimum when taken around the
centroid (center of gravity) of a system.” (W&J, page 248)

Note (Signal-to-noise ratio). We distinguish two definitions of the signal-to-noise


ratio in two different contexts.

Definition 4.5.1. Consider a communication system using the waveforms


−1
{sj (t)}M
j=0 over an additive white Gaussian noise waveform channel, and let
Sw (f ) = N0 /2 denote the power spectral density of the noise. The signal-to-
noise ratio on the channel (SN R) is defined by:
∫∞
−∞
s2 (t)dt Em |s|2
SN R = = =
N0 N0 N0

Definition 4.5.2. Consider a linear time invariant filter used to detect a signal
7
ϕ(t) as in figure 4.22,
√ page 240, W&J. The signal-to-noise ratio of the estimate
is defined as E[r]/ V ar(r). When expressed in units of power, the definition
becomes:
r2
SN R =
V ar(r)
and in the context of figure 4.22 (W&J), this simplifies to r2 /n2 .

In the context of section 4.5 in W&J (more generally in communication theory),


definition 4.5.1 is used. In the context of figure 4.22 in W&J (more generally in
detection/estimation theory), definition 4.5.2 is used.

4.5.1 Rectangular Signal Sets


(W&J, pages 248 - 257)

Theorem 55. (binary signals) Let s0 , s1 be two signal vectors in a signal space
with (orthonormal basis) {ϕ1 (t), ϕ2 (t)}, and let d = |s0 − s1 |.
7 Refer to page 512 in M.B. Priestley, Spectral Analysis and Time Series, vol. 1, Academic

Press, 1983, ISBN: 0125649223.


170 CHAPTER 4. OPTIMUM RECEIVER PRINCIPLES

1. If the messages are equally likely,


( ) ( )
d/2 d
P (E ) = Q √ =Q √
N0 /2 2N0

s0 = (d/2, 0), s1 = −s0 = (−d/2, 0) then Es = |si |2 =


(a) (antipodal) If √
d /4 ⇒ d = 2 Es . Then (equation (4.77) in W&J):
2

(√ )
Es
P (E ) = Q 2
N0

(b) (orthogonal, equal energy) If ⟨s0 , s1 ⟩ = 0, Es = |s0 |2 = |s1 |2 then


(equation (4.78) in W&J):
(√ )
Es
P (E ) = Q
N0

2. If the messages are unequally likely, then (equations (4.79b), (4.79a) in


W&J):
( ) ( )
d − 2∆ d + 2∆
P (E ) = P (m0 )Q √ + P (m1 )Q √
2N0 2N0
( )
where ∆ = N P (m1 )
2d ln P (m0 ) .
0

Proof. Refer to W&J, pages 248 - 251. Notice that the probability of error is
more easily calculated, in the present situation, with the formula of equation
(4.4). In the case of unequally likely messages, the expression of ∆ is easily de-
rived from equations (4.6) and (4.7) based on the decision rule of the correlation
receiver.
The probability of error of equally likely binary antipodal and binary or-
thogonal signal sets is plotted as a function of the signal to noise ratio Es /N0
on figure 4.11.
Remarks.
1. Antipodal signalling is more energy efficient than orthogonal signalling.
2. Let  √

 2Es
0≤t≤T
T sin(2πfc t + iπ) ;
si (t) =


0 ; elsewhere
for i = 0, 1 and fc T ∈ N. The modulation scheme is called phase reversal
keying (PRK) and is a special case of phase shift keying (PSK). It is
a form of antipodal signalling since s1 (t) = −s0 (t) (exercise: find an
orthonormal basis and the corresponding vectors s0 , s1 ).
4.5. PROBABILITY OF ERROR 171

Figure 4.11:

3. Let:
 √
 2000Es cos(2π(3 kHz)t) ; 0 ≤ t ≤ 1ms
s0 (t) =

0 ; elsewhere
 √
 2000Es cos(2π(4.5 kHz)t) ; 0 ≤ t ≤ 1ms
s1 (t) =

0 ; elsewhere
This modulation scheme is called frequency shift keying (FSK) and is a
form of orthogonal signalling (exercise: find an orthonormal basis and
the corresponding vectors s0 , s1 ).
4. Let
s0 (t) = 0
 √

 2Es
0≤t≤T
T sin(2πfc t) ;
s1 (t) =


0 ; elsewhere
where fc T ∈ N. This modulation scheme is called amplitude shift key-
ing (ASK) and is neither antipodal nor orthogonal signalling. It is left
172 CHAPTER 4. OPTIMUM RECEIVER PRINCIPLES

as an exercise to show that ASK has the same energy efficiency as that
of orthogonal signalling (exercise: find an orthonormal basis and the
corresponding vectors s0 , s1 ).

Rectangular decision regions (W&J, pages 251 - 254) occur when “the signal
vector configuration is rectangular and all signals are equally likely”. This
situation is easily analyzed and generalizes the previous theorem.

Example 4.5.1. Consider the signal set depicted in figure 4.12. We have (again

Figure 4.12:

we use the formula of equation(4.4)):

∫∫
P (C |m0 ) = pn (β)dβ
I0 −s0
∫ d/2 ∫ ∞
= pn (β)dβ pn (β)dβ
−∞ −d/2

√ 1 e−β /N0 .
2
where pn (β) = πN0
It follows that:

P (C |m0 ) = (1 − p)2

∫∞ ( √ )
where p = d/2
pn (β)dβ = Q d/ 2N0 . Noticing that Em = 5d2 /2, p can be
4.5. PROBABILITY OF ERROR 173
(√ )
Em
expressed as a function of SN R: p = Q 5N0 . Similarly we find:
∫∫
P (C |m4 ) = pn (β)dβ
I4 −s4
∫ d/2 ∫ ∞
= pn (β)dβ pn (β)dβ
−d/2 −d/2
( ( ( d/2 ))( ( −d/2 ))
−d/2 )
= Q √ −Q √ Q √
N0 /2 N0 /2 N0 /2
( ( d/2 ) ( d/2 ))( ( d/2 ))
= 1−Q √ −Q √ 1−Q √
N0 /2 N0 /2 N0 /2
= (1 − 2p)(1 − p)

P (C |m12 ) = (1 − 2p)2
Also
P (C |m0 ) = P (C |mi ), i = 1, 2, 3,
P (C |m4 ) = P (C |mi ), i = 5, 6, 7, . . . , 11,
P (C |m12 ) = P (C |mi ), i = 13, 14, 15.
Finally, after simplification we obtain

15
P (C ) = P (mi )P (C |mi )
i=0
= (1 − 3p/2)2
The P (E ) = 1 − P (C ) is plotted on figure 4.13 as a function of the SN R.
In the special case where:
 √

 2
; 0≤t≤T
T cos(2πfc t)
ϕ1 (t) =


0 ; elsewhere
 √

 2
; 0≤t≤T
T sin(2πfc t)
ϕ2 (t) =


0 ; elsewhere
this modulation is referred to as quadrature amplitude modulation with 16 levels
(QAM-16). It generalizes to other numbers of levels. QAM-4 is also called
quadri-phase shift keying (QPSK).
Theorem 56. (vertices of a hypercube) Let M = 2N equally likely messages
be mapped onto signal vectors s0 , . . . , sM −1 which form the vertices of an N -
dimensional hypercube centered on the origin:
si = (si1 , , si2 , , . . . , siN ), sij = ±d/2 .
174 CHAPTER 4. OPTIMUM RECEIVER PRINCIPLES

Figure 4.13:

( ) (√ )
N d2
Then P (C ) = (1 − p) N
where p = Q √ d
2N
=Q 2Es
N N0 and Es = 4 =
0

Em .

Proof. For simplicity, let N = 3. There are M = 23 = 8 possible signal vectors

s0 = (−d/2, −d/2, −d/2) s4 = (d/2, −d/2, −d/2)


s1 = (−d/2, −d/2, d/2) s5 = (d/2, −d/2, d/2)
s2 = (−d/2, d/2, −d/2) s6 = (d/2, d/2, −d/2)
s3 = (−d/2, d/2, d/2) s7 = (d/2, d/2, d/2)

as seen on figure 4.35(b) in W&J. Clearly the decision regions are the regions
of R3 delimited by the axes (octants). Suppose s0 is transmitted. Then:

1. no errors ⇔ n1 < d/2, n2 < d/2, n3 < d/2 (as in W&J bottom of page
255),

2. by symmetry P (C |m0 ) = P (C |mi ), i = 1, 2, . . . , 7 (as in W&J top of


page 256),

N d2
3. Em = |s|2 = 4 = Es (as in equation (4.88a) in W&J).
4.5. PROBABILITY OF ERROR 175

Therefore:
∫ d/2 ∫ d/2 ∫ d/2
P (C |m0 ) = pn (α1 )pn (α2 )pn (α3 )dα1 dα2 dα3
−∞ −∞ −∞
( ∫ d/2 )3
= pn (α)dα
−∞
= (1 − p)3

( d ) (√ 2E )
√ 1 e−α /N0
2
where pn (α) = πN0
and p = Q √2N =Q N N0 . The symmetry
s
0

and equiprobability of the signal vectors imply P (C ) = P (C |mi ) = (1 − p)3 .

Remark. The irrelevance theorem may be used to give a more global proof of
this result (refer to W&J, top of page 257).
P (E ) is plotted as a function of the SN R for the case N = 4, M = 16 in
figure 4.13. It is seen, not surprisingly, that the QAM-16 modulation yields a
bigger P (E ) for the same value of SN R. P (E ) versus SN R is plotted in figure
4.14 for various values of N .

Figure 4.14: Vertices of hypercube


176 CHAPTER 4. OPTIMUM RECEIVER PRINCIPLES

4.5.2 Orthogonal and Related Signal Sets


W&J, pages 257 - 263.
−1
Theorem 57. Let {sj (t)}Mj=0 denote M = N equiprobable orthogonal signal
waveforms with same energy Es = Em . Then (equation (4.96a) in W&J)
∫ ∞ √ (∫ α )M −1
P (C ) = pn (α − Es ) pn (β)dβ dα
−∞ −∞
| {z }
( )
1−Q √ α
N0 /2

Proof. (W&J, pages 257 - 259) Notice that the decision regions are no longer
rectangular. In this case we use the receiver structure of figure 4.19 in W&J to
describe the event E and to calculate its probability. With an orthogonal signal
set, the various elements in fig 4.19 become: 8

• ϕj (t) = sj (t)/ Es ,

• c0 = c1 = . . . = cM −1 ,

• weighting matrix is diagonal.

Suppose s0 is transmitted. We then have:

1. no error ⇔ r0 > r1 , r0 > r2 , . . . , r0 > rM −1 ,



2. r0 = Es + n0 and rj = nj , j = 1, 2, . . . , M − 1,

3. symmetry ⇒ P (C |m0 ) = P (C |mj ), j = 1, 2, . . . M − 1.

The remainder of the proof is straightforward (W&J, page 258).

√ Even though it is not clear from the expression, P (E ) is function of the ratio
ES /N0 only; it does not depend of both ES √ and N0 independently. P (E ) is
plotted on figure 4.15 as a function of SN R = ES /N0 .
M −1
Definition 4.5.3. Let {sj }j=0 be a set of M = N equally likely, equal energy,
−1
orthogonal signal vectors. The signal vector set {s′j = sj − a}M
j=0 where

M −1
1 ∑
a=s= si
M i=0

is a simplex signal vector set.

Remarks.
8 The notation is changed in that j ranges from 0 to M − 1 instead of from 1 to N . We

also write s = (s0 , s1 , . . . , sM −1 ), n = (n0 , n1 , . . . , nM −1 ), r = (r0 , r1 , . . . , rM −1 ).


4.5. PROBABILITY OF ERROR 177

Figure 4.15:

1. The vector a in the above is the center of gravity (equation (4.97a) in


W&J). The center of gravity of a simplex signal set is therefore at the
origin. The M signal vectors of a simplex are linearly dependent; they
span a vector space of dimension M − 1.

2. As previously mentioned in the proof of theorem 54, P (E ) is not affected


by translation of the signal set and consequently the simplex set {s′j =
−1 M −1
sj − a}M
j=0 yields the same P (E ) as the signal vector set {sj }j=0 . Less
energy is required for the transmission of the simplex signal vectors (page
261 in W&J). P (E ) for a given SN R can be obtained from figure 4.37 in
W&J by rescaling the horizontal axis with the factor 1 − 1/M . P (E ) is
much better than vertices of hypercube, rectangular decision regions such
as QAM-16 plotted in figure 4.13; we expand on this below.

3. “A simplex is the optimum set of M signals for use in white Gaussian


noise when the energy is constrained and P (mi ) = 1/M, ∀i” 9 .
(s)
The signal vectors of a simplex set of energy Es and dimension N = M − 1
9 W&J, page 260.
178 CHAPTER 4. OPTIMUM RECEIVER PRINCIPLES

are of the form:



(s)
Es
s′i = − (+1, +1, . . . , +1, (1 − M ), +1, . . . , +1)
M (M − 1) | {z }
M components

where the component (1 − M ) is in position i + 1 where 0 ≤ i ≤ M − 1.


The square of the distance between two signal vectors s′i and s′j , i ̸= j, is
|si − s′j |2 where:


(s)
Es
s′i − s′j = − (0, 0, . . . , 0, −M, 0, 0, . . . , 0, +M, 0, . . . , 0)
M (M − 1)
In the above, the component −M is in position i and the component +M is in
position j. The square of the distance is then:
(s)
Es
|s′i − s′j |2 = × 2M 2
M (M − 1)
2M
= E (s)
M −1 s
(o) M Es(s)
An orthogonal set of signals of equal energy Es = M −1 is also such that the
(s)
square of the distance between any two distinct signals is M2M −1 Es . Both sets
consequently have the same P (E ) and P (C ). For example, a ternary (M = 3)
(s)
simplex set of dimension N = 2 with Es = 1 V2 has the same P (E ) as the
(o)
corresponding orthogonal signal set with energy Es = 1.5 V2 . In other words,
the probability of error of a ternary simplex set at a signal-to-noise ratio of 0 dB
(SN R = 1) is equal to the probability of error of a ternary orthogonal set at a
signal-to-noise ratio of√1.76 dB (SN R = 1.5). P (E ) is plotted on figure 4.15 as
a function of SN R = ES /N0 .
−1
Definition 4.5.4. (W&J, page 261) Let {sj }N j=0 be a set of equal energy or-
−1
thogonal signal vectors. The signal vector set {sj , −sj }N
j=0 is called biorthog-
onal set and clearly M = 2N .
−1
Theorem 58. Let {sj , −sj }N j=0 contain M = 2N equiprobable biorthogonal
signal vectors with with same energy Es = Em . Then (equation (4.104) in
W&J):
∫ ∞ √ (∫ ∞ )N −1
P (C ) = pn (α − Es ) pn (β)dβ dα
0 −α
∫ ∞ √ ( ∫ ∞ )N −1
= pn (α − Es ) 1 − 2 pn (β)dβ dα
0 α
∫ ∞ √ ( ( α ))N −1
= pn (α − Es ) 1 − 2Q √ dα
0 N0 /2
4.5. PROBABILITY OF ERROR 179

Proof. Similar to orthogonal signal vectors. If s0 is transmitted, then:




 r closer to s0 than to −s0


 and
no errors ⇔ r closer to s0 than to sj , j = 1, 2, . . . , N − 1



 and

r closer to s0 than to −sj , j = 1, 2, . . . , N − 1

This becomes:


 r0 > 0



no errors ⇔ and





r0 > ri and r0 > −ri , i = 1, 2, . . . , N − 1

The remainder of the proof is straightforward.

The expressions of P (C ) in equations (4.96a), (4.104) in W&J can be shown


to depend on SN R = Em /N0 = Es /N0 only and √ not on both Es , N0 . P (E ) is
plotted on figure 4.16 as a function of SN R = ES /N0 .

Figure 4.16:
180 CHAPTER 4. OPTIMUM RECEIVER PRINCIPLES

4.5.3 Completely Symmetric Signal Sets and a priori


Knowledge
Pages 263 - 264 in W&J.

Congruent Decision Region Receiver

Definition 4.5.5. Consider a receiver m \CDR ( ) and corresponding decision


−1
regions {Ij }M
j=0 . If all Ij ’s are rotations and translations of each others, the
decision regions are said to be congruent and the receiver m \ CDR ( ) is called
congruent decision region receiver.

We easily see that a symmetrical signal vector set with equiprobable mes-
sages leads to a congruent decision region receiver by the MAP decision rule
(page 263 in W&J). Equivalently, when the signal set is symmetrical, a max-
imum likelihood receiver is a congruent decision region receiver. A congruent
decision region receiver need not be optimal, but over an additive white Gaus-
sian noise channel it satisfies (equation (4.106a) in W&J):

P (C |mi ) = P (C |mj )

for all i, j, from which it follows that (equation (4.106b) in W&J):


M −1
P (C ) = P (mi )P (C |mi )
i=0
= P (C |mi ), ∀i.

This proves the following (page 263 in W&J):

Theorem 59. The error performance of a congruent decision region receiver


is invariant to the actual (a priori) source statistics.

A congruent decision region receiver is only guaranteed to be optimum when


the (congruent) decision regions correspond to the MAP decision rule for given
a priori message probabilities; if the a priori probabilities are not all equal, the
receiver still achieves the same P (E ) but a better receiver could be designed.
This can be advantageous from a designer’s point of view (page 264 in W&J).

Minimax Receiver

It has been observed that the P (E ) of a receiver depends on the a priori


message probabilities (problem 2.12 in W&J), and reaches a maximum for a
certain choice of the P (mi )’s.

Definition 4.5.6. A receiver having smallest maximum P (E ) over all possible


choices of P (mi )’s is called minimax receiver.
4.5. PROBABILITY OF ERROR 181

Theorem 60. A congruent decision region receiver which is optimal for a cer-
tain choice of the P (mi )’s is minimax. In particular a maximum likelihood
receiver with symmetrical signal vector set is minimax.
Proof. Refer to page 264 in W&J; this proves part 4 of theorem 50.

4.5.4 Union Bound on the Probability of Error


Pages 264 - 266 in W&J.
The following theorem leads to useful simple expressions bounding the P (E )
of equations (4.96), (4.104) in W&J.
−1
Theorem 61. Let {sj }M j=0 be a set of M equally likely signal vectors. Then
(equation (4.109) in W&J)


M −1
P (E |mi ) ≤ P2 [si , sk ]
k=0
k ̸= i

where P2 [si , sk ] denotes the P (E ) of a binary system that would use signals
si and sk to communicate one of two equally likely messages. For the additive
white Gaussian noise channel we have (equations (4.110) and previously (4.76b)
in W&J) ( )
|si − sk |
P2 [si , sk ] = Q √
2N0
and therefore

M −1 ( )
|si − sk |
P (E |mi ) ≤ Q √
2N0
k=0
k ̸= i

Proof. Refer to page 265 in W&J – easy to read.


Corollary.
1. For orthogonal signals (equation (4.111) in W&J):
(√ )
P (E ) ≤ (M − 1)Q Es /N0

10
2. For simplex signals:
(√( )( ))
M Es
P (E ) ≤ (M − 1)Q
M −1 N0
( )
(s) (o) (o) (s) (s) (o)
10 Recall that Es = Es 1− 1
M
⇒ Es = M
E ,
M −1 s
where Es , Es respectively
denote the average signal energy of the simplex and orthogonal signal sets.
182 CHAPTER 4. OPTIMUM RECEIVER PRINCIPLES

3. For biorthogonal signals (equation (4.112) in W&J):


√ √
P (E ) ≤ (M − 2)Q( Es /N0 ) + Q( 2Es /N0 )

Proof.

1. For orthogonal signals we have:


√ √
s0 − sk = ( Es , 0, 0, . . . , 0, − Es , 0, 0, . . . , 0)
√ √
where the components Es , − E √s are respectively in positions 0 and
k ̸= 0. It follows that |s0 − sk | = 2Es . Therefore:


M −1 ( )
|s0 − sk |
P (E |m0 ) ≤ Q √
k=1
2N0
∑ −1
M
(√ )
= Q Es /N0
k=1
(√ )
= (M − 1)Q Es /N0

and by symmetry (congruent decision region receiver) we obtain:



P (E ) = P (E |m0 ) = (M − 1)Q( Es /N0 )

2. For a simplex signal set we start with the latter bound, and recalling that
(page 261 in W&J):
M (s)
Es(o) = Es
M −1
(s)
where Es = Em , we obtain:
(√( )( ))
M Em
P (E ) ≤ (M − 1)Q
M −1 N0

3. For biorthogonal signals, we have


N −1 ( ) N∑−1 ( ) ( )
|s0 − sk | |s0 + sk | |s0 + s0 |
P (E |m0 ) = Q √ + Q √ +Q √
k=1
2N0 k=1
2N0 2N0

and we easily find:



|s0 − sk | = 2Es

|s0 + sk | = 2Es

|s0 + s0 | = 2 Es
4.6. BIT ERROR RATE AND SIGNAL-TO-NOISE RATIO PER BIT 183

It follows (by symmetry) that:

P (E ) = P (E |m0 )
(√ ) (√ )
= 2(N − 1)Q Es /N0 + Q 2Es /N0
(√ ) (√ )
= (M − 2)Q Es /N0 + Q 2Es /N0

These bounds are plotted in figure 4.17 for the case M = 16.

Figure 4.17:

4.6 Bit Error Rate and Signal-to-Noise Ratio


per Bit
The graphs presented in the previous section do not present a fair comparison
between signal sets of different sizes because of the following:

1. A set of M signals carries log2 (M ) bits for each transmission of a mes-


sage. The signal energy should be calculated as an average energy per
information bit.

2. Each message error results in a number of bit errors that can be anywhere
between 1 and log2 (M ).
184 CHAPTER 4. OPTIMUM RECEIVER PRINCIPLES

It may be very difficult to predict the bit error rate of a signal set as a
function of the average signal-to-noise ratio per bit, mainly because it depends
on the mapping of information bit sequences to signal vectors. A simple ap-
proximation may be obtained by noticing that at a low signal-to-noise ratio one
expects that half of the log2 (M ) bits of a message will be incorrect and at a high
signal-to-noise ratio only one of the log2 (M ) bits of a message will be incorrect
(using a Gray code to map the information bit sequences to signal vectors).
After rescaling the horizontal axis to reflect the average signal-to-noise ratio
per bit, some of the curves of the previous section are redrawn as two curves each,
one of which corresponding to “half of the information bits are incorrect” and the
other curve corresponding to “only one bit is incorrect”; the true performance
curve should lie somewhere in between the two. Some of the curves are presented
in figures 4.18 and 4.19.

Figure 4.18:
4.7. PROBLEMS: 185

Figure 4.19:

4.7 Problems:
1. Consider the signals in figure 4.20.

(a) Calculate an orthonormal basis for s1 (t), s2 (t).


Hint: you will find two orthonormal functions ϕ1 (t) and ϕ2 (t).
(b) Calculate the coordinates of s1 (t) and s2 (t) in the basis
{ϕ1 (t), ϕ2 (t)}.
(c) To which function ϕ1 (t) or ϕ2 (t) does s1 (t) resemble the most? To
which function ϕ1 (t) or ϕ2 (t) does s2 (t) resemble the most?
(d) The signal
 2
 t ; 0≤t≤1
g(t) = t2 (u(t) − u(t − 1)) =

0 ; elsewhere
does not lie in the space generated by ϕ1 (t), ϕ2 (t). Calculate the
d in the space generated by ϕ1 (t), ϕ2 (t) that resembles the
signal g(t)
d on the same graph. Does
most the signal g(t). Sketch g(t) and g(t)
your answer appear to make sense?
(e) Calculate a third orthonormal function ϕ3 (t) such that all of s1 (t),
s2 (t) and g(t) lie in the space generated by ϕ1 (t), ϕ2 (t) and ϕ3 (t).
186 CHAPTER 4. OPTIMUM RECEIVER PRINCIPLES

Figure 4.20:
Chapter 5

Reliable Communication on
an Unreliable Channel

the situation depicted in figure 5.1 where a source is transmitting a


Consider
message to a receiver via a communication channel. A denotes the source

A B
source channel receiver

Figure 5.1:

alphabet, A , |A | > 1 is the number of source alphabet symbols. The source


is characterized by the a priori message probabilities P (A), ∀a ∈ A , that we
assume known. B denotes the channel output alphabet. The channel is described
by the conditional transition probabilities P (b|a), ∀a ∈ A , ∀b ∈ B. The
receiver attempts to recover the transmitted symbol from the received symbol
using the a priori probabilities and the channel transition probabilities.
In chapter 4 we calculated the probability of error when one symbol from A
is transmitted over the channel. Using the analogy between probabilities and
relative frequency of occurrence of an event, we expect the error rate to equal
the probability of error when multiple bits are transmitted over the channel
(this is the basis of Monte Carlo simulations to estimate the probability of
error by repeated transmission of symbols over a channel). Because the receiver
has a non-zero probability of making an error, this scheme does not allow the
reliable transmission of an arbitrarily long sequence over the channel. This is the
problem that we now address: to transmit an arbitrarily long sequence over the
channel and guarantee (with an arbitrarily small probability of error) that the
sequence is received correctly in its entirety. The event “an error has occurred”
is realized whenever the receiver makes a single or more mistakes in the received
sequence. For a long time it was believed that the probability of this event could

187
188CHAPTER 5. RELIABLE COMMUNICATION ON AN UNRELIABLE CHANNEL

never be made arbitrarily small. Shannon showed how this can be accomplished
in a famous paper published in 1948. In the following we denote:
A N = {(a1 , a2 , . . . , aN )|ai ∈ A }
in which N > 1 is an integer. A sequence of N source alphabet symbols is an
element of A N .
Definition 5.0.1.
1. A block code C of (block) length N and rate R over the alphabet A is a
subset of A N of size (cardinality) M = A RN . The elements of C are
called codewords. It is convenient to number the codewords and without
loss of generality we write:
C = {y 1 , y 2 , . . . , y M }.

2. RN = logN2 M is the bit rate. RN may be larger than 1 and RN = R when


A = 2.
3. If RN ∈ N, RN = K > 1, an encoder is a bijective mapping ϵ : A K → C .
In the above definition the rate could equivalently be defined as:
logA M
R= ≤ 1.
N
The number ⌈logA M ⌉ represents the minimum number of components that are
required in order to represent M messages with different sequences of symbols
from A . It follows that the rate can be interpreted as the (average) fraction of
alphabet symbol sent over the channel per channel use when a codeword taken
at random is transmitted. Similarly, the bit rate RN represents the equivalent
average number of bits that are transmitted per channel use.
Example 5.0.1. log2 40 = 5.322 bits which means that 40 given binary vectors
of length N ≥ 6, can be used to represent all 32 binary vectors of length 5 plus
some 8 binary vectors of length 6. If N = 10 is used then an average of 0.5322
bits is sent each time the channel is used to transmit the symbols of one of the
40 codewords taken at random.
Example 5.0.2. C = {(0, 1, 0, 1), (1, 1, 1, 1), (1, 0, 1, 0), (1, 1, 1, 0)} is a block
1
code of length 4 and rate 21 over the alphabet A = {0, 1}; there are 4 = 2(4)( 2 )
codewords. The mapping:
ϵ : {0, 1}2 → C
ϵ : (0, 0) 7 → (0, 1, 0, 1)
ϵ : (0, 1) 7→ (1, 1, 1, 1)
ϵ : (1, 0) 7→ (1, 0, 1, 0)
ϵ : (1, 1) 7→ (1, 1, 1, 0)
is one of the 4! = 24 possible encoders. Clearly an average of 21 bit from the
source alphabet A = {0, 1} is sent every time the channel is used to transmit a
codeword taken at random.
189

In order to communicate over the (possibly unreliable) channel, the source


and receiver (the receiver is called a decoder in the context of coding for the
correction of errors – see definition below) agree on a block code C and the
source transmits a codeword y i ∈ C ⊂ A N taken at random over the channel
as depicted in figure 5.2. The source is characterized by the a priori probabilities

N
yi ∈ C z∈B decoder ŷ ( z ) ∈ C
source channel

RN
symbols
source encoder

If RN is an integer

Figure 5.2:

P (y i ), ∀y i ∈ C , (we suppose the codewords are equally likely). The channel is


described by the conditional transition probabilities PN (z|y i ), ∀y i ∈ C , ∀z ∈
B N . The decoder yb : B N → C attempts to recover the transmitted codeword
from the received word z ∈ B N using the a priori probabilities and the channel
transition probabilities and makes an error if and only if yb(z) ̸= y i .
Definition 5.0.2.
1. A decoder is a mapping yb : B N → C .
2. Let I1 , I2 , . . . , IM ⊂ B N such that:
Ii , {z ∈ B N : yb(z) = y i }.
Ii is called the decision region of y i , i = 1, 2, . . . , M .
Clearly we have ∪M
i=1 Ii = B
N
and Ii ∩ Ij = ∅, ∀i ̸= j. It follows that:

P (E |y i ) = P (z ∈
/ Ii |y i ) = PN (z|y i )
z ∈I
/ i

which implies

M
P (E ) = P (y i )P (E |y i )
i=1

M ∑
= P (y i ) PN (z|y i )
i=1 z ∈I
/ i

The last expression shows that the probability is a function of the a priori
probabilities P (y i ), the channel probabilities PN (z|y i ), the code C and the
decoders decision regions Ii , i = 1, 2, . . . M .
190CHAPTER 5. RELIABLE COMMUNICATION ON AN UNRELIABLE CHANNEL

Remark. the last expression assumes that z is a discrete ∑variable; if∫ it is con-
tinuous then all PN (z|y i ) become pz (γ|y = y i ) and all z become dγ.

Definition 5.0.3.

1. An optimal decoder minimizes P (E ) so that any other decoder achieves a


probability of error bigger or equal to that of an optimal decoder.

2. A maximum a posteriori decoder is such that:

yb(z) = y i ⇔ P (y i )PN (z|y i ) ≥ P (y j )PN (z|y j ), ∀j ∈ {1, . . . , M }.

3. A maximum likelihood decoder is such that:

yb(z) = y i ⇔ PN (z|y i ) ≥ PN (z|y j ), ∀j ∈ {1, . . . , M }.

It is well known that a maximum a posteriori decoder is optimal and that a


maximum likelihood receiver is optimal when the codewords are equally likely
1
(P (y i ) = M , ∀i). In the following we assume that the codewords are equally
1
likely .

5.1 Codes of size M = 2


We write
C = {y 1 , y 2 }
logA 2
and the rate is R = N . The decision regions are:

I1 = {z ∈ B N : PN (z|y 1 ) ≥ PN (z|y 2 )},


I2 = {z ∈ B N : PN (z|y 2 ) > PN (z|y 1 )} = B N − I1 .

Example 5.1.1. Let A = {0, 1}, A = 2 and the (memoryless) Binary Sym-
metric Channel (BSC) shown in figure 5.3 with p < 12 . We have:

PN (z|y i ) = (1 − p)N −dH (z,yi )pdH (z,yi )


( p )dH (z,yi
= (1 − p)N )
1−p

where dH (z, y i ), called the Hamming distance between z and y i , denotes the
number of positions in which z and y i differ. Clearly p < 12 ⇒ 1−p p
< 1 and
PN (z|y i ) is maximized by minimizing dH (z, y i ). So if

C = {(0, 0, 1, 0), (1, 0, 0, 1)},


1 This can be seen as a worst case scenario – refer to W&J page 81
5.1. CODES OF SIZE M = 2 191

then we find:

I1 = {(0, 0, 0, 0), (0, 0, 1, 0), (0, 0, 1, 1),


(0, 1, 0, 0), (0, 1, 1, 0), (0, 1, 1, 1), (1, 0, 1, 0), (1, 1, 1, 0)}
I2 = {(0, 0, 0, 1), (0, 1, 0, 1), (1, 0, 0, 0),
(1, 0, 0, 1), (1, 0, 1, 1), (1, 1, 0, 0), (1, 1, 0, 1), (1, 1, 1, 1)}

Clearly I1 ∪ I2 = {0, 1}4 and I1 ∩ I2 = ∅.


4
{ 0,1}
I1
I2
y1= 0,0,1,0 

y2 =1,0,0,1

Figure 5.3:

For the code of size M = 2 and the decision regions of the maximum likeli-
hood decoder we have:

P (E |y 1 ) = P (z ∈ I2 |y 1 ) = PN (z|y 1 )
z∈I2

We then notice that for any z ∈ I2 we have PN (z|y 1 ) < PN (z|y 2 ), i.e.
( )s
PN (z|y 2 ) PN (z|y 2 )
≥1⇒ ≥ 1, ∀s ≥ 0
PN (z|y 1 ) PN (z|y 1 )
The above P (E |y 1 ) is bounded by:
∑ ( )s
PN (z|y 2 )
P (E |y 1 ) ≤ PN (z|y 1 )
PN (z|y 1 )
z∈I2

= PN (z|y 1 )1−s PN (z|y 2 )s
z∈I2

≤ PN (z|y 1 )1−s PN (z|y 2 )s ,
z∈BN

∀s ≥ 0. The above bound is convenient because it does not involve the decision
regions; it can be calculated or computed without first having to find the decision
192CHAPTER 5. RELIABLE COMMUNICATION ON AN UNRELIABLE CHANNEL

regions. We also find similarly:



P (E |y 2 ) ≤ PN (z|y 2 )1−s PN (z|y 1 )s
z∈BN

∀s ≥ 0. It follows that:
P (E ) = P (E |y 1 )P (y 1 ) + P (E |y 2 )P (y 2 )

≤ P (y 1 ) PN (z|y 1 )1−s PN (z|y 2 )s
z∈BN

+P (y 2 ) PN (z|y 2 )1−s PN (z|y 1 )s
z∈BN

∀s ≥ 0. This bound is seen to be valid for any a priori probabilities. If P (y 1 ) =


P (y 2 ) = 21 it is seen that the value s = 12 gives the smallest (tightest) upper
bound which reduces to:
∑ √
P (E ) ≤ PN (z|y 1 )PN (z|y 2 ) , g(y 1 , y 2 )
z∈BN

This bound only depends on the code C = {y 1 , y 2 } and the discrete commu-
nication channel probabilities PN (z|y), with or without memory; it assumes
equally likely messages and a maximum likelihood decoder.

5.1.1 Memoryless channel


The bound g(y 1 , y 2 ) can be expanded as follows:
∑ √
g(y 1 , y 2 ) = PN (z|y 1 )PN (z|y 2 )
z∈BN
v
uN
∑ u∏
= t P (zn |y1n )P (zn |y2n )
z∈BN n=1

∑ ∏
N √
= P (zn |y1n )P (zn |y2n )
z∈BN n=1

∑ ∑ ∑ ∏
N √
= ··· P (zn |y1n )P (zn |y2n )
z1 ∈B z2 ∈B zN ∈B n=1
N ∑ √

= P (zn |y1n )P (zn |y2n )
n=1 zn ∈B

Given a code C = {y 1 , y 2 }, a memoryless DCC with transition probabilities


P (zn |yn ), equally likely messages and maximum likelihood decoder we have:

N ∑ √
P (E ) ≤ P (zn |y1n )P (zn |y2n )
n=1 zn ∈B
5.1. CODES OF SIZE M = 2 193

The subscript n of zn may be omitted in the above. We notice that if y1n = y2n
∑√ ∑
P (z|y1n )P (z|y2n ) = P (z|y1n ) = 1
z∈B z∈B

and those values do not affect the product. We can then rewrite the product in
the above bound as:

N ∑√
P (E ) ≤ P (z|y1n )P (z|y2n )
z∈B
n=1
y1n ̸= y2n

5.1.2 Memoryless channel and A = {0, 1}


Since there are now only two possibilities for y1n , y2n and we insist that y1n ̸=
y2n for some n, the bound becomes:

N ∑√
P (E ) ≤ P (z|0)P (z|1)
n=1 z∈B

y1n ̸= y2n
= γ dH (y1 ,y2 )
∑ √
where γ , P (z|0)P (z|1). If z is a continuous variable then γ =
∫∞ √ z∈B

−∞
pr (ρ|0)pr (ρ|1)dρ.
2
Example 5.1.2.
1. Binary symmetric channel with transition probability p. 3 We find:
∑ √ √
γ= P (z|0)P (z|1) = 2 p(1 − p)
z∈{0,1}

and P (E ) is bounded by
( )dH (y1 ,y2 )

P (E ) ≤ 2 p(1 − p) .

( )dH (y1 ,y2 )


−3 √ √
If for example p = 10 then γ ≈ 2 p and P (E ) ≤ 2 p . The
repetition code of block length 11 gives the bound
P (E ) ≤ (211 )(10−3 )11/2 = 6.48 × 10−14
2 The bounds obtained in problem 2.39 in W&J are identical to the results obtained in

these examples; the situation described in problem 2.39 corresponds to dH (y 1 , y 2 ) = N .


3 Throughout this chapter we assume that the BSC consists in an AWGN channel used

with (binary) antipodal modulation and a maximum likelihood receiver. We then have p =
(√ )
Q 2EN /N0 , where EN and N0 /2 respectively denote the energy transmitted per channel
use and the double sided noise power spectral density.
194CHAPTER 5. RELIABLE COMMUNICATION ON AN UNRELIABLE CHANNEL

The exact probability of error is given by:


∑11 ( )
11 t
P (E ) = p (1 − p)11−t = 4.6 × 10−15
t=6
t

and the bound is found to be 14 times larger than the exact value in this
specific case.
2. AWGN channel with binary antipodal modulation, maximum likelihood
(ML) receiver and as before EN and N0 /2 respectively denote the energy
transmitted per channel use and the double sided noise power spectral
density: ∫ ∞√ √ √
γ= pr (ρ| EN )pr (ρ| − EN )dρ
−∞
In the above we have
1
e−(ρ−si )
2
/2σ 2
pr (ρ|si ) = pn (ρ − si ) = √
2πσ 2
N0
and σ 2 = 2 . It follows that:
∫ ∞√
1 √ 1 √
γ = √ e−(ρ− EN )2 /2σ2 √ e−(ρ+ EN )2 /2σ2 dρ
2πσ 2 2πσ 2
−∞
∫ ∞ ( √ √ )
1 −(ρ − EN ) − (ρ + EN )2
2
= √ exp dρ
−∞ 2πσ 2 4σ 2
∫ ∞ ( )
1 ρ2 + EN
= √ exp − dρ
−∞ 2πσ 2 2σ 2
( )∫ ∞
EN 1
e−ρ /2σ dρ
2 2
= exp − 2 √
2σ −∞ 2πσ 2
−EN /N0
= e

Refer to the graph in figure 5.4 for a comparison of the γ corresponding to:
1. BSC,
2. AWGN channel/antipodal modulation/ML receiver.
Since a smaller value of γ is desirable we notice that channel 2 is better (≈ 2
dB more energy efficient).

5.2 P (E ) averaged over all possible C ⊂


A N , |C | = M = 2
In the following we calculate (a bound on) the average P (E ) that is obtained
when we chose at random a code of size 2 and block length N . Specifically we
5.2. P (E ) AVERAGED OVER ALL POSSIBLE C ⊂ A N , |C | = M = 2 195

0.8 gamma for BSC


gamma for AWGN

0.6
gamma

0.4

0.2

0
-15 -10 -5 0 5 10 15
SNR EN/N0 [dB]

Figure 5.4:

consider the probability system (Ω = A N , F = 2Ω , QN : F → [0, 1]), in which


QN ( ) is a probability measure and4 :

QN ({y}) = 1
y∈A N

The selection of a code C of size 2 and block length N is a compound random


experiment: 2 codewords are selected by 2 independent trials5 of the random
experiment corresponding to the probability system (A N , 2A , QN ). It follows
N

that the probability of a code C = {y 1 , y 2 } is

p(C ) = P ({y 1 , y 2 }) = QN (y 1 )QN (y 2 )

and the probability of error corresponding to code C is a random variable ξ


given by and bounded by:
∑ ∑
ξ = PN (z|y 1 )P (y 1 ) + PN (z|y 2 )P (y 2 )
z∈I2 z∈I1
≤ g(y 1 , y 2 )
4 for simplicity of notation we let QN (y) , QN ({y})
5 we do not insist that the codewords y 1 and y 2 be different
196CHAPTER 5. RELIABLE COMMUNICATION ON AN UNRELIABLE CHANNEL

where we recall that


∑ √
g(y 1 , y 2 ) = PN (z|y 1 )PN (z|y 2 )
z∈BN

The average probability of error, denoted by P (E ), is given by:



P (E ) = ξP (C )
C
∑ ∑
≤ g(y 1 , y 2 )QN (y 1 )QN (y 2 ) , g
y 1 ∈A N y 2 ∈A N

g is simplified as follows:
∑∑∑ √ √
g = QN (y 1 ) PN (z|y 1 )QN (y 2 ) PN (z|y 2 )
z y1 y2
∑(∑ √
)( ∑

)
= QN (y 1 ) PN (z|y 1 ) QN (y 2 ) PN (z|y 2 )
z y1 y2
∑( ∑ √
)2
= QN (y) pN (z|y)
z y∈A N

The average probability of error for any discrete communication channel (with
or without memory) with 2 equally likely messages and a maximum likelihood
decoder is bounded by:

∑ ( ∑ √
)2
P (E ) ≤ g = QN (y) pN (z|y)
z∈BN y∈A N

5.2.1 Memoryless channel


g takes the form

∑( ∑ ∏
N √ )2
g = QN (y) P (zn |yn )
z y∈A N n=1

∑( ∑ ∏
N √
)2
= Q(yn ) P (zn |yn )
z y∈A N n=1

∏N
where we choose QN (y) = n=1 Q(yn ) in which Q( ) is a probability measure
on A (this corresponds to drawing by independent trials the symbols of y out
5.2. P (E ) AVERAGED OVER ALL POSSIBLE C ⊂ A N , |C | = M = 2 197

of the alphabet A , similarly to a compound experiment). It follows that:

∑( ∏
N ∑ √
)2
g = Q(y) P (zn |y)
z n=1 y∈A
N ( ∑
∑ ∏ √
)2
= Q(y) P (zn |y)
z∈BN n=1 y∈A


N ∑(∑ √
)2
= Q(y) P (z|y)
n=1 z∈B y∈A
( )2 )N
∑(∑ √
= Q(y) P (z|y)
z∈B y∈A
−N R(Q)
= 2

where we define
(∑(∑ )2 )

R(Q) , − log2 Q(y) P (z|y)
z∈B y∈A

Remark. Clearly the above is only useful if

g < 1 ⇔ log2 (g) = −N R(Q) < 0 ⇔ log2 (1/g) = N R(Q) > 0

As long as Q( ) and the channel probabilities P (z|y) are such that R(Q) > 0
(bits per channel use) then g can be made as small as we wish by taking N large
enough; there must then exist at least one pair of codewords {y 1 , y 2 } ⊂ A N
for which the probability of error is less than g.
In order to obtain the “tightest” bound, the probability measure Q( ) max-
imizing R(Q) is chosen.

Definition 5.2.1. R0 , maxQ( ) (R(Q)) is called exponential bound parameter


or computational cut-off rate, and we let g0 , 2−N R0 .

Equivalently we have:
( (∑(∑ )2 ))

R0 = − log2 min Q(y) P (z|y)
Q( )
z∈B y∈A

and P (E ) ≤ g0 = 2−N R0 . When R0 > 0, the average probability of error


decreases exponentially with N .
198CHAPTER 5. RELIABLE COMMUNICATION ON AN UNRELIABLE CHANNEL

5.2.2 Special case 2: memoryless channel and A = {0, 1}


1
The optimal choice for Q( ) is Q(0) = Q(1) = 2 and we find:

∑( √ √
)2
R0 = − log2 (1/2) P (z|0) + (1/2) P (z|1)
z∈B

∑ ( P (z|0) P (z|1) P (z|0)P (z|1)
)
= − log2 + +
4 4 2
z∈B
= − log2 (1/2)(1 + γ)
= 1 − log2 (1 + γ)
∑ √
where we recall that γ = z∈B P (z|0)P (z|1).

Example 5.2.1. Refer to the graphs on figure 5.5 for a comparison of the R0
corresponding to:

1. BSC,

2. AWGN channel/antipodal modulation/ML receiver.

Since a larger R0 is characteristic of a better channel, we notice that channel


2 is a better channel (≈ 2 dB more energy efficient) as remarked earlier on the
plots of γ.

0.9

0.8
R0 for BSC
0.7 R0 for AWGN

0.6
R0

0.5

0.4

0.3

0.2

0.1

0
-15 -10 -5 0 5 10 15
SNR EN/N0 [dB]

Figure 5.5:
5.3. P (E ) AVERAGED OVER ALL POSSIBLE C ⊂ A N , |C | = M 199

5.3 P (E ) averaged over all possible C ⊂


A N , |C | = M

We now calculate the average P (E ) that is obtained when a code of size


M and block length N is chosen at random. Extending the construction
of the previous section, M codewords are selected by M repeated (indepen-
dent) trials of the random experiment corresponding to the probability system

(A N , 2A , QN ), where we recall that y∈A N QN (y) = 1. The probability of
N

code C = {y 1 , y 2 , . . . , y M } is:


M
P (C ) = QN (y i )
i=1

We define the following regions of B N for code C = {y 1 , y 2 , . . . , y M }:

Ii,j = {z ∈ B N : PN (z|y j ) ≥ PN (z|y i )} ,

i, j = 1, 2, . . . , M, i ̸= j (there are M (M − 1) such regions). If the M


codewords are equally likely, we can use the union bound (W & J, p 265) to
bound the probability of error achieved by a maximum likelihood decoder:

P (E ) = P (E |y 1 ) = P (z ∈ ∪j̸=1 I1j |y 1 )

≤ P (z ∈ I1j |y 1 )
j̸=1

≤ g(y 1 , y j )
j̸=1

√ ≤ depending on how ties are broken


where the first equality relation may become

and we recall that g(y 1 , y j ) = z∈BN P (z|y 1 )P (z|y j ) is the upper bound
for the code {y 1 , y j } of size 2 (equally likely and maximum likelihood decoder).
When the code is chosen at random then the above probability of error is a
200CHAPTER 5. RELIABLE COMMUNICATION ON AN UNRELIABLE CHANNEL

random variable, the expected value of which is given by:



P (E |y 1 ) = P (E )P (C )
C
∑ ∑
≤ P (C ) g(y 1 , y j )
C j̸=1

∑∏
M ∑
= QN (y l ) g(y 1 , y j )
C l=1 j̸=1

∑ ∑ ∑ ∑∏
M
= ··· QN (y l )g(y 1 , y j )
y 1 ∈A N y 2 ∈A N y M ∈A N j̸=1 l=1


M ∑ ∑ ∑ ∏
M
= ··· g(y 1 , y j ) QN (y l )
j=2 y 1 ∈A N y 2 ∈A N y M ∈A N l=1
∑M ∑ ∑
y 1 ∈A N y j ∈A N g(y 1 , y j )QN (y 1 )QN (y j )×
j=2
∑ ∑
··· QN (y 2 ) · · · QN (y M )
= y 2 ∈A N y M ∈A N
| {z }
but not y 1 nor y j
| {z }
equals 1


M ∑ ∑
= QN (y 1 )QN (y j )g(y 1 , y j )
j=2 y 1 ∈A N y j ∈A N

= (M − 1)g ≤ M g

Since the above bound P (E |y 1 ) ≤ M g is independent of y 1 we obtain:



P (E ) = P (y 1 )P (E |y 1 ) = P (E |y 1 ) ≤ M g
y1

when all y 1 ∈ A N are equiprobable. We recall that

∑ ( ∑ √
)2
g= QN (y) P (z|y)
z∈BN y∈A N

and when the channel is memoryless g = 2−N R(Q) where


(∑(∑ )2 )

R(Q) = − log2 Q(y) P (z|y)
z∈B y∈A

In order to obtain the tightest possible bound of P (E ) we take maxQ R(Q):

P (E ) ≤ M g0 = 2N RN 2−N R0 = 2−N (R0 −RN )


5.3. P (E ) AVERAGED OVER ALL POSSIBLE C ⊂ A N , |C | = M 201

where
log2 M
RN =
N
R0 = max R(Q)
Q
( (∑(∑ )2 ))

= − log2 min Q(y) P (z|y)
Q
z∈B y∈A

When A = {0, 1} the above expression reduces to R0 = 1 − log2 (1 + γ) where:


 ∑ √

 P (z|0)P (z|1) ; in general


z∈B

 √


 4p(1 − p) ; for BSC with transition probabil-
γ= ity p







 e−EN /N0 ; for AWGN channel/antipodal

modulation/ML receiver

Interpretation There exists at least one code C ⊂ A N of rate RN < R0 for


which
P (E ) ≤ 2−N (R0 −RN )
Any sequence of K source alphabet symbols can be transmitted and received
correctly in its entirety with an arbitrarily small probability of error δ > 0 as
long as RN < R0 and N is large enough, regardless of how unreliable the channel
may be; for given values of RN and R0 , RN < R0 , the longer the sequence, the
more likely it will be received correctly.
The units of R0 and RN are the same, i.e. bits per channel use. RN repre-
sents the average number of equivalent source bits transmitted every time the
channel is used. R0 suggests the existence of an upper limit on the rate of
transmission. The limit, denoted by C, is called capacity and C ≥ R0 . For the
special cases of BSC and AWGN channel with antipodal signals it is given by:

 1 − H2 (p) ; for BSC with transition probabil-


 ity p
C= ( )
 1
 EN

 2 log2 1 + 2 N0 ; for AWGN channel/antipodal
modulation/ML receiver
where H2 ( ) is the binary entropy function.6 Shannon showed that
1. RN < C ⇒ P (E ) goes to 0 as N → ∞,
2. RN > C ⇒ P (E ) goes to 1 as N → ∞.
Refer to W & J bottom of page 321, bottom of page 322 and the discussion on
page 341.
6H
2 (x) , −x log2 (x) − (1 − x) log2 (1 − x), ∀x ∈ [0, 1].
202CHAPTER 5. RELIABLE COMMUNICATION ON AN UNRELIABLE CHANNEL

Example 5.3.1.
1. There exists a code of rate RN = 0.6 that can be used to transmit a 160
bit sequence over a BSC with p = 1.25 × 10−2 and a P (E ) < 10−8 . 7 This
is because the channel has R0 ≈ 0.7105, N = 160
0.6 = 267 and

2−N (R0 −RN ) = 2−267×0.1 ≈ 9 × 16−9 < 10−8 .

The size of the code is M = 2160 ≈ 1.5 × 1048 codewords. Practical design
of such codes/encoders/decoders exceeds the scope of this course.
2. The results presented below are obtained by Monte Carlo simulation for
the transmission of a 5000 bit sequence over the AWGN channel with
binary antipodal signalling and maximum likelihood receiver at a signal-
to-noise ratio E N
= 0 dB. This means that the channel’s probability of
√N0
bit error is Q( 2) = 7.865 × 10−2 and the channel’s computational cut-off
rate is R0 ≈ 0.548. We distinguish the cases:
(a) No coding: The packet-error-rate (PER) ≈ 1 since every packet/se-
quence contains an average of 393 bit errors; it is very unlikely that
a packet be received with not a single bit error:

P (C ) = (1 − 7.865 × 10−2 )5000 ≈ 1.33 × 10−178 ≈ 0

(b) Convolutional coding with RN = 21 < R0 : We see that as the con-


straint length is increased from 7 to 13 the PER goes to 0, even
though the rate RN remains fixed at 12 .

Code Con-
packet
straint packets bit errors PER BER
errors
Length
— 500 500 196592 1 7.86 × 10−2
There were 196592 bit errors during the transmission of
2500000 bits at a signal to noise ratio of 0.0 dB
7 500 122 1063 2.44 × 10−1 2.13 × 10−4
9 500 29 185 5.80 × 10−2 7.40 × 10−5
11 500 12 72 2.40 × 10−2 2.88 × 10−5
12 500 10 65 2.00 × 10−2 2.60 × 10−5
13 263 0 0 0 0

Table 5.1: simulation results rate 1/2 convolutional codes over an AWGN chan-
nel with binary antipodal modulation, 0 dB signal-to-noise ratio and maximum
likelihood receiver – Viterbi decoding of the convolutional code

7 this corresponds to a signal ratio of EN = 0.4 dB if the BSC is implemented with binary
N0
antipodal signalling over an AWGN channel and a maximum likelihood receiver
Appendix A

Table of the Q( ) and


erf( ) functions

0.7 −x2 /2
The approximation Q(x) ≈ √1 (1
x 2π
− x2 )e may be used when x > 2.

x erf(x) Q(x) x erf(x) Q(x) x erf(x) Q(x)


0.00 0 0.5 1.70 0.9838 0.04457 3.40 1 0.0003369
0.10 0.1125 0.4602 1.80 0.9891 0.03593 3.50 1 0.0002326
0.20 0.2227 0.4207 1.90 0.9928 0.02872 3.60 1 0.0001591
0.30 0.3286 0.3821 2.00 0.9953 0.02275 3.70 1 0.0001078
0.40 0.4284 0.3446 2.10 0.997 0.01786 3.80 1 7.235 × 10−5
0.50 0.5205 0.3085 2.20 0.9981 0.0139 3.90 1 4.810 × 10−5
0.60 0.6039 0.2743 2.30 0.9989 0.01072 4.00 1 3.167 × 10−5
0.70 0.6778 0.242 2.40 0.9993 0.008198 4.10 1 2.066 × 10−5
0.80 0.7421 0.2119 2.50 0.9996 0.00621 4.20 1 1.335 × 10−5
0.90 0.7969 0.1841 2.60 0.9998 0.004661 4.30 1 8.540 × 10−6
1.00 0.8427 0.1587 2.70 0.9999 0.003467 4.40 1 5.413 × 10−6
1.10 0.8802 0.1357 2.80 0.9999 0.002555 4.50 1 3.398 × 10−6
1.20 0.9103 0.1151 2.90 1 0.001866 4.60 1 2.112 × 10−6
1.30 0.934 0.0968 3.00 1 0.00135 4.70 1 1.301 × 10−6
1.40 0.9523 0.08076 3.10 1 0.0009676 4.80 1 7.933 × 10−7
1.50 0.9661 0.06681 3.20 1 0.0006871 4.90 1 4.792 × 10−7
1.60 0.9763 0.0548 3.30 1 0.0004834 5.00 1 2.867 × 10−7

203
204 APPENDIX A. TABLE OF THE Q( ) AND ERF( ) FUNCTIONS
Appendix B

Orthonormal Expansion
and Vector Representation
of Continuous Signals

e define a scalar product for finite energy signals. This scalar product,
W essentially corresponds to the projection described in the motivating prob-
lem below, and is used to obtain a geometric view of a given finite set of (finite
energy) signals. Additionally, we obtain a distance measure between signals.
Signal bases are defined and we show how to perform a basis change. Finally
we prove Parseval’s relationships. We start with a review of the concepts with
3-dimensional vectors over the real numbers.

Motivating problem: Two teams are sent to measure the po-


sitions of three objects with respect to a fixed reference point in 3
dimensions using a meter stick. The first team obtains the coordi-
nates:

(1, 1, 1)
(0, 1, 1) Objects measured by team 1 with
(0, 1, −1) NSEW reference.

with an orthogonal axis grid based on North, South, East, West


centered on the specified fixed reference point. The second team
obtains coordinates:

(
√ 3,√ 0, 0)
Objects measured by team 2 with
(2/ 3, √ 2/3, 0)
shore line reference.
(0, 0, 2)

with an orthogonal axis grid based on a shore line and centered on


the same specified fixed reference point.

205
206APPENDIX B. ORTHONORMAL EXPANSION AND VECTOR REPRESENTATION OF CON

The objects are no longer available and we want to determine if the


measurements from the two teams are in agreement or not. A third
team is then sent to measure the coordinates of the reference axis
axis grid of the second team with respect to the NSEW grid used by
the first team and they found:
√ √ √
(1/√ 3, 1/ √3, 1/ √3) Reference grid of shore line used
(− 2/3, √1/ 6, 1/
√ 6) by team 2 with NSEW reference
(0, 1/ 2, −1/ 2) used by team 1.

In order to determine whether or not the measurements agree, we


use the measurements of team 3 to draw the axis system of team
2 on top of the axis system used by team 1. We next take the
measurements from team 1 and project on the axis system used by
team 2 to calculate what team 2 should have measured. We then
only have to compare with what team 2 actually measured to see if
the measurements agree or not. The situation is sketched on figure
B.1.

Figure B.1:

B.1 Three-dimensional vectors in R3 :


Definition B.1.1. A basis in R3 is a set of three vectors {g⃗1 , g⃗2 , g⃗3 } such that
any vector ⃗v ∈ R3 may be written uniquely in the form a1 g⃗1 +a2 g⃗2 +a3 g⃗3 , ai ∈ R.

Let ⃗v denote a vector in R3 . We can express ⃗v as:

⃗v = v1⃗i + v2⃗j + v3⃗k


B.1. THREE-DIMENSIONAL VECTORS IN R3 : 207

where v1 , v2 , v3 ∈ R and ⃗i, ⃗j, ⃗k are the standard unit vectors which form a basis of
R3 . Since the above expansion is unique, the triplet (v1 , v2 , v3 ) can consequently
be used to represent ⃗v with respect to the basis and we write:

⃗v ←→ (v1 , v2 , v3 )

Similarly another vector w


⃗ maybe written as

⃗ ←→ (w1 , w2 , w3 )
w

meaning that
⃗ = w1⃗i + w2⃗j + w3⃗k .
w
Definition B.1.2. Let ⃗v = v1⃗i + v2⃗j + v3⃗k, w
⃗ = w1⃗i + w2⃗j + w3⃗k.

1. The bilinear map defined by:

⟨ , ⟩ : R3 × R3 → R
⟨ , ⟩ : (⃗v , w)
⃗ 7 → ⟨⃗v , w⟩
⃗ , v1 w1 + v2 w2 + v3 w3

is called dot product (aka scalar product or standard inner product) with
respect to the basis {⃗i, ⃗j, ⃗k}. One easily shows that:

⟨av⃗1 + bv⃗2 , v⃗3 ⟩ = a⟨v⃗1 , v⃗3 ⟩ + b⟨v⃗2 , v⃗3 ⟩


⟨v⃗1 , av⃗2 + bv⃗3 ⟩ = a⟨v⃗1 , v⃗2 ⟩ + b⟨v⃗1 , v⃗3 ⟩

for every a, b ∈ R and for every v⃗1 , v⃗2 , v⃗3 ∈ R3 .


2. The distance between ⃗v and w,
⃗ denoted by d(⃗v , w),
⃗ is defined by:

⃗ = (v1 − w1 )2 + (v2 − w2 )2 + (v3 − w3 )2
d(⃗v , w))

3. The length or norm of ⃗v is the distance between ⃗v and the origin; it is


denoted by |⃗v |.

The following are stated without proof (easy to do in 2 dimensions and it


generalizes to 3 dimensions).
Theorem 62.

⃗ = ⟨⃗v − w,
1. d(⃗v , w) ⃗ ⃗v − w⟩.


2. |⃗v | = ⟨⃗v , ⃗v ⟩.
3. ⟨⃗v , w⟩
⃗ = |⃗v | |w|
⃗ cos(∠⃗v − ∠w),
⃗ where ∠⃗v − ∠w
⃗ is the angle between vectors
⃗v and w.⃗
⟨⃗
v , w⟩

4. |w|
⃗ is the length of the projection of vector ⃗v on vector w.
⃗ In particular,
if |w|
⃗ = 1 then ⟨⃗v , w⟩
⃗ is the length of the projection of vector ⃗v on vector
w.

208APPENDIX B. ORTHONORMAL EXPANSION AND VECTOR REPRESENTATION OF CON

Definition B.1.3. A vector ⃗v is called normal if ⟨⃗v , ⃗v ⟩ = 1 and two vectors


⃗ are called orthogonal if ⟨⃗v , w⟩
⃗v , w ⃗ = 0. A set of vectors {v⃗1 , v⃗2 , . . . , v⃗n } is
called orthonormal if ⟨⃗
vi , v⃗i ⟩ = 1, ∀i and ⟨⃗
vi , v⃗j ⟩ = 0, ∀i ̸= j.

In light of theorem 62, we see that a vector is normal if its length is 1, and
⃗v and w ⃗ are orthogonal if the projection of ⃗v on w ⃗ (or vice-versa) has a length
of 0 as one would expect intuitively.
The basis {⃗i, ⃗j, ⃗k} used above is by definition orthonormal.1 The expansion
of ⃗v with respect to ⃗i, ⃗j, ⃗k is represented graphically in figure B.2.

Figure B.2:

Remark. Not every basis of R3 is orthonormal or even orthogonal; orthonormal-


ity is a property that some bases satisfy.
The following forms another orthonormal basis of R3 :2
√ √ √
w⃗1 = ⃗i/ 3 + ⃗j/ 3 + ⃗k/ 3
√ √ √
w⃗2 = −⃗i 2/3 + ⃗j/ 6 + ⃗k/ 6
√ √
w⃗3 = ⃗j/ 2 − ⃗k/ 2

One easily verifies that any two of w⃗1 , w⃗2 , w⃗3 are orthogonal and that each one
is normalized.

Example B.1.1. Let ⃗i, ⃗j, ⃗k denote the standard unit vectors and define:3

v⃗1 = ⃗i + ⃗j + ⃗k ←→ (1, 1, 1)
v⃗2 = ⃗j + ⃗k ←→ (0, 1, 1)
v⃗3 = ⃗j − ⃗k ←→ (0, 1, −1)
| {z }
with respect to the basis {⃗i, ⃗j, ⃗
k}

1 Thiscould correspond to the basis used by team #1 in our motivating example.


2 Thiscould correspond, in our motivating example, to the measurements taken by team
#3 of the basis used by team #2 in the coordinate system used by team #1.
3 This corresponds to the measurements taken by team #1 in our motivating example.
B.1. THREE-DIMENSIONAL VECTORS IN R3 : 209

Since {w⃗1 , w⃗2 , w⃗3 } forms a basis of R3 they can also be used to expand
v⃗1 , v⃗2 , v⃗3 . We would find (this will be explained below):4
√ √
v⃗1 = 3w⃗1√ √ ←→ ( 3,√
√ 0, 0)
v⃗2 = √ 2w⃗1 / 3 + w⃗2 2/3 ←→ (2/ 3, √ 2/3, 0)
v⃗3 = 2w⃗3 ←→ (0, 0, 2)
| {z }
with respect to the basis {w
⃗1 , w ⃗3 }
⃗2 , w

Not surprisingly, the representation of the same vectors v⃗1 , v⃗2 , v⃗3 depends on
the basis used to represent them, but the dot products are however independent
of the orthonormal basis used (and this is not a coincidence as we will see):
⟨⃗
vi , v⃗j ⟩ = ⟨⃗
vi , v⃗j ⟩ , ∀i, j = 1, 2, 3.
| {z } | {z }
wrt {⃗i, ⃗j, ⃗
k} wrt {w
⃗1 , w ⃗3 }
⃗2 , w

Why do we want a basis to be orthonormal? Because the orthonormality


makes it very easy to find the coordinates of a vector and because the dot
products and invariant under a change of orthonormal basis. A vector ⃗v is
expanded in an orthonormal basis {w⃗1 , w⃗2 , w⃗3 } as follows:
⃗v = ⟨⃗v , w⃗1 ⟩w⃗1 + ⟨⃗v , w⃗2 ⟩w⃗2 + ⟨⃗v , w⃗3 ⟩w⃗3
and the dot products in the above can be expanded with respect to any or-
thonormal basis.
Example B.1.2. The coordinates of v⃗1 , v⃗2 , v⃗3 in the orthonormal basis
{w⃗1 , w⃗2 , w⃗3 } from the previous example are given by:

⟨v⃗1 , w⃗1 ⟩ = 3 ⟨v⃗3 , w⃗1 ⟩ = 0
⟨v⃗1 , w⃗2 ⟩ = 0 ⟨v⃗3 , w⃗2 ⟩ = √
0
⟨v⃗1 , w⃗3 ⟩ = 0 √ ⟨v⃗3 , w⃗3 ⟩ = 2
⟨v⃗2 , w⃗1 ⟩ = 2/
√ 3
⟨v⃗2 , w⃗2 ⟩ = 2/3
⟨v⃗2 , w⃗3 ⟩ = 0
This proves that the measurements taken by teams #1 and #2 are in agreement.
In general the expansion of a vector ⃗v in a basis that is not orthonormal is
not as easily done.

Why can’t we just use the standard unit vector basis since it is or-
thonormal? Why do we need other orthonormal bases? In R3 not
much is to be gained by using different orthonormal bases, but if we have a
small number of vectors with many components then some orthonormal bases
may give a very simple representation of the vectors. This is illustrated in the
next example.
4 This could correspond to the measurements taken by team #2 in our motivating example,

assuming that their measurements are in agreement with those made by team #1.
210APPENDIX B. ORTHONORMAL EXPANSION AND VECTOR REPRESENTATION OF CON

Example B.1.3. Let

v⃗1 = (−1, 2, 1, 0, 0, 2, −1, 0, −2, 0)


v⃗2 = (1, −2, 0, 1, 2, 2, −2, −1, −2, 2)
v⃗3 = (−3, 6, 1, −2, −4, −2, 3, 2, 2, −4)

with respect to the standard unit vector basis {i⃗1 , i⃗2 , . . . , i⃗10 } in R10 . The vectors

w⃗1 = (−1, 2, 1, 0, 0, 2, −1, 0, −2, 0)/ 15

w⃗2 = (2, −4, −1/2, 3/2, 3, 2, −5/2, −3/2, −2, 3)/ 57

form an orthonormal basis (they obviously generate a subspace of R10 ) and we


have:

v⃗1 = ( 15, 0)
√ √
v⃗2 = ( 15/3, 2 57/3) ≈ (1.291, 5.033)
√ √
v⃗3 = ( 15/3, −4 57/3) ≈ (1.291, −10.066)

with respect to the basis {w⃗1 , w⃗2 } and we may represent the vectors graphically
as in figure B.3. Clearly the representation of v⃗1 , v⃗2 , v⃗3 in the basis {w⃗1 , w⃗2 }
is much simpler than the representation in the standard unit vector basis.

Figure B.3:

The Gram-Schmidt orthonormalization procedure is used to compute an or-


thonormal basis from a given set of vectors. It involves the following steps,
starting from a finite set of vectors v⃗1 , v⃗2 , . . . , v⃗N represented in an orthonor-
mal basis.

Step 1 w⃗1 = normalize v⃗1 (assuming that v⃗1 ̸= 0).

Step 2 w⃗2 = normalize v⃗2 − ⟨v⃗2 , w⃗1 ⟩w⃗1 if different than 0. This operation is
represented graphically in figure B.4.
B.2. VECTORS OBTAINED BY SAMPLING OF A WAVEFORM 211

Figure B.4:

Step 3 w⃗3 = normalize v⃗3 − ⟨v⃗3 , w⃗1 ⟩w⃗1 − ⟨v⃗3 , w⃗2 ⟩w⃗2 if different than 0.
Step 4 . . . and so on . . .

There are as many steps as there are (non-zero) vectors in the given set. The
vectors w⃗1 , w⃗2 , . . . are orthonormal. The number of (non-zero) orthonormal
vectors is less than or equal to the number of vectors in the given set.

B.2 Vectors obtained by sampling of a waveform


Signals sampled on a time interval of finite duration produce vectors possibly
consisting of very large number of components. The vector components are
sketched on a discretized time axis indicating the approximate shape of the
signal from which the samples were taken.
Example B.2.1. From a previous example with v⃗1 , v⃗2 , v⃗3 ∈ R10 , the compo-
nents are sketched on a discretized time axis (with linear interpolation) in figure
B.5(a).
Let v⃗i ∈ RN , i = 1, 2, . . . , M denote M vectors obtained by sampling of
waveforms and represented as:

v⃗i = (vi1 , vi2 , . . . , viN )

with respect to a standard unit vector basis. We define the functions

vi : {1, 2, . . . , N } → R
vi : n 7→ vi (n) = vi n

so that we may write v⃗i = (vi (1), vi (2), . . . , vi (N )) in the standard unit vector
basis. The Gram-Schmidt procedure applied to the v⃗i returns L ≤ M orthonor-
mal vectors

w⃗j = (wj1 , wj2 , . . . , wjN )


= (wj (1), wj (2), . . . , wj (N ))
212APPENDIX B. ORTHONORMAL EXPANSION AND VECTOR REPRESENTATION OF CON

in the (same) standard unit vector basis such that:5


L
v⃗i = ⟨⃗
vi , w⃗j ⟩w⃗j ,
j=1

i = 1, 2, . . . , M . This can equivalently be written as



L
vi (n) = cij wj (n) ,
j=1
∑N
n = 1, . . . , N , where cij = ⟨⃗ vi , w⃗j ⟩ = k=1 vi (k)wj (k). In this context,
the Gram-Schmidt procedure obtains L ≤ M orthonormal discrete signals wj (n)
from M discrete signals vi (n) such that any vi (n) can be expressed as a (unique)
linear combination of the wj (n).
As previously remarked in example B.1.1, the dot product of vectors is in-
variant under a change of orthonormal basis. In the present context we have:


N ∑
L
vi (n)vk (n) = cij ckj
n=1 j=1
| {z } | {z }
wrt standard unit vectors wrt the basis {w
⃗j }

Example B.2.2. Three signals v1 (n), v2 (n), v3 (n) are sketched on figure B.5(a).
The signals w1 (n), w2 (n) sketched in figure B.5(b) form an orthonormal ba-
sis suitable to represent v1 (n), v2 (n), v3 (n) (they have been obtained by ap-
plying the Gram-Schmidt procedure). v1 (n), v2 (n), v3 (n) can then be repre-
sented as a point in a Cartesian plane similarly to the sketch in figure B.3.
By a proper choice of the linear combination, v1 (n), v2 (n), v3 (n) can be recov-

ered from
√ w1 (n), w2 (n)√as shown in figure √ B.5(c) by the √sketches of w1 (n) 15,
w1 (n) 15/3 + 2w2 (n) 57/3 and w1 (n) 15/3 − 4w2 (n) 57/3. Finally the in-
variance of the dot product to the choice of the orthonormal basis is illustrated
below with signals v1 (n), v2 (n):


10
⟨v1 (n), v2 (n)⟩ = v1 (n)v2 (n) = 5
n=1
√ √ √ 15
⟨( 15, 0), ( 15/3, 2 57/3)⟩ = +0=5
3

B.3 Effect of sampling frequency


A normalization of the dot product is required in order to make it independent
of the sampling frequency (as it should be as long as the Nyquist sampling
criterion is not violated). To see why, consider a continuous signal s(t) and the
two discrete sampled signals
5 and where the functions wj ( ) are defined similarly to the functions vi ( )
(a) Sketch of three signals obtained from the sampling of three wave- (b) Orthonormal basis signals
forms
B.3. EFFECT OF SAMPLING FREQUENCY
213

(c) Signals recovery from the orthonormal basis

Figure B.5: Example for Gram-Schmidt procedure with the sampling of a waveform
214APPENDIX B. ORTHONORMAL EXPANSION AND VECTOR REPRESENTATION OF CON

( ( )) N
n
v⃗1 = s
fs n=1
( ( )
n )
2N
v⃗2 = s
2fs n=1

Both v⃗1 , v⃗2 contain samples of s(t) for 0 < t < N/fs , but v⃗2 contains twice as
many samples. Consequently:

⟨v⃗1 , v⃗1 ⟩ ≈ ⟨v⃗2 , v⃗2 ⟩/2

In order for the dot product of discrete signals to be independent of the sampling
frequency we define:
∑ s1 (n)s2 (n)
⟨s1 (n), s2 (n)⟩ ,
n
fs

as the dot product of two discrete signals sampled at the rate fs (both signals
need to be sampled at the same rate fs so that they both have the same number
of samples in any given time interval).
We now have two different dot products, and there will not be any ambigu-
ities by the context:

• dot product through the (discrete) time n; requires normalization by the


sampling frequency fs ,
{ }L
• dot product through the basis wj (n) j=1 ; does not require the normal-
ization by fs .

B.4 Continuous time signals


∑ ∫
Intuitively we take the limit as fs → ∞ ⇒ 1/fs ≈ dt → 0 and → :
∫ ∞
⟨s1 (t), s2 (t)⟩ = s1 (t)s2 (t)dt
−∞

is the dot product between two signals. The concepts of distance, norm and
projection introduced in definition B.1.2 and theorem 62 naturally extend to
signals. We then have:

1. d(s1 (t), s2 (t)) = ⟨s1 (t) − s2 (t), s1 (t) − s2 (t)⟩;

2. |s1 (t)| = ⟨s1 (t), s1 (t)⟩;
⟨s1 (t), s2 (t)⟩
3. |s2 (t)| is the length of the projection of signal s1 (t) on signal s2 (t). In
particular, if |s2 (t)| = 1 then ⟨s1 (t), s2 (t)⟩ is the length of the projection
of signal s1 (t) on signal s2 (t);
B.4. CONTINUOUS TIME SIGNALS 215

4. s1 (t) and s2 (t) are orthonormal if:


∫ ∞
s1 (t)s2 (t)dt = 0
−∞
∫ ∞ ∫ ∞
s21 (t)dt = s22 (t)dt = 1
−∞ −∞

The Gram-Schmidt orthonormalization procedure for continuous signals fol-


lows the same steps as those previously explained on page 210:

Step 1 ϕ1 (t) = normalize s1 (t) if s1 (t) ̸= 0.

Step 2 ϕ2 (t) = normalize s2 (t) − ⟨s2 (t), ϕ1 (t)⟩ϕ1 (t) if different than 0.

Step 3 ϕ3 (t) = normalize s3 (t) − ⟨s3 (t), ϕ1 (t)⟩ϕ1 (t) − ⟨s3 (t), ϕ2 (t)⟩ϕ2 (t) if dif-
ferent than 0.

Step 4 . . . and so on . . .

Example B.4.1. (W&J, pp 269 - 272) Find an orthonormal basis for the signals
s1 (t), s2 (t), s3 (t) and s4 (t) sketched in figure B.6 and restricted to the interval
t ∈ [0, 3].

Figure B.6:

Solution. Apply the Gram-Schmidt orthonormalization procedure to the sig-


nals.
216APPENDIX B. ORTHONORMAL EXPANSION AND VECTOR REPRESENTATION OF CON

Step 1. Since s1 (t) ̸= 0 we let θ1 (t) = s1 (t) and ϕ1 (t) = √ θ1 (t)


, where:
⟨θ1 (t), θ1 (t)⟩

∫ 3
⟨θ1 (t), θ1 (t)⟩ = ⟨s1 (t), s1 (t)⟩ = s21 (t)dt
0
∫ 1 ∫ 2 ∫ 3
2 2
= (2) dt + (−2) dt + (2)2 dt
0 1 2
= 12

ϕ1 (t) is sketched on figure B.7.

Figure B.7:

Step 2. We need to calculate θ2 (t) = s2 (t) − ⟨s2 (t), ϕ1 (t)⟩ϕ1 (t):


∫ 1 ∫ 2 ∫ 3
−1 −3 1
⟨s2 (t), ϕ1 (t)⟩ = √ dt + √ dt + √ dt
0 3 1 3 2 3
−3 √
= √ =− 3
3
It follows that:

θ2 (t) = s2 (t) − ⟨s2 (t), ϕ1 (t)⟩ϕ1 (t)



= s2 (t) + 3 ϕ1 (t) ̸= 0

θ2 (t) is sketched on figure B.8 together with ϕ2 (t) = √ θ2 (t)


, where
⟨θ2 (t), θ2 (t)⟩
we easily calculate ⟨θ2 (t), θ2 (t)⟩ = 4 + 4 = 8.
Step 3. We need to calculate θ3 (t) = s3 (t) − ⟨s3 (t), ϕ1 (t)⟩ϕ1 (t) −
⟨s3 (t), ϕ2 (t)⟩ϕ2 (t):
√ √ √
⟨s3 (t), ϕ1 (t)⟩ = 1/ 3 + 2/ 3 + 0 = 3
√ √
⟨s3 (t), ϕ2 (t)⟩ = 0 + (− 2) + 0 = − 2

and as illustrated on figure B.9 we obtain θ3 = 0. It is then not necessary


to introduce a third function ϕ3 (t) at this point.
B.4. CONTINUOUS TIME SIGNALS 217

Figure B.8:

Step 4. We need to calculate θ4 (t) = s4 (t) − ⟨s4 (t), ϕ1 (t)⟩ϕ1 (t) −


⟨s4 (t), ϕ2 (t)⟩ϕ2 (t):
√ √ √ √
⟨s4 (t), ϕ1 (t)⟩ = −1/ 3 + 1/ 3 − 3 = − 3
√ √ √
⟨s4 (t), ϕ2 (t)⟩ = 0 − 1/ 2 − 3/ 2 = −2 2

and as illustrated on figure B.10 we obtain θ4 = 0. Again, it is not


necessary to introduce a third function ϕ3 (t).
The signals s1 (t), s2 (t), s3 (t), s4 (t) are represented in the orthonormal basis
{ϕ1 (t), ϕ2 (t)} as in figure B.11.

The findings are summarized in the following.


Theorem 63. Any set of M finite energy signals si (t), i = 0, 1, . . . , M − 1 can
be expressed as:


N
si (t) = sij ϕj (t), i = 0, 1, . . . , M − 1,
j=1

where ϕj (t) are N appropriately chosen signals and N ≤ M . The set of signals
{ϕj (t)} is not unique (for example, the Gram-Schmidt procedure may give a
different set of orthonormal functions if the signals are simply reordered). The
Gram-Schmidt procedure yields a minimal set of orthonormal functions, the
cardinality of which is called dimensionality of the set of signals. The coefficients
sij in the above equation are given by:
∫ ∞
sij = ⟨si (t), ϕj (t)⟩ = si (t)ϕj (t)dt
−∞
218APPENDIX B. ORTHONORMAL EXPANSION AND VECTOR REPRESENTATION OF CON

Figure B.9:

We also have

N
⟨si (t), sk (t)⟩ = ⟨si (t), ϕj (t)⟩⟨sk (t), ϕj (t)⟩
j=1

∀i, k ∈ {0, 1, . . . , M − 1}, and in particular ( Parseval’s relathionship)



N
⟨si (t), si (t)⟩ = ⟨si (t), ϕj (t)⟩2
j=1
∫ ∞
= s2i (t)dt ≡ energy of si (t)
−∞

Moreover, for any finite energy signal r(t) we have:



N
⟨r(t), si (t)⟩ = ⟨r(t), ϕj (t)⟩⟨si (t), ϕj (t)⟩
j=1

and more generally for any finite energy signal s(t) lying in the space generated
by the orthonormal functions ϕ1 (t), . . . ϕN (t), we also have:

N
⟨r(t), s(t)⟩ = ⟨r(t), ϕj (t)⟩⟨s(t), ϕj (t)⟩
j=1
B.4. CONTINUOUS TIME SIGNALS 219

Figure B.10:

Theorem 64. Let nw (t) be a 0-mean white Gaussian noise with power spectral
density
Snw (f ) = N0 /2

and define

nj = ⟨nw (t), ϕj (t)⟩, j = 1, 2, . . . , N



N
n(t) = nj ϕj (t)
j=1
r2 (t) = nw (t) − n(t)

where ϕj (t), j = 1, 2, . . . , N are N orthonormal functions. Then n(t) and r2 (t)


are jointly Gaussian statistically independent 0-mean random processes.

Sketch of proof. We assume that expectations and integrals are freely inter-
changeable.
220APPENDIX B. ORTHONORMAL EXPANSION AND VECTOR REPRESENTATION OF CON

Figure B.11:

1. We have:
∫ ∞
nj = nw (t)ϕj (t)dt
−∞
≡ linear operations on the Gaussian process nw (t)

By theorem 44 (or a special case of theorem 44) we conclude that nj is a


Gaussian random variable, ∀ j = 1, 2, . . . , N .

2. The nj ’s are Gaussian random variables resulting from linear operations


on the same Gaussian process nw (t). It follows that the random vector

(n1 , n2 , . . . , nN )

is a Gaussian random vector.

3. Samples of the process n(t) at times t1 , . . . , tk may be expressed as:


 
ϕ1 (t1 ) ϕ1 (t2 ) . . . ϕ1 (tk )
 ϕ2 (t1 ) ϕ2 (t2 ) . . . ϕ2 (tk ) 
 
(n(t1 ), . . . , n(tk )) = (n1 , . . . , nN ) ×  .. .. .. .. 
| {z }  . . . . 
ϕN (t1 ) ϕN (t2 ) . . . ϕN (tk )
Gaussian | {z }
random N ×k matrix of constants
vector

It follows that (n(t1 ), n(t2 ), . . . , n(tk )) is a Gaussian random vector


∀t1 , t2 , . . . , tk and ∀k; n(t) is therefore a Gaussian random process.

4. r2 (t) = nw (t)−n(t) where nw (t), n(t) are both Gaussian random processes
and therefore r2 (t) is also a Gaussian random process.
B.4. CONTINUOUS TIME SIGNALS 221

5. Both n(t), r2 (t) result from operations on the same random process nw (t);
n(t) and r2 (t) are then jointly Gaussian random processes.

6. In order to show that n(t) and r2 (t) are 0-mean we first calculate :

E(nj ) = E(⟨nw (t), ϕj (t)⟩)


= ⟨E(nw (t)), ϕj (t)⟩
= ⟨0, ϕj (t)⟩
= 0

It follows that

(∑
N )
E(n(t)) = E nj ϕj (t)
j=1


N
= E(nj ) ϕj (t)
j=1
| {z }
0, ∀j
= 0

E(r2 (t)) = E(nw (t) − n(t))


= E(nw (t)) − E(n(t))
= 0−0=0

Thus n(t) and r2 (t) are jointly Gaussian 0-mean random processes. Even
though this is not required at this time, we next calculate E(ni nj ); we’ll
222APPENDIX B. ORTHONORMAL EXPANSION AND VECTOR REPRESENTATION OF CON

need this shortly:


( )
E(ni nj ) = E ⟨nw (t), ϕi (t)⟩⟨nw (t), ϕj (t)⟩
(∫ ∞ ∫ ∞ )
= E nw (t)ϕi (t)dt nw (s)ϕj (s)ds
−∞ −∞
( ∫∫ ∞ )
= E nw (t)nw (s)ϕi (t)ϕj (s)dtds
−∞
∫∫ ∞
( )
= E nw (t)nw (s) ϕi (t)ϕj (s)dtds
−∞ | {z }
N0
Rnw (t−s)= δ(t−s)
∫ (∫ )
2
∞ ∞
N0
= ϕi (t)ϕj (s)δ(t − s)ds dt
2 −∞ −∞
| {z }
ϕi (t)ϕj (t)

N0 ∞
= ϕi (t)ϕj (t)dt
2 −∞
N0
= ⟨ϕi (t), ϕj (t)⟩
2
 0 ; if i ̸= j
=
 N0
2 ; if i = j
N0
= δi, j
2
where we define: {
0 ; if i ̸= j
δi, j ,
1 ; if i = j
for convenience.

7. By theorem 48 processes n(t) and r2 (t) are statistically independent if and


only if Ln, r2 (t, s) = 0, ∀ t, s, or equivalently if and only if Rn, r2 (t, s) =
0, ∀ t, s, since n(t) and r2 (t) are 0-mean; this is proved below. First we
have:
( )
Rn, r2 (t, s) = E n(t) r2 (s)
(( ∑ )( )
N
)
= E nj ϕj (t) nw (s) − n(s)
j=1
(∑
N
( ))
= E ϕj (t)nj nw (s) − n(s)
j=1


N ( ( ))
= ϕj (t)E nj nw (s) − n(s) (B.1)
j=1
B.4. CONTINUOUS TIME SIGNALS 223

The expected value in equation (B.1) is evaluated below:


( ( ))
E nj nw (s) − n(s) = nj nw (s) − nj n(s)
(∑
N )
= ⟨nw (t), ϕj (t)⟩nw (s) − nj ni ϕi (s)
i=1

N
= ⟨nw (t)nw (s), ϕj (t)⟩ − nj ni ϕi (s)
i=1

N0 ∑N
N0
= ⟨ δ(t − s), ϕj (t)⟩ − δj, i ϕi (s)
2 i=1
2

N0 ∞ N0
= δ(t − s) ϕj (t)dt − ϕj (s)
2 −∞ 2
N0 N0
= ϕj (s) − ϕj (s)
2 2
= 0

Substituting the latter in equation (B.1) yields Rn, r2 (t, s) = 0, ∀ t, s.


This concludes the proof that n(t), r2 (t) are jointly Gaussian statistically
independent 0-mean random processes.

You might also like