0% found this document useful (0 votes)
42 views

Lecture Notes An Introduction To Digital Communications: 1997-2011 by Armand M. Makowski

1) The document discusses hypothesis testing and designing optimal receivers. It introduces the Bayesian model for hypothesis testing with multiple hypotheses. 2) It describes the optimization problem of minimizing the expected cost of deciding incorrectly. The optimal detector is the one that partitions the observation space such that each region selects the hypothesis with the largest posterior probability density given that observation. 3) The detector that selects the hypothesis with the largest posterior is proven to minimize the probability of error compared to any other detector.

Uploaded by

prasoon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views

Lecture Notes An Introduction To Digital Communications: 1997-2011 by Armand M. Makowski

1) The document discusses hypothesis testing and designing optimal receivers. It introduces the Bayesian model for hypothesis testing with multiple hypotheses. 2) It describes the optimization problem of minimizing the expected cost of deciding incorrectly. The optimal detector is the one that partitions the observation space such that each region selects the hypothesis with the largest posterior probability density given that observation. 3) The detector that selects the hypothesis with the largest posterior is proven to minimize the probability of error compared to any other detector.

Uploaded by

prasoon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

LECTURE NOTES1

AN INTRODUCTION TO DIGITAL
COMMUNICATIONS

Armand M. Makowski 2

1 1997-2011
c by Armand M. Makowski
2 Department of Electrical and Computer Engineering, and Institute for Systems Re-
search, University of Maryland, College Park, MD 20742. E-mail: [email protected].
Phone: (301) 405-6844
2
Part I

Preliminaries

3
Chapter 1

Decision Theory

In this chapter we present the basic ideas of statistical decision theory that will
be used repeatedly in designing optimal receivers in a number of settings. These
design problems can all be reduced to problems of M -ary hypothesis testing which
we investigate below in generic form.

1.1 The generic hypothesis testing problem


In the statistical hypothesis testing problem, a decision has to be made as to which
of several possible hypotheses (or states of nature) is the correct one. The state
of nature is encoded in a rv H and a decision has to be made on the basis of an
Rd -valued observation vector X which is statistically related to H. Given that a
cost is incurred for making decisions, the decision-maker seeks to determine the
best decision to be implemented. Although several formulations are available
in the literature, here we concentrate on the Bayesian formulation.

1.1.1 The Bayesian model


Let H denote a finite set with M elements for some positive integer M 2,
say H := {1, . . . , M } for the sake of concreteness. The rv H takes values in H
according to the pmf

pm := P [H = m] , m = 1, . . . , M.

This pmf p = (p1 , . . . , pM ) is often called the prior on H.

5
6 CHAPTER 1. DECISION THEORY

With each of the possible hypothesis m = 1, . . . , M , we associate a probabil-


ity distribution function Fm on Rd with the interpretation that Fm is the conditional
distribution of X given H = m, i.e.,

P [X x|H = m] = Fm (x), x Rd .

The observation rv X is then distributed according to


M
X
P [X x] = pm Fm (x), x Rd
m=1

by the Law of Total Probabilities, while

P [X x, H = m] = pm Fm (x), x Rd , m = 1, . . . , M.

In other words, the conditional probability distribution of the observations given


the hypothesis and the probability distribution of H completely specify the joint
distribution of the rvs H and X.

1.1.2 The optimization problem


On observing the observation vector, the decision-maker implements a decision
rule which returns a state of nature in response to this observation. Thus, an
(admissible) decision rule or detector1 is simply any mapping d : Rd H.2 In
the language of Estimation Theory, the mapping : Rd H can be interpreted as
an estimator for H (on the basis of X) with (X) representing the corresponding
estimate Hb of H (on the basis of X). Let D denote the class of all (admissible)
detection rules.
As a cost is incurred for making decisions, we introduce the mapping C :
H H R with the interpretation that

Cost incurred for deciding k


C(m, k) =
when H = m
1
In the statistical literature on Hypothesis Testing such a detector is often called a test, while
in the context of Digital Communications, a detector is often refered to as a receiver for reasons
that will become shortly apparent We shall follow this tradition in due time!
2
Strictly speaking, the definition of an admissible rule should include the property that each of
the sets {x Rd : (x) = m}, m = 1, . . . , M , be a Borel subset of Rd .
1.1. THE GENERIC HYPOTHESIS TESTING PROBLEM 7

for all k, m = 1, . . . , M . The use of any admissible rule d in D thus incurs a cost
C(H, (X)). However, the value of the cost C(H, (X)) is not available to the
decision-maker3 and attention focuses instead on the expected cost J : D R
defined by
J() := E [C(H, (X)] , D.
The Bayesian M-ary hypothesis testing problem (PB ) is now formulated as

(PB ) : Minimize J over the collection D of admissible decision rules

Solving problem (PB ) amounts to identifying detector(s) ? : Rd H such that

J( ? ) J(), D.

Any detector ? : Rd H which minimizes the expected cost is referred to as an


optimal detector.
The problem (PB ) can be solved for arbitrary cost functions C under fairly
weak assumptions on the distributions F1 , . . . , FM . Throughout, to simplify mat-
ters somewhat, we assume that for each m = 1, . . . , M , the distribution function
Fm admits a density fm on Rd , i.e.,
Z x1 Z xd
Fm (x) = ... fm (t)dt1 . . . dtd , x = (x1 , . . . , xd ) Rd .

This assumption is enforced in all cases considered here.


Rather than discussing the case of a general cost function, we will instead
focus on a special case of paramount importance to Digital Communications. This
occurs when C takes the form

1 if m 6= k
(1.1) C(m, k) = , k, m = 1, . . . , M
0 if m = k

and the expected cost reduces to the so-called probability of error

(1.2) Er() := P [(X) 6= H] , D.

Versions of the problem with cost (1.1)(1.2) will be extensively discussed in this
text. The remainder of the discussion assumes this cost structure.
3
Indeed the value of H is not known, in fact needs to be estimated!
8 CHAPTER 1. DECISION THEORY

1.2 Identifying the optimal detector


As the first step in solving the problem (PB ), we argue now as to the form of the
optimal detector. We begin by noting that any detector : Rd H is equivalent
to a partition (1 , . . . , M ) of Rd , that is, a collection of subsets of Rd such that

k 6= m
m k = ,
k, m = 1, . . . , M

with
Rd = M
m=1 m .

Indeed, any detector : Rd H induces a partition (1 , . . . , M ) of Rd by


setting
m = {x Rd : (x) = m}, m = 1, . . . , M.
Conversely, with any partition (1 , . . . , M ) of Rd we can associate a detector
d : Rd H through the correspondence

d(x) = m if x m , m = 1, . . . , M.

Start with a detector : Rd H with induced partition (1 , . . . , M ) as


above. We have
M
X
P [(X) = H] = pm P [(X) = m|H = m]
m=1
XM
= pm P [X m |H = m]
m=1
XM Z
= pm fm (x)dx.
m=1 m

As we seek to minimize the probability of error, we conclude that it suffices to


maximize
M
X Z
F (1 , . . . , M ) := pm fm (x)dx
m=1 m
Z M
!
X
= 1 [x m ] pm fm (x) dx
Rd m=1
1.3. THE DETECTOR ? IS OPTIMAL 9

with respect to partitions (1 , . . . , M ) of Rd .


Inspection of the functional F suggests a possible candidate for optimality:
For each m = 1, . . . , M , set

?m := {x Rd : pm fm (x) = max pk fk (x)}


k=1,...,M

with tie breakers if necessary. For sake of concreteness, ties are broken according
to the lexicographic order, i.e., if at point x, it holds that

pi fi (x) = max pk fk (x) = pj fj (x)


k=1,...,M

for distinct values i and j, then x will be assigned to ?i if i < j. With such
precautions, these sets form a partition (?1 , . . . , ?M ) of Rd , and the detector
? : Rd H associated with this partition takes the form

(1.3) ? (x) = m iff pm fm (x) = max pk fk (x), x Rd


k=1,...,M

with a lexicographic tie-breaker, or more compactly,

? (x) = arg max (m = 1, . . . , M : pm fm (x)) , x Rd .

We shall often write that ? prescribes

(1.4) H
b =m iff pm fm (x) largest

with the interpretation that upon collecting the observation vector x, the detector
? selects the state of nature m as its estimate on the basis of x.

1.3 The detector ? is optimal


That the guess (1.4) is indeed correct forms the content of the next proposition:

Theorem 1.3.1 The detector ? : Rd H given by (1.3) is optimal, in that


Er( ? ) Er() for any other detector : Rd H.

Proof. Introduce the mapping f : Rd R by

f (x) = max pm fm (x), x Rd .


m=1,...,M
10 CHAPTER 1. DECISION THEORY

The obvious bound


M
X
f (x) pm fm (x), x Rd
m=1

implies
Z M
X Z M
X
f (x)dx pm fm (x)dx = pm = 1,
Rd m=1 Rd m=1
d
and the function f is indeed integrable over all of R . This fact will be used with-
out further mention in the discussion below to validate some of the manipulations
involving integrals.
For any partition (1 , . . . , M ) of Rd , we need to show that
(1.5) F (?1 , . . . , ?M ) F (1 , . . . , M ) 0,
where
F (?1 , . . . , ?M ) F (1 , . . . , M )
X M Z Z 
= pm fm (x)dx pm fm (x)dx .
m=1 ?m m

Next, for each m = 1, . . . , M , by the definition of ?m and f it holds that


pm fm (x) = f (x), x ?m
and
pm fm (x) f (x), x m .
Therefore,
F (?1 , . . . , ?M ) F (1 , . . . , M )
X M Z Z 
= f (x) pm fm (x)dx
m=1 ?m m

XM Z Z 
f (x)dx f (x)dx
m=1 ?m m

XM Z M Z
X
= f (x)dx f (x)dx
m=1 ?m m=1 m
Z Z
= f (x)dx f (x)dx = 0,
Rd Rd
1.4. ALTERNATE FORMS OF THE OPTIMAL DETECTOR 11

and the inequality (1.5) is established.

1.4 Alternate forms of the optimal detector


The optimal detector ? identified in Theorem 1.3.1 is amenable to useful inter-
pretations which we now develop

The MAP detector With the usual caveat on tie breakers, the definition (1.3) of
the optimal detector ? yields

Choose H
b =m iff pm fm (x) largest
pm fm (x)
iff PM largest
k=1 pk fk (x)
iff P [H = m|X = x] largest

where the last equivalence follows from Bayes Theorem in the form

pm fm (x)
P [H = m|X = x] = PM , x Rd
k=1 p k fk (x)

for each m = 1, . . . , M . In particular, ? can be viewed as selecting H


b = m
whenever the a posteriori probability of H given the observations X is largest.
In the parlance of Estimation Theory, ? is the Maximum A Posteriori (MAP)
estimator of the parameter H on the basis of the observations X.
As monotone increasing transformations are order preserving, the optimal de-
tector ? has the equivalent form

Choose H
b =m iff log (pm f (x|H = m)) largest.

Uniform prior and the ML detector There is one situation of great interest,
from both practical and theoretical viewpoints, where further simplifications are
achieved in the structure of the optimal detector. This occurs when the rv H is
uniformly distributed over H, namely
1
(1.6) P [H = m] = , m = 1, . . . , M.
M
12 CHAPTER 1. DECISION THEORY

In that case, the optimal detector ? prescribes

Choose H
b =m iff fm (x) largest,

and therefore implements the so-called Maximum Likelihood (ML) estimate of H


on the basis of x.

1.5 An important example


An important special case arises when the distributions F1 , . . . , FM are all Gaus-
sian distributions with the same invertible covariance matrix. This is equivalent
to
(1.7) [X|H = m] =st m + V , m = 1, . . . , M
where V is a zero mean Rd -valued Gaussian rv with covariance matrix . We
assume to be invertible and the mean vectors 1 , . . . , M to be distinct. An
alternative description, based on (1.7), relates the observation X to the state of
nature H through the measurement equation

(1.8) X = H + V

where the rvs H and V are assumed to be mutually independent rvs distributed
as before. Under this observation model, for each m = 1, . . . , M , Fm admits the
density
1 1
e 2 (xm ) (xm )
1 0
(1.9) fm (x) = p , x Rd .
(2)p det()
We note that

(1.10) log (pm fm (x))


1 x Rd ,
= C + log pm (x m )0 1 (x m ),
2 m = 1, . . . , M
with constant C given by
1
C := log (2)d det() .

2
This constant being independent of m and x, the optimal detector prescribes

Choose H
b =m iff 2 log pm (x m )0 1 (x m ) largest.
1.6. CONSECUTIVE OBSERVATIONS 13

Under uniform prior, this MAP detector becomes the ML detector and takes the
form
Choose Hb = m iff (x m )0 1 (x m ) smallest.

The form of the MAP detector given above very crisply illustrates how the
prior information (pm ) on the hypothesis is modified by the posterior information
collected through the observation vector x. Indeed, at first, if only the prior dis-
tribution were known, and with no further information available, it is reasonable
to select the most likely state of nature H = m, i.e., the one with largest value
of pm . However, as the observation vector x becomes available, its closeness to
m should provide some indication on the underlying state of nature. More pre-
cisely, if m is the closest (in some sense) to the observation x among all the
vectors 1 , . . . , M , then this should be taken as an indication of high likelihood
that H = m; here the appropriate notion of closeness is the norm on Rd induced
by 1 . The MAP detector combines these two trends when constructing the op-
timal decision in the following way: The state of nature H = m may have a rather
small value for its prior pm , making it a priori unlikely to be the underlying state
of nature, yet this will be offset if the observation x yields an extremely small
value for the distance (x m )0 1 (x m ) to the mean vector m .
When = 2 I d for some > 0, the components of V are mutually indepen-
dent, and the MAP and ML detectors take the simpler forms
1
Choose H
b =m iff 2 log pm kx m k2 largest
2
and
Choose H
b =m iff kx m k2 smallest,
respectively. Thus, given the observation vector x, the ML detector returns the
state of nature m whose mean vector m is closest (in the usual Euclidean sense)
to x. This is an example of nearest-neighbor detection.

1.6 Consecutive observations


As the discussion in Section 1.5 already shows, the MAP and ML detectors can as-
sume simpler forms in structured situations. In the present section we explore pos-
sible simplifications when repeated observations of the state of nature are made.
A convenient setup to carry out the discussion is as follows: Consecutive ob-
servations are collected at time epochs labelled i = 1, . . . , n with n > 1. At
14 CHAPTER 1. DECISION THEORY

each time epoch, nature is assumed to be in one of L distinct states, labelled


` = 1, . . . , L, and we write L = {1, . . . , L}. For each i = 1, . . . , n, the unknown
state of nature at epoch i is encoded in the L-valued rv Hi , while the observa-
tion is modeled by an Rd -valued rv X i . The global state of nature over these
n time epochs is the Ln -valued rv H = (H1 , . . . , Hn ), while the Rnd -valued
rv X = (X 1 , . . . , X n ) represents the cumulative observation over these same
epochs.
The problem of interest here is that of detecting the global state of nature H
on the basis of the cumulative observation vector X. A number of assumptions
will now be made; they are present in some situations relevant to Digital Commu-
nications: At this point, the Ln -valued rv H is assumed to have an arbitrary pmf,
say

p(h) = P [H = h]
= P [H1 = h1 , . . . , Hn = hn ] , h = (h1 , . . . , hn ) Ln .

We also assume that the observations X 1 , . . . , X n are conditionally independent


given the global state of nature, with a conditional density of the product form
n
Y h = (h1 , . . . , hn ) Ln
(1.11) fh (x) = fhi (xi ), .
x Rn
i=1

Note that the functional form of (1.11) implies more than the conditional indepen-
dence of the rvs X 1 , . . . , X n as it also stipulates for each i = 1, . . . , n that the
conditional distribution of X i given H depends only on Hi , the state of nature at
the epoch i when this observation is taken.
The results obtained earlier apply for it suffices to identify the state of nature
as the rv H and the observation as X: We then see that the ML detector for H
on the basis of the observation vector X prescribes
n
Y
Choose H
c = (h1 , . . . , hn ) iff fhi (xi ) largest.
i=1

This leads to the following equivalent prescription

Choose H
b i = hi iff fhi (xi ) largest, i = 1, . . . , n.

In other words the corresponding ML detector reduces to sequentially applying


an appropriate ML detector for deciding the state of nature Hi at epoch i on the
1.7. IRRELEVANT DATA 15

basis of the observation X i collected only at that epoch for each i = 1, . . . , n. Of


course this is a great simplification since it can be done sequentially in time.
We now turn to the MAP detector in the situation when the rvs H1 , . . . , Hn are
mutually independent (but not necessarily identically distributed), i.e.,
n
Y
(1.12) P [H1 = h1 , . . . , Hn = hn ] = P [Hi = hi ]
i=1

with h = (h1 , . . . , hn ) in Ln . Under this independence assumption on the prior,


the MAP detector for H on the basis of the observation vector X prescribes
n
Y
Choose H
c = (h1 , . . . , hn ) iff P [Hi = hi ] fhi (xi ) largest.
i=1

This time again, a separation occurs under the independence assumption (1.12),
namely the combined prescriptions
Choose H
b i = hi iff P [Hi = hi ] fhi (xi ) largest, i = 1, . . . , n.
Again great simplification is achieved as the MAP detector reduces to sequentially
applying an MAP detector for deciding the state of nature Hi at epoch i on the
basis of the observation X i collected only at that epoch for each i = 1, . . . , n.

1.7 Irrelevant data


When applying the ideas of Decision Theory developed in this chapter, we shall
sometimes encounter the following structured situation: The observed data X
admits a natural partitioning into two component vectors, say X = (Y , Z) for
rvs Y and Z which take values in Rp and Rq , respectively, with p + q = d.
To simplify the discussion, we still assume that for each m = 1, . . . , M , the
distribution function Fm admits a density fm on Rd . In that case, the distribution
of the rv Y given H = m also admits a density gm given by
Z
gm (y) = fm (y, z) dz, y Rp .
Rq

It is a simple matter to check for y in Rp that the conditional distribution of the


rv Z given Y = y and H = m admits a density, denoted hm (|y). Standard
conditioning arguments readily yield
(1.13) fm (y, z) = gm (y)hm (z|y), y R p , z Rq .
16 CHAPTER 1. DECISION THEORY

0
In fact, with the convention 0
= 0, we find

fm (y, z)
(1.14) hm (z|y) = , y Rp , z R q .
gm (y)

Returning to the definition (1.4) of the optimal detector, we see that ? pre-
scribes
Hb = m iff pm gm (y)hm (z|y) largest

with a tie-breaker. Therefore, if the conditional density at (1.14) were to not de-
pend on m, i.e.,

(1.15) h1 (z|y) = . . . = hM (z|y) =: h(z|y), y Rp , z Rq

then (1.14) reduces to

(1.16) H
b =m iff pm gm (y) largest.

The condition (1.15) and the resulting form (1.16) of the optimal detector suggest
that knowledge of Z plays no role in developing inference of H on the basis of
the pair (Y , Z), hence the terminology irrelevant data given to Z.
In a number of cases occuring in practice, the condition (1.15) is guaranteed by
the following stronger conditional independence: (i) The rvs Y and Z are mutu-
ally independent conditionally on the rv H, and (ii) the rv Z is itself independent
of the rv H. In other words, for each m = 1, . . . , M , it holds that

P [Y y, Z z|H = m] = P [Y y|H = m] P [Z z|H = m]


= P [Y y|H = m] P [Z z]

for all y and z in Rp and Rq , respectively. In that case, it is plain that

fm (y, z) = gm (y)h(z), y Rp , z R q

where h is the unconditional probability density function of Z. The validity of


(1.15) is now immediate with

h(z|y) = h(z), y Rp , z Rq .
1.8. SUFFICIENT STATISTICS 17

1.8 Sufficient statistics


A mapping T : Rd Rp is said to be a sufficient statistic for (estimating) H on
the basis of X if the conditional distribution of X given H = m and T (X) does
not depend on m.
The Fisher-Neyman Factorization Theorem given next provides a convenient
characterization of a sufficient statistic in the framework used here.

Theorem 1.8.1 Assume that for each m = 1, . . . , M , the distribution function Fm


admits a density fm on Rd . The mapping T : Rd Rp is a sufficient statistic for
estimating H on the basis of X if and only if there exist mappings h : Rd R+
and g1 , . . . , gM : Rp R+ such that
(1.17) fm (x) = h(x)gm (T (x)), x Rd
for each m = 1, . . . , M .

The usefulness of the Fisher-Neyman Factorization Theorem should be appar-


ent: From the definition (1.4) of the optimal detector, we see that ? prescribes
(1.18) H
b =m iff pm h(x)gm (T (x)) largest
with a tie-breaker, a prescription equivalent to
(1.19) H
b =m iff pm gm (T (x)) largest
with a tie-breaker. In many applications p is much smaller than d with obvious
advantages from the point of view of storage and implementation: The data x is
possibly high-dimensional but after some processing, the decision concerning the
state of nature can be taken on the basis of the lower-dimensional quantity T (x).
The following example, already introduced in Section 1.5, should clarify the
advantage of using (1.19) over (1.18): Assume the distributions F1 , . . . , FM to be
Gaussian distributions with the same invertible covariance matrix 2 I d but with
distinct means 1 , . . . , M . Further assume that
m = m , m = 1, . . . , M
for distinct scalars 1 , . . . , M and non-zero vector . Then, under these assump-
tions, for each m = 1, . . . , M , the distribution Fm admits the density
1
e 22 kxm k ,
1 2
(1.20) fm (x) = p x Rd
(2 2 )d
18 CHAPTER 1. DECISION THEORY

where
kx m k2 = kxk2 2m x0 + 2m kk2 .
As a result, the density fm can be written in the form (1.17) with
1
e 22 kxk ,
1 2
h(x) = p x Rd
(2 2 )p
and
gm (t) = e 22 (2m t+m kk ) ,
1 2 2
t R.
It now follows from Theorem 1.8.1 that the mapping T : Rd R given by

T (x) := x0 , x Rd

is a sufficient statistic for (estimating) H on the basis of X Here p = 1 while d is


arbitrary (and often very large). While the (high-dimensional) data x is observed,
the decision is taken on the basis of the one-dimensional quantity T (x), namely
1
2m T (x) + 2m kk2 largest

(1.21) H
b =m iff log pm 2
2
upon taking logarithms in (1.19).

1.9 Exercises
Ex. 1.1 Consider the Bayesian hypothesis problem with an arbitrary cost function
C : H H R. Revisit the arguments of Section 1.2 to identify the optimal
detector.

Ex. 1.2 Show that the detector identified in Exercise 1.1 is indeed the optimal
detector. Arguments similar to the ones given in Section 1.3 can be used.

Ex. 1.3 Specialize Exercise 1.2 to the case M = 2.

Ex. 1.4 Show that the formulations (1.7) and (1.8) are equivalent.

Ex. 1.5 In the setting of Section 1.6, show that the rv H is uniformly distributed
on Ln if and only the rvs H1 , . . . , Hn are i.i.d. rvs, each of which is uniformly
distributed on L. Use this fact to obtain the form of the ML detector from the
results derived in the second half of Section 1.6, under the assumption (1.12) on
the prior.
1.9. EXERCISES 19

Ex. 1.6 Consider the situation where the scalar observation X and the state of
nature H are rvs related through the measurement equation

X = H + V

under the following assumptions: The rvs H and V are mutually independent, the
rv H takes values in some finite set H = {1, . . . , M }, and the R-valued rv V
admits a density fV . Here 1 , . . . , M denote distinct scalars, say 1 < . . . < M .
Find the corresponding ML detector.

Ex. 1.7 Continue Exercise 1.6 when the noise V has a Cauchy distribution with
density
1
fV (v) = , v R.
(1 + v 2 )
Show that the ML detector implements nearest-neighbor detection.

Ex. 1.8 Consider the multi-dimensional version of Exercise 1.6 with the observa-
tion X and the state of nature H related through the measurement equation

X = H + V

under the following assumptions: The rvs H and V are mutually independent,
the rv H takes values in some finite set H = {1, . . . , M }, and the Rd -valued rv
V admits a density fV . Here the vectors 1 , . . . , M are distinct elements of Rd .
Find the ML detector when fV is of the form

fV (v) = g(kvk2 ), v Rd

for some decreasing function g : R+ R+ .


20 CHAPTER 1. DECISION THEORY
Chapter 2

Gaussian Random Variables

This chapter is devoted to a brief discussion of the class of Gaussian rvs. In


particular, for easy reference we have collected various facts and properties to be
used repeatedly.

2.1 Scalar Gaussian rvs


With
R and 0,
an R-valued rv X is said to be a Gaussian (or normally distributed) rv with mean
and variance 2 if either it is degenerate to a constant with X = a.s. (in which
case = 0) or the probability distribution of X is of the form
Z x
1 (t)2
P [X x] = e 22 dt, x R
2 2

(in which case 2 > 0). Under either circumstance, it can be shown that
2 2
E eiX = ei 2 ,
 
(2.1) R.

It is then follows by differentiation that

and E X 2 = 2 + 2
 
(2.2) E [X] =

so that Var[X] = 2 . This confirms the meaning ascribed to the parameters and
2 as mean and variance, respectively.

21
22 CHAPTER 2. GAUSSIAN RANDOM VARIABLES

It is a simple matter to check that if X is normally distributed with mean and


variance 2 , then for scalars a and b, the rv aX + b is also normally distributed
with mean a + b and variance a2 2 . In particular, with > 0, the rv 1 (X )
is a Gaussian rv with mean zero and unit variance.

2.2 The standard Gaussian rv


The Gaussian rv with mean zero and unit variance occupies a very special place
among Gaussian rvs, and is often referred to as the standard Gaussian rv. Through-
out, we denote by U the Gaussian rv with zero mean and unit variance. Its proba-
bility distribution function is given by
Z x
(2.3) P [U x] = (x) := (t)dt, x R

with density function given by
1 x2
(2.4) (x) := e 2 , x R.
2
As should be clear from earlier comments, the importance of this standard rv
U stems from the fact that for any Gaussian rv X with mean and variance 2 , it
holds that X =st + U , so that
P [X x] = P 1 (X ) 1 (x )
 

= P U 1 (x )
 

= ( 1 (x )), x R.
The evaluation of probabilities involving Gaussian rvs thus reduces to the evalua-
tion of related probabilities for the standard Gaussian rv.
For each x in R, we note by symmetry that P [U x] = P [U > x], so that
(x) = 1 (x), and is therefore fully determined by the complementary
probability distribution function of U on [0, ), namely
(2.5) Q(x) := 1 (x) = P [U > x] , x 0.

2.3 Gaussian integrals


There are a number of integrals that can be evaluated explicitly by making use of
the fact that the Gaussian density function (2.4) must integrate to unity. We refer
to these integrals as Gaussian integrals, and provide an expression for them.
2.3. GAUSSIAN INTEGRALS 23

Lemma 2.3.1 For every a in R and b > 0, it holds that


Z r
axbx2 a2
(2.6) I(a, b) := e dx = e 4b .
R b

Proof. To evaluate I(a, b) we use a completion-of-square argument to write

2

2 a   a  2 a2
ax bx = b x x = b x + , xR
b 2b 4b
so that
Z 2
a2 a
I(a, b) = e 4b eb(x 2b ) dx
R
r Z r
a2 b b(x 2ba )2
= e 4b e dx.
b R
The desired conclusion (2.6) follows once we observe that
Z r
b b(x 2ba )2
e dx = 1
R
a 1
as the integral of a Gaussian density with mean = 2b
and variance 2 = 2b
.

Sometimes we shall be faced with the task of evaluating integrals that reduce
to integrals of the form (2.6). This is taken on in

Lemma 2.3.2 For every pair a and b in R, it holds that


Z
2
J(; a, b) := e(a+bx) (x) dx
R
1 a
2

(2.7) = e 1+2b2 , > 0.


1 + 2b2

Proof. Fix > 0. For each x in R, we note that


1 2 1
x + (a + bx)2 = 1 + 2b2 x2 + a2 + 2abx.

2 2
24 CHAPTER 2. GAUSSIAN RANDOM VARIABLES

Hence, upon making the change of variable u = x 1 + 2b2 , we find
a2
Z
J(; a, b) = e ( 1 + 2b2 x)e2abx dx
ZR
2 2ab u du
= ea e 1+2b2 (u)
R 1 + 2b2
2
ea
Z
2ab u
= e 1+2b2 (u) du
1 + 2b2 R
2
ea
(2.8) = p I(, )
2(1 + 2b2 )
with
2ab 1
:= and := .
1 + 2b2 2
Applying Lemma 2.3.1, we note that
2 2 22 a2 b2
= =
4 2 1 + 2b2
so that
2 22 a2 b2
(2.9) I(, ) = 2e 2 = 2e 1+2b2 .
The desired conclusion readily follows from (2.8) and (2.9) once we observe that
22 a2 b2 a2
a2 + = .
1 + 2b2 1 + 2b2

As an easy corollary of Lemma 2.3.1, any Gaussian rv X with mean and


variance 2 has a moment generating function given by
2 2
E eX = e+ 2 , R.
 
(2.10)
Indeed, for each in R, direct inspection shows that
Z
1 (x)2
ex 22 dx
 X 
E e =
2
R
Z 2
1 t2
= e et 22 dt
R 22 
1 1
= e I , 2
2 2 2
2.4. EVALUATING Q(X) 25

where the second equality is obtained by the change of variable t = x , and


(2.10) follows by making use of Lemma 2.3.1. Observe that (2.1) can also be
obtained formally from (2.10) upon replacing in the latter by i.

2.4 Evaluating Q(x)


The complementary distribution function (2.5) repeatedly enters the computation
of various probabilities of error. Given its importance, we need to develop good
approximations to Q(x) over the entire range x 0.

The error function In the literature on digital communications, probabilities of


error are often expressed in terms of the so-called error function Erf : R+ R
and of its complement Erfc : R+ R defined by
Z x
2 2
(2.11) Erf(x) = et dt, x 0
0

and Z
2 2
(2.12) Erfc(x) = et dt, x 0.
x

A simple change of variables (t = u ) in these integrals leads to the relationships


2


 
1
Erf(x) = 2 (x 2) and Erfc(x) = 2Q(x 2),
2

so that
Erf(x) = 1 Erfc(x), x 0.
Conversely, we also have
    
1 x 1 x
(x) = 1 + Erf and Q(x) = Erfc .
2 2 2 2
Thus, knowledge of any one of the quantities , Q, Erf or Erfc is equivalent to
that of the other three quantities. Although the last two quantities do not have
a probabilistic interpretation, evaluating Erf is computationally more efficient.
Indeed, Erf(x) is an integral of a positive function over the finite interval [0, x]
(and not over an infinite interval as in the other cases).
26 CHAPTER 2. GAUSSIAN RANDOM VARIABLES

Chernoff bounds To approximate Q(x) we begin with a crude bound which


takes advantage of (2.10): Fix x > 0. For each > 0, the usual Chernoff bound
argument gives

P [U > x] E eU ex
 

2
= ex+ 2
x2 (x)2
(2.13) = e 2 e 2

where in the last equality we made use of a completion-of-square argument. The


best lower bound
x2
(2.14) Q(x) e 2 , x0

is achieved upon selecting = x in (2.13). We refer to the bound (2.14) as a


Chernoff bound; it is not very accurate for small x > 0 since limx0 Q(x) = 12
x2
while limx0 e 2 = 1.

Approximating Q(x) (x ) The Chernoff bound shows that Q(x) decays


x2
to zero for large x at least as fast as e 2 . However, sometimes more precise
information is needed regarding the rate of decay of Q(x). This issue is addressed
as follows:
For each x 0, a straigthforward change of variable yields
Z
Q(x) = (t)dt
x
Z
= (x + t)dt
0
Z
t2
(2.15) = (x) ext e 2 dt.
0

t2
With the Taylor series expansion of e 2 in mind, approximations for Q(x) of
increased accuracy thus suggest themselves by simply approximating the second
exponential factor (namely ext ) in the integral at (2.15) by terms of the form
n
X (1)k
(2.16) t2k , n = 0, 1, . . .
k=0
2k k!
2.5. GAUSSIAN RANDOM VECTORS 27

To formulate the resulting approximation contained in Proposition 2.4.1 given


next, we set
Z X n
!
(1)k 2k xt
Qn (x) = (x) k k!
t e dt, x 0
0 k=0
2

for each n = 0, 1, . . ..
Proposition 2.4.1 Fix n = 0, 1, . . .. For each x > 0 it holds that
(2.17) Q2n+1 (x) Q(x) Q2n (x),
with
(2n)! (2n+1)
(2.18) | Q(x) Qn (x) | x (x).
2n n!
where n
X (1)k (2k)!
(2.19) Qn (x) = (x) x(2k+1) .
k=0
2k k!

A proof of Proposition 2.4.1 can be found in Section 2.12. Upon specializing


(2.17) to n = 0 we get
x2  x2
e 2 e 2

1
(2.20) 1 2 Q(x) , x > 0
x 2 x x 2
and the asymptotics
x2
e 2
(2.21) Q(x) (x )
x 2
follow. Note that the lower bound in (2.20) is meaningful only when x 1.

2.5 Gaussian random vectors


Let denote a vector in Rd and let be a symmetric and non-negative definite
d d matrix, i.e., 0 = and 0 0 for all in Rd .
An Rd -valued rv X is said to be a Gaussian rv with mean vector and co-
variance matrix if there exist a d p matrix T for some positive integer p and
i.i.d. zero mean unit variance Gaussian rvs U1 , . . . , Up such that
(2.22) TT0 =
28 CHAPTER 2. GAUSSIAN RANDOM VARIABLES

and
(2.23) X =st + T U p
where U p is the Rp -valued rv (U1 , . . . , Up )0 .
From (2.22) and (2.23) it is plain that

E [X] = E [ + T U p ] = + T E [U p ] =

and

E (X ) (X )0 = E T U p (T U p )0
   

= T E U p U 0p T 0
 

(2.24) = T I p T 0 = ,

whence
E [X] = and Cov[X] = .
Again this confirms the terminology used for and as mean vector and covari-
ance matrix, respectively.
It is a well-known fact from Linear Algebra [, , p. ] that for any symmetric
and non-negative definite d d matrix , there exists a d d matrix T such that
(2.22) holds with p = d. This matrix T can be selected to be symmetric and non-
negative definite, and is called the square root of . Consequently, for any vector
in Rd and any symmetric non-negative definite d d matrix , there always
exists an Rd -valued Gaussian rv X with mean vector and covariance matrix
Simply take
X =st + T U d
where T is the square root of .

2.6 Characteristic functions


The characteristic function of Gaussian rvs has an especially simple form which
is now developed.

Lemma 2.6.1 The characteristic function of a Gaussian Rd -valued rv X with


mean vector and covariance matrix is given by
h 0 i 0 1 0
(2.25) E ei X = ei 2 , Rd .
2.6. CHARACTERISTIC FUNCTIONS 29

Conversely, any Rd -valued rv X whose characteristic function is given by (2.25)


for some vector in Rd and symmetric non-negative definite d d matrix is a
Gaussian Rd -valued rv X with mean vector and covariance matrix .

Proof. Consider an Rd -valued rv X which is a Gaussian rv with mean vector


and covariance matrix . By definition, there exist a d p matrix T for some
positive integer p and i.i.d. zero mean unit variance Gaussian rvs U1 , . . . , Up such
that (2.22) and (2.23) hold.
For each in Rd , we get
h 0 i 0 h 0 i
E ei X = ei E ei T U p
0 h 0 i
= ei E ei(T ) U p
0

0 h Pp 0 i
= ei E ei k=1 (T )k Uk
p
0 h 0 i
i E ei(T )k Uk
Y
(2.26) = e
k=1
p
0 0
= ei e 2 |(T )k |
1 2
Y
(2.27)
k=1

The equality (2.26) is a consequence of the independence of the rvs U1 , . . . , Up ,


while (2.27) follows from their Gaussian character (and (2.1)).
Next, we note that
p
X
|(T 0 )k |2 = (T 0 )0 (T 0 )
k=1
(2.28) = 0 (T T 0 ) = 0

upon invoking (2.22). It is now plain from (2.27) that the characteristic function
of the Gaussian Rd -valued rv X is given by (2.25).
Conversely, consider an Rd -valued rv X with characteristic function of the
form (2.25) for some vector in Rd and some symmetric non-negative definite
d d matrix . By comments made earlier, there exists a d d matrix T such
that (2.22) holds. By the first part of the proof, the Rd -valued rv X f given by
X
f := + T U d has characteristic function given by (2.25). Since a probability
distribution is completely determined by its characteristic function, it follows that
30 CHAPTER 2. GAUSSIAN RANDOM VARIABLES

the rvs X and X


f obey the same distribution. The rv X
f being Gaussian with mean
vector and covariance matrix , the rv X is necessarily Gaussian as well with
mean vector and covariance matrix .

2.7 Existence of a density


In general, an Rd -valued Gaussian rv as defined above may not admit a density
function. To see why, consider the null space of its covariance matrix ,1 namely

N () := {x Rd : x = 0d }.

Observe that 0 = 0 if and only if belongs to N (), in which case (2.25)


yields h 0 i
i (X )
E e =1

and we conclude that


0 (X ) = 0 a.s.
In other words, with probability one, the rv X is orthogonal to the linear
space N ().
To proceed, we assume that the covariance matrix is not trivial (in that it
has some non-zero entries) for otherwise X = a.s. In the non-trivial case, there
are now two possibilities depending on the d d matrix being positive definite
or not. Note that the positive definiteness of , i.e., 0 = 0 necessarily implies
= 0d , is equivalent to the condition N () = {0d }.
If the d d matrix is not positive definite, hence only positive semi-definite,
then the mass of the rv X is concentrated on the orthogonal space N ()
of N (), whence the distribution of X has its support on the linear manifold
+ N () and is singular with respect to Lebesgue measure.
On the other hand, if the d d matrix is positive definite, then the matrix
is invertible, det() 6= 0 and the Gaussian rv X with mean vector and
covariance matrix admits a density function given by
1 1
e 2 (x) (x)
1 0
f (x) = p , x Rd .
(2)d det()
1
This linear space is sometimes called the kernel of .
2.8. LINEAR TRANSFORMATIONS 31

2.8 Linear transformations


The following result is very useful in many contexts, and shows that linear trans-
formations preserve the Gaussian character:

Lemma 2.8.1 let be an element of Rq and let A be an q d matrix. Then, for


any Gaussian rv Rd -valued rv X with mean vector and covariance matrix ,
the Rq -valued rv Y given by
Y = + AX
is also a Gaussian rv with mean vector + A and covariance matrix AA0 .

Proof. First, by linearity we note that


E [Y ] = E [ + AX] = + A
so that
Cov[Y ] = E A(X ) (A(X ))0
 

= AE [(X )(X )0 ] A0
(2.29) = AA0 .
Consequently, the Rq -valued rv Y has mean vector +A and covariance matrix
AA0 .
Next, by the Gaussian character of X, there exist a d p matrix T for some
positive integer p and i.i.d. zero mean unit variance Gaussian rvs U1 , . . . , Up such
that (2.22) and (2.23) hold. Thus,
Y =st + A ( + T U p )
= + A + AT U p
(2.30) = e +TeUp

with

e := + A and T
e := AT
and the Gaussian character of Y is established.

This result can also be established through the evaluation of the characteristic
function of the rv Y . As an immediate consequence of Lemma 2.8.1 we get
32 CHAPTER 2. GAUSSIAN RANDOM VARIABLES

Corollary 2.8.1 Consider a Gaussian rv Rd -valued rv X with mean vector and


covariance matrix . For any subset I of {1, . . . , d} with |I| = q d, the Rq -
valued rv X I given by X I = (Xi , i I)0 is a Gaussian rv with mean vector
(i , i I)0 and covariance matrix (ij , i, j I).

2.9 Independence of Gaussian rvs


Characterizing the mutual independence of Gaussian rvs turns out to be quite
straightforward as the following suggests: Consider the rvs X 1 , . . . , X r where
for each s = 1, . . . , r, the rvX s is an Rds -valued rv with mean vector s and
covariance matrix s . With d = d1 + . . . + dr , let X denote the Rd -valued rv
obtained by concatenating X 1 , . . . , X r , namely

X1
(2.31) X = ... .

Xr
Its mean vector is simply
1
(2.32) = ...

r
while its covariance matrix can be written in block form as

1 1,2 . . . 1,r
2,1 2 . . . 2,r
(2.33) = ..

.. .. ..
. . . .
r,1 r,2 . . . r
with the notation
s,t := Cov[X s , X t ] s, t = 1, . . . , r.
Lemma 2.9.1 With the notation above, assume the Rd -valued rv X to be a Gaus-
sian rv with mean vector and covariance matrix . Then, for each s = 1, . . . , r,
the rv X s is a Gaussian rv with mean vector s and covariance matrix s . More-
over, the rvs X 1 , . . . , X r are mutually independent Gaussian rvs if and only they
are uncorrelated, i.e.,
(2.34) s,t = (s, t)t , s, t = 1, . . . , r.
2.10. CONVERGENCE AND LIMITS OF GAUSSIAN RVS 33

The first part of Lemma 2.9.1 is a simple rewrite of Corollary 2.8.1. Some-
times we refer to the fact that the rv X is Gaussian by saying that the rvs X 1 , . . . , X r
are jointly Gaussian. A converse to Lemma 2.9.1 is available:

Lemma 2.9.2 Assume that for each s = 1, . . . , r, the rv X s is a Gaussian rv with


mean vector s and covariance matrix s . If the rvs X 1 , . . . , X r are mutually
independent, then the Rd -valued rv X is an Rd -valued Gaussian rv with mean
vector and covariance matrix as given by (2.33) with (2.34).

It might be tempting to conclude that the Gaussian character of each of the rvs
X 1 , . . . , X r alone suffices to imply the Gaussian character of the combined rv
X. However, it can be shown through simple counterexamples that this is not so.
In other words, the joint Gaussian character of X does not follow merely from
that of its components X 1 , . . . , X r without further assumptions.

2.10 Convergence and limits of Gaussian rvs


In later chapters we will need to define integrals with respect to Gaussian pro-
cesses. As in the deterministic case, these stochastic integrals will be defined as
limits of partial sums of the form
kn
X (n) (n)
(2.35) Xn := aj Y j , n = 1, 2, . . .
i=1

(n)
where for each n = 1, 2, . . ., the integer kn and the coefficients aj , j = 1, . . . , kn ,
(n)
are non-random while the rvs {Yj , j = 1, . . . , kn } are jointly Gaussian rvs. Typ-
ically, as n goes to infinity so does kn . Note that under the foregoing assumptions
for each n = 1, 2, . . ., the rv Xn is Gaussian with
kn
X h i
(n) (n)
(2.36) E [Xn ] = aj E Yj
i=1

and
kn X
X kn
(n) (n) (n) (n)
(2.37) Var[Xn ] = ai aj Cov[Yi , Yj ].
i=1 j=1
34 CHAPTER 2. GAUSSIAN RANDOM VARIABLES

Therefore, the study of such integrals is expected to pass through the conver-
gence of sequence of rvs {Xn , n = 1, 2, . . .} of the form (2.35). Such considera-
tions lead naturally to the need for the following result [, Thm. , p.]:

Lemma 2.10.1 Let {X k , k = 1, 2, . . .} denote a collection of Rd -valued Gaus-


sian rvs. For each k = 1, 2, . . ., let k and k denotes the mean vector and
covariance matrix of the rv X k . The rvs {X k , k = 1, . . .} converge in distribu-
tion (in law) if and only there exist an element in Rd and a d d matrix such
that
(2.38) lim k = and lim k = .
k k

In that case,
X k =k X
where X is an Rd -valued Gaussian rv with mean vector and covariance matrix
.

The second half of condition (2.38) ensures that the matrix is symmetric
and non-negative definite, hence a covariance matrix.
Returning to the partial sums (2.35) we see that Lemma 2.10.1 (applied with
d = 1) requires identifying the limits = limn E [Xn ] and 2 = limn Var[Xn ],
in which case Xn =n X where X is an R-valued Gaussian rv with mean and
variance . In Section ?? we discuss a situation where this can be done quite
easily.

2.11 Rvs derived from Gaussian rvs


Rayleigh rvs A rv X is said to be a Rayleigh rv with parameter ( > 0) if

(2.39) X =st Y 2 + Z 2

with Y and Z independent zero mean Gaussian rvs with variance 2 . It is easy to
check that
x2
(2.40) P [X > x] = e 22 , x 0
with corresponding density function

d x x2
(2.41) P [X x] = 2 e 22 , x 0.
dx
2.12. A PROOF OF PROPOSITION ?? 35

It is also well known that the rv given by


 
Z
(2.42) := arctan
Y
is uniformly distributed over [0, 2) and independent of the Rayleigh rv X, i.e.,
 
2
x2
(2.43) P [X x, ] = 1 e 2 , [0, 2), x 0.
2

Rice rvs A rv X is said to be a Rice rv with parameters (in R) and ( > 0)


if p
(2.44) X =st ( + Y )2 + Z 2
with Y and Z independent zero mean Gaussian rvs with variance 2 . It is easy to
check that X admits a probability density function given by
d x x2 +2 2  x 
(2.45) P [X x] = 2 e 2 I0 , x 0.
dx 2
Here, Z 2
1
(2.46) I0 (x) := ex cos t dt, x R
2 0
is the modified Bessel function of the first kind of order zero.

Chi-square rvs For each n = 1, 2, . . ., the Chi-square rv with n degrees of


freedom is the rv defined by
2n =st U12 + . . . + Un2
where U1 , . . . , Un are n i.i.d. standard Gaussian rvs.

2.12 A Proof of Proposition 2.4.1


The main idea is to use the Taylor series approximations (2.16) in the relation
(2.15). To do so, we begin by establishing some elementary facts concerning the
Taylor series approximations of the negative exponential ey (y 0): For each
n = 0, 1, . . ., set
n
X (1)k k
(2.47) Hn (y) := y , y 0.
k=0
k!
36 CHAPTER 2. GAUSSIAN RANDOM VARIABLES

Lemma 2.12.1 For each y 0 and n = 0, 1, . . ., it holds that

(2.48) H2n+1 (y) ey H2n (y)

with
yn
(2.49) | Hn (y) ey | .
n!

Proof. Fix y 0 and n = 0, 1, . . .. By differentiation we readily check that


0
Hn+1 (y) = Hn (y),

so that
d y
e Hn+1 (y) = ey Hn (y) .
 
dy
Integrating and using the fact Hn+1 (0) = 1, we find
Z y
y
et Hn (t) dt.

(2.50) e Hn+1 (y) =
0

An easy induction argument now yields (2.48) once we note for the basis step that
H0 (y) > ey for all y > 0.
To obtain the bound (2.49) on the accuracy of approximating ey by Hn (y), we
proceed by induction on n. For n = 0, it is always the case that |ey H0 (y)| 1,
whence (2.49) holds for all y 0 and the basis step is established. Next, we
assume that (2.49) holds for all y 0 for n = m with some m = 0, 1, . . ., namely

ym
(2.51) |ey Hm (y)| , y 0.
m!
Hence, upon invoking (2.50) we observe that
Z y
y
|e Hm+1 (y)| |et Hm (t)|dt
Z0 y m
t y m+1
dt = , y0
0 m! (m + 1)!

and the induction step is established.


Back to the proof of Proposition 2.4.1: Fix x > 0 and n = 0, 1, . . .. As we have
2.13. EXERCISES 37

in mind to use (2.48) to bound the second exponential factor in the integrand of
(2.15), we note that
n
(1)k 2k xt
Z  2 Z
xt t X
e Hn dt = t e dt
0 2 k=0
2k k! 0
n
(1)k (2k+1) 2k u
X Z
= x u e du
k=0
2k k! 0
n
X (1)k (2k)!
(2.52) = x(2k+1)
k=0
2k k!

where the last equality made use of the well-known closed-form expressions
Z
up eu du = p!, p = 0, 1, . . .
0

for the moments of a standard exponential distribution.


The bounds (2.48) together with (2.15) yield the inequalities
Z  2 Z  2
xt t xt t
(x) e H2n+1 dt Q(x) (x) e H2n dt,
0 2 0 2

and (2.17) follows from the evaluation (2.52).


Using the definition of Q(x) and Qn (x) we conclude from (2.49) that
Z   2 
xt t2 t
| Q(x) Qn (x) | = (x) e e 2 Hn dt
2
Z 0 2n
t
(x) ext n dt,
0 2 n!

and (2.18) follows.

2.13 Exercises
Ex. 2.1 Derive the relationships between the quantities , Q, Erf or Erfc which
are given in Section 2.4.
38 CHAPTER 2. GAUSSIAN RANDOM VARIABLES

Ex. 2.2 Given the covariance matrix , explain why the representation (2.22)
(2.23) may not be unique. Give a counterexample.

Ex. 2.3 Give a proof for Lemma 2.9.1 and of Lemma 2.9.2.

Ex. 2.4 Construct an R2 -valued rv X = (X1 , X2 ) such that the R-valued rvs X1
and X2 are each Gaussian but the R2 -valued rv X is not (jointly) Gaussian.

Ex. 2.5 Derive the probability distribution function (2.40) of a Rayleigh rv with
parameter ( > 0).

Ex. 2.6 Show by direct arguments that if X is a Rayleigh distribution with pa-
rameter , then
h Xi2 is exponentially distributed with parameter (2 2 )1 [Hint:
2
Compute E eX for a Rayleigh rv X for 0.]

Ex. 2.7 Derive the probability distribution function (2.45) of a Rice rv with pa-
rameters (in R) and ( > 0).

Ex. 2.8 Write a program to evaluate Qn (x).

Ex. 2.9 Let X1 , . . . , Xn be i.i.d. Gaussian rvs with zero mean and unit variance
and write Sn = X1 + . . . + Xn . For each a > 0 show that
na2
e 2
(2.53) P [Sn > na] (n ).
a 2n
This asymptotic is known as the Bahadur-Rao correction to the large deviations
asymptotics of Sn .

Ex. 2.10 Find all the moments E [U p ] (p = 1, . . .) where U is a zero-mean unit


variance Gaussian rv.

Ex. 2.11 Find all the moments E [U p ] (p = 1, . . .) where X is a 2n -rv with n


degrees of freedom.
Chapter 3

Vector space methods

In this chapter we develop elements of the theory of vector spaces. As we shall


see in subsequent chapters, vector space methods will prove useful in handling the
so-called waveform channels by transforming them into vector channels. Vector
spaces provide a unifying abstraction to carry out this translation. Additional
information can be found in the references [?, ?].

3.1 Vector spaces Definitions


We begin by introducing the notion of vector space. Consider a set V whose
elements are called vectors while we refer to the elements of R as scalars. We
assume that V is equipped with an internal operation of addition, say + : V V
V , with the property that (V, +) is a commutative group. This means that

1. (Commutativity)
v + w = w + v, v, w V

2. (Associativity)

(u + v) + w = u + (v + w), u, v, w V

3. (Existence of a zero vector) There exists an element 0 in V such that

v + 0 = v = 0 + v, vV

39
40 CHAPTER 3. VECTOR SPACE METHODS

4. (Existence of negative vectors) For every vector v in V , there exists a vector


in V , denoted v, such that

v + (v) = 0 = (v) + v

It is a simple matter to check that there can be only one such zero vector 0, and
that for every vector v in V , its negative v is unique.
In order for the group (V, +) to become a vector space on R we need to endow
it with an external multiplication operation whereby multiplying a vector by a
scalar is given a meaning as a vector. This multiplication operation, say : R
V V , is required to satisfy the following properties:

1. (Distributivity)

(a + b) v = a v + b v, a, b R, v V

2. (Distributivity)

a (v + w) = a v + a w, a R, v, w V

3. (Associativity)

a (b v) = (ab) v = b (a v) , a, b R, v V

4. (Unity law)
1 v = v, vV

It is customary to drop the multiplication symbol from the notation, as we do


from now. Two important examples will be developed in Chapter 4, namely the
usual space Rd and the space of finite energy signals defined on some interval.
Throughout the remainder of this chapter, we assume given a vector space
(V, +) on R.

3.2 Linear independence


Given a finite collection of vectors v 1 , . . . , v p in V , the vector pi=1 ai v i is called
P
a linear combination of the vectors v 1 , . . . , v p in V (with weights a1 , . . . , ap in
R).
3.3. SUBSPACES AND LINEAR SPANS 41

The vectors v 1 , . . . , v p in V are linearly independent if the relation


p
X
(3.1) ai v i = 0
i=1

with scalars a1 , . . . , ap in R implies

(3.2) a1 = . . . = ap = 0.

In that case, we necessarily have v i 6= 0 for each i = 1, 2, . . . , p (for otherwise


(3.1) does not necessarily imply (3.2)).
If the vectors v 1 , . . . , v p are linearly independent in V , then the relation
p p
X X
ai v i = bi v i
i=1 i=1

with scalars a1 , b1 , . . . , ap , bp implies ai = bi for all i = 1, . . . , p. In other words,


the representation of a vector as a linear combination of a finite number of linearly
independent vectors is necessarily unique.
As we shall see when discussing spaces of signals such as L2 (I), it will be
natural to introduce the following extension of the concept of linear independence:
Consider an arbitrary family {v , A} of elements in V with A some index
set (not necessarily finite). We say that the vectors {v , A} form a linearly
independent family if each of its finite subsets is a linearly independent collection.
Formally, this is equivalent to requiring that for every p = 1, 2, . . . and for every
collection 1 , . . . , p of distinct elements in A, the relation
p
X
(3.3) ai v i = 0
i=1

with scalars a1 , . . . , ap in R implies a1 = . . . = ap = 0.

3.3 Subspaces and linear spans


A (linear) subspace E of the vector space (V, +) (on R) is any subset of V which
is closed under vector addition and multiplication by scalars, i.e.,

v+w E and av E
42 CHAPTER 3. VECTOR SPACE METHODS

whenever v and w are elements of E and a is an arbitrary scalar.


Consider an arbitrary family {v , A} of elements in V with A some index
set (not necessarily finite). We say that v belongs to the (linear) span of {v ,
A}, denoted sp (v , A), if v can be expressed as a linear combination of a
finite number of elements of {v , A}, i.e., there exists a finite number of
indices in A, say 1 , . . . , p for some p, and scalars a1 , . . . , ap in R such that
p
X
v= ai v i .
i=1

This representation is not a priori unique.


The linear span of this family {v , A} is a linear subspace, and is in fact
the smallest linear subspace of E that contains {v , A}. In particular, if A is
finite, say A = {1, . . . , p} for sake of concreteness, then
( p )
X
sp (v 1 , . . . , v p ) := ai v i : (a1 , . . . , ap ) Rp .
i=1

A subspace E of V is now said to have dimension p if there exists p lin-


early independent vectors u1 , . . . , up in E (not merely in V ) such that E =
sp (u1 , . . . , up ). The notion of dimension is well defined in that if v 1 , . . . , v q
is another collection of linearly independent vectors in E (not merely in V ) such
that E = sp (v 1 , . . . , v q ), then p = q. Any set of p linearly independent vectors
w1 , . . . , wp such that E = sp (w1 , . . . , wp ) is called a basis of E.

3.4 Scalar product and norm


Many of the vector spaces of interest are endowed with a scalar product, a notion
which provides a way to measure correlations between vectors. Formally, a scalar
product on the vector space (V, +) is a mapping h, i : V V R which satisfies
the following conditions
1. (Bilinearity) For each v in V , the mappings V R : w hv, wi and
V R : w hw, vi are linear mappings, i.e.,
hv, aw + bui = ahv, wi + bhv, ui
and
haw + bu, vi = ahw, vi + bhu, vi
for all u and w in V , and all scalars a and b in R
3.4. SCALAR PRODUCT AND NORM 43

2. (Symmetry)
hv, wi = hw, vi, v, w V

3. (Positive definiteness)

hv, vi > 0 if v 6= 0 V

It is easy to see that hv, vi = 0 when v = 0, so that

hv, vi 0, v V.

Put differently, hv, vi = 0 for some vector v in V if and only if v = 0.


Once a scalar product is available, it is possible to associate with it a notion
of vector length. We define a notion of norm or vector length on V through the
definition p
(3.4) kvk := hv, vi, v V.
The terminology is justified through the following properties which are commonly
associated with the notion of length in Euclidean geometry.

Proposition 3.4.1 The mapping V R+ : v kvk defined by (3.4) satisfies


the following properties

1. (Homogeneity) For each v in V , it holds that

ktvk = |t| kvk, t R.

2. (Positive definiteness) If kvk = 0 for some v in V , then v = 0

3. (Triangular inequality) For every pair v and w of elements of V , it holds


that
kv + wk kvk + kwk

The properties listed in Proposition 3.4.1 form the basis for the notion of norm
in more general settings [?].

Proof. The homogeneity and positive definiteness are immediate consequence


of the definition (3.4) when coupled with the bilinearity of the underlying scalar
44 CHAPTER 3. VECTOR SPACE METHODS

product and its positive definiteness. To establish the triangular inequality, con-
sider elements v and w of V . It holds that

kv + wk2 = kvk2 + kwk2 + 2hv, wi


kvk2 + kwk2 + 2kvk kwk
(3.5) = (kvk + kwk)2

where the first equality follows by bilinearity of the scalar product, and the in-
equality is justified by the Cauchy-Schwartz inequality (discussed in Proposition
3.4.2 below). This establishes the triangular inequality.

We conclude this section with a proof of the Cauchy-Schwartz inequality.

Proposition 3.4.2 The Cauchy-Schwartz inequality

(3.6) |hv, wi| kvk kwk, v, w V

holds with equality in (3.6) if and only if v and w are co-linear, i.e., there exists a
scalar a in R such that v = aw.

Proof. Fix v and w elements of V , and note that

Q(t) := kv + twk2
(3.7) = kvk2 + 2thv, wi + t2 kwk2 , tR

by bilinearity of the scalar product. The fact that Q(t) 0 for all t in R is
equivalent to the quadratic equation Q(t) = 0 having at most one (double) real
root. This forces the corresponding discriminant to be non-positive, i.e.,

= (2hv, wi)2 4kvk2 kwk2 0,

and the proof of (3.6) is completed. Equality occurs in (3.6) if and only if = 0,
in which case there exists t? in R such that Q(t? ) = 0, whence v + t? w = 0, and
the co-linearity of v and w follows.

In the remainder of this chapter, all discussions are carried out in the context
of a vector space (V, +) on R equipped with a scalar product h, i : V V R.
3.5. ORTHOGONALITY 45

3.5 Orthogonality
The elements v and w of V are said to be orthogonal if
hv, wi = 0.
We also say that the vectors v 1 , . . . , v p are (pairwise) orthogonal if
hv i , v j i = 0, i 6= j, i, j = 1, . . . , p.
More generally, consider an arbitrary family {v , A} of elements in V with
A some index set (not necessarily finite). We say that this family is an orthogonal
family if every one of its finite subset is itself a collection of orthogonal vectors.
A moment of reflection shows that this is equivalent to requiring the pairwise
conditions
(3.8) hv , v i = 0, 6= A.
Moreover, for any subset E of V , the element v of V is said to be orthogonal
to E if
hv, wi = 0, w E.
If the set E coincides with the linear span of the vectors v 1 , . . . , v p , then v is
orthogonal to E if and only if hv, v i i = 0 for all i = 1, . . . , p.
An important consequence of orthogonality is the following version of Pythago-
ras Theorem.
Proposition 3.5.1 When v and w are orthogonal elements in V , we have Pythago-
ras relation
(3.9) kv + wk2 = kvk2 + kwk2 .

This result can be used to show a relationship between linear independence


and orthogonality.
Lemma 3.5.1 If the non-zero vectors v 1 , . . . , v p are orthogonal, then they are
necessarily linearly independent.

Proof. Indeed, for any scalars a1 , . . . , ap in R, repeated application of Pythago-


ras Theorem yields
p p
X X
2
k ai v i k = |ai |2 kv i k2 .
i=1 i=1
46 CHAPTER 3. VECTOR SPACE METHODS

Therefore, the constraint pi=1 ai v i = 0 implies |ai |2 kv i k2 = 0 for all i =


P
1. . . . , p. The vectors v 1 , . . . , v p being non-zero, we have kv i k2 6= 0 for all
i = 1. . . . , p, so that |ai |2 = 0 for all i = 1. . . . , p. In short, a1 = . . . = ap = 0!
Thus, the vectors v 1 , . . . , v p are indeed linearly independent.

The notions of orthogonality and norm come together through the notion of
orthonormality: If the vectors v 1 , . . . , v p are orthogonal with unit norm, they are
said to be orthornormal, a property characterized by
(3.10) hv i , v j i = (i, j), i, j = 1, . . . , p.
The usefulness of this notion is already apparent when considering the follow-
ing representation result.
Lemma 3.5.2 If E is a linear space of V spanned by the orthornormal family
u1 , . . . , up , then the representation
p
X
(3.11) h= hh, ui iui , hE
i=1

holds, and E has dimension p.


The assumption of Lemma 3.5.2 can always be achieved as should be clear
from the Gram-Schmidt orthonormalization procedure discussed in Section 3.8.

Proof. By the definition of E as a span of the vectors u1 , . . . , up , every element


h in E is of the form p
X
(3.12) h= hi ui
i=1
for an appropriate selection of scalars h1 , . . . , hp . For each j = 1, . . . , p, we find
p p
X X
hh, uj i = h hi ui , uj i = hi hui , uj i = hj
i=1 i=1

upon invoking orthonormality, and (3.11) follows from (3.12).

We emphasize that the discussion of Sections 3.4 and 3.5 depends only on
the defining properties of the scalar product. This continues to be the case in the
material of the next section.
3.6. DISTANCE AND PROJECTION 47

3.6 Distance and projection


We can define a notion of distance on V by setting

(3.13) d(v, w) := kv wk, v, w V.

Consider now the situation where E is a linear subspace of V and v is an


element in V . We are interested in finding an element v ? in E which has the
smallest distance to v, namely

(3.14) d(v, v ? ) = inf d(v, x).


xE
The uniqueness and characterization of such an element v ? (when it exists) are
addressed in

Proposition 3.6.1 Let E be a linear subspace of V , and let v denote an arbitrary


element in V . If there exists an element v ? in E satisfying (3.14), it is unique and
characterized by the simultaneous validity of the relations

(3.15) hv v ? , hi = 0, h E.

Conversely, any element v ? in E satisfying (3.15) necessarily satisfies (3.14).

Before giving the proof of Proposition 3.6.1 in the next section we discuss
some easy consequences of the conditions (3.15). These conditions state that
the vector v v ? is orthogonal to E. The unique element v ? satisfying these
constraints is often called the projection of v onto E, and at times we shall use the
notation
v ? = ProjE (v),
in which case (3.15) takes the form

(3.16) hv ProjE (v), hi = 0, h E.

It is often useful to view v ? as the best approximation of v in E, with v v ?


interpreted as the error incurred by approximating v by v ? . In this interpretation,
(3.15) states that the error is orthogonal to the space of all admissible approxi-
mations (i.e., those in E). If v is itself an element of E, then v v ? is now an
element of E and (3.15) (with h = v v ? now in E) yields kv v ? k = 0 or
equivalently, ProjE (v) = v, as expected.
48 CHAPTER 3. VECTOR SPACE METHODS

For any element v in V whose projection onto E exists, Pythagoras Theorem


gives
(3.17) kvk2 = kProjE (v)k2 + kv ProjE (v)k2
as a direct consequence of (3.16)
The linearity of the projection operator is a simple consequence of Proposition
3.6.1 and is left as an exercise to the reader:

Corollary 3.6.1 For any linear space E of V , the projection mapping ProjE :
V E is a linear mapping wherever defined: For every v and w in V whose
projections ProjE (v) and ProjE (w) onto E exist, the projection of av + bw onto
E exists for arbitrary scalars a and b in R, and is given by

ProjE (av + bw) = aProjE (v) + bProjE (w).

We stress again that at this level of generality, there is no guarantee that the
projection always exists. There is however a situation of great practical impor-
tance where this is indeed the case.

Lemma 3.6.1 Assume E to be a linear subspace of V spanned by the orthornor-


mal family u1 , . . . , up for some finite integer p. Then, every element v in V
admits a projection onto E given by
p
X
(3.18) ProjE (v) = hv, ui iui .
i=1

For future use, under the conditions of Lemma 3.6.1, we note that
p
X
(3.19) kProjE (v)k2 = |hv, ui i|2 , vV
i=1

as a simple consequence of the orthonormality of the family u1 , . . . , up .

Proof. Pick an element v in V , and set


p
X
?
v := hv, ui iui .
i=1
3.7. A PROOF OF PROPOSITION ?? 49

The element v ? belongs to E, with

hv v ? , ui i = hv, ui i hv ? , ui i
p
X
= hv, ui i hv, uj ihuj , ui i
j=1
(3.20) = hv, ui i hv, ui i = 0, i = 1, . . . , p.

From Lemma 3.5.2 it is plain that v v ? is orthogonal to E, thus v ? satisfies


(3.15) and the proof is now completed by invoking Proposition 3.6.1.

3.7 A proof of Proposition 3.6.1


First, there can be at most one element in E which satisfies (3.15) for if there were
two such elements, say v ?1 and v ?2 in E, then

k = 1, 2
hv v ?k , hi = 0,
hE

so that
hv ?1 v ?2 , hi = 0, h E.
Using h = v ?1 v ?2 , element of E, in this last relation we find kv ?1 v ?2 k = 0,
whence v ?1 = v ?2 necessarily.
Let v ? be an element in E which satisfies (3.14). For any h in E, the vector
v + th is also an element of E for all t in R. Thus, by the definition of v ? it holds
?

that
kv v ? k2 kv (v ? + th)k2 , t R
with
kv (v ? + th)k2 = kv v ? k2 + t2 khk2 2thv v ? , hi.
Consequently,
t2 khk2 2thv v ? , hi 0, t R.
This last inequality readily implies

tkhk2 2hv v ? , hi, t>0


50 CHAPTER 3. VECTOR SPACE METHODS

and
|t|khk2 2hv v ? , hi, t < 0.
Letting t go to zero in each of these last two inequalities yields hv v ? , hi 0
and hv v ? , hi 0, respectively, and the desired conclusion (3.15) follows.
Conversely, consider any element v ? in E satsifying (3.15). For each x in E,
(3.15) implies the orthogonality of v v ? and h = v ? x (this last vector being
in E), and Pythagoras Theorem thus yields

kv xk2 = kv v ? k2 + kv ? xk2 kv v ? k2 .

This establishes the minimum distance requirement for v ? and (3.15) indeed char-
acterizes the solution to (3.14).

3.8 Gram-Schmidt orthonormalization


As the discussion in Section 3.6 already indicates, the ability to identify ProjE (v)
is greatly simplified if E is spanned by a finite orthonormal family. While E may
not be first introduced as being generated by a family of orthonormal vectors, it
is however possible to find another family of vectors, this time orthonormal, that
nevertheless spans E. The procedure to do so is known as the Gram-Schmidt
orthonormalization procedure.
More formally, this procedure provides an algorithm to solve the following
problem: Given non-zero vectors v 1 , . . . , v n in V , find a collection of orthonormal
vectors u1 , . . . , v p in V such that

sp (v 1 , . . . , v n ) = sp (u1 , . . . , up ) .

While there is no a priori constraint on n, it is plain from previous remarks that


p n. The Gram-Schmidt procedure is iterative and works as follows:
Step 1: Pick v 1 and define the vector u1 by
v1
u1 := .
kv 1 k
This definition is well posed since kv 1 k =
6 0 for the non-zero vector v 1 . Obvi-
ously, ku1 k = 1. Set

`(1) := 1 and E1 := sp(u1 ),


3.8. GRAM-SCHMIDT ORTHONORMALIZATION 51

and go to Step 2.
At Step k, the procedure has already returned the ` orthonormal vectors u1 , . . . , u`
with ` = `(k) k, and let E` denote the corresponding linear span, i.e., E` :=
sp(u1 , . . . , u` ).
Step k + 1: Pick v k+1 .
Either v k+1 lies in the span E` , i.e.,
X
v k+1 = hv k+1 , uj iuj ,
j=1

in which case, set


`(k + 1) := `(k) and E`(k+1) := E`(k)
and go to Step k + 2;
Or v k+1 does not lie in E` , i.e.,
X
v k+1 6= hv k+1 , uj iuj = ProjE` (v k+1 ),
j=1

in which case define


v 0k+1
u`+1 :=
kv 0k+1 k
with
v 0k+1 := v k+1 ProjE` (v k+1 )
X
= v k+1 hv k+1 , uj iuj .
j=1

The algorithm is well defined since v 0k+1 6= 0, while v 0k+1 is orthogonal to E` by


virtue of (3.16). It is now plain that the vectors u1 , . . . , u` , u`+1 form an orthonor-
mal family. Set

`(k + 1) = `(k) + 1 and E`(k+1) := sp E`(k) {u`(k)+1 }
and go to Step k + 2.
This algorithm terminates in a finite number of steps, in fact no more than n
steps. All the projections encountered in the course of running the algorithm do
exist by virtue of Lemma 3.6.1 as they are onto subspaces spanned by a finite
number of orthonormal vectors.
52 CHAPTER 3. VECTOR SPACE METHODS

3.8.1 Exercises
Ex. 3.1 Show that in a commutative group (V, +), there can be only one zero
vector.

Ex. 3.2 Show that in a commutative group (V, +), for every vector v in V , its
negative v is unique.

Ex. 3.3 Let u1 , . . . , up and v 1 , . . . , v q denote two collections of linearly indepen-


dent vectors in V . Show that if sp (u1 , . . . , up ) = sp (v 1 , . . . , v q ), then necessar-
ily p = q.

Ex. 3.4 If E is a linear subspace of V , then it necessarily contains the zero ele-
ment 0. Moreover, v belongs to E if and only if v belongs to E.

Ex. 3.5 For non-zero vetrors v and w in V , we define their correlation coefficient
by
hv, wi
(v; w) = .
kvkkwk

Ex. 3.6 Show that |(v; w)| 1. Find a necessary and sufficient condition for
(v; w) = 1 and (v; w) = 1.

Ex. 3.7 If the set E is the linear span of the vectors v 1 , . . . , v p in V , then show
that v is orthogonal to E if and only if hv, v i i = 0 for all i = 1, . . . , p.

Ex. 3.8 Consider a linear subspece E which is is spanned by the set F in V . Show
that v in V is orthogonal to E if and only if vis orthogonal to F .

Ex. 3.9 Let E1 and E2 be subsets of V such that E1 E2 . Assume that for some
v in V , its projection ProjE2 (v) exists and is an element of E1 . Explain why

ProjE1 (v) = ProjE2 (v).

Ex. 3.10 Prove Corollary 3.6.1.

Ex. 3.11 Repeat Exercise 3.3 using the Gram-Schmidt orthonormalization proce-
dure.
3.8. GRAM-SCHMIDT ORTHONORMALIZATION 53

Ex. 3.12 Let (V1 , +) and (V2 , +) denote two vector spaces on R. A mapping
T : V1 V2 is linear if

T (av + bw) = aT (v) + bT (w), v, w V1 , a, b R.

For any subset E of V1 , we write T (E) = {T (v), v E}. For E a linear


subspace of V1 , show that T (E) is a linear subspace of V2 .

Ex. 3.13 For i = 1, 2, let (Vi , +) denote a vector space on R, equipped with its
own scalar product h, ii : Vi Vi R, and let k ki denote the corresponding
norm. A mapping T : V1 V2 is said to be norm-preserving if

kT (v)k2 = kvk1 , v V1 .

Show that if the mapping T is linear, then it is norm-preserving if and only if T


preserves the scalar product, i.e.,

hT (v), T (w)i2 = hv, wi1 , v, w V1 .

Ex. 3.14
54 CHAPTER 3. VECTOR SPACE METHODS
Chapter 4

Finite-dimensional representations

Building on the discussion of Chapter 3, we now present two vector spaces of


interest for subsequent developments.

4.1 Finite-dimensional spaces


The simplest example of a vector space is the space Rd with d some positive
integer. An element v of Rd is identified with the d-uple (v1 , . . . , vd ) with vi in R
for each i = 1, . . . , d.
In Rd , the addition and multiplication operations are defined componentwise
in the usual way by
v + w := (v1 + w1 , . . . , vd + wd )
and
av := (av1 , . . . , avd ), aR
for any pair of vectors v = (v1 , . . . , vd ) and w = (w1 , . . . , wd ) in Rd . It is a
simple matter to show that these operations turn (Rd , +) into a vector space on
R. The zero element in (Rd , +) is simply the vector 0 = (0, . . . , 0) with all zero
entries.
Statements on the linear independence of vectors in Rd are statements in Lin-
ear Algebra. Indeed, consider vectors v 1 , . . . , v p in Rd with v i = (vi1 , . . . , vid )
for each i = 1, . . . , p. The linear independentce requirements (3.1) and (3.2) now
read as requiring that the d simultaneous relations
p
X
ai vij = 0, j = 1, . . . , d
i=1

55
56 CHAPTER 4. FINITE-DIMENSIONAL REPRESENTATIONS

with scalars a1 , . . . , ap in R imply a1 = . . . = ap = 0. In other words, the linear


independence of the vectors v 1 , . . . , v p is tantamount to a rank property of the
p d matrix V = (vij ).
The vector space Rd is endowed with the scalar product given by
d
X
hv, wi := vi wi , v, w Rd .
i=1

It is a straightforward to check the requisite bilinearity, symmetry and positive


definiteness. The norm induced by this scalar product now takes the form
d
! 12
p X
kvk := (v, v) = |vi |2 , v Rd
i=1

and the corresponding distance is simply the Euclidean distance on Rd given by


d
! 12
X
d(v, w) := kv wk = |vi wi |2 , v, w Rd .
i=1

The vector space Rd contains a very special set of vectors, denoted by e1 , . . . , ed ,


which form an extremely convenient orthonormal family: For each i = 1, . . . , d,
the vector ei = (ei1 , . . . , eid ) has all its components zero except the ith which is
equal to one, i.e.,
eij = (i, j), i, j = 1, . . . , d.
Obviously,
hei , ej i = (i, j), i, j = 1, . . . , d
and for every element v = (v1 , . . . , vd ) in Rd , we can write
v = (v1 , . . . , vd )
= v1 (1, 0, . . . , 0) + v2 (0, 1, . . . , 0) + vd (0, 0, . . . , 1)
= v1 e1 + . . . + vd ed .
Thus, Rd (as a subspace of itself) has dimension d, and therefore no more than d
non-zero vectors can ever be orthogonal, hence orthonormal, in Rd .
As an immediate consequence, any linear subspace E of Rd can always be
viewed as the linear span of a finite number of orthonormal vectors. Hence, by
Lemma 3.6.1 the projection operator onto E is well defined as a mapping ProjE :
Rd E on the whole of Rd where it is linear by Corollary 3.6.1.
4.2. SIGNAL SPACES 57

4.2 Signal spaces


Let I be a non-degenerate interval of the real line R, say [a, b] (with a < b),
(, b] or [a, ). A (real-valued) signal is any function : I R. The energy
of the signal is the quantity E() defined by
Z
E() := |(t)|2 dt.
I

The signal has finite energy if E() < . The space of all finite energy signals
on the interval I is denoted by L2 (I), namely
L2 (I) := { : I R : E() < }.
The set L2 (I) can be endowed with a vector space structure by introducing a
vector addition and multiplication by constants, i.e., for any and in L2 (I) and
any scalar a in R, we define the signals + and a by
( + )(t) := (t) + (t), tI
and
(a)(t) := a(t), t I.
The signals + and a are all finite energy signals if and are in L2 (I). It
is easy to show that equipped with these operations, (L2 (I), +) is a vector space
on R. The zero element for (L2 (I), +) will be the zero signal : I R defined
by (t) = 0 for all t in I.
In L2 (I) the notion of linear independence specializes as follows: The signals
1 , . . . , p in L2 (I) are linearly independent if
p
X
ai i =
i=1

with scalars a1 , . . . , ap in R implies a1 = . . . = ap = 0. This equivalent to the


validity of the simultaneous relations
p
X
ai i (t) = 0, tI
i=1

with scalars a1 , . . . , ap in R implying a1 = . . . = ap = 0. In contrast with the


situation in Rd , here there is no constraint on p as the following example shows
[Exercise 4.7].
58 CHAPTER 4. FINITE-DIMENSIONAL REPRESENTATIONS

Example 4.2.1 Take I = [0, 1] and for each k = 0, 1, . . ., define the signal k :
[0, 1] R by k (t) = tk (t I). For each p = 1, 2, . . ., the signals 0 , 1 , . . . , p
are linearly independent in L2 (I). Therefore, L2 (I) cannot be of finite dimension.
Here as well, we can define a product scalar by setting
Z
h, i := (t)(t)dt, , L2 (I).
I

We leave it as an exercise to show that this definition gives rise to a scalar product
on L2 (I). The norm of a finite energy signal is now defined by
p
kk := (, ), L2 (I)
or in extensive form,
Z  12
2
p
kk = |(t)| dt = E(), L2 (I).
I

It should be noted that this notion of energy norm is not quite a norm on L2 (I)
as understood earlier. Indeed, positive definiteness fails here since kk = 0 does
not necessarily imply = Just take (t) = 1 for t in I Q and (t) = 0 for
t in I Qc , in which case kk = 0 but 6= ! This difficulty is overcome by
partitioning L2 (I) into equivalence classes with signals considered as equivalent
if their difference has zero energy, i.e., the two signals and 0 in L2 (I) are
equivalent if k 0 k2 = 0. It is this collection of equivalence classes that should
be endowed with a vector space structure and a notion of scalar product, instead
of the collection of all finite energy signals defined on I Pointers are provided in
Exercises 4.3-4.6. This technical point will be not pursued any further as it does
not affect the analyses carried out here. Thus, with a slight abuse of notation, we
will consider the scalar product defined earlier on L2 (I) as a bona fide scalar
product.
With these definitions, the notions of orthogonality and orthonormality are
defined as before. However, while in Rd there could be no more than d vectors
which can ever be orthonormal, this is not the case in L2 (I) [Exercise 4.8].
Example 4.2.2 Pick I = [0, 1] and for each k = 0, 1, . . . define the signals k :
I R by

(4.1) 0 (t) = 1, k (t) = 2 cos(2kt), t I, k = 1, 2, . . .
For each p = 1, 2, . . ., the signals 0 , 1 , . . . , p are orthonormal in L2 (I).
4.3. PROJECTIONS IN L2 (I) 59

The notion of distance on L2 (I) associated with the energy norm takes the
special form
Z  12
2
d(, ) := |(t) (t)| dt , , L2 (I).
I

4.3 Projections in L2(I)


As we now explore the notion of projection onto a linear subspace E of L2 (I),
we shall see shortly that in sharp contrast with the situation in Rd , existence is not
automatic anymore. In other words, for an arbitrary signal in L2 (I), there is
no guarantee that there will always be an element ? in E which has the smallest
distance to , i.e.,
(4.2) d(, ? ) = inf d(, ).
E

Additional assumptions are needed on E for (4.2) to hold for all signals in L2 (I).
However, when ? does exist, it is necessarily unique by virtue of Proposition
3.6.1.
To gain a better understanding as to why the projection onto E may fail to
exist, consider the situation where a countably infinite family of orthonormal sig-
nals {k , k = 1, 2, . . .} is available. For each n = 1, 2, . . ., let En denote the
linear span of the n first signals 1 , . . . , n . Fix in L2 (I). By Lemma 3.6.1 the
projection of onto En always exists, and is given by
n
X
bn := ProjEn () = h, k ik ,
k=1

and (3.19) yields


n
X
kbn k2 = |h, k i|2 .
k=1

With the corresponding error defined by

en := bn ,

we find from (3.17) that

kk2 = kbn k2 + ken k2


60 CHAPTER 4. FINITE-DIMENSIONAL REPRESENTATIONS

by the orthogonality condition (3.15).


Combining these observations leads to
n
X
ken k2 = kk2 kbn k2 = kk2 |h, k i|2 ,
k=1

and the convergence


lim ken k2 := ()
n

takes place in a monotonically decreasing manner. Of course, this is consistent


with the geometric viewpoint according to which bn is the best approximation of
among the elements of En . The inclusions En En+1 , n = 1, 2, . . . imply that
the approximations {bn , n = 1, 2, . . .} are increasingly accurate, or equivalently,
that the magnitude of the error, namely ken k, decreases.
A natural question is to determine the limiting value (). Several cases arise
depending on whether () > 0 or () = 0. In the discussion we make use of
the easy identity
(4.3) E := sp (k , k = 1, 2, . . .) = k Ek .

Case 1 If belongs to E , then is an element of Ep for some p and bp+k =


for all k = 0, 1, . . ., whence ep+k = , and () = 0. Obviously the projection
onto E does exist with = ProjE ().

Case 2 When is not an element of E , then is not the zero signal but
two distinct scenarios are possible.

Case 2.a With not in E , if () = 0, then can be approximated ever


closely by an element of E since limn k bn k2 = 0. It is then customary
to say that is an element of the closure of E , a fact noted

E = sp(k , k = 1, 2, . . .).

The set E is called the closure of the linear subspace E ; it is itself a linear
subspace of L2 (I) which could be defined by

E := { L2 (I) : () = 0}.

However, ProjE () does not exist as the following argument by contradic-


tion shows: If the projection ProjE () were indeed to exist, then it would have
4.3. PROJECTIONS IN L2 (I) 61

to be an element of E , say . b By the definition of E , the signal b is an ele-


ment of Ep for some p and it is a simple matter to check that b = bp+k for all
k = 0, 1, . . .. Consequently, making use of earlier observations, we find
kk2 = kbk+p k2 + kek+p k2 = kk
b 2 + kek+p k2 , k = 0, 1, . . .

Letting k go to infinity and using the fact () = 0, we obtain kk2 = kk b 2 . It


follows from (3.17) that kke = 0 since kk2 = kk b 2 + kk
e 2 (with e = ).
b
Therefore, e = and = . b But this implies that was an element of E and
an contradiction ensues.
On the other hand, ProjE () does exist and it is customary to represent it
formally as an infinite series, namely

X
(4.4) ProjE () = h, k ik ,
k=1

to capture the intuitive fact that ProjE () is the limiting signal increasingly
approximated by the projection signals {bn , n = 1, 2, . . .}. Note that here =
ProjE ().
It follows from the discussion above that only finitely many of the coefficients
{h, k i, k = 1, 2 . . .} can ever be zero, and some care therefore needs to be
exercised in defining this element (4.4) of L2 (I) Up to now only finite linear
combinations have been considered. For our purpose, it suffices Pto note that for
any sequence {ck , k = 1, . . .} of scalars, the infinite series k=1 ck k can be
made to represent an element of L2 (I) under the summability condition

X
(4.5) |ck |2 < .
k=1

This can be achieved by showing that the partial sums


k
X
c` ` , k = 1, 2, . . .
`=1

converge
P in some suitable sense to an element of L2 (I) (which is represented by
k=1 ck k ). We invite the reader to check that indeed

X
(4.6) |h, k i|2 < , L2 (I).
k=1
62 CHAPTER 4. FINITE-DIMENSIONAL REPRESENTATIONS

Example 4.3.1 Continue with the situation in Example 4.2.2, and set

X 1
(t) := 2
cos(2kt), t I.
k=1
k

The signal is a well defined element of L2 (I) with () = 0, and yet is not an
element of E .

Case 2.b With not in E , if () > 0, then cannot be an element of


E and therefore cannot be approximated ever so closely by an element in E .
Here ProjE () may not exist, but ProjE () always does exist with

X
6= ProjE () = h, k ik .
k=1

We follow up these comments with the following examples.

Example 4.3.2 Continue with the situation in Example 4.2.2, and take

(t) := sin(2t), t I.

Here, () > 0 and the projection of onto E exists and ProjE () = .

Example 4.3.3 Continue with the situation in Example 4.2.2, and take

X 1
(t) := sin(2t) + 2
cos(2kt), t I.
k=1
k

This time, it is still the case that () > 0 but the projection of onto E does
not exist.

The last two example show that it is possible to have

E 6= L2 (I),

a possibility reflecting the fact that the orthonormal family {k , k = 1, 2, . . .} is


not rich enough in that its (finite) linear combinations are not sufficient to approx-
imate some element in L2 (I) to any prescribed level of accuracy. This motivates
4.4. FINITE-DIMENSIONAL SPACES OF L2 (I) 63

the following definition: The orthonormal family {k , k = 1, 2, . . .} is said to be


complete (in L2 (I)) if
E = L2 (I).
This is equivalent to
() = lim k bn k = 0
n

for every signal in L2 (I).


Example 4.3.4 Pick I = [0, 1] and for each k = 0, 1, . . . define the signals k :
I R by
2k (t) = 2 cos(2kt), t I, k = 1, 2, . . .
and
2k+1 (t) = 2 sin(2kt), t I, k = 0, 1, . . .
with 0 (t) = 1 (t I). It is a non-trivial fact concerning the structure of the
space L2 (I) that the orthonormal family {k , k = 0, 1, . . .} is complete [?]

4.4 Finite-dimensional spaces of L2(I)


The discussion from earlier sections suggests ways to represent finite energy sig-
nals. Given an orthonormal family {k , k = 1, 2, . . .} in L2 (I), we associate with
each finite energy signal a sequence of finite dimensional vectors. Formally, for
each n = 1, 2, . . ., we set
(4.7) Tn () := (h, 1 i, . . . , h, n i), L2 (I).
The vector Tn () is an element of Rn . By restricting our attention to En we get
the following useful fact.
Lemma 4.4.1 For each n = 1, 2, . . ., the correspondence Tn : En Rn given by
(4.7) is a norm-preserving bijection, i.e., Tn is onto and one-to-one with
n
X
2
(4.8) kTn ()k = |h, k i|2 = kk2 , En .
k=1

More generally we have


n
X
(4.9) hTn (), Tn ()i = h, k ih, k i = h, i, , En .
k=1
64 CHAPTER 4. FINITE-DIMENSIONAL REPRESENTATIONS

Proof. First, when restricted to En , the projection operator ProjEn reduces to the
identity, i.e., ProjEn () = whenever is an element of En . Thus, with the
notation introduced earlier, for any in En , we have
Xn
= n =
b h, k ik
k=1

so that n
X
kk2 = |h, k i|2
k=1
and (4.8) holds. The relation (4.9) is proved in a similar way.
As a result, if Tn () = Tn ( 0 ) for signals and 0 in En , then Tn ( 0 ) = 0
by linearity and k 0 k = kTn ( 0 )k = 0 by isometry. The inescapable
conclusion is that = 0 , whence Tn is one-to-one.
Finally, any vector v = (v1 , . . . , vn ) in Rn gives rise to a signal v in En
through
Xn
v := vk k .
k=1
It is plain that hv , k i = vk for each k = 1, . . . , n, hence Tn (v ) = v and the
mapping Tn is onto.

As a result, any element of En can be represented uniquely by a vector in


n
R . This correspondence, formalized in Lemma 4.4.1, is norm-preserving and
allows signals in En to be viewed as finite-dimensional vectors.
Next, we address the situation of arbitrary signals. To do so, we will need to
assume that the orthonormal family {k , k = 1, 2, . . .} is rich enough.
Theorem 4.4.1 Assume the orthonormal family {k , k = 1, 2, . . .} to be com-
plete in L2 (I). Then, any finite energy signal in L2 (I) admits a unique repre-
sentation as a sequence
(h, k i, k = 1, 2, . . .) .
Moreover, Parsevals identity

X
(4.10) kk2 = |h, k i|2 , L2 (I)
k=1

holds.
4.5. EXERCISES 65

4.5 Exercises
Ex. 4.1 Consider two families u1 , . . . , up and w1 , . . . , wq of linearly independent
vectors in Rd . Show that we necessarily have p = q whenever

sp (u1 , . . . , up ) = sp (w1 , . . . , wq ) .

Ex. 4.2 Let u1 , . . . , up be an orthonormal family in Rd for some integer p d.


Find the linear span of the family of 2p vectors in Rd defined by
p
X
f (b) := (1)b` +1 u`
`=1

with b = (b1 , . . . , bp ) a binary string of length p, i.e., b` = 0 or b` = 1 for


` = 1, . . . , p.

Ex. 4.3 Two signals and 0 in L2 (I) are said to be equivalent if k 0 k2 = 0,


and we write 0 . Show that this notion defines an equivalence relation on
L2 (I).

Ex. 4.4 With the notation of Exercise 4.3, show that addition of signals and mul-
tiplication of signals by scalars are compatible with this equivalence relation .
More precisely, with 0 and 0 in L2 (I), show that + 0 + 0 and
a a 0 for every scalar a.

Ex. 4.5 With 0 and 0 in L2 (I), show that kk2 = k 0 k2 and that
h, i = h 0 , 0 i.

Ex. 4.6 Let L2 (I) denote the collection of equivalence classes induced on L2 (I)
by the equivalence relation . Using Exercise 4.4 and Exercise 4.5, define a
structure of vector space on L2 (I) and a notion of scalar product.

Ex. 4.7 Show that the signals {k , k = 0, 1, . . .} of Example 4.2.1 are linearly
independent in L2 (I).

Ex. 4.8 Show that the signals {k , k = 0, 1, . . .} of Example 4.2.2 form an or-
thonormal family in L2 (I).
66 CHAPTER 4. FINITE-DIMENSIONAL REPRESENTATIONS

Ex. 4.9 Apply the Gram-Schmidt orthonormalization procedure to the family {k , k =


0, 1, 2} in L2 [0, 1] given by

t [0, 1]
k (t) = tk ,
k = 0, 1, 2

Does the answer depend on the order in which the algorithm processes the signals
0 , 1 and 2 ?

Ex. 4.10 The distinct finite energy signals 1 , . . . , n defined on [0, 1] have the
property that 1 (t) = . . . = n (t) for all t in the subinterval [, ] with 0 < <
< 1. Are such signals necessarily linearly independent in L2 [0, 1]? Explain.

Ex. 4.11 Starting with a finite energy signal g in L2 [0, T ] with E(g) > 0, define
the two signals gc and gs in L2 (0, T ) by

gc (t) := g(t) cos (2fc t) and gs (t) := g(t) sin (2fc t) , 0tT

for some carrier frequency fc > 0. Show that the signals gc and gs are always
linearly independent in L2 [0, T ].

Ex. 4.12 Consider the M signals s1 , . . . , sM in L2 [0, T ] given by

0tT
sm (t) = A cos(2fc t + m ),
m = 1, . . . , M

with amplitude A > 0, carrier fc > 0 and distinct phases 0 1 < . . . < M <
2. What is the dimension L of sp (s1 , . . . , sM )? Find an orthonormal family in
L2 [0, T ], say 1 , . . . , L , such that sp (s1 , . . . , sM ) = sp (1 , . . . , L ). Find the
corresponding finite dimensional representation.

Ex. 4.13 Apply the Gram-Schmidt orthonormalization procedure to the family of


M signals given in Exercise 4.12.

Ex. 4.14 Same problem as in Exercise 4.12. for the M signals given by

0tT
sm (t) = Am g(t),
m = 1, . . . , M

with g a pulse in L2 [0, T ] and distinct amplitudes A1 < . . . < AM .


4.5. EXERCISES 67

Ex. 4.15 Apply the Gram-Schmidt orthonormalization procedure to the family of


M signals given in Exercise 4.14.

Ex. 4.16 For the collection {k , k = 0, 1, . . .} in Example 4.2.1, find in


L2 (0, 1) such that does not belong to the linear span sp(k , k = 0, 1, . . .),
but does belong to its closure sp(k , k = 0, 1, . . .).

Ex. 4.17 Consider a set {s1 , . . . , sM } of M linearly dependent signals in L2 [0, T ).


Now partition the interval [0, T ) into K non-empty subintervals, say [tk , tk+1 )
(k = 0, . . . , K 1) with t0 = 0 and tM = T . For each k = 1, . . . , K, let
k = (k1 , . . . , kM ) denote an element of RM , and define the new constellation
{s?1 , . . . , s?M } by

s?m (t) = km sm (t), t [tk1 , tk ), k = 1, . . . , K

for each m = 1, . . . , M . Find conditions on the original constellation {s1 , . . . , sM }


and on the vectors 1 , . . . , K that ensure the linear independence (in L2 (0, T ))
of the signals {s?1 , . . . , s?K }.

Ex. 4.18 Consider a finite energy non-constant pulse g : [0, 1] R, with g(t) >
0 in the unit interval [0, 1]. Are the signals g and g 2 linearly independent in
L2 [0, 1]? Are the signals g, g 2 , . . . , g p always linearly independent in L2 [0, 1]?

Ex. 4.19 For each > 0, let s and c denote the signals R R given by

s (t) = sin (t) and c (t) = cos (t) , t R.

For T > 0 and 6= , find conditions for each of the collections {s , c },


{s , s }, {s , c , s } and {s , c , s , c } (restricted to the interval [0, T ]) to be
orthogonal in L2 (0, T ).

Ex. 4.20 Show (4.3).

Ex. 4.21 Discuss Example 4.3.1.

Ex. 4.22 Discuss Example 4.3.2.

Ex. 4.23 Discuss Example 4.3.3.

Ex. 4.24

You might also like