0% found this document useful (0 votes)
93 views67 pages

Advances in Source Coding Toby Berger

Uploaded by

markccie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views67 pages

Advances in Source Coding Toby Berger

Uploaded by

markccie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

INTERNATIONAL CENTRE FOR MECHANICAL SCIENCES

C 0 U R S E S A N D L E C T U R E S - No. 166

-~-
tr~
~~~
~

TOBY BERGER
CqRNELL UNIVERSITY
ITHACA, NEW YORK

AND

LEE D. DAVISSON
UNIVERSITY OF SOUTHERN CALIFORNIA
LOS ANGELES, CALIFORNIA

ADVANCES IN SOURCE
CODING

UDINE 1975

SPRINGER-VERLAG WIEN GMBH


This work is subject to copyright.
All rights are reserved,
whether the whole or part of the material is concemed

specifically those of translation, reprinting, re-use of illustrations,


broadcasting, reproduction by photocopying machine
or similar means, and storage in data banks.

© 1975 by Springer-Verlag Wien


Originally published by Springer-Verlag Wien-New York in 1975

ISBN 978-3-211-81302-7 ISBN 978-3-7091-2928-9 (eBook)


DOI 10.1007/978-3-7091-2928-9
RATE DISTORTION THEORY AND DATA COMPRESSION

TOBYBERGER
School of Electrical Engineering
Cornell University, Ithaca, New York
PREFACE

I am grateful to CISM and to Prof. Giuseppe Longo for the opportunity


to lecture about rate distortion theory research during the 1973 CISM summer
school session. Rate distortion theory is currently the most vital area of probabilistic
information · theory, affording significantly increased insight and understanding
about data compression mechanisms in both man-mcde and natural information
processing systems. The lectures that follow are intended to convey an appreciation
for the important results that have been obtained to date and the challenging
problems that remain to be resolved in this rapidly evolving branch of information
theory. The international composition of the CISM summer school attendees affords
a unique opportunity to attempt to stimulate subsequent contributions to the
theory from many nations. It is my hope that the lectures which follow capitalize
successfully on this promising opportunity.

T. Berger
LECTURE 1

Rate Distortion Theory: An Introduction


In this introductory lecture we present the rudiments of rate distortion
theory, the brauch of information theory that treats data compression problems.
The rate distortion function is defined and a powerful iterative algorithm for
calculating it is described. Shannon's source ~oding theorems are stated and
heuristically discussed.
Shannon's celebrated coding theorem states that a source of entropy
rate H can be transmitted reliably over any channel of capacity C > H. This theorem
also has a converse devoted to the frequently encountered situation in which the
entropy rate of the source exceeds the capacity of the channel.'Said converse states
that not all the source data can be recovered reliably when H > C [ 1 ].
Since it is always desirable to recover all the data correctly, one can
reduce H either by
(a) slowing down the rate of production of source letters, or
(b) encoding the letters more slowly than they are being produced.
However, (a) often is impossible because the source rate is not under
the communication system designer's control, and (b) often is impossible both
because the data becomes stale, or perhaps even useless, because of lang coding
delays and because extensive buffering memory is needed.
If H cannot be lowered, then the only other remedy is to increase C.
This, however, is usually very expensive in practice. We shall assume in all that
follows that H already has been lowered and that C already has been raised as much
as is practically possible but the situation H >C nonetheless continues to prevail.
(This is always true in the important case of analog sources because their absolute
entropy H is infinite whereas the C of any physical channel is finite). In such a
situation it is reasonable to attempt to preserve those aspects of the source data that
are the most crucial to the application at hand. That is, the communication system
should be configured to reconstruct the source data at the channel output with
minimum possible distortion subject to the requirement that the rate of transmission
of information cannot exceed the channel capacity. Obviously, there is a tradeoff
between the rate at which information is provided about the output of a data source
and the fidelity with which said output can be reconstructed on the basis of this
information. The mathematical discipline that treats this tradeoff from the
6 Rate Distortion Theory and Data Compression

viewpoint of information theory is known as rate distortion theory.


A graphical sketch of a typical rate-distortion tradeoff is shown m
Figure 1. If the rate R at which

Rite

Figure 1. A Typical Rate-Distortion Tradeoff

information can be provided about the source output exceeds H, then no distortion
need result. As R decreases from H towards 0, the minimum attainable distortion
steadily increases from 0 to the value Dm ax associated with the best guess one can
make in the total absence of any information about the source outputs. Curves that
quantify such tradeoffs aptly have been termed rate distortion functions by
Shannon [2].
The crucial property that a satisfactorily defined rate distortion
function R(D) possesses is the following:
"It is possible to compress the data rate from H down to any R > R(D)
and still be able to recover the original source outputs with an average
( *) distortion not exceeding D. Conversely, if the compressed data rate R
satisfies R < R(D), then it is not possible to recover the original source
data from the compressed version with an average distortion of D or
less".
An R(D) curve that posssesses the above property clearly functions as
an extension, or generalization, of the concept of entropy. Just as His the minimum
data rate (channel capacity) needed to transmit the source data with zero average
distortion, R(D) is the minimum data rate (channel capacity) needed to transmit the
data with average distortion D.
Omnis rate distortion theory in tres partes divisa est.
(i) Definition, calculation and bounding of R(D) curves for various data sources
and distortion measures.
(ii) Proving of coding theorems which establish that said R(D) curves da indeed
specify the absolute limit an the rate vs. distortion tradeoff in the sense of
(*).
(iii) Designing and analyzing practical communication systems whose performances
Rate Distortion Theory: An lntroduction 7

closely approach the ideallimit set by R(D).


We shall begin by defining R(D) for the case of a memoryless
source and a context-free distortion measure. This is the simplest case from the
mathematical viewpoint, but unfortunately it is the least interesting case from
the viewpoint of practical applications. The memoryless nature of the source
limits the extent to which the data rate can be compressed without severe
distortion being incurred. Moreover, because of the assumption that even
isolated errors cannot be corrected from contextual considerations, the ability
to compress the data rate substantially without incurring intolerable distortion
is further curtailed. Accordingly, we subsequently shall extend the
development to situations in which the source has memory and/or the
fidelity criterion has context-dependence.
A discrete memoryless source generates a sequence {X , X ,... } of
I 2
independent identically distributed random variables (i.i.d. r.v.'s). Each Xi
assumes a value in the finite set A= { a 1 , ... , aM } called the source alphabet.
The probability that Xi = aj will be denoted by P( aj ). This probability does
not depend on i because the source outputs are identically distributed. Let
~ = (X 1 , ... , Xn ) denote a random vector of n successive source outputs, and
let .! = (x1 , ... , xn ) € A" denote a value that ~ can assume. Then the probability
Pn (!) that ~ assumes the value ! is given by
n
P (x) = ll P(x )
n - t=l t

because the source has been assumed to be memoryless.


A communication system is to be built that will attempt to
convey the sequence of source outputs to an interested user, as depicted in
Figure 2.

-Communication S y s t e m -

Figure 2. A Communication System Linking Source to User

Let Y , Y , ... denote the sequence of lettcrs reccived by the user. In


I 2
general, the Yi assume values in an alphabet B= {b 1 , ... bN} of Jettcrs that may differ
both in value and in cardinality from thc source alphabet A, although A = B in most
8 Rate Distortion Theory and Data Compression

applications.
In order to quantify the rate-distortion tradeoff, we must have a means
of specifying the distortion that results when Xi = ai and Yi = bk. We shall assume
that there is given for this purpose a so-called distortion measure p: AxB--+ [O,oo].
That is, p (a. b~,) is the penalty, lass, cost, or distortion that results when the source
J' 1\
produces ai and the system delivers bk to the user. Moreover, we shall assume that
the distortion p 0 (~.y) that results when a vector ~ € A" of n successive source letters
r
is represented to the-user as € ß!l is of the form
n
p (x,y) = n-t ~ p(xt ,yt).
n -- t=l

A family {pn, 1 ".;;; n <


of such vector distortion measures will be
00 }

referred to as a single-letter, or memoryless, fidelity criterion because the penalty


charged for each error the system makes does not depend on the overall context in
which the error occurs.

Example:

A = B = {0,1}
p(O,O) = p(1,1) = 0

p(0,1) = c.:, p(l,O) = ß

Then 1 a +ß
p 3 (010,001) =3 (0 + a + ß) =- 3-
Each communication system linking source to user in Figure 2 may be
fully described for statistical purposes by specifying for each j € {1, ... , M } and
k € {1, ... , N} the probability Q k/ i that a typical system output Y will be equal to
bk given that the corresponding input X equals aj' Contracting notation from P(aj)
to Pi and from P(a j , b k) to Pjk , we may associate with each system Q = (Qk /i)
two functions of extreme importance, namely

I(Q) = ~M
j =1 k=l
~N P. o . log
J ""'r.1J
+ ,
(~./'
""'r.
)
where
Rate Distortion Theory: An Introducdon 9

and
d (Q) = j,I:k P.J o"k1J.p.k
J

I(Q) is the average mutual information between source and user for the
system Q, and d(Q) is the average distortion with which the source outputs are
reproduced for the user by system Q. For situatior:s in which both the source and
the fidelity criterion are memoryless, Shannon [ 2 ! has defined the rate distortion
function R(D) as follows

R(D) = min I (Q)


Q: d(Q) ~ D

That is, we restriet our attention to those systems Q whose average distortion d(Q)
does not exceed a specified Ievel D of tolerable average di<>tortion, and then we
search among this set of Q's for the ',ne with the minimum value of I(Q). We
minimize rather than maximize I(Q) because we wish to supply as Iittleinformation
as possible about the source outputs provided, of course, that we preserve the data
with the required fidelity D. In physical terms, we wish to compress the rate of
transmission of information to the lowest value consistent with the requirement that
the average distortion may not exceed D.
Remark: An equivalent definition of the critical curve in the (R,D)-
plane in which R is the independent and D the dependent variable is

D(R) = min d(Q).


os:; ; R
Q: I (Q)

This so-called "distortion rate function" concept is more intuitively


satisfying because average distortion gets minimized rather than average mutual
information. Moreover, the constraint I(Q) os:;;;R has the physical significance that the
compressed data rate cannot exceed a specified value which in practice would be the
capacity C of the channel in the communication link of Figure 2. It is probably
unfortunate that Shannon chose to employ the R(D) rather than the D(R) approach.
LECTURE 2.

Computatio n of R(D)
The definition of R(D) that we have given thus far is not imbued with
any physical significance because we have not yet proved the crucial source-coding
theorem and converse for R(D) so defined. Weshall temporarily ignore this fact and
concentrate instead an procedures for calculating R(D) curves in practical examples.
Until recently, it was possible to find R(D) only in special examples involving either
small M and N or { P j } and ( Pjk) with special symmetry properties. A recent
advance by Blahut, however, permits rapid and accurate calculation of R(D) in the
general case [3].
The problern of computing R(D) is one in convex programming. The
mutual information functional I(Q) is convex U in Q (Homework). lt must be
minimized subject to the linear equality and inequality constraints
N
k~l ~Jj = 1

.I: P. 0. .. p. k <;; D
j,k J "'k/ J J

and Qk/ j ;;;;::. 0. Accordingly, the Kuhn-Tucke r theorem can be applied to determine
necessary and sufficient conditions that characterize the optimum transition
probabüity assignment. Said conditions assume the following form:
Theorem 1. - Fix s e [- oo,O]. Then there is a line of slope s tangent to the R(D)
curve at the point (d(Q), I(Q)) if and only if there exists a probability vector

rl '
{Qk, 1 .;;;;; k .;;;;; N} such that
(i) ~/j = \ ~ eSPjk where \ = ( t~ eSPjk

an4

6. { • 1 for all k such that Qk >0


(ii) Gt= .I: A. P. eSPjk
Q =0
J J
<;; 1 for all k such that
k

Proof: Straightforward application of the Kuhn-Tucker theorem.


The theorem above shows that the R(D) curve can be generated
parametrically by choosing different values of s €[..., 00 ,01 and then determining the
12 Computation of R(D)

corresponding optimum {Qk} vectors.


The following theorem describes Blahut's iterative algorithm which
converges to the optimum {Qk } vector, and hence yields the optimum system of
transition probabilities Qk/ j associated with a particular value s of the slope
parameter.
0
Theorem 2. - Given s € [ - oo, O] ,choose any probability vector Q k , 1 ~ k ~N with
strictly positive com ponen ts Q ~ > 0. Define

r+l rQr
Qk = ck k

where

r
Then {Qk} converges to the { Q k} that satisfies (i) and (ii) of theorem 1 as r -+-oo.

Proof: Lemma: log x~x-1 (naturallogarithm)


Proof of Lemma:

Since log (x) is convex () and both log x and


x-1 equal zero and have slope 1 at x = 1, the

tangent line x-1 must lie above the curve log x for all x. This completes the proof of
the lemma. (log x ~ x - 1 is often called the fundamental inequality).
Corollary of Lemma: log y ~1-1/y. (Proof omitted.)
To prove the theorem we consider the functional V(Q) = I (Q)- sd (Q).
The graphical significance of V(Q) is shown in Figure 3. It is the R-axis intercept of a

lines ot slope s Figure 3. Graphical Description of V(Q)

R!Dl
Rate Distertian Theory and Data Compression 13

line of slope s through the point (d(Q), I(Q)) in the (R,D)- plane. For any Q the
point (d(Q), I(Q)) necessarily lies an or above the R(D) curve by definition of the
latter. Since R(D) is convex U(Homework), it follows that V(Q)~V8 , the R-axis
intercept of the line of slope s that is tangent to R(D).
The Blahut iterative formula Q ~ +~ c~ Qrk can be thought ofas the
composition of two successive steps; namely starting from {Q1k } , we have

Step 1.
r +I
Qk/j

Step 2.
r +1 ~ r +I
Qk j pj Qkfj
r r sp·k r
= ~ Aj Pj ~ e J = c~ ~
J

We proceed to show that V(Q r) ~ V8 as r~oo, which in turn clearly


implies from Figure 3 that(d(Qr ), I(Qr )) converges to the point ( D 8 , R(D 8 )) at
which the slope of R(.) iss. Let us contract notation from V(Qr) to yr. Then
r+I
r+1 ~P r+1l Qk/j _ 8 ~P Qr+I p
V = j,kjQk/j og Qr+l j,kj k/j jk
k
Ar r SPjk
~P. ~+. 1 log ( j ~e ) - ~P. d 1~ 1 1og(esPjk)
j,kJk/J <{+1 j,kJ"'kJ

r r+I (r/r+1)
fPj log(Aj ) + tQk log Qk Qk

Hence, with the aid of the lemma we get


-sP·k <{+I
r
V- V
r+I ~ P. r{ . log ( <{/j e J ) + ~ or+ 1 1og ( - - )
j,k J '<k/J Qr Ar k "k Qr
k j k

r r SP·k c{
r [ 1 - __:____
~ P. ~I" \ e J ] + ~ ~kct+1 (1 - -r+I )
<tn
~ '<k
j,kJ J Qk
14 Computation of R(D)

1- ~i f:\;pj eSPjk + 1- ti
1 - ~ <{ c~ + 1- 1 = 1 - ~ <{ + 1 = 1 - 1 0.

This establishes that V is monotonic nonincreasing* in r and hence


converges because it is bounded from below by V 8 • It only remains to show that the
value to which V r converges is V 8 rather than some higher value. This is argued by
noting that equality must hold in the abov.e argument at the convergence point which
r +1 r
in turn requires that Q k /Qk = crk ~1 Vk such that lim Qkr >O. That is, the first
r oo ~

of the Kuhn-Tucker conditions must be satisfied at the convergence point. But the
second one also must be satisfied there since, if rlt!v". c~ > 1 for any k, then
convergence cannot occur because o< Q~+l = ck Qk. This completes the proof of
the theorem and also completes Lecture 2.

* In fact, vr is strictly decreasing whenever vr >V . To see this,.note that the inequality used to show the
nonincreasing nature of Vr is strict unless Q:: 1 =8 Q~ V k, i.e. unless ck = 1 V k. But ck = 1 V k implies
that Qk is optimum and hence that yr = V8•
LECTURE3.

Thc Lowcr ßound Theorem and Error Estimates for the Blahut Algorithm
We have just seen how Blahut's iterative algorithm provides a powerful
tool for the computation of R(D) curves. The following rather remarkable theorem
is most interesting in its own right and also provides a means for ascertaining when
thc Blahut iterations have progressed to the point that (d {Qr), I {Qr)) is within some
specified € of the R(D) curve.

Theorem 3.
R(D) = (sD + ~P. log X. )
J J J

where

Remark: This theorem gives a representation of R(D) in the form of a maximum of


a different function in a different convex region. Hence, it permits lower bounds to
R(D) to be generated easily just as the original definition permitted generation of
upper bounds in the form of I(Q) for any D-admissible Q.
Proof: Choose any s oe;;;o, any ~ € As , and any Q suchthat d(Q) oe;;;o, Then

(1)
~ I ( Q) - sd ( Q) - ~ P. 1 og X.
j J J

= ~ P. ~~- log (
~/j )
j:k J J ~ ~ esPjk
16 The Lower Bound Theorem ...

(2) AOttesPjk
~ I p. ~~· (1 -
j,k J J
1
~/j
)

(3)

(1)
~ 1 - t~ =1 - 1 =0
[~ results from the conditions s EO;;o and d(Q)EO;; 0.
<;J is the fundamental inequality
<~ results from the fact that ~
E A8 ]
Wehave just shown that (d(Q)EO;; 0),.,.. (I(Q) ~sO + Ipj log ~j for any sosO;; 0
an d a n y ~ E A8 ) • It f o ll o w s t hat ( d ( Q) osO;; D) ,.,.. (I ( Q) ~
max sD + ~p. log ~.).
s ...;o AeA J J J
·- 8

Accordingly
R(D) ~ min I(Q) ~ max (sD + ~P. log A. )
Q: d(Q) EO;;D s:S:::
"""lt:
0 t-
AeA 8 J J J

which establishes the advertized lower bound. That the rcverse incyuality also holds,
and hence that the theorem is true, is established by rccalling that thc Q that solvcs
the original R(D) problern must be of the form
n
"fk/j
= ~ n
j"fke
sP;k

so that for this Q we have

R(D) = I(Q) = IP. o 1. log (


j ,k J "'k J
~j
"'k
) = IP. o . log
j,k J "'k/ J
(A.J esP;k)

= s j,k
IP. o 1.P.k +
J "'k J J
IP. log A.
j J J

= sd(Q) + IP. log A. = sD + IP. log A. ,


j J J j J J

since d(Q)= 0 for the Q that solves the R(O) problem. Also, we know that Ck <; 1
for this Q, so ~ E A 8 • Thus R(O) is of the form sO+~ Pj log~ for some s EO;;Q
and some ~ e A . Therefore J
8
Rate Distordon Theory and Data Compression 17

R(D) ~ max sD + l: P. log A..


s ~ 0, l, €1\ 8 j J J

and the theorem is proved.


Let us return to step (r + 1) of the Blahut algorithm in which we
defined
d+l = ,r r{ sPjk d+I - rr{
"'<k/j "i "'<k e and "'<k - ck"'<k

Let d(Q"+ 1 )=D. It then follows that

r+l
R(D) ~ I(Qr+I) = l:P. IT+. 1 log Qk/j
j,kJ '<k/J <{+1
Ar.e SPjk
l:P. <{1~ 1 log .....;J_ __
j,kJ J er
k

s l:P. rf 1~ 1 p.k + l:P. log }{ - l:rf log (r!) l:A.~ P. e 8 Pjk


j, k J "'<k J J j J J k '<k 'K j J J

sd(<f+l) + l:p. log A.~ - l:d ckr log r!, or


j J J k'<k "'k

r
1\; (D)

which is an upper bound to R(D) at the value of distortion associated with iteration
r + 1. We can obtain a lower bound to R(D) at the same value of D with the help of
Theorem 3 by defining

er max crk = max l: A.~ P. e sPj k


max k k j J J

and then setting

It follows that
18 The Lower Bound Theorem ...

er
p • e spjk -- 1 r
e SP·k k
J = - - · <;; 1
ek, -~ - - 1;~. p.
A
~'\'
~1\. 1 1
j J J ~ j ~
max max
for all k. Thus ~ e A 8 so Theorem 3 gives

R(D) ;;;;, .) + :I: P. log ~~ or


j J J

R(D) ;;;;, sD + l::P. log ~~ - log er = Rr (D)


j J J max -1.

which is the desired lower bound. Comparing the two bounds, we see that I(Q r+l)
differs from R (d(Qr+~) =R(D) by less than

~(D) -I{ (D) = log e:.X- ~c{ ejJog ek_.

Sinee 1 ;;;,,l~.. ek_ with equality "Y ~~~l~ooQk >O,we see that)!w.;.~(D)­
.Ri_(D)=O. The upper and lower bounds ~herefore tend to eoincidenee in the limit of
large iteration number. If we need an estimate of R(D) with an error of e or less, we
need merely iterate on r until log e~ax -l::kQk ek_ log ek_ <;; e. Note that both the
iterations and the test to see if we are within e of R(D) ean be earried out without
having to calculate either I(Qr) or d(Qr) at step r, let alone their gradients.This is the
reason why the iterations proeeed so rapidly in praetice.
LECTURE4.

Extension to Sources and Distorrion Measures with Memory


We shall now indicate how to extend the theory developed above to
cases in which the source and/ or the distortion measure have memory. Let P n ( x)
denote the probability distribution governing the random source vector ~ = (X 1 , ... ,
X n), and let Pn (~,~) denote the distortion incurred when the source vector ~ is
reproduced as the vector y. Let Q nCYJ!) denote a hypothetical transition probability
assignment, or "system", for transmission of successive blocks of length n. Then
define

d(Qn) = !~ln (~Qn <zi~Pn (~,p


~<zl~
I(Qn) n-\~ln (!_}Qn (~I!_} log ~ (p

where

~(1_) ~pn (!_)~ <x_l!.>, and finally

~ (D) = min I(~).


Qn:d(Qn)~D

It is clear that Rn (D) is the rate distortion function for a source that
produces successive n-vectors independently according to P n (~) when the distortion
in the reproduction of a sequence of such n-vectors is measured by the arithmetic
average of Pn over the successive vectors that com prise the sequence. Although the
actual source if stationary will produce successive n-vectors that are identically
distributed according toP n (~), these vectors will not be independent. Hence, ~ (D)
will be an overestimate of the rate needed to achieve average distortion D because it
does not reflect the fact that the statistical dependence between successive source
letters can be exploited to further reduce the required information rate. However,
one expects that this dependence will be useful only near the beginning and end of
each block and hence may be ignored for large n. This tempted Shannon [ 2 ] to
define
20 Extension to Sources and Distordon Measures with Memory

R(D) = lim inf ~(D)


n-+oo

Source coding theorems to the effect that the above prescription for
calculating R(D) döes indeed describe the absolute tradeoff between rate and
distortion have been proven under increasingly general conditions by a variety of
authors [2,4-9].
A sufficient but by no means necessary set of conditions is the
following
(i) The source is strictly stationary
(ii) The source is ergodie
(iii) :!lg < 00 and Pg : Ag x Bg-+ [ 0, 00 ] suchthat
n-g+l
Pn(~,I) = p-!+1 / ; Pg(xt,xt+l'•••,xt+g-l'yt, ••• ,yt+g-1)

(This is called a distortion me·asure of span g and can be used to reflect context-
dependencies when assigning distortions).

(iv)

where EX denotes expectation over a vector ~ =(X 1 , ••. , Xg) of g succs~ive random
source oütputs.
Comments. Gray and Davisson [9] claim recently to have been able to remove the
need for condition (ü). Condition (iv) is automatically satisfied to finite-alphabet
sources; it is needed only when the theory is extended to the important case of
sources with continuous, possibly unbounded, source alphabets.
Under conditions (i) and (iii) the Rn (D) curves will be monotonic
nonincreasing for n~g. They have the physical significance that no system that
operates independently on successive n-vectors from the source can perform below
Rn(D). However, the corresponding positive source coding theorem that systems can
be found that operate arbitrarily close to ~ (D) is true only in the Iimit n -+ oo •
We now give some explicit examples of rate distortion functions.

Example 1. Binary Memoryless Source and Error Frequence Criterion [Shannon,


1948, 1959; Goblick, 1962]
Rate Distordon Theory and Data Compression 21

6 { 0 if j =k
A = B = {0, 1} ' pjk = 1- jk = . 1 if j =I= k.

Assurne that zeros are produced with probability 1-p ~ 1/2 and ones
with probability p ~ 1/2.
A simple computation shows that

R(D) = { 1\, (p) -1\, (D) , 0 ~D ~Dmax= p


0 , D~p
R

where Hb(.)is the binary entropy function,

1\ (p) =- p log p-(1-p) log(l-p)

In the equiprobable casep=1/2, we get

R(D) = 1 +D log2 D + (l-D)log2 (1-D) bits/letter

This formula equals the Gilbert bound which is a lower bound on the
exponential rate of growth of the maximum number of vectors that can be packed
into binary n-space such that no two of them disagree in fewer than nD places. Rate
distortion theory yields a different interpretation of this curve in terms of covering
theory. Namely, R(D) = 1+ D log2D +(1-D) log 2 (1-D) is the exponential rate of
growth of the minimum number of vectors such that every point in binary n space
differs from at least one of these vectors in fewer than nD places. Note that in the
covering problern the result is exact, whereas in the corresponding packing problern
the tightest upper bounds known (Eliar, Levenstein 11 ) are strictly greater than the
Gilbert bound *. In this sense at least covering problems are simpler than packing
problems.
Example 2. Gaussian Source and Mean Squared Error (MSE) [Shannon, 1948,
and 1959]

*lt is widely believed that the Hilbert bound is tight in the packing problem, too, but no proof has been found
despite more than 20 years of attempts.
22 Extension to Sources and Distortion Measures with Memory

A=B= IR, thcrcallinc P (x,y) = (x-y) 2 •

Thc sourcc Jettcrs are governed by the Gaussian probability density


(x -fJY
2a 2
e
p (x) = ,!2-rrat xetR

The R(D) curve given by Shannon [1,12] is

R(D) = -1
2 log(a 2 /D) , 0 ~D ~Dmax 02

R!Dl • t loglo2 /Dl

Example 3. Stationary Gaussian Random Sequence and MSE [Kolmogorov, 1956]

A=B=IR
P(x,y) = (x-yf

EX. 0,
I

Thc answer is most easily expressed in terms of the discrete-time


spcctral density ..
<P(w) =I: IPk e -ikw , i2 =- 1
k=-oo
The R and D coordinates are then given parametrically in terms of
Oe[O,sup <P (w)] as follows :
w

n0 = _!._
21r
j" min [ 8, 4>(w)] dw
-'lf
Rate Distortion Theory and Data Compression 23

1
R(D8 ) = 47r I max [ 0, log
ff
4'(w) /8] dw
-ff

~~~""' .
-~
The cross-hatched region in the accompanying sketch is the area under
the so-called "error spectrum", min[8, 41(w)], associated with the parameter8.
The optimtim system for rate R(D8 ) will yield a reproduction sequence {Yi } which
is Gaussian and stationarily related to {Xi }with the error sequence {Xi- Yi }having
time-discrete spectral density Imin [ 8 , 4'(w) ] , Iw I ~ 7r •

Example 4. Stationary Gaussian Random Process and MSE [Kolmogorov, 1956]

Let
x = {x(t) ,0 ~t ~T} , 1. = {y(t) ,0 ~t ~T}
and 1 T
Pr(x,y) = T f[x(t) -y(t)Jldt.
0
The formulas for D 8 and R(D 8 ) from the time-discrete case of Example
3 continue to apply except that the limits pn the integrals become ±oo instead of ± 7r
and the spectral density 41(w) is defmed by
.. .
d..
'~~'(W) = I I{) (T) e -JWTdT
where
lfJ(T) = E(X(t)X(t +T)).
In the special case of an ideal bandlimited process

41(w) = { «<»o'
lwl ~ 2'HW
0' lwl > 2'HW
the results specialize to D8 =2WO and R(D 8 )=W log(41JO). Eliminating 0 yields the
explicit result
2W'I>0
= W log (-D---·
)
R(D)
24 Extension to Sources and Distortion Measures with Memory

Now the instantaneous signalpower is znW


... 1
S = ...!:.. Jcll(w)dw = 2.11' J 4>0 dW = 2W<l>0 ,
2.11' _.., -21f w

and the mean squared error D in the reconstruction of the signal can be considered
as an effective additive noise power N, the above result often is written as

R(D) = W log ( ~) ,

which is the form originally given by Shannon [ 1 ].


LECTURE 5.

Algebraic Source Encoding


We shall now prove the source coding theorem for the special case of
the binary equiprobable memoryless source and error frequency fidelity criterion
(Example 1 with p = 1/2). Moreover, the proof will show that a sequence of linear
codes of increasing blocklength n can be found whose (D,R) performances converge
to any specified point on the rate distortion function, R(D) = 1 - Hh (D) bits/letter;
The fact that the codes in question are linear is very important from the practical
Standpoint because the associated structure considerably simplifies the encoding
procedure.
Let x = (x 1 , ••• ,~) € {0, l}n and x_ = € {0, l}n
denote a typical source vector and reproduction vector, respectively. Define

where d His the Hamming distance function.


Given any set y 1 , y 2 , ... , y. of not necessarily t linearly independent re-
- - _J
production vectors, let

denote the [n, j] linear code with g~nerators 1J , ••• ,x_j (If the generatorsarenot
linearly independent, then not all 2 J of the y i € Bj are distinct). For completeness,
let B 0 = {_Q}. Define
F.
J

and let N. = II F. II denote the cardinality of FJ· . It should be clear that Nj is the
J J
number of source vectors .!. that can be encoded by at least one y EBj with error
frequency '1.jn or less. Accordingly, the probability Qj that a random vector X
produced by the source cannot be encoded by Bj with average distortion '1.j n ~
less is
0; = 1- 2-n~

because all 2npossible values of ~ are equally likely.


Now consider the addition of another generator .lj+l resulting in a
code Bj+l. Then
26 Algebraic Source Encoding

where

rJ = y_€UB. Sn (v + Y.
~ .:t.. 'J + 1
) = {v- + -JY. + 1 : -veF.}
J
J

The set Fr clearly is geometrically isomorphic to Fj and therefore contains Nj


elements, too. It follows that
N; + 1 = 2~ - IIF; n~ II

A good choice of X; +1 is one that makes IIF; n F; * II small. We proceed to derive an


upper bound that we are assured will exceed IIF. n F. * II for some choice of y. +l.
J J -J
This is done by averaging IIF; n Fj * II over all 2n possible choices of X;+ 1 and then
concluding that there must be at least one y. +l such that IIF. n F. * II does not ex-
_J J J
ceed this average. The average is calculated by observing that, for fixed y e Fj, the
vector y + I; +1 e F; * also belongs to Fj iffI; +1 = y + .!! for some u e F;. Hence, there
are exactly N1. choices of y. +l suchthat y+ y.+l € F. n F.*. Now letting!vary over
-J t..J J J 2'
the Nj points of Fj shows that there are a total of Nj' Nj = Nj pairs (y •X; +1 ), y e Fj,
such that y + y. +1 € F. n F. *. It follows that the average value of IIF. n F. * II is
2 -J J J J J
2-nNj. Therefore, there exists at least one way to choose I;+ 1 suchthat

~ 2N
Nj+I...- - z-nN.2
j J

This implies that


Rate Distorrion Theory and Data Compression 27

ü -1 - 2-nNj+1""'
)+1-
~1 22-n ~ +(2-nN.)
- • J
2

~ + 1 ,.;;; (1 _ 2 -n~ )2 = ~2

It follows by recursion that

Since

where we have used the fundamental inequality in the last step. Now let n -+ oo and
k -+ oo in such a way that the code rate k/n -+ R, and let !Q-+ oosuch that~/n-+ D.
It then follows that the probability Qk that the source produces a word that cannot
be encoded with distortion D or less will tend to zero provided
lim l2-n(~) = oo
n~ oo
k/n~ R
Q/n~ D
Since by Stirling's approximation we know that

(:n) _ 2 nl\ (D) ± O(n 1og n), 1


2

we conclude that Qk-+ 0 provided nR- n + nHb (D)-+ 00 ,i.e., provided

R>1 - \ (D) = R(D).

Hence, we have established the following theorem.


28 Algebraic Source Encoding

Theorem 4. [ Goblick, 1962 ]


There exist D-admissible linear codes of rate R for the binary
equiprobable memoryless source and error frequency criterion for any
R >R(D) = 1-Hb (D) bits/letter.
The. encoding, or com pression, of binary data by means of a linear code
proceeds in exact analogy to the procedure for decoding the word received at the
channcl ou tpu t in the channel coding application. Specifically, when the source
produccs ~ onc computes the syndrome

s = HxT

and then searches for a minimum weight solution ~ of the equation

s = HzT •
Such a ~ is called the Ieader of the coset CW = { y; H~ T = .!}. Once ~ has been
found, which is the hardest part of the procedure, one then encodes (approximates)
~by 1. =~ + ~· Said J_ is a code word since Hl = Hl + H~T = ~+ ~- = Q: Hence,
L is expressible as a linear combination (mod. 2) of the k generators. This means that
a string of k binary digits suffices for specification of y- In this way the n source
digits are compressed to only k digits that need to be transmitted. The resulting
error frequency is n- 1wt(E:Y) = n- 1 wt(~). Accordingly, the expected error
frequency Dis the average weight of the coset Ieaders
2 n·k
2- (n-k) ~
D = n -1
._ w.
j=l I

where wi is the weight of the Ieader ~ of the ith coset, 1 ~i ~ 2n-k . It


follows that for the source encoding application the crucial parameter of an
algebraic code is the average weight of its coset Ieaders rather than the minimum
distance between any pair of codewords.
LECTURE 6

Tree Codes for Sources


There are two major difficulties with algebraic source encoding. The
first one is that most sources da not naturally produce outputs in GF( q) for some
prime power q. Of course, the source outputs can be losslessly encoded into astring
of elements from GF( q), but then it is unlikely that minimization of Hamming or
Lee distance will correspond to maximization of fidelity. The second difficulty is
that, even if A = GF(q) and the Hamming or Lee metric is appropriate, the source
produces output vectors that are distributed over An in a manner that does not
concentrate probability in the immediate vicinity of the code words. This differs
markedly from the situation that prevails in channel decoding where the probability
distribution an the space of channel output vectors is concentrated in peaks
centered at the code words (provided the code rate is not extremely close to the
channel capacity ). As a result it suffices

Figure 4. Sketch of Probability Distribution Over


the Space of Channel Output Words A . ,.
p•obobo!oty
peoks

in most instances to limit decoding to the nonoverlapping spheres of


radius t ~d/2 centered about the codewords. The majority of algebraic decoding
procedures that have been devised to date simply abort whenever the received ward
does not lie inside one of the spheres of radius t centered at the code words. Such
decoding procedures are essentially useless for source encoding because the source
ward density is just as great between these spheres as it is inside them. In fact, if one
insists that the limiting rate of a sequence of binary codes be bounded away from 0
and 1 as n ~ oo, then a vanishingly small fraction of the space will lie within the
nonoverlapping t-spheres.
It follows from the above discussion that a good algebraic source
encoding algorithm must be complete in the sense that it finds the closest ( or at least
a close) code ward for every possible received ward. At present complete decoding
algorithms are known only for the Hamming codes [ Hamming, 1950], t = 2 BCH
codes [Berlekamp, 1968), certain t= 3 BCH codes [Berger and V an der Horst, 1973],
30 Tree Codes for Sources

and first-ordcr Recd-Muller codcs [Posner, 1968 ]. Unfortunately, the asymptotic


rate is 1 for thcsc Hamming and BCH codes, and 0 for these Reed-Muller codes, so no
long, good, dccodable algebraic source codes of nontrivial rate are known at present.
Anothcr way to introducc structure into a source code that overcomes
most of thc shortcomings of thc algcb~:aic approach is to employ tree codes. The
conccpt of trcc cncoding of sourccs is most readily introduced by means of an
cxamplc. Thc trcc codc dcpictcd in Figure 5 provides a means for encoding the
binary sourcc of Examplc 1 of Lccturc 4.

00 000000 000

000001 001

000110 1110


·~· 00
000111

111000
Oll

100

l
Figure 5. A Binary Tree Code
lt1001 101

111110 110

111111 111

......
-·· "''h
Iu...
mapl

Thc codc consists of the 2 3 = 8 different words of length n = 6 that


can bc formcd by concatenating the three pairs of binary digits encountered along a
path from the root to a leaf. The code word corresponding to each such path is
indicated in Figurc 5 directly to the right of the leaf at which the path terminates.
Since thcrc arc 26 = 64 different words of length 6 that might be produced by ·the
source, thcre usually will not be a code word that matches the source word exactly.
Accordingly, we approximate the source word ~ by whatever y_ in the code
minimizes P6 (!_,y). For example, if.!. = 010110, then~ = 000110 is closest in the
sense of the Hamming metric. When traversing the path- from the root to the leaf
y = 000110, we branch first up, then down, and then up at the successive nodes
cncountcrcd. Hence, this path is specified by the binary path map sequence 010,
whcrc 0 represents "up" and 1 represents "down". In this way we have compressed
a source sequcnce of length 6 into a path map oflength 3. Despite the fact that the
data rate has been ~ompressed by a factor of two, most of the original source digits
can be recovcred correctly. The reader may wish to verify that in the equiprobable
case p = 1/2 , the average fraction of the digits that are reproduced incorrectly is
Rate Distortion Theory and Data Compression 31

3/16 = 0.1875. The best that could be done in this instance by any code of rate
R=l/2of any blocklength is D(R=l/ 2) = 0.110.
It should be clear that tree codes can be designed with any desired rate.
If there are b branches per node and t letters per branch, then the rate is

R = Q- 1 1og d

Also, the letters an the branches can be chosen from any reproducing alphabet
whatsoever including a continuous alphabet, so tree coding is not restricted to
GF(q).
Tree encoding of sourceswas first proposed by Gabliek (1962). Jelinek
(1969) proved that tree codes can be found that perform arbitrarily close to the
R(D) curve for any memoryless source and fidelity criterion; this result has been
extended by Berger (1 971) to a limited dass of sources with memory. In order for
tree coding to become practical, however, an efficient search algorithm must be
devised for finding a good path through the tree. (Note that, unlike in sequential
decoding for channels, one need not necessarily find the best path through the tree;
a good path wi.P suffice). This problern has been attacked with some degree of
success by Anderson andJelinek (1971, 1973), by Gallager (1973), by Viterbiand
Omura (1973), and by Berger, Dick andJelinek (1973).
In conclusion, it should be mentioned that delta modulation and more
general differential PCM schemes are special examples of tree encoding procedures.
However, even the adaptive versions of such schemes are necessarily sub-optimal
because they make their branching decisions instantaneously on the basis solely of
past and present source outputs. Performance could be improved by introducing
some coding delay in order to take future source outputs into consideration, too.
LECTURE 7

Theory vs Practice and Some Open Problems


In the final lccturc of this ovcrview of rate distortion theory, I shall
bcgin by attcmpting to convcy somc fecling for the current status of the comparison
of thcory and practicc in data compression. I shall couch the discussion in terms of
thc i.i.d. Gaussian sourcc and MSE criterion of Example 2 of Lecture 4 because this
cxamplc has bccn treated the most thoroughly in the literature. However, the
ovcrall flavor qf my comments applies to more general sources and distortion
measures, too.
Rccall that thc rate distortion function for the situation in question is

R(D) =I log (o 2 /D)

Whcn this is plotted with a logarithmic D-axis, it appears as a straight line with
negative slope as skctched in Figure 6. Parallel to R(D)

1-IITS

Figure 6. Performances for Gauss-MSE Problem

A LLOYD-MAX OUANnZOS,UNCODED
e LLOVO-MAX OUANTIZERS,COOED

• BEST TREE CODE, 1•5


x lEST TIElliS CODE,hsS

but approximatcly 1/ 4bit higher lies the performance curve of the best entropy-
coded quantizers. Thc small separation indicates that there is very little tobe gained
by more sophisticated encoding techniques. This is not especially surprising given the
memorylcss nature of both the source and the fidelity criterion. It must be
emphasized, however, that entropy coding necessitates the use of a buffer to
implement the conversion to variable-length codewords. Moreover, since the
optimum quantizer has nearly uniformly spaced Ievels, some of these Ievels become
many times more probable than others, which Ieads to difficult buffering problems.
Furthermore, when the buffer overflows it is usually because of an inordinately high
local density of large-magnitude source outputs. This means that the per-letter MSE
incurred when buffer overflows occur tends to be even bigger than o2.
34 Rate Distordon Theory and Data Compression

As a result, the performance of coded quantizers with buffer overflow


properly taken into account may be considerably poorer than that shown in Figure
6, especially at high rates.
The buffering problern can be cirfumvented by not applying entropy
coding to the quantizer outputs. However, the performance curves of uncoded
quantizer diverge from R(D) at high R, as indicated by the locus of uncoded
Lloyd-Max quantizers in Figure 6.
Another scheme studied by Berger, Jelinek and Wolf (1972) uses the
permutation codes of Slepian (1965) in reverse for compression of source data. The
performance curve of optimum permutation codes of Hxed blocklength and varying
rate essentially coincides with the optimum coded quantizer curve for low rates but
diverges froin it at a rate that increases with blocklength. Permutation codes offer
the advantage of synchronaus operation, but they are characterized by long coding
delays and the need to partially order the components of the source vector. Berger
(1972) has shown that optimum quantizers and optimum permutation codes
perform identically in the respective Iimits of perfect entropy coding and of infinite
blocklength. Both are strictly bounded away from R(D).
An extensive study of tree coding techniques for the Gauss-MSE
problern was undertaken by Berger, Dick and Jelinek (1973). They studied source
encoding adaptations of both the Jelinek-Zigangirov stack algorithm for sequential
decoding of channel tree codes and the Viterbi algorithm for trellis codes. The best
performances obtained, determined by extensive simulation, were strictly better
than that of the best coded quantizer as indicated by the points marked in Figure 6.
The best stack algorithm run had D = 1.243D(R) and the Viterbi algorithm achieved
D• 1.'308 D(R), whereas coded quantizers achieve at best D = 1.415 D(R).
However, it was necessary to search 512 nodes per datum in the Viterbi runs and an
average of 727 nodes per daturn in the best of the stack runs.
Hence, real time tree encoding of practical analog sources at high rates is not instru-
mentable at present.
We close with a discussion of several open problems and research areas.

A. Algebraic Source Encoding


1. Derive bounds on the average weights of the coset Ieaders offamilies of linear
codes.
2. Are long BCH, Justesen, and/or Goppa source codes good?
3. Find co~plete decoding algorithms for families of codes with nondegenerate
Theory vs Practice and Some Open Problems 35

asymptotic rates.
4. Prove that the ensemble performance of codes with generators chosen in-
dependendy at random approaches R(D) as n ~ oo.
5. Extend the theory satisfactorily to nonequiprobable sources.

B. Tree Coding
1. Show that there are good convolutional tree codes for sources. (For contin-
uaus amplitude sources the tapped shift register is replaced by a feed-forward
digital fitter.
2. Find better algorithms for finding satisfactory paths through the tree (or
trellis).

C. Information-Singularity [Berger (1973) ]


Characterize the dass of all sources whose MSE rate distortion functions vanish
V.D > 0. This dass is surprisingly broad, and its study promises to provide insight
into the fundamental structure of information production mechanisms.

D. Biochemical Data Compression


The equations in chemical thermodynamics that describe the approach to
multiphase chemical equilibrium are mathematically analogaus to those which
must be solved in order to calculate a point on an R(D) curve. It has been
postulated [ Berger (1971) ] that this is not purely coincidental; indeed, the
interaction of a living system with its environment can be modeled by multiphase
chemical equilibrium. By "solving" this multiphase chemical equilibrium
problem, the system efficiently extracts those aspects of the environmental data
that it wishes to record accurately and either rejects or only coarsely encodes the
remainder.
The provocative mathematical analogy with rate distortion theory arises as follows.
If nj molecules of substance j, 1 ~ j ~ M, are injected into a system that
possesses N thermodynamically homogeneaus phases, then the number njk of mole-
cules of substance j that reside in phase k at equilibrium is found by minimizing the
Gibbs free energy functional which has the form
36 Rate Distordon Theory and Data Compression

where nk = ~njk and the Cjk, are so-called "free energy constants" that can be
experimentilly measured. The minimization naturally is subject to the mass balance
conditions l:n.k = n. and the constraints n.k ~0. Letting n = l:n. and making the
k J J J j J
obvious associations
n "" nP.
J J

cjk '"" -spjk


one finds that F is of the form F = I(Q)-sd(Q), which we know from earlier work to
be the quantity that one must minimize to find'-the point on R(D) at which the
slope iss.
lnvestigation of the realm of applicability of rate distortion theory to
environmental encoding by biochemical systems along the lines of the above
discussion of multiphase chemical equilibrium appears to be a very worthwhile area
for future research. It may weil turn out that principal usefulness of rate distortion
theory will prove to be in applications to biology rather than to communication
theory.
REFERENCES

[1] SHANNON, C.E., (1948). "A Mathematical Theory of Communication",


BSTJ, 27, 379-423, 623-656.

[2] SHANNON, C.E., (1959) "Coding Theoremsfora Discrete Source with a


Fidelity Criterion", IRE Nat'l. Conv. Rec., Part 4, 142-163.

[3] BLAHUT, R.E., (1972) "Computation of Channel Capacity and Rate-


distortion Functions", Trans. IEEE, IT-18, 460-473.

[4] PINKSTER, M.S., (1963) "Sources of Messages", Problemy Peredecü


Informatsii. 14, 5-20.

[5] GALLAGER, R.G., (1968) "Information Theory and Reliable


Communication", Wiley, New York.

[6] BERGER, T., (1968) "Rate Distordon Theory for Sources with Abstract
Alphabetsand Memory", Information and Control, 13, 254-273.

[7] GOBLICK, T.J.) Jr. (1969) "A Coding Theorem for Time-Discrete Analog
Data Sources", Trans. IEEE, IT-15, 401-407.

[8] BERGER, T., (1971) "Rate Distordon Theory. A Mathematical Basis for
Data Compression", Prentice-Hall, Englewood Cliffs, N.Y.

[9) GRAY, R.M., and L.D. DAVISSON (1973) "Source Coding Without
Ergodicity", Presented at 1973 IEEE Intern. Symp. on Inform.
Theory, Ashkelon, Israel.

[10] GOBLICK, T.J., Jr. (1962) "Coding for a Discrete Information Source with
a Distortion Measure", Ph.D.Dissertation, Elec. Eng. Dept. M.I.T.
Cam bridge, Mass.

[11] KOLMOGOROV, A.N., (1956) "On the Shannon Theory of Information


38 References

Transmission in the Case of Continuous Signals", Trans. IEEE, IT-2,


102-108.

[12] HAMMING, R.W., (1950) "Error Detecting and Error Correcting Codes",
BSTJ, 29, 1.47-160.

[13] BERLEKAMP, E.R., (1968) "Algebraic Coding Theory", McGraw-Hill, N.Y.

[14] BERGER, T., and J.A. VAN DER HORST (1973) "BCH Source Codes",
Submitted to IEEE Trans. on Information Theory.

[15] POSNER, E.C., (1968) In Man H.B."Error Correcting Codes", Wiley, N.Y.
Chapter 2.

[ 16] JELINEK, F., (1969) "Tree Encoding of Memoryless Time-Discrete Sources


with a Fidelity Criterion", Trans. IEEE, IT-15, 584-590.

[17] JELINEK, F., and J.B. ANDERSON (1971) "Instrumentable Tree En-
coding and Information Sources", Trans. IEEE, IT-17, 118-119.

[18] ANDERSON, J.B., and F. JELINEK (1973) "A Two-Cycle Algorithm for
Source Coding with a Fidelity Criterion", Trans. IEEE, IT-19, 77-92.

[19] GALLAGER, R.G., (1973) "Tree Encoding for Symmetrie Sources with a
Distordon Measure", Presented at 1973 IEEE Int'l. Symp. on
Information Theory, Ashkelon, Israel.

[20] VITERBI, A.J., and J.K. OMURA (1974) "Trellis Encoding of Memoryless
Discrete-Time Sources with a Fidelity Criterion", Trans. IEEE,
IT-20, 325-332.

[21] BERGER, T., R.J. DICK and F. JELINEK (1974) "Tree Encoding of
Gaussian Sources", Trans. IEEE, IT-20, 332-336.

[22] BERGER, T., F. JELINEK and J.K. WOLF (1972) "Permutation Codes
for Sources", Trans. IEEE, IT-18, 160-169.
References 39

[23] SLEPIAN, D., (1965) "Permutation Modulation", Proc. IEEE, 53, 228-236.

[24] BERGER, T., (1972) "Optimum Quantizers and Permutation Codes", Trans.
IEEE, IT-18, 759-765.

[25] BERGER, T., (1973) "Information- Singular Random Processes", Presented


at Third International Symposium on Information Theory, Tallinn,
Estonia, USSR.
UNIVERSAL SOURCE CODING

LEE D. DAVISSON
Department of Electrical Engineering
University of Southern California, Berkeley
PREFACE

I am grateful to Professor Giuseppe Longo for inviting me to participate

in this workshop and for the opportuni ty to present the series of lectures
represented by the following notes.

Lee D. Davisson
I. Introducdon

A. Summary
The basic purpose of data compression is to massage a data stream to
reduce the average bit rate required for transmission or storage by removing
unwanted redundancy and/or unnecessary precision. A mathematical formulation of
data compression providiPg ftgures of merit and bounds on optimalperformancewas
developed by Shannon [1,2] both for the case where a perfect compressed
reproduction is required and for the case where a certain specified average distortion
is allowable. Unfortunately, however, Shannon's probabilistic approach requires
advance precise knowledge of the statistical description of the process to be
compressed- a demand rarely met in practice. The coding theorems only apply, or
are meaningful, when the source is stationary and ergodic.
We here present a tutorial description of numerous recent approaches
and results generalizing the Shannon approach to unknown statistical environments.
Simple examples and empirical results are given to illustrate the essential ideas.

B. Source Modelling
Sources with unknown statistical descriptions or with incomplete or
inaccurate statistical descriptions can be modelled as a dass of sources from which
nature chooses a particular source and reveals only its outputs and not the source
chosen. The size and content of the dass are determined by our knowledge or lack
thereof about the possible source probabilities. Typical classes might be the dass of
all stationary sources with a given alphabet, the dass of all Bernoulli processes
(sequences of independent, identically distributed binary random variables, i.e., coin
flipping with a11 possible biases), the dass of a11 Gaussian random processes with
bounded covariances and means, etc.
The source chosen by Nature can be modelled as a rapdom variable, i.e.,
a random choice described by a prior probability distribution, or it can be modelled
as a fixed but unknown source, i.e., we have no such prior.
To model such a dass, assume that for each value of an index 0 e A we
have a discrete time source {~; n = ... ,-1,0,1, ... } with some alphabet A of possible
outputs and a statistical descripiion .p8 , specifically, Jl 6 is a probability measure
on a sample space of all possible doubly infinite sequences drawn from A (and an
appropriate a - Held of subsets called events). Thus Jl8 implies the probability
distributions on a11 random vectors composed of a finite number of samples of the
46 Universal Source Coding

N
process, e.g., p. implies for any integer N the distributions p.8 describing the
~_N !J.
random vector X = (X1 ,... ,XN ). As noted above, the source index may (or may
not) itself be considered as arandom variable ® taking values in the parameter space
A as described by some prior probability distribution W( () ). The source described
by p.8 will often be referred to simply as (). Similarly, the dass will often be
abbreviated to A.
We shall usually assume that the individual sources () are stationary,
i.e., the statistieal description p. 8 is invariant to shifts of the time origin. We shall
sometimes require also that the individual sources be ergodic, that is, with
P. 8 -probability-one all sample functions produced by the source are representative
of the source in that all sample moments converge to appropriate expectations and
all relative frequencies of events converge to the appropriate probabilities. This
should not be considered as a Iimitation on the theory since in practice, most real
sources can be considered to have at least local (i.e., over each output block
presented to an encoder) stationary and ergodie properties.
From the ergodie decomposition of a stationary time series [3,4,5] ,
any stationary nonergodie source can be viewed as a dass of ergodie sources with a
prior. Hence, stationary nonergodie sources are induded in our model and the
subsources in a dass of stationary sources can be considered ergodie without lass of
generality, i.e., a dass of stationary sources can be decomposed into a "finer" dass
of ergodie sources.
For example, consider a Bernoulli source, i.e., a binary source in which
the probability () of a "one" chosen by "nature" randomly according to the
uniform probability law on [ 0,1] and then fixed for all time. The user is not told (),
however. Subsequently the source produces independent-letter outputs with the
chosen, but unknown, probability. The probability, given (), that any message block
of length N contains n ones in any given pattern is then
(}n (l-O)N -n

Given () the source is stationary and ergodie. Otherwise the source is stationary only
and the probability of any given message containing n ones in N outputs is simply
Sources with Unknown Statistics 47

The relative frequency of ones for this source converges to®, a random variable. lt
is seen that the stationary source can be decomposed by working backwards into an
ergodie dass of subsources indexed by 8.
The object of data compression is to process the observed source
messages to reduce the bit rate required for transmission. If the com pressed
reproduction is required to be perfect (noiseless source coding), then compression
can only be achieved through the removal of unnecessary redundancy. If the
compressed reproduction need not be perfect (as is usually the case with continuous
alphabets), but need only satisfy some average fidelity constraint, then compression
can also be achieved via removal of unnecessary precision (as in quantization).

xr
We shall here consider only block processors, i.e., compressors that
operate on individual consecutive source blocks or vectors = ( XiN , XiN + 1
, .... ,~N+ N _ 1 ), i = 0,1, ... , independently of previous or future blocks. The index i
will be suppressed subsequently. The block coder maps XN into a reproduction
x(XN ). The block coding restriction eliminates such methods as ,run-length
encoding, but is made for analytical simplicity and the fact that block encoders
provide a large and useful dass.
The dassical Shannon theory [ 1,2] provides f:tgUres of merit and
bounds on optimal performance for block-encoders (hereafter called simply
encoders) x(X N) when the source is stationary and ergodie and the source
probabilities are known precisely. Universal coding is the extension of these ideas to
the more general situation of a dass of sources, wherein encoders must' be designed
without precise knowledge of the actual statistics of the source to be observed, or
equivalently, where the source is nonergodic. The encoder is called "universal" if the
performance of the code designed without knowledge of the unknown "true" source
converges in the Iimit of long blocklength N to the optimal performance possible if
one knew the true source. Different types of universality are defmed to correspond
to different notions of convergence.
In this tutorial presentation of recent approaches and results on
noiseless universal encoding and universal coding subject to a rate or an average
fidelity constraint, we attempt a unified development with a minimum of
uniformative mathematical detail and a maximum of intuition. As the mere
existence of universal codes is suprising, the Hrst cases considered are nearly trivial in
order to make these types of results believable and to emphasize the fundamental
ideas in the simplest possible context.
The mathematical detail, the proofs of the general theorem, and
48 Universal Source Coding

complete historical references may be found in [ 4,10].

II. Variable-Rate Noiseless Source Codes


We frrst consider uniquely decodeable block-ta-variable length source
codes. As we here require perfect reproduction, the alphabet A is assumed discrete.
Given a blocklength N, a code ~ is a codebook or collection of codewords or
vectors of varying blocklength and an encoding rule x(X N) that assigns to each
possible source N-tuple a unique codeword. t(N) is the dass of all uniquely
decodeable codes CN. In this situation compression is achieved by minimizing the
average length of the codewords and hence minimizing the average bit rate required
to perfectly communicate the source. Thus the principal property of interest of a
code CN is the resulting average length of codewords when applying the code to a
p~ticular source. Fora given CN, defme the normalized length function 2 (xN ICN)
as N""1 times the codeward length in bits of x(x N), the codeward resulting when
CN is used to encode the source block xN. Given a particular source described by
ll 11 , defme the average normalized length resulting from using the code C N on this
source by

where E11 denotes expectation over J.L11 , i.e., since the alphabet is discrete

N
where J.L 11 is taken as the probability mass function of N-tuples.
The optimal code in <l(N) is the one that minimizes the average
normalized length. Defme

A (N) = inf 2~(C)


II <t.f~ (N) " ~

as the minimal attainable a,verage normalized length using block-length N-to


variahle-length codes on 8. Defme

A11 = inf A11 (N)


N
Variable Rate Noiseless Codes 49

as the smallest attainable average normalized length for any source blocklength.
Shannon's variable-length source coding theorem relates ~8 to the
entropy of the source when the source 8 is known in advance:

Theorem 2.1 (Shannon): Given a source 8 ,

(2.1) N-1 H(~ /8) <: ~8 (N) < N - 1 H(~ /8) + N-t

where

is the N th-order entropy of the source 8 (or the conditional entropy of the dass of
sources given 8 ). All logarithms are to the base 2. If the source is stationary, then

(2.2) ~
8
= H(X/8)

where H(X/ 8) =Bm N- 1H(X Nj8) is the entropy rate of the source 8.
N.. oo
An optimal code C N for a given N can be constructed so that

N N N N N
(2.3) -log p.9 (x) <: ~(x j~) <- log ll9 (x ) + 1

Note that the construction sarisfies (2.1) but requires advance knowledge of the
source 8 for which the code is to be constructed. .
A useful performance index of any code C N used on a source 8 is the
difference between the actual resulting normalized average length and the optimal
attainable for that blocklength. We formalize this quantity as the redundancy or
conditional redundancx given 8, r 9 (CN), given by

(2. 4) r
B
(CN) ~ 2B (CN) - ~B (N) •

This defmition is slightly different than that usually used [ 6] of r ~ (C N) =


= 29 (C N )-N- 1 H(XN I 8) which compares the average redundancy of a particular
code with a possibly unachievable (except in the limit) lower bound. We use (2.4)
here because we feel it is a more realistic comparison, results in a closer analogy with
50 Universal Source Coding

the figures of merit of later sections, and because the asymptotic N .. .. results are
unaffected through (2.2) for stationary sources. Clearly 0 ~ r8 (C N ) "S r; (C N ).
We now consider the more general situation of coding for a dass of
sources, i.e., we must now choose a code C N without knowledge of the actual
source, 8 , being observed. A sequence of these codes { CN}; = 1 will be said to be
universal in accordance with the definition of Section I if these codes designed
without knowledge of 8 asymptotically perform as well as optimal codes custom
designed for the true (but unknown) () ,i.e., if r8 (C N) N~..o in some sense regardless
of (). The various types of universal codes will correspond to the various notions of
convergence, e.g., convergence in measure, pointwise convergence and uniform
convergence.
Before formalizing these concepts, it is useful to consider in some detail
a specific simple example to hopefully make believable the remarkable fact that such
universal codes exist and to provide a typical construction.
Suppose that the dass A consists of all Bernoulli processes, i.e., all
independent, identically distributed sequences of binary random variables. The
sources in the dass are uniquely specified as noted in Section (I,B) by
8 = Pr[X = 1] , i.e.,
n

0,1

(2.5)

n
where w(x n) = i~l xi is the Hamming weight of the binary n-tuple xn.
Choosing a code CN to minimize r 8 (C N ) for a particular 8 will dearly
result in a large redundancy for some other sources. To account for the entire dass
we instead proceed as follows: Each source ward xN is encoded into a codeward
consisting of two parts. The first part of the codeward is w(xN ), the number of ones
in xN . This specification requires at most log(N + 1) bits. The second part of the
codeward gives the location of the w(xN ) ones by indexing the ( ~(xN ) ) possible
location patterns given w(xN ). Thus this information can be optirnally encoded
(given w( xN)) using equal length codewords of at most log (N N ) + .1 bits.
w (x )
Variable Rate Noiseless Codes 51

If the actual unknown source is 0 , the resulting redundancy using this uniquely
decodable code is bounded above as follows:

r8 (~) ~ N-t (log(N + 1) + 1 + E8 [log(!(xN))] + 1- \(N)).

The N-t (log (N + 1) + 2) term clearly vanishes as N ..... Using Stirling's ap-
proximation and the ergodie theorem.

N- t log (N N ) = N-t log N N! N


w(x) w(x )!(N-w(x ))!

w(xN) w(xN) N-w(xN) 1 ( N-w(xN) )


~ - - - 1og - - - og N
N -+00 N N N

w~.l- 0 logO- (1-0)log(l-0)

The last step also follows from the strong law of large numbers (a special case of the
ergodie theorem, N-1 w(x N) N7.. 0 w.p. 1) and the continuity of the logarithm.
From the Shannon theorem, A8 (N)N""t ..H(X/ 0) = -0 logO - (1-0 )l6g(l-0) so
that r 8 (CN ) -+0 for all 0 and the given sequence of codes.
We now proceed to the general definitions of universal codes and the
corresponding existence theorems.
Given a probability measure on A (and an appropriate O-field), a
sequence of codes {CN}Noo= 1 is said tobe weighted-universal (or Bayes universal)
if r 8 (CN) converges to 0 in W-measure, i.e., if

(2. 6) lim J dW(O)r8 (~) = 0


N+oo A

The sequence is maximin-universal if (2.6) holds for all possible W. The measure W
might be a prior probability or a preference weighting. The sequence is said to be
weakly:-minimax universal (or weakly:-universal) if r 8 (C N) -+o pointwise, i.e.,

(2.7) all 0 € A
52 Universal Source Coding

The sequence is said to be strong!Y.-minimax universal or minimax-universal or


~g!r.-universal if r 6 ( C N)-+ 0 uniformly in 8 , i.e.,

lim r (C.) =0 , uniformly in 8. (2.8)


N+oo 6 ~

The types of universal codes are analogaus to the types of optimal


estimates in statistics.
Uniform convergence is the strongest and practicallymost useful type
since it is equivalent to the following: Given an E > 0, there is an Ne (not a
fUnction of 8) such that if N ;;;a.N e , then

The advantage here is that a singl~ fmite blocklength code has redundancy less than
E for all 8 .
A strongly-minimax universal sequence of codes is obviously also
weakly-minimax. Since r6 (C N) ;;;a. 0, weakly-minimax code sequences are also
weighted universal for any prior by a Standard theorem of integration. Since r 6 (CN)
;;;a. 0, convergence in measure implies convergence W-almost everywhere (with
W-probability one ). Thus, if {CN} is a weighted universal sequence for A, then
there is a set A0 such that W(A-A0 ) = 0 and {CN } is a weakly-minimax universal
sequence for the dass of sources A 0 • Since convergence W-a.e. implies almost-
-uniform convergence, given any E > 0, there is a set Aesuch that W( A- Ae )",.;; E
and {CN} is a strongly-minimax universal sequence for Ae .
Even though the strongly-minimax universal codes are the most
desirable, the weaker types are usually more easily demonstrated and provide a dass
of good code sequences which can be searched for the stronger types. The following
theorem is useful in this regard:

Theorem 2.2: Given a discrete alphabet A, weighted-universal codes


exist for the dass of all finite entropy stationary ergodie sources with alphahet A
and any weighting W that is a probability measure.

Proof: Let 1J. denote the avera~ or mixture measure induced by W


and the p.6 , i.e.,p.N (xN ) = Jdw (8 )J.L~ (x ). This measure is clearly stationary
and hence application of Shann~n's theorem to the mixture measure yields for each
Codebook Theorem 53

Na code CN suchthat

R(CN) = EU (XN I~>}~ N-1 H(l), + N-1

where H(X N ) is the Nth order entropy of the mixture source. Thus for these codes

J r 8 (CN)dW(8) ~ J dW(O){E8 {R(XNicN)}- A8 (N)}


A = ;{R<XNicN)}- J dW(O)A8 (N)

~ N-1 H(XN) +N-1 -AN_1 J dW(8)H(XNI8)


A -1
= N-1 {H(XN) - H(xN !@)} + N

= N-1 I(XN; @) + N-1

where I(X N ; ®) is the average mutual information between the unknown source
index and .the output N-tuples. In [ 5] it is shown for the dass considered that
N -t I( X N ; ®) ~ 0 as N ... .. , completing the proof. This convergence to zero
follows since the source outputs of an ergodie source eventually uniquely specify the
source in the limit and therefore the per-letter average mutual information tends to
zero. The theorem can also be proved by construction [ 6] .
Using the ergodie decomposition, the above theorem can be extended
to the apparently more general dass of all stationary processes.
Unfortunately, existence theorems for weakly-minimax and strongly
minimax universal codes are not as easily obtainable. As an alternative approach, the
following theorem (proved in [ 6] ) provides a construction often yielding universal
codes for certain dasses of sources.

Theorem 2.3: Codebook Theorem. Let {~ ;j =l, ... ,JN} be a


partition of the source N-tuple output space with "representatives" 81 , ... , 01N .
Let j(.) be the mapping of xN into the index of the ljN containing it, i.e.,
xN e r j (xN ).lf for all 8 there are vanishing sequences e 8 (N) :>o, 68 (N) :> 0
54 Universal Source Coding

and a measure J.lN not depending on 0, such that

then weakly-minimax universal codes exist. If e6 (N) and ~6 (N) do not depend on
0 , then strongly-minimax universal codes exist.
Code constructions follow immediately whenever the codebook
theorem applies. Make a codebook on the source output space for each of the
representative values Oj • Send the codeward in two parts. The first part consists of
the value j(!..N) encoded with a word of length -log J.lN [ r j (xN) ] + 1. The
second part consists of the codeward for ~ in the j(x N )th codebook.
As an example of the application of the codebook theorem, consider
the binary sources of eqn. (2.5 ). The representative values are (Jj = j!N, j = 0,1,
.•. ,N. lj = {~N: Hammingweightj }. p. N(lj ) = (N +lf 1 •
Obviously

log(N + 1)
N
Fixed Rate Noiseless Coding 55

The codebook theorem can be used to solve many other coding


problems. Suppose for any ftxed 8 , the source is stationary and ergodie. The source
is then called conditionally stationary ergodie. The following theorem can be
proven:

Theorem 2.4: For any conditionally stationary ergodie source,


weighted universal codes exist. If there exists a probability function, q(x 1 ), such
that
(2. 9) E [-log q(Xl) 181 <oo ,

for every 8 e A, then weighted universal codes exist. If the supremun of (2.9) is
fmite over A and there exists a vanishing sequence {ek} suchthat for all N, M,;;,
k, all 0 e A.

(2 .10)

Then minimax universal codes exist.


Theorem 2.4 is established by the codebook theorem where the
representative values {8j} are kth order Markov source values whieh can be taken
by a source histogram. AsN-+- oo, k-+- oo in such a way that (2.10) is satisfied.

111. Fixed-Rate Noiseless Coding


In some application variable length coding is not allowed. Instead, for
some rate R one must encode each source block of length N into a fi;;ed length
coded representation of RN bits. Assuming that there are more than 2 RN possible
message blocks, there will be some nonzero probability of error associated with this
operation. For "noiseless" coding we would like to know how big R must be so that
vanishing probability of error can be assured by choosing N large enough. In direct
analogy to Section II, weighted, maximin, minimax and weakly minimax universal
codes can be defmed over a source parametrized by a random variable ® on a space
A, with performance measured by error probability rather than redundancy. For
conditionally stationary ergodie sources, the results of Section II can be combined
with the law of large numbers (as used in the McMillan asymptotic equipartition
property) to obtain the following theorem:
56 Universal Source Coding

Theorem 3: For encoding any conditionally stationary ergodie source,


I
if R> lim sup N- 1 H(xN 8 ), weighted and weakly minimax universal codes exist._
N-+oo Bel\

IV. Universal Coding on Video Data


The idea of variable length universal coding was applied to a recorded
satellite picture of the earth consisting of 2400 lines, each line containing4096 8 bit
samples. The sample-to-sample differences, {x1 } , were formed and modeled as
having the probability mass function:
1 - 8 8 lx.lI
m (4 .1)

It is weil known that this is a reasonable approximation for video data. From (4.1)
and the independence assumption,
N
~ !x.!
(1-8) ()
J.Le(~) = m
N
N . -I
I-
I
(4. 2)

Choosing representative values of () , five codebooks were generated -


one fixed length PCM coder, three variable length coders on the individual {xi } ,
and one run length coder. As indicated by the codebook theorem, each blockwas
encoded by each of the five codebooks with the shortest codeward chosen for the
actual representation with a prefix code added to denote the codebook. The
resulting average rate was three bits per sample at a block size of N = 64. For
increasing or decreasing block sizes abou t this value, the rate was found to increase
slowly. For larger blocksizes the nonstationarity of the data causes the increase,
whereas for smaller blocksizes, the prefix "overhead" information causes the
increase. As a basis for comparison, the actual sample entropy of the differences was
calculated and found to be 3.30 bits per sample across the picture. Note that this is
the minimum that any of the usual coding schemes can achieve. The universal coding
scheme can do better than the entropy, in ~pparent contradiction to the usual
source coding theorem, by taking advantage of the non-stationary nature of the
source.
V. Fixed-Rate Coding Subject to a Fidelity Criterion
We now drop the requirement of a perfect reproduction and require
only that some average fidelity constraint be satisfied. Thus compression is now
attainable by eliminating unnecessary precision as well as redundancy. Let Ä be an
available reproducing alphabet, i.e., the possible letters in the compressed re-
production. Usually, but not necessarily A,S A, e.g., the result of quantizing. Let
p (x,y) be a nonnegative distortion measure defined on A x A, i.e., for all x € A, y € A.
The distortion between N-tuples is assumed tobe single-letter, i.e.,
N N !J. -t ~
PN(x ,y ) = N .~ 1 P(x. , Y.) •
l = l l

A codebook ~ is a collection of II CN U < oo codewords or N-tuples with entries


in .Ä. A source xN is encoded using CN by mapping it into the bestcodeward in the
PN sense, i.e., into the ., € CN minimizing pN (xN , y N ). The resulting
is denoted x(xN ). The codebook tagether with the encoding rule is called a code
and is also denoted by CN . If the code CN is used on a source 0 , the parameters of
interest are the rate of the code

R(C)
N
= N~log UcNU

and the average distortion resulting from using C N on (J

where

PN(XN!c) = PN(xN,x(xN))
N
min
N
y €~

Compression is achieved since the code size II CN II is usually much smaller than the
number of possible source N-tuples (which is in general uncountably infinite) and
hence any codeward can be specified by using fewer bits than the original source
required. (Strictly speaking, "compression" is achieved if R(CN) :<H(X), the
entropy rate of the source). Fidelity is lost as a result of the compression, but the
58 Universal Source Coding

goal is to minimize this hopefully tolerable loss. The optimal performance is now
specified by the minimum attainable average distortion using fixed rate codes. The
rate may be constrained by channel capacity, available equipment, receiver
limitations, storage media, etc. Let (/, (N,R,A) be the dass of alphabet Ä,
blocklength N codes having rate less than or equal to R. Define

~ 8 (R,N,A) inf P8 (CN)


CN€ (/, (N, R,A)
~

= inf ~ 8 (R,N,A)
N

~ 8 parallels the "J\8performance measure of noiseless coding. It can be shown


[10] that if 0 is stationary, then the limit of ~ 8 (R, N, A)
exists and equals the
infimum over N. Shannon's theorem on source coding with a fidelity criterion
relates the desired optimal ~ 8 to a well-defined information theoretic minimization
called the distortion-rate function (DRF). This theorem is important since ~ 8 (like
"J\ 8 ) cannot in general be directly evaluated while the DRF is amenable to fast
computer computation via nonlinear programming techniques [11 ].
The DRF of a stationary source 0 with available reproducing alphabet
is defined by

D8 (R,A) lim D8 (R,A,N)


N-+oo
N ~
D8 (R,A,N) inf E8 {pN (X ,X ) }
N- 1 I(XN ,XN):S R

X
where the inf is over all test channels (conditional probability measures for N given
,X
xN , random encoders) and I(XN N) is the average mutual information between
input and output N-tuples of the given source and test channel [12,13].

Theorem 5.1: (Shannon, Gallager, Berger). Given a stationary ergodie


source 0 , if there exists a reference reproduction letter a * such that
Fixed Rate Coding Subject to a Fidelity Criterion 59

then
li 8 (R,A) D8 (R,A)

Theorem 5.1 resembles Theorem 2.1 in that it relates optimal


performance to an information theoretic quantity. Unlike Theorem 2.1, however,
Theorem 5.1 only relates these quantities asymptotically, i.e., there is no general
relation between li 8 (R,A,N) and D 8 (R,A,N). Analogaus to redundancy in the
noiseless case, define the discrepancy of a rate R code C N as the difference between
actual performance for the given dass of codes:

We next consider source coding for a dass of sources. A sequence of


codes {C N}~ =1 will be said to be universal if d6 ( C N) ~ 0 in some sense for a1l (} •
The various types of universal fixed rate codes with a fidelity criterion are defined
by the type of convergence exactly as in the noiseless case. The camparisans and
relative strengths are obvious generalizations of the noiseless case.
Given a probability measure W on A, a sequence of codes { CN} is said
to be weighted-universal if

(5 .1) 0 '

weaklr.-minimax universal if

(5.2.) lim d (~) 0, all OeA


N+oo 9

and ~g!r.-minimax universal if

(5.3) lim d8 (CN) = 0, uniformly in 0 •


W+oo

Before proceeding to the general case, we consider as before a nearly


trivial case to make the existence of such codes believable and to demonstrate a
typical construction.

Theorem 5.2: Strongly-minimax (and therefore weakly-minimax and


60 Universal Source Coding

weighted) universal codesexist for any finite dass of stationary sources.

Proof: Say the dass contains K sources k = l, ... ,K. Ök (R,A) can be
shown to be a continuous function of R so that given an € > 0 there exists an N
sufficiently large to ensure that for all k
<
I
A A

iök (R-N-1 log K,A,N) - ök (R,A) = e /2


-I
Foreach source k build a nearly optimal blocklength N code C N (k) of rate R-N
log K such that

Pk {~(k)):s:;;; Ök (R-N-1 log K,A,N) + e /2

The extra € /2 is necessary since a code actually yiel~ing the infimum defining ök
may not exist. Farm the union codebook ~ = kld 1 C N(k) containing all the
distinct codewords in all of the subcodes CN (k). The encoding rule is unchanged,
i.e., a source block is encoded into the bestcodeward in CN. Since this ward can be
no worse than the best ward in any subcode CN (k), the average distortion resulting
from using CN on the source k satisfies

pk (CN) :s:;;; pk (~(k))

< p (R-N- 1 log K,A,N) + €/2


= k

~ pk (R,A) + €

The rate of CN is given by

R(~) = N-1 1og II~ II

< N-1 1og K max IIeN (k) II


k
< N-1 1og K + (R-N- 1 1og K) R
Fixed Rate Coding Subject to a Fidelity Criterion 61

completing the proof.


The basie idea is that with a slight decrease in rate (that asymptotieally
vanishes ), we can build a code that accounts for all possibilities by combining
subcodes for each possible source.
The finite case does not generalize immediately as in general there are
an uncountably infinite dass of sources and we cannot possibly build a subcode for
each. If, however, the dass can be partitioned into a finite number of subdasses such
that sources within a subdass are "similar" or "dose" in some way in that a code
designed for a single representative of the subdass works "well" for all members of
the subdass, then the resulting subcodes can be combined as previously to obtain a
universal code sequence. With differing definitions of "similar" and "well", this
topological approach has resulted in the most general known existence theorems for
weakly and strongly-minimax universal codes. An example of this approach will be
presented in the proof of the strongly-minimax universal coding theorem whieh,
unlike the noiseless case, is here the easiest to demonstrate.
We now proceed to Statements of the various universal coding
theorems. The required technieal assumptions are given for completeness.

Theorem 5.3: Weighted-Universal Coding Theorem [ 4,8,14]


Given a metrie distortion measure p on (AUA) x (AUA) suchthat A
is a separable metric space under P and every bounded set of A is totally bounded,,
let A be the dass of all ergodie alphabet A processes. If there is a reference source
letter a * such that

E8 {p (X1 , a*) } <oo , all () € A

and if for the weighting W, JA


dW ( ()) E { p (X\ a*) }< 00 , then weighted-universal
8 .
codes exist for A.
The theorem follows directly from the source coding theorem for
stationary nonergodie sources [ 4 ] as generalized in [ 14] since a mixture of ergodie
sources is equivalent to a single stationary source for whieh there exists a sequence
{CN} suchthat
62 Universal Source Coding

1im 1im
N+oo N+oo

= 1im Jdw(O)E6 {p(XNIC )}


N-+oo N
II.

= Jn6 (R,~)dW(O) = J 86 (R,~)dW(O)


II.

yielding the theorem [ 8]. The proof of the source coding theorem used is a
complicated generalization of random coding arguments and a topological
decomposition of A using the ergodie decomposition.
The distortion measure p is defined on (AUA) x (AUA) as the above
theorem is proved using a two-step encoding combining the regular encoding with a
quantization within the source or reproduction alphabet. Hence distortion must be
defined on AxA and AxA as weil as AxA.
The previous the'drem is conceptually easily generalized to dasses of
stationary sources using the ergodie decomposition. Numerous technieal
measurability problems arise, however, and such results are more easily obtainable
using the following theorems.

Theorem 5.3: Weakly-Minimax Universal Coding Theorem [ 8,10 ].


_9iven a metrie distortion measure p on (AUA) x (A UA) under whieh
either A or A is a separable metrie space, then weakly-minimax universal codesexist
for the dass of all stationary processes with alphabet A. When A is separable, the
theorem is proved using the previously described topological approach of carving up
the dass of sources [ 8]. The distance used is the distribution or variational distance.
When Ä is separable, the method of proof is a generalization,of Ziv's combinatorie
proof [ 8,10] that does not involve the structure of the source dass, bu t attempts to
fit a given code structure as weil as possible to whatever source block is observed.
In [ 8 ] this theorem is proved first for special simple cases and then for
the general cases where the topological and combinatorie approaches are compared
and contrasted in some detail. Instead of further considering the details of
weakly-minimax universal codes, however, we proceed to a discussion of strongly-
minimax universal codes as these are practically the most useful type, the proof is
easy and demonstrates the basic topological approach. In addition, an interim step in
the proof provides an interesting side result giving a measure of the mismatch
occurring when applying a code designed for one source to another.
Mismatch Theorem 63

To state the theorem in its most general form, we require the concept
of the p distance between random processes and a simple application. Given two
stationary processes 0 and l/>, the p distance p ( 0 , l/>) is defined by

P (O,l/>) sup
n
pn (0 ,l/>)
n n
inf E [ p (X , Y ) ]
q€Qn(O,l/>) q n

where Qn ( 0 , l/> ) is the dass of all joint distribu tions q describing random vectors
(Xn, yn ) suchthat the marginal distributions specifying xn and yn are J.l; and
J.l: , respectively. Thus pn measures how well Xn and yn can be matched in the
Pn sense by probabilistically connecting the random vectors in a way consistent
with their given distributions. Alternative definitions and properdes of the p
distance are given in [ 7 ] and [ 8] . In particular, it is there proved that p is a metric
and that p has the following simple (but less useful here) alternative definition:

p(O,l/>) = inf E(P(X0 ,Y0 ))

{Wn }:= -oo

where {Wn } n=- oo


00
are stationary random processes of pairs Wn = (X , Y• )
n n
such that the coordinate process { ~ } is the 8 process and {YJ is the l/> process.
Thus p measures how weil the processes can fit tagether in the p sense at a single
time if the processes are stochastically linked in a jointly stationary manner. The
usefulness of p is demonstrated by the following simple and intuitive theorem:

Theorem 5.4: Mismatch Theorem


For any blocklength N and any codebook CN,

Proof: Let q nearly yield Pn , i.e., Eq[ Pn (Xn ,Yn)]


< Pn ( 0, l/>) + €. If e describes (Xn ,yn· ), then Xn and yn are distributed
according to J.l; and J1.: ,
respectively, so that we have from the triangle inequality
that
64 Uttiversal Source Coding

Since e is arbitrary, reversing the roles of 8 and rp completes the proof.


Using optimal codes in the above theorem immediately yields the
following:

Corolli!:!Y.:
... ...
I~ 8 (R,A,N)- ~~(R,A,N) l~p(8,rp) all N

and, therefore, if the sources are ergodie

Theorem 5.4: Strongly-Minimax Universal Coding Theorem


If the dass A is totally bounded under p , i.e., if given e > 0 there is a
fmite partition {Bk}~ = 1 where K = K( E) of A such that if 8 , rp E Bk , then
P( 8, tP)~ E, then there exist strongly-minimax universal codes for A.
Mismatch Theorem 65

Proof: For each set ~ in the part1t1on, let k be a representative


source, i.e., k indexes any fixed 8 in Bk. Construct as in Theorem 5.1 a universal
code CN for the fmite dass of sources k = 1, ... , K. Application of this code to A
yields the following: Given a source 8 , it must lie in some subdass Bk in the
partition, say Bk( 6 ) • We have from the proof of Theorem 5.2 and the above
construction that

:E;;;;l)k( 6 ) (R,A) + e(2 + p(8,k(8))

....
'S li 6 (R,A) + e (2 + zP (8, k(8))
....
:E;;;; li 6 (R,A) + Se /2

completing the proof.


Intuitively, to build a strong-minimax code we "cover" the dass by a
finite number of "representative" subcodes that tagether have rate R. Each subcode
works nearly optimally for a subdass of the dass.
An example of a totally bounded dass of sources under p is the dass of
all stationary finite-state wide-sense Markov chains of order less than some fmite
integer. Extensions to more general dasses along with partial converses may be
found in [ 8).
In practice one might reverse the order of construction by frrst
choosing a reasonable number, K, of representative subcodes and ~f rate R and
blocklength N with resulting rate R + tr 1 log K and average distortion within e (K)
of the optimal. An appropriate choice of blocklength will allow use of this
construction on nonstationary sources that are locally stationary [ 15).
VI. Variable-Rate Coding with Distordon
The average distortion of the encoding at a Hxed rate as in the last
section depends upon the actual value of 8 in effect, that is, the actual stationary
ergodie source seen by the encoder. Thus distortion is a randoro variable over the
enseroble with distribution given by the distribution of the pararoeter 8 . In roany
applications it roay be roore desirable to allow the coding rate to depend on 8 while
holding the average distortion fixed over the enseroble.
As in the last section, a coding theorero will be established for the
special case of a Hnite nurober of subsources, 8 = 1,2 ... ,K, in the enseroble. As
before, the theorero holds for noncountable enserobles as weil, but the proof is
considerably roore involved. In addition, it will be assuroed that there is a roaxirouro
distortion value, PM •
For each value of k, generate a set of codewords according to the usual
coding theorero for station~, ergodie sources. If Rk (D) is the rate distortion
function in bits of the k th subsource, and D is the desired value of average
distortion, each codewill contain Lk codewords where

log ~ = (N(l\ (D -€) + €). (6 .1)

€ is an arbitrary positive constant and the blocksize is chosen large enough so that
the average distortion is D - € for all 8 and so that the probabiliry that there is no
codeward with distortion less than D- €/2 is less than € /2 PM.
The coded representation of each of the Lk codewords generated for
each k will consist of two parts. The first part will be the fixed length binary nurober
equal to k-1, k =1,2.... , K using at mostlog K + 1 bits to identify the codeword. The
.second part will be the location of the codeward in a Iist for each k of length at
roost log Lk + 1 bits. Thus the rate of any codeward for a given 8 is at roost

log K + 2 .
r
k
=
N
+R(D-€)+€bl.ts.
-'k
(6. 2)

Obviously by choosing N large enough and € sroall enough, the rate can
be roade arbitrarily close to R 8 (D) for 8 = 1,2, ... ,K.
The achieveroent of D and ~ (D) for the Rorobined supercode depends
upon the actual choice of a codeward out of the L= ~ L codewords total. The
j =1 J
coding rule is as follows. Upon observing an output block of length N, among the
Variable Rate Coding with Distordon 67

codewords of distortion less than D- e /2, fmd the one of minimum rate, ifany. If
there is no codeward with distortion less than D- e /2, make a random choice. The
average rate for 0 = k then

of distor.tion J
~
~ rk + ( SUJ.P rJ.
) Prob rnolesscodeward
D •
than 1n the t codeth

~ r + (sup r.) e/2 PM (6.3)


k j J

Qn M + 2
= N + I\ (D -e) + e + (sup r. ) e /2PM ,
j J

which is abritrarily close to ~ (D) for small enough €, large enough N. The average
distortion for 0 = k is

~D
"""'-e /2 +PM Prob [no codeward of distortion
less than D in the kth code
J

Further details can be found in [ 9) .

VII. Conclusions
We have summarized a unified formulation of source coding-both
noiseless and with a fidelity criterion- in inaccurately or incompletely specified
statistical environments. Several alternative definitions of universally good coding
algorithms have been presented, the most general known existence theorems stated,
and proofs given for some simple informative cases. The results indicate that
surprisingly good performance is possible in such situations and suggest possible
design philosophies, i.e., (1) estimate the source present and transmit both estimate
and an optimal codeward for the given estimate source, or (2) build several
representative subcodes and send the best ward produced by any of them. In the
first approach, asymptotically the estimate well approximates the true source and
takes up a negligible percentage of the codeword. In the second approach, if the
representatives are well chosen, all possible sources are nearly optimally encoded
with an asymptotically negligible increase in rate.
The results described are largely existence theorems and therefore do
not prescribe a specific method of synthesizing data compression systems. They do,
68 Conclusions

however, provide Hgures of merit and optimal performance bounds that can serve as
an absolute yardstick for comparison justification for overall design philosophies
that have proved useful in practice.
REFERENCES

[1] SHANNON, C.E., "The Mathematical Theory of Communication", Uni-


versity of Illinois Press, 1949, Urbana, Illinois.

[2] _ _ ,"Coding Theorems for a Discrete Source with a Fidelity Criterion", in


IRE Nat.Conv.Rec., pt. 4, pp. 142-163, 1959.

[3} ROZANOV, YU., "Station~ Random Processes", Holden-Day, San


Francisco, 196 7.

[4] GRAY, R.M., and DAVISSON, L.D., "Source Coding Theorems without the
Ergodie Assumption", IEEE Trans. IT, July 1974.

[5] GRAY, R.M., and DAVISSON, L.D., "The Ergodie Decomposition of


Discrete Stationary Sources", IEEE Trans. IT, September 1974.

[6] DAVISSON, L.D., "Universal Noiseless Coding", IEEE Trans. Inform.


Theory, Vol. IT-19, pp. 783-795, November 1973.

[7] GRAY, R.M., NEUHOFF, D., and SHIELDS, P., "A Generalization of
Ornstein's d Metrie with Applications to Information Theory",
Annals ofProbabiliry (tobe published).

[8} NEUHOFF, D., GRAY, R.M. and DAVISSON, L.D., "Fixed Rate Universal
Source Coding with a Fidelity Criterion", submitted to IEEE Trans.
IT.

[9} PURSLEY, M.B., "Coding Theorems for Non-Ergodie Sources and Sources
with Unknown Parameters", USC Technieal Report, February 1974.

[10] ZIV, J., "Coding of Sources with Unknown Statistics-Part 1: Probability of


Encoding Error; Part II: Distordon Relative to a Fidelity Criterion",
IEEE Trans.Info.Theo., vol IT-18, No. 4, July 1972, pp. 460-473.
[11] BLAHUT, R.E., "Computation of Channel Capacity and Rate-Distortion
Functions", IEEE Trans.Info.Theo., Val. IT-18, No. 4, July 1972,
PP'· 460-473.

[12] GALLAGER, R.G., "Information Theory and Reliable Cornmunication",


New York, Wiley, 1968, eh. 9.

[13] BERGER, T., "Rate Distordon Theory: A Mathematical Basis for Data
Comwession", Englewood Cliffs, New Jersey, Prentice-Hall.

[14] NEUHOFF, 0., Ph.D. Research, Stanford University, 1973.

[15] GRAY, R.M., and DAVISSON, L.D. "A Mathematical Theory of Data
Compression (? )", USCEE Report, September 1974.
CONTENTS

RATE DISTORTION THEORY AND DATA COMPRESSION


By T. Berger

Preface...................................... ................. 3
Lecture 1
Rate Distordon Theory: An Introducdon . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Lecture 2
Computation of R(D) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Lecture 3
The Lower Bound Theorem & Error Estimates for the Blahut Algorithm . . . . 15
Lecture 4
Extension to Sources and Distordon Measures with Memory . . . . . . . . . . . . . . 19
Lecture 5
Algebraic Source Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Lecture 6
Tree Codes for Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Lecture 7
Theory vs Practice and Some Open Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 33
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

UNIVERSAL SOURCE CODING


By Lee D. Davisson

Preface......... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
I. Introducdon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
II. Variable-Rate Noiseless Source Codes. . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
111. Fixed-Rate Noiseless Coding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
IV. Universal Coding on Video Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
V. Fixed-Rate Coding Subject to a Fidelity Criterion. . . . . . . . . . . . . . . . . . 57
VI. Variable-Rate Coding with Distordon. . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
VII. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

You might also like