Advances in Source Coding Toby Berger
Advances in Source Coding Toby Berger
C 0 U R S E S A N D L E C T U R E S - No. 166
-~-
tr~
~~~
~
TOBY BERGER
CqRNELL UNIVERSITY
ITHACA, NEW YORK
AND
LEE D. DAVISSON
UNIVERSITY OF SOUTHERN CALIFORNIA
LOS ANGELES, CALIFORNIA
ADVANCES IN SOURCE
CODING
UDINE 1975
TOBYBERGER
School of Electrical Engineering
Cornell University, Ithaca, New York
PREFACE
T. Berger
LECTURE 1
Rite
information can be provided about the source output exceeds H, then no distortion
need result. As R decreases from H towards 0, the minimum attainable distortion
steadily increases from 0 to the value Dm ax associated with the best guess one can
make in the total absence of any information about the source outputs. Curves that
quantify such tradeoffs aptly have been termed rate distortion functions by
Shannon [2].
The crucial property that a satisfactorily defined rate distortion
function R(D) possesses is the following:
"It is possible to compress the data rate from H down to any R > R(D)
and still be able to recover the original source outputs with an average
( *) distortion not exceeding D. Conversely, if the compressed data rate R
satisfies R < R(D), then it is not possible to recover the original source
data from the compressed version with an average distortion of D or
less".
An R(D) curve that posssesses the above property clearly functions as
an extension, or generalization, of the concept of entropy. Just as His the minimum
data rate (channel capacity) needed to transmit the source data with zero average
distortion, R(D) is the minimum data rate (channel capacity) needed to transmit the
data with average distortion D.
Omnis rate distortion theory in tres partes divisa est.
(i) Definition, calculation and bounding of R(D) curves for various data sources
and distortion measures.
(ii) Proving of coding theorems which establish that said R(D) curves da indeed
specify the absolute limit an the rate vs. distortion tradeoff in the sense of
(*).
(iii) Designing and analyzing practical communication systems whose performances
Rate Distortion Theory: An lntroduction 7
-Communication S y s t e m -
applications.
In order to quantify the rate-distortion tradeoff, we must have a means
of specifying the distortion that results when Xi = ai and Yi = bk. We shall assume
that there is given for this purpose a so-called distortion measure p: AxB--+ [O,oo].
That is, p (a. b~,) is the penalty, lass, cost, or distortion that results when the source
J' 1\
produces ai and the system delivers bk to the user. Moreover, we shall assume that
the distortion p 0 (~.y) that results when a vector ~ € A" of n successive source letters
r
is represented to the-user as € ß!l is of the form
n
p (x,y) = n-t ~ p(xt ,yt).
n -- t=l
Example:
A = B = {0,1}
p(O,O) = p(1,1) = 0
Then 1 a +ß
p 3 (010,001) =3 (0 + a + ß) =- 3-
Each communication system linking source to user in Figure 2 may be
fully described for statistical purposes by specifying for each j € {1, ... , M } and
k € {1, ... , N} the probability Q k/ i that a typical system output Y will be equal to
bk given that the corresponding input X equals aj' Contracting notation from P(aj)
to Pi and from P(a j , b k) to Pjk , we may associate with each system Q = (Qk /i)
two functions of extreme importance, namely
I(Q) = ~M
j =1 k=l
~N P. o . log
J ""'r.1J
+ ,
(~./'
""'r.
)
where
Rate Distortion Theory: An Introducdon 9
and
d (Q) = j,I:k P.J o"k1J.p.k
J
I(Q) is the average mutual information between source and user for the
system Q, and d(Q) is the average distortion with which the source outputs are
reproduced for the user by system Q. For situatior:s in which both the source and
the fidelity criterion are memoryless, Shannon [ 2 ! has defined the rate distortion
function R(D) as follows
That is, we restriet our attention to those systems Q whose average distortion d(Q)
does not exceed a specified Ievel D of tolerable average di<>tortion, and then we
search among this set of Q's for the ',ne with the minimum value of I(Q). We
minimize rather than maximize I(Q) because we wish to supply as Iittleinformation
as possible about the source outputs provided, of course, that we preserve the data
with the required fidelity D. In physical terms, we wish to compress the rate of
transmission of information to the lowest value consistent with the requirement that
the average distortion may not exceed D.
Remark: An equivalent definition of the critical curve in the (R,D)-
plane in which R is the independent and D the dependent variable is
Computatio n of R(D)
The definition of R(D) that we have given thus far is not imbued with
any physical significance because we have not yet proved the crucial source-coding
theorem and converse for R(D) so defined. Weshall temporarily ignore this fact and
concentrate instead an procedures for calculating R(D) curves in practical examples.
Until recently, it was possible to find R(D) only in special examples involving either
small M and N or { P j } and ( Pjk) with special symmetry properties. A recent
advance by Blahut, however, permits rapid and accurate calculation of R(D) in the
general case [3].
The problern of computing R(D) is one in convex programming. The
mutual information functional I(Q) is convex U in Q (Homework). lt must be
minimized subject to the linear equality and inequality constraints
N
k~l ~Jj = 1
.I: P. 0. .. p. k <;; D
j,k J "'k/ J J
and Qk/ j ;;;;::. 0. Accordingly, the Kuhn-Tucke r theorem can be applied to determine
necessary and sufficient conditions that characterize the optimum transition
probabüity assignment. Said conditions assume the following form:
Theorem 1. - Fix s e [- oo,O]. Then there is a line of slope s tangent to the R(D)
curve at the point (d(Q), I(Q)) if and only if there exists a probability vector
rl '
{Qk, 1 .;;;;; k .;;;;; N} such that
(i) ~/j = \ ~ eSPjk where \ = ( t~ eSPjk
an4
r+l rQr
Qk = ck k
where
r
Then {Qk} converges to the { Q k} that satisfies (i) and (ii) of theorem 1 as r -+-oo.
tangent line x-1 must lie above the curve log x for all x. This completes the proof of
the lemma. (log x ~ x - 1 is often called the fundamental inequality).
Corollary of Lemma: log y ~1-1/y. (Proof omitted.)
To prove the theorem we consider the functional V(Q) = I (Q)- sd (Q).
The graphical significance of V(Q) is shown in Figure 3. It is the R-axis intercept of a
R!Dl
Rate Distertian Theory and Data Compression 13
line of slope s through the point (d(Q), I(Q)) in the (R,D)- plane. For any Q the
point (d(Q), I(Q)) necessarily lies an or above the R(D) curve by definition of the
latter. Since R(D) is convex U(Homework), it follows that V(Q)~V8 , the R-axis
intercept of the line of slope s that is tangent to R(D).
The Blahut iterative formula Q ~ +~ c~ Qrk can be thought ofas the
composition of two successive steps; namely starting from {Q1k } , we have
Step 1.
r +I
Qk/j
Step 2.
r +1 ~ r +I
Qk j pj Qkfj
r r sp·k r
= ~ Aj Pj ~ e J = c~ ~
J
r r+I (r/r+1)
fPj log(Aj ) + tQk log Qk Qk
r r SP·k c{
r [ 1 - __:____
~ P. ~I" \ e J ] + ~ ~kct+1 (1 - -r+I )
<tn
~ '<k
j,kJ J Qk
14 Computation of R(D)
1- ~i f:\;pj eSPjk + 1- ti
1 - ~ <{ c~ + 1- 1 = 1 - ~ <{ + 1 = 1 - 1 0.
of the Kuhn-Tucker conditions must be satisfied at the convergence point. But the
second one also must be satisfied there since, if rlt!v". c~ > 1 for any k, then
convergence cannot occur because o< Q~+l = ck Qk. This completes the proof of
the theorem and also completes Lecture 2.
* In fact, vr is strictly decreasing whenever vr >V . To see this,.note that the inequality used to show the
nonincreasing nature of Vr is strict unless Q:: 1 =8 Q~ V k, i.e. unless ck = 1 V k. But ck = 1 V k implies
that Qk is optimum and hence that yr = V8•
LECTURE3.
Thc Lowcr ßound Theorem and Error Estimates for the Blahut Algorithm
We have just seen how Blahut's iterative algorithm provides a powerful
tool for the computation of R(D) curves. The following rather remarkable theorem
is most interesting in its own right and also provides a means for ascertaining when
thc Blahut iterations have progressed to the point that (d {Qr), I {Qr)) is within some
specified € of the R(D) curve.
Theorem 3.
R(D) = (sD + ~P. log X. )
J J J
where
(1)
~ I ( Q) - sd ( Q) - ~ P. 1 og X.
j J J
= ~ P. ~~- log (
~/j )
j:k J J ~ ~ esPjk
16 The Lower Bound Theorem ...
(2) AOttesPjk
~ I p. ~~· (1 -
j,k J J
1
~/j
)
(3)
(1)
~ 1 - t~ =1 - 1 =0
[~ results from the conditions s EO;;o and d(Q)EO;; 0.
<;J is the fundamental inequality
<~ results from the fact that ~
E A8 ]
Wehave just shown that (d(Q)EO;; 0),.,.. (I(Q) ~sO + Ipj log ~j for any sosO;; 0
an d a n y ~ E A8 ) • It f o ll o w s t hat ( d ( Q) osO;; D) ,.,.. (I ( Q) ~
max sD + ~p. log ~.).
s ...;o AeA J J J
·- 8
Accordingly
R(D) ~ min I(Q) ~ max (sD + ~P. log A. )
Q: d(Q) EO;;D s:S:::
"""lt:
0 t-
AeA 8 J J J
which establishes the advertized lower bound. That the rcverse incyuality also holds,
and hence that the theorem is true, is established by rccalling that thc Q that solvcs
the original R(D) problern must be of the form
n
"fk/j
= ~ n
j"fke
sP;k
= s j,k
IP. o 1.P.k +
J "'k J J
IP. log A.
j J J
since d(Q)= 0 for the Q that solves the R(O) problem. Also, we know that Ck <; 1
for this Q, so ~ E A 8 • Thus R(O) is of the form sO+~ Pj log~ for some s EO;;Q
and some ~ e A . Therefore J
8
Rate Distordon Theory and Data Compression 17
r+l
R(D) ~ I(Qr+I) = l:P. IT+. 1 log Qk/j
j,kJ '<k/J <{+1
Ar.e SPjk
l:P. <{1~ 1 log .....;J_ __
j,kJ J er
k
r
1\; (D)
which is an upper bound to R(D) at the value of distortion associated with iteration
r + 1. We can obtain a lower bound to R(D) at the same value of D with the help of
Theorem 3 by defining
It follows that
18 The Lower Bound Theorem ...
er
p • e spjk -- 1 r
e SP·k k
J = - - · <;; 1
ek, -~ - - 1;~. p.
A
~'\'
~1\. 1 1
j J J ~ j ~
max max
for all k. Thus ~ e A 8 so Theorem 3 gives
which is the desired lower bound. Comparing the two bounds, we see that I(Q r+l)
differs from R (d(Qr+~) =R(D) by less than
Sinee 1 ;;;,,l~.. ek_ with equality "Y ~~~l~ooQk >O,we see that)!w.;.~(D)
.Ri_(D)=O. The upper and lower bounds ~herefore tend to eoincidenee in the limit of
large iteration number. If we need an estimate of R(D) with an error of e or less, we
need merely iterate on r until log e~ax -l::kQk ek_ log ek_ <;; e. Note that both the
iterations and the test to see if we are within e of R(D) ean be earried out without
having to calculate either I(Qr) or d(Qr) at step r, let alone their gradients.This is the
reason why the iterations proeeed so rapidly in praetice.
LECTURE4.
where
It is clear that Rn (D) is the rate distortion function for a source that
produces successive n-vectors independently according to P n (~) when the distortion
in the reproduction of a sequence of such n-vectors is measured by the arithmetic
average of Pn over the successive vectors that com prise the sequence. Although the
actual source if stationary will produce successive n-vectors that are identically
distributed according toP n (~), these vectors will not be independent. Hence, ~ (D)
will be an overestimate of the rate needed to achieve average distortion D because it
does not reflect the fact that the statistical dependence between successive source
letters can be exploited to further reduce the required information rate. However,
one expects that this dependence will be useful only near the beginning and end of
each block and hence may be ignored for large n. This tempted Shannon [ 2 ] to
define
20 Extension to Sources and Distordon Measures with Memory
Source coding theorems to the effect that the above prescription for
calculating R(D) döes indeed describe the absolute tradeoff between rate and
distortion have been proven under increasingly general conditions by a variety of
authors [2,4-9].
A sufficient but by no means necessary set of conditions is the
following
(i) The source is strictly stationary
(ii) The source is ergodie
(iii) :!lg < 00 and Pg : Ag x Bg-+ [ 0, 00 ] suchthat
n-g+l
Pn(~,I) = p-!+1 / ; Pg(xt,xt+l'•••,xt+g-l'yt, ••• ,yt+g-1)
(This is called a distortion me·asure of span g and can be used to reflect context-
dependencies when assigning distortions).
(iv)
where EX denotes expectation over a vector ~ =(X 1 , ••. , Xg) of g succs~ive random
source oütputs.
Comments. Gray and Davisson [9] claim recently to have been able to remove the
need for condition (ü). Condition (iv) is automatically satisfied to finite-alphabet
sources; it is needed only when the theory is extended to the important case of
sources with continuous, possibly unbounded, source alphabets.
Under conditions (i) and (iii) the Rn (D) curves will be monotonic
nonincreasing for n~g. They have the physical significance that no system that
operates independently on successive n-vectors from the source can perform below
Rn(D). However, the corresponding positive source coding theorem that systems can
be found that operate arbitrarily close to ~ (D) is true only in the Iimit n -+ oo •
We now give some explicit examples of rate distortion functions.
6 { 0 if j =k
A = B = {0, 1} ' pjk = 1- jk = . 1 if j =I= k.
Assurne that zeros are produced with probability 1-p ~ 1/2 and ones
with probability p ~ 1/2.
A simple computation shows that
This formula equals the Gilbert bound which is a lower bound on the
exponential rate of growth of the maximum number of vectors that can be packed
into binary n-space such that no two of them disagree in fewer than nD places. Rate
distortion theory yields a different interpretation of this curve in terms of covering
theory. Namely, R(D) = 1+ D log2D +(1-D) log 2 (1-D) is the exponential rate of
growth of the minimum number of vectors such that every point in binary n space
differs from at least one of these vectors in fewer than nD places. Note that in the
covering problern the result is exact, whereas in the corresponding packing problern
the tightest upper bounds known (Eliar, Levenstein 11 ) are strictly greater than the
Gilbert bound *. In this sense at least covering problems are simpler than packing
problems.
Example 2. Gaussian Source and Mean Squared Error (MSE) [Shannon, 1948,
and 1959]
*lt is widely believed that the Hilbert bound is tight in the packing problem, too, but no proof has been found
despite more than 20 years of attempts.
22 Extension to Sources and Distortion Measures with Memory
R(D) = -1
2 log(a 2 /D) , 0 ~D ~Dmax 02
A=B=IR
P(x,y) = (x-yf
EX. 0,
I
n0 = _!._
21r
j" min [ 8, 4>(w)] dw
-'lf
Rate Distortion Theory and Data Compression 23
1
R(D8 ) = 47r I max [ 0, log
ff
4'(w) /8] dw
-ff
~~~""' .
-~
The cross-hatched region in the accompanying sketch is the area under
the so-called "error spectrum", min[8, 41(w)], associated with the parameter8.
The optimtim system for rate R(D8 ) will yield a reproduction sequence {Yi } which
is Gaussian and stationarily related to {Xi }with the error sequence {Xi- Yi }having
time-discrete spectral density Imin [ 8 , 4'(w) ] , Iw I ~ 7r •
Let
x = {x(t) ,0 ~t ~T} , 1. = {y(t) ,0 ~t ~T}
and 1 T
Pr(x,y) = T f[x(t) -y(t)Jldt.
0
The formulas for D 8 and R(D 8 ) from the time-discrete case of Example
3 continue to apply except that the limits pn the integrals become ±oo instead of ± 7r
and the spectral density 41(w) is defmed by
.. .
d..
'~~'(W) = I I{) (T) e -JWTdT
where
lfJ(T) = E(X(t)X(t +T)).
In the special case of an ideal bandlimited process
41(w) = { «<»o'
lwl ~ 2'HW
0' lwl > 2'HW
the results specialize to D8 =2WO and R(D 8 )=W log(41JO). Eliminating 0 yields the
explicit result
2W'I>0
= W log (-D---·
)
R(D)
24 Extension to Sources and Distortion Measures with Memory
and the mean squared error D in the reconstruction of the signal can be considered
as an effective additive noise power N, the above result often is written as
R(D) = W log ( ~) ,
denote the [n, j] linear code with g~nerators 1J , ••• ,x_j (If the generatorsarenot
linearly independent, then not all 2 J of the y i € Bj are distinct). For completeness,
let B 0 = {_Q}. Define
F.
J
and let N. = II F. II denote the cardinality of FJ· . It should be clear that Nj is the
J J
number of source vectors .!. that can be encoded by at least one y EBj with error
frequency '1.jn or less. Accordingly, the probability Qj that a random vector X
produced by the source cannot be encoded by Bj with average distortion '1.j n ~
less is
0; = 1- 2-n~
where
rJ = y_€UB. Sn (v + Y.
~ .:t.. 'J + 1
) = {v- + -JY. + 1 : -veF.}
J
J
~ 2N
Nj+I...- - z-nN.2
j J
ü -1 - 2-nNj+1""'
)+1-
~1 22-n ~ +(2-nN.)
- • J
2
~ + 1 ,.;;; (1 _ 2 -n~ )2 = ~2
Since
where we have used the fundamental inequality in the last step. Now let n -+ oo and
k -+ oo in such a way that the code rate k/n -+ R, and let !Q-+ oosuch that~/n-+ D.
It then follows that the probability Qk that the source produces a word that cannot
be encoded with distortion D or less will tend to zero provided
lim l2-n(~) = oo
n~ oo
k/n~ R
Q/n~ D
Since by Stirling's approximation we know that
s = HxT
s = HzT •
Such a ~ is called the Ieader of the coset CW = { y; H~ T = .!}. Once ~ has been
found, which is the hardest part of the procedure, one then encodes (approximates)
~by 1. =~ + ~· Said J_ is a code word since Hl = Hl + H~T = ~+ ~- = Q: Hence,
L is expressible as a linear combination (mod. 2) of the k generators. This means that
a string of k binary digits suffices for specification of y- In this way the n source
digits are compressed to only k digits that need to be transmitted. The resulting
error frequency is n- 1wt(E:Y) = n- 1 wt(~). Accordingly, the expected error
frequency Dis the average weight of the coset Ieaders
2 n·k
2- (n-k) ~
D = n -1
._ w.
j=l I
00 000000 000
000001 001
000110 1110
l·
·~· 00
000111
111000
Oll
100
l
Figure 5. A Binary Tree Code
lt1001 101
111110 110
111111 111
......
-·· "''h
Iu...
mapl
3/16 = 0.1875. The best that could be done in this instance by any code of rate
R=l/2of any blocklength is D(R=l/ 2) = 0.110.
It should be clear that tree codes can be designed with any desired rate.
If there are b branches per node and t letters per branch, then the rate is
R = Q- 1 1og d
Also, the letters an the branches can be chosen from any reproducing alphabet
whatsoever including a continuous alphabet, so tree coding is not restricted to
GF(q).
Tree encoding of sourceswas first proposed by Gabliek (1962). Jelinek
(1969) proved that tree codes can be found that perform arbitrarily close to the
R(D) curve for any memoryless source and fidelity criterion; this result has been
extended by Berger (1 971) to a limited dass of sources with memory. In order for
tree coding to become practical, however, an efficient search algorithm must be
devised for finding a good path through the tree. (Note that, unlike in sequential
decoding for channels, one need not necessarily find the best path through the tree;
a good path wi.P suffice). This problern has been attacked with some degree of
success by Anderson andJelinek (1971, 1973), by Gallager (1973), by Viterbiand
Omura (1973), and by Berger, Dick andJelinek (1973).
In conclusion, it should be mentioned that delta modulation and more
general differential PCM schemes are special examples of tree encoding procedures.
However, even the adaptive versions of such schemes are necessarily sub-optimal
because they make their branching decisions instantaneously on the basis solely of
past and present source outputs. Performance could be improved by introducing
some coding delay in order to take future source outputs into consideration, too.
LECTURE 7
Whcn this is plotted with a logarithmic D-axis, it appears as a straight line with
negative slope as skctched in Figure 6. Parallel to R(D)
1-IITS
A LLOYD-MAX OUANnZOS,UNCODED
e LLOVO-MAX OUANTIZERS,COOED
but approximatcly 1/ 4bit higher lies the performance curve of the best entropy-
coded quantizers. Thc small separation indicates that there is very little tobe gained
by more sophisticated encoding techniques. This is not especially surprising given the
memorylcss nature of both the source and the fidelity criterion. It must be
emphasized, however, that entropy coding necessitates the use of a buffer to
implement the conversion to variable-length codewords. Moreover, since the
optimum quantizer has nearly uniformly spaced Ievels, some of these Ievels become
many times more probable than others, which Ieads to difficult buffering problems.
Furthermore, when the buffer overflows it is usually because of an inordinately high
local density of large-magnitude source outputs. This means that the per-letter MSE
incurred when buffer overflows occur tends to be even bigger than o2.
34 Rate Distordon Theory and Data Compression
asymptotic rates.
4. Prove that the ensemble performance of codes with generators chosen in-
dependendy at random approaches R(D) as n ~ oo.
5. Extend the theory satisfactorily to nonequiprobable sources.
B. Tree Coding
1. Show that there are good convolutional tree codes for sources. (For contin-
uaus amplitude sources the tapped shift register is replaced by a feed-forward
digital fitter.
2. Find better algorithms for finding satisfactory paths through the tree (or
trellis).
where nk = ~njk and the Cjk, are so-called "free energy constants" that can be
experimentilly measured. The minimization naturally is subject to the mass balance
conditions l:n.k = n. and the constraints n.k ~0. Letting n = l:n. and making the
k J J J j J
obvious associations
n "" nP.
J J
[6] BERGER, T., (1968) "Rate Distordon Theory for Sources with Abstract
Alphabetsand Memory", Information and Control, 13, 254-273.
[7] GOBLICK, T.J.) Jr. (1969) "A Coding Theorem for Time-Discrete Analog
Data Sources", Trans. IEEE, IT-15, 401-407.
[8] BERGER, T., (1971) "Rate Distordon Theory. A Mathematical Basis for
Data Compression", Prentice-Hall, Englewood Cliffs, N.Y.
[9) GRAY, R.M., and L.D. DAVISSON (1973) "Source Coding Without
Ergodicity", Presented at 1973 IEEE Intern. Symp. on Inform.
Theory, Ashkelon, Israel.
[10] GOBLICK, T.J., Jr. (1962) "Coding for a Discrete Information Source with
a Distortion Measure", Ph.D.Dissertation, Elec. Eng. Dept. M.I.T.
Cam bridge, Mass.
[12] HAMMING, R.W., (1950) "Error Detecting and Error Correcting Codes",
BSTJ, 29, 1.47-160.
[14] BERGER, T., and J.A. VAN DER HORST (1973) "BCH Source Codes",
Submitted to IEEE Trans. on Information Theory.
[15] POSNER, E.C., (1968) In Man H.B."Error Correcting Codes", Wiley, N.Y.
Chapter 2.
[17] JELINEK, F., and J.B. ANDERSON (1971) "Instrumentable Tree En-
coding and Information Sources", Trans. IEEE, IT-17, 118-119.
[18] ANDERSON, J.B., and F. JELINEK (1973) "A Two-Cycle Algorithm for
Source Coding with a Fidelity Criterion", Trans. IEEE, IT-19, 77-92.
[19] GALLAGER, R.G., (1973) "Tree Encoding for Symmetrie Sources with a
Distordon Measure", Presented at 1973 IEEE Int'l. Symp. on
Information Theory, Ashkelon, Israel.
[20] VITERBI, A.J., and J.K. OMURA (1974) "Trellis Encoding of Memoryless
Discrete-Time Sources with a Fidelity Criterion", Trans. IEEE,
IT-20, 325-332.
[21] BERGER, T., R.J. DICK and F. JELINEK (1974) "Tree Encoding of
Gaussian Sources", Trans. IEEE, IT-20, 332-336.
[22] BERGER, T., F. JELINEK and J.K. WOLF (1972) "Permutation Codes
for Sources", Trans. IEEE, IT-18, 160-169.
References 39
[23] SLEPIAN, D., (1965) "Permutation Modulation", Proc. IEEE, 53, 228-236.
[24] BERGER, T., (1972) "Optimum Quantizers and Permutation Codes", Trans.
IEEE, IT-18, 759-765.
LEE D. DAVISSON
Department of Electrical Engineering
University of Southern California, Berkeley
PREFACE
in this workshop and for the opportuni ty to present the series of lectures
represented by the following notes.
Lee D. Davisson
I. Introducdon
A. Summary
The basic purpose of data compression is to massage a data stream to
reduce the average bit rate required for transmission or storage by removing
unwanted redundancy and/or unnecessary precision. A mathematical formulation of
data compression providiPg ftgures of merit and bounds on optimalperformancewas
developed by Shannon [1,2] both for the case where a perfect compressed
reproduction is required and for the case where a certain specified average distortion
is allowable. Unfortunately, however, Shannon's probabilistic approach requires
advance precise knowledge of the statistical description of the process to be
compressed- a demand rarely met in practice. The coding theorems only apply, or
are meaningful, when the source is stationary and ergodic.
We here present a tutorial description of numerous recent approaches
and results generalizing the Shannon approach to unknown statistical environments.
Simple examples and empirical results are given to illustrate the essential ideas.
B. Source Modelling
Sources with unknown statistical descriptions or with incomplete or
inaccurate statistical descriptions can be modelled as a dass of sources from which
nature chooses a particular source and reveals only its outputs and not the source
chosen. The size and content of the dass are determined by our knowledge or lack
thereof about the possible source probabilities. Typical classes might be the dass of
all stationary sources with a given alphabet, the dass of all Bernoulli processes
(sequences of independent, identically distributed binary random variables, i.e., coin
flipping with a11 possible biases), the dass of a11 Gaussian random processes with
bounded covariances and means, etc.
The source chosen by Nature can be modelled as a rapdom variable, i.e.,
a random choice described by a prior probability distribution, or it can be modelled
as a fixed but unknown source, i.e., we have no such prior.
To model such a dass, assume that for each value of an index 0 e A we
have a discrete time source {~; n = ... ,-1,0,1, ... } with some alphabet A of possible
outputs and a statistical descripiion .p8 , specifically, Jl 6 is a probability measure
on a sample space of all possible doubly infinite sequences drawn from A (and an
appropriate a - Held of subsets called events). Thus Jl8 implies the probability
distributions on a11 random vectors composed of a finite number of samples of the
46 Universal Source Coding
N
process, e.g., p. implies for any integer N the distributions p.8 describing the
~_N !J.
random vector X = (X1 ,... ,XN ). As noted above, the source index may (or may
not) itself be considered as arandom variable ® taking values in the parameter space
A as described by some prior probability distribution W( () ). The source described
by p.8 will often be referred to simply as (). Similarly, the dass will often be
abbreviated to A.
We shall usually assume that the individual sources () are stationary,
i.e., the statistieal description p. 8 is invariant to shifts of the time origin. We shall
sometimes require also that the individual sources be ergodic, that is, with
P. 8 -probability-one all sample functions produced by the source are representative
of the source in that all sample moments converge to appropriate expectations and
all relative frequencies of events converge to the appropriate probabilities. This
should not be considered as a Iimitation on the theory since in practice, most real
sources can be considered to have at least local (i.e., over each output block
presented to an encoder) stationary and ergodie properties.
From the ergodie decomposition of a stationary time series [3,4,5] ,
any stationary nonergodie source can be viewed as a dass of ergodie sources with a
prior. Hence, stationary nonergodie sources are induded in our model and the
subsources in a dass of stationary sources can be considered ergodie without lass of
generality, i.e., a dass of stationary sources can be decomposed into a "finer" dass
of ergodie sources.
For example, consider a Bernoulli source, i.e., a binary source in which
the probability () of a "one" chosen by "nature" randomly according to the
uniform probability law on [ 0,1] and then fixed for all time. The user is not told (),
however. Subsequently the source produces independent-letter outputs with the
chosen, but unknown, probability. The probability, given (), that any message block
of length N contains n ones in any given pattern is then
(}n (l-O)N -n
Given () the source is stationary and ergodie. Otherwise the source is stationary only
and the probability of any given message containing n ones in N outputs is simply
Sources with Unknown Statistics 47
The relative frequency of ones for this source converges to®, a random variable. lt
is seen that the stationary source can be decomposed by working backwards into an
ergodie dass of subsources indexed by 8.
The object of data compression is to process the observed source
messages to reduce the bit rate required for transmission. If the com pressed
reproduction is required to be perfect (noiseless source coding), then compression
can only be achieved through the removal of unnecessary redundancy. If the
compressed reproduction need not be perfect (as is usually the case with continuous
alphabets), but need only satisfy some average fidelity constraint, then compression
can also be achieved via removal of unnecessary precision (as in quantization).
xr
We shall here consider only block processors, i.e., compressors that
operate on individual consecutive source blocks or vectors = ( XiN , XiN + 1
, .... ,~N+ N _ 1 ), i = 0,1, ... , independently of previous or future blocks. The index i
will be suppressed subsequently. The block coder maps XN into a reproduction
x(XN ). The block coding restriction eliminates such methods as ,run-length
encoding, but is made for analytical simplicity and the fact that block encoders
provide a large and useful dass.
The dassical Shannon theory [ 1,2] provides f:tgUres of merit and
bounds on optimal performance for block-encoders (hereafter called simply
encoders) x(X N) when the source is stationary and ergodie and the source
probabilities are known precisely. Universal coding is the extension of these ideas to
the more general situation of a dass of sources, wherein encoders must' be designed
without precise knowledge of the actual statistics of the source to be observed, or
equivalently, where the source is nonergodic. The encoder is called "universal" if the
performance of the code designed without knowledge of the unknown "true" source
converges in the Iimit of long blocklength N to the optimal performance possible if
one knew the true source. Different types of universality are defmed to correspond
to different notions of convergence.
In this tutorial presentation of recent approaches and results on
noiseless universal encoding and universal coding subject to a rate or an average
fidelity constraint, we attempt a unified development with a minimum of
uniformative mathematical detail and a maximum of intuition. As the mere
existence of universal codes is suprising, the Hrst cases considered are nearly trivial in
order to make these types of results believable and to emphasize the fundamental
ideas in the simplest possible context.
The mathematical detail, the proofs of the general theorem, and
48 Universal Source Coding
where E11 denotes expectation over J.L11 , i.e., since the alphabet is discrete
N
where J.L 11 is taken as the probability mass function of N-tuples.
The optimal code in <l(N) is the one that minimizes the average
normalized length. Defme
as the smallest attainable average normalized length for any source blocklength.
Shannon's variable-length source coding theorem relates ~8 to the
entropy of the source when the source 8 is known in advance:
(2.1) N-1 H(~ /8) <: ~8 (N) < N - 1 H(~ /8) + N-t
where
is the N th-order entropy of the source 8 (or the conditional entropy of the dass of
sources given 8 ). All logarithms are to the base 2. If the source is stationary, then
(2.2) ~
8
= H(X/8)
where H(X/ 8) =Bm N- 1H(X Nj8) is the entropy rate of the source 8.
N.. oo
An optimal code C N for a given N can be constructed so that
N N N N N
(2.3) -log p.9 (x) <: ~(x j~) <- log ll9 (x ) + 1
Note that the construction sarisfies (2.1) but requires advance knowledge of the
source 8 for which the code is to be constructed. .
A useful performance index of any code C N used on a source 8 is the
difference between the actual resulting normalized average length and the optimal
attainable for that blocklength. We formalize this quantity as the redundancy or
conditional redundancx given 8, r 9 (CN), given by
(2. 4) r
B
(CN) ~ 2B (CN) - ~B (N) •
the figures of merit of later sections, and because the asymptotic N .. .. results are
unaffected through (2.2) for stationary sources. Clearly 0 ~ r8 (C N ) "S r; (C N ).
We now consider the more general situation of coding for a dass of
sources, i.e., we must now choose a code C N without knowledge of the actual
source, 8 , being observed. A sequence of these codes { CN}; = 1 will be said to be
universal in accordance with the definition of Section I if these codes designed
without knowledge of 8 asymptotically perform as well as optimal codes custom
designed for the true (but unknown) () ,i.e., if r8 (C N) N~..o in some sense regardless
of (). The various types of universal codes will correspond to the various notions of
convergence, e.g., convergence in measure, pointwise convergence and uniform
convergence.
Before formalizing these concepts, it is useful to consider in some detail
a specific simple example to hopefully make believable the remarkable fact that such
universal codes exist and to provide a typical construction.
Suppose that the dass A consists of all Bernoulli processes, i.e., all
independent, identically distributed sequences of binary random variables. The
sources in the dass are uniquely specified as noted in Section (I,B) by
8 = Pr[X = 1] , i.e.,
n
0,1
(2.5)
n
where w(x n) = i~l xi is the Hamming weight of the binary n-tuple xn.
Choosing a code CN to minimize r 8 (C N ) for a particular 8 will dearly
result in a large redundancy for some other sources. To account for the entire dass
we instead proceed as follows: Each source ward xN is encoded into a codeward
consisting of two parts. The first part of the codeward is w(xN ), the number of ones
in xN . This specification requires at most log(N + 1) bits. The second part of the
codeward gives the location of the w(xN ) ones by indexing the ( ~(xN ) ) possible
location patterns given w(xN ). Thus this information can be optirnally encoded
(given w( xN)) using equal length codewords of at most log (N N ) + .1 bits.
w (x )
Variable Rate Noiseless Codes 51
If the actual unknown source is 0 , the resulting redundancy using this uniquely
decodable code is bounded above as follows:
The N-t (log (N + 1) + 2) term clearly vanishes as N ..... Using Stirling's ap-
proximation and the ergodie theorem.
The last step also follows from the strong law of large numbers (a special case of the
ergodie theorem, N-1 w(x N) N7.. 0 w.p. 1) and the continuity of the logarithm.
From the Shannon theorem, A8 (N)N""t ..H(X/ 0) = -0 logO - (1-0 )l6g(l-0) so
that r 8 (CN ) -+0 for all 0 and the given sequence of codes.
We now proceed to the general definitions of universal codes and the
corresponding existence theorems.
Given a probability measure on A (and an appropriate O-field), a
sequence of codes {CN}Noo= 1 is said tobe weighted-universal (or Bayes universal)
if r 8 (CN) converges to 0 in W-measure, i.e., if
The sequence is maximin-universal if (2.6) holds for all possible W. The measure W
might be a prior probability or a preference weighting. The sequence is said to be
weakly:-minimax universal (or weakly:-universal) if r 8 (C N) -+o pointwise, i.e.,
(2.7) all 0 € A
52 Universal Source Coding
The advantage here is that a singl~ fmite blocklength code has redundancy less than
E for all 8 .
A strongly-minimax universal sequence of codes is obviously also
weakly-minimax. Since r6 (C N) ;;;a. 0, weakly-minimax code sequences are also
weighted universal for any prior by a Standard theorem of integration. Since r 6 (CN)
;;;a. 0, convergence in measure implies convergence W-almost everywhere (with
W-probability one ). Thus, if {CN} is a weighted universal sequence for A, then
there is a set A0 such that W(A-A0 ) = 0 and {CN } is a weakly-minimax universal
sequence for the dass of sources A 0 • Since convergence W-a.e. implies almost-
-uniform convergence, given any E > 0, there is a set Aesuch that W( A- Ae )",.;; E
and {CN} is a strongly-minimax universal sequence for Ae .
Even though the strongly-minimax universal codes are the most
desirable, the weaker types are usually more easily demonstrated and provide a dass
of good code sequences which can be searched for the stronger types. The following
theorem is useful in this regard:
Na code CN suchthat
where H(X N ) is the Nth order entropy of the mixture source. Thus for these codes
where I(X N ; ®) is the average mutual information between the unknown source
index and .the output N-tuples. In [ 5] it is shown for the dass considered that
N -t I( X N ; ®) ~ 0 as N ... .. , completing the proof. This convergence to zero
follows since the source outputs of an ergodie source eventually uniquely specify the
source in the limit and therefore the per-letter average mutual information tends to
zero. The theorem can also be proved by construction [ 6] .
Using the ergodie decomposition, the above theorem can be extended
to the apparently more general dass of all stationary processes.
Unfortunately, existence theorems for weakly-minimax and strongly
minimax universal codes are not as easily obtainable. As an alternative approach, the
following theorem (proved in [ 6] ) provides a construction often yielding universal
codes for certain dasses of sources.
then weakly-minimax universal codes exist. If e6 (N) and ~6 (N) do not depend on
0 , then strongly-minimax universal codes exist.
Code constructions follow immediately whenever the codebook
theorem applies. Make a codebook on the source output space for each of the
representative values Oj • Send the codeward in two parts. The first part consists of
the value j(!..N) encoded with a word of length -log J.lN [ r j (xN) ] + 1. The
second part consists of the codeward for ~ in the j(x N )th codebook.
As an example of the application of the codebook theorem, consider
the binary sources of eqn. (2.5 ). The representative values are (Jj = j!N, j = 0,1,
.•. ,N. lj = {~N: Hammingweightj }. p. N(lj ) = (N +lf 1 •
Obviously
log(N + 1)
N
Fixed Rate Noiseless Coding 55
for every 8 e A, then weighted universal codes exist. If the supremun of (2.9) is
fmite over A and there exists a vanishing sequence {ek} suchthat for all N, M,;;,
k, all 0 e A.
(2 .10)
It is weil known that this is a reasonable approximation for video data. From (4.1)
and the independence assumption,
N
~ !x.!
(1-8) ()
J.Le(~) = m
N
N . -I
I-
I
(4. 2)
R(C)
N
= N~log UcNU
where
PN(XN!c) = PN(xN,x(xN))
N
min
N
y €~
Compression is achieved since the code size II CN II is usually much smaller than the
number of possible source N-tuples (which is in general uncountably infinite) and
hence any codeward can be specified by using fewer bits than the original source
required. (Strictly speaking, "compression" is achieved if R(CN) :<H(X), the
entropy rate of the source). Fidelity is lost as a result of the compression, but the
58 Universal Source Coding
goal is to minimize this hopefully tolerable loss. The optimal performance is now
specified by the minimum attainable average distortion using fixed rate codes. The
rate may be constrained by channel capacity, available equipment, receiver
limitations, storage media, etc. Let (/, (N,R,A) be the dass of alphabet Ä,
blocklength N codes having rate less than or equal to R. Define
= inf ~ 8 (R,N,A)
N
X
where the inf is over all test channels (conditional probability measures for N given
,X
xN , random encoders) and I(XN N) is the average mutual information between
input and output N-tuples of the given source and test channel [12,13].
then
li 8 (R,A) D8 (R,A)
(5 .1) 0 '
weaklr.-minimax universal if
Proof: Say the dass contains K sources k = l, ... ,K. Ök (R,A) can be
shown to be a continuous function of R so that given an € > 0 there exists an N
sufficiently large to ensure that for all k
<
I
A A
The extra € /2 is necessary since a code actually yiel~ing the infimum defining ök
may not exist. Farm the union codebook ~ = kld 1 C N(k) containing all the
distinct codewords in all of the subcodes CN (k). The encoding rule is unchanged,
i.e., a source block is encoded into the bestcodeward in CN. Since this ward can be
no worse than the best ward in any subcode CN (k), the average distortion resulting
from using CN on the source k satisfies
~ pk (R,A) + €
1im 1im
N+oo N+oo
yielding the theorem [ 8]. The proof of the source coding theorem used is a
complicated generalization of random coding arguments and a topological
decomposition of A using the ergodie decomposition.
The distortion measure p is defined on (AUA) x (AUA) as the above
theorem is proved using a two-step encoding combining the regular encoding with a
quantization within the source or reproduction alphabet. Hence distortion must be
defined on AxA and AxA as weil as AxA.
The previous the'drem is conceptually easily generalized to dasses of
stationary sources using the ergodie decomposition. Numerous technieal
measurability problems arise, however, and such results are more easily obtainable
using the following theorems.
To state the theorem in its most general form, we require the concept
of the p distance between random processes and a simple application. Given two
stationary processes 0 and l/>, the p distance p ( 0 , l/>) is defined by
P (O,l/>) sup
n
pn (0 ,l/>)
n n
inf E [ p (X , Y ) ]
q€Qn(O,l/>) q n
where Qn ( 0 , l/> ) is the dass of all joint distribu tions q describing random vectors
(Xn, yn ) suchthat the marginal distributions specifying xn and yn are J.l; and
J.l: , respectively. Thus pn measures how well Xn and yn can be matched in the
Pn sense by probabilistically connecting the random vectors in a way consistent
with their given distributions. Alternative definitions and properdes of the p
distance are given in [ 7 ] and [ 8] . In particular, it is there proved that p is a metric
and that p has the following simple (but less useful here) alternative definition:
Corolli!:!Y.:
... ...
I~ 8 (R,A,N)- ~~(R,A,N) l~p(8,rp) all N
....
'S li 6 (R,A) + e (2 + zP (8, k(8))
....
:E;;;; li 6 (R,A) + Se /2
€ is an arbitrary positive constant and the blocksize is chosen large enough so that
the average distortion is D - € for all 8 and so that the probabiliry that there is no
codeward with distortion less than D- €/2 is less than € /2 PM.
The coded representation of each of the Lk codewords generated for
each k will consist of two parts. The first part will be the fixed length binary nurober
equal to k-1, k =1,2.... , K using at mostlog K + 1 bits to identify the codeword. The
.second part will be the location of the codeward in a Iist for each k of length at
roost log Lk + 1 bits. Thus the rate of any codeward for a given 8 is at roost
log K + 2 .
r
k
=
N
+R(D-€)+€bl.ts.
-'k
(6. 2)
Obviously by choosing N large enough and € sroall enough, the rate can
be roade arbitrarily close to R 8 (D) for 8 = 1,2, ... ,K.
The achieveroent of D and ~ (D) for the Rorobined supercode depends
upon the actual choice of a codeward out of the L= ~ L codewords total. The
j =1 J
coding rule is as follows. Upon observing an output block of length N, among the
Variable Rate Coding with Distordon 67
codewords of distortion less than D- e /2, fmd the one of minimum rate, ifany. If
there is no codeward with distortion less than D- e /2, make a random choice. The
average rate for 0 = k then
of distor.tion J
~
~ rk + ( SUJ.P rJ.
) Prob rnolesscodeward
D •
than 1n the t codeth
Qn M + 2
= N + I\ (D -e) + e + (sup r. ) e /2PM ,
j J
which is abritrarily close to ~ (D) for small enough €, large enough N. The average
distortion for 0 = k is
~D
"""'-e /2 +PM Prob [no codeward of distortion
less than D in the kth code
J
VII. Conclusions
We have summarized a unified formulation of source coding-both
noiseless and with a fidelity criterion- in inaccurately or incompletely specified
statistical environments. Several alternative definitions of universally good coding
algorithms have been presented, the most general known existence theorems stated,
and proofs given for some simple informative cases. The results indicate that
surprisingly good performance is possible in such situations and suggest possible
design philosophies, i.e., (1) estimate the source present and transmit both estimate
and an optimal codeward for the given estimate source, or (2) build several
representative subcodes and send the best ward produced by any of them. In the
first approach, asymptotically the estimate well approximates the true source and
takes up a negligible percentage of the codeword. In the second approach, if the
representatives are well chosen, all possible sources are nearly optimally encoded
with an asymptotically negligible increase in rate.
The results described are largely existence theorems and therefore do
not prescribe a specific method of synthesizing data compression systems. They do,
68 Conclusions
however, provide Hgures of merit and optimal performance bounds that can serve as
an absolute yardstick for comparison justification for overall design philosophies
that have proved useful in practice.
REFERENCES
[4] GRAY, R.M., and DAVISSON, L.D., "Source Coding Theorems without the
Ergodie Assumption", IEEE Trans. IT, July 1974.
[7] GRAY, R.M., NEUHOFF, D., and SHIELDS, P., "A Generalization of
Ornstein's d Metrie with Applications to Information Theory",
Annals ofProbabiliry (tobe published).
[8} NEUHOFF, D., GRAY, R.M. and DAVISSON, L.D., "Fixed Rate Universal
Source Coding with a Fidelity Criterion", submitted to IEEE Trans.
IT.
[9} PURSLEY, M.B., "Coding Theorems for Non-Ergodie Sources and Sources
with Unknown Parameters", USC Technieal Report, February 1974.
[13] BERGER, T., "Rate Distordon Theory: A Mathematical Basis for Data
Comwession", Englewood Cliffs, New Jersey, Prentice-Hall.
[15] GRAY, R.M., and DAVISSON, L.D. "A Mathematical Theory of Data
Compression (? )", USCEE Report, September 1974.
CONTENTS
Preface...................................... ................. 3
Lecture 1
Rate Distordon Theory: An Introducdon . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Lecture 2
Computation of R(D) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Lecture 3
The Lower Bound Theorem & Error Estimates for the Blahut Algorithm . . . . 15
Lecture 4
Extension to Sources and Distordon Measures with Memory . . . . . . . . . . . . . . 19
Lecture 5
Algebraic Source Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Lecture 6
Tree Codes for Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Lecture 7
Theory vs Practice and Some Open Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 33
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Preface......... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
I. Introducdon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
II. Variable-Rate Noiseless Source Codes. . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
111. Fixed-Rate Noiseless Coding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
IV. Universal Coding on Video Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
V. Fixed-Rate Coding Subject to a Fidelity Criterion. . . . . . . . . . . . . . . . . . 57
VI. Variable-Rate Coding with Distordon. . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
VII. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69