Chapter 6
Chapter 6
Motivations
• Lossy data compression = to compress a source to a rate less than the source
entropy.
output with
source with unmanageable
rH > C - channel with error -
capacity C
• The distortion measure ρ(z, ẑ) can be viewed as the cost of representing the
source symbol z ∈ Z by a reproduction symbol ẑ ∈ Ẑ.
E.g. A lossy data compression is similar to “grouping.”
' $' $Representative for this group
1 2 1 2
⇒
3 4 3 4 0 1 2 2
& %&6 % 1 0 2 2
[ρ(i, j)]:= ,
Representative for this group 2 2 0 1
2 2 1 0
6.1.2 Distortion measures I: 6-3
Example 6.5 (Hamming distortion measure) Let source alphabet and re-
production alphabet be the same, i.e., Z = Ẑ. Then the Hamming distribution
measure is given by
0, if z = ẑ;
ρ(z, ẑ):=
1, if z = ẑ.
This is also named the probability-of-error distortion measure because
E[ρ(Z, Ẑ)] = Pr(Z = Ẑ).
• The squared error distortion measure has the advantages of simplicity and
having a closed-form solution for most cases of interest, such as when using
least squares prediction.
• Yet, this measure is not ideal for practical situations involving data operated
by human observers (such as image and speech data) as it is inadequate in
measuring perceptual quality.
• For example, two speech waveforms in which one is a marginally time-shifted
version of the other may have large square error distortion; however, they sound
quite similar to the human ear.
Distortion measure for sequences I: 6-7
Problem: A problem for taking k = n is that the distortion measure for sequences
can no longer be defined based on per-letter distortions, and hence a per-letter
formula for the best lossy data compression rate cannot be rendered.
Solution: To view the lossy data compression in two steps.
Step 1 : Find the data compression code
h : Z n → Ẑ n
for which the pre-specified distortion constraint and rate constraint are
both satisfied.
Step 2 : Derive the (asymptotically) lossless data compression block code for
source h(Z n). The existence of such code with block length
k > H(h(Z n )) bits
is guaranteed by Shannon’s lossless source coding theorem.
Zn → Ẑ n → {0, 1}k
is established.
Distortion measure for sequences I: 6-9
• Since the second step is already discussed in lossless data compression, we can
say that the theorem regarding the lossy data compression is basically a theorem
on the first step.
6.2 Fixed-length lossy data compression I: 6-10
Proof:
• time-sharing argument:
– If we can use an (n, M1, D1) code ∼C1 to achieve (R1, D1) and an (n, M2, D2)
code ∼C2 to achieve (R2, D2), then for any rational number 0 < λ < 1, we
can use ∼C1 for a fraction λ of the time and use ∼C2 for a fraction 1 − λ
of the time to achieve (Rλ, Dλ), where Rλ = λR1 + (1 − λ)R2 and Dλ =
λD1 + (1 − λ)D2;
– hence the result holds for any real number 0 < λ < 1 by the density of the
rational numbers in R and the continuity of Rλ and Dλ in λ.
• Let r and s be positive integers and let λ = r
r+s ; then 0 < λ < 1.
Achievable Rate-Distortion Pair I: 6-13
• Assume that the pairs (R1, D1) and (R2, D2) are achievable. Then there exist
a sequence of (n, M1, D1) codes ∼C1 and a sequence of (n, M2, D2) codes ∼C2
such that for n sufficiently large,
1
log M1 ≤ R1
n 2
and
1
log M2 ≤ R2.
n 2
• Construct a sequence of new codes ∼C of blocklength nλ = (r + s)n, codebook
size M = M1r × M2s and compression function h : Z (r+s)n → Ẑ (r+s)n such that
h(z (r+s)n) = (h1(z1n ), . . . , h1(zrn ), h2(zr+1
n n
), . . . , h2(zr+s ))
where
z (r+s)n = (z1n , . . . , zrn , zr+1
n n
, . . . , zr+s )
and h1 and h2 are the compression functions of ∼C1 and ∼C2, respectively.
Achievable Rate-Distortion Pair I: 6-14
• The average (or expected) distortion under the additive distortion measure ρn
and the rate of code ∼C are given by
(r+s)n (r+s)n
ρ(r+s)n(z , h(z )) 1 ρn (z1n , h1(z1n )) ρn (zrn , h1(zrn ))
E = E + ···+ E
(r + s)n r+s n n
n n n n
ρn (zr+1 , h2(zr+1 )) ρn (zr+s , h2(zr+s ))
+E + ···+ E
n n
1
≤ (rD1 + sD2)
r+s
= λD1 + (1 − λ)D2 = Dλ
and
1 1
log2 M = log2(M1r × M2s)
(r + s)n (r + s)n
r 1 s 1
= log2 M1 + log M2
(r + s) n (r + s) n 2
≤ λR1 + (1 − λ)R2 = Rλ,
respectively, for n sufficiently large. Thus, (Rλ, Dλ) is achievable by ∼.
C
Achievable Rate-Distortion Pair I: 6-15
Definition 6.16 (Distortion typical set) The distortion δ-typical set with
respect to the memoryless (product) distribution PZ,Ẑ on Z n × Ẑ n and a bounded
additive distortion measure ρn (·, ·) is defined by
Dn(δ) := (z n , ẑ n ) ∈ Z n × Ẑ n :
1
− log2 PZ n (z n ) − H(Z) < δ,
n
1
− log2 P n (ẑ n ) − H(Ẑ) < δ,
n Ẑ
1
− log2 P n n (z n , ẑ n ) − H(Z, Ẑ) < δ,
n Z ,Ẑ
1
and ρn (z n, ẑ n ) − E[ρ(Z, Ẑ)] < δ .
n
AEP for distortion typical set I: 6-17
Theorem 6.17 If (Z1, Ẑ1), (Z2, Ẑ2), . . ., (Zn , Ẑn), . . . are i.i.d., and ρn are bou-
nded additive distortion measure, then
1
− log2 PZ n (Z1, Z2, . . . , Zn) → H(Z) in probability;
n
1
− log2 PẐ n (Ẑ1, Ẑ2, . . . , Ẑn) → H(Ẑ) in probability;
n
1
− log2 PZ n,Ẑ n ((Z1, Ẑ1), . . . , (Zn, Ẑn )) → H(Z, Ẑ) in probability;
n
and
1
ρn (Z n, Ẑ n ) → E[ρ(Z, Ẑ)] in probability.
n
Proof: Functions of independent random variables are also independent random
variables. Thus by the weak law of large numbers, we have the desired result. 2
• It needs to be pointed out that without the bounded property assumption, the
normalized sum of an i.i.d. sequence does not necessarily converge in probability
to a finite mean, hence the need for requiring that ρ be bounded.
AEP for distortion typical set I: 6-18
Theorem 6.18 (AEP for distortion measure) Given a DMS {(Zn, Ẑn)}
with generic joint distribution PZ,Ẑ and any δ > 0, the distortion δ-typical set
satisfies
1. PZ n,Ẑ n (Dnc (δ)) < δ for n sufficiently large.
2. For all (z n , ẑ n ) in Dn(δ),
Proof: The first result follows directly from Theorem 6.17 and the definition of
the distortion typical set Dn(δ). The second result can be proved as follows:
PZ n,Ẑ n (z n , ẑ n )
PẐ n|Z n (ẑ n |z n ) =
PZ n (z n )
n n
P Z n ,Ẑ n (z , ẑ )
= PẐ n (ẑ n )
PZ n (z n )PẐ n (ẑ n )
2−n[H(Z,Ẑ)−δ]
≤ PẐ n (ẑ ) n
2−n[H(Z)+δ]2−n[H(Ẑ)+δ]
= PẐ n (ẑ n )2n[I(Z;Ẑ)+3δ],
where the inequality follows from the definition of Dn(δ).
AEP for distortion typical set I: 6-19
where ρ(·, ·) is a given single-letter distortion measure. Then the source’s rate-
distortion function satisfies the following expression
R(D) = min I(Z; Ẑ).
PẐ|Z : E[ρ(Z,Ẑ)]≤D
Proof: Define
R(I)(D) := min I(Z; Ẑ); (6.3.3)
PẐ|Z : E[ρ(Z,Ẑ)]≤D
1. Achievability Part (i.e., R(D + ε) ≤ R(I)(D) + 4ε for arbitrarily small ε > 0):
We need to show that for any ε > 0, there exist 0 < γ < 4ε and a sequence of
lossy data compression codes {(n, Mn, D + ε)}∞ n=1 with
1
lim sup log2 Mn ≤ R(I)(D) + γ < R(I)(D) + 4ε.
n→∞ n
Step 1: Optimizing conditional distribution. Let PZ̃|Z be the conditional
distribution that achieves R(I)(D), i.e.,
R(I)(D) = min I(Z; Ẑ) = I(Z; Z̃).
PẐ|Z : E[ρ(Z,Ẑ)]≤D
Then
E[ρ(Z, Z̃)] ≤ D.
Choose Mn to satisfy
1 1
R(I)(D) + γ ≤ log2 Mn ≤ R(I)(D) + γ
2 n
for some γ in (0, 4ε), for which the choice should exist for all sufficiently large
n > N0 for some N0. Define
γ ε
δ := min , .
8
1 + 2ρ
max
Required in Step 4 Required in Step 5
Shannon’s lossy source coding theorem I: 6-22
For convenience, we let K(z n , z̃ n ) denote the indicator function of Dn(δ), i.e.,
n n 1, if (z n, z̃ n ) ∈ Dn(δ);
K(z , z̃ ) =
0, otherwise.
Then
Mn
PZ̃ n (∼Cn) = 1 − PZ̃ n (z̃ n )K(z n , z̃ n) .
∼Cn : z n ∈
J (∼Cn ) z̃ n ∈Ẑ n
Shannon’s lossy source coding theorem I: 6-26
2. Converse Part (i.e., R(D + ε) ≥ R(I)(D) for arbitrarily small ε > 0 and
any D ∈ {D ≥ 0 : R(I)(D) > 0}): We need to show that for any sequence of
{(n, Mn, Dn)}∞
n=1 code with
1
lim sup log2 Mn < R(I)(D),
n→∞ n
there exists ε > 0 such that
1
Dn = E[ρn (Z n , hn(Z n ))] > D + ε
n
for n sufficiently large. The proof is as follows.
Step 1: Convexity of mutual information. By the convexity of mutual in-
formation I(Z; Ẑ) with respect to PẐ|Z for a fixed PZ , we have
Finally,
1
lim sup log2 Mn < R(I)(D)
n→∞ n
implies the existence of N and γ > 0 such that
1
log2 Mn < R(I)(D) − γ
n
for all n > N . Therefore, for n > N ,
1 1
R(I) E[ρn (Z n, hn(Z n ))] ≤ log2 Mn < R(I)(D) − γ,
n n
which, together with the fact that R(I)(D) is strictly decreasing, implies that
1
E[ρn (Z n , hn(Z n ))] > D + ε
n
for some ε = ε(γ) > 0 and for all n > N .
Hence, (R(I)(D), D + ε) is not achievable and the operational R(D) satisfies
R(D + ε) > R(I)(D) for arbitrarily small ε > 0.
Shannon’s lossy source coding theorem I: 6-33
3. Summary:
• For D ∈ {D ≥ 0 : R(I)(D) > 0}, the achievability and converse parts jointly
imply that
R(I)(D) + 4ε ≥ R(D + ε) ≥ R(I)(D)
for arbitrarily small ε > 0.
• These inequalities together with the continuity of R(I)(D) yield that
R(D) = R(I)(D)
for D ∈ {D ≥ 0 : R(I)(D) > 0}.
• For D ∈ {D ≥ 0 : R(I)(D) = 0}, the achievability part gives us
R(I)(D) + 4ε = 4ε ≥ R(D + ε) ≥ 0
for arbitrarily small ε > 0. This immediately implies that
R(D) = 0 (= R(I)(D)).
2
Notes I: 6-34
• After introducing
– Shannon’s source coding theorem for block codes
– Shannon’s channel coding theorem for block codes
– Rate-distortion theorem
in the memoryless (and stationary ergodic) system setting, we briefly elucidate
the “key concepts or techniques” behind these lengthy proofs, in particular:
– The notion of a typical set
∗ The typical set construct – specifically,
· δ-typical set for source coding
· joint δ-typical set for channel coding
· distortion typical set for rate-distortion
uses a law of large numbers or AEP argument to claim the existence
of a set with very high probability; hence, the respective information
manipulation can just focus on the set with negligible performance loss.
Notes I: 6-36
where ρ(·, ·) is a given single-letter distortion measure. Then the source’s rate-
distortion function is given by
R(D) = R̄(I)(D),
where
R̄(I)(D) := lim Rn(I)(D) (6.3.5)
n→∞
is called the asymptotic information rate-distortion function. and
1
Rn(I)(D) := min I(Z n ; Ẑ n) (6.3.6)
PẐ n |Z n : n1 E[ρn (Z n ,Ẑ n )]≤D n
• Question: Can we extend the theorems to cases where the two arguments
fail?’
• It is obvious that only when new methods (other than the above two) are
developed can the question be answered in the affirmative.
6.4 Calculation of the rate-distortion function I: 6-39
Theorem 6.23 Fix a binary DMS {Zn}∞ n=1 with marginal distribution PZ (0) =
1 − PZ (1) = p, where 0 < p < 1. Then the source’s rate-distortion function under
the Hamming additive distortion measure is given by:
hb(p) − hb(D) if 0 ≤ D < min{p, 1 − p};
R(D) =
0 if D ≥ min{p, 1 − p},
where hb(p) := − p · log(p) − (1 − p) · log(1 − p) is the binary entropy function.
Then
I(Z; Ẑ) = H(Z) − H(Z|Ẑ)
= hb(p) − H(Z ⊕ Ẑ|Ẑ)
≥ hb(p) − H(Z ⊕ Ẑ) (conditioning never increase entropy)
≥ hb(p) − hb(D),
where the last inequality follows since hb(x) is increasing for x ≤ 1/2, and
Pr{Z ⊕ Ẑ = 1} ≤ D.
• Since the above derivation is true for any PẐ|Z , we have
R(D) ≥ hb(p) − hb (D).
6.4 Calculation of the rate-distortion function I: 6-41
• It remains to show that the lower bound is achievable by some PẐ|Z , or equiv-
alently, H(Z|Ẑ) = hb(D) for some PẐ|Z .
• Now in the case of p ≤ D < 1 − p, we can let PẐ|Z (1|0) = PẐ|Z (1|1) = 1 to
obtain I(Z; Ẑ) = 0 and
1
1
E[ρ(Z, Ẑ)] = PZ (z)PẐ|Z (ẑ|z)ρ(z, ẑ) = p ≤ D.
z=0 ẑ=0
where “⊕” denotes modulo two addition. In such case, ρ(z n , ẑ n ) is exactly the
number of bit changes or bit errors after compression.
6.4.2 Rate distortion func / the squared error dist I: 6-43
the rate-distortion function for any continuous memoryless source {Zi} with a pdf
of support R, zero mean, variance σ 2 and finite differential entropy satisfies
1 σ2
log2 , for 0 < D ≤ σ 2
R(D) ≤ 2 D
0, for D > σ 2
with equality holding when the source is Gaussian.
For 0 < D ≤ σ 2:
• Choose a dummy Gaussian random variable W with zero mean and variance
aD, where a = 1 − D/σ 2, and is independent of Z. Let Ẑ = aZ + W . Then
E[(Z − Ẑ)2] = E[(1 − a)2Z 2] + E[W 2] = (1 − a)2σ 2 + aD = D
which satisfies the distortion constraint.
• Note that the variance of Ẑ is equal to E[a2Z 2] + E[W 2] = σ 2 − D.
• Consequently,
R(D) ≤ I(Z; Ẑ)
= h(Ẑ) − h(Ẑ|Z)
= h(Ẑ) − h(W + aZ|Z)
= h(Ẑ) − h(W |Z)
= h(Ẑ) − h(W ) (by the independence of W and Z)
1
= h(Ẑ) − log2(2πe(aD))
2
1 1 1 σ2
≤ log2(2πe(σ − D)) − log2(2πe(aD)) = log2 .
2
2 2 2 D
6.4.2 Rate distortion func / the squared error dist I: 6-45
For D > σ 2:
• Let Ẑ satisfy Pr{Ẑ = 0} = 1 (and be independent of Z).
• Then E[(Z − Ẑ)2] = E[Z 2]+E[Ẑ 2 ]−2E[Z]E[Ẑ] = σ 2 < D, and I(Z; Ẑ) = 0.
Hence, R(D) = 0 for D > σ 2.
The achievability of this upper bound by a Gaussian source (with zero mean and
variance σ 2) can be proved by showing that under the Gaussian source,
(1/2) log2(σ 2/D)
is a lower bound to R(D) for 0 < D ≤ σ 2.
6.4.2 Rate distortion func / the squared error dist I: 6-46
Indeed, when the source Z is Gaussian and for any fẐ|Z such that E[(Z − Ẑ)2] ≤ D,
we have
I(Z; Ẑ) = h(Z) − h(Z|Ẑ)
1
= log2(2πeσ 2) − h(Z − Ẑ|Ẑ)
2
1
≥ log2(2πeσ 2) − h(Z − Ẑ)
2
1 1
≥ log2(2πeσ 2) − log2 2πe Var[(Z − Ẑ)]
2 2
1 1
≥ log2(2πeσ 2) − log2 2πe E[(Z − Ẑ)2]
2 2
1 1
≥ log2(2πeσ 2) − log2 (2πeD)
2 2
2
1 σ
= log2 .
2 D
6.4.2 Rate distortion func / the squared error dist I: 6-47
• Similarly, for a continuous memoryless source {Zi} with a pdf of support R and
finite differential entropy under the additive squared error distortion measure
its rate-distortion function satisfies
R (D) − D(ZZG) ≤ R(D) ≤ RG(D) .
G
2
log2 σD
Shannon lower bound 1
on the rate distortion func 2
Section 6.4.3 is based on a similar idea but targets for the absolute error distor-
tion; hence, we omit it in our lecture. Notably, a correction has been provided
for Theorem 6.29 (See errata for the textbook.)
6.5 Lossy joint source-channel coding theorem I: 6-49
Encoder Xn Yn Decoder
Z ∈Z
m m - -
Channel - -
Ẑ m ∈ Ẑ m
f (sc) g (sc)
,m
Given an additive distortion measure ρm = i=1 ρ(zi , ẑi ), where ρ is a distor-
tion function on Z × Ẑ, we say that the m-to-n lossy source-channel block code
(f (sc) , g (sc)) satisfies the average distortion fidelity criterion D, where D ≥ 0, if
1
E[ρm (Z m, Ẑ m )] ≤ D.
m
6.5 Lossy joint source-channel coding theorem I: 6-50
• Converse part: On the other hand, for any sequence of m-to-nm lossy source-
channel codes (f (sc), g (sc) ) satisfying the average distortion fidelity criterion D,
we have
m
· R(D) ≤ C.
nm
6.5 Lossy joint source-channel coding theorem I: 6-51
R(D)
1
DSL Rsc C
D
6.6 Shannon limit of communication systems I: 6-54
• Thus,
-
=Q 2Rscγb (6.6.6)
• The minimal γb (in dB) for a given Pb = D < 12 and a source-channel code
rate Rsc < 1:
1 ! −1 $2
γb,SL = Q (SL)
2Rsc
• For Rsc = 1,
! ! $$
SL := h−1
b 1 − R sc 1 − h b (D) = D = Pb
and
1 ! −1 $2
γb,SL = Q (Pb) .
2
6.6 Shannon limit of communication systems I: 6-57
• The Shannon limit DSL for this system with rate Rsc is obtained via
1
DSL := min D : R(D) ≤ C(P )
Rsc
1 σ2 1 P
= min D : log2 ≤ log 1 + 2
2 D 2Rsc 2 σN
σ2
= 1/Rsc
(6.6.10)
1 + σP2
N
Shannon limit
1
Rsc = 1/2 c
sc sc sc sc sc sc Rsc = 1/3 s
1e-1 sc sc cs c
s c
ss c
1e-2 s
s c
s
s c
Pb 1e-3
c
1e-4 s c
s
1e-5 s
1e-6
-6 -5 -4 -3 -2 -1 -.495 0.19 1 2
γb (dB)
The Shannon limits for (2, 1) and (3, 1) codes under binary-input AWGN channel.
• The Shannon limits calculated above are pertinent due to the invention of
near-capacity achieving channel codes, such as Turbo or LDPC codes.
• For example, the rate-1/2 Turbo coding system proposed in 1993 can approach
a bit error rate of 10−5 at γb = 0.9 dB, which is only 0.714 dB away from the
Shannon limit of 0.186 dB.
6.6 Shannon limit of communication systems I: 6-62
• Why lossy data compression (e.g., to transmit a source with entropy larger
than capacity)
• Distortion measure
• Lossy data compression codes
• Rate-distortion function
• Distortion typical set
• AEP for distortion measure
• Rate distortion theorem
Key Notes I: 6-65
Terminology
• Shannon’s source coding theorem → Shannon’s first coding theorem;
• Shannon’s channel coding theorem → Shannon’s second coding theorem;
• Rate distortion theorem → Shannon’s third coding theorem.
• Information transmission Theorem → Joint source-channel coding theorem
– Shannon limit (BER versus SNRb )