TSBK08 Data Compression Exercises: Informationskodning, ISY, Link Opings Universitet, 2013
TSBK08 Data Compression Exercises: Informationskodning, ISY, Link Opings Universitet, 2013
Data compression
Exercises
Contents
1 Information theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Source coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 Dierential entropy and rate-distortion theory . . . . . . . . . . . . . . . . 10
Informationskodning, ISY, Linkopings universitet, 2013
Problems
1 Information theory
1.1 The random variable X takes values in the alphabet A = {1, 2, 3, 4}. The probability
mass function is p
X
(x) =
1
4
, x A.
Calculate H(X).
1.2 The random variable Y takes values in the alphabet A = {1, 2, 3, 4}. The probability
mass function is p
Y
(1) =
1
2
, p
Y
(2) =
1
4
, p
Y
(3) = p
Y
(4) =
1
8
.
Calculate H(Y).
1.3 Suppose that X and Y in 1.1 and 1.2 are independent. Consider the random variable
(X, Y ).
a) Determine p
XY
(x, y).
b) Calculate H(X, Y )
c) Show that H(X, Y ) = H(X) +H(Y ) when X and Y are independent.
d) Generalization: Show that
H(X
1
, X
2
, . . . , X
n
) = H(X
1
) +. . . +H(X
n
)
Whenever all the variables are mutually independent.
1.4 The random variable Z takes values in the alphabet A = {1, 2, 3, 4}.
a) Give an example of a probability mass function p
Z
that maximizes H(Z). Is p
Z
unique?
b) Give an example of a probability mass function p
Z
that minimizes H(Z). Is p
Z
unique?
1.5 Let the random variable U take values in the innite alphabet A = {0, 1, 2, . . .},
with probability mass function p
U
(u) = q
u
(1 q), 0 < q < 1, u A.
a) Conrm that
u=0
p
U
(u) = 1.
b) Calculate H(U)
c) Calculate the average value of U
1
1.6 Show that
I(X, Y ) = H(X) H(X | Y ) 0
Hint: Use the inequality lnx x 1.
1.7 A binary memoryless source where the two symbols have the probabilities {p, 1p}
has the entropy H. Express the entropies for the following sources in H.
a) A memoryless source with four symbols with the probabilities
{
p
2
,
p
2
,
1 p
2
,
1 p
2
}
b) A memoryless source with three symbols with the probabilities {
p
2
,
p
2
, 1 p}
c) A memoryless source with four symbols with the probabilities
{p
2
, p(1 p), (1 p)p, (1 p)
2
}.
1.8 Let X be a random variable and f any deterministic function of the alphabet. Show
that
a) H(f(X) | X) = 0
b) H(X, f(X)) = H(X)
c) H(f(X)) H(X)
1.9 Let X and Y be two independent random variables. Show that
H(X) H(X +Y ) H(X, Y )
1.10 Given three random variables X, Y and Z, show that
H(X | Z) H(X | Y ) + H(Y | Z)
Hint: Start with H(X | Y ) H(X | Y Z) and use the chain rule.
1.11 Show that
H(X
1
, . . . , X
n+1
) = H(X
1
, . . . , X
n
) +H(X
n+1
| X
1
, . . . X
n
)
1.12 A uniformly distributed random variable X takes values in the alphabet
{0000, 0001, 0010, . . . , 1011} (the numbers 0 to 11 written as four bit binary num-
bers).
a) What is the entropy of each bit?
b) What is the entropy of X?
2
1.13 A Markov source X
i
of order 1 with alphabet {0, 1, 2} has transition probabilities
according to the gure. Calculate the stationary probabilities of the states.
0
1 2
0.1
0.5 0.1
0.5
0.8
0.5 0.5
1.14 Consider the Markov source in problem 1.13.
a) Calculate the memoryless entropy of the source.
b) Calculate the block entropy H(X
i
, X
i+1
) of the source. Compare this to H(X
i
)+
H(X
i+1
) = 2H(X
i
).
c) Calculate the entropy rate of the source.
1.15 A second order Markov source X
i
with alphabet {a, b} has the transition probabil-
ities p
X
i
|X
i1
X
i2
below:
p(a|aa) = 0.7, p(b|aa) = 0.3, p(a|ba) = 0.4, p(b|ba) = 0.6
p(a|ab) = 0.9, p(b|ab) = 0.1, p(a|bb) = 0.2, p(b|bb) = 0.8
Calculate the entropies H(X
i
), H(X
i
X
i1
), H(X
i
X
i1
X
i2
), H(X
i
|X
i1
) and
H(X
i
|X
i1
X
i2
).
3
1.16 A fax machine works by scanning paper documents line by line. The symbol alpha-
bet is black and white pixels, ie A = {b, w}. We want to make a random model
X
i
for typical documents and calculate limits on the data rate when coding the
documents.
From a large set of test documents, the following conditional probabilities
p(x
i
|x
i1
, x
i2
) (note the order) have been estimated.
p(w|w, w) = 0.95 p(b|w, w) = 0.05
p(w|w, b) = 0.9 p(b|w, b) = 0.1
p(w|b, w) = 0.2 p(b|b, w) = 0.8
p(w|b, b) = 0.3 p(b|b, b) = 0.7
a) The given probabilities imply a Markov model of order 2. Draw the state
diagram for this Markov model and calculate the stationary probabilities.
b) Calculate the entropies H(X
i
), H(X
i
|X
i1
) and H(X
i
|X
i1
, X
i2
) for the
model.
2 Source coding
2.1 A suggested code for a source with alphabet A = {1, . . . , 8} has the codeword
lengths l
1
= 2, l
2
= 2, l
3
= 3, l
4
= 4, l
5
= 4, l
6
= 5, l
7
= 5 and l
8
= 6. Is it possible
to construct a prex code with these lengths?
2.2 A memoryless source has the innite alphabet A = {1, 2, 3, . . . , } and symbol prob-
abilities P = {
1
2
,
1
4
,
1
8
, . . .}, ie p(i) = 2
i
, i A.
Construct an optimal binary prex code for the source and calculate the expected
data rate R in bits/symbol.
2.3 A memoryless source has the alphabet A = {x, y, z} and the symbol probabilities
p(x) = 0.6, p(y) = 0.3, p(z) = 0.1
a) What is the lowest average rate that can be achieved when coding this source?
b) Construct a Human code for single symbols from the source and calculate the
average data rate in bits/symbol.
c) Construct a Human code for pairs of symbols from the source and calculate
the average data rate in bits/symbol.
4
2.4 Consider the following Markov source of order 1, where p =
8
0.5:
a
b
1 p
1 p
p p
Construct Human codes for the source where 2 and 3 symbols are coded at a time.
Calclulate the rates of the two codes. Which code is the best?
2.5 Consider the source in problem 2.4. It produces runs of a and b. Instead of coding
symbols, we can code the length of each run. We create a new source R which has
an innite alphabet of run lengths B = {1, 2, 3, . . .}.
a) What is the probability of a run of length r?
b) What is the average run length (in symbols/run) ?
c) What is the entropy of R (in bits/run)?
d) What is the entropy rate of the source (in bits/symbol) ?
2.6 We now want to construct a simple systematic code for the run lengths from the
source in problem 2.4.
a) Construct a four bit xlength code for run lengths 1 to 14. Longer runs are
coded as 15 followed by the codeword for run of length-15, ie the run length 15
is coded as 15 0, the run length 17 is coded as 15 2, the run length 40 is coded
as 15 15 10 and so on.
Calculate the data rate for the code in bits/symbol.
b) Change the codeword length to ve bits and do the same as above.
2.7 We now want to code the run lengths from the source in problem 2.4 using Golomb
coding.
a) How should the parameter m be chosen so that we get an optimal code?
b) Calculate the resulting data rate.
5
2.8 We want to send documents using a fax machine. The fax handles the colours black
and white. Experiments have shown that text areas and image areas of documents
have dierent statistical properties. The documents are scanned and coded row
by row, according to a time discrete stationary process. The following conditional
probabilities have been estimated from a large set of test data.
Colour of Probability of colour of next pixel
current Text area Image area
pixel black white black white
svart 0.5 0.5 0.7 0.3
vit 0.1 0.9 0.2 0.8
The probability that we are in a text area is
4
5
and the probability that we are in
an image area is
1
5
.
Assume that the estimated probabilities are correct and answer the following ques-
tions.
a) Assume that we can neglect the cost of coding what areas of the document that
are text and what areas that are images. Calculate an upper bound for the
lowest data rate that we can get when coding documents, in bits/pixel.
b) Construct a Human code for text areas that has a data rate of 0.65 bits/pixel
or less.
2.9 Consider the following Markov source of order 1
A B
C D
0.5
0.5
0.8
0.7
0.1
0.2
0.9 0.3
a) Show that it is possible to code the signal from the source with a data rate that
is less than 0.6 bits/symbol.
b) Construct optimal tree codes for single symbols and for pairs of symbols from
the source. Calculate the resulting data rates.
6
2.10 A memoryless source has the alphabet A = {a
1
, a
2
, a
3
} and probabilities
P = {0.6, 0.2, 0.2}. We want to code the sequence a
3
a
1
a
1
a
2
using arithmetic cod-
ing. Find the codeword. Assume that all calculations can be done with unlimited
precision.
2.11 Do the same as in problem 2.10, but let all probabilities and limits be stored with
six bits precision. Shift out codeword bits as soon as possible.
2.12 Assume that the source is the same as in problem 2.10 and problem 2.11. Let
all probabilities and limit be stored with six bits precision. Decode the codeword
1011101010000. We know that the codeword corresponds to ve symbols.
2.13 A system for transmitting simple colour images is using the colours white, black,
red, blue, green and yellow. The source is modelled as a Markov source of order 1
with the following transition probabilities
State probability for next state
white black red blue green yellow
white 0.94 0.02 0.01 0.01 0.01 0.01
black 0.05 0.50 0.15 0.10 0.15 0.05
red 0.03 0.02 0.90 0.01 0.01 0.03
blue 0.02 0.02 0.02 0.90 0.03 0.01
green 0.02 0.02 0.01 0.03 0.90 0.02
yellow 0.03 0.01 0.03 0.01 0.03 0.89
We use arithmetiv coding to code sequences from the source. The coder uses the
conditional probabilities while coding.
a) A sequence starts with red, white, white. What intervall does this correspond
to? Assume that the previous pixel was red.
b) The decoder is waiting for a new sequence. The bit stream 110100111100001100
is received. What are the rst two colours in this sequence? The last pixel in
the previous sequence was black.
2.14 We want to code a stationary binary memory source with alphabet A = {a, b}. The
following block probabilities p(x
n
, x
n+1
) have been estimated and can be assumed
to be correct:
p(aa) = 1/7
p(ab) = 1/7
p(ba) = 1/7
p(bb) = 4/7
Construct a codeword for the sequence bbab by using arithmetic coding. The coder
should utilize conditional probabilities. Assume that the symbol before the sequence
to be coded is b.
7
2.15 Consider the source in problem 1.16. Use arithmetic coding to code the sequence
wbbwww
The memory of the source should be utilized in the coder. Imaginary pixels before
the rst pixel can be considered to be white. You can assume that the coder can
store all probabilities and interval limits exactly. Give both the resulting interval
and the codeword.
2.16 Consider the following code for coding sequences from a binary source (alphabet
A = {0, 1}): Choose a block length n. Divide the sequence into blocks of length
n. For each block, count the number of ones, w. Encode the number w as a
binary number using a xed length code. Then code the index of the block in an
enumeration of all possible blocks of length n containing exactly w ones, also using
a x length code.
a) How many bits are needed to code a block of w ones and n w zeros?
b) Assuming n = 15, how many bits are needed to code the sequence
x=000010000000100?
c) Note that the coding algorithm does not take the symbol probabilities into
account.
Assuming that the source is memory-less, show that the coding algorithm is
universal, ie that the average rate approaches the entropy rate of the source
when n ?
2.17 A source has the alphabet A = {a, b}. Code the sequence
ababbaaababbbbaaabaaaaaababba...
using LZ77. The history buer has the size 16. The maximum match length is 15.
Check your solution by decoding the codewords.
2.18 Code the sequence in problem 2.17 using LZ78.
2.19 Code the sequence in problem 2.17 using LZW.
8
2.20 A source has the alphabet {a, b, c, d, e, f, g, h}. A sequence from the source is coded
using LZW and gives the following index sequence:
5, 0, 8, 10, 4, 3, 12, 12, 3, 6, 17, . . .
The starting dictionary is:
index sequence
0 a
1 b
2 c
3 d
4 e
5 f
6 g
7 h
Decode the index sequence. Also give the resulting dictionary.
2.21 A source has the alphabet A = {a, b, c, d, e, f, g, h}.
Code the sequence beginning with
bagagagagebaggebadhadehadehaf . . .
using LZSS. The history buer pointer should be coding using 6 bit xed length
codewords and the match lengths should be coded using 3 bit xed length codewords.
Give the resulting codewords.
2.22 a) A source has the alphabet A = {a, b, c, d}. The sequence dadadabbbc is
coded using Burrows-Wheelers block transform (BWT). Whats the transformed
sequence and the index?
b) The transformed sequence from problem a is coded with move-to-front coding
(mtf). Whats the coded sequence?
2.23 A source has the alphabet A = {v, x, y, z}. A sequence is coded using BWT and
mtf. The resulting index is 1 and the mtf-coded sequence is 2,2,3,0,0,1,1,0,2,0,0.
Decode the sequence
9
3 Dierential entropy and rate-distortion theory
3.1 Calculate the dierential entropy of the triangular distribution with probability
density function f(x)
f(x) =
_
_
_
1
a
+
x
a
2
; a x 0
1
a
x
a
2
; 0 x a
0 ; otherwise
3.2 Calculate the dierential entropy of the exponential distribution with probability
density function f(x)
f(x) =
_
1
; x 0
0 ; x < 0
3.3 Determine the rate-distortion function R(D) for a gaussian process with power
spectral density,
-
6
B
B
2
B
B
2
2
2
3B
3.4 Determine the rate-distortion function for a gaussian process with power spectral
density,
-
6
2
B
B B
10
3.5 A signal is modelled as a time-discrete gaussian stochastic process X
n
with power
spectral density () as below.
6
-
()
17
-0.5 -0.3 -0.1 0.1 0.3 0.5
Assume that we encode the signal at an average data rate of 3 bits/sample. What
is the theoretically lowest possible distortion that we can achieve?
11
Solutions
1.1 H(X) =
4
i=1
p
X
(x
i
) log p
X
(x
i
) = 4
1
4
log
1
4
= 2
1.2 H(Y ) =
4
i=1
p
Y
(y
i
) log p
Y
(y
i
) =
1
2
log
1
2
1
4
log
1
4
2
1
8
log
1
8
= 1.75
1.3 a) Independence gives p
XY
(x, y) = p
X
(x) p
Y
(y). Thus
y
p(x, y) 1 2 3 4
1 1/8 1/16 1/32 1/32
x 2 1/8 1/16 1/32 1/32
3 1/8 1/16 1/32 1/32
4 1/8 1/16 1/32 1/32
b) Calculating the entropy from the 16 probabilities in a) gives
H(X, Y ) = 3.75
Since X and Y are independent you can also calculate the entropy from
H(X, Y ) = H(X) +H(Y ) = 3.75
c) H(X, Y ) =
y
p
XY
(x, y) log p
XY
(x, y)
=
y
p
X
(x) p
Y
(y)(log p
X
(x) + log p
Y
(y))
=
x
_
y
p
Y
(y)
_
p
X
(x) log p
X
(x)
y
_
x
p
X
(x)
_
p
Y
(y) log p
Y
(y)
= H(X) +H(Y )
d) View (X
1
, . . . , X
n1
) as one random variable and make an induction proof.
1.4 a) The solution can for instance be found by using Lagrange multipliers. Set
p
i
= p
Z
(i), i = 1, . . . , 4. We want to maximize
i=1
p
i
log p
i
subject to the constraint
4
i=1
p
i
= 1
12
Thus, we want to maximize
i=1
p
i
log p
i
+(
4
i=1
p
i
1)
Dierentiating with respect to p
j
, we require
p
j
(
4
i=1
p
i
log p
i
+(
4
i=1
p
i
1)) = 0
which gives us
log p
j
1
ln 2
+ = 0
This shows us that all p
j
should be equal, since they depend only on . Using
the constraint
p
i
= 1 , we get that all p
i
=
1
4
.
Thus, the distribution that gives the maximum entropy is the uniform distribu-
tion.
b) Its easy to see that the minimal entropy H(Z) = 0 can be achieved when one
probability is 1 and the other are 0. There are four such distributions, and thus
its not unique.
1.5 a) Hint:
i=0
q
i
=
1
1 q
, |q| < 1.
b) Hint:
i=0
i q
i
=
q
(1 q)
2
, |q| < 1.
H(U) =
q log q (1 q) log(1 q)
1 q
=
H
b
(q)
1 q
c) E{ U } =
q
1 q
1.6 ln 2 I(X; Y ) = ln 2
_
H(X | Y ) H(X)
_
= ln 2
_
H(X, Y ) H(X) H(Y )
_
= E
_
ln
p(X)p(Y )
p(X, Y )
_
E
_
p(X)p(Y )
p(X, Y )
_
1 = 0
since E
_
p(X)p(Y )
p(X, Y )
_
=
x,y
p(x, y)
p(x)p(y)
p(x, y)
= 1.
1.7 a) H + 1
b) H +p
c) 2H
13
1.8 a) H(f(X) | X) =
_
Z = f(X)
_
= H(Z | X) =
=
b
j
A
Z
a
i
A
X
p
XZ
(a
i
, b
j
)
. .
=0,f(a
i
)=b
j
log p
Z|X
(b
j
| a
i
)
. .
=1,f(a
i
)=b
j
= 0
b) H(X, f(X)) =
_
chain rule
_
= H(X) +H(f(X) | X)
. .
=0 according to a)
c) H(X) H(f(X)) =
_
according to b)
_
= H(X, f(X)) H(f(X)) =
_
chain rule
_
= H(X | f(X)) 0
1.9 According to problem 1.8, applying a deterministic function to a random variable
can not increase the entropy, which immediately gives us the right inequality.
For the left inequality we set Z = X +Y and show that
H(Z | Y ) =
=
y
p
ZY
(z, y) log
p
ZY
(z,y)
p
Y
(y)
=
_
x=zy
_
=
y
p
XY
(x, y) log
p
XY
(x,y)
p
Y
(y)
(1)
=
y
p
X
(x)p
Y
(y) log
p
X
(x)p
Y
(y)
p
Y
(y)
=
_
y
p
Y
(y)
_
(
x
p
X
(x) log p
X
(x)) = H(X)
At (1) we used the independence. We can now write
H(X +Y ) H(X +Y | Y ) = H(X)
See problem 1.6 for a proof that conditioning cant increase the entropy.
1.10 H(X | Y ) +H(Y | Z)
(1)
H(X | Y, Z) +H(Y | Z)
= H(X | Y, Z) +H(Y | Z) +H(Z)
. .
chain rule =H(X,Y,Z)
H(Z)
= H(X, Y, Z) H(Z)
(2)
= H(X | Z) +H(Y | XZ) H(X| Z)
where we at (1) used the fact that conditioning never increases the entropy and at
(2) we used the chain rule: H(X, Y, Z) = H(Z) +H(X | Z) +H(Y | X, Z).
1.11 The chain rule is proved by
H(X, Y ) = E{ log p
XY
(X, Y ) } = E
_
log p
X|Y
(X|Y )p
Y
(Y )
_
= E
_
log p
X|Y
(X|Y )
_
+ E{ log p
Y
(Y ) }
= H(X | Y ) + H(Y )
Now let X = X
n+1
and Y = (X
1
, . . . , X
n
).
14
1.12 a) Let (B
8
, B
4
, B
2
, B
1
) be four random variables describing the bits. Then the
entropies will be
H(B
8
) =
1
3
log
1
3
2
3
log
2
3
0.9183
H(B
4
) =
1
3
log
1
3
2
3
log
2
3
0.9183
H(B
2
) =
1
2
log
1
2
1
2
log
1
2
= 1
H(B
1
) =
1
2
log
1
2
1
2
log
1
2
= 1
b)
H(X) = log 12 3.5850
Note that this is smaller than the sum of the entropies for the dierent bits,
since the bits arent independent of each other.
1.13 The transition matrix P of the source is
P =
_
_
0.8 0.1 0.1
0.5 0.5 0
0.5 0 0.5
_
_
The stationary distribution w = (w
0
, w
1
, w
2
) is given by the equation system
w = w P. Replace one of the equations with the the equation w
0
+ w
1
+ w
2
= 1
and solve the equation system. This gives us the solution
w =
1
7
(5, 1, 1) (0.714, 0.143, 0.143)
1.14 a) H(X
i
) =
5
7
log
5
7
2
1
7
log
1
7
1.1488 [bits/symbol].
b) The block probabilities are given by p(x
i
, x
i+1
) = p(x
i
) p(x
i+1
|x
i
)
symbol pair probability
00 5/7 0.8 = 8/14
01 5/7 0.1 = 1/14
02 5/7 0.1 = 1/14
10 1/7 0.5 = 1/14
11 1/7 0.5 = 1/14
12 0
20 1/7 0.5 = 1/14
21 0
22 1/7 0.5 = 1/14
H(X
i
, X
i+1
) =
8
14
log
8
14
6
1
14
log
1
14
2.0931 [bits/pair]. ( 1.0465
[bits/symbol]).
The entropy of pairs is less than twice the memory-less entropy.
15
c) Since the source is of order 1, the entropy rate is given by H(X
i
| X
i1
). Ac-
cording to the chain rule we get
H(X
i
| X
i1
) = H(X
i1
, X
i
) H(X
i1
) 0.9442
1.15 The transition matrix P of the source is
P =
_
_
_
_
0.7 0 0.3 0
0.9 0 0.1 0
0 0.4 0 0.6
0 0.2 0 0.8
_
_
_
_
The stationary distribution w = (w
aa
, w
ab
, w
ba
, w
bb
) is given by the equation system
w = w P. Replace one of the equations with the equation w
aa
+w
ab
+w
ba
+w
bb
= 1
and solve the system. This gives us the solution
w =
1
8
(3, 1, 1, 3)
The stationary distribution is of course also the probabilities for pairs of symbols,
so we can directly calculate H(X
i
X
i1
)
H(X
i
X
i1
) = 2
3
8
log
3
8
2
1
8
log
1
8
1.8113
The probabilities for single symbols are given by the marginal distribution
p(a) = p(aa) +p(ab) = 0.5, p(b) = p(ba) +p(bb) = 0.5
H(X
i
) = 1
The probabilities for three symbols are given by p
X
i
X
i1
X
i2
= p
X
i1
X
i2
p
X
i
|X
i1
X
i2
p(aaa) =
3
8
0.7 =
21
80
, p(baa) =
3
8
0.3 =
9
80
p(aba) =
1
8
0.4 =
4
80
, p(bba) =
1
8
0.6 =
6
80
p(aab) =
1
8
0.9 =
9
80
, p(bab) =
1
8
0.1 =
1
80
p(abb) =
3
8
0.2 =
6
80
, p(bbb) =
3
8
0.8 =
24
80
which gives us
H(X
i
X
i1
X
i2
) 2.5925
Now we can calculate the conditional entropies using the chain rule
H(X
i
|X
i1
) = H(X
i
X
i1
) H(X
i1
) 0.8113
H(X
i
|X
i1
X
i2
) = H(X
i
X
i1
X
i2
) H(X
i1
X
i2
) 0.7812
16
1.16 a) Given states (x
i
, x
i1
), the state diagram looks like
ww
wb
bw bb
0.05
0.9
0.1
0.2
0.8
0.3
0.95
0.7
The stationary probabilities for this model is
w
ww
=
54
68
, w
wb
=
3
68
, w
bw
=
3
68
, w
bb
=
8
68
These probabilities are also probabilities for pairs p(x
i
, x
i1
).
b) The pair probabilities from above gives us the entropy
H(X
i
, X
i1
) 1.0246. Probabilties for single symbols can be found as marginal
probabilities
p(w) = p(w, w) +p(w, b) =
57
68
, p(b) = p(b, w) +p(b, b) =
11
68
which gives us the entropy H(X
i
) 0.6385. Using the chain rule, we nd
H(X
i
|X
i1
) = H(X
i
, X
i1
) H(X
i1
) 0.3861. Finally, we need probabilities
for three symbols
p(x
i
, x
i1
, x
i2
) = p(x
i1
, x
i2
) p(x
i
|x
i1
, x
i2
)
p(w, w, w) =
513
680
, p(b, w, w) =
27
680
p(w, w, b) =
27
680
, p(b, w, b) =
3
680
p(w, b, w) =
6
680
, p(b, b, w) =
24
680
p(w, b, b) =
24
680
, p(b, b, b) =
56
680
This gives us the entropy H(X
i
, X
i1
, X
i2
) 1.4083.
Using the chain rule we nd
H(X
i
|X
i1
, X
i2
) = H(X
i
, X
i1
, X
i2
) H(X
i1
, X
i2
) 0.3837.
17
2.1 Yes, since Krafts inequality is fullled:
8
i=1
2
l
i
=
53
64
< 1
2.2 Since the probabilities are dyadic, it is possible to construct a code with the code-
word lengths l
i
= log p(i) = i. This code will have a rate that is equal to the entropy
of the source and will thus be optimal.
.
.
1
2
3
4
R = 2 [bits/symbol].
2.3 a) The lowest rate is given by the entropy rate of the source. Since the source is
memory-less, the entropy rate is
0.6 log 0.6 0.3 log 0.3 0.1 log 0.1 1.2955 [bits/symbol]
b) Codeword lengths and one possible assignment of codewords:
symbol length codeword
x 1 0
y 2 10
z 2 11
The average codeword length is 1.4 bits/codeword and the average rate is 1.4
bits/symbol.
c) Codeword lengths (not unique) and one possible assignment of codewords:
symbol length codeword
xx 1 0
xy 3 100
xz 4 1100
yx 3 101
yy 4 1110
yz 5 11110
zx 4 1101
zy 6 111110
zz 6 111111
The average codeword length is 2.67 bits/codeword and the average rate is 1.335
bits/symbol.
2.4 Approximate probabilities for pairs of symbols: {0.4585, 0.0415, 0.0415, 0.4585}
18
aa
bb
ab ba
l 1.6245 [bits/codeword] R =
l
2
0.8122 [bits/symbol]
Approximate probabilities for three symbols: {0.4204, 0.0381, 0.0034, 0.0381, 0.0381,
0.0034, 0.0381, 0.4204}
l 1.9496 [bits/codeword] R =
l
3
0.6499 [bits/symbol]
Its better to code three symbols at a time.
2.5 a) p(r) = p
r1
(1 p)
b) r =
r=1
r (1 p) p
r1
=
1
1 p
=
1
1
8
0.5
12.05 [symbols/run]
c) H(R) =
r=1
(1 p) p
r1
log((1 p) p
r1
)
= (1 p) log(1 p)
r=1
p
r1
(1 p) log p
r=1
(r 1) p
r1
= (1 p)
1
1 p
log(1 p) (1 p)
p
(1 p)
2
log p
=
H
b
(p)
1 p
4.972 [bits/run]
d)
H(R)
r
= H
b
(p) 0.4126 [bits/symbol]
We could of course get the same answer by calculating the entropy rate directly
from the source.
2.6 a) If we do a direct mapping to the binary representation we get
l = E{ codeword length/run } =
4
14
r=1
(1 p)p
r1
+ 8
29
r=15
(1 p)p
r1
+ 12
44
r=30
(1 p)p
r1
+. . . =
19
4
r=1
(1 p)p
r1
+ 4p
14
r=1
(1 p)p
r1
+ 4p
29
r=1
(1 p)p
r1
+. . . =
4
r=1
(1 p)p
r1
(1 +p
14
i=0
p
15i
) = 4(1 +
p
14
1 p
15
) 5.635 [bits/run]
From problem 2.4 we know that r 12.05 [symbols/run], therefore the data
rate is
l
r
0.468 [bits/symbol].
b) In the same way as we did in a) we get
l = 5(1 +
p
30
1 p
31
) 5.399 [bits/run]
and the data rate 0.448 [bits/symbol].
2.7 a) Since runs of a and b have the same probabilities, we can use the same Golomb
code for both types of runs. Since the probability of a run of length r is p(r) =
p
r1
(1 p) we get an optimal code if we choose m as
m =
1
log p
= 8
b) A Golomb code with parameter m = 8 has 8 codewords of length 4, 8 codewords
of length 5, 8 codewords of length 6 etc.
The average codeword length is
l = E{ codeword length/run} =
4
8
r=1
(1 p)p
r1
+ 5
16
r=9
(1 p)p
r1
+ 6
24
r=17
(1 p)p
r1
+. . . =
4
r=1
(1 p)p
r1
+p
8
r=1
(1 p)p
r1
+p
16
r=1
(1 p)p
r1
+. . . = 4 +
i=1
p
8i
=
4 +
p
8
1 p
8
= 5 [bits/run]
From problem 2.4 we know that r 12.05 [symbols/run], therefore the data
rate is
l
r
0.415 [bits/symbol].
2.8 a) The theoretical lowest bound on the data rate is given by the entropy rate of
the source. The best model of the source that we can get, according to the
given information, is a Markov model of order 1. In the text areas the model
looks like
b
w
0.5
0.1
0.5 0.9
20
The stationary probabilities for this source are w
b
=
1
6
and w
w
=
5
6
and the
entropy rate when we are in a text area is therefore
H
t
= w
b
H
b
(0.5) +w
w
H
b
(0.9) 0.5575
In the same way, for image areas the entropy rate is
H
p
0.7857
The total entropy rate for the source is therefore
H =
4
5
H
t
+
1
5
H
p
0.60313
This is the best estimate we can make. The real entropy rate of the source can
be lower, if the memory is longer than just one pixel.
b) The desired data rate can be achieved by coding blocks of three symbols. The
probabilities for the 8 dierent combinations of three symbols can be calculated
from p
X
i
X
i+1
X
i+2
= p
X
i
p
X
i+1
|X
i
p
X
i+2
|X
i
X
i+1
The eight probabilities are
1
120
{1 5 5 5 5 9 9 81}
Construct the code using the Human algorithm. The resulting data rate is
0.625 bits/pixel.
2.9 a) It is possible to get arbitrarily close to the entropy rate of the source. For the
given source, the entropy rate is given by H(S
n+1
|S
n
) = w
A
H(S
n+1
|S
n
= A) +
w
B
H(S
n+1
|S
n
= B) + w
C
H(S
n+1
|S
n
= C) + w
D
H(S
n+1
|S
n
= D)
where w
A
etc. are the stationary probabilities for the states. These are calcu-
lated from the following equation system, plus the fact that they should sum to
1.
_
w
A
w
B
w
C
w
D
_
_
_
_
_
0 0.5 0 0.5
0 0.2 0 0.8
0.1 0 0.9 0
0 0 0.7 0.3
_
_
_
_
=
_
w
A
w
B
w
C
w
D
_
(w
A
w
B
w
C
w
D
) =
1
731
(56 35 560 80)
The entropy rate is
H(S
n+1
|S
n
) =
1
731
(56 H
b
(0.5) + 35 H
b
(0.2) + 560 H
b
(0.1) + 80 H
b
(0.3))
0.567 [bits/symbol]
b) Optimal tree codes can be constructed using Humans algorithm. When we
code single symbols we use the stationary probabilities. For example we can get
the following code
21
symbol probability codeword length
A 56/731 110 3
B 35/731 111 3
C 560/731 0 1
D 80/731 10 2
which gives a data rate of
993
731
1.36 bits/symbol
The probabilities of pairs of symbols are gotten from p
X
i
X
i+1
= p
X
i
p
X
i+1
|X
i
.
For instance we can get the following code
symbols probability codeword length
AB 28/731 1010 4
AD 28/731 1110 4
BB 7/731 10110 5
BD 28/731 1111 4
CA 56/731 100 3
CC 504/731 0 1
DC 56/731 110 3
DD 24/731 10111 5
The resulting data rate is
1
2
1331
731
0.910 bits/symbol
2.10
F(0) = 0, F(1) = 0.6, F(2) = 0.8, F(3) = 1
l
(0)
= 0
u
(0)
= 1
l
(1)
= 0 + (1 0) 0.8 = 0.8
u
(1)
= 0 + (1 0) 1 = 1
l
(2)
= 0.8 + (1 0.8) 0 = 0.8
u
(2)
= 0.8 + (1 0.8) 0.6 = 0.92
l
(3)
= 0.8 + (0.92 0.8) 0 = 0.8
u
(3)
= 0.8 + (0.92 0.8) 0.6 = 0.872
l
(4)
= 0.8 + (0.8 0.872) 0.6 = 0.8432
u
(4)
= 0.8 + (0.8 0.872) 0.8 = 0.8576
The interval corresponding to the sequence is [0.8432, 0.8576). The size of the
interval is 0.0144, which means that we need to use at least log 0.0144 = 7 bits
in the codeword.
22
Alternative 1: The smallest number with 7 bits in the interval is (0.1101100)
2
=
0.84375. Since (0.1101101)
2
= 0.8515625 is also in the interval 7 bits will be enough,
and the codeword is 1101100.
Alternative 2: The middle point of the interval is 0.8504 = (0.110110011 . . .)
2
.
Truncate to 7+1=8 bits, which gives us the codeword 11011001.
2.11 With six bits all values are stored as 1/64:ths. The distribution is then
F(0) = 0, F(1) = 38, F(2) = 51, F(3) = 64
l
(0)
= 0 = (000000)
2
u
(0)
= 63 = (111111)
2
l
(1)
= 0 +
(63 0 + 1) 51
64
= 51 = (110011)
2
u
(1)
= 0 +
(63 0 + 1) 64
64
1 = 63 = (111111)
2
Shift out 1 to the codeword, shift a 0 into l and a 1 into u
l
(1)
= (100110)
2
= 38
u
(1)
= (111111)
2
= 63
Shift out 1 to the codeword, shift a 0 into l and a 1 into u
l
(1)
= (001100)
2
= 12
u
(1)
= (111111)
2
= 63
l
(2)
= 12 +
(63 12 + 1) 0
64
= 12 = (001100)
2
u
(2)
= 12 +
(63 12 + 1) 38
64
1 = 41 = (101001)
2
l
(3)
= 12 +
(41 12 + 1) 0
64
= 12 = (001100)
2
u
(3)
= 12 +
(41 12 + 1) 38
64
1 = 28 = (011100)
2
Shift out 0 to the codeword, shift a 0 into l and a 1 into u
l
(3)
= (011000)
2
= 24
u
(3)
= (111001)
2
= 57
l
(4)
= 24 +
(57 24 + 1) 38
64
= 44 = (101100)
2
u
(4)
= 24 +
(57 24 + 1) 51
64
1 = 50 = (110010)
2
23
Since there are no more symbols we dont need to do any more shift operations.
The codeword are the bits that have been shifted out before, plus all of l
(4)
, ie
110101100.
2.12 With six bits all values are stored as 1/64:ths. The distribution is then
F(0) = 0, F(1) = 38, F(2) = 51, F(3) = 64
This means that the interval 0-37 belongs to symbol a
1
, the interval 38-50 to symbol
a
2
and the interval 51-63 to symbol a
3
.
l
(0)
= (000000)
2
= 0
u
(0)
= (111111)
2
= 63
t = (101110)
2
= 46
(46 0 + 1) 64 1
63 0 + 1
= 46 a
2
l
(1)
= 0 +
(63 0 + 1) 38
64
= 38 = (100110)
2
u
(1)
= 0 +
(63 0 + 1) 51
64
1 = 50 = (110010)
2
Shift out 1, shift 0 into l, 1 into u and a new bit from the codeword into t.
l
(1)
= (001100)
2
= 12
u
(1)
= (100101)
2
= 37
t = (011101)
2
= 29
(29 12 + 1) 64 1
37 12 + 1
= 44 a
2
l
(2)
= 12 +
(37 12 + 1) 38
64
= 27 = (011011)
2
u
(2)
= 12 +
(37 12 + 1) 51
64
1 = 31 = (011111)
2
The rst three bits are the same in l and u. Shift them out, shift zeros into l, ones
into u and three new bits from the codeword into t.
l
(2)
= (011000)
2
= 24
u
(2)
= (111111)
2
= 63
t = (101010)
2
= 42
(42 24 + 1) 64 1
63 24 + 1
= 30 a
1
24
l
(3)
= 24 +
(63 24 + 1) 0
64
= 24 = (011000)
2
u
(3)
= 24 +
(63 24 + 1) 38
64
1 = 46 = (101110)
2
The rst two bits in l are 01 and the rst two in u are 10. Shift l, u and t one step,
invert the new most signicant bit and shift 0 into l, 1 into l and a new bit from
the codeword into t
l
(3)
= (010000)
2
= 16
u
(3)
= (111101)
2
= 61
t = (110100)
2
= 52
(52 16 + 1) 64 1
61 16 + 1
= 51 a
3
l
(4)
= 16 +
(61 16 + 1) 51
64
= 52 = (110100)
2
u
(4)
= 16 +
(61 16 + 1) 64
64
1 = 61 = (111101)
2
The rst two bits are the same in l and u. Shift them out, shift zeros into l, ones
into u and two new bits from the codeword into t.
l
(4)
= (010000)
2
= 16
u
(4)
= (110111)
2
= 55
t = (010000)
2
= 16
(16 16 + 1) 64 1
55 16 + 1
= 1 a
1
Since we have now decoded ve symbols we dont have to do any more calculations.
The decoded sequence is
a
2
a
2
a
1
a
3
a
1
2.13 a) Under the assumption that we have ordered the colours in the same order as in
the table, with white closest to 0, the interval is [0.05 , 0.07538).
b) green, green
2.14 The probabilities for single symbols p(x
n
) can be found as the marginal distribution,
ie
p(a) = p(aa) +p(ab) = 2/7, p(b) = p(ba) +p(bb) = 5/7
The conditional probabilities p(x
n+1
|x
n
) are
p(a|a) =
p(aa)
p(a)
=
1/7
2/7
= 0.5
25
p(b|a) =
p(ab)
p(a)
=
1/7
2/7
= 0.5
p(a|b) =
p(ba)
p(b)
=
1/7
5/7
= 0.2
p(b|b) =
p(bb)
p(b)
=
4/7
5/7
= 0.8
The interval corresponding to the sequence is [0.424, 0.488)
The interval size is 0.064, which means we need at least log 0.064 = 4 bits in
our codeword.
Alternative 1: The smallest number with 4 bits in the interval is (0.0111)
2
= 0.4375.
Since (0.1000)
2
= 0.5 is not inside the interval 4 bits will not be enough, instead we
must use 5 bits. We will use the number (0.01110)
2
= 0.4375 and thus the codeword
is 01110.
Alternative 2: The middle point of the interval is 0.456 = (0.0111010 . . .)
2
Truncate
to 4+1=5 bits, which gives us the codeword 01110.
2.15 Assuming that we always place the b interval closest to 0, the sequence corresponds
to the interval [0.078253 0.088). The interval size is 0.009747 and thus we will need
at least log
2
0.009747 = 7 bits in our codeword, maybe one more.
Write the lower limit as a binary number:
0.078253 = (0.0001010000001 . . .)
2
The smallest binary number with seven bits that is larger than the lower limit is
(0.0001011)
2
= 0.0859375.
Since (0.0001100)
2
= 0.09375 > 0.088, seven bits will not be enough to ensure that
we have a prex code. Thus we will have to use eight bits.
The codeword is thus: 00010101
If we instead place the w interval closest to 0, the interval for the sequence becomes
[0.912 0.921747) and the codeword becomes 11101010
2.16 a) In a block of length n, there can be between 0 and n ones. w ones can be placed
in
_
n
w
_
dierent ways in a block of length n. Thus we will need log(n + 1) +
log
_
n
w
_
bits to code the block.
b) To code how many ones there are, we need 4 bits (we need to send a number
between 0 and 15). The sequence has 2 ones. There are
_
15
2
_
= 105 dierent
ways to in place two ones in a block of length 15. Thus, we need log 105 = 7
bits to code how the 2 ones are placed. In total we need 4 +7 = 11 bits to code
the given sequence.
c) Assume that the symbol probabilities are p and 1 p. The entropy rate of the
source is p log p (1 p) log(1 p) = H
b
(p).
26
The number of bits/symbol is given by
R =
1
n
n
w=0
_
n
w
_
p
nw
(1 p)
w
(log(n + 1) +log
_
n
w
_
) <
1
n
(2 + log(n + 1) +
n
w=0
_
n
w
_
p
nw
(1 p)
w
log
_
n
w
_
)
where we used the fact that x < x + 1
n
w=0
_
n
w
_
p
nw
(1 p)
w
log
_
n
w
_
=
n
w=0
_
n
w
_
p
nw
(1 p)
w
log(
_
n
w
_
p
nw
(1 p)
w
) (1)
n
w=0
_
n
w
_
p
nw
(1 p)
w
log p
nw
(2)
n
w=0
_
n
w
_
p
nw
(1 p)
w
log(1 p)
w
(3)
(1) is the negative entropy of a binomial distribution, thus it is 0.
(3) =
n
w=0
_
n
w
_
p
nw
(1 p)
w
log(1 p)
w
=
log(1 p)
n
w=0
_
n
w
_
p
nw
(1 p)
w
w =
n log(1 p) (1 p)
n
w=1
_
n 1
w 1
_
p
nw
(1 p)
w1
= n (1 p) log(1 p)
where we used the fact that the last sum is just the sum of a binomial
distribution, which of course is 1. In the same way we can show that
(2) = n p log p. Since the source is memoryless, the entropy H
b
(p) is a lower
bound for R. Thus we get
H
b
(p) R <
1
n
(2 + log(n + 1) n p log p n (1 p) log(1 p)) =
H
b
(p) +
1
n
(2 + log(n + 1))
So when n the rate R will approach the entropy of the source, showing
that the coding method is universal.
2.17 Assuming oset 0 at the right of the history buer
Codewords:
27
(oset, length, new symbol) Binary codeword
(0,0,a) 0000 0000 0
(0,0,b) 0000 0000 1
(1,2,b) 0001 0010 1
(2,1,a) 0010 0001 0
(6,5,b) 0110 0101 1
(8,6,a) 1000 0110 0
(1,4,b) 0001 0100 1
(15,3,a) 1111 0011 0
2.18 The coded sequence of pairs <index, new symbol> are:
< 0, a > < 0, b > < 1, b > < 2, a > < 1, a > < 4, b > < 2, b >
< 4, a > < 3, a > < 5, a > < 5, b > < 3, b > . . .
If we assume that the dictionary is of size 16 we will need 4+1=5 bits to code each
pair.
The dictionary now looks like
index sequence index sequence index sequence
0 - 5 aa 10 aaa
1 a 6 bab 11 aab
2 b 7 bb 12 abb
3 ab 8 baa
4 ba 9 aba
2.19 The coded sequence of <index> is:
< 0 > < 1 > < 2 > < 3 > < 0 > < 2 > < 4 > < 1 >
< 5 > < 7 > < 6 > < 12 > < 3 > < 9 > . . .
If we assume that the dictionary is of size 16 we will need 4 bits to code each index.
The dictionary now looks like
index sequence index sequence index sequence
0 a 6 aa 12 aaa
1 b 7 aba 13 aaab
2 ab 8 abbb 14 bab
3 ba 9 bb 15 bba
4 abb 10 baaa
5 baa 11 abaa
2.20 The decoded sequence is
fafafafedededdggg . . .
and the dictionary looks like
28
index word index word index word index word
0 a 5 f 10 faf 15 edd
1 b 6 g 11 fafe 16 dg
2 c 7 h 12 ed 17 gg
3 d 8 fa 13 de 18
4 e 9 af 14 ede 19
and the next word to add to position 18 is gg, where will be the rst symbol in
the next decoded word.
2.21 The history buer size will be 2
6
= 64. The alphabet size is 8, thus we will use
log
2
8 = 3 bits to code a symbol. If we code a single symbol we will use a total of
1+3 = 4 bits and if we code a match we will use a total of 1+6+3 = 10 bits. This
means that it is better to code matches of length 1 and 2 as single symbols. Since
we use 3 bits for the lengths, we can then code match lengths between 3 and 10.
f o l c codeword sequence
0 b 0 001 b
0 a 0 000 a
0 g 0 110 g
1 1 6 1 000001 011 agagag
0 e 0 100 e
1 9 3 1 001001 000 bag
1 4 4 1 000100 001 geba
0 d 0 011 d
0 h 0 111 h
0 a 0 000 a
0 d 0 011 d
0 e 0 100 e
1 3 6 1 000011 011 hadeha
0 f 0 101 f
2.22 a) The transformed sequence is dddabbbaac and the index is 9 (assuming that the
rst index is 0).
b) After mtf the sequence is 3,0,0,1,2,0,0,1,0,3.
2.23 Inverse mtf gives the sequence yxzzzxzzyyy.
Inverse BWT gives the sequence xyzzyzyzzyx (assuming that the rst index is 0).
3.1 h(X) =
1
2
log ea
2
=
1
2
log 6e
2
3.2 h(X) = log e =
1
2
log e
2
2
29
3.3 R(D) =
_
_
Blog
2
2
2
3D
; 0 D
2
2
3
1
2
Blog
2
2
3D
2
;
2
2
3
D
2
0 ; D >
2
3.4 R(D) =
_
_
_
B
ln2
(ln
1
1
1
D
_
1
D
2
) ; 0 D
2
0 ; D >
2
3.5 We use the rate-distortion function to calculate the theoretically lowest distortion.
The process has the variance
2
=
_
0.5
0.5
() d = 17 0.6 = 10.2
The rate-distortion function is
R(D) = 0.6
1
2
log
17 0.6
D
= 0.3 log
2
D
The minimum distortion for R = 3 is
D =
2
2
3/0.3
=
10.2
1024
0.009961
30