Ecs452 2
Ecs452 2
In this section, we look at the “source encoder” part of the system. This
part removes redundancy from the message stream or sequence. We will
focus only on binary source coding.
2.1. The material in this section is based on [C & T Ch 2, 4, and 5].
8
2.5. It is estimated that we may only need about 1 bits per character in
English text.
Definition 2.6. Discrete Memoryless Sources (DMS): Let us be more
specific about the information source.
• The message that the information source produces can be represented
by a vector of characters X1 , X2 , . . . , Xn .
◦ A perpetual message source would produce a never-ending sequence
of characters X1 , X2 , . . ..
• These Xk ’s are random variables (at least from the perspective of the
decoder; otherwise, these is no need for communication).
• For simplicity, we will assume our source to be discrete and memoryless.
◦ Assuming a discrete source means that the random variables are all
discrete; that is, they have supports which are countable. Recall
that “countable” means “finite” or “countably infinite”.
∗ We will further assume that they all share the same support
and that the support is finite. This support is called the source
alphabet.
◦ Assuming a memoryless source means that there is no dependency
among the characters in the sequence.
∗ More specifically,
pX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) = pX1 (x1 ) × pX2 (x2 ) × · · · × pXn (xn ).
(1)
∗ This means the current model of the source is far from the
source of normal English text. English text has dependency
among the characters. However, this simple model provide a
good starting point.
∗ We will further assume that all of the random variables share
the same probability mass function (pmf). We denote this
shared pmf by pX (x). In which case, (1) becomes
pX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) = pX (x1 )×pX (x2 )×· · ·×pX (xn ). (2)
9
· We will also assume that the pmf pX (x) is known. In prac-
tice, there is an extra step of estimating this pX (x).
· To save space, we may see the pmf pX (x) written simply as
p(x), i.e. without the subscript part.
∗ The shared support of X which is usually denoted by SX be-
comes the source alphabet. Note that we also often see the use
of X to denote the support of X.
• In conclusion, our simplified source code can be characterized by a
random variable X. So, we only need to specify its pmf pX (x).
Example 2.7. See slides.
Definition 2.8. An encoder c(·) is a function that maps each of the char-
acter in the (source) alphabet into a corresponding (binary) codeword.
• In particular, the codeword corresponding to a source character x is
denoted by c(x).
• Each codeword is constructed from a code alphabet.
◦ A binary codeword is constructed from a two-symbol alphabet,
wherein the two symbols are usually taken as 0 and 1.
◦ It is possible to consider non-binary codeword. Morse code dis-
cussed in Example 2.13 is one such example.
• Mathematically, we write
where
{0, 1}∗ = {ε, 0, 1, 00, 01, 10, 11, 000, 001, 010, 011, . . .}
10
◦ In fact, writing this as ` (c(x)) may be clearer because we can see
that the length depends on the choice of the encoder. However, we
shall follow the notation above5 .
Example 2.9. c(red) = 00, c(blue) = 11 is a source code for SX =
{red, blue}.
Example 2.10. Suppose the message is a sequence of basic English words
which happen according to the probabilities provided in the table below.
Definition 2.11. The expected length of a code c(·) for (a DMS source
which is characterized by) a random variable X with probability mass func-
tion pX (x) is given by
X
E [`(X)] = pX (x)`(x).
x∈SX
11
would get the same answer if it is replaced by the set {1, 2, 3, 4}, or the
set {a, b, c, d}. All that matters is that the alphabet size is 4, and the
corresponding probabilities are {0.04, 0.03, 0.9, 0.03}.
Therefore, for brevity, we often find DMS source defined only by its
alphabet size and the list of probabilities.
Example 2.13. The Morse code is a reasonably efficient code for the En-
glish alphabet using an alphabet of four symbols: a dot, a dash, a letter
space, and a word space. [See Slides]
• Short sequences represent frequent letters (e.g., a single dot represents
E) and long sequences represent infrequent letters (e.g., Q is represented
by “dash,dash,dot,dash”).
Example 2.14. Thought experiment: Let’s consider the following code
x p(x) Codeword c(x) `(x)
1 4% 0
2 3% 1
3 90% 0
4 3% 1
This code is bad because we have ambiguity at the decoder. When a
codeword “0” is received, we don’t know whether to decode it as the source
symbol “1” or the source symbol “3”. If we want to have lossless source
coding, this ambiguity is not allowed.
Definition 2.15. A code is nonsingular if every source symbol in the
source alphabet has different codeword.
12
2.17. We usually wish to convey a sequence (string) of source symbols. So,
we will need to consider concatenation of codewords; that is, if our source
string is
X1 , X2 , X3 , . . .
then the corresponding encoded string is
or
(ii) use uniquely decodable codes.
Definition 2.18. A code is called uniquely decodable (UD) if any en-
coded string has only one possible source string producing it.
Example 2.19. The code used in Example 2.16 is not uniquely decodable
because source string “2”, source string “34”, and source string “13” share
the same code string “010”.
2.20. It may not be easy to check unique decodability of a code. (See
Example 2.28.) Also, even when a code is uniquely decodable, one may
have to look at the entire string to determine even the first symbol in the
corresponding source string. Therefore, we focus on a subset of uniquely
decodable codes called prefix code.
Definition 2.21. A code is called a prefix code if no codeword is a prefix6
of any other codeword.
• Equivalently, a code is called a prefix code if you can put all the
codewords into a binary tree where all of them are leaves.
6
String s1 is a prefix of string s2 if there exist a string s3 , possibly empty, such that s2 = s1 s3 .
13
• A more appropriate name would be “prefix-free” code.
Example 2.22.
x Codeword c(x)
1 10
2 110
3 0
4 111
Example 2.27.
x Codeword c(x)
1 1
2 10
3 100
4 1000
14
Classes of codes
All codes
Nonsingular codes
UD codes
Prefix
codes
15
Sirindhorn International Institute of Technology
Thammasat University
School of Information, Computer and Communication Technology
8
The class was the first ever in the area of information theory and was taught by Robert Fano at MIT
in 1951.
◦ Huffman wrote a term paper in lieu of taking a final examination.
◦ It should be noted that in the late 1940s, Fano himself (and independently, also Claude Shannon)
had developed a similar, but suboptimal, algorithm known today as the ShannonFano method. The
difference between the two algorithms is that the ShannonFano code tree is built from the top down,
while the Huffman code tree is constructed from the bottom up.
16
• By construction, Huffman code is a prefix code.
Example 2.31.
x pX (x) Codeword c(x) `(x)
1 0.5
2 0.25
3 0.125
4 0.125
E [`(X)] =
Note that for this particular example, the values of 2`(x) from the Huffman
encoding is inversely proportional to pX (x):
1
pX (x) = .
2`(x)
In other words,
1
`(x) = log2 = − log2 (pX (x)).
pX (x)
Therefore,
X
E [`(X)] = pX (x)`(x) =
x
Example 2.32.
x pX (x) Codeword c(x) `(x)
‘a’ 0.4
‘b’ 0.3
‘c’ 0.1
‘d’ 0.1
‘e’ 0.06
‘f’ 0.04
E [`(X)] =
17
Example 2.33.
x pX (x) Codeword c(x) `(x)
1 0.25
2 0.25
3 0.2
4 0.15
5 0.15
E [`(X)] =
Example 2.34.
x pX (x) Codeword c(x) `(x)
1/3
1/3
1/4
1/12
E [`(X)] =
E [`(X)] =
2.35. The set of codeword lengths for Huffman encoding is not unique.
There may be more than one set of lengths but all of them will give the
same value of expected length.
Definition 2.36. A code is optimal for a given source (with known pmf) if
it is uniquely decodable and its corresponding expected length is the shortest
among all possible uniquely decodable codes for that source.
2.37. The Huffman code is optimal.
18
2.3 Source Extension (Extension Coding)
2.38. One can usually (not always) do better in terms of expected length
(per source symbol) by encoding blocks of several source symbols.
Definition 2.39. In, an n-th extension coding, n successive source sym-
bols are grouped into blocks and the encoder operates on the blocks rather
than on individual symbols. [4, p. 777]
Example 2.40.
x pX (x) Codeword c(x) `(x)
Y(es) 0.9
N(o) 0.1
E [`(X)] =
YNNYYYNYYNNN...
E [`(X1 , X2 )] =
E [`(X1 , X2 , X3 )] =
19
Sirindhorn International Institute of Technology
Thammasat University
School of Information, Computer and Communication Technology
• The log is to the base 2 and entropy is expressed in bits (per symbol).
◦ The base of the logarithm used in defining H can be chosen to be
any convenient real number b > 1 but if b 6= 2 the unit will not be
in bits.
◦ If the base of the logarithm is e, the entropy is measured in nats.
◦ Unless otherwise specified, base 2 is our default base.
• Based on continuity arguments, we shall assume that 0 ln 0 = 0.
20
Example 2.42. The entropy of the random variable X in Example 2.31 is
1.75 bits (per symbol).
Example 2.43. The entropy of a fair coin toss is 1 bit (per toss).
HX = -pX*(log2(pX))’.
21
Definition 2.47. Binary Entropy Function : We define hb (p), h (p) or
H(p) to be −p log p − (1 − p) log (1 − p), whose plot is shown in Figure 3.
1
0.9
0.8
0.7
0.6
H(p)
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
p
1
• Logarithmic Bounds: ( ln pFigure
)( ln q3:) ≤Binary
( log eEntropy
) H ( p Function
) ≤ ( ln p )( ln q )
ln 2
2.48. Two important facts about entropy:
0.7
(a) H (X) ≤ log2 |SX | with equality if and only if X is a uniform random
0.6
variable.
0.5
In summary,
0.3
0 ≤ H (X) ≤ log2 |SX | .
deterministic uniform
0.2
Theorem 2.49. The expected length E [`(X)] of any uniquely decodable
binary code for a random variable X is greater than or equal to the entropy
0.1
0.4 22
0.3
2.51. Given a random variable X, let cHuffman be the Huffman code for this
X. Then, from the optimality of Huffman code mentioned in 2.37,
L∗ (X) = L(cHuffman , X).
Theorem 2.52. The optimal code for a random variable X has an expected
length less than H(X) + 1:
L∗ (X) < H(X) + 1.
2.53. Combining Theorem 2.49 and Theorem 2.52, we have
H(X) ≤ L∗ (X) < H(X) + 1. (3)
Definition 2.54. Let L∗n (X) be the minimum expected codeword length
per symbol when the random variable X is encoded with n-th extension
uniquely decodable coding. Of course, this can be achieve by using n-th
extension Huffman coding.
2.55. An extension of (3):
1
H(X) ≤ L∗n (X) < H(X) + . (4)
n
In particular,
lim L∗n (X) = H(X).
n→∞
In otherwords, by using large block length, we can achieve an expected
length per source symbol that is arbitrarily close to the value of the entropy.
2.56. Operational meaning of entropy: Entropy of a random variable is the
average length of its shortest description.
2.57. References
• Section 16.1 in Carlson and Crilly [4]
• Chapters 2 and 5 in Cover and Thomas [5]
• Chapter 4 in Fine [6]
• Chapter 14 in Johnson, Sethares, and Klein [8]
• Section 11.2 in Ziemer and Tranter [17]
23