Chapter 2 - Edited

Chapter Two
Introduction to Information Theory and Coding
“How to measure and code information

in terms of bits?”
= ? bits
School of Electrical and Computer Engineering 1

Overview
 Brief Introduction to Information Theory
 Information and Entropy
 Source Coding
 Discrete Memoryless Channels
 Mutual information
 Channel Capacity
 Channel Coding Theorem
2
Introduction
 What is information theory (IT)?
 What is the purpose of IT?
 Why do we need to study information theory?
 How are the terms uncertainty, surprise, entropy and
information related?
3
What is Information Theory
 It deals with the concept of information: mathematical
modeling, measurement and applications
 Provides answers for two fundamental questions in
communication theory:
 What is the ultimate limit on data compression?
 What is the ultimate transmission rate of reliable communication
over noisy channels?
 Shannon showed reliable (i.e error-free) communication is
possible for all rates below channel capacity (using
channel coding).
 Any source can be represented in bits at any rate above
entropy (using source coding).
 Rise of digital information technology
4
Overview
 Source Coding
5
What is Information?
 Information: any new knowledge about something

 How can we measure it?
 Messages containing knowledge of a high probability
of occurrence →Not very informative
 Messages containing knowledge of low probability of
occurrence →More informative
 A small change in the probability of a certain output
should not change the information delivered by that
output by a large amount.
6
Definition
 The amount of information gained after observing the
event 𝑋 = 𝑥𝑘 is related with the probability of its
occurrence 𝑝𝑘 , as the following logarithmic function:
1
𝐼 𝑥𝑘 = log = − log(𝑝𝑘 )
𝑝𝑘
 Important Properties:
1. 𝐼(𝑥𝑘 ) = 0 For𝑝𝑘 = 1: an event that is certain to occur
contains no information
2. 𝐼 𝑥𝑘 ≥ 0𝑓𝑜𝑟 0 ≤ 𝑝𝑘 ≤ 1: the occurrence of an event
either provides some or no information, but never brings
loss of information
3. 𝐼(𝑥𝑘 ) > 𝐼(𝑥𝑖 ) for 𝑝𝑘 < 𝑝𝑖 : that is the less probable the
event is the more information we gain as it occurs
7
Cont. …
4. 𝐼 𝑥𝑘 𝑥𝑖 = 𝐼 𝑥𝑘 + 𝐼(𝑥𝑖 ) if 𝑥𝑘 𝑎𝑛𝑑 𝑥𝑖 are statistically
independent
The base of the logarithm is arbitrary. However, it is standard
practice to use a logarithm to the base 2.
The resulting unit of information is called the bit and when
base e is used the unit will be in nats.
We thus write
1
𝐼 𝑥𝑘 = log 2 = − log 2 𝑝𝑘 for 𝑘 = 0, 1, … , 𝑘 − 1
𝑝𝑘
1
When 𝑝𝑘 = , we have 𝐼 𝑥𝑘 = 1 bit. Hence, 1 bit is the amount
2
of information gained when one of the two possible and
equally likely (equiprobable) events occur.
8
Entropy of Discrete Memoryless Source
 Suppose we have an information source emitting a
sequence of symbols from a finite alphabet:
𝑿 = 𝑥0 , 𝑥1 , … , 𝑥𝑘−1
 Discrete memoryless source: The successive symbols
are statistically independent
 Assume that each symbol has a probability of
occurrence
𝑝𝑘 , 𝑘 = 0, 1, … , 𝑘 − 1 such that 𝑘−1
𝑘=0 𝑝𝑘=1
 The amount of information 𝐼(𝑥𝑘 ) produced by the
source during certain interval depends on the symbol
𝑥𝑘 emitted by the source.
9
Cont. …
 𝐼(𝑥𝑘 ) is a discrete random variable that takes a values 𝐼 𝑥0 ,
𝐼 𝑥1 , … , 𝐼(𝑥𝑘−1 ) with probabilities 𝑝0 , 𝑝1 , … , 𝑝𝑘−1
respectively.
 The mean of 𝐼(𝑥𝑘 ) over the source alphabet 𝑿 is given by:
𝐻 𝑿 = E 𝐼(𝑥𝑘 )
𝑘−1 𝑘−1
1
𝐻(𝑿) = 𝑝𝑘 𝐼 𝑥𝑘 = 𝑝𝑘 log 2
𝑝𝑘
𝑘=0 𝑘=0
• The quantity 𝐻(𝑿) is known as the Entropy of the source
and it measures the average information content per source
symbol.
• Note that 𝐻(𝑿) depends only on probabilities of the symbols
in the alphabet 𝑿 of the source.
10
Meaning of Entropy
 What information about a source does its entropy give us?
 It is the amount of uncertainty before we receive it
 It tells us how many bits of information per symbol we
expect to get on the average.
Some properties of Entropy
• Entropy of a source is bounded as follows
0 ≤ 𝐻(𝑋) ≤ log 2 𝐾
• 𝐻 𝑋 = 0, if and only if the probability 𝑝𝑘 = 1 for some k
and, and the remaining probabilities in the set are 0.
1
• 𝐻 𝑋 = log 2 𝐾 , if and only if 𝑝𝑘 =
, for
all k (equiprobable
𝑘
symbols). This upper bound on entropy corresponds to
maximum uncertainty.
11
Example: Entropy of a Binary Source
 Consider a memoryless binary source for which symbol
0 occurs with probability 𝑝0 and symbol 1 with probability
𝑝1 = 1 − 𝑝0 . The entropy of this source equals:
1
1
𝐻 𝑋 = 𝑝𝑘 log 2 = −𝑝0 log 2 𝑝0 − 𝑝1 log 2 𝑝1
𝑝𝑘
𝑘=0
= −𝑝0 log 2 𝑝0 − (1 − 𝑝0 ) log 2 (1 − 𝑝0 )
12
Example: A Three-Symbol Alphabet
 Consider another discrete memoryless source with source
alphabet 𝑿 = 𝑥0 , 𝑥1 , 𝑥2 with respective probabilities of 𝑝0 =
1 1 1
, 𝑝1 = and 𝑝2 = , the entropy of the source becomes;
4 4 2
2
1 1 1 1
𝐻 𝑿 = 𝑝𝑘 log 2 = 𝑝0 log 2 + 𝑝1 log 2 + 𝑝2 log 2
𝑝𝑘 𝑝0 𝑝1 𝑝2
𝑘=0
1 1 1
= log 2 (4) + log 2 (4) + log 2 (2)
4 4 2
3
= 𝑏𝑖𝑡𝑠
2
• In most cases, blocks are considered rather than individual
symbols, with each block consisting of n successive source
symbols.
13
Cont. …
 In case of DMS, the entropy of this extended source will be
given as:
𝐻(𝑿𝑛 ) = 𝑛𝐻(𝑿)
Example: Calculate entropy of the 2nd-order extension of the
previous source.
8
1
𝐻𝑿2 = 𝑝(𝜎𝑖 ) log 2
𝑝(𝜎𝑖 )
𝑖=0
1 1 1 1
= log 2 (16) + log 2 (16) + log 2 (8) + log 2 (16)
16 16 8 16
1 1 1 1 1
+ log 2 (16) + log 2 (8) + log 2 (8) + log 2 (8) + log 2 (4)
16 8 8 8 4
= 3 𝑏𝑖𝑡𝑠
= 2 ∗ 𝐻(𝑿)
14
Overview
 Source Coding
15
Source Coding Theorem
 An important problem in communication is the efficient
representation of a data generated by discrete source.
Source encoder
 Our primary interest is in developing efficient source encoder
that satisfies two functional requirements
1. The code words produced by the encoder are in binary form
2. The source code is uniquely decodable
• If we let the binary code word assigned to symbol 𝑥𝑘 by the
encoder have length 𝑙𝑘 , the average code word length can be
defined as:
16
Average Codeword Length
𝑘−1
𝐿= 𝑝𝑘 𝑙𝑘
𝑘=0
where 𝑝𝑘 is the probability of occurrence of 𝑥𝑘 for 𝑘 = 0,1, … , 𝑘 − 1
• The parameter 𝐿 represents the average number of bits per
source symbol used in the source encoding process.
 Effective idea to reduce the given average code word length is
to encode symbols that occur often with short code words and
symbols that occur rarely may be encoded using longer code
words.
 This results in a variable length code word.
 The code efficiency of the source encoder is defined as
𝐿𝑚𝑖𝑛
𝜂=
𝐿
17
Minimum Codeword Length
 What is the minimum codeword length for a particular alphabet of
source symbols?
 Shannon's source-coding theorem address this fundamental
question. It is stated as:
 Thus with 𝐿𝑚𝑖𝑛 = 𝐻 𝑿 , we may write the efficiency of a source

encoder in terms of the entropy 𝐻 𝑿 as:
𝐻 𝑿
𝜂=
𝐿
18
Prefix Coding
 A prefix code is defined as a code in which no code word is

the prefix of any other code word.
 If we consider the alphabet 𝑿 = 𝑎, 𝑏, 𝑐 that is coded as:
𝐶 𝑎 = 0, 𝐶 𝑏 = 1 𝑎𝑛𝑑 𝐶 𝑐 = 01
• Note that in the above code, the code for (a) is a prefix of
the code for (c) and the code is said to be neither prefix free
nor uniquely decodable.
19
Property of prefix code
 It can be decoded by simply reading a string or a sequence
from left to right or following the corresponding path in the
code tree until it reaches a leaf, which represents a code
word by the prefix free property
 It is always uniquely decodable. But the converse is not
necessarily true.
20
Huffman coding Algorithm
 The Huffman algorithm is a variable-length coding scheme
based on the source letter probabilities pk, k = 1, 2, ….., L
 The coding algorithm is optimum in the sense that the
average number of binary digits required to represent the
source letters is minimum
 Satisfies the prefix condition and the sequence of code
words are uniquely and instantaneously decodable
 Basic idea: choose codeword lengths so that more-
probable sequences have shorter codewords
 The reduction process is continued in a step by step
manner until we are left with a final sets of two source
statics.
21
Cont. …
 The encoding algorithm proceeds as follows;
1. The source symbols are listed in order of decreasing
probability and the two lowest probability symbols are
assigned a 0 and a 1.
2. Probability of the new symbol after combining the two
source symbols is placed in the list in accordance with its
value
3. The procedure is repeated until only two source symbols
are left for which a 0 and a 1 can be assigned
• The code for each original source symbols is found by
working backward and tracing the sequence of 0s and 1s
assigned to that symbol as well as its successor
22
Example
Given seven source letters x1, x2, …..x7 with probabilities 0.35, 0.30,
0.20, 0.10,0.04,0.005.0.005 respectively. Construct a Huffman code
for this source and calculate its efficiency
23
Example
Letter Prob. I(x) Code
x1 0.35 1.5146 00
x2 0.30 1.7370 01
x3 0.20 2.3219 10
x4 0.10 3.3219 110
x5 0.04 4.6439 1110
x6 0.005 7.6439 11110
x7 0.005 7.6439 11111
7 7
𝐻 𝑋 = 𝑝 𝑥𝑖 𝐼 𝑥𝑖 = 2.11 & 𝐿 = 𝑝 𝑥𝑖 𝑙𝑘 = 2.21
𝑘=1 𝑘=1
𝐻(𝑋) 2.11
Efficiency, 𝜂 = = × 100% = 95.5%
𝐿 2.21
24
Example
The above code is not
necessarily unique. We can
devise an alternative code as
shown next for the same source
as above.
X1_____0
X2_____10
X3_____110
X4_____1110
X5_____11110
X6_____111110
X7_____111111
An alternative code for the
The average code word length is DMS in the above example
the same as above (show?)
25
Cont. …
 Note that the assignment of 0 to the upper branch and 1 to the
lower branch is arbitrary and by reversing this we obtain an
equally efficient code that satisfies the prefix condition
 The above procedure always results in a prefix free variable
length code that satisfy the bounds on the average length
codeword 𝐿
 Efficient procedure is to encode J letters or symbols at a time
 As an illustration consider the following example
 Example: Let the output of a DMS consist of x1, x2 and x3 with
probabilities 0.45, 0.35, 0.2, respectively
 Entropy of the source
3
𝐻 𝑿 =− 𝑝(𝑥𝑘 ) log 2 𝑝 𝑥𝑘 = 1.518 𝑏𝑖𝑡𝑠/𝑠𝑦𝑚𝑏𝑜𝑙
𝑘=1
26
Cont. …
• If these are encoded individually Using Huffman encoding
procedure: x1--0, x2--10, and x3--11 with an average codeword
length of 1.55 and an efficiency of 97.7%
If pairs of symbols are encoded

using the Huffman algorithm, one
possible variable length code can
be as given in the table
2𝐻 𝑋 = 3.036 and 𝐿2 = 3.0675
3.036
𝜂= ∗ 100% = 99%
3.0675
27
Lempel-Ziv Coding
 Drawback of Huffman coding is that, it requires knowledge of
a probabilistic model of the source, unfortunately, in practice,
source statistics are not always known a prior.
 To overcome these practical limitations, we may use the
Lempel-Ziv algorithm.
 Basically, encoding in Lempel-Ziv, algorithm is accomplished
by parsing the source data stream into segments that are the
shortest subsequences not encountered previously.
 Consider the example of an input binary sequence specified as
follows.
000101110010100101 …
 It is assumed that the binary symbols 0 and 1 are already
stored in that order in the code book.
28
Cont. …
 Subsequences stored: 0, 1
 Data to be parsed: 000101110010100101 …
 Starting the encoding process from the left, the shortest
subsequence of the data stream encountered for the first time
and not seen before is 00; so we write,
 Subsequences stored: 0, 1, 00
 Data to be parsed: 0101110010100101 …
 The second shortest subsequence not seen before is 01;
accordingly, we go on to write
 Subsequences stored: 0, 1, 00, 01
 Data to be parsed: 01110010100101 …
 The next shortest subsequence not encountered previously is
011; hence, we write,
29
Cont. …
 Subsequences stored: 0, 1, 00, 01, 011
 Data to be parsed: 10010100101 …
 We continue in this manner until the given data stream is
parsed completely.
Numerical positions: 1 2 3 4 5 6 7 8 9
Subsequences: 0 1 00 01 011 10 010 100 101
Numerical representation: 11 12 42 21 41 61 62
Binary encoded blocks: 0010 0011 1001 0100 1000 1100 1101
 The first row shown indicates the numerical positions of
individual subsequences in the code book.
 The second row consists of the resulting subsequences after
the data stream is completely parsed.
30
Cont. …
 The 3rd row shows the numerical representation of the
subsequence in-terms of previous sequences positions.
 The last bit represents the innovation symbol for the particular
subsequence under consideration.
 The remaining bits provide the equivalent binary
representation of the “pointer” to the root subsequence that
matches the one in question except for the innovation symbol.
 Lempel-Ziv algorithm uses fixed length codes to represent a
variable number of a variable number of source symbols.
 Lempel-Ziv algorithm is now the standard for file compression.
 When it is applied to English text, it achieves a compaction of
approximately 55%, in contrast to 43% of Huffman algorithm.
31
Overview
 Source Coding
32
Discrete Memoryless Channels (DMC)
 DMC is a statistical model with an input X and an output Y that
is noisy version of X.
 It can be represented in the following diagram
 Both X and Y are discrete random variables

 Memoryless when the current output symbol depends only on
the current input and not any of previous ones.
 Were 𝑝 𝑦𝑘 𝑥𝑗 = 𝑃 𝑌 = 𝑦𝑘 𝑋 = 𝑥𝑗 for all 𝑗 and 𝑘, are set of
transition probabilities which can be described as;
33
Cont. …
𝑝 𝑦0 𝑥0 𝑝 𝑦1 𝑥0 … 𝑝 𝑦𝑘−1 𝑥0
𝑝 𝑦0 𝑥1 𝑝 𝑦1 𝑥1 ⋯ 𝑝 𝑦𝑘−1 𝑥1
𝑃=
⋮ ⋮ ⋮ ⋮
𝑝 𝑦0 𝑥𝑗−1 𝑝 𝑦1 𝑥𝑗−1⋯ 𝑝 𝑦𝑘−1 𝑥𝑗−1
 Note that the sum of elements along any row of the matrix is
always equal to one
 Joint probability distribution of the variables x and y is given :
𝑝 𝑥𝑗 , 𝑦𝑘 = 𝑃 𝑌 = 𝑦𝑘 𝑋 = 𝑥𝑗 𝑃(𝑋 = 𝑥𝑗 ) = 𝑝 𝑦𝑘 𝑥𝑗 𝑝(𝑥𝑗 )
• The marginal probability distribution of the output random
variable Y is obtained by averaging out the dependence of
𝑝 𝑥𝑗 , 𝑦𝑘 on 𝑥𝑗 as shown below
𝐽−1 𝑗−1
𝑝(𝑦𝑘 ) = 𝑃 𝑌 = 𝑦𝑘 𝑋 = 𝑥𝑗 𝑃 𝑋 = 𝑥𝑗 = 𝑝 𝑦𝑘 𝑥𝑗 𝑝(𝑥𝑗 )
𝑗=0 𝑗=0
34
Overview
 Source Coding
35
Mutual Information
 Consider two discrete random variables X and Y such that
𝑿 = 𝒙𝟎 , 𝒙𝟏 , … , 𝒙𝒋−𝟏 and 𝒀 = 𝒚𝟎 , 𝒚𝟏 , … , 𝒙𝒌−𝟏
 Suppose we observe 𝒀 = 𝒚𝒌 and wish to determine,
quantitatively, the amount of information 𝒀 = 𝒚𝒌 provides
about the event 𝑿 = 𝒙𝒋 , 𝒋 = 𝟎, 𝟏, 𝟐, … , 𝒋 − 𝟏
 To answer this question, lets define the conditional entropy of
𝑋 selected from alphabet 𝑿, given that 𝒀 = 𝑦𝑘 ;
𝑗−1
1
𝐻 𝑿 𝒀 = 𝑦𝑘 = 𝑝 𝑥𝑗 𝑦𝑘 log 2
𝑝 𝑥𝑗 𝑦𝑘
𝑗=0
• Taking the mean of 𝐻 𝑿 𝒀 = 𝑦𝑘 over the output alphabet 𝒀;
𝑘−1
𝐻 𝑿𝒀 = 𝐻 𝑿 𝒀 = 𝑦𝑘 𝑝(𝑦𝑘 )
𝑘=0
36
Cont. …
𝒌−𝟏 𝑗−1
1
𝐻 𝑿𝒀 = 𝑝 𝑥𝑗 𝑦𝑘 𝑝(𝑦𝑘 ) log 2
𝒌=𝟎 𝑗=0
𝒌−𝟏 𝑗−1
1
= 𝑝(𝑥𝑗 , 𝑦𝑘 ) log 2
𝒌=𝟎 𝑗=0
• The quantity 𝐻 𝑿 𝒀 is conditional entropy of 𝑿 given 𝒀 and it
represents the amount of uncertainty remaining about the
channel input 𝑿 after output 𝒀 has been observed
• We also know that H(𝑿) represents the uncertainty about
input 𝑿 before the output is observed
• Thus, the difference H X − 𝐻 𝑿 𝒀 must represent the
amount of uncertainty about the channel input 𝑿 that is
resolved after observing output 𝒀
37
Cont. …
 This quantity is known as Mutual information of the channel
𝐼 𝑿, 𝒀 = H 𝑿 − 𝐻 𝑿 𝒀
 Similarly, we may write
𝐼 𝒀, 𝑿 = H 𝒀 − 𝐻 𝒀 𝑿
 Substituting equations of H 𝑿 and 𝐻 𝑿 𝒀 , we have (show) ;
𝑗−1 𝑘−1
𝐼 𝑿, 𝒀 = 𝑝(𝑥𝑗 , 𝑦𝑘 ) log
𝑝(𝑥𝑗 )
𝑗=0 𝑘=0
 Unit of the information measure is the nat(s) if the natural
logarithm is used and bit(s) if base 2 is used
 Note that ln 𝑎 = ln 2. log 2 𝑎 = 0.69315. log 2 𝑎
38
Properties of mutual information
1. The mutual information is symmetric that means:
𝐼 𝑿, 𝒀 = 𝐼 𝒀, 𝑿
2. The mutual information is always non-negative, that is:
𝐼 𝑿, 𝒀 ≥ 0 with equality if and only if
𝑝 𝑥𝑗 , 𝑦𝑘 = 𝑝(𝑥𝑗 )𝑝(𝑦𝑘 ) for all j and k
 The mutual information is zero if and only if x and y are independent.
3. The mutual information of a channel is related to the joint
entropy of the channel input and channel output by
𝐼 𝑿, 𝒀 = 𝐻 𝑿 + 𝐻 𝒀 − 𝐻 𝑿, 𝒀
Where the joint entropy 𝐻 𝑿, 𝒀 is defined by:
𝑗−1 𝑘−1
1
𝐻 𝑿, 𝒀 = 𝑝(𝑥𝑗 , 𝑦𝑘 ) log 2 ( )
𝑝(𝑥𝑗 , 𝑦𝑘 )
𝑗=0 𝑘=0
39
Overview
 Source Coding
40
Channel coding Theorem
 Block diagram representation of a communication system
ignoring the source encoder and decoder:
 Goal: to reduce error probabilities and increase reliability of a

communication system over a noisy channel
 The approach taken is: to introduce controlled redundancy in
the channel encoder so as to reconstruct the original message
 Problem: does there exist a channel coding scheme such that
the probability of symbol error is as small as we want it to be?
 Shannon's channel coding theorem stated next provides the
answer for this question
41
Types of channel coding schemes (Reading assignment)
 Classification
 Block codes
 Linear block codes
 Cyclic codes
 Convolutional codes
 Compound codes
42
Information Capacity Theorem (Shannon Capacity Formula)
 For a band-limited and power limited AWGN channel, the
channel capacity is
𝑃
𝐶 = 𝐵 log 2 1 + 𝑆𝑁𝑅 = 𝐵 log 2 1 + , where,
𝑁0 𝐵
 𝐵: the bandwidth of the channel
 𝑃: the average signal power at the receiver
 𝑁0 : the single-sided PSD of noise
 Important implication:
 We can communicate error free or with as small
probability of error as desired, up to C bits per second
 How can we achieve this rate?
 Design error detection and correction codes to detect and
correct as many errors as possible .
43
Example: channel capacity
44
`
3/26/2021 45

Chapter 2 - Edited

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Chapter 2 - Edited

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 2 - Edited

Uploaded by

Copyright:

Available Formats

Chapter Two

Introduction to Information Theory and Coding

“How to measure and code information

School of Electrical and Computer Engineering 1

 Information: any new knowledge about something

 Thus with 𝐿𝑚𝑖𝑛 = 𝐻 𝑿 , we may write the efficiency of a source

 A prefix code is defined as a code in which no code word is

If pairs of symbols are encoded

 Both X and Y are discrete random variables

 Goal: to reduce error probabilities and increase reliability of a

You might also like