Chapter 2 - Edited
Chapter 2 - Edited
Chapter 2 - Edited
= ? bits
2
Introduction
What is information theory (IT)?
What is the purpose of IT?
Why do we need to study information theory?
How are the terms uncertainty, surprise, entropy and
information related?
3
What is Information Theory
It deals with the concept of information: mathematical
modeling, measurement and applications
Provides answers for two fundamental questions in
communication theory:
What is the ultimate limit on data compression?
What is the ultimate transmission rate of reliable communication
over noisy channels?
Shannon showed reliable (i.e error-free) communication is
possible for all rates below channel capacity (using
channel coding).
Any source can be represented in bits at any rate above
entropy (using source coding).
Rise of digital information technology
4
Overview
Brief Introduction to Information Theory
Information and Entropy
Source Coding
Discrete Memoryless Channels
Mutual information
Channel Capacity
Channel Coding Theorem
5
What is Information?
6
Definition
The amount of information gained after observing the
event 𝑋 = 𝑥𝑘 is related with the probability of its
occurrence 𝑝𝑘 , as the following logarithmic function:
1
𝐼 𝑥𝑘 = log = − log(𝑝𝑘 )
𝑝𝑘
Important Properties:
1. 𝐼(𝑥𝑘 ) = 0 For𝑝𝑘 = 1: an event that is certain to occur
contains no information
2. 𝐼 𝑥𝑘 ≥ 0𝑓𝑜𝑟 0 ≤ 𝑝𝑘 ≤ 1: the occurrence of an event
either provides some or no information, but never brings
loss of information
3. 𝐼(𝑥𝑘 ) > 𝐼(𝑥𝑖 ) for 𝑝𝑘 < 𝑝𝑖 : that is the less probable the
event is the more information we gain as it occurs
7
Cont. …
4. 𝐼 𝑥𝑘 𝑥𝑖 = 𝐼 𝑥𝑘 + 𝐼(𝑥𝑖 ) if 𝑥𝑘 𝑎𝑛𝑑 𝑥𝑖 are statistically
independent
The base of the logarithm is arbitrary. However, it is standard
practice to use a logarithm to the base 2.
The resulting unit of information is called the bit and when
base e is used the unit will be in nats.
We thus write
1
𝐼 𝑥𝑘 = log 2 = − log 2 𝑝𝑘 for 𝑘 = 0, 1, … , 𝑘 − 1
𝑝𝑘
1
When 𝑝𝑘 = , we have 𝐼 𝑥𝑘 = 1 bit. Hence, 1 bit is the amount
2
of information gained when one of the two possible and
equally likely (equiprobable) events occur.
8
Entropy of Discrete Memoryless Source
Suppose we have an information source emitting a
sequence of symbols from a finite alphabet:
𝑿 = 𝑥0 , 𝑥1 , … , 𝑥𝑘−1
Discrete memoryless source: The successive symbols
are statistically independent
Assume that each symbol has a probability of
occurrence
𝑝𝑘 , 𝑘 = 0, 1, … , 𝑘 − 1 such that 𝑘−1
𝑘=0 𝑝𝑘=1
The amount of information 𝐼(𝑥𝑘 ) produced by the
source during certain interval depends on the symbol
𝑥𝑘 emitted by the source.
9
Cont. …
𝐼(𝑥𝑘 ) is a discrete random variable that takes a values 𝐼 𝑥0 ,
𝐼 𝑥1 , … , 𝐼(𝑥𝑘−1 ) with probabilities 𝑝0 , 𝑝1 , … , 𝑝𝑘−1
respectively.
The mean of 𝐼(𝑥𝑘 ) over the source alphabet 𝑿 is given by:
𝐻 𝑿 = E 𝐼(𝑥𝑘 )
𝑘−1 𝑘−1
1
𝐻(𝑿) = 𝑝𝑘 𝐼 𝑥𝑘 = 𝑝𝑘 log 2
𝑝𝑘
𝑘=0 𝑘=0
• The quantity 𝐻(𝑿) is known as the Entropy of the source
and it measures the average information content per source
symbol.
• Note that 𝐻(𝑿) depends only on probabilities of the symbols
in the alphabet 𝑿 of the source.
10
Meaning of Entropy
What information about a source does its entropy give us?
It is the amount of uncertainty before we receive it
It tells us how many bits of information per symbol we
expect to get on the average.
Some properties of Entropy
• Entropy of a source is bounded as follows
0 ≤ 𝐻(𝑋) ≤ log 2 𝐾
• 𝐻 𝑋 = 0, if and only if the probability 𝑝𝑘 = 1 for some k
and, and the remaining probabilities in the set are 0.
1
• 𝐻 𝑋 = log 2 𝐾 , if and only if 𝑝𝑘 =
, for
all k (equiprobable
𝑘
symbols). This upper bound on entropy corresponds to
maximum uncertainty.
11
Example: Entropy of a Binary Source
Consider a memoryless binary source for which symbol
0 occurs with probability 𝑝0 and symbol 1 with probability
𝑝1 = 1 − 𝑝0 . The entropy of this source equals:
1
1
𝐻 𝑋 = 𝑝𝑘 log 2 = −𝑝0 log 2 𝑝0 − 𝑝1 log 2 𝑝1
𝑝𝑘
𝑘=0
= −𝑝0 log 2 𝑝0 − (1 − 𝑝0 ) log 2 (1 − 𝑝0 )
12
Example: A Three-Symbol Alphabet
Consider another discrete memoryless source with source
alphabet 𝑿 = 𝑥0 , 𝑥1 , 𝑥2 with respective probabilities of 𝑝0 =
1 1 1
, 𝑝1 = and 𝑝2 = , the entropy of the source becomes;
4 4 2
2
1 1 1 1
𝐻 𝑿 = 𝑝𝑘 log 2 = 𝑝0 log 2 + 𝑝1 log 2 + 𝑝2 log 2
𝑝𝑘 𝑝0 𝑝1 𝑝2
𝑘=0
1 1 1
= log 2 (4) + log 2 (4) + log 2 (2)
4 4 2
3
= 𝑏𝑖𝑡𝑠
2
• In most cases, blocks are considered rather than individual
symbols, with each block consisting of n successive source
symbols.
13
Cont. …
In case of DMS, the entropy of this extended source will be
given as:
𝐻(𝑿𝑛 ) = 𝑛𝐻(𝑿)
Example: Calculate entropy of the 2nd-order extension of the
previous source.
8
1
𝐻𝑿2 = 𝑝(𝜎𝑖 ) log 2
𝑝(𝜎𝑖 )
𝑖=0
1 1 1 1
= log 2 (16) + log 2 (16) + log 2 (8) + log 2 (16)
16 16 8 16
1 1 1 1 1
+ log 2 (16) + log 2 (8) + log 2 (8) + log 2 (8) + log 2 (4)
16 8 8 8 4
= 3 𝑏𝑖𝑡𝑠
= 2 ∗ 𝐻(𝑿)
14
Overview
Brief Introduction to Information Theory
Information and Entropy
Source Coding
Discrete Memoryless Channels
Mutual information
Channel Capacity
Channel Coding Theorem
15
Source Coding Theorem
An important problem in communication is the efficient
representation of a data generated by discrete source.
Source encoder
Our primary interest is in developing efficient source encoder
that satisfies two functional requirements
1. The code words produced by the encoder are in binary form
2. The source code is uniquely decodable
• If we let the binary code word assigned to symbol 𝑥𝑘 by the
encoder have length 𝑙𝑘 , the average code word length can be
defined as:
16
Average Codeword Length
𝑘−1
𝐿= 𝑝𝑘 𝑙𝑘
𝑘=0
where 𝑝𝑘 is the probability of occurrence of 𝑥𝑘 for 𝑘 = 0,1, … , 𝑘 − 1
• The parameter 𝐿 represents the average number of bits per
source symbol used in the source encoding process.
Effective idea to reduce the given average code word length is
to encode symbols that occur often with short code words and
symbols that occur rarely may be encoded using longer code
words.
This results in a variable length code word.
The code efficiency of the source encoder is defined as
𝐿𝑚𝑖𝑛
𝜂=
𝐿
17
Minimum Codeword Length
What is the minimum codeword length for a particular alphabet of
source symbols?
Shannon's source-coding theorem address this fundamental
question. It is stated as:
19
Property of prefix code
It can be decoded by simply reading a string or a sequence
from left to right or following the corresponding path in the
code tree until it reaches a leaf, which represents a code
word by the prefix free property
It is always uniquely decodable. But the converse is not
necessarily true.
20
Huffman coding Algorithm
The Huffman algorithm is a variable-length coding scheme
based on the source letter probabilities pk, k = 1, 2, ….., L
The coding algorithm is optimum in the sense that the
average number of binary digits required to represent the
source letters is minimum
Satisfies the prefix condition and the sequence of code
words are uniquely and instantaneously decodable
Basic idea: choose codeword lengths so that more-
probable sequences have shorter codewords
The reduction process is continued in a step by step
manner until we are left with a final sets of two source
statics.
21
Cont. …
The encoding algorithm proceeds as follows;
1. The source symbols are listed in order of decreasing
probability and the two lowest probability symbols are
assigned a 0 and a 1.
2. Probability of the new symbol after combining the two
source symbols is placed in the list in accordance with its
value
3. The procedure is repeated until only two source symbols
are left for which a 0 and a 1 can be assigned
• The code for each original source symbols is found by
working backward and tracing the sequence of 0s and 1s
assigned to that symbol as well as its successor
22
Example
Given seven source letters x1, x2, …..x7 with probabilities 0.35, 0.30,
0.20, 0.10,0.04,0.005.0.005 respectively. Construct a Huffman code
for this source and calculate its efficiency
23
Example
Letter Prob. I(x) Code
x1 0.35 1.5146 00
x2 0.30 1.7370 01
x3 0.20 2.3219 10
x4 0.10 3.3219 110
x5 0.04 4.6439 1110
x6 0.005 7.6439 11110
x7 0.005 7.6439 11111
7 7
𝐻 𝑋 = 𝑝 𝑥𝑖 𝐼 𝑥𝑖 = 2.11 & 𝐿 = 𝑝 𝑥𝑖 𝑙𝑘 = 2.21
𝑘=1 𝑘=1
𝐻(𝑋) 2.11
Efficiency, 𝜂 = = × 100% = 95.5%
𝐿 2.21
24
Example
The above code is not
necessarily unique. We can
devise an alternative code as
shown next for the same source
as above.
X1_____0
X2_____10
X3_____110
X4_____1110
X5_____11110
X6_____111110
X7_____111111
An alternative code for the
The average code word length is DMS in the above example
the same as above (show?)
25
Cont. …
Note that the assignment of 0 to the upper branch and 1 to the
lower branch is arbitrary and by reversing this we obtain an
equally efficient code that satisfies the prefix condition
The above procedure always results in a prefix free variable
length code that satisfy the bounds on the average length
codeword 𝐿
Efficient procedure is to encode J letters or symbols at a time
As an illustration consider the following example
Example: Let the output of a DMS consist of x1, x2 and x3 with
probabilities 0.45, 0.35, 0.2, respectively
Entropy of the source
3
𝐻 𝑿 =− 𝑝(𝑥𝑘 ) log 2 𝑝 𝑥𝑘 = 1.518 𝑏𝑖𝑡𝑠/𝑠𝑦𝑚𝑏𝑜𝑙
𝑘=1
26
Cont. …
• If these are encoded individually Using Huffman encoding
procedure: x1--0, x2--10, and x3--11 with an average codeword
length of 1.55 and an efficiency of 97.7%
27
Lempel-Ziv Coding
Drawback of Huffman coding is that, it requires knowledge of
a probabilistic model of the source, unfortunately, in practice,
source statistics are not always known a prior.
To overcome these practical limitations, we may use the
Lempel-Ziv algorithm.
Basically, encoding in Lempel-Ziv, algorithm is accomplished
by parsing the source data stream into segments that are the
shortest subsequences not encountered previously.
Consider the example of an input binary sequence specified as
follows.
000101110010100101 …
It is assumed that the binary symbols 0 and 1 are already
stored in that order in the code book.
28
Cont. …
Subsequences stored: 0, 1
Data to be parsed: 000101110010100101 …
Starting the encoding process from the left, the shortest
subsequence of the data stream encountered for the first time
and not seen before is 00; so we write,
Subsequences stored: 0, 1, 00
Data to be parsed: 0101110010100101 …
The second shortest subsequence not seen before is 01;
accordingly, we go on to write
Subsequences stored: 0, 1, 00, 01
Data to be parsed: 01110010100101 …
The next shortest subsequence not encountered previously is
011; hence, we write,
29
Cont. …
Subsequences stored: 0, 1, 00, 01, 011
Data to be parsed: 10010100101 …
We continue in this manner until the given data stream is
parsed completely.
Numerical positions: 1 2 3 4 5 6 7 8 9
Subsequences: 0 1 00 01 011 10 010 100 101
Numerical representation: 11 12 42 21 41 61 62
Binary encoded blocks: 0010 0011 1001 0100 1000 1100 1101
The first row shown indicates the numerical positions of
individual subsequences in the code book.
The second row consists of the resulting subsequences after
the data stream is completely parsed.
30
Cont. …
The 3rd row shows the numerical representation of the
subsequence in-terms of previous sequences positions.
The last bit represents the innovation symbol for the particular
subsequence under consideration.
The remaining bits provide the equivalent binary
representation of the “pointer” to the root subsequence that
matches the one in question except for the innovation symbol.
Lempel-Ziv algorithm uses fixed length codes to represent a
variable number of a variable number of source symbols.
Lempel-Ziv algorithm is now the standard for file compression.
When it is applied to English text, it achieves a compaction of
approximately 55%, in contrast to 43% of Huffman algorithm.
31
Overview
Brief Introduction to Information Theory
Information and Entropy
Source Coding
Discrete Memoryless Channels
Mutual information
Channel Capacity
Channel Coding Theorem
32
Discrete Memoryless Channels (DMC)
DMC is a statistical model with an input X and an output Y that
is noisy version of X.
It can be represented in the following diagram
33
Cont. …
𝑝 𝑦0 𝑥0 𝑝 𝑦1 𝑥0 … 𝑝 𝑦𝑘−1 𝑥0
𝑝 𝑦0 𝑥1 𝑝 𝑦1 𝑥1 ⋯ 𝑝 𝑦𝑘−1 𝑥1
𝑃=
⋮ ⋮ ⋮ ⋮
𝑝 𝑦0 𝑥𝑗−1 𝑝 𝑦1 𝑥𝑗−1⋯ 𝑝 𝑦𝑘−1 𝑥𝑗−1
Note that the sum of elements along any row of the matrix is
always equal to one
Joint probability distribution of the variables x and y is given :
𝑝 𝑥𝑗 , 𝑦𝑘 = 𝑃 𝑌 = 𝑦𝑘 𝑋 = 𝑥𝑗 𝑃(𝑋 = 𝑥𝑗 ) = 𝑝 𝑦𝑘 𝑥𝑗 𝑝(𝑥𝑗 )
• The marginal probability distribution of the output random
variable Y is obtained by averaging out the dependence of
𝑝 𝑥𝑗 , 𝑦𝑘 on 𝑥𝑗 as shown below
𝐽−1 𝑗−1
𝑝(𝑦𝑘 ) = 𝑃 𝑌 = 𝑦𝑘 𝑋 = 𝑥𝑗 𝑃 𝑋 = 𝑥𝑗 = 𝑝 𝑦𝑘 𝑥𝑗 𝑝(𝑥𝑗 )
𝑗=0 𝑗=0
34
Overview
Brief Introduction to Information Theory
Information and Entropy
Source Coding
Discrete Memoryless Channels
Mutual information
Channel Capacity
Channel Coding Theorem
35
Mutual Information
Consider two discrete random variables X and Y such that
𝑿 = 𝒙𝟎 , 𝒙𝟏 , … , 𝒙𝒋−𝟏 and 𝒀 = 𝒚𝟎 , 𝒚𝟏 , … , 𝒙𝒌−𝟏
Suppose we observe 𝒀 = 𝒚𝒌 and wish to determine,
quantitatively, the amount of information 𝒀 = 𝒚𝒌 provides
about the event 𝑿 = 𝒙𝒋 , 𝒋 = 𝟎, 𝟏, 𝟐, … , 𝒋 − 𝟏
To answer this question, lets define the conditional entropy of
𝑋 selected from alphabet 𝑿, given that 𝒀 = 𝑦𝑘 ;
𝑗−1
1
𝐻 𝑿 𝒀 = 𝑦𝑘 = 𝑝 𝑥𝑗 𝑦𝑘 log 2
𝑝 𝑥𝑗 𝑦𝑘
𝑗=0
• Taking the mean of 𝐻 𝑿 𝒀 = 𝑦𝑘 over the output alphabet 𝒀;
𝑘−1
𝐻 𝑿𝒀 = 𝐻 𝑿 𝒀 = 𝑦𝑘 𝑝(𝑦𝑘 )
𝑘=0
36
Cont. …
𝒌−𝟏 𝑗−1
1
𝐻 𝑿𝒀 = 𝑝 𝑥𝑗 𝑦𝑘 𝑝(𝑦𝑘 ) log 2
𝑝 𝑥𝑗 𝑦𝑘
𝒌=𝟎 𝑗=0
𝒌−𝟏 𝑗−1
1
= 𝑝(𝑥𝑗 , 𝑦𝑘 ) log 2
𝑝 𝑥𝑗 𝑦𝑘
𝒌=𝟎 𝑗=0
• The quantity 𝐻 𝑿 𝒀 is conditional entropy of 𝑿 given 𝒀 and it
represents the amount of uncertainty remaining about the
channel input 𝑿 after output 𝒀 has been observed
• We also know that H(𝑿) represents the uncertainty about
input 𝑿 before the output is observed
• Thus, the difference H X − 𝐻 𝑿 𝒀 must represent the
amount of uncertainty about the channel input 𝑿 that is
resolved after observing output 𝒀
37
Cont. …
This quantity is known as Mutual information of the channel
𝐼 𝑿, 𝒀 = H 𝑿 − 𝐻 𝑿 𝒀
Similarly, we may write
𝐼 𝒀, 𝑿 = H 𝒀 − 𝐻 𝒀 𝑿
Substituting equations of H 𝑿 and 𝐻 𝑿 𝒀 , we have (show) ;
𝑗−1 𝑘−1
𝑝 𝑥𝑗 𝑦𝑘
𝐼 𝑿, 𝒀 = 𝑝(𝑥𝑗 , 𝑦𝑘 ) log
𝑝(𝑥𝑗 )
𝑗=0 𝑘=0
Unit of the information measure is the nat(s) if the natural
logarithm is used and bit(s) if base 2 is used
Note that ln 𝑎 = ln 2. log 2 𝑎 = 0.69315. log 2 𝑎
38
Properties of mutual information
1. The mutual information is symmetric that means:
𝐼 𝑿, 𝒀 = 𝐼 𝒀, 𝑿
2. The mutual information is always non-negative, that is:
𝐼 𝑿, 𝒀 ≥ 0 with equality if and only if
𝑝 𝑥𝑗 , 𝑦𝑘 = 𝑝(𝑥𝑗 )𝑝(𝑦𝑘 ) for all j and k
The mutual information is zero if and only if x and y are independent.
3. The mutual information of a channel is related to the joint
entropy of the channel input and channel output by
𝐼 𝑿, 𝒀 = 𝐻 𝑿 + 𝐻 𝒀 − 𝐻 𝑿, 𝒀
Where the joint entropy 𝐻 𝑿, 𝒀 is defined by:
𝑗−1 𝑘−1
1
𝐻 𝑿, 𝒀 = 𝑝(𝑥𝑗 , 𝑦𝑘 ) log 2 ( )
𝑝(𝑥𝑗 , 𝑦𝑘 )
𝑗=0 𝑘=0
39
Overview
Brief Introduction to Information Theory
Information and Entropy
Source Coding
Discrete Memoryless Channels
Mutual information
Channel Capacity
Channel Coding Theorem
40
Channel coding Theorem
Block diagram representation of a communication system
ignoring the source encoder and decoder:
Classification
Block codes
Linear block codes
Cyclic codes
Convolutional codes
Compound codes
42
Information Capacity Theorem (Shannon Capacity Formula)
For a band-limited and power limited AWGN channel, the
channel capacity is
𝑃
𝐶 = 𝐵 log 2 1 + 𝑆𝑁𝑅 = 𝐵 log 2 1 + , where,
𝑁0 𝐵
𝐵: the bandwidth of the channel
𝑃: the average signal power at the receiver
𝑁0 : the single-sided PSD of noise
Important implication:
We can communicate error free or with as small
probability of error as desired, up to C bits per second
How can we achieve this rate?
Design error detection and correction codes to detect and
correct as many errors as possible .
43
Example: channel capacity
44
`
3/26/2021 45