Information Theory: Mohamed Hamada

Information Theory
Mohamed Hamada
Software Engineering Lab
The University of Aizu
Email: [email protected]
URL: http://www.u-aizu.ac.jp/~hamada
Today’s Topics
• Communication Channel
• Noiseless binary channel
• Binary Symmetric Channel (BSC)
• Symmetric Channel
• Mutual Information
• Channel Capacity
1
Digital Communication Systems
1. Huffman Code．
2. Two-pass Huffman Code．
3. Lemple-Ziv Code．
Information 4. Fano code．

User of
Source 5. Shannon Code． Information
6. Arithmetic Code．
Source Source
Encoder Decoder
Channel Channel
Encoder Decoder
Modulator De-Modulator
Channel
2
Digital Communication Systems
1. Memoryless
2. Stochastic
3. Markov
Information 4. Ergodic User of
Source Information
Source Source
Encoder Decoder
Channel Channel
Encoder Decoder
Modulator De-Modulator
Channel
3
INFORMATION TRANSFER ACROSS CHANNELS
Sent Received
messages messages
symbols
Source Channel Channel Source
source Channel decoding decoding receiver
coding coding
Compression Error Correction Decompression

Source Entropy Channel Capacity
Rate vs Distortion Capacity vs Efficiency
4
Communication Channel
A (discrete ) channel is a system consisting of an input alphabet

X and output alphabet Y and a probability transition matrix
p(y|x) that expresses the probability of observing the output
symbol y given that we send the symbol x
Examples of channels:
CDs, CD – ROMs, DVDs, phones,

Ethernet, Video cassettes etc.
5
Communication Channel
Channel
input x p(y|x) output y
Transition probabilities
Memoryless:
- output only on input
- input and output alphabet finite
6
Noiseless binary channel
Noiseless binary channel
Channel
0 0
1 1
Transition Matrix
0 1
p(y | x) = 0 1 0
1 0 1
7
Binary Symmetric Channel (BSC)
(Noisy channel)
1-p
Error Source
0 0
e
p
xi y i = xi  e
+
Output 1 1
Input
1-p
8
Binary Symmetric Channel (BSC)
(Noisy channel)
BSC Channel
0 01
p 1-p
BSC Channel
1 10
p 1-p
p
0 0
1-p
1 1
p
BSC Channel
9
Symmetric Channel
(Noisy channel)
Channel
X Y
In the transmission matrix of this channel , all the rows are

permutations of each other and so the columns.
Example: Transition Matrix
y1 y2 y3
x1 0.3 0.2 0.5
p(y | x) = x2 0.5 0.3 0.2
x3 0.2 0.5 0.3
10
Mutual Information
11
Mutual Information (MI)
Channel
X p(y|x) Y
MI is a measure of the amount of information that one

random variable contains about another random variable.
It is the reduction in the uncertainty of one random
variable due to the knowledge of the other.
For the two random variable X and Y with a joint
probability p(x, y) and marginal probabilities p(x) and p(y).
The mutual information I(X,Y) is the relative entropy between

the joint distribution and the product distribution p(x) p(y) i.e.
p ( x, y )
I ( X , Y )   p ( x, y ) log 2 ( )
p( x) p( y ) 12
Relationship between Entropy and Mutual Information
I(X,Y)=H(Y)-H(Y|X)
Proof:
p( y | x)
I ( X , Y )   p ( x, y ) log 2 ( )
x y p( y)
I ( X , Y )     p ( x, y ) log 2 ( p ( y ))    p ( x, y ) log 2 ( p ( y | x ))
x y x y
I ( X , Y )   p( y) log2 ( p( y))   p(x) p( y | x) log2 ( p( y | x))

y x y
= H(Y) – H(Y|X) Note that I(X,Y)  0
i.e. the reduction in the description length of X given Y

or: the amount of information that Y gives about X 13
Note That:
• I(X,Y)=H(Y)-H(Y|X)
• I(X,Y)=H(X)-H(X|Y)
• I(X,Y)=H(X)+H(Y)-H(X,Y)
• I(X,Y)=I(Y,X)
• I(X,X)=H(X)
H(X|Y) I(X,Y) H(Y|X)
H(X) H(Y) 14
Mutual Information with 2 channels
Channel 1 Channel 2
Y Z
X p(y|x) p(z|y)
Let X, Y and Z form a Markov chain: X  Y  Z

and Z is independent from X given Y
i.e. p(x,y,z) = p(x) p(y|x) p(z|y)
The amount of information that Y gives about X given Z is:
I(X, Y|Z) = H(X|Z) – H(X|YZ)
15
Mutual Information with 2 channels
Theory: I(X,Y)  I(X, Z)
(I.e. Multi-channels may destroy information)
Proof: I(X, (Y,Z) ) =H(Y,Z) - H(Y,Z|X)

=H(Y) + H(Z|Y) - H(Y|X) - H(Z|YX)
= I(X, Y) + I(X; Z|Y)
I(X, (Y,Z) ) = H(X) - H(X|YZ)
= H(X) - H(X|Z) + H(X|Z) - H(X|YZ)
= I(X, Z) + I(X,Y|Z)
now I(X, Z|Y) = 0 (independency)
Thus: I(X, Y)  I(X, Z)
16
Channel Capacity
17
Transmission efficiency
I need on the average H(X) bits/source output to describe the
source symbols X After observing Y, I need H(X|Y) bits/source
output.
Channel
H(X) X Y H(X|Y)
Reduction in description length is called the transmitted

Information.
Transmitted R = I(X, Y) = H(X) - H(X|Y)

= H(Y) – H(Y|X) from earlier calculations
We can maximize R by changing the input probabilities.

The maximum is called CAPACITY 18
Channel Capacity
The channel capacity C is the highest rate, in bits per channel
use, at which information can be sent with arbitrarily low
probability of error
C = Max I(X , Y)
p(x)
Where the maximum is taken over all possible input

distributions p(x)
Note that :
• During data compression, we remove all redundancy in the
data to form the most compressed version possible.
• During data transmission, we add redundancy in a controlled

fashion to combat errors is the channel
Capacity depends on input probabilities
because the transition probabilites are fixed 19
Channel Capacity
Example 1: Noiseless binary channel
Channel
Probability
Transition Matrix
1/2 0 0
0 1
p(y|x) = 0 1 0
1 0 1
1/2 1 1
Here the entropy is:
H(X) = H(p, 1-p)= H(1/2,1/2) = -1/2 log 1/2 - 1/2 log 1/2 = 1 bit
H(X|Y) = -1 log 1 - 0 log 0 - 0 log 0 - 1 log 1 = 0 bit
I(X, Y) = H(X) - H(X | Y) = 1 - 0 = 1 bit 20

Channel Capacity
Example 2: (Noisy) Binary symmetric channel
If P is the probability of an error ( noise )
p
0 0
1-p Transition Matrix
1 1
p 0 1
BSC Channel p(y|x)= 0 p 1-p
Here the mutual information 1 1-p p
I(X,Y) = H(Y) - H(Y | X) = H(Y) - ∑x p(x) H(Y | X=x)
= H(Y) - ∑x p(x) H(P) = H(Y) - H(P) ∑x p(x) (Note that ∑p(x) = 1)
x
= H(Y) - H(P) (Note that H(Y) ≤ 1 )
≤ 1 - H(P) I(X,Y) ≤ 1 – H(P)

Hence the information capacity of the BSC channel is
C = 1 - H (P) bits (maximum information ) 21

Channel Capacity Channel
Example 3: (Noisy) Symmetric channel
X Y
The capacity of such channel can be found as follows:
let r be a row of the transition matrix =1
I(X, Y) = H(Y) - H(Y | X) = H (Y) - ∑ p(x) H(Y | X = x)
x
= H(Y) - ∑ p(x) H(r) = H(Y) - H(r) ∑ p(x) = H(Y) - H(r)
x x
≤ log |Y | - H (r) Y is the output alphabet
Hence the capacity C of such channel is: C = max I(X; Y) = log |Y | - H (r)
For example: Consider a symmetric channel with the following transmission matrix
It is clear that such channel is symmetric y1 y2 y3
since rows (and columns) are permutations. x1 0.3 0.2 0.5
Hence the capacity C of this channel is: P (y|x) = x2 0.5 0.3 0.2
x3 0.2 0.5 0.3
C = log 3 - H(0.2, 0.3, 0.5) = 0.1 bits (approx.)
Transition Matrix
22
Channel Capacity
(Noisy) Symmetric channel
Channel
X Y
Note that: In general for symmetric channel the capacity C is
C = log |Y | - H (r)
Where Y is the output alphabet and r is any row in the

transmission matrix.
23

Information Theory: Mohamed Hamada

Uploaded by

Copyright:

Available Formats

Information Theory: Mohamed Hamada

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Information Theory: Mohamed Hamada

Uploaded by

Copyright:

Available Formats

Information Theory

2. Two-pass Huffman Code．

Information 4. Fano code．

Compression Error Correction Decompression

Rate vs Distortion Capacity vs Efficiency

A (discrete ) channel is a system consisting of an input alphabet

CDs, CD – ROMs, DVDs, phones,

input x p(y|x) output y

In the transmission matrix of this channel , all the rows are

Example: Transition Matrix

MI is a measure of the amount of information that one

The mutual information I(X,Y) is the relative entropy between

I ( X , Y )   p( y) log2 ( p( y))   p(x) p( y | x) log2 ( p( y | x))

= H(Y) – H(Y|X) Note that I(X,Y)  0

i.e. the reduction in the description length of X given Y

H(X|Y) I(X,Y) H(Y|X)

Let X, Y and Z form a Markov chain: X  Y  Z

I(X, Y|Z) = H(X|Z) – H(X|YZ)

(I.e. Multi-channels may destroy information)

Proof: I(X, (Y,Z) ) =H(Y,Z) - H(Y,Z|X)

Reduction in description length is called the transmitted

Transmitted R = I(X, Y) = H(X) - H(X|Y)

We can maximize R by changing the input probabilities.

Where the maximum is taken over all possible input

• During data transmission, we add redundancy in a controlled

Here the entropy is:

H(X|Y) = -1 log 1 - 0 log 0 - 0 log 0 - 1 log 1 = 0 bit

I(X, Y) = H(X) - H(X | Y) = 1 - 0 = 1 bit 20

If P is the probability of an error ( noise )

= H(Y) - H(P) (Note that H(Y) ≤ 1 )

≤ 1 - H(P) I(X,Y) ≤ 1 – H(P)

C = 1 - H (P) bits (maximum information ) 21

Note that: In general for symmetric channel the capacity C is

Where Y is the output alphabet and r is any row in the

You might also like