Information Theory
Information Theory
Information theory Discrete and continuous messages, Message source, zero memory source, Discrete
memory-less source, extension of zero memory source, Markov source and their entropy, Channel
with and without memory, Hartley and Shannon’s law.
Introduction to Information Theory
The fundamental problem of communication is that of reproducing at one point either exactly or approximately a
message selected at another point.
(Claude Shannon, 1948)
Throughout this book we have studied electrical communication prirnarily in terms of signals-desired information-bearing
signals corrupted by noise and interference signals. Although signal theory has proved to be a valuable tool, it does not
come to grips with the fundamental communication process of information transfer, Recognizing the need for a broader
viewpoint, Claude Shannon drew upon the earlier work of Nyquist and Hartley and the concurrent investigations of Wiener
to develop and set forth in 1948 a radically new approach he called ”A Mathematical Theory of Communication."
Shannon's paper isolated the central task of communication engineering in this question: Given a message-producing
source, not of our choosing, how should the messages be represented for reliable transmission over a communication
channel with its inherent physical limitations? To .address that question, Shannon concentrated on the message
Information per se rather than on the signals. His approach was soon renamed information theory, and it has
subsequently evolved into a hybrid mathematical and engineering discipline. Information theory deals with three basic
concepts: the measure of source information, the information capacity of a channel, and codin g as a means of utilizing
channel capacity for information transfer. The term coding is taken here in the broadest sense of message representation,
including both discrete and continuous waveforms.
If the rate of information from a source does not exceed the capacity of a communication channel then there exists a
coding technique such that the information can be transmitted over the channel with an arbitrarily small frequency of
errors despite the presence of noise.
(Claude Shannon, 1948)
The surprising, almost incredible aspect of this statement is its promise of error-Free transmission on a noisy channel, a
condition achieved with the help of coding. The coding process generally involves two distinct encoding/decoding
operations, portrayed diagrammatically by Fig. below. The channel encoder/decoder units perform the task of error
control coding. Information theory asserts that optimum channel coding yields an equivalent noiseless channel with a well-
defined capacity for information transmission. The source encoder/decoder units then match the source to the equivalent
noiseless channel, provided that the source information rate falls within the channel capacity.
The information source emits a no of discrete message symbol with probabilities P1 P2 …..PQ such that
Information sources may take a variety of different forms. For example , in radio broadcasting , the source is
generally audio source(voice or music). In TV broadcasting , the information source is video source whose output is
moving image. The output of this source is analog signals, and hence the sources are called analog sources. In
contrast computers and storage devices( magnetic or optical disks) produce discrete outputs(usually binary or
ASCII) and hence they are called discrete source. Whether a source is analog or digital, a digital communication
system is designed to transmit information in digital form. Consequently the output of this source must be converted
to a format that can be transmitted.
The simplest type of discrete source is one that emits a binary sequence of the form 1010101100… where the
alphabet consists of two letters {1,0}. In general a discrete information source with an alphabet of Q possible
symbols say {x1,x2,….xQ} emits a sequence of letters selected from the alphabet. In statistical terms , we assume
that each letter in the alphabet {x1,x2,,x3…xQ} has a given probability Pk that is
©® A.Sarkar ,ECE, JGEC, page no 1
Pk=P(X=XK) Q≥k≥1 where
Information source can be classified as having memory or being memory less. A source with memory is one for
which a current symbol depends on the previous symbols. A memory less source is one for which each symbol is
independent of the previous symbols. The symbols are chosen for transmission independently of one another i.e.
emission of one symbol does not depend on other symbol in the same alphabet. This information source need not
have to remember the symbols which are sent earlier to any symbol and hence called zero-memory information
source or memory-less information source.
INFORMATION MEASURE
We begin our study of information theory with the measure of information. Then we apply information measure to
determine the information rate of discrete sources. Particular attention will be given to binary coding for discrete
Memory-less sources.
Here we use information as a technical term, not to be confused with "knowledge" or "meaning” concepts that defy
precise definition and quantitative measurement. In the context of communication, information is simply the
commodity produced by the source for transfer to some user at the destination.
Suppose, for instance, that you're planning a trip to a distant city. To determine what clothes to pack, you might hear
one of the following forecasts:
The sun will rise.
There will be scattered rainstorms.
There will be a tornado.
The first message conveys virtually no information since you are quite sure in advance that the sun will rise. The
forecast of rain, however, provides information not previously available to you. The third forecast gives you more
information, tornadoes being rare and unexpected events; you might even decide to cancel the trip !
Notice that the messages have been listed in order of decreasing likelihood and increasing information. The
less likely the message, the more information it conveys. We thus conclude that information measure must be
related to uncertainty, the uncertainty of the user as to what the message will be.
Whether you prefer the source or user viewpoint, it should be evident that information measure involves the
probability of a message. If x, denotes an arbitrary message and P(xi) = Pi is the probability of the event that xi is
selected for transmission, then the amount of information associated with xi should be some
function of Pi.Specifically, Shannon defined information measure by the logarithmic function
ii) The lowest possible information must be zero which occurs for a sure message
iii) More information should be carried if the message is less likely one
iv) For independent message symbols , the total self-information should be equal to the sum of the individual self-
information.
Proof: Let Si and Sj are two consecutive independent symbols chosen for transmission with probabilities Pi and Pj respectively.
Then the total self-information contained in both Si and Sj is given by
The total self information= Itotal= P1L log 1/P1+ P2L log 1/P2+…..+ PQL log 1/PQ bits
The amount of information produced by the source during an arbitrary symbol interval is a discrete random variable having the
possible values I1, I2, . . . , Iq. The expected information per symbol is then given by the statistical average which is called the
source entropy.
But we'll interpret above equation from the more pragmatic observation that when the source emits a sequence of n >> 1
symbols, the total information to be transferred is about nH(X) bits. Since the source produces r symbols per second on average,
the time duration of this sequence is about n/r. The information must therefore be transferred at the average rate nH(X)/(n/r) =
rH(X) bits per second. Formally, we define the source information rate a critical quantity relative to transmission.
Properties of Entropy
The value of H(X) for a- given source depends upon the symbol probabilities Pi and the alphabet size M.
Nonetheless, the source entropy always falls within the Limits
Upper Bound: The upper bound corresponds to maximum uncertainty or freedom of choice, which occurs when
Pi = 1/M for all i-so the symbols are equally likely.
To illustrate the variation of H(X) between these extremes, take the special but important case of a binary
source (M = 2) with
P1=p and P2=1-p
Substituting these probabilities into Equation above yields the binary entropy
©® A.Sarkar ,ECE, JGEC, page no 3
The plot of figure above displays a rather broad maximum centered at p = 1 - p = 1/2 where H(X) = log 2 = 1
bit/symbol; H(X) then decreases monotonically to zero as p→1 or 1 - p→1.
Entropy function is continuous in the interval(0,1) and logarithm of a continuous function is continuous by itself
Consider
Let us consider a straight line y=v-1 and a curve y=lnv plotted on the same graph as shown below
Similarly the 3rd extension of the source will have 23=8 symbols given by
S1S1S1 occurring with probability P13
S1S1S2 occurring with probability P12P2
S1S2S1 occurring with probability P12P2
S1S2S2 occurring with probability P1P22
S2S1S1 occurring with probability P12P2
S2S1S2 occurring with probability P1P22
S2S2S1 occurring with probability P1P22
S2S2S2 occurring with probability P23
-------------------------------
Total Probability=(P1+P2)3=1
Similarly the entropy of the 3rd extension is H(S3)=3H(S)
In general the nth extension of the source will have 2n symbols and the entropy of the nth extended source is given by
H(Sn)=nH(S)
H(SI | Sj1, Sj2, Sj3,….. Sjr)= log 1/ P(Si | Sj1, Sj2, Sj3,….. Sjr)
The entropy of the source then is the average entropy of each state. i.e. we have to average above equation over the Q r possible
states
for a zero memory source rather than Markov source P(Si | Sj1, Sj2, Sj3,….. Sjr)=P(Si)
Joint entropy: Consider two independent events X and Y with m possibilities for x and n possibilities for y. If P(xi,yj) is the
joint probability, P(xi) is the input probability and P(yj) is the output probability, then the entropy of the joint event called the
joint entropy is defined as
Conditional Entropy:
From the definition of conditional probability we have
P(xk|yj)=p(xk,yj)/P(yj)
Then
Taking the average of the above entropy function for all admissible characters received, we have the average “conditional
entropy”
Note: For the CPM (conditional probability matrix) if you add all the elements in any column the sum
P(X|Y) should be equal to unity. Similarly if you add al the elements along any row of the CPM P(Y|X)
the sum shall be unity.
Channel
A communication channel is a medium through which the symbols generated by the source flow to the receiver. A discrete
memory less channel is a statistical model with an input xk and output yj which accepts symbols from source X and generates an
output symbol Y a shown in the figure.
X2 Y2
P(yj|xk)
X Y
Xk Yj
Xm Yn
If the alphabets of X and Y are infinite , then the channel is a continuous channel whereas if the alphabets of X and Y are finite
the channel is a discrete channel. It is memory less when the current output depends on only the current input symbol but not on
any previous input symbols. Such a discrete memory less channel is shown which has m inputs generated by X and n outputs
that are received by the receiver.
The channel is represented by conditional probability ‘P(yj|xk)’ which is the conditional probability of obtaining an output yj
given that the input is xj and is called the channel transition probability.
In fact a channel can be characterized completely by means of a channel matrix. A channel matrix is a matrix of channel
transitional probabilities[P(Y|X)] represented by
The conditional probabilities P(yj|xk) then have special significance as the channel's forward transition
probabilities. By way of example, Fig. below depicts the forward transitions for a noisy channel with two source
symbols and three destination symbols.
I(X,Y)=H(X)-H(X|Y).
Equation above says that the average information transfer per symbol equals the source entropy minus the
equivocation. Correspondingly, the equivocation represents the information lost in the noisy channel.
In alternate way I(X,Y)=H(Y)-H(Y|X)
Equation above says that the information transferred equals the destination entropy H(Y) minus the noise
entropy H(Y|X) added by the channel.
It is clear that
H(X|Y)
I(X,Y) H(Y|X)
H(X) H(Y)
for a channel with independent input and output , we have H(X|Y)=H(X) and H(Y|X)=H(Y)
I(X,Y)=H(X)-H(X|Y)=H(X)-H(X)=0
i.e. no information is transferred through the channel. Hence the channel has the largest internal loss( lossy channel) as compared to noise
free channel which is a loss less newrork.
Properties of Mutual Information
I(X,Y)=I(Y,X)
I(X,Y)≥0
I(X,Y)=H(Y)-H(Y|X)
I(X,Y)=H(X)-H(X|Y)
I(X,Y)=H(X)+H(Y)-H(X,Y)
I(X,Y)=H(X)=H(Y)( FOR NOISE FREE CHANNEL)
I(X,Y)=0 ( FOR A LOSSY/ NOISY CHANNEL)
For a lossless channel H(Y|X)=0 and hence I(X:Y)=H(X)-H(X|Y)=H(X) i..e. the mutual information ( information transfer ) is equal to
the source entropy and no source information is lost in transmission. Consequently chanlecapacity per symbol is given by
CS=max H(X)=log2m where m is number of symbols in X.
For a deterministic channel H(Y|X)=0 and I(X;Y)=H(Y)
CS=max[H(Y)]=log2n bits/symbol
C=rlog2n bps. Where n is the number of symbols at Y.
For a noiseless channel we have I(X:Y)=H(X)=H(Y) as (H(Y|X)=H(X|Y)=0)
Channel capacity is given by
CS=log2m=log2n
C=r log2m=rlog2n
For a BSC
I(X:Y)=H(Y)+P log2P+(1-P) log2(1-P)
And the channel capacity is given by
CS= 1+P log2P+(1-P) log2(1-P)
C=r[1+P log2P+(1-P) log2(1-P)]
Binary Symmetric Channel
A BSC has two inputs ( x1=0, x2-1) and two outputs (y1=0 , y2=1) with a channel matrix given by
where α is the transition probability
1-α α
P(Y|X)=
α 1-α
we know that
Thus, if No and R have fixed values, information transmission at the rate R ≤ C requires
This plot reveals that bandwidth compression (B/R < 1) demands a dramatic increase of signal power, while
bandwidth expansion (B/R > 1) reduces S/NoR asymptotically toward a distinct limiting value of about - 1.6 dB
as B/R →∞
Problem 1:
problem 2:
4)
6)
7)
9)
10)
13)
15)