0% found this document useful (0 votes)
611 views26 pages

Information Theory

Information theory deals with quantifying and transmitting information. It analyzes communication systems using mathematical modeling. Entropy measures the uncertainty in a random variable and is key to information theory. Mutual information measures the shared information between two random variables. Shannon's noiseless coding theorem states that random variables can be compressed into a number of bits equal to their entropy without loss, or compressed into fewer bits with loss. The Shannon-Fano algorithm provides a method for entropy coding.

Uploaded by

amit mahajan
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
611 views26 pages

Information Theory

Information theory deals with quantifying and transmitting information. It analyzes communication systems using mathematical modeling. Entropy measures the uncertainty in a random variable and is key to information theory. Mutual information measures the shared information between two random variables. Shannon's noiseless coding theorem states that random variables can be compressed into a number of bits equal to their entropy without loss, or compressed into fewer bits with loss. The Shannon-Fano algorithm provides a method for entropy coding.

Uploaded by

amit mahajan
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 26

Information Theory

Presented By:
Er. Amit Mahajan
What is information theory

 Information theory is needed to enable the


communication system to carry information
(signals) from sender to receiver over a
communication channel
 It deals with mathematical modeling and analysis
of a communication system
 Its major task is to answer to the questions of
signal compression and transfer rate
 Those answers can be found and solved by
entropy and channel capacity
Entropy
 Entropy is defined in terms of probabilistic
behaviour of a source of information

In information theory the source output are
discrete random variables that have a certain
fixed finite alphabet with certain probabilities
 Entropy is an average information content for
the given source symbol
Mutual Information
• Mutual information uses conditional entropy of
X selected from a known alphabet
– conditional entropy means the uncertainty remaining
about the channel input after the channel output has
been observed
• Mutual information has several properties :
– symmetric channel
– always nonnegative
– relation to the joint entropy of a channel input and
channel output
Mutual Information
• The mutual information of two random variables is
a quantity that measures the mutual dependence of
the two variables. The most common unit of
measurement of mutual information is the bit, when
logarithms to the base 2 are used.
• The mutual information between two discrete
random variables is denoted by I(X;Y) and defined
as I(X;Y) =H(X) −H(X|Y )
• Mutual information is a useful concept to measure
the amount of information shared between input and
output of noisy channels.
Mutual Information
• Mutual information measures the information that X
and Y share: it measures how much knowing one of
these variables reduces our uncertainty about the
other. For example, if X and Y are independent, then
knowing X does not give any information about Y and
vice versa, so their mutual information is zero. At the
other extreme, if X and Y are identical then all
information conveyed by X is shared with Y: knowing
X determines the value of Y and vice versa. As a
result, the mutual information is the same as the
uncertainty contained in Y (or X) alone, namely the
entropy of Y (or X: clearly if X and Y are identical they
have equal entropy).
Entropy
 The mutual information of two discrete random variables X and
Y can be defined as:

where p(x,y) is the joint probability distribution


function of X and Y, and p1(x) and p2(y) are the
marginal probability distribution functions of X and Y
respectively
 Mutual information tells that how much information
does one random variable tell about another one.
 Mutual information quantifies the dependence between
the joint distribution of X and Y and what the joint
distribution would be if X and Y were independent.
Mutual information is a measure of dependence

(X; Y) = 0 if and only if X and Y are independent random
variables

if X and Y are independent, then p(x,y) = p(x) p(y), and
therefore:
• Mutual information can be equivalently expressed as

where H(X) and H(Y) are the marginal entropies, H(X|


Y) and H(Y|X) are the conditional entropies, and
H(X,Y) is the joint entropy of X and Y.
Entropy for discrete ensembles


The entropy H of a discrete random variable X with possible
values {x1, …, xn} is


E is the expected value function, and I(X) is the information
content or self-information of X.
I(X) is itself a random variable. If p denotes the probability
mass function of X then the entropy can explicitly be written
as
Entropy for discrete ensembles
 For a random variable X with n outcomes , the Shannon
entropy, a measure of uncertainty and denoted by H(X), is
defined as

where p(xi) is the probability mass function of outcome xi.


 consider a set of n possible outcomes (events) , with equal
probability p(xi) = 1 / n.

The uncertainty for such set of n outcomes is defined by
b is the base of the logarithm used. Common values of b
are 2, Euler's number e and 10, and the unit of entropy is
bitfor b = 2, nat for b = e, and dit (or digit) for b = 10.
 In the case of p = 0 for some i, the value of the
i
corresponding summand 0 logb 0 is taken to be 0, which
is consistent with the limit

The logarithm is used so to provide the additivity characteristic for
independent uncertainty. For example, consider appending to each
value of the first die the value of a second die, which has m possible
outcomes .

There are thus mn possible outcomes

The uncertainty for such set of mn outcomes is then

the uncertainty of playing with two dice is obtained by adding the


uncertainty of the second die logb(m) to the uncertainty of the first
die logb(n).
 the probability of each event is 1 / n

 In the case of a non-uniform probability mass function


(or density in the case of continuous random variables),

which is also called a surprisal; the lower the


probability p(xi), i.e. , the higher the uncertainty or the
surprise, i.e. , for the outcome xi.
 The average uncertainty , with being the average
operator, is obtained by
and is used as the definition of the entropy
H(X) .The above also explained why information
entropy and information uncertainty can be used
interchangeably.
EXAMPLE

 Entropy of a Coin toss as a function of the probability of it coming up


heads.
 Consider tossing a coin with known, not necessarily fair, probabilities of
coming up heads or tails.
 The entropy of the unknown result of the next toss of the coin is maximised
if the coin is fair (that is, if heads and tails both have equal probability 1/2).
This is the situation of maximum uncertainty as it is most difficult to predict
the outcome of the next toss; the result of each toss of the coin delivers a
full 1 bit of information.
 However, if we know the coin is not fair, but comes up
heads or tails with probabilities p and q, then there is less
uncertainty. Every time, one side is more likely to come
up than the other. The reduced uncertainty is quantified
in a lower entropy: on average each toss of the coin
delivers less than a full 1 bit of information.

The extreme case is that of a double-headed coin which
never comes up tails. Then there is no uncertainty. The
entropy is zero: each toss of the coin delivers no
information.
Shannon Noiseless coding theorem

Source coding is a mapping from (a sequence of) symbols from an
information source to a sequence of alphabet symbols (usually bits)
such that the source symbols can be exactly recovered from the
binary bits (lossless source coding) or recovered within some
distortion (lossy source coding). This is the concept behind data
compression.
 In information theory, the source coding theorem informally states
that:
“Random variables each with entropy H(X) can be compressed
into more than N H(X) bits with negligible risk of information loss,
as N tends to infinity; but conversely, if they are compressed into
fewer than NH(X) bits it is virtually certain that information will be
lost.".
Shannon's statement
Let X be a random variable taking values in some finite alphabet Σ1 and let f be a
decipherable code from Σ1 to Σ2where . Let S denote the resulting word
length of f(X).
If f is optimal in the sense that it has the minimal expected word length for X, then,

Proof
Let si denote the word length of each possible xi ( ). Define
.
, where C is chosen so that Then,
where the second line follows from Gibbs' inequality and the third line follows from
Kraft's inequality:

so

For the second inequality we may set so that

and so
and
and so by Kraft's inequality there exists a prefix-free code having those word lengths.
Thus the minimal S satisfies
Shannon-Fano Algorithm
 List the source symbols in order
of decreasing probability

Partition the set into two sets that
are as close to equiprobables as
possible, and assign 0 to the upper
set and 1 to the lower set.

Continue this process, each time
partitioning the sets as nearly
equal probabilities as possible
until further partitioning is not
possible.
For M=2,
Message Probability Encoded Message Length
x1 0.4 00 2
x2 0.2 01 2
x3 0.12 100 3
x4 0.08 101 3
0.08 110 3
x5 0.08 1110 4
x6 0.04 1111 4
x7

= 0.4*2 + 0.2*2+0.12*3+0.08*3 +0.08*3 + 0.08*4 + 0.04*4


=2.52letters/message

H(x) = - = 2.42 bits/ message

Effeciency = = 2.42/ 2.52 log 2 = 31.9%


For M=3,
Message Probability Encoded Message Length
x1 0.4 -1 1
x2 0.2 0 -1 2
x3 0.12 0 0 2
x4 0.08 1 -1 2
0.08 1 0 2
x5 0.08 1 1 -1 3
x6 0.04 1 1 0 3
x7

= 0.4*1 + 0.2*2+ 0.12*2 + 0.08*2 + 0.08*2 + 0.08*3 +


0.04*3
=1.72 letters/message
H(x) = - = 2.42 bits/ message

Effeciency = = 88.7%
Huffman Coding Algorithm
 Encoding algorithm
 Order the symbols by decreasing probabilities
 Starting from the bottom, assign 0 to the least probable
symbol and 1 to the next least probable symbol
 Combine the two least probable symbols into one composite Node
symbol Root
 Reorder the list with the composite symbol
 Repeat Step 2 until only two symbols remain in the list 1 0
 Huffman tree
1 0
 Nodes: symbols or composite symbols
 Branches: from each node, 0 defines one branch while 1
defines the other
 Decoding algorithm Leaves
 Start at the root, follow the branches based on the bits
received
 When a leaf is reached, a symbol is decoded
Huffman Coding Example
Symbols Prob. Symbols Prob. Symbols Prob.
A 0.35 A 0.35 A 0.35
B 0.17 DE 0.31 BC 0.34 1
C 0.17 B 0.17 1 DE 0.31 0
D 0.16 1 C 0.17 0
E 0.15 0

Huffman Tree
Huffman Codes 1 0 Symbols Prob.
BCDE A BCDE 0.65 1
A 0
B 111 BC 1 0 DE A 0.35 0

B 1
C 110 0 E
0 1
D 101
E 100 C D
Average code-word length = 0.35 x 1 + 0.65 x 3 = 2.30 bits per symbol
Refrences

http://www.quantiki.org/wiki/index.php/Shannon
%27s_noiseless_coding_theorem

Cover, Thomas M. (2006). "Chapter 5: Data Compression".
Elements of Information Theory. John Wiley & Sons.

http://moser.cm.nctu.edu.tw/nctu/it/index_0809.html

http://www.maths.abdn.ac.uk/~igc/tch/mx4002/notes/node59.html

http://en.wikipedia.org/wiki/Mutual_information

You might also like