Uncertain We Are of The Outcome
Uncertain We Are of The Outcome
The above situation can be captured by using probability. Since each event is assumed
to be independent, the probability of the ith (1iN) event is pi=1/N ( All events are
assumed to be equally probable) and the amount of information associated with the
occurrence of this event or self-information is given by log pi . If pi=1 then the
information is zero (certainty) and if pi=0 , it is infinity; if pi equals 0.5, it is one bit
corresponding to N=2. If N=4, pi =0.25 and the information is 2 bits and so on.
Note in the case of tossing of a coin there are two possible events: head or tail If you
consider the tossing of the coin to be an experiment, the question is how much total
information will this experiment have? This can be quantified if we can describe the
outcome of the experiment in some reasonable fashion. Lets encode the outcome
head to be represented by the bit 1 and outcome tail by the bit 0. Thus, a minimal
description of this experiment needs only one bit. Note the experiment is the sum total
of all the events. If we take the self-information of each event, multiply this by its
probability and sum it up over all the events, intuitively that gives a measure of
information content or average information of the experiment. It just so happens that
this entity is also just one bit for the tossing event since the probability of either head or
tail is 0.5 and self information for each event is also 1 bit. This 1 bit also expresses how
uncertain we are of the outcome. How do you generalize the definition?
Suppose, we have a set of N events whose probabilities of occurrence are p1, p2,,pN.
Can we measure how much choice is involved or how much uncertain we are of the
outcome? Such a measure is precisely the entropy of the experiment or source denoted
as H(p1, p2,,pN). [More precisely, it is called the first order entropy. Higher order
entropies depend on contextual information. The true entropy is infinite order entropy.
But,by popular use, entropy most often refers to first order entropy unless stated
otherwise. Read the discussion from Sayood pp.14-16].
It is reasonable to require the following properties of H.
1. H should be continuous in p, that is, a small change in the value of pi should
cause small change in the value of H.
(1)
ln(x) x 1
ln(x) 1 1
x
log(x) log(e)[(x 1)]
log(x) log(e)[(1 1 )]
x
(2)
To obtain the left inequality, note p log p 0 for 0 p 1 with equality if p=1. Hence
H 0 . To obtain the right inequality, note
p
i
= 1, so we can write:
= pi (log N + log pi )
i
(3)
= pi (log Np i )
i
pi (log e(1 1
i
Np i
(4)
Joint Probability
Suppose there are two discrete events, X and Y, with N possibilities for X and M
possibilities for Y. Let p(i,j) be the probability of the joint occurrence of i (1 i N)
for X and j (1 i M) for Y. The entropy of the joint event is
(5)
i X j Y
(6)
i, j
Given the joint probabilities, the entropy H(X) and H(Y) can be easily obtained as
(7)
(8)
p(i / j ) =
p (i, j )
p( j )
(9)
p( j / i) =
p(i, j )
p(i )
(10)
Similarly we have,
which gives
and
p(i, j ) = p ( j ) p (i / j ) = p(i ) p ( j / i )
(11)
(12)
p(i ) p( j / i )
p( j )
(13)
and so on.
Conditional Entropy
The entropy H(X/Y) is defined to be the average of entropy of X for all values of Y.
H ( X / Y ) = p( j ) H ( X / Y = j )
(14)
j Y
= p( j )( p(i / j ) log p (i / j ))
(15)
= p( j ) p(i / j ) log p (i / j )
(16)
(17)
iX
H(Y/X) = p (i ) p ( j / i ) log p( j / i )
j
= p(i, j ) log p( j / i )
(18)
(19)
The above definition of the joint and conditional entropies is naturally justified by the
fact that the entropy of two discrete random variables is the entropy of one variable plus
the conditional entropy of the other. This is expressed by the theorem:
Theorem3: H ( X , Y ) = H ( X ) + H (Y / X )
(20)
[By Eqn.(10)]
= H ( X ) + H (Y / X )
We can similarly prove that
H ( X , Y ) = H (Y ) + H ( X / Y )
(21)
Or
H ( X , Y ) = H ( X ) + H (Y / X ) = H (Y ) + H ( X / Y )
H ( X ) H ( X / Y ) = H (Y ) H (Y / X )
(22)
(23)
The quantity expressed by Eqn.(23) is called the mutual information , denoted as I(X:Y)
or I(Y:X). It can also be defined as the relative entropy between the joint distribution
p(i,j) and the product distribution p(i)p(j). That is,
p (i, j )
p(i ) p( j )
i
j
p(i / j )
= p(i, j ) log
p(i )
i
j
I ( X : Y ) = p(i, j ) log
[By Eqn.(9)]
= H ( X ) H ( X / Y ) = H (Y ) H (Y / X )
(24)
Information Sources
An event in the above discussion could be a message in the context of
communication application. Thus the above discussion is applicable to an artificial
situation when the information source is free to choose only between several definite
messages. A more natural situation is when the information source makes a sequence of
choices from a set of elementary symbols (letters of an alphabet or words) or musical
notes or web pages. The successive sequences are governed by probabilities which are
not independent but at any stage, depend on the preceding choices. For example, if the
source is English language text, not all letters are equiprobable; there are no words
(27)
log p N pi log pi
i
log p NH
H
(28)
log 1 / p
N
H is thus approximately the logarithm of the reciprocal of the probability of the typical
long sequence divided by the length of the sequence. It can be rigorously proved that
when N is large, the right hand side is very close to H.
8
3. Consider now the case when successive symbols are not chosen independently but
their probabilities depend on the preceding letters ( and not one before that). This is
called the Order(1) model which can be described by a set of transition probabilities pi(j),
the probability that letter i is followed by letter j [the bigram ij]. The indices i and j run
over all symbols. An equivalent way is to specify all the diagram frequencies p(i,j). We
have already encountered these probabilities as conditional probability and joint
probability of two independent events. As we know these probabilities are related by the
following equations:
10
11
(Read now Section 2.3, pp.22-26 from K. saywood and go through the example of
entropy calculation of an image using first the probability model and then the Markov
model)
Ergodic Process
If every sequence produced by the process is the same in statistical properties, then it is
called an ergodic source. For English text, if the length of the sequence is very large, this
assumption is approximately true. For an ergodic sequence:
1. The markov graph is connected.
2. The greatest common divisor of lengths of all cycles in the graph is 1.
12
Most text files do not use more than 128 symbols which include the alphanumerics, punctuation marks
and some special symbols. Thus, a 7-bit ASCII code should be enough .
2
This situation gives rise to what is called the zero-frequency problem. One cannot assume the
probabilities to be zero because that will imply an infinite number of bits to encode the first few symbols
since log o is infinity. There are many different methods of handling this problem but the equiprobabilty
assumption is a fair and practical one.
13
defines a lower limit of the average number of bits needed to encode the source
symbols [ShW98]. The worst model from information theoretic point of view is the
order(-1) model, the equiprobable model, giving the maximum value Hm of the entropy.
Thus, for the 8-bit ASCII code, the value of this entropy is 8 bits. The redundancy R is
defined to be the difference3 between the maximum entropy Hm and the actual entropy H.
As we build better and better models by going to higher order k, lower will be the value
of entropy yielding a higher value of redundancy. The crux of lossless compression
research boils down to developing compression algorithms that can find an encoding of
the source using a model with minimum possible entropy and exploiting maximum
amount of redundancy. But incorporating a higher order model is computationally
expensive and the designer must be aware of other performance metrics such as decoding
or decompression complexity (the process of decoding is the reverse of the encoding
process in which the redundancy is restored so that the text is again human readable),
speed of execution of compression and decompression algorithms and use of additional
memory.
Good compression means less storage space to store or archive the data, and it also means
less bandwidth requirement to transmit data from source to destination. This is achieved
with the use of a channel which may be a simple point-to-point connection or a complex
entity like the Internet. For the purpose of discussion, assume that the channel is noiseless,
that is, it does not introduce error during transmission and it has a channel capacity C
which is the maximum number of bits that can be transmitted per second. Since entropy
H denotes the average number of bits required to encode a symbol, C denotes the
H
average number of symbols that can be transmitted over the channel per second [ShW98].
A second fundamental theorem of Shannon says that however clever you may get
developing a compression scheme, you will never be able to transmit on average
more than C symbols per second [ShW98]. In other words, to use the available
H
bandwidth effectively, H should be as low as possible, which means employing a
compression scheme that yields minimum BPC.
References
[ShW98]
[Sh48]
[CoT91]
Shannons original definition is R/Hm which is the fraction of the structure of the text message determined
by the inherent property of the language that governs the generation of specific sequence or words in the
text.
14