0% found this document useful (0 votes)

26 views40 pages

2 Information Theory

Uploaded by

Mayouf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views40 pages

2 Information Theory

Uploaded by

Mayouf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

Algebra of Information Measures

Louis Wehenkel

Institut Montefiore, University of Liège, Belgium

ELEN060-2
Information and coding theory
February 2021

1 / 40
Outline

• Entropies and information measures

• Chain rules for entropy and information

• More about independence, and conditional independence

• Translation of these properties into properties of information measures

• Data processing inequality

• Bayesian networks and decision trees

2 / 40
Conditional (a posteriori) entropy

n X
X m
H(X |Y) = − P (Xi ∩ Yj ) log P (Xi |Yj ). (1)
i=1 j=1

The entropy of X knowing that Y = Yj is

n
X
H(X |Yj ) = − P (Xi |Yj ) log P (Xi |Yj ), (2)
i=1

it is positive (it is an entropy) and one has

m
X
H(X |Y) = P (Yj )H(X |Yj ), (3)
j=1

hence this latter is also positive.

And concavity of Hn implies: H(X |Y) ≤ H(X ), which is a fundamental property!

3 / 40
Joint entropy and its relationship with conditional entropy

4 / 40
Inequalities related to the entropy

One deduces the following inequalities :

H(X , Y) ≥ max (H(X ), H(Y))
H(X , Y) ≤ H(X ) + H(Y)
Conclusion :
H(X , Y) ≤ H(X ) + H(Y) ≤ 2H(X , Y) (4)

Particular cases :
X and Y independent : H(X , Y) = H(X ) + H(Y)
(because then P (Xi ∩ Yj ) = P (Xi )P (Yj )) (⇒ H(X |Y) = H(X ))
X function of Y : H(X , Y) = H(Y).
(because then H(X |Y) = 0) (since H(X |Yj ) = 0, ∀j = 1, . . . , m)

5 / 40
Mutual information

n X
m
X P (Xi ∩ Yj )
I(X ; Y) = + P (Xi ∩ Yj ) log . (5)
i=1 j=1
P (Xi )P (Yj )

One can derive :

I(X ; Y) = H(X ) − H(X |Y) = H(Y) − H(Y|X )

and hence
I(X ; Y) = H(X ) + H(Y) − H(X , Y)
which we may also write as

H(X , Y) = H(X ) + H(Y) − I(X ; Y)

Main conclusion :
0 ≤ I(X ; Y) ≤ min{H(X ), H(Y)}

6 / 40
Exercises.

1. Show that indeed (and in the given order)

1. H(X , Y) = H(Y) + H(X |Y) = H(X ) + H(Y|X )
2. H(X , Y) ≥ max{H(X ), H(Y)}
3. H(X |Y) ≤ H(X )
4. H(X , Y) ≤ H(X ) + H(Y)
5. I(X ; Y) = H(X ) + H(Y) − H(X , Y) = H(X ) − H(X |Y)

2. A tournament between two teams consists of a sequence of at most 5 games which stops as soon as
one of the two teams has won three games. Let a and b denote the two teams and X a r.v. which
represents the issue of a tournament between a and b. For example, X = aaa, babab, bbaaa are possible
values of X (there are other possible values). Let Y denote the random variable which denotes the
number of games played (thus Y = {3, 4, 5}).
Suppose that the teams are of the same strength and the outcomes of the successive games are
independent, and compute H(X ),H(Y), H(X |Y) and H(Y|X ).
Let Z = {a, b} denote the random variable which identifies the team winning the tournament.
Determine H(X |Z), compare with H(X ) and justify the result. Determine H(Z|X ), and justify.

7 / 40
Summary

Particular cases
X and Y independent : I(X ; Y) = 0 (necessary and sufficient).
X function of Y : I(X ; Y) = H(X ).
X one-to-one function of Y : I(X ; Y) = H(X ) = H(Y)
8 / 40
Exercises.

1. Consider the following contingency table

Y1 Y2
1 1
X1 3 3
1
X2 0 3

Compute (logarithms in base 2) :

1. H(X ), H(Y)
2. H(X |Y), H(Y|X )
3. H(X , Y)
4. H(Y) − H(Y|X )
5. I(X ; Y)
6. Draw a Venn diagram.
2. Consider three random variables X , Y, Z.
Prove that H(X , Y|Z) = H(X |Z) + H(Y|X , Z).

9 / 40
Other important properties (1)

1. Chain rules
A. Entropies
n
X
H(X1 , X2 , . . . , Xn ) = H(Xi |Xi−1 , . . . , X1 )
i=1

B. Informations
n
X
I(X1 , X2 , . . . , Xn ; Y) = I(Xi ; Y|Xi−1 , . . . , X1 )
i=1

NB: Conditional mutual information of X and Y given Z is defined by

4
I(X ; Y|Z) = H(X |Z) − H(X |Y, Z).

Almost same as before but one uses P (·|Z) (and averaging w.r.t. Zi ).

10 / 40
Outline of proofs.

Chain rule for entropies, by repeated application of the two variable expansion rule :

H(X1 , X2 ) = H(X1 ) + H(X2 |X1 ) (6)

Chain rule for information :

I(X1 , X2 , . . . , Xn ; Y) = H(X1 , X2 , . . . , Xn ) − H(X1 , X2 , . . . , Xn |Y) (11)

Xn X n
= H(Xi |Xi−1 , . . . , X1 ) − H(Xi |Xi−1 , . . . , X1 , Y) (12)
i=1 i=1
Xn
= I(Xi ; Y|Xi−1 , . . . , X1 ) (13)
i=1

Equivalent definition of I(X ; Y|Z)

11 / 40
Other important properties (2)

2. Conditional independence and data processing inequality

Consider three discrete random variables : X , Y, Z
They are said to form a Markov chain if Z is conditionally indep. of X given Y.
Notation : Z ⊥ X |Y ⇔ Zi ⊥ Xj |Yk , ∀i, j, k.
In other words P (Z|X , Y) = P (Z|Y)
Interpretation :
Conditioning : suppose Y = Yk given ⇒ P (·) → P (·|Yk )
The probability measure becomes a conditional probability measure.
Cond. indep. ≡ independence under the conditional measure, for any Yk .
Independence is a symmetric relation : Z ⊥ X |Y ⇔ X ⊥ Z|Y.
X , Y, Z form a Markov chain which is denoted by X ↔ Y ↔ Z

X Y Z X Y Z

X Y Z

12 / 40
The graphical representation is again a particular case of a Bayesian belief network, which will be
introduced more precisely later on.
Bayesian belief networks provide a general and very powerful tool in order to handle conditional
independence. Conditional independence is very important as a notion, because for many physical
problems it may be used to represent causal relationships. Thus, the structure of conditional
independence of stochastic models may be deduced from physical causality and structure.
Consider a communication system composed of two channels in series : X represents messages chosen
by a source, Y messages at the receiving end of the first channel, and Z the messages at the receiving
end of the second channel. These three random variables obviously represent a Markov chain.
Similarly, look at an industrial two stage process : X represents the characteristics of the input material;
Y the characteristics of the output of the first stage and Z the characteristics of the output of the
second stage. If Y is a precise enough description, then again we have a Markov chain. This means that
if we are able to observe the output of the first stage, and want to predict what will happen during the
second stage, the history X of the material is irrelevant.
This notion of sufficiently precise description of a process at an intermediate stage, is what we call in
system theory the state of the system.

13 / 40
NB : these ideas may be applied to sets of variables :

X1 , X2 , . . . ↔ Y1 , Y2 , . . . ↔ Z1 , Z2 , . . .

X1 ↔ X2 ↔ · · · ↔ Xk ↔ · · · ↔ Xn−1 ↔ Xn
Remarks.
If X ↔ Y ↔ Z then

P (X , Y, Z) = P (X )P (Y|X )P (Z|Y) = P (Z)P (Y|Z)P (X |Y).

Data processing inequality

If X ↔ Y ↔ Z form a Markov chain then I(X ; Y) ≥ I(X ; Z).
Indeed : chain rule of information applied in two ways to I(X ; Y, Z):

I(X ; Z) + I(X ; Y|Z) = I(X ; Y, Z) = I(X ; Y) + I(X ; Z|Y).

Since X et Z are conditionally independent, we have I(X ; Z|Y) = 0, and hence

I(X ; Z) ≤ I(X ; Y).

14 / 40
Examples
If Z is a function of Y it is conditionally independent of X .
(Hence also X ↔ Y ↔ Y)
If Z is a function of Y and another r.v. independent of X and Y, it is also conditionally
independent of X .
Interpretation
The theorem tells us that whatever we do with Y in terms of data processing, there is
no hope to gain more information about X than what is provided by Y :
⇒ no way to create information by data processing.
Questions:
If A is an event of positive probability, what is the value of P (A|A)?
What is the meaning (value) of P (X , Y, Y) ?
Is it true that P (Y|X , Y) = P (Y|Y) ?

15 / 40
Another consequence
If X ↔ Y ↔ Z then I(X ; Y|Z) ≤ I(X ; Y).
In other words, in a Markov chain conditioning decreases mutual information.
This property is not true in general.
In other words, it is possible that I(X ; Y|Z) > I(X ; Y) when X , Y, Z do not form a
Markov chain.
For example
Consider the double coin flipping experiment.
Compute I(H1 ; S) and I(H1 ; S|H2 ).
This finishes our study of information measures (algebra).
We will come back later to these notions for continuous random variables.

16 / 40
Exercises
1. Let X , Y, Z be three binary random variables. One gives the following information :

• P (X = 0) = P (Y = 0) = 0.5,
• P (X , Y) = P (X )P (Y)
• Z = (X + Y)mod2 (i.e. Z = 1 ⇔ X 6= Y).
(a) What is the value of P (Z = 0) ?
(b) What is the value of H(X ), H(Y), H(Z) ?
(c) What is the value of H(X , Y), H(X , Z), H(Y, Z), H(X , Y, Z) ?
(d) What is the value of I(X ; Y), I(X ; Z), I(Y; Z) ?
(e) What is the value of I(X ; Y, Z), I(Y; X , Z), I(Z; X , Y) ?
(f) What is the value of I(X ; Y|Z), I(Y; X |Z), I(Z; X |Y) ?
(g) Can you draw a Venn diagram which summarizes the situation ?
2. Let X , Y, Z be three discrete random variables. Show that

(a) H(X , Y|Z) ≥ H(X |Z);

(b) I(X , Y; Z) ≥ I(X ; Z);
(c) H(X , Y, Z) − H(X , Y) ≤ H(X , Z) − H(X );
(d) I(X ; Z|Y) ≥ I(Z; Y|X ) − I(Z; Y) + I(X ; Z).
17 / 40
Graphical models for probabilistic inference

Classical logic :
- Start with a theory : set of axioms which are supposed to hold in the physical world (if
X has wings then X is a bird)
- Add observations from the real world : facts (Tweety has wings)
- Infer conclusions about other properties of the real world : Tweety is a bird.
Probabilistic logic :
Same, but statements and axioms are of probabilistic nature.
Inference : from a probabilistic model and observations from the real world, draw
conclusions about unobserved variables.
Graphical models : represent relationships among variables by a graph.
NB.: not all models are graphical...

18 / 40
Main questions 1. How to build models : from first principles, from observations of
nature, from both
2. How to use models : deductive inference
Now, we focus on probabilistic (deductive) inference with graphical models :
⇒ Bayesian networks, decision trees.
Model probabilistic relationships among a set of variables
- We will consider only discrete variables, but theory extends to continuous variables
- Bayesian networks : models for joint probability distributions P (A, B, . . . , U)
- Decision trees : models for conditional probability distributions P (A|B, . . . , U)

19 / 40
Bayesian networks : models for P (A, B, . . . , U)
NB:
We consider only the case where A, B, . . . , U take a finite number of value. Thus, the
number of possible combinations of values is also finite.
Thus P (A, B, . . . , U) can be represented explicitly as a multidimensional table of
numbers in [0; 1] : contingency table
But :
1. Explicit representation becomes quickly intractable (when the number of variables
increases).
2. Explicit representation says nothing about structural relationships of variables (e.g.
conditional independence)
Bayesian networks : compact representation, tractable, and interpretable (explicitly).

20 / 40
Example of inference using an explicit representation :
Given P (A, B, C, D, E, F) (model) and the fact (observation or hypothesis) that
B = Bj and C = Ck , what is the probability of event A = Ai ?
In other words compute : P (Ai |Bj , Ck )
Answer :
P (Ai ,Bj ,Ck )
1. P (Ai |Bj , Ck ) = P (Bj ,Ck ) .

2. P (Ai , Bj , Ck ) = D∈D E∈E F ∈F P (Ai , Bj , Ck , D, E, F )

P P P

3. P (Bj , Ck ) = A∈A D∈D E∈E F ∈F P (A, Bj , Ck , D, E, F )

P P P P

Comments :
Suppose that the variables assume three values each, then P (A, B, C, D, E, F) is given
by 36 − 1 = 728 numbers.
The two sums concerns respectively 33 = 27 and 34 = 81 terms.
In applications (e.g. coding) : thousands of variables ⇒ trivial method breaks down.

21 / 40
Same problem : we add some structural knowledge
Suppose we know (e.g. because of physical knowledge about the problem that :

P (A, B, C, D, E, F) = P (A, B, C)P (D, E, F|A)

and that
P (A, B, C) = P (B)P (C)P (A|BC)

Now we need to specify the model :

− For P (B) and P (C) we need 4 = 2 + 2
− For P (A|BC) we need 2 × 3 × 3 = 18.
− For P (D, E, F|A) we need 3 × (33 − 1) = 78.
⇒ Structural knowledge reduces the size of our model from 728 to 4 + 18 + 78 = 100.
Computation of P (Ai |Bj , Ck ) : trivial (table lookup)
What about computation of P (Bj , Ck |Ai ) ? (with and without structural knowledge)

22 / 40
Models are useful to provide not only accurate but also compact representations of the reality. In
general, there is a tradeoff between model complexity and accuracy. Models are useful only if we are
able to exploit them in order to understand or predict behavior of reality : in most situations tractability
is possible only at the expense of accuracy.
Next week, when we will focus on channel coding, we will see that in order to efficiently exploit noisy
channels it is necessary to manipulate very long sequences of symbols (long messages). For example, in
the context of Turbo-codes typical message lengths which are manipulated are in the interval
[1000 . . . 100000]. This means that we need to manipulate joint probability distributions of more than
1000 to 100000 binary variables, which would be totally impossible if we were to use explicit
table-lookup models.
For those who are not yet convinced, let us make the explicit calculation : if N = 1000, a channel code
will comprise 21000 ≈ 10301 code words. If every electron of the Universe (there are about 1080 ) was a
1000 GHz processor able to store and retrieve the probability of such a code word in a single
instruction, one could handle 1012 × 1080 ≈ 1092 code words per second, and in a period equal to the
age of the Universe (3 × 1017 seconds), these computers would handle 3 × 10109 code words. To
handle all words, we would still need to wait for a period equal to 10190 times the age of our Universe !
Nevertheless, by using compact models it is possible to handle the channel encoding and decoding tasks
efficiently (in linear time with respect to the message length).
Later we will introduce stochastic process models. A stochastic process is a sequence of random
variables corresponding to successive time instants (we will only consider discrete time models in this
course). As time can grow indefinitely, a stochastic process is actually an infinite collection of random
variables. Still, it is possible to devise very compact probabilistic models of such processes : actually
with a few numbers it is possible to characterize the joint probability distribution of any finite
sub-collection of random variables of the process.

23 / 40
Bayesian network : definition

Directed acyclic graph :

• Nodes model variables (one node for each variable)
• Arcs model causal relations among variables (conditional independence relations)

24 / 40
The figure illustrates an example Bayesian network, which is supposed to model the relationships
among the color of eyes of different people in a family. The network actually models the ancestral
relationships among the persons of this family. Each node represents the color (blue or brown) of one
person. The arcs indicate which are the children of a person.
Note that this model does not pretend to be a correct view of Mendelean genetics. We will see later
that this model is slightly more complex, but can still be easily represented by a Bayesian network. For
the time being, we will use this naive picture of genetics as our running example to explain main
concepts in Bayesian networks.
Terminology and notation
We use the same notation (round uppercase) to represent variables and nodes, since they are in
one-to-one correspondance.
Let Xk denote a node in the graph G. Then we denote by :
• P(Xk ) the set of parent nodes of Xk , i.e. the origins of the arcs pointing towards Xk .
• F (Xk ) the children of Xk , i.e. the set {Xj ∈ G|Xk ∈ P(Xj )}
• D(Xk ) the descendents of Xk , i.e. the set of nodes which are in F (Xk ), or descendents of a node
in F (Xk ).
• N D(Xk ) the nondescendents of Xk , i.e. the set G ∩ ¬({Xk } ∪ D(Xk ))
The figure illustrates these notions for the node M, and shows how this node partitions the graph.
Defining property of Bayesian networks
For any variable X ∈ G, and any subset of variables W ∈ N D(X ), we have
P (X |P(X ), W) = P (X |P(X )), i.e. once the parents of a variable are given, it becomes independent of
all its other non-descendents

25 / 40
Factorisation property
Suppose we are given a Bayesian network G = {X1 , . . . Xn } and for each variable
Xi ∈ G we are also given P (Xi |P(Xi )), then
n
Y
P (X1 , . . . , Xn ) = P (Xi |P(Xi ))
i=1

Note that for those variables for which P(Xi ) = ∅ we are given the prior P (Xi ).
Comments :
As long as the P (Xi |P(Xi )) are not specified, a Bayesian network is meant to represent
all distributions which can be factorized in this way.
Any probability distribution may be represented in many ways by a Bayesian network,
but not necessarily all conditional independence structures may be derived explicitly
from a Bayesian network structure.
There exist probability distributions leading to independence relations which can not be
represented completely by any Bayesian network.

26 / 40
Simple examples : some (all ?) three variable networks

X Y Z X Y Z
P (X , Y, Z) = P (X )P (Y|X )P (Z|X , Y) P (X , Y, Z) = P (X )P (Y)P (Z)

X Y Z X Y Z
P (X , Y, Z) = P (X )P (Y|X )P (Z|Y) P (X , Y, Z) = P (Z)P (Y|Z)P (X |Y)

X Y X Y

Z Z
P (X , Y, Z) = P (Z|X , Y)P (Y)P (X ) P (X , Y, Z) = P (Z)P (X |Z)P (Y|Z)

27 / 40
The correct Mendel model of eye colors.
Produced by JavaBayes tool.

28 / 40
Here is the complete model of our earlier example related to genetic.
We have added new variables denoted by GT... for each individual which denote the two versions of the
gene which determine the eye color for each individual. The variables may take on three values bb, bB,
and BB, where b stands for blue and B for brown. We assume that the prior (or marginal) probability
of these three values for the grand-parent generation are 0.25, 0.5, 0.25; all the other (conditional)
probability distributions are deduced from the Mendel model : the relation between parent and children
genotypes assume that one of the two chromosomes is chosen at random (0.5 probability) and the
relation between phenotype and genotype is deterministic, assuming that B (brown) is dominant
character.
Notice that this network models the genotype (genes) of the individuals, and the relationship between
the genotype and the observed variables (eye colors). In spite of the fact that genotypes can not be
observed directly, it is possible to use this model to infer unobserved genotypes and phenotypes from
the observed phenotypes (eye colors).
One particularity of this network is that all the conditional probabilities are identical (all people behave
in the same way from the viewpoint of our model). The network can be extended to a whole population
and be used to model relationships between successive generations and how one can observe genetic
drift.
The present example is available on the web page http://www.montefiore.ulg.ac.be/˜lwh/javabayes,
where you can use a Java applet to simulate the network and see how it reacts to observations. On the
same page you can also try out the earlier naive version of the same problem, and compare the
differences.
The prior probability distributions have been chosen so that without any observations all individuals
have the same marginal probability distribution of genotypes (and hence phenotypes). This is what we
will later on denote by stationary conditions. It turns out that, even if in the earlier generations the
prior distribution is different from the stationary distribution, after a large enough number of
generations the system converges to the stationary distribution.

29 / 40
Graphical models of communication systems

First order markov model of a black&white scanner

0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1

Same compressed by a 4 bit block code

0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1

0/1 0/1

0/10/110/111 0/10/110/111

Same, encoded and sent through a noisy, memoryless communication channel

0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 Source

Conv. code

Channel noise

Received message

30 / 40
D-separation and conditional independance relations induced by a
BN

Some comments on the notion of independance of sets of random variables

Let A = {X1 , . . . , Xl } and B = {Y1 , . . . , Ym } two sets of random variables.
- What is the meaning of A ⊥ B ?
- Is it true that A ⊥ B ⇒ (∀i, j : Xi ⊥ Xj ) ?
- And/or is the converse true, i.e. (∀i, j : Xi ⊥ Xj ) ⇒ A ⊥ B ?
D-separation: definition
Let us denote by A, B, C three disjoint subsets of r.v. of a BN, and let us assume that
A and C are non empty.
Let us consider paths over the undirected version of the DAG, from A to C.
We say that A and C are d-separated by B if all paths from A to C are blocked by B.

31 / 40
By definition, a path is blocked if it goes through a variable, say Xk ,
1. The pattern → Xk → appears in the path and Xk ∈ B
2. The pattern → Xk ← appears in the path and ({Xk } ∪ D(Xk )) ∩ B = ∅
3. The pattern ← Xk → appears in the path and Xk ∈ B

A A C Xk ∈ B

Xk ∈ B Xk 6∈ B A C

All paths with are not blocked, are said to be active (w.r.t. A to C and B).

32 / 40
D-separation: fundamental property
If A, B, C are three disjoint sets of variables (B may be empty) of a bayesian network,
then “A and C are d-separated by B” ⇒ A ⊥ C|B.
Notice that, if A and B are d-separated by C then any subset of A is d-separated from
any subset of C by B.
Notice also that we can change directions of some arrows in the graph without changing
d-separations, provided that we don’t change the set of → Xk ← structures.
Thus, to represent the conditional independances one often uses so-called essential
graphs, obtained from a DAG by replacing arrows which do not participate in a
V -structure by lines.
Belief propagation
D-separation also leads to the design of effective belief propagation algorithms.
(See course notes and subsequent lessons).

33 / 40
There is much more to say about Bayesian belief networks, but limited time in the context of this
course does not allow to go further in depth.
Bayesian networks where proposed in the eighties by Judea Pearl, in order to provide modelling tools for
reasoning under uncertainty in artificial intelligence (e.g. expert systems for medical diagnosis).
In the meanwhile, both theory and practice have progressed significantly, and although the field has not
yet reached full maturity there are already many significant real applications.
One of the complex questions, as regards inference, is to devise efficient algorithms to propagate
evidence through the network. If the network has a tree structure this is rather easy task (a
generalization of the forward-backward algorithm used for hidden markov chains, leading to an efficient
algorithm). If the network is not a tree, one approach consists of grouping variables so as to yield a tree
(so-called junction tree algorithm); another approach is to use approximate (but efficient) algorithms for
probability propagation.
The other main problem under consideration in research concerns the automatic design of probabilistic
models from data. Here also, there is still a lot to do.

34 / 40
Probabilistic reasoning and questionnaires

Let us consider a medical diagnostic problem and its probabilistic model. Let us denote
by D a variable which is true when the patient under consideration has a certain disease
(say hepatitis).
Ω set of all possible patients which will visit a M.D.
In order to make a diagnosis, the doctor will typically try to look at the symptoms
(concentration of various types of blood cells, eye color, skin color, temperature . . . )
and ask questions about antecedents (factors, such as age, nutrition, smoking, addiction
to heroin,. . . ).
Note that not all questions have same relevance, and also in general the relevance of a
question is dependent on already observed variables. Anyhow, typically the doctor would
like to reach conclusions about the diagnostic by asking on relevant and informative
questions.
Problem : how to design an efficient strategy for the diagnosis ?

35 / 40
Probabilistic model

Suppose that we have a model for P (D, A1 , . . . , An , S1 , . . . , Sm ).

We can measure the residual uncertainty of the diagnosis problem by

H(D|A1 , . . . , An , S1 , . . . , Sm )

i.e. the uncertainty which can not be reduced by observations.

If the disease is well known, hopefully this quantity will be small.
Note that, if we forbid the use of one of the possible observations, say A1 , then the
residual uncertainty increases

H(D|A2 , . . . , An , S1 , . . . , Sm ) ≥ H(D|A1 , . . . , An , S1 , . . . , Sm )

but this does not mean that all questions are relevant in all cases.
Suppose we are allowed only to ask one single question (observe one of Ai or Sj ), then
we would choose X ∈ {A1 , . . . , An , S1 , . . . , Sm } maximizing I(X ; D).

36 / 40
Strategy : same as decision tree

37 / 40
The test nodes of the tree (square boxes on the figure) represent essentially questions or observations
that may be made by the doctor. The terminal nodes represent conclusions that will be drawn : the
doctor stops to ask questions and decides that it is either very likely or very unlikely that the patient
has hepatitis, or possibly decides that he is still uncertain and the patient should go to a specialist (who
will ask more questions).
The tree structure defines the strategy that the doctor will use to reach a decision : the top-node (root
of the tree) defines the first question, and successors define the substrategies depending on the
obtained answer. Note that the terminal nodes of the tree are a function T of the test variables : using
the tree is equivalent to observing this variable.
Tree construction algorithms :
A good decision tree is one that minimizes the average conditional entropy at the leaf nodes and at the
same time minimizes complexity of the tree (different measures). If we know
P (D, A1 , . . . , An , S1 , . . . , Sm ) (say we have a Bayesian network) we can try to find an optimal tree,
say one which minimizes
H(D|T ) + βComplexity

Brute force :
- generate all possible trees (there is only a finite number of trees)
- for each tree compute H(D|T ) + βComplexity (can be done using P (D, A1 , . . . , An , S1 , . . . , Sm ))
- keep the best one.
Hill climbing :
- select the variable maximizing I(X ; D) at the root node
- for each value Xi of X use P (D, A1 , . . . , An , S1 , . . . , Sm |Xi ) to build subtree.
- stop when H(D|T ) + βComplexity starts to increase.

38 / 40
Further reading

• D. MacKay, Information theory, inference, and learning algorithms

• Chapter 2

39 / 40
Frequently asked questions

• State and graphically represent the algebraic relations among these quantities and
their main properties (bounds, inequalities and equalities among quantities), and
explain the main steps of the mathematical proofs of these relations.
• State the chaining rule for joint entropies. Define the notion of conditional mutual
information and state the chaining rule for mutual informations. Explain the main
steps of the mathematical proofs of these two chaining rules.
• Define the notion of Markov chain (over three discrete random variables). State,
prove, discuss and illustrate the data processing inequality. Give an example where
the data processing inequality can not be applied and where it is also not satisfied.
State and discuss the corollaries of the data processing inequality.

40 / 40

1990 Volvo 740 Wiring Diagrams
80% (5)
1990 Volvo 740 Wiring Diagrams
14 pages
Sugar Plant Specifications 5000 TCD-7500 TCD
80% (5)
Sugar Plant Specifications 5000 TCD-7500 TCD
104 pages
Instructor's Manual For Probabilistic Graphical Models by Daphne Koller, Benjamin Packer
No ratings yet
Instructor's Manual For Probabilistic Graphical Models by Daphne Koller, Benjamin Packer
59 pages
Solved Problems
No ratings yet
Solved Problems
7 pages
Ix Developer: User's Guide
100% (1)
Ix Developer: User's Guide
48 pages
MTH2222 Mathematics of Uncertainty
No ratings yet
MTH2222 Mathematics of Uncertainty
96 pages
2102 13225
No ratings yet
2102 13225
19 pages
Article
No ratings yet
Article
16 pages
Advanced Power Electronics Corp.: Description
No ratings yet
Advanced Power Electronics Corp.: Description
6 pages
Computer Science CPSC 322: Bayesian Networks: Construction
No ratings yet
Computer Science CPSC 322: Bayesian Networks: Construction
70 pages
Entropy and Mutual Information
No ratings yet
Entropy and Mutual Information
63 pages
Ranger 700: 1. Contents
100% (1)
Ranger 700: 1. Contents
8 pages
? Excel VLOOKUP - Massive Guide With 8 Examples
No ratings yet
? Excel VLOOKUP - Massive Guide With 8 Examples
19 pages
Nursery Pmamp Dadeldhura
No ratings yet
Nursery Pmamp Dadeldhura
16 pages
Ns2-Vw00-p0uyq-174226 Vehicle Repair Shop Side Elevation Rev.0int1
No ratings yet
Ns2-Vw00-p0uyq-174226 Vehicle Repair Shop Side Elevation Rev.0int1
1 page
Toro - Drip Irrigation DIY Guide
No ratings yet
Toro - Drip Irrigation DIY Guide
5 pages
A Rugged Radio For Harsh Environments
No ratings yet
A Rugged Radio For Harsh Environments
8 pages
SP14 CS188 Lecture 13 - Markov Models
No ratings yet
SP14 CS188 Lecture 13 - Markov Models
33 pages
Unit Iv L Earning
No ratings yet
Unit Iv L Earning
33 pages
Lect2 PDF
No ratings yet
Lect2 PDF
25 pages
2WH Light
No ratings yet
2WH Light
36 pages
Kolom Distilasi Tinjauan Umum
No ratings yet
Kolom Distilasi Tinjauan Umum
22 pages
Entropy, Relative Entropy and Mutual Information
No ratings yet
Entropy, Relative Entropy and Mutual Information
4 pages
Information Theory: Info Rmatio N Types
No ratings yet
Information Theory: Info Rmatio N Types
52 pages
E2 201: Information Theory (2019) Solutions To Homework 3
No ratings yet
E2 201: Information Theory (2019) Solutions To Homework 3
11 pages
Running Head: DATA STRUCTURES 1: Course: Project Name: Student Name: Date
No ratings yet
Running Head: DATA STRUCTURES 1: Course: Project Name: Student Name: Date
7 pages
Debian Server
No ratings yet
Debian Server
12 pages
lời giải
No ratings yet
lời giải
52 pages
4 Information Theory
No ratings yet
4 Information Theory
53 pages
Relative Entropy
No ratings yet
Relative Entropy
6 pages
Probabilistic AI
No ratings yet
Probabilistic AI
13 pages
Unit 1
No ratings yet
Unit 1
23 pages
Unit Iv L Earning
No ratings yet
Unit Iv L Earning
23 pages
RG Series
No ratings yet
RG Series
26 pages
Applied Probability Theory - J. Chen
100% (3)
Applied Probability Theory - J. Chen
177 pages
Report 2.3-Revised-Final - Kaal Harir Abdulle, 160041080
No ratings yet
Report 2.3-Revised-Final - Kaal Harir Abdulle, 160041080
36 pages
Duk
No ratings yet
Duk
7 pages
Entropy 4
No ratings yet
Entropy 4
10 pages
Public-Key Cryptography: CCA Secure PKE Hybrid Encryption
No ratings yet
Public-Key Cryptography: CCA Secure PKE Hybrid Encryption
18 pages
Symmetric-Key Encryption: Constructions: PRG, PRF Stream and Block Ciphers
No ratings yet
Symmetric-Key Encryption: Constructions: PRG, PRF Stream and Block Ciphers
19 pages
Notes It
No ratings yet
Notes It
46 pages
LECTURE 1: Introduction
No ratings yet
LECTURE 1: Introduction
16 pages
Skoda Enyaq Brochure April 2024
No ratings yet
Skoda Enyaq Brochure April 2024
43 pages
Statistical Analysis System: First SAS Program
No ratings yet
Statistical Analysis System: First SAS Program
8 pages
An 120
No ratings yet
An 120
6 pages
SMo Notes1
No ratings yet
SMo Notes1
81 pages
Jour 2
No ratings yet
Jour 2
37 pages
An Introduction To Artificial Intelligence: Chapter 13 &14.1-14.2: Uncertainty & Bayesian Networks
No ratings yet
An Introduction To Artificial Intelligence: Chapter 13 &14.1-14.2: Uncertainty & Bayesian Networks
31 pages
PMRprobabilistic Modelling Primer
No ratings yet
PMRprobabilistic Modelling Primer
14 pages
ML Physics
No ratings yet
ML Physics
24 pages
02 - Conditioning and Independence
No ratings yet
02 - Conditioning and Independence
14 pages
2 Entropy and Mutual Information: I (A) F (P (A) )
No ratings yet
2 Entropy and Mutual Information: I (A) F (P (A) )
27 pages
Elements of Information Theory 2006 Thomas M. Cover and Joy A. Thomas
No ratings yet
Elements of Information Theory 2006 Thomas M. Cover and Joy A. Thomas
16 pages
The Binary Entropy Function: ECE 7680 Lecture 2 - Definitions and Basic Facts
No ratings yet
The Binary Entropy Function: ECE 7680 Lecture 2 - Definitions and Basic Facts
8 pages
Ml-15 Work Procedure For Row Clean Up & Restoration
No ratings yet
Ml-15 Work Procedure For Row Clean Up & Restoration
8 pages
Entropy
No ratings yet
Entropy
21 pages
Lecture2 1
No ratings yet
Lecture2 1
37 pages
ST302
No ratings yet
ST302
58 pages
SF 2940 Forms
No ratings yet
SF 2940 Forms
23 pages
AOS Rel 6.0 GUI User Guide August 2024
No ratings yet
AOS Rel 6.0 GUI User Guide August 2024
208 pages
Unit 4
No ratings yet
Unit 4
74 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
111 pages
1 s2.0 S2405844024160999 Main
No ratings yet
1 s2.0 S2405844024160999 Main
53 pages
Naive Bayes
No ratings yet
Naive Bayes
25 pages
Gree Vireo Gen3 Submittal 9mbh 230v A
No ratings yet
Gree Vireo Gen3 Submittal 9mbh 230v A
6 pages
Learning Material - ITC
No ratings yet
Learning Material - ITC
96 pages
Probability Cheatsheet Midterm
No ratings yet
Probability Cheatsheet Midterm
5 pages
IET Generation Trans Dist - 2023 - Mansour - Applications of IoT and Digital Twin in Electrical Power Systems A
No ratings yet
IET Generation Trans Dist - 2023 - Mansour - Applications of IoT and Digital Twin in Electrical Power Systems A
23 pages
Information Theory and Coding
No ratings yet
Information Theory and Coding
79 pages
Problem Set 1
No ratings yet
Problem Set 1
3 pages
02 Measure of Information
No ratings yet
02 Measure of Information
17 pages
13 Bayes Nets
No ratings yet
13 Bayes Nets
38 pages
Sai Kumar Modified Resume
No ratings yet
Sai Kumar Modified Resume
1 page
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
114 pages
Database Administration Level IV Theory Exam 6
No ratings yet
Database Administration Level IV Theory Exam 6
5 pages
Sam Roweis Probx
No ratings yet
Sam Roweis Probx
12 pages
Session 3
No ratings yet
Session 3
44 pages
PAC-1 - Appendix H - Approved Vendor List
No ratings yet
PAC-1 - Appendix H - Approved Vendor List
12 pages
Math7224 Notes
No ratings yet
Math7224 Notes
32 pages
It Co 1 en
No ratings yet
It Co 1 en
26 pages
ILAC - Members (By Category)
No ratings yet
ILAC - Members (By Category)
11 pages
Ch5 Entropy and Information
No ratings yet
Ch5 Entropy and Information
77 pages
Ict Solution
No ratings yet
Ict Solution
41 pages
Introduction To Bayesian Networks - Koski - Noble
No ratings yet
Introduction To Bayesian Networks - Koski - Noble
471 pages
Conditional Probability, Bayes Rule
No ratings yet
Conditional Probability, Bayes Rule
22 pages
InDesign 100 Real Shortcuts Hinglish
No ratings yet
InDesign 100 Real Shortcuts Hinglish
3 pages
Introduction To Information Theory
No ratings yet
Introduction To Information Theory
20 pages
Chapter 2
No ratings yet
Chapter 2
68 pages
Dempster Shafer
No ratings yet
Dempster Shafer
134 pages
Lec4 - Probability Theory and Naive Bayes Classifier
No ratings yet
Lec4 - Probability Theory and Naive Bayes Classifier
27 pages
Unit-V POAI
No ratings yet
Unit-V POAI
50 pages
Naive Bays
No ratings yet
Naive Bays
25 pages
Hyperbolic Functions (Trigonometry) Mathematics E-Book For Public Exams
From Everand
Hyperbolic Functions (Trigonometry) Mathematics E-Book For Public Exams
Mohmmad Khaja Shareef
No ratings yet
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet

2 Information Theory

Uploaded by

2 Information Theory

Uploaded by

Algebra of Information Measures

Institut Montefiore, University of Liège, Belgium

• Entropies and information measures

• Chain rules for entropy and information

• More about independence, and conditional independence

• Translation of these properties into properties of information measures

• Data processing inequality

• Bayesian networks and decision trees

The entropy of X knowing that Y = Yj is

it is positive (it is an entropy) and one has

hence this latter is also positive.

One deduces the following inequalities :

One can derive :

I(X ; Y) = H(X ) − H(X |Y) = H(Y) − H(Y|X )

H(X , Y) = H(X ) + H(Y) − I(X ; Y)

1. Show that indeed (and in the given order)

1. Consider the following contingency table

Compute (logarithms in base 2) :

NB: Conditional mutual information of X and Y given Z is defined by

H(X1 , X2 ) = H(X1 ) + H(X2 |X1 ) (6)

Chain rule for information :

I(X1 , X2 , . . . , Xn ; Y) = H(X1 , X2 , . . . , Xn ) − H(X1 , X2 , . . . , Xn |Y) (11)

Equivalent definition of I(X ; Y|Z)

2. Conditional independence and data processing inequality

P (X , Y, Z) = P (X )P (Y|X )P (Z|Y) = P (Z)P (Y|Z)P (X |Y).

Data processing inequality

I(X ; Z) + I(X ; Y|Z) = I(X ; Y, Z) = I(X ; Y) + I(X ; Z|Y).

Since X et Z are conditionally independent, we have I(X ; Z|Y) = 0, and hence

(a) H(X , Y|Z) ≥ H(X |Z);

2. P (Ai , Bj , Ck ) = D∈D E∈E F ∈F P (Ai , Bj , Ck , D, E, F )

3. P (Bj , Ck ) = A∈A D∈D E∈E F ∈F P (A, Bj , Ck , D, E, F )

P (A, B, C, D, E, F) = P (A, B, C)P (D, E, F|A)

Now we need to specify the model :

Directed acyclic graph :

First order markov model of a black&white scanner

0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1

Same compressed by a 4 bit block code

0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1

Same, encoded and sent through a noisy, memoryless communication channel

Some comments on the notion of independance of sets of random variables

Suppose that we have a model for P (D, A1 , . . . , An , S1 , . . . , Sm ).

i.e. the uncertainty which can not be reduced by observations.

• D. MacKay, Information theory, inference, and learning algorithms

You might also like