0% found this document useful (0 votes)
26 views40 pages

2 Information Theory

Uploaded by

Mayouf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views40 pages

2 Information Theory

Uploaded by

Mayouf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Algebra of Information Measures

Louis Wehenkel

Institut Montefiore, University of Liège, Belgium

ELEN060-2
Information and coding theory
February 2021

1 / 40
Outline

• Entropies and information measures

• Chain rules for entropy and information

• More about independence, and conditional independence

• Translation of these properties into properties of information measures

• Data processing inequality

• Bayesian networks and decision trees

2 / 40
Conditional (a posteriori) entropy

n X
X m
H(X |Y) = − P (Xi ∩ Yj ) log P (Xi |Yj ). (1)
i=1 j=1

The entropy of X knowing that Y = Yj is


n
X
H(X |Yj ) = − P (Xi |Yj ) log P (Xi |Yj ), (2)
i=1

it is positive (it is an entropy) and one has


m
X
H(X |Y) = P (Yj )H(X |Yj ), (3)
j=1

hence this latter is also positive.


And concavity of Hn implies: H(X |Y) ≤ H(X ), which is a fundamental property!

3 / 40
Joint entropy and its relationship with conditional entropy

n X
m
4 X
H(X , Y) = − P (Xi ∩ Yj ) log P (Xi ∩ Yj )
i=1 j=1
Xn X m
= − P (Yj )P (Xi |Yj ) log(P (Yj )P (Xi |Yj ))
i=1 j=1
Xn X m
= − P (Yj )P (Xi |Yj ) log P (Yj )
i=1 j=1
Xn X m
− P (Yj )P (Xi |Yj ) log P (Xi |Yj )
i=1 j=1
Xm n
X
= − P (Yj )( P (Xi |Yj )) log P (Yj ) + H(X |Y)
j=1 i=1
= H(Y) + H(X |Y)
= H(X ) + H(Y|X ).

4 / 40
Inequalities related to the entropy

One deduces the following inequalities :


H(X , Y) ≥ max (H(X ), H(Y))
H(X , Y) ≤ H(X ) + H(Y)
Conclusion :
H(X , Y) ≤ H(X ) + H(Y) ≤ 2H(X , Y) (4)

Particular cases :
X and Y independent : H(X , Y) = H(X ) + H(Y)
(because then P (Xi ∩ Yj ) = P (Xi )P (Yj )) (⇒ H(X |Y) = H(X ))
X function of Y : H(X , Y) = H(Y).
(because then H(X |Y) = 0) (since H(X |Yj ) = 0, ∀j = 1, . . . , m)

5 / 40
Mutual information

n X
m
X P (Xi ∩ Yj )
I(X ; Y) = + P (Xi ∩ Yj ) log . (5)
i=1 j=1
P (Xi )P (Yj )

One can derive :

I(X ; Y) = H(X ) − H(X |Y) = H(Y) − H(Y|X )

and hence
I(X ; Y) = H(X ) + H(Y) − H(X , Y)
which we may also write as

H(X , Y) = H(X ) + H(Y) − I(X ; Y)

Main conclusion :
0 ≤ I(X ; Y) ≤ min{H(X ), H(Y)}

6 / 40
Exercises.

1. Show that indeed (and in the given order)


1. H(X , Y) = H(Y) + H(X |Y) = H(X ) + H(Y|X )
2. H(X , Y) ≥ max{H(X ), H(Y)}
3. H(X |Y) ≤ H(X )
4. H(X , Y) ≤ H(X ) + H(Y)
5. I(X ; Y) = H(X ) + H(Y) − H(X , Y) = H(X ) − H(X |Y)

2. A tournament between two teams consists of a sequence of at most 5 games which stops as soon as
one of the two teams has won three games. Let a and b denote the two teams and X a r.v. which
represents the issue of a tournament between a and b. For example, X = aaa, babab, bbaaa are possible
values of X (there are other possible values). Let Y denote the random variable which denotes the
number of games played (thus Y = {3, 4, 5}).
Suppose that the teams are of the same strength and the outcomes of the successive games are
independent, and compute H(X ),H(Y), H(X |Y) and H(Y|X ).
Let Z = {a, b} denote the random variable which identifies the team winning the tournament.
Determine H(X |Z), compare with H(X ) and justify the result. Determine H(Z|X ), and justify.

7 / 40
Summary

Particular cases
X and Y independent : I(X ; Y) = 0 (necessary and sufficient).
X function of Y : I(X ; Y) = H(X ).
X one-to-one function of Y : I(X ; Y) = H(X ) = H(Y)
8 / 40
Exercises.

1. Consider the following contingency table

Y1 Y2
1 1
X1 3 3
1
X2 0 3

Compute (logarithms in base 2) :


1. H(X ), H(Y)
2. H(X |Y), H(Y|X )
3. H(X , Y)
4. H(Y) − H(Y|X )
5. I(X ; Y)
6. Draw a Venn diagram.
2. Consider three random variables X , Y, Z.
Prove that H(X , Y|Z) = H(X |Z) + H(Y|X , Z).

9 / 40
Other important properties (1)

1. Chain rules
A. Entropies
n
X
H(X1 , X2 , . . . , Xn ) = H(Xi |Xi−1 , . . . , X1 )
i=1

B. Informations
n
X
I(X1 , X2 , . . . , Xn ; Y) = I(Xi ; Y|Xi−1 , . . . , X1 )
i=1

NB: Conditional mutual information of X and Y given Z is defined by


4
I(X ; Y|Z) = H(X |Z) − H(X |Y, Z).

Almost same as before but one uses P (·|Z) (and averaging w.r.t. Zi ).

10 / 40
Outline of proofs.

Chain rule for entropies, by repeated application of the two variable expansion rule :

H(X1 , X2 ) = H(X1 ) + H(X2 |X1 ) (6)


H(X1 , X2 , X3 ) = H(X1 ) + H(X2 , X3 |X1 ) (7)
= H(X1 ) + H(X2 |X1 ) + H(X3 |X2 , X1 ) (8)
..
. (9)
H(X1 , X2 , . . . , Xn ) = H(X1 ) + H(X2 |X1 ) + . . . + H(Xn |Xn−1 , . . . , X1 ) (10)

Chain rule for information :

I(X1 , X2 , . . . , Xn ; Y) = H(X1 , X2 , . . . , Xn ) − H(X1 , X2 , . . . , Xn |Y) (11)


Xn X n
= H(Xi |Xi−1 , . . . , X1 ) − H(Xi |Xi−1 , . . . , X1 , Y) (12)
i=1 i=1
Xn
= I(Xi ; Y|Xi−1 , . . . , X1 ) (13)
i=1

Equivalent definition of I(X ; Y|Z)

X P (Xi , Yj |Zk ) X
I(X ; Y|Z) = P (Xi , Yj , Zk ) log = P (Zk )I(X ; Y|Zk ) (14)
i,j,k
P (Xi |Zk )P (Yj |Zk ) k

11 / 40
Other important properties (2)

2. Conditional independence and data processing inequality


Consider three discrete random variables : X , Y, Z
They are said to form a Markov chain if Z is conditionally indep. of X given Y.
Notation : Z ⊥ X |Y ⇔ Zi ⊥ Xj |Yk , ∀i, j, k.
In other words P (Z|X , Y) = P (Z|Y)
Interpretation :
Conditioning : suppose Y = Yk given ⇒ P (·) → P (·|Yk )
The probability measure becomes a conditional probability measure.
Cond. indep. ≡ independence under the conditional measure, for any Yk .
Independence is a symmetric relation : Z ⊥ X |Y ⇔ X ⊥ Z|Y.
X , Y, Z form a Markov chain which is denoted by X ↔ Y ↔ Z

X Y Z X Y Z

X Y Z

12 / 40
The graphical representation is again a particular case of a Bayesian belief network, which will be
introduced more precisely later on.
Bayesian belief networks provide a general and very powerful tool in order to handle conditional
independence. Conditional independence is very important as a notion, because for many physical
problems it may be used to represent causal relationships. Thus, the structure of conditional
independence of stochastic models may be deduced from physical causality and structure.
Consider a communication system composed of two channels in series : X represents messages chosen
by a source, Y messages at the receiving end of the first channel, and Z the messages at the receiving
end of the second channel. These three random variables obviously represent a Markov chain.
Similarly, look at an industrial two stage process : X represents the characteristics of the input material;
Y the characteristics of the output of the first stage and Z the characteristics of the output of the
second stage. If Y is a precise enough description, then again we have a Markov chain. This means that
if we are able to observe the output of the first stage, and want to predict what will happen during the
second stage, the history X of the material is irrelevant.
This notion of sufficiently precise description of a process at an intermediate stage, is what we call in
system theory the state of the system.

13 / 40
NB : these ideas may be applied to sets of variables :

X1 , X2 , . . . ↔ Y1 , Y2 , . . . ↔ Z1 , Z2 , . . .

X1 ↔ X2 ↔ · · · ↔ Xk ↔ · · · ↔ Xn−1 ↔ Xn
Remarks.
If X ↔ Y ↔ Z then

P (X , Y, Z) = P (X )P (Y|X )P (Z|Y) = P (Z)P (Y|Z)P (X |Y).

Data processing inequality


If X ↔ Y ↔ Z form a Markov chain then I(X ; Y) ≥ I(X ; Z).
Indeed : chain rule of information applied in two ways to I(X ; Y, Z):

I(X ; Z) + I(X ; Y|Z) = I(X ; Y, Z) = I(X ; Y) + I(X ; Z|Y).

Since X et Z are conditionally independent, we have I(X ; Z|Y) = 0, and hence


I(X ; Z) ≤ I(X ; Y).

14 / 40
Examples
If Z is a function of Y it is conditionally independent of X .
(Hence also X ↔ Y ↔ Y)
If Z is a function of Y and another r.v. independent of X and Y, it is also conditionally
independent of X .
Interpretation
The theorem tells us that whatever we do with Y in terms of data processing, there is
no hope to gain more information about X than what is provided by Y :
⇒ no way to create information by data processing.
Questions:
If A is an event of positive probability, what is the value of P (A|A)?
What is the meaning (value) of P (X , Y, Y) ?
Is it true that P (Y|X , Y) = P (Y|Y) ?

15 / 40
Another consequence
If X ↔ Y ↔ Z then I(X ; Y|Z) ≤ I(X ; Y).
In other words, in a Markov chain conditioning decreases mutual information.
This property is not true in general.
In other words, it is possible that I(X ; Y|Z) > I(X ; Y) when X , Y, Z do not form a
Markov chain.
For example
Consider the double coin flipping experiment.
Compute I(H1 ; S) and I(H1 ; S|H2 ).
This finishes our study of information measures (algebra).
We will come back later to these notions for continuous random variables.

16 / 40
Exercises
1. Let X , Y, Z be three binary random variables. One gives the following information :

• P (X = 0) = P (Y = 0) = 0.5,
• P (X , Y) = P (X )P (Y)
• Z = (X + Y)mod2 (i.e. Z = 1 ⇔ X 6= Y).
(a) What is the value of P (Z = 0) ?
(b) What is the value of H(X ), H(Y), H(Z) ?
(c) What is the value of H(X , Y), H(X , Z), H(Y, Z), H(X , Y, Z) ?
(d) What is the value of I(X ; Y), I(X ; Z), I(Y; Z) ?
(e) What is the value of I(X ; Y, Z), I(Y; X , Z), I(Z; X , Y) ?
(f) What is the value of I(X ; Y|Z), I(Y; X |Z), I(Z; X |Y) ?
(g) Can you draw a Venn diagram which summarizes the situation ?
2. Let X , Y, Z be three discrete random variables. Show that

(a) H(X , Y|Z) ≥ H(X |Z);


(b) I(X , Y; Z) ≥ I(X ; Z);
(c) H(X , Y, Z) − H(X , Y) ≤ H(X , Z) − H(X );
(d) I(X ; Z|Y) ≥ I(Z; Y|X ) − I(Z; Y) + I(X ; Z).
17 / 40
Graphical models for probabilistic inference

Classical logic :
- Start with a theory : set of axioms which are supposed to hold in the physical world (if
X has wings then X is a bird)
- Add observations from the real world : facts (Tweety has wings)
- Infer conclusions about other properties of the real world : Tweety is a bird.
Probabilistic logic :
Same, but statements and axioms are of probabilistic nature.
Inference : from a probabilistic model and observations from the real world, draw
conclusions about unobserved variables.
Graphical models : represent relationships among variables by a graph.
NB.: not all models are graphical...

18 / 40
Main questions 1. How to build models : from first principles, from observations of
nature, from both
2. How to use models : deductive inference
Now, we focus on probabilistic (deductive) inference with graphical models :
⇒ Bayesian networks, decision trees.
Model probabilistic relationships among a set of variables
- We will consider only discrete variables, but theory extends to continuous variables
- Bayesian networks : models for joint probability distributions P (A, B, . . . , U)
- Decision trees : models for conditional probability distributions P (A|B, . . . , U)

19 / 40
Bayesian networks : models for P (A, B, . . . , U)
NB:
We consider only the case where A, B, . . . , U take a finite number of value. Thus, the
number of possible combinations of values is also finite.
Thus P (A, B, . . . , U) can be represented explicitly as a multidimensional table of
numbers in [0; 1] : contingency table
But :
1. Explicit representation becomes quickly intractable (when the number of variables
increases).
2. Explicit representation says nothing about structural relationships of variables (e.g.
conditional independence)
Bayesian networks : compact representation, tractable, and interpretable (explicitly).

20 / 40
Example of inference using an explicit representation :
Given P (A, B, C, D, E, F) (model) and the fact (observation or hypothesis) that
B = Bj and C = Ck , what is the probability of event A = Ai ?
In other words compute : P (Ai |Bj , Ck )
Answer :
P (Ai ,Bj ,Ck )
1. P (Ai |Bj , Ck ) = P (Bj ,Ck ) .

2. P (Ai , Bj , Ck ) = D∈D E∈E F ∈F P (Ai , Bj , Ck , D, E, F )


P P P

3. P (Bj , Ck ) = A∈A D∈D E∈E F ∈F P (A, Bj , Ck , D, E, F )


P P P P

Comments :
Suppose that the variables assume three values each, then P (A, B, C, D, E, F) is given
by 36 − 1 = 728 numbers.
The two sums concerns respectively 33 = 27 and 34 = 81 terms.
In applications (e.g. coding) : thousands of variables ⇒ trivial method breaks down.

21 / 40
Same problem : we add some structural knowledge
Suppose we know (e.g. because of physical knowledge about the problem that :

P (A, B, C, D, E, F) = P (A, B, C)P (D, E, F|A)

and that
P (A, B, C) = P (B)P (C)P (A|BC)

Now we need to specify the model :


− For P (B) and P (C) we need 4 = 2 + 2
− For P (A|BC) we need 2 × 3 × 3 = 18.
− For P (D, E, F|A) we need 3 × (33 − 1) = 78.
⇒ Structural knowledge reduces the size of our model from 728 to 4 + 18 + 78 = 100.
Computation of P (Ai |Bj , Ck ) : trivial (table lookup)
What about computation of P (Bj , Ck |Ai ) ? (with and without structural knowledge)

22 / 40
Models are useful to provide not only accurate but also compact representations of the reality. In
general, there is a tradeoff between model complexity and accuracy. Models are useful only if we are
able to exploit them in order to understand or predict behavior of reality : in most situations tractability
is possible only at the expense of accuracy.
Next week, when we will focus on channel coding, we will see that in order to efficiently exploit noisy
channels it is necessary to manipulate very long sequences of symbols (long messages). For example, in
the context of Turbo-codes typical message lengths which are manipulated are in the interval
[1000 . . . 100000]. This means that we need to manipulate joint probability distributions of more than
1000 to 100000 binary variables, which would be totally impossible if we were to use explicit
table-lookup models.
For those who are not yet convinced, let us make the explicit calculation : if N = 1000, a channel code
will comprise 21000 ≈ 10301 code words. If every electron of the Universe (there are about 1080 ) was a
1000 GHz processor able to store and retrieve the probability of such a code word in a single
instruction, one could handle 1012 × 1080 ≈ 1092 code words per second, and in a period equal to the
age of the Universe (3 × 1017 seconds), these computers would handle 3 × 10109 code words. To
handle all words, we would still need to wait for a period equal to 10190 times the age of our Universe !
Nevertheless, by using compact models it is possible to handle the channel encoding and decoding tasks
efficiently (in linear time with respect to the message length).
Later we will introduce stochastic process models. A stochastic process is a sequence of random
variables corresponding to successive time instants (we will only consider discrete time models in this
course). As time can grow indefinitely, a stochastic process is actually an infinite collection of random
variables. Still, it is possible to devise very compact probabilistic models of such processes : actually
with a few numbers it is possible to characterize the joint probability distribution of any finite
sub-collection of random variables of the process.

23 / 40
Bayesian network : definition

Directed acyclic graph :


• Nodes model variables (one node for each variable)
• Arcs model causal relations among variables (conditional independence relations)

24 / 40
The figure illustrates an example Bayesian network, which is supposed to model the relationships
among the color of eyes of different people in a family. The network actually models the ancestral
relationships among the persons of this family. Each node represents the color (blue or brown) of one
person. The arcs indicate which are the children of a person.
Note that this model does not pretend to be a correct view of Mendelean genetics. We will see later
that this model is slightly more complex, but can still be easily represented by a Bayesian network. For
the time being, we will use this naive picture of genetics as our running example to explain main
concepts in Bayesian networks.
Terminology and notation
We use the same notation (round uppercase) to represent variables and nodes, since they are in
one-to-one correspondance.
Let Xk denote a node in the graph G. Then we denote by :
• P(Xk ) the set of parent nodes of Xk , i.e. the origins of the arcs pointing towards Xk .
• F (Xk ) the children of Xk , i.e. the set {Xj ∈ G|Xk ∈ P(Xj )}
• D(Xk ) the descendents of Xk , i.e. the set of nodes which are in F (Xk ), or descendents of a node
in F (Xk ).
• N D(Xk ) the nondescendents of Xk , i.e. the set G ∩ ¬({Xk } ∪ D(Xk ))
The figure illustrates these notions for the node M, and shows how this node partitions the graph.
Defining property of Bayesian networks
For any variable X ∈ G, and any subset of variables W ∈ N D(X ), we have
P (X |P(X ), W) = P (X |P(X )), i.e. once the parents of a variable are given, it becomes independent of
all its other non-descendents

25 / 40
Factorisation property
Suppose we are given a Bayesian network G = {X1 , . . . Xn } and for each variable
Xi ∈ G we are also given P (Xi |P(Xi )), then
n
Y
P (X1 , . . . , Xn ) = P (Xi |P(Xi ))
i=1

Note that for those variables for which P(Xi ) = ∅ we are given the prior P (Xi ).
Comments :
As long as the P (Xi |P(Xi )) are not specified, a Bayesian network is meant to represent
all distributions which can be factorized in this way.
Any probability distribution may be represented in many ways by a Bayesian network,
but not necessarily all conditional independence structures may be derived explicitly
from a Bayesian network structure.
There exist probability distributions leading to independence relations which can not be
represented completely by any Bayesian network.

26 / 40
Simple examples : some (all ?) three variable networks

X Y Z X Y Z
P (X , Y, Z) = P (X )P (Y|X )P (Z|X , Y) P (X , Y, Z) = P (X )P (Y)P (Z)

X Y Z X Y Z
P (X , Y, Z) = P (X )P (Y|X )P (Z|Y) P (X , Y, Z) = P (Z)P (Y|Z)P (X |Y)

X Y X Y

Z Z
P (X , Y, Z) = P (Z|X , Y)P (Y)P (X ) P (X , Y, Z) = P (Z)P (X |Z)P (Y|Z)

27 / 40
The correct Mendel model of eye colors.
Produced by JavaBayes tool.

28 / 40
Here is the complete model of our earlier example related to genetic.
We have added new variables denoted by GT... for each individual which denote the two versions of the
gene which determine the eye color for each individual. The variables may take on three values bb, bB,
and BB, where b stands for blue and B for brown. We assume that the prior (or marginal) probability
of these three values for the grand-parent generation are 0.25, 0.5, 0.25; all the other (conditional)
probability distributions are deduced from the Mendel model : the relation between parent and children
genotypes assume that one of the two chromosomes is chosen at random (0.5 probability) and the
relation between phenotype and genotype is deterministic, assuming that B (brown) is dominant
character.
Notice that this network models the genotype (genes) of the individuals, and the relationship between
the genotype and the observed variables (eye colors). In spite of the fact that genotypes can not be
observed directly, it is possible to use this model to infer unobserved genotypes and phenotypes from
the observed phenotypes (eye colors).
One particularity of this network is that all the conditional probabilities are identical (all people behave
in the same way from the viewpoint of our model). The network can be extended to a whole population
and be used to model relationships between successive generations and how one can observe genetic
drift.
The present example is available on the web page http://www.montefiore.ulg.ac.be/˜lwh/javabayes,
where you can use a Java applet to simulate the network and see how it reacts to observations. On the
same page you can also try out the earlier naive version of the same problem, and compare the
differences.
The prior probability distributions have been chosen so that without any observations all individuals
have the same marginal probability distribution of genotypes (and hence phenotypes). This is what we
will later on denote by stationary conditions. It turns out that, even if in the earlier generations the
prior distribution is different from the stationary distribution, after a large enough number of
generations the system converges to the stationary distribution.

29 / 40
Graphical models of communication systems

First order markov model of a black&white scanner

0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1

Same compressed by a 4 bit block code

0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1

0/1 0/1

0/10/110/111 0/10/110/111

Same, encoded and sent through a noisy, memoryless communication channel


0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 Source

Conv. code

Channel noise

Received message

30 / 40
D-separation and conditional independance relations induced by a
BN

Some comments on the notion of independance of sets of random variables


Let A = {X1 , . . . , Xl } and B = {Y1 , . . . , Ym } two sets of random variables.
- What is the meaning of A ⊥ B ?
- Is it true that A ⊥ B ⇒ (∀i, j : Xi ⊥ Xj ) ?
- And/or is the converse true, i.e. (∀i, j : Xi ⊥ Xj ) ⇒ A ⊥ B ?
D-separation: definition
Let us denote by A, B, C three disjoint subsets of r.v. of a BN, and let us assume that
A and C are non empty.
Let us consider paths over the undirected version of the DAG, from A to C.
We say that A and C are d-separated by B if all paths from A to C are blocked by B.

31 / 40
By definition, a path is blocked if it goes through a variable, say Xk ,
1. The pattern → Xk → appears in the path and Xk ∈ B
2. The pattern → Xk ← appears in the path and ({Xk } ∪ D(Xk )) ∩ B = ∅
3. The pattern ← Xk → appears in the path and Xk ∈ B

A A C Xk ∈ B

Xk ∈ B Xk 6∈ B A C

All paths with are not blocked, are said to be active (w.r.t. A to C and B).

32 / 40
D-separation: fundamental property
If A, B, C are three disjoint sets of variables (B may be empty) of a bayesian network,
then “A and C are d-separated by B” ⇒ A ⊥ C|B.
Notice that, if A and B are d-separated by C then any subset of A is d-separated from
any subset of C by B.
Notice also that we can change directions of some arrows in the graph without changing
d-separations, provided that we don’t change the set of → Xk ← structures.
Thus, to represent the conditional independances one often uses so-called essential
graphs, obtained from a DAG by replacing arrows which do not participate in a
V -structure by lines.
Belief propagation
D-separation also leads to the design of effective belief propagation algorithms.
(See course notes and subsequent lessons).

33 / 40
There is much more to say about Bayesian belief networks, but limited time in the context of this
course does not allow to go further in depth.
Bayesian networks where proposed in the eighties by Judea Pearl, in order to provide modelling tools for
reasoning under uncertainty in artificial intelligence (e.g. expert systems for medical diagnosis).
In the meanwhile, both theory and practice have progressed significantly, and although the field has not
yet reached full maturity there are already many significant real applications.
One of the complex questions, as regards inference, is to devise efficient algorithms to propagate
evidence through the network. If the network has a tree structure this is rather easy task (a
generalization of the forward-backward algorithm used for hidden markov chains, leading to an efficient
algorithm). If the network is not a tree, one approach consists of grouping variables so as to yield a tree
(so-called junction tree algorithm); another approach is to use approximate (but efficient) algorithms for
probability propagation.
The other main problem under consideration in research concerns the automatic design of probabilistic
models from data. Here also, there is still a lot to do.

34 / 40
Probabilistic reasoning and questionnaires

Let us consider a medical diagnostic problem and its probabilistic model. Let us denote
by D a variable which is true when the patient under consideration has a certain disease
(say hepatitis).
Ω set of all possible patients which will visit a M.D.
In order to make a diagnosis, the doctor will typically try to look at the symptoms
(concentration of various types of blood cells, eye color, skin color, temperature . . . )
and ask questions about antecedents (factors, such as age, nutrition, smoking, addiction
to heroin,. . . ).
Note that not all questions have same relevance, and also in general the relevance of a
question is dependent on already observed variables. Anyhow, typically the doctor would
like to reach conclusions about the diagnostic by asking on relevant and informative
questions.
Problem : how to design an efficient strategy for the diagnosis ?

35 / 40
Probabilistic model

Suppose that we have a model for P (D, A1 , . . . , An , S1 , . . . , Sm ).


We can measure the residual uncertainty of the diagnosis problem by

H(D|A1 , . . . , An , S1 , . . . , Sm )

i.e. the uncertainty which can not be reduced by observations.


If the disease is well known, hopefully this quantity will be small.
Note that, if we forbid the use of one of the possible observations, say A1 , then the
residual uncertainty increases

H(D|A2 , . . . , An , S1 , . . . , Sm ) ≥ H(D|A1 , . . . , An , S1 , . . . , Sm )

but this does not mean that all questions are relevant in all cases.
Suppose we are allowed only to ask one single question (observe one of Ai or Sj ), then
we would choose X ∈ {A1 , . . . , An , S1 , . . . , Sm } maximizing I(X ; D).

36 / 40
Strategy : same as decision tree

37 / 40
The test nodes of the tree (square boxes on the figure) represent essentially questions or observations
that may be made by the doctor. The terminal nodes represent conclusions that will be drawn : the
doctor stops to ask questions and decides that it is either very likely or very unlikely that the patient
has hepatitis, or possibly decides that he is still uncertain and the patient should go to a specialist (who
will ask more questions).
The tree structure defines the strategy that the doctor will use to reach a decision : the top-node (root
of the tree) defines the first question, and successors define the substrategies depending on the
obtained answer. Note that the terminal nodes of the tree are a function T of the test variables : using
the tree is equivalent to observing this variable.
Tree construction algorithms :
A good decision tree is one that minimizes the average conditional entropy at the leaf nodes and at the
same time minimizes complexity of the tree (different measures). If we know
P (D, A1 , . . . , An , S1 , . . . , Sm ) (say we have a Bayesian network) we can try to find an optimal tree,
say one which minimizes
H(D|T ) + βComplexity

Brute force :
- generate all possible trees (there is only a finite number of trees)
- for each tree compute H(D|T ) + βComplexity (can be done using P (D, A1 , . . . , An , S1 , . . . , Sm ))
- keep the best one.
Hill climbing :
- select the variable maximizing I(X ; D) at the root node
- for each value Xi of X use P (D, A1 , . . . , An , S1 , . . . , Sm |Xi ) to build subtree.
- stop when H(D|T ) + βComplexity starts to increase.

38 / 40
Further reading

• D. MacKay, Information theory, inference, and learning algorithms


• Chapter 2

39 / 40
Frequently asked questions

• State and graphically represent the algebraic relations among these quantities and
their main properties (bounds, inequalities and equalities among quantities), and
explain the main steps of the mathematical proofs of these relations.
• State the chaining rule for joint entropies. Define the notion of conditional mutual
information and state the chaining rule for mutual informations. Explain the main
steps of the mathematical proofs of these two chaining rules.
• Define the notion of Markov chain (over three discrete random variables). State,
prove, discuss and illustrate the data processing inequality. Give an example where
the data processing inequality can not be applied and where it is also not satisfied.
State and discuss the corollaries of the data processing inequality.

40 / 40

You might also like