0% found this document useful (0 votes)
27 views

Directed Graphical Models

Halloween

Uploaded by

Youssef Bahmad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Directed Graphical Models

Halloween

Uploaded by

Youssef Bahmad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Graphical Models

Christophe Ambroise
[email protected]
UEVE, UMR CNRS 8071

November 7, 2023

Introduction

3
Practical matters
Reference document

The lecture closely follows and largely borrows material from


“Machine Learning: A Probabilistic Perspective” (MLAPP) from
Kevin P. Murphy, chapters: 4

Chapter 10: Directed graphical models (Bayes nets)


Practical matters
Evaluation

The project will be evaluated through of a project in R or Python


(realized by 2 or 3 student). Each project will be different and rated
on the basis of

a code (1/3)
a presentation (15 minutes) (1/3)
a report (1/3)

5
What is a graphical model ?
A graphical model is a probability distribution in a factorized form

There a two main type of representation of the factorization:

directed graphical model


undirected graphical model
Why the term graph ?

Conditionnal independences between variables are well modeled


via Graphs

What is it usefull for ?


reduce the number of parameters
→ may be used for supervised or unsupervised approaches
allow exploratory data analysis by providing a simple graphical
representation
→ “approach causality”

7
What problems does it raise ?
learning the parameters of a given factorized form
learning the structure of the graphical model (factorized form)

Directed Graphical Models


(Chapter 10 MLAPP)

10
Joint distribution
Observation

Suppose we observe multiple correlated variables, such as words


in a document, pixels in an image, or genes in a microarray.
Joint distribution

How can we compactly represent the joint distribution p(x|θ)?

11

Chain Rule
By the chain rule of probability, we can always represent a joint
distribution as follows, using any ordering of the variables:

p(x 1:V ) = p(x 1 )p(x 2 |x 1 )p(x 3 |x 2 , x 1 )p(x 4 |x 1 , x 2 , x 3 ). . . p(x

The problem of the number of parameters

O(K) + O(K
2
) + O(K
3
)+. . . There are O(K V ) parameters
in the system

12
Conditional independence
The key to efficiently representing large joint distributions is to
make some assumptions about conditional independence (CI).

X ⊥ Y |Z ⇔ p(X, Y |Z) = p(X|Z)p(Y |Z)

X is conditionaly independent of Y knowing Z if once you know Z


knowing Y does not help you to guess X

13

Conditional independence: an example


Setting: picking a card at random in a traditional set of cards

1. if full set of color and values then color ⊥ value

2. if all diamond faces (⧫) are discarded from the set then
color ⊥̸ value but still color ⊥ value|F acecard

P (King|F acecard) = 1/3 = P (♣|F acecard)

P (King♣|F acecard) = 1/9 = P (King|F acecard)P (♣|F a

14
Simplification of chain rule
Simplficiation of chain rule factorization

Let assume that x t+1 ⊥ x 1:t−1 |x t , first order Markov


assumption.

p(x 1:V ) = p(x 1 ) ∏ p(x t |x t−1 )

t=2

K − 1 + K
2
parameters

15

Graphical models
A graphical model (GM) is a way to represent a joint distribution by
making Conditional Independence (CI) assumptions.

the nodes in the graph represent random variables,


and the (lack of) edges represent CI assumptions.

A better name for these models would in fact be ‘’independence


diagrams’’

There are several kinds of graphical model, depending on whether


the graph is directed,
undirected,
or some combination of directed and undirected.

16

Example of directed and undirected graphical


model

17
Graph terminology
A graph G = (V , E) consists of

a set of nodes or vertices, V , and


= {1, . . . , V }

a set of edges, E = {(s, t) : s, t ∈ V } .


Adjacency matrix

We can represent the graph by its adjacency matrix, in which we


write G(s, t) = 1 to denote (s, t) ∈ E , that is, if s → t is an
edge in the graph. If G(s, t) = 1 iff G(t, s) = 1, we say the
graph is undirected, otherwise it is directed.

We usually assume G(s, s) = 0 , which means there are no self


loops. 18

Graph terminology
Parent: For a directed graph, the parents of a node is the set of
all nodes that feed into it: pa(s) ≜ {t : G(t, s) = 1}.
Child: For a directed graph, the children of a node is the set of
all nodes that feed out of it: ch(s) ≜ {t : G(s, t) = 1}.
Family: For a directed graph, the family of a node is the node
and its parents, f am(s) = s ∪ pa(s).
Root: For a directed graph, a root is a node with no parents.
Leaf: For a directed graph, a leaf is a node with no children.
Ancestors: For a directed graph, the ancestors are the parents,
grand-parents, etc of a node. That is, the ancestors of t is the
set of nodes that connect to t via a trail:
anc(t) ≜ {s : s ⇝ t}.

Descendants: For a directed graph, the descendants are the


children, grand-children, etc of a node. That is, the descendants 19

Graph terminology
Clique: For an undirected graph, a clique is a set of nodes that
are all neighbors of each other.
A maximal clique is a clique which cannot be made any larger
without losing the clique property.
Neighbors For any graph, we define the neighbors of a node as
the set of all immediately connected nodes,
nbr(s) ≜ {t : G(s, t) = 1vG(t, s) = 1}. For an undirected

graph, we write s ∼ t to indicate that s and t are neighbors.


Degree: The degree of a node is the number of neighbors. For
directed graphs, we speak of the in-degree and out-degree,
which count the number of parents and children.
Cycle or loop: For any graph, we define a cycle or loop to be a
series of nodes such that we can get back to where we started
by following edges
20

DAG A directed acyclic graph or DAG is a directed graph with no

Directed graphical models


A directed graphical model or DGM is a GM whose graph is a
DAG.
These are more commonly known as Bayesian networks
These models are also called belief networks
Finally, these models are sometimes called causal networks,
because the directed arrows are sometimes interpreted as
representing causal relations.

21
Topological ordering of DAGs
nodes can be ordered such that parents come before children
it can be constructed from any DAG
The ordered Markov property

a node only depends on its immediate parents

x s ⊥ x pred(s)∖pa(s) |x pa(s)

where pa(s) are the parents of node s, and pred(s) are the
predecessors of node s in the ordering.

22

General form of factorization


V

p(x 1:V ) = ∏ p(x t |x pa(t) )

t=1

if the Conditional Independence assumptions encoded in DAG G


are correct

23
Examples

25

Naive Bayes classifiers


p(y, x) = p(y) ∏ p(x j |y)

The naive Bayes assumption is rather naive, since it assumes the


features are conditionally independent.

26
Markov and hidden Markov models
Markov chain

p(x 1:T ) = p(x 1 )p(x 2 |x 1 )p(x 3 |x 2 ). . . = p(x 1 ) ∏ p(x t |x t−1 )

t=2

Hidden Markov Model

The hidden variables often represent quantities of interest, such as


the identity of the word that someone is currently speaking. The
observed variables are what we measure, such as the acoustic
waveform.

27

Directed Gaussian graphical models


Consider a DGM where all the variables are real-valued, and all the
Conditional Proba. Distributions have the following form:

T 2
p(x t |x pa(t) ) = N (x t |μ t + w x pa(t) , σ )
t t

Directed GGM (Gaussian Bayes net)

p(x) = N (x|μ, Σ)

28
Directed GGM (Gaussian Bayes net)
For convenience let rewrite the CPDs

xt = μt + ∑ w ts (x s − μ s ) + σ t z t

s∈pa(t)

where z t ∼ N (0, 1), σ t is the conditional standard deviation of


x t given its parents, wts is the strength of the s → t edge, and μ t

is the local mean.


Mean

The global mean is just the concatenation of the local means

t
μ = (μ 1 , . . . , μ D ) .
29

Directed GGM (Gaussian Bayes net)


Covariance matrix

(x − μ) = W (x − μ) + Sz

where S ≜ diag(S) Let consider e ≜ Sz = (I − W )(x − μ)

We have

−1 2
Σ = cov(x − μ) = cov((I − W ) e) = cov(U Sz) = U S U

where U = (I − W )
−1

30
Examples
Two extreme cases

Isolated vertices : Naive Bayes where Σ = S , p vertices, no


edges
Fully connected Graph: p vertices, p(p − 1)/2 directed edges

Click to goto exercice on Directed GGM

31

Learning {#learning}

33
Learning from complete data (with known
graph structure)
If all the variables are fully observed in each case, so there is no
missing data and there are no hidden variables, we say the data is
complete.

N N

p(D|θ) = ∏ p(x i |θ) = ∏ ∏ p(x it |x i,pa(t) , θ t )

i=1 i=1 t∈V

The likelihhod decomposes according the graph structure

Click to goto exercice on Sprinkler


Discrete distribution

N tck ≜ ∑ I(x i,t = k, x i,pa(t) = c)

i=1

N tck
and thus p(x
^ t = k, x pa(t) = c) =
∑ N ′
Of course, the MLE
′ tck
k

suffers from the zero-count

34
Conditional independence
properties of DGMs

36

Diverging edges (fork)


With the DAG

A ← C → B

with have

A ⊥̸ B

but

A ⊥ B|C

Exercice
37
Chain (Head - tail)
With the DAG

A → C → B

with have

A ⊥̸ B

but

A ⊥ B|C

Exercice
38

Converging edges (V) and collider


With the DAG

A → C ← B

with have

A ⊥ B

but

A ⊥
╱B|C

Exercice
Show it
Independence map

a directed graph G is an I-map (independence map) for p, or that p


is Markov wrt G,

iff I (G) ⊆ I (p), where I (p) is the set of all CI statements that
hold for distribution p.

This allows us to use the graph as a safe proxy for p


Minimal I-map

39

The fully connected graph is an I map of all distributions


d-separation
The “d” in d-separation and d-connection stands for
dependence.
d-separation is related the ideas of active path and active vertex
on a path
a path is active if it carries information, or dependence.
Thus, when the conditioning set is empty, only paths that
correspond to “causal connection” are active (creating
dependance).

40
d-separation: example of Pearl (1988)
two independent causes of your car refusing to start: having no
gas and having a dead battery.

dead battery –> car won’t start <– no gas

Telling you that the battery is charged tells you nothing about
whether there is gas,
Telling you that the battery is charged after I have told you that
the car won’t start tells me that the gas tank must be empty.

So independent causes are made dependent by conditioning on a


common effect, which in the directed graph representing the
causal structure is the same as conditioning on a collider. 41

d-separation
When a vertex is in the conditioning set, its status with respect to
being active or inactive flip-flops. If we condition by C

Are variables A and B are d-separated by C (in boldface).

1. A –> C –> B Inactive


2. A <– C <– B Inactive
3. A <– C –> B Iactive
4. A –> C <– B, C is a collider and thus inactive when the
conditioning set is empty, so condiitionning by C it becomes
Active (produce dependence)

42
Formulation d-separation definition
an undirected path P is d-separated by a set of nodes E iff at least
one of the following conditions hold:

P contains a chain, s → m → t or s ← m ← t where


m ∈ E

P contains a fork, s ← m → t where m ∈ E

P contains a collider, s → m ← t where m ∉ E and nor is


any descendant of m.

43

Alternative formulation of d-connection:


If G is a directed graph in which X, Y and E are disjoint sets of
vertices, then X and Y are d-connected by E in G if and only if there
exists an undirected path P between some vertex in X and some
vertex in Y such that

for every collider C on P, either C or a descendent of C is in E


(active path),
and no non-collider on P is in E (no inactive path).

X and Y are d-separated by E in G if and only if they are not d-


connected by E in G (all path are inactives… ).
Independance requires all possible paths to be inactive whereas
dependence requires only on leak (one active path)

see https://www.youtube.com/watch?v=yDs_q6jKHb0 for


examples

44

d-separation versus conditional independence


a set of nodes A is d-separated from a different set of nodes B
given a third observed set E iff each undirected path from every
node a ∈ A to every node b ∈ B is d-separated by E:

x A ⊥ G x B |x E ⇔ A is d-separated from B given E

45
Consequences of d-separation

Directed local Markov property

From the d-separation criterion, one can conclude that


t ⊥ nd(t)∖pa(t)|pa(t) where the non-descendants of a node

nd(t) are all the nodes except for its descendants

46

Consequences of d-separation
Ordered Markov property

A special case of directed local Markov property is when we only


look at predecessors of a node according to some topological
ordering. We have t ⊥ pred(t)∖pa(t)|pa(t)

47
Markov blanket
The set of nodes that renders a node t conditionally independent of
all the other nodes in the graph is called t’s Markov blanket

mb(t) ≜ pa(t) ∪ ch(t) ∪ copa(t)

The Markov blanket of a node in a DGM is equal to the parents, the


children, and the co-parents.

48

Markov blanket
To understand the Markov blanket, one could start with the local
Markov property which block the dependence to non-descendant
by conditioning on the parents.

To further block the path the descendants of t one has to

Condition on the children of t.


But conditioning on the children open the path to the
coparents.
Thus one needs conditioning on the coparents to block all
paths.

49
Graphical Model Learning
Structure (chapter 26
MLAPP)

51

Introduction
Two main applications of structure learning:

1. knowledge discovery (requires a graph topology)


2. density estimation (requires a fully specified model).
main obstacle

the number of possible graphs is exponential in the number of


nodes: a simple upper bound is O(2 V (V −1)/2 ).

52
Relevance network
A relevance network is a way of visualizing the pairwise mutual
information between multiple random variables:

we simply choose a threshold α


draw an edge from node i to node j if I(X i ; X j ) > α

Major problem

the graphs are usually very dense,


most variables are dependent on most other variables, even
after thresholding the MIs.

53

Gaussian case
In the Gaussian case, I(X i ; X j ) = −1/2 log (1 − ρ
2
ij
) where
ρ ij is the correlation coefficient so we are essentially visualizing Σ;

this is known as the covariance graph.


Exercice : Gaussian mutual information

Show the previous statement

54
Dependency networks

55

Learning tree structures


Since the problem of structure learning for general graphs is NP-
hard (Chickering 1996), we start by considering the special case of
trees. Trees are special because we can learn their structure
efficiently

56
Joint Distribution associated to a directed tree
A directed tree, with a single root node r, defines a joint distribution
as follows

p(x|T ) = ∏ p(x t |x pa(t) )

t∈V

The distribution is a product over the edges and the choice of root
does not matter
Symetrization

To make the model more symmetric, it is preferable to use an


undirected tree:

( )

p(x s , x t )
p(x|T ) = ∏ p(x t ) ∏
p(x s )p(x t )
t∈V (s,t)∈E

57
Chow-Liu algorithm for finding the ML tree
structure (1968)
Goal: Chow Liu algorithm constructs tree distribution
approximation that has the minimum Kullback–Leibler divergence
to the actual distribution (that maximizes the data likelihood)
Principle

1. Compute weight I (s, t) of each (possible) edge (s, t)


2. Find a maximum weight spanning tree (MST)
3. Give directions to edges in MST by chosing a root node

58

Chow-Liu algorithm for finding the ML tree


structure (1968)
log-likelihood

log P (θ|D, T ) = ∑ N tk log p(x t = k) + ∑ ∑ N stjk log


p
tk st jk

N tk N stjk
thus p(x
^ t = k) =
N
and p(x
^ s = j, x t = k) =
N
.
Mutual information of a pair of variables

^
p(x s = j, x t = k)
I (s, t) = ∑ p(x
^ s = j, x t = k) log
^
p(x ^
s = j)p(x t = k)
jk
The Kullback–Leibler divergence

^
log P (θ M L |D, T )
^
= ∑ p(x ^
t = k) log p(x t = k) + ∑ I (s
N
tk st

59

Chow-Liu algorithm
There are several algorithms for finding a max spanning tree
(MST). The two best known are - Prim’s algorithm and - Kruskal’s
algorithm.

Both can be implemented to run in O(ElogV ) time, where


E = V
2
is the number of edges and V is the number of nodes.

60
Exercice Gaussian Chow-Liu
1. Show that in the Gaussian case, I (s, t) = − 12 log(1 − ρ 2st )
,where ρ st is the correlation coefficient (see Exercise 2.13,
Murphy)
2. Given a realisation of n gaussian vector of size p find the ML
tree structured covariance matrix using Chow-Liu algorithm.

61

TAN: Tree-Augmented Naive Bayes


Naive Bayse with Chow-Liu

62
Mixtures of trees
A single tree is rather limited in its expressive power.
learning a mixture of trees (Meila and Jordan 2000), where
each mixture component may have a different tree topology is
an alternative
Tntegrating out over all possible trees.

This can be done in V 3


time using the matrix tree theorem.

63

Learning DAG structures

Three DAGs. G1 and G3 are Markov equivalent,G2 is not.


Graphs are Markov equivalent

if they encode the same set of CI assumptions


64
Learning DAG structures
An ill posed problem

when we learn the DAG structure from data, we will not be able to
uniquely identify all of the edge directions

we can learn DAG structure “up to Markov equivalence”.

Do not read too much into the meaning of particular edge


orientations, since we can often change them without changing
the model in any observable way.

65

Exact structural inference


Exact structural inference is based on the computation of exact
posterior over graphs, p(G|D).

It requires:

the computation of the likelihood p(D|G)


the computation of the prior p(G)

This solution allows to compared different graph in terms of


posterior and eventually find the MAP if the search space is small

66
Exact structural inference (categorical case)
Consider x it ∈ {1, ⋯ , K t } be the value of node t in case i,
where

Kt is the number of states for node t.


θ tck ≜ p(x t = k|x pa(t) = c) , for k = 1 : K t , and c = 1 : Ct

, where C t is the number of parent combinations (possible


conditioning cases).

Let d t = dim(pa(t)) be the degree or fan-in of node t, so that


Ct = K
dt
.

67

Exact structural inference (categorical case)


Prior

V V Ct

p(θ) = ∏ p(θ t ) = ∏ ∏ p(θ tc )

t=1 t=1 c=1

where C t is the number of parent combinations (possible


conditioning cases)
Likelihood

V Ct Kt
N tck
p(D|G, θ) = ∏ ∏ ∏ θ
tck

t=1 c=1 k=1


where N tck is the number of time node t is in state k and its parent
in state c.

68

Exact structural inference (categorical case)


Chosing a Dirichlet prior p(θ tc ) = Dir(θ tc |α tc ) allows to
compute the posterior

V Ct
B(N tc + α tc )
p(D|G) = ∏ ∏
B(α tc )
t=1 c=1

where N tc = ∑
k
N tck , and α tc = ∑
k
α tck .
Local scoring
For node t and its parents

Ct
B(N tc + α tc )
score(N t,pa(t) ) ≜ ∏
B(α tc )
c=1

Marginal likelihood factorizes according to the graph structure.

69

Setting the prior


How should we set the hyper-parameters α tck ?

Jeffreys prior of the form α tck = 1/2 violates a property called


likelihood equivalence
This property says that if G1 and G2 are Markov equivalent ,
they should have the same marginal likelihood
BDe prior

Geiger and Heckerman (1997) proved that, for complete


graphs, the only prior that satisfies likelihood equivalence and
parameter independence is the Dirichlet prior, where the pseudo
counts have the form
α tck = αp 0 (x t = k, x pa(t) = c)

where α > 0 is called the equivalent sample size, and p 0 is some


prior joint probability distribution. This is called the BDe prior
(Bayesian Dirichlet likelihood equivalent).

70

Example of Exact structural inference


(Neapolitan 2003, p.438)

71
Scaling up to larger graphs
The main challenge in computing the posterior over DAGs is that
there are so many possible graphs.

Consequently, we must settle for finding a locally optimal MAP


DAG.
Popular solution: Greedy hill climbing

72

Learning causal DAGs


Causal models

predict the effects of interventions to, or manipulations of, a


system.
Causal claims are inherently stronger, yet more useful, than
purely associative claims
Causal interpretation of DAGs

A → B in a DAG to mean that “A directly causes B” so if we


manipulate A, then B will change.
Known as the causal Markov assumption.

73
Intervention
Perfect intervention

represents the act of setting a variable to some known value


A real world example of such a perfect intervention is a gene
knockout experiment
do calculus notation

do(X i = x i ) to denote the event that we set X i to x i

A causal model makes inferences of the form


p(x|do(X i = x i )),

Different from making inferences of the form p(x|X i .


= xi )

74

Observing versus doing


Consider a 2 node DGM S → Y

S = 1 if you smoke
S = 0 otherwise,
Y = 1 if you have yellow-stained fingers
Y = 0 otherwise.

If I observe you have yellow fingers, I am licensed to infer that you


are probably a smoker (since nicotine causes yellow stains):

p(S = 1|Y = 1) > p(S = 1)


If I intervene and paint your fingers yellow, I am no longer licensed
to infer this, since I have disrupted the normal causal mechanism.
Thus

p(S = 1|do(Y = 1)) = p(S = 1)

75

Graph surgery

One way to model perfect interventions is to use graph surgery: -


represent the joint distribution by a DGM, - cut the arcs coming into
any nodes that were set by intervention.
76
Exercices on Directed
Graphical Models

78

Exercice Gaussian Bayesian Network


Data

Let consider the following graph x 1 → x 2 → x 3 where -


E[x 1 ] = b 1 , E[x 2 ] = b 2 , E[x 3 ] = b 3 - x 1 = b 1 + z 1 -

x 2 = b 2 + (x 1 − b 1 ) + z 2 - x 3 = b 3 + 1/2(x 2 − b 2 ) + z 3 -

σ 1 = σ 2 = σ 3 = 1,

Problem

79
Exercice Directed GGM
T
μ = (0, 1, 2)

diag(S) = (1, 1, 1)

0 0 0
⎛ ⎞
W = 1 0 0
⎝ ⎠
0 1/2 0

80

Exercice Directed GGM


We can observe that the precision matrix has the some support as
W

1 n=1000
2 mu=c(0,1,2)
3 sigma=c(1,1,1)
4 W=matrix(c(0,1,0,0,0,1/2,0,0,0),3,3)
5 U=solve(diag(rep(1,3))-W)
6 S=diag(sigma)
7 Sigma=U%*%S^2%*%t(U)
8 solve(Sigma)
[,1] [,2] [,3]
[1,] 2 -1.00 0.0
[2,] -1 1.25 -0.5
[3,] 0 -0.50 1.0

81
Exercice Directed GGM
First solution (direct)
1 library(mvtnorm)
2 Xprime=rmvnorm(n,mean=c(0,1,2),sigma=Sigma)

Second solution (constructive)


1 X=matrix(0,n,3)
2 Z=matrix(rnorm(n*3),n,3)
3 for (i in 1:n)
4 for (j in 1:3)
5 X[i,j]=mu[j]+sigma[j]*Z[i,j] + sum(W[j,]*(X[i,]-mu))

Click to go Back to Lecture

82

Sprinkler Exercice
Let us define the structure of the network
1 library(bnlearn)
2 library(visNetwork)
3 variables<-c("Nuageux","Arrosage","Pluie","HerbeMouillee")
4 net<-empty.graph(variables)
5 adj = matrix(0L, ncol = 4, nrow = 4, dimnames=list(variables, variables))
6 adj["Nuageux","Arrosage"]<-1
7 adj["Nuageux","Pluie"]<-1
8 adj["Arrosage","HerbeMouillee"]<-1
9 adj["Pluie","HerbeMouillee"]<-1
10 amat(net)=adj

83
Sprinkler Exercice
1 #plot.network(net) # for a nice html plot
2 plot(net)

84

Sprinkler Exercice
Simulate a sample according the model

85
Basic Simulation with using conditional
probability tables
Function for one event (one line of dataframe)
1 NAPHM1<-function(n){
2 N<-rbinom(1,size = 1,prob = 1/2)
3 if (N==1) {A<-rbinom(1,size = 1,prob = 0.1)} else {A<-rbinom(1,size =1,prob = 0.5
4 if (N==1) {P<-rbinom(1,size = 1,prob = 0.8)} else {P<-rbinom(1,size = ,1,prob = 0
5 if (A+P==0) HM<-rbinom(1,size = 1,prob = 0.1) else if
6 (A+P==1) HM<-rbinom(1,size = 1,prob = 0.9) else
7 HM<-rbinom(1,size = 1,prob = 0.99)
8 X<-as.logical(c(N,A,P,HM))
9 }

86

Basic Simulation with using conditional


probability tables
1 n<-1000
2 X<-data.frame(t(sapply(1:n,NAPHM1)))
3 names(X)<-c("Nuageux","Arrosage","Pluie","HerbeMouillee")
4 head(X)
Nuageux Arrosage Pluie HerbeMouillee
1 TRUE FALSE TRUE TRUE
2 TRUE FALSE TRUE TRUE
3 TRUE FALSE TRUE TRUE
4 FALSE TRUE FALSE TRUE
5 FALSE TRUE FALSE FALSE
6 TRUE TRUE TRUE TRUE

87
Learning the parameters
1 mean(X$Nuageux) -> pNuageux
2 lapply(sousTableauxNuageux<-split(X,X$Nuageux),
3 function(XsousTableau){mean(XsousTableau$Arrosage)})
4 lapply(sousTableauxNuageux<-split(X,X$Nuageux),
5 function(XsousTableau){mean(XsousTableau$Pluie)})
6 lapply(sousTableauxNuageux<-split(X,X$Arrosage + X$Pluie),
7 function(XsousTableau){mean(XsousTableau$HerbeMouillee)})

Back to lecture

88

Exercices directed Graphical Model


Joint distribution and graphical decomposition (Bishop 8.3)

The joint distribution over three binary variables

89
Exercices directed Graphical Model
Bishop 8.3

Consider three binary variables a, b, c ∈ {0, 1} having the joint


distribution given in Table above. Show by direct evaluation that
this distribution has the property that a and b are marginally
dependent, so that p(a, b) ≠= p(a)p(b), but that they become
independent when conditioned on c, so that
p(a, b ∣ c) = p(a ∣ c)p(b ∣ c) for both c = 0 and c = 1.

90

Exercices directed Graphical Model


Bishop 8.4

Show by direct evaluation that p(a, b, c) = p(a)p(c ∣ a)p(b ∣ c) .


Draw the corresponding directed graph.

91
Local Markov Property
directed local Markov property

t ⊥ nd(t)∖pa(t)|pa(t) where the non-descendants of a node


nd(t) are all the nodes except for its descendants

We the topological ordering we have

p(x t |x 1 , ⋯ , x t−1 ) = p(x t |x nd(t) ) = p(x t |x pa(t) )

Thus

p(x t , x nd(t)∖pa(t) |x pa(t) ) = p(x nd(t)∖pa(t) |x pa(t) )p(x t |x pa(t) , x nd

= p(x nd(t) |x pa(t) )p(x t |x pa(t) )

92

Gaussian mutual information


p(x s , x t )
I (s, t) =E[log ]
p(x s )p(x t )

2
1 |Σ| 1 1/σ
t −1 t 1
= − log − E[z Σ z − z [
2 2
2 |diag(σ , σ )| 2 0
1 2

2
1 1/σ
2 t −1 1
= − log(1 − ρ ) − 1/2trace(E[zz (Σ − [
2 0

2
1 1 σ 12 /σ
2 2
= − log(1 − ρ ) − trace(I − [ ])
2
2 σ 12 /σ 1
1

1
2
= − log(1 − ρ )
2
xs μs
where z = [ ] − [ ] and E[zz t ] = Σ
xt μt

93

KL-divergence
Maximizing log-likelihood is equivalent to minimizing KL-
divergence

94
Projects

96

List 2023
Explain a concept and illustrate with an example:
10 minutes OBS recording
Commented Code Notebook (not a full report).

1. Simulation of images using a Strauss model (Markov Random


Field). You may use the paper “Markov Random Field Texture
Models” Code for simulation Bonus : estimation of the
parameters
2. Programmation of Graphical Lasso. You may use the paper
“Sparse inverse covariance estimation with the graphical lasso”
Original code of the algorithm illustrated with sachs data
3. Program you own Restricted Boltzmann Machine for prediction.
You may use the paper “A Practical Guide to Training Restricted
Boltzmann Machines” Original code of the algorithm with
illustration on MNIST dataset
4. Structural equation models (SEM) using the NoTears approach.
You may use the paper “DAGs with NO TEARS: Continuous
Optimization for Structure Learning” Use the code from
https://github.com/xunzheng/notears and illustrate with one 97

You might also like