Directed Graphical Models
Directed Graphical Models
Christophe Ambroise
[email protected]
UEVE, UMR CNRS 8071
November 7, 2023
Introduction
3
Practical matters
Reference document
a code (1/3)
a presentation (15 minutes) (1/3)
a report (1/3)
5
What is a graphical model ?
A graphical model is a probability distribution in a factorized form
7
What problems does it raise ?
learning the parameters of a given factorized form
learning the structure of the graphical model (factorized form)
10
Joint distribution
Observation
11
Chain Rule
By the chain rule of probability, we can always represent a joint
distribution as follows, using any ordering of the variables:
O(K) + O(K
2
) + O(K
3
)+. . . There are O(K V ) parameters
in the system
12
Conditional independence
The key to efficiently representing large joint distributions is to
make some assumptions about conditional independence (CI).
13
2. if all diamond faces (⧫) are discarded from the set then
color ⊥̸ value but still color ⊥ value|F acecard
14
Simplification of chain rule
Simplficiation of chain rule factorization
t=2
K − 1 + K
2
parameters
15
Graphical models
A graphical model (GM) is a way to represent a joint distribution by
making Conditional Independence (CI) assumptions.
16
17
Graph terminology
A graph G = (V , E) consists of
Graph terminology
Parent: For a directed graph, the parents of a node is the set of
all nodes that feed into it: pa(s) ≜ {t : G(t, s) = 1}.
Child: For a directed graph, the children of a node is the set of
all nodes that feed out of it: ch(s) ≜ {t : G(s, t) = 1}.
Family: For a directed graph, the family of a node is the node
and its parents, f am(s) = s ∪ pa(s).
Root: For a directed graph, a root is a node with no parents.
Leaf: For a directed graph, a leaf is a node with no children.
Ancestors: For a directed graph, the ancestors are the parents,
grand-parents, etc of a node. That is, the ancestors of t is the
set of nodes that connect to t via a trail:
anc(t) ≜ {s : s ⇝ t}.
Graph terminology
Clique: For an undirected graph, a clique is a set of nodes that
are all neighbors of each other.
A maximal clique is a clique which cannot be made any larger
without losing the clique property.
Neighbors For any graph, we define the neighbors of a node as
the set of all immediately connected nodes,
nbr(s) ≜ {t : G(s, t) = 1vG(t, s) = 1}. For an undirected
21
Topological ordering of DAGs
nodes can be ordered such that parents come before children
it can be constructed from any DAG
The ordered Markov property
x s ⊥ x pred(s)∖pa(s) |x pa(s)
where pa(s) are the parents of node s, and pred(s) are the
predecessors of node s in the ordering.
22
t=1
23
Examples
25
26
Markov and hidden Markov models
Markov chain
t=2
27
T 2
p(x t |x pa(t) ) = N (x t |μ t + w x pa(t) , σ )
t t
p(x) = N (x|μ, Σ)
28
Directed GGM (Gaussian Bayes net)
For convenience let rewrite the CPDs
xt = μt + ∑ w ts (x s − μ s ) + σ t z t
s∈pa(t)
t
μ = (μ 1 , . . . , μ D ) .
29
(x − μ) = W (x − μ) + Sz
We have
−1 2
Σ = cov(x − μ) = cov((I − W ) e) = cov(U Sz) = U S U
where U = (I − W )
−1
30
Examples
Two extreme cases
31
Learning {#learning}
33
Learning from complete data (with known
graph structure)
If all the variables are fully observed in each case, so there is no
missing data and there are no hidden variables, we say the data is
complete.
N N
i=1
N tck
and thus p(x
^ t = k, x pa(t) = c) =
∑ N ′
Of course, the MLE
′ tck
k
34
Conditional independence
properties of DGMs
36
A ← C → B
with have
A ⊥̸ B
but
A ⊥ B|C
Exercice
37
Chain (Head - tail)
With the DAG
A → C → B
with have
A ⊥̸ B
but
A ⊥ B|C
Exercice
38
A → C ← B
with have
A ⊥ B
but
A ⊥
╱B|C
Exercice
Show it
Independence map
iff I (G) ⊆ I (p), where I (p) is the set of all CI statements that
hold for distribution p.
39
40
d-separation: example of Pearl (1988)
two independent causes of your car refusing to start: having no
gas and having a dead battery.
Telling you that the battery is charged tells you nothing about
whether there is gas,
Telling you that the battery is charged after I have told you that
the car won’t start tells me that the gas tank must be empty.
d-separation
When a vertex is in the conditioning set, its status with respect to
being active or inactive flip-flops. If we condition by C
42
Formulation d-separation definition
an undirected path P is d-separated by a set of nodes E iff at least
one of the following conditions hold:
43
44
45
Consequences of d-separation
46
Consequences of d-separation
Ordered Markov property
47
Markov blanket
The set of nodes that renders a node t conditionally independent of
all the other nodes in the graph is called t’s Markov blanket
48
Markov blanket
To understand the Markov blanket, one could start with the local
Markov property which block the dependence to non-descendant
by conditioning on the parents.
49
Graphical Model Learning
Structure (chapter 26
MLAPP)
51
Introduction
Two main applications of structure learning:
52
Relevance network
A relevance network is a way of visualizing the pairwise mutual
information between multiple random variables:
Major problem
53
Gaussian case
In the Gaussian case, I(X i ; X j ) = −1/2 log (1 − ρ
2
ij
) where
ρ ij is the correlation coefficient so we are essentially visualizing Σ;
54
Dependency networks
55
56
Joint Distribution associated to a directed tree
A directed tree, with a single root node r, defines a joint distribution
as follows
t∈V
The distribution is a product over the edges and the choice of root
does not matter
Symetrization
( )
p(x s , x t )
p(x|T ) = ∏ p(x t ) ∏
p(x s )p(x t )
t∈V (s,t)∈E
57
Chow-Liu algorithm for finding the ML tree
structure (1968)
Goal: Chow Liu algorithm constructs tree distribution
approximation that has the minimum Kullback–Leibler divergence
to the actual distribution (that maximizes the data likelihood)
Principle
58
N tk N stjk
thus p(x
^ t = k) =
N
and p(x
^ s = j, x t = k) =
N
.
Mutual information of a pair of variables
^
p(x s = j, x t = k)
I (s, t) = ∑ p(x
^ s = j, x t = k) log
^
p(x ^
s = j)p(x t = k)
jk
The Kullback–Leibler divergence
^
log P (θ M L |D, T )
^
= ∑ p(x ^
t = k) log p(x t = k) + ∑ I (s
N
tk st
59
Chow-Liu algorithm
There are several algorithms for finding a max spanning tree
(MST). The two best known are - Prim’s algorithm and - Kruskal’s
algorithm.
60
Exercice Gaussian Chow-Liu
1. Show that in the Gaussian case, I (s, t) = − 12 log(1 − ρ 2st )
,where ρ st is the correlation coefficient (see Exercise 2.13,
Murphy)
2. Given a realisation of n gaussian vector of size p find the ML
tree structured covariance matrix using Chow-Liu algorithm.
61
62
Mixtures of trees
A single tree is rather limited in its expressive power.
learning a mixture of trees (Meila and Jordan 2000), where
each mixture component may have a different tree topology is
an alternative
Tntegrating out over all possible trees.
63
when we learn the DAG structure from data, we will not be able to
uniquely identify all of the edge directions
65
It requires:
66
Exact structural inference (categorical case)
Consider x it ∈ {1, ⋯ , K t } be the value of node t in case i,
where
67
V V Ct
V Ct Kt
N tck
p(D|G, θ) = ∏ ∏ ∏ θ
tck
68
V Ct
B(N tc + α tc )
p(D|G) = ∏ ∏
B(α tc )
t=1 c=1
where N tc = ∑
k
N tck , and α tc = ∑
k
α tck .
Local scoring
For node t and its parents
Ct
B(N tc + α tc )
score(N t,pa(t) ) ≜ ∏
B(α tc )
c=1
69
70
71
Scaling up to larger graphs
The main challenge in computing the posterior over DAGs is that
there are so many possible graphs.
72
73
Intervention
Perfect intervention
74
S = 1 if you smoke
S = 0 otherwise,
Y = 1 if you have yellow-stained fingers
Y = 0 otherwise.
75
Graph surgery
78
x 2 = b 2 + (x 1 − b 1 ) + z 2 - x 3 = b 3 + 1/2(x 2 − b 2 ) + z 3 -
σ 1 = σ 2 = σ 3 = 1,
Problem
79
Exercice Directed GGM
T
μ = (0, 1, 2)
diag(S) = (1, 1, 1)
0 0 0
⎛ ⎞
W = 1 0 0
⎝ ⎠
0 1/2 0
80
1 n=1000
2 mu=c(0,1,2)
3 sigma=c(1,1,1)
4 W=matrix(c(0,1,0,0,0,1/2,0,0,0),3,3)
5 U=solve(diag(rep(1,3))-W)
6 S=diag(sigma)
7 Sigma=U%*%S^2%*%t(U)
8 solve(Sigma)
[,1] [,2] [,3]
[1,] 2 -1.00 0.0
[2,] -1 1.25 -0.5
[3,] 0 -0.50 1.0
81
Exercice Directed GGM
First solution (direct)
1 library(mvtnorm)
2 Xprime=rmvnorm(n,mean=c(0,1,2),sigma=Sigma)
82
Sprinkler Exercice
Let us define the structure of the network
1 library(bnlearn)
2 library(visNetwork)
3 variables<-c("Nuageux","Arrosage","Pluie","HerbeMouillee")
4 net<-empty.graph(variables)
5 adj = matrix(0L, ncol = 4, nrow = 4, dimnames=list(variables, variables))
6 adj["Nuageux","Arrosage"]<-1
7 adj["Nuageux","Pluie"]<-1
8 adj["Arrosage","HerbeMouillee"]<-1
9 adj["Pluie","HerbeMouillee"]<-1
10 amat(net)=adj
83
Sprinkler Exercice
1 #plot.network(net) # for a nice html plot
2 plot(net)
84
Sprinkler Exercice
Simulate a sample according the model
85
Basic Simulation with using conditional
probability tables
Function for one event (one line of dataframe)
1 NAPHM1<-function(n){
2 N<-rbinom(1,size = 1,prob = 1/2)
3 if (N==1) {A<-rbinom(1,size = 1,prob = 0.1)} else {A<-rbinom(1,size =1,prob = 0.5
4 if (N==1) {P<-rbinom(1,size = 1,prob = 0.8)} else {P<-rbinom(1,size = ,1,prob = 0
5 if (A+P==0) HM<-rbinom(1,size = 1,prob = 0.1) else if
6 (A+P==1) HM<-rbinom(1,size = 1,prob = 0.9) else
7 HM<-rbinom(1,size = 1,prob = 0.99)
8 X<-as.logical(c(N,A,P,HM))
9 }
86
87
Learning the parameters
1 mean(X$Nuageux) -> pNuageux
2 lapply(sousTableauxNuageux<-split(X,X$Nuageux),
3 function(XsousTableau){mean(XsousTableau$Arrosage)})
4 lapply(sousTableauxNuageux<-split(X,X$Nuageux),
5 function(XsousTableau){mean(XsousTableau$Pluie)})
6 lapply(sousTableauxNuageux<-split(X,X$Arrosage + X$Pluie),
7 function(XsousTableau){mean(XsousTableau$HerbeMouillee)})
Back to lecture
88
89
Exercices directed Graphical Model
Bishop 8.3
90
91
Local Markov Property
directed local Markov property
Thus
92
2
1 |Σ| 1 1/σ
t −1 t 1
= − log − E[z Σ z − z [
2 2
2 |diag(σ , σ )| 2 0
1 2
2
1 1/σ
2 t −1 1
= − log(1 − ρ ) − 1/2trace(E[zz (Σ − [
2 0
2
1 1 σ 12 /σ
2 2
= − log(1 − ρ ) − trace(I − [ ])
2
2 σ 12 /σ 1
1
1
2
= − log(1 − ρ )
2
xs μs
where z = [ ] − [ ] and E[zz t ] = Σ
xt μt
93
KL-divergence
Maximizing log-likelihood is equivalent to minimizing KL-
divergence
94
Projects
96
List 2023
Explain a concept and illustrate with an example:
10 minutes OBS recording
Commented Code Notebook (not a full report).