Prob Nets
Prob Nets
Cantabria University
The Joint Probability Distribution (JPD) of a set of n binary variables involve a huge number of parameters 2n (larger than 1025 for only 100 variables).
x 0 0 0 0 1 1 1 1 y 0 0 1 1 0 0 1 1 z p(x, y, z) 0 0.12 1 0.18 0 0.04 1 0.16 0 0.09 1 0.21 0 0.02 1 0.18
We can use the qualitative structure of the model to simplify the probabilistic structure.
Graphically Specified Models Qualitative Structure (Factorization) Quantitative Structure (Parameter estimation)
Cantabria University
Denicin 1 Conditional probability. Let X o and Y be two disjoint subsets of variables such that p(y) > 0. Then, the conditional probability distribution (CPD) of X given Y = y is given by p(x, y) p(X = x|Y = y) = p(x|y) = . p(y) (1)
Denicin 2 Independence of two variables. o Let X and Y be two disjoint subsets of the set of random variables {X1, . . . , Xn}. Then X is said to be independent of Y if and only if p(x|y) = p(x), (2)
for all possible values x and y of X and Y ; otherwise X is said to be dependent on Y . Also, if X is independent of Y , we can then combine (1) and (2) and obtain which implies p(x, y) = p(x)p(y). If {X1, . . . , Xm } are independent, then p(x1, . . . , xm ) =
m i=1
(3)
p(xi),
(4)
CONDITIONAL INDEPENDENCE
Cantabria University
Denicin 3 Conditional independence. o Let X, Y and Z be three disjoint sets of variables, then X is said to be conditionally independent of Y given Z, if and only if p(x|z, y) = p(x|z). p(x, y|z) = p(x|z)p(y|z). When X and Y are conditionally independent given Z, we write I(X, Y |Z). The statement I(X, Y |Z) is referred to as a conditional independence statement (CIS). Similarly, when X and Y are conditionally dependent given Z, we write D(X, Y |Z), which is called a conditional dependence statement. The denition of conditional independence conveys the idea that once Z is known, knowing Y can no longer inuence the probability of X. In other words, if Z is already known, knowledge of Y does not add any new information about X. Note that (unconditional) independence can be treated as a particular case of conditional independence. For example, we can write I(X, Y |), to mean that X and Y are unconditionally independent.
PROBABILISTIC NETWORK MODELS: BAYESIAN NETWORKS
3
FACTORIZATIONS OF A JPD
Cantabria University
Denicin 4 Factorization by potentials. Let o C1, . . . , Cm be subsets of a set of variables X = {X1, . . . , Xn }. p(x1, . . . , xn) =
m i=1
i(ci),
where the functions i, called factor potentials, are nonnegative. Denicin 5 Chain rule factorizations. Any o JPD of a set of ordered variables {X1, . . . , Xn} can be expressed as a product of m CPDs of the form p(x1, . . . , xn ) = where Bi = {Y1, . . . , Yi1}. Ejemplo 1 Chain rule. Consider a case of four variables {X1, . . . , X4}. Then the following are equivalent chain rule factorizations of the JPD: p(x1, . . . , x4) = p(x1)p(x2|x1)p(x3|x1, x2)p(x4|x1, x2, x3) and p(x1, . . . , x4) = p(x1|x2, x3, x4)p(x2|x3, x4)p(x3|x4)p(x4).
PROBABILISTIC NETWORK MODELS: BAYESIAN NETWORKS
4
m i=1
p(yi|bi),
(5)
IMPOSING INDEPENDENCIES
Cantabria University
Consider the variables {X1, X2, X3, X4} and suppose we have: I(X3, X1|X2) and I(X4, {X1, X3}|X2). (6)
We wish to compute the constraints among the parameters of the JPD imposed by these CISs. The rst of these statements implies p(x3|x1, x2) = p(x3|x2), and the second statement implies p(x4|x1, x2, x3) = p(x4|x2). (8) (7)
Note that the general form of the JPD is not a suitable representation for calculating the constraints given by (7) and (8). However, by using these two equalities we obtain p(x1, . . . , x4) = p(x1)p(x2|x1)p(x3|x2)p(x4|x2). (9) Therefore, the two CISs in (6) give rise to a reduction in the number of parameters from 15 to 7.
Cantabria University
Denicin 6 Dependency model. o Any model M of a set of variables {X1, . . . , Xn} from which we can determine whether I(X, Y |Z) is true, for all possible triplets of disjoint subsets X, Y , and Z, is called a dependency model. A JPD (with the denition of conditional independence) is a dependency model. A graph can also dene a dependency model (using the corresponding separation criterion), The qualitative structure of a probabilistic model can be represented by a graphical dependency model that provides a way to factorize the corresponding JPD. Denicin 7 U-separation. Let X, Y , and Z be o three disjoint subsets of nodes in an undirected graph G. We say that Z separates X and Y i every path between each node in X and each node in Y contains at least one node in Z. When Z separates X and Y in G, we write I(X, Y |Z)G; otherwise D(X, Y |Z)G. Given an undirected graph, one can derive all CISs from the graph using the above U -separation criterion.
PROBABILISTIC NETWORK MODELS: BAYESIAN NETWORKS
6
U-SEPARATION EXAMPLE
Cantabria University
A B D G H (a) I(A, I | E) E I C F G D H B
A C E F
(b) D(A, I | B)
A B D G H E I C F G D H B
A C E I F
Every path between A and I contains E. Thus, I(A, I|E)G . There is a path (A C E I) that does not contain B. Thus I(A, I|B)G . Every path between the two subsets contains either B or E. I({A, C}, {D, H}|{B, E})G . There is a path (A B D} that does not contain the variables E and I.
PROBABILISTIC NETWORK MODELS: BAYESIAN NETWORKS
7
D-SEPARATION
Cantabria University
Denicin 8 D-Separation. Let X, Y , and Z be o three disjoint subsets of nodes in a DAG D; then Z is said to D-separate X and Y , i along every undirected path from each node in X to each node in Y there is an intermediate node A such that either 1. A is a head-to-head node in the path, and neither A nor its descendants are in Z, or 2. A is not a head-to-head node in the path and A is in Z. When Z D-separates X and Y in D, we write I(X, Y |Z)D to indicate that this CIS is derived from D; otherwise we write D(X, Y |Z)D to indicate that X and Y are conditionally dependent given Z in the graph D. Denicin 9 D-Separation. Let X, Y , and Z be o three disjoint subsets of nodes in a DAG D, then Z is said to D-separate X and Y i Z separates X and Y in the moral graph of the smallest ancestral set containing X, Y , and Z.
D-SEPARATION EXAMPLE
Cantabria University
Employment
Investment income
V
Health
Wealth
C
Contributions
P
Happiness
H C
P (a) I(E, V | ) V
P (b) D(E, H | P)
H C
Cantabria University
Symmetry: If X is conditional independent (c.i.) of Y , then Y is c.i. of X I(X, Y |Z) I(Y, X|Z). Decomposition: If X is c.i. of Y W given Z, then X is c.i. of Y given Z, and X is c.i. of W given Z, that is, I(X, Y W |Z) I(X, Y |Z) and I(X, W |Z). Weak Union: I(X, Y W |Z) I(X, W |ZY ) and I(X, Y |ZW ). Contraction: If W is irrelevant to X after the learning of some irrelevant information Y , then W must have been irrelevant before we knew Y , that is, I(X, W |Z Y ) and I(X, Y |Z) I(X, Y W |Z). The weak union and contraction properties together mean that irrelevant information should not alter the relevance of other relevant information in the system. In other words, what was relevant remains relevant, and what was irrelevant remains irrelevant.
PROBABILISTICfour properties hold for any JPD. The above NETWORK MODELS: BAYESIAN NETWORKS
10
GRAPHICAL ILLUSTRATION
Cantabria University
&
(a) Symmetry
X X
(b) Decomposition
X
&
&
W Y
(d) Contraction
X X X
&
W Y
(e) Intersection
OTHER PROPERTIES
Cantabria University
1. Strong Union: If X is c.i. of Y given Z, then X is also c.i. of Y given Z W , that is, I(X, Y |Z) I(X, Y |Z W ). Strong union property violation by DAGs.
X X
Y
(a)
Y
(b)
2. Intersection:
I(X, W |Z Y ) and I(X, Y |Z W ) I(X, Y W |Z).
3. Strong Transitivity:
D(X, A|Z) and D(A, Y |Z) D(X, Y |Z),
4. Weak Transitivity:
D(X, A|Z) and D(A, Y |Z) D(X, Y |Z) or D(X, Y |Z A),
5. Chordality:
D(A, C|B) and D(A, C|D) D(A, C|BD) or D(B, D|AC),
PROBABILISTIC NETWORK MODELS: BAYESIAN NETWORKS
12
GRAPHICAL ILLUSTRATION
Cantabria University
or
&
A Y
or
B A A C B C D
or
&
B C
D A A
or
(d) Chordality
Cantabria University
Graphs display the relationships among the variables explicitly and are intuitive and easy to explain. It is important to analyze whether or not dependence models associated with probabilistic models can be given by graphical models. Denicin 10 Perfect map. A graph G is said to o be a perfect map of a dependency model M if every CIS derived from G can also be derived from M and vice versa, that is, I(X, Y |Z)M I(X, Y |Z)G Z separates X f rom Y. Unfortunately, not every dependency model can be represented by a directed or undirected perfect map.
Ejemplo 2 Dependency model with no directed perfect map. Consider the set of three variables {X, Y, Z} and the dependency model M = {I(X, Y |Z), I(Y, Z|X), I(Y, X|Z), I(Z, Y |X)}. There is no directed acyclic graph (DAG) D that is a perfect map of the dependency model M .
PROBABILISTIC NETWORK MODELS: BAYESIAN NETWORKS
14
Cantabria University
Y
(b)
Y
(c)
Y
(d)
Y
(e)
Y
(f)
Y
(g)
Y
(h)
Graph G G not in M M not in G (a) I(X, Z|) (b) I(X, Z|) I(X, Y |) (c) I(Y, Z|) (d) I(X, Z|) (e) I(Y, Z|X) I(X, Y |) (f) I(X, Z|Y ) I(X, Y |) (g) I(X, Y |Z) I(X, Y |) (h) I(X, Y |)
PROBABILISTIC NETWORK MODELS: BAYESIAN NETWORKS
15
Denicin 11 Independence map. A graph G is o said to be an independence map (I-map) of a dependency model M if I(X, Y |Z)G I(X, Y |Z)M , that is, if all CISs derived from G hold in M .
Denicin 12 Dependency map. A graph G is o said to be a dependency map (D-map) of a dependency model M if D(X, Y |Z)G D(X, Y |Z)M , that is, all CISs derived from G hold in M .
Denicin 13 Minimal I-map. A graph G is said o to be a minimal I-map of a dependency model M if it is an I-map of M , but it is not an I-map of M when removing any link from it.
Cantabria University
A necessary and sucient condition for a dependency model M to have an undirected perfect map is that M must satisfy the following properties: Symmetry: I(X, Y |Z)M I(Y, X|Z)M . Decomposition: I(X, Y W |Z)M I(X, Y |Z)M and I(X, W |Z)M . Intersection:
I(X, W |Z Y )M and I(X, Y |Z W )M I(X, Y W |Z)M .
Strong union: I(X, Y |Z)M I(X, Y |Z W )M . Strong transitivity: I(X, Y |Z)M I(X, A|Z)M or I(Y, A|Z)M , where A is a single node not in {X, Y, Z}.
MARKOV NETWORKS
Cantabria University
Denicin 14 Markov network. A Markov neto work is a pair (G, ) where G is an undirected graph and = {1(c1), . . . , m(cm)} is a set of positive potential functions dened on the cliques C1, . . . , Cm of G that denes the JPD p(x) as p(x) =
n i=1
i(ci).
(10)
If the undirected graph G is triangulated, then p(x) can also be factorized, using probability functions P = {p(r1|s1), . . . , p(rm|sm)}, as p(x1, . . . , xn ) =
m i=1
p(ri|si),
(11)
where Ri and Si are the separator and residual of the cliques. In this case, the Markov network model is dened by (G, P ). The graph G is an undirected I-map of p(x). Thus, a Markov network can be used to dene the qualitative structure of a probabilistic model through a factorization of the corresponding JPD in terms of potential functions or probability functions. The quantitative structure is then obtained by numerically specifying the functions appearing in the factorization.
PROBABILISTIC NETWORK MODELS: BAYESIAN NETWORKS
18
Cantabria University
A
B
2
A
1
C1 C2
E
4
B D E
(a)
C F
D
5
C
3
C3
C4
F
6
(b)
The cliques of the graph are : C1 = {A, B, C}, C2 = {B, C, E}, C3 = {B, D}, C4 = {C, F }. (12)
p(a, b, c, d, e, f ) = 1 (c1)2(c2 )3(c3)4 (c4) = 1 (a, b, c)2 (b, c, e)3 (b, d)4 (c, f ). (13) Since the graph is triangulated, another factorization of the JPD in terms of probability functions can be obtained i Clique Ci Separator Si Residual Ri 1 A, B, C A, B, C 2 B, C, E B, C E 3 B, D B D 4 C, F C F
4
p(a, b, c, d, e, f ) =
i=1
p(ri|si) (14)
Cantabria University
A necessary condition for a dependency model M to have a directed perfect map is that M must satisfy the following properties: Symmetry: I(X, Y |Z)M I(Y, X|Z)M . Composition-Decomposition: Intersection:
I(X, W |Z Y )M and I(X, Y |Z W )M I(X, Y W |Z)M . I(X, Y W |Z)M I(X, Y |Z)M and I(X, W |Z)M .
Weak union:
I(X, Y Z|W )M I(X, Y |W Z)M .
Weak transitivity:
I(X, Y |Z)M and I(X, Y |ZA)M I(X, A|Z)M orI(Y, A|Z)M ,
Contraction:
I(X, Y |Z W )M and I(X, W |Z)M I(X, Y W |Z)M .
Chordality:
I(A, B|CD)M , I(C, D|AB)M I(A, B|C)M or I(A, B|D)M ,
BAYESIAN NETWORKS
Cantabria University
Denicin 15 Bayesian network. A Bayesian o network is a pair (D, P ), where D is a DAG, P = {p(x1|1), . . . , p(xn |n)} is a set of n CPDs, one for each variable, and i is the set of parents of node Xi in D. The set P denes the associated JPD as p(x) =
n i=1
p(xi|i).
(15)