Reasoning With Uncertainty - Probabilistic Reasoning: Version 2 CSE IIT, Kharagpur
Reasoning With Uncertainty - Probabilistic Reasoning: Version 2 CSE IIT, Kharagpur
Reasoning With Uncertainty - Probabilistic Reasoning: Version 2 CSE IIT, Kharagpur
Module 10
Reasoning with Uncertainty Probabilistic reasoning
Version 2 CSE IIT, Kharagpur
Lesson 28
Bayes Networks
Version 2 CSE IIT, Kharagpur
If a node has no parents, then the CPT reduces to a table giving the marginal distribution on that random variable.
Consider another example, in which all nodes are binary, i.e., have two possible values, which we will denote by T (true) and F (false).
We see that the event "grass is wet" (W=true) has two possible causes: either the water sprinker is on (S=true) or it is raining (R=true). The strength of this relationship is shown in the table. For example, we see that Pr(W=true | S=true, R=false) = 0.9 (second row), and hence, Pr(W=false | S=true, R=false) = 1 - 0.9 = 0.1, since each row must sum to one. Since the C node has no parents, its CPT specifies the prior probability that it is cloudy (in this case, 0.5). (Think of C as representing the season: if it is a cloudy season, it is less likely that the sprinkler is on and more likely that the rain is on.)
P(C, S, R, W) = P(C) * P(S|C) * P(R|C,S) * P(W|C,S,R) By using conditional independence relationships, we can rewrite this as P(C, S, R, W) = P(C) * P(S|C) * P(R|C) * P(W|S,R) where we were allowed to simplify the third term because R is independent of S given its parent C, and the last term because W is independent of C given its parents S and R. We can see that the conditional independence relationships allow us to represent the joint more compactly. Here the savings are minimal, but in general, if we had n binary nodes, the full joint would require O(2^n) space to represent, but the factored form would require O(n 2^k) space to represent, where k is the maximum fan-in of a node. And fewer parameters makes learning easier. The intuitive meaning of an arrow from a parent to a child is that the parent directly influences the child. The direction of this influence is often taken to represent casual influence. The conditional probabilities give the strength of causal influence. A 0 or 1 in a CPT represents a deterministic influence.
Note that, a node is NOT independent of its descendants given its parents. Generally,
Bayesian Networks allow a compact representation of the probability distributions. An unstructured table representation of the medical expert system joint would require 28 1 = 255 numbers. With the structure imposed by the conditional independence assumptions this reduces to 18 numbers. Structure also allows efficient inference of which more later.
If E d-separates X and Y then X and Y are conditionally independent given E. E d-separates X and Y if every undirected path from a node in X to a node in Y is blocked given E. Defining d-separation: A path is blocked given a set of nodes E if there is a node Z on the path for which one of these three conditions holds: 1. Z is in E and Z has one arrow on the path coming in and one arrow going out. 2. Z is in E and Z has both path arrows leading out. 3. Neither Z nor any descendant of Z is in E, and both path arrows lead in to Z.
Recall that a Bayesian network is composed of related (random) variables, and that a variable incorporates an exhaustive set of mutually exclusive events - one of its events is true. How shall we represent the two hypothesis events in a problem? Variables whose values are observable and which are relevant to the hypothesis events are called information variables. What are the information variables in a problem? In this problem we have three variables, what is the causal structure between them? Actually, the whole notion of cause let alone determining causal structure is very controversial. Often (but not always) your intuitive notion of causality will help you. Sometimes we need mediating variables which are neither information variables or hypothesis variables to represent causal structures.
We see that the log-likelihood scoring function decomposes according to the structure of the graph, and hence we can maximize the contribution to the log-likelihood of each node independently (assuming the parameters in each node are independent of the other nodes). In cases where N is small compared to the number of parameters that require fitting, we can use a numerical prior to regularize the problem. In this case, we call the estimates Maximum A Posterori (MAP) estimates, as opposed to Maximum Likelihood (ML) estimates. Consider estimating the Conditional Probability Table for the W node. If we have a set of training data, we can just count the number of times the grass is wet when it is raining and the sprinler is on, N(W=1,S=1,R=1), the number of times the grass is wet when it is raining and the sprinkler is off, N(W=1,S=0,R=1), etc. Given these counts (which are the sufficient statistics), we can find the Maximum Likelihood Estimate of the CPT as follows:
where the denominator is N(S=s,R=r) = N(W=0,S=s,R=r) + N(W=1,S=s,R=r). Thus "learning" just amounts to counting (in the case of multinomial distributions). For Gaussian nodes, we can compute the sample mean and variance, and use linear regression to estimate the weight matrix. For other kinds of distributions, more complex procedures are necessary. As is well known from the HMM literature, ML estimates of CPTs are prone to sparse data problems, which can be solved by using (mixtures of) Dirichlet priors (pseudo counts). This results in a Maximum A Posteriori (MAP) estimate. For Gaussians, we can use a Wishart prior, etc.