18CS71 Module 4
18CS71 Module 4
Module-4
Course Notes - 18CS71
Artificial Intelligence and Machine Learning
Syllabus: Bayesian Learning: Introduction, Bayes theorem, Bayes theorem and concept learning, ML and LS
error hypothesis, ML for predicting, MDL principle, Bates optimal classifier, Gibbs algorithm, Navie Bayes
classifier, BBN, EM Algorithm.
Texbook2: Chapter 6
Textbooks:
1. Tom M Mitchell, “Machine Lerning”,1 st Edition, McGraw Hill Education, 2017.
2. Elaine Rich, Kevin K and S B Nair, “Artificial Intelligence”, 3rd Edition, McGraw Hill Education, 2017.
1
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4
3
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4
Bayes Theorem
In machine learning we are often interested in determining the best hypothesis from some space
H, given the observed training data D. Bayes theorem provides a way to calculate the
probability of a hypothesis based on its prior probability, the probabilities of observing various
data given the hypothesis, and the observed data itself.
To define Bayes theorem precisely, let us first introduce a little notation.
• We shall write P(h) to denote the initial probability that hypothesis h holds, before we
have observed the training data. P(h) is often called the prior-probability of h and may
reflect any background knowledge we have about the chance that h is a correct
hypothesis.
• Similarly, we will write P(D) to denote the prior probability that training data D will
be observed
• Next, we will write P(D|h) to denote the probability of observing data D given some
world in which hypothesis h holds. In general, we write P(x|y) to denote the probability
of x given y. In machine learning problems we are interested in the probability P(h|D)
that h holds given the observed training data D. P(h|D) is called the posterior-
probability of h, because it reflects our confidence that h holds after we have seen the
training data D. Notice the posterior probability P(h|D) reflects the influence of the
training data D, in contrast to the prior probability P(h), which is independent of D.
Bayes theorem provides a way to calculate the posterior probability P(h|D), from the prior
probability P(h), together with P(D) and P(D|h).
As one might intuitively expect, P(h|D) increases with P(h) and with P(D|h) according to Bayes
theorem. It is also reasonable to see that P(h|D) decreases as P(D) increases, because the more
probable it is that D will be observed independent of h, the less evidence D provides in support
of h.
In many learning scenarios, the learner considers some set of candidate hypotheses H and is
interested in finding the most probable hypothesis h ∈ H given the observed data D (or at least
one of the maximally probable if there are several). Any such maximally probable hypothesis
is called a maximum a posteriori (MAP) hypothesis. We can determine the MAP hypotheses
by using Bayes theorem to calculate the posterior probability of each candidate hypothesis.
More precisely, we will say that hMAP is a MAP hypothesis provided,
…(2)
4
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4
Notice in the final step above we dropped the term P(D) because it is a constant independent
of h. In some cases, we will assume that every hypothesis in H is equally probable a priori
( P(hi) = P(hj) for all hi and hj in H). In this case we can further above equation and need only
consider the term P(D|h) to find the most probable hypothesis. P(D|h) is often called the
likelihood of the data D given h, and any hypothesis that maximizes P(D|h) is called a
maximum likelihood (ML) hypothesis, hML
…(3)
In order to make clear the connection to machine learning problems, we introduced Bayes
theorem above by referring to the data D as training examples of some target function and
referring to H as the space of candidate target functions.
Example: To illustrate Bayes rule, consider a medical diagnosis problem in which there are
two alternative hypotheses: (1) that the patient has a particular form of cancer, and (2) that
the patient does not. The available data is from a particular laboratory test with two possible
outcomes: ⊕ (positive) and ⊖ (negative). We have prior knowledge that over the entire
population of people only .008 have this disease. Furthermore, the lab test is only an imperfect
indicator of the disease. The test returns a correct positive result in only 98% of the cases in
which the disease is actually present and a correct negative result in only 97% of the cases in
which the disease is not present. In other cases, the test returns the opposite result.
Suppose we now observe a new patient for whom the lab test returns a positive result. Should
we diagnose the patient as having cancer or not?
Solution: The above situation can be summarized by the following probabilities:
5
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4
This step is warranted because Bayes theorem states that the posterior probabilities are just the
above quantities divided by the probability of the data, P(⊕). Although P(⊕) was not
provided directly as part of the problem statement, we can calculate it in this fashion because
we know that P(cancer|⊕) and P(¬cancer|⊕) must sum to 1.
Notice that while the posterior probability of cancer is significantly higher than its prior
probability, the most probable hypothesis is still that the patient does not have cancer.
As this example illustrates, the result of Bayesian inference depends strongly on the prior
probabilities, which must be available in order to apply the method directly. Note also that in
this example the hypotheses are not completely accepted or rejected, but rather become more
or less probable as more data is observed.
6
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4
This algorithm may require significant computation, because it applies Bayes theorem to each
hypothesis in H to calculate P(h|D ). While this may prove impractical for large hypothesis
spaces, the algorithm is still of interest because it provides a standard against which we may
judge the performance of other concept learning algorithms.
We assume the following.
1. The training data D is noise free (i.e., di = c(xi)).
2. The target concept c is contained in the hypothesis space H
3. We have no a priori reason to believe that any hypothesis is more probable than any
other.
Given no prior knowledge ( i.e. P(h) is not given) that one hypothesis is more likely than
another, it is reasonable to assign the same prior probability to every hypothesis h in H.
Now, P(D|h) is the probability of observing the target values D = (dl . . .dm) for the fixed set of
instances (x1 . . . xm), given a world in which hypothesis h holds (i.e., given a world in which
h is the correct description of the target concept c). Since we assume noise-free training data,
the probability of observing classification di given h is just 1 if di = h(xi) and 0 if di ≠ h(xi).
Therefore,
..(4)
In other words, the probability of data D given hypothesis h is 1 if D is consistent with h, and
0 otherwise. Recalling Bayes theorem, we have,
First consider the case where h is inconsistent with the training data D. Here P(D|h) = 0 due to
Equation (4). Thus, the posterior probability of hypothesis is
Now consider the case where h is consistent with D. Since Equation (4) defines P(D|h) = 1
when h is consistent with D, we have
7
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4
where VSH,D is the Version Space (subset of hypotheses) from H that are consistent with D.
The derivation for P(D) is as follows
To summarize, Bayes theorem implies that the posterior probability P(h|D) under our assumed
P(h) and P(D|h) is
a uniform prior probability distribution over H (i.e., P(hi) = P(hj) for all i, j), and if we assume
deterministic, noise free training data.
The Bayesian framework allows one way to characterize the behavior of learning algorithms
(e.g., FIND-S), even when the learning algorithm does not explicitly manipulate probabilities.
By identifying probability distributions P(h) and P(D|h) under which the algorithm outputs
optimal (i.e., MAP) hypotheses, we can characterize the implicit assumptions, under which this
algorithm behaves optimally. Thus, Bayesian analysis can be used to show that a particular
learning algorithm outputs MAP hypothesis even though it may not explicitly use Bayes rule
or calculate probabilities in any form.
So far we discussed a special case of Bayesian reasoning, where P(D|h) takes on values of only
0 and 1, reflecting the deterministic predictions of hypotheses and the assumption of noise-free
training data. In the next section, we model learning from noisy training data, by allowing
P(D|h) to take on values other than 0 and 1, and by introducing into P(D|h) additional
assumptions about the probability distributions that govern the noise.
9
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4
Before showing why a hypothesis that minimizes the sum of squared errors in this setting is
also a maximum likelihood hypothesis, let us quickly review two basic concepts from
probability theory: probability densities and Normal distributions.
Probability densities:
First, in order to discuss probabilities over continuous variables such as e, we must introduce
probability densities. The reason, roughly, is that we wish for the total probability over all
possible values of the random variable to sum to one. In the case of continuous variables we
cannot achieve this by assigning a finite probability to each of the infinite set of possible values
for the random variable. Instead, we speak of a probability density for continuous variables
such as e and require that the integral of this probability density over all possible values be one.
In general, we will use lower case p to refer to the probability density function, to distinguish
it from a finite probability P (which we will sometimes refer to as a probability mass). The
probability density p(x0) is the limit as E goes to zero, of times the probability that x will take
on a value in the interval [x0, x0 + 6).
Probability density function:
10
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4
A Normal distribution is fully determined by two parameters in the above formula: μ and σ. If
the random variable X follows a normal distribution, then:
• The probability that X will fall into the interval (a, b) is given by
• The expected, or mean value of X, E[X], is E[X] = μ
• The variance of X, Var(X), is Var(X) = σ2
• The standard deviation of X, σx, is σx = σ
The Central Limit Theorem states that the sum of a large number of independent, identically
distributed random variables follows a distribution that is approximately Normal.
Prove: Maximum likelihood hypothesis hML minimizes the sum of the squared errors between
the observed training values di and the hypothesis predictions h(xi)
Proof: From equation (3) we have
Let set of training instances be (x1 , … , xm) and therefore consider the data D to be the
corresponding sequence of target values D = (dl , … , dm). Here di = f(xi) + ei. Assuming the
training examples are mutually independent given h, we can write P(D|h) as the product of the
various p(di|h)
Given that the noise ei obeys a Normal distribution with zero mean and unknown variance σ2,
each di must also obey a Normal distribution with variance σ2 centered around the true target
value f(xi) rather than 0. Therefore p(di|h) can be written as a Normal distribution with variance
σ2 and mean p = f (xi). Let us write the formula for this Normal distribution to describe p(di
|h), using general formula for a Normal distribution and substituting the appropriate μ and σ2.
Because we are writing the expression for the probability of di given that h is the correct
description of the target function f, we will also substitute μ = f (xi) = h(xi), yielding
11
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4
The first term in this expression is a constant independent of h, and can therefore be discarded,
yielding,
Above equation shows that the maximum likelihood hypothesis hML is the one that minimizes
the sum of the squared errors between the observed training values di and the hypothesis
predictions h(xi).
Limitations: The above analysis considers noise only in the target value of the training
example and does not consider noise in the attributes describing the instances themselves.
12
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4
patient example, if x is one of those indistinguishable patients of which 92% survive, then f'(x)
= 0.92 whereas the probabilistic function f (x) will be equal to 1 in 92% of cases and equal to
0 in the remaining 8%.
How can we learn f ‘ using, say, a neural network? One obvious, bruteforce way would be to
first collect the observed frequencies of 1's and 0's for each possible value of x and to then train
the neural network to output the target frequency for each x. As we shall see below, we can
instead train a neural network directly from the observed training examples of f, yet still derive
a maximum likelihood hypothesis for f '.
What criterion should we optimize in order to find a maximum likelihood hypothesis for f' in
this setting? To answer this question, we must first obtain an expression for P(D1h). Let us
assume the training data D is of the form D = {(xl, dl) . . . (xm, dm)}, where di is the observed 0
or 1 value for f (xi). Recall that in the maximum likelihood, least-squared error analysis of the
previous section, we made the simplifying assumption that the instances (xl . . . xm) were fixed.
This enabled us to characterize the data by considering only the target values di. Although we
could make a similar simplifying assumption in this case, let us avoid it here in order to
demonstrate that it has no impact on the final outcome. Thus, treating both xi and di as random
variables, and assuming that each training example is drawn independently, we can write
P(D|h) as
…(8)
Now what is the probability P(di | h, xi) of observing di = 1 for a single instance xi, given a
world in which hypothesis h holds? Recall that h is our hypothesis regarding the target function,
which computes this very probability.
Therefore, P(di = 1 | h, xi) = h(xi), and in general
….(9)
In order to substitute for P(D|h) in (8), let us first "re-express it in a more mathematically
manipulable form, as
…(10)
13
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4
The expression on the right side of Equation (12) can be seen as a generalization of the
Binomial distribution. The expression in Equation (12) describes the probability that flipping
each of m distinct coins will produce the outcome (dl . . .dm), assuming that each coin xi has
probability h(xi) of producing a heads. Note the Binomial distribution is similar, but makes the
additional assumption that the coins have identical probabilities of turning up heads (i.e., that
h(xi) = h(xj), for every i, j). In both cases we assume the outcomes of the coin flips are mutually
independent-an assumption that fits our current setting.
As in earlier cases, we will find it easier to work with the log of the likelihood, yielding
…(13)
Equation (13) describes the quantity that must be maximized in order to obtain the maximum
likelihood hypothesis in our current problem setting. This result is analogous to our earlier
result showing that minimizing the sum of squared errors produces the maximum likelihood
hypothesis in the earlier problem setting. Note the similarity between Equation (13) and the
general form of the entropy function, -xi pi log pi, discussed in Chapter 3. Because of this
similarity, the negation of the above quantity is sometimes called the cross entropy.
14
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4
several arguments in the long-standing debate regarding Occam's razor. Here we consider a
Bayesian perspective on this issue and a closely related principle called the Minimum
Description Length (MDL) principle.
The Minimum Description Length principle is motivated by interpreting the definition of hMAP
light of basic concepts from information theory. Consider again the now familiar definition of
MAP
Above equation can be interpreted as a statement that short hypotheses are preferred, assuming
a particular representation scheme for encoding hypotheses and data.
To explain this, let us introduce a basic result from information theory: Consider the problem
of designing a code to transmit messages drawn at random, where the probability of
encountering message i is pi. We are interested here in the most compact code; that is, we are
interested in the code that minimizes the expected number of bits we must transmit in order to
encode a message drawn at random. Clearly, to minimize the expected code length we should
assign shorter codes to messages that are more probable. Shannon and Weaver (1949) showed
that the optimal code (i.e., the code that minimizes the message length) assigns -log2 pi bits to
encode message i . We will refer to the number of bits required to encode message i using code
C as the description length of message i with respect to C, which we denote by Lc(i).
Let us interpret above equation in light of the above result from coding theory.
The Minimum Description Length (MDL) principle recommends choosing the hypothesis that
minimizes the sum of these two description lengths. Of course, to apply this principle in
practice we must choose specific encodings or representations appropriate for the given
15
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4
learning task. Assuming we use the codes C1 and C2 to represent the hypothesis and the data
given the hypothesis, we can state the MDL principle as
The above analysis shows that if we choose C1 to be the optimal encoding of hypotheses CH,
and if we choose C2 to be the optimal encoding CDlh then hMDL= hMAP.
Intuitively, we can think of the MDL principle as recommending the shortest method for re-
encoding the training data, where we count both the size of the hypothesis and any additional
cost of encoding the data given this hypothesis.
MDL principle provides a way of trading off hypothesis complexity for the number of errors
committed by the hypothesis. It might select a shorter hypothesis that makes a few errors over
a longer hypothesis that perfectly classifies the training data. Viewed in this light, it provides
one method for dealing with the issue of overfitting the data.
In general, the most probable classification of the new instance is obtained by combining the
predictions of all hypotheses, weighted by their posterior probabilities. If the possible
classification of the new example can take on any value vj from some set V, then the
probability P(vjlD) that the correct classification for the new instance is vi is just
16
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4
Any system that classifies new instances according to Equation (6.18) is called a Bayes
optimal classzjier, or Bayes optimal learner. No other classification method using the same
hypothesis space and same prior knowledge can outperform this method on average. This
method maximizes the probability that the new instance is classified correctly, given the
available data, hypothesis space, and prior probabilities over the hypotheses.
The labeling of instances defined in this way need not correspond to the instance labeling of
any single hypothesis h from H. One way to view this situation is to think of the Bayes
optimal classifier as effectively considering a hypothesis space H' different from the space of
hypotheses H to which Bayes theorem is being applied. In particular, H' effectively includes
hypotheses that perform comparisons between linear combinations of predictions from
multiple hypotheses in H
17
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4
GIBBS ALGORITHM
Although the Bayes optimal classifier obtains the best performance that can be achieved from
the given training data, it can be quite costly to apply. The expense is due to the fact that it
computes the posterior probability for every hypothesis in H and then combines the
predictions of each hypothesis to classify each new instance. An alternative, less optimal
method is the Gibbs algorithm (see Opper and Haussler 1991), defined as follows:
Given a new instance to classify, the Gibbs algorithm simply applies a hypothesis drawn at
random according to the current posterior probability distribution. Surprisingly, it can be
shown that under certain conditions the expected misclassification error for the Gibbs
algorithm is at most twice the expected error of the Bayes optimal classifier (Haussler et al.
1994). More precisely, the expected value is taken over target concepts drawn at random
according to the prior probability distribution assumed by the learner.
Under this condition, the expected value of the error of the Gibbs algorithm is at worst twice
the expected value of the error of the Bayes optimal classifier. This result has an interesting
implication for the concept learning problem described earlier. In particular, it implies that if
the learner assumes a uniform prior over H, and if target concepts are in fact drawn from such
a distribution when presented to the learner, then classifying the next instance according to a
hypothesis drawn at random from the current version space (according to a uniform
distribution), will have expected error at most twice that of the Bayes optimal classijier.
Again, we have an example where a Bayesian analysis of a non-Bayesian algorithm yields
insight into the performance of that algorithm.
18
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4
The Bayesian approach to classifying the new instance is to assign the most probable target
value, vMAP, given the attribute values (al, a2, ..., an) that describe the instance.
..(19)
Now we could attempt to estimate the two terms in Equation (19) based on the training data. It
is easy to estimate each of the P(vj) simply by counting the frequency with which each target
value vj occurs in the training data. However, estimating the different P(al, a2, ... an | vj) terms
in this fashion is not feasible unless we have a very, very large set of training data. (The problem
is that the no. of these terms = no. of possible instances * no. of possible target values.)
The naive Bayes classifier is based on the simplifying assumption that the attribute values are
conditionally independent given the target value. In other words, the assumption is that given
the target value of the instance, the probability of observing the conjunction al, a2, … , an, is
just the product of the probabilities for the individual attributes: P(al, a2, … , an | vj) = Πi P(ai |
vj). Substituting this into Equation (6.19), we have the approach used by the naive Bayes
classifier.
Naive Bayes classifier: …(20)
where vNB denotes the target value output by the naive Bayes classifier. (Here total terms are
only n)
To summarize, the naive Bayes learning method involves a learning step in which the various
P(vj) and P(ai|vj) terms are estimated, based on their frequencies over the training data. The set
of these estimates corresponds to the learned hypothesis. This hypothesis is then used to
classify each new instance by applying the rule in Equation (20).
One interesting difference between the naive Bayes learning method and other learning
methods we have considered is that there is no explicit search through the space of possible
hypotheses. Instead, the hypothesis is formed without searching, simply by counting the
frequency of various data combinations within the training examples.
Illustration: Consider the following data.
Day Outlook Temp. Humidity Wind Play Tennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
19
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4
Let us use the naive Bayes classifier and the training data from this table to classify the
following novel instance:
(Outlook = sunny, Temperature = cool, Humidity = high, Wind = strong)
Our task is to predict the target value (yes or no) of the target concept PlayTennis for this new
instance. Instantiating Equation (20) to fit the current task, the target value vNB is given by
The probabilities of the different target values can easily be estimated based on their
frequencies over the 14 training examples
Thus, the naive Bayes classifier assigns the target value PlayTennis = no to this new instance,
based on the probability estimates learned from the training data.
Furthermore, by normalizing the above quantities to sum to one we can calculate the
conditional probability that the target value is no, given the observed attribute values. For the
current example, this probability is 0.0206 / (0,0206+0.0053) = 0.795
Estimating Probabilities: In the above computations, conditional fraction
P(Wind = strong | PlayTennis = no) = 3/5 = nc/n
from the training samples provides a good estimate of the probability in many cases, but
estimate is poor when n is very small or nc is 0. There are two difficulties. 1) First, nc/n produces
a biased underestimate of the probability. 2) Second, when this probability estimate is zero,
this probability term will dominate the Bayes classifier if the future query contains Wind =
strong. The reason is that the quantity calculated in Equation (20) requires multiplying all the
other probability terms by these zero values.
To avoid this difficulty, we can adopt a Bayesian approach to estimating the probability, using
the m-estimate defined as follows.
m-estimate of probability: ...(22)
20
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4
Here, nc, and n are defined as before, p is our prior estimate of the probability we wish to
determine, and m is a constant called the equivalent sample size, which determines how heavily
to weight p relative to the observed data.
A typical method for choosing p in the absence of other information is to assume uniform
priors; that is, if an attribute has k possible values we set p = 1/k. For example, in estimating
P(Wind = strong | PlayTennis = no) we note the attribute Wind has two possible values, so
uniform priors would correspond to choosing p = .5. Note that if m is zero, the m-estimate is
equivalent to the simple fraction nc/n. If both n and m are nonzero, then the observed fraction
nc/n and prior p will be combined according to the weight m. The reason m is called the
equivalent sample size is that Equation (22) can be interpreted as augmenting the n actual
observations by an additional m virtual sample distributed according to p.
where xi ∈ V(X), yj ∈ V(Y), and zk ∈ V(Z). We commonly write the above expression in
abbreviated form as P(X|Y, Z) = P(X|Z). This definition of conditional independence can be
extended to sets of variables as well. We say that the set of variables X1 . . . Xi is conditionally
independent of the set of variables Yl . . . Ym given the set of variables Z1 . . . Zn, if
21
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4
Note the correspondence between this definition and our use of conditional independence in
the definition of the naive Bayes classifier. The naive Bayes classifier assumes that the instance
attribute A1 is conditionally independent of instance attribute A2 given the target value V. This
allows the naive Bayes classifier to calculate P(Al, A2|V) in Equation (20) as follows
Equation (6.23) is just the general form of the product rule of probability from Table 6.1.
Equation (6.24) follows because if A1 is conditionally independent of A2 given V, then by our
definition of conditional independence P (A1 | A2, V) = P(A1 | V).
Representation
A Bayesian belief network (Bayesian network for short) represents the joint probability
distribution for a set of variables. For example, the Bayesian network in Figure 6.3 represents
the joint probability distribution over the boolean variables Storm, Lightning, Thunder,
ForestFire, Campjre, and BusTourGroup. In general, a Bayesian network represents the joint
probability distribution by specifying a set of conditional independence assumptions
(represented by a directed acyclic graph), together with sets of local conditional probabilities.
Each variable in the joint space is represented by a node in the Bayesian network.
For each variable two types of information are specified.
1. First, the network arcs represent the assertion that the variable is conditionally
independent of its non-descendants in the network given its immediate predecessors in
the network. We say X is a descendant of Y if there is a directed path from Y to X.
2. Second, a conditional probability table is given for each variable, describing the
probability distribution for that variable given the values of its immediate predecessors.
The joint probability for any desired assignment of values (y1, . . . , yn) to the tuple of
network variables (Y1, . . . , Yn) can be computed by the formula
where Parents(Yi) denotes the set of immediate predecessors of Yi in the network. Note
the values of P(yi | Parents(Yi)) are precisely the values stored in the conditional
probability table associated with node Yi.
22
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4
To illustrate, the Bayesian network in Figure 6.3 represents the joint probability distribution
over the boolean variables Storm, Lightning, Thunder, ForestFire, Campfire, and
BusTourGroup. Consider the node Campfire. The network nodes and arcs represent the
assertion that Campfire is conditionally independent of its non-descendants Lightning and
Thunder, given its immediate parents Storm and BusTourGroup. This means that once we
know the value of the variables Storm and BusTourGroup, the variables Lightning and Thunder
provide no additional information about Campfire. The right side of the figure shows the
conditional probability table associated with the variable Campfire. The top left entry in this
table, for example, expresses the assertion that
P(Campfire = True | Storm = True, BusTourGroup = True) = 0.4
Note this table provides only the conditional probabilities of Campfire given its parent variables
Storm and BusTourGroup. The set of local conditional probability tables for all the variables,
together with the set of conditional independence assumptions described by the network,
describe the full joint probability distribution for the network.
One attractive feature of Bayesian belief networks is that they allow a convenient way to
represent causal knowledge such as the fact that Lightning causes Thunder. In the terminology
of conditional independence, we express this by stating that Thunder is conditionally
independent of other variables in the network, given the value of Lightning.
8.2 Inference
We might wish to use a Bayesian network to infer the value of some target variable (e.g.,
ForestFire) given the observed values of the other variables. Of course, given that we are
dealing with random variables it will not generally be correct to assign the target variable a
single determined value. What we really wish to infer is the probability distribution for the
target variable, which specifies the probability that it will take on each of its possible
23
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4
values given the observed values of the other variables. This inference step can be
straightforward if values for all of the other variables in the network are known exactly. In the
more general case we may wish to infer the probability distribution for some variable (e.g.,
ForestFire) given observed values for only a subset of the other variables (e.g., Thunder and
BusTourGroup may be the only observed values available).
In general, a Bayesian network can be used to compute the probability distribution for any
subset of network variables given the values or distributions for any subset of the remaining
variables.
The EM Algorithm
In many practical learning settings, only a subset of the relevant instance features might be
observable. For example, in training or using the Bayesian belief network, we might have data
where only a subset of the network variables Storm, Lightning, Thunder, ForestFire, Campfire,
and BusTourGroup have been observed. Many approaches have been proposed to handle the
problem of learning in the presence of unobserved variables. If some variable is sometimes
observed and sometimes not, then we can use the cases for which it has been observed to learn
to predict its values when it is not.
24
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4
In this section we describe the EM algorithm (Dempster et al. 1977), a widely used approach
to learning in the presence of unobserved variables. The EM algorithm can be used even for
variables whose value is never directly observed, provided the general form of the probability
distribution governing these variables is known.
Application: The EM algorithm has been used to train Bayesian belief networks (Heckerman
1995) as well as radial basis function neural networks. The EM algorithm is also the basis for
many unsupervised clustering algorithms (e.g., Cheeseman et al. 1988), and it is the basis for
the widely used Baum-Welch forward-backward algorithm for learning Partially Observable
Markov Models (Rabiner 1989).
Estimating Means of k Gaussians
The easiest way to introduce the EM algorithm is via an example. Consider a problem in which
the data D is a set of instances generated by a probability distribution that is a mixture of k
distinct Normal distributions. This problem setting is illustrated in Figure 6.4 for the case where
k = 2 and where the instances are the points shown along the x axis. Each instance is generated
using a two-step process. First, one of the k Normal distributions is selected at random. Second,
a single random instance xi is generated according to this selected distribution.
This process is repeated to generate a set of data points as shown in the figure. To simplify
our discussion, we consider the special case where the selection of the single Normal
distribution at each step is based on choosing each with uniform probability, where each of the
k Normal distributions has the same variance σ2, known value. The learning task is to output a
hypothesis h = (μ1, . . . ,μk)
that describes the means of each of the k distributions. We would like to find a maximum
likelihood hypothesis for these means; that is, a hypothesis h that maximizes p(D |h).
Note it is easy to calculate the maximum likelihood hypothesis for the mean of a single Normal
distribution given the observed data instances x1, x2, . . . , xm drawn from this single distribution.
Earlier where we showed that the maximum likelihood hypothesis is the one that minimizes
25
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4
the sum of squared errors over the m training instances. Now the problem of finding the mean
of a single distribution is just a special case of the problem discussed. Restating using our
current notation, we have
…(6.27)
In this case, the sum of squared errors is minimized by the sample mean
… (6.28)
Our problem here, however, involves a mixture of k different Normal distributions, and we
cannot observe which instances were generated by which distribution. Thus, we have a
prototypical example of a problem involving hidden variables. In the example of Figure 6.4,
we can think of the full description of each instance as the triple (xi, zi1, zi2), where xi is the
observed value of the ith instance and where zil and zi2 indicate which of the two Normal
distributions was used to generate the value xi. In particular, zij has the value 1 if xi was created
by the jth Normal distribution and 0 otherwise. Here xi is the observed variable in the description
of the instance, and zil and zi2 are hidden variables. If the values of zil and zi2 were observed,
we could use Equation (6.27) to solve for the means p1 and p2. Because they are not, we will
instead use the EM algorithm.
Applied to our k-means problem the EM algorithm searches for a maximum likelihood
hypothesis by repeatedly re-estimating the expected values of the hidden variables zij given its
current hypothesis (μ1 . . . μ k), then recalculating the maximum likelihood hypothesis using
these expected values for the hidden variables.
We will first describe this instance of the EM algorithm, and later state the EM algorithm in its
general form.
Applied to the problem of estimating the two means for Figure 6.4, the EM algorithm first
initializes the hypothesis to h = (μ1, μ2), where μ1 and μ2 are arbitrary initial values. It then
iteratively re-estimates h by repeating the following two steps until the procedure converges to
a stationary value for h.
Let us examine how both of these steps can be implemented in practice. Step 1 must calculate
the expected value of each zi,. This E[zij] is just the probability that instance xi was generated
by the jth Normal distribution.
26
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4
Thus, the first step is implemented by substituting the current values (μl, μ2) and the observed
xi into the above expression.
In the second step we use the E[zij] calculated during Step 1 to derive a new maximum
likelihood hypothesis h' = (μ'1, μ'2). As we will discuss later, the maximum likelihood
hypothesis in this case is given by
Note this expression is similar to the sample mean from Equation (6.28) that is used to estimate
μ for a single Normal distribution. Our new expression is just the weighted sample mean for
μj, with each instance weighted by the expectation E[zij] that it was generated by the jth Normal
distribution.
The above algorithm for estimating the means of a mixture of k Normal distributions illustrates
the essence of the EM approach: The current hypothesis is used to estimate the unobserved
variables, and the expected values of these variables are then used to calculate an
improved hypothesis. It can be proved that on each iteration through this loop, the EM
algorithm increases the likelihood P(D|h) unless it is at a local maximum. The algorithm thus
converges to a local maximum likelihood hypothesis for (μ1, μ2).
governing Y, which is determined by the unknown parameters θ. Let us consider exactly what
this expression signifies. First, P(Y|h’) is the likelihood of the full data Y given hypothesis h'.
It is reasonable that we wish to find a h' that maximizes some function of this quantity. Second,
maximizing the logarithm of this quantity ln(P(Y|h’)) also maximizes P(Y|h’), as we have
discussed on several occasions already. Third, we introduce the expected value E[ln P(Y|h’)]
because the full data Y is itself a random variable. Given that the full data Y is a combination
of the observed data X and unobserved data Z, we must average over the possible values of the
unobserved Z, weighting each according to its probability. In other words we take the expected
value E[ln P(Y|h')] over the probability distribution governing the random variable Y. The
distribution governing Y is determined by the completely known values for X, plus the
distribution governing Z.
What is the probability distribution governing Y? In general, we will not know this distribution
because it is determined by the parameters θ that we are trying to estimate. Therefore, the EM
algorithm uses its current hypothesis h in place of the actual parameters θ to estimate the
distribution governing Y. Let us define a function Q(h’|h) that gives E[ln P(Y |h')] as a function
of h', under the assumption that θ = h and given the observed portion X of the full data Y.
We write this function Q in the form Q(h’|h) to indicate that it is defined in part by the
assumption that the current hypothesis h is equal to 8. In its general form, the EM algorithm
repeats the following two steps until convergence:
When the function Q is continuous, the EM algorithm converges to a stationary point of the
likelihood function P(Y|h'). When this likelihood function has a single maximum, EM will
converge to this global maximum likelihood estimate for h'. Otherwise, it is guaranteed only to
converge to a local maximum. In this respect, EM shares some of the same limitations as other
optimization methods such as gradient descent, line search, and conjugate gradient discussed
in Chapter 4.
9.1 Derivation of the k Means Algorithm
28
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4
29
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4
30
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4
Summary
• Bayesian methods provide the basis for probabilistic learning methods that
accommodate (and require) knowledge about the prior probabilities of alternative
hypotheses and about the probability of observing various data given the hypothesis.
Bayesian methods allow assigning a posterior probability to each candidate hypothesis,
based on these assumed priors and the observed data.
• Bayesian methods can be used to determine the most probable hypothesis given the
data-the maximum a posteriori (MAP) hypothesis. This is the optimal hypothesis in the
sense that no other hypothesis is more likely.
• The naive Bayes classifier is a Bayesian learning method that has been found to be
useful in many practical applications. It is called "naive" because it incorporates the
simplifying assumption that attribute values are conditionally independent, given the
classification of the instance. When this assumption is met, the naive Bayes classifier
outputs the MAP classification. Even when this assumption is not met, as in the case of
learning to classify text, the naive Bayes classifier is often quite effective. Bayesian
belief networks provide a more expressive representation for sets of conditional
independence assumptions among subsets of the attributes.
• The framework of Bayesian reasoning can provide a useful basis for analyzing certain
learning methods that do not directly apply Bayes theorem. For example, under certain
conditions it can be shown that minimizing the squared error when learning a real-
valued target function corresponds to computing the maximum likelihood hypothesis.
• The Minimum Description Length principle recommends choosing the hypothesis that
minimizes the description length of the hypothesis plus the description length of the
data given the hypothesis. Bayes theorem and basic results from information theory can
be used to provide a rationale for this principle.
• In many practical learning tasks, some of the relevant instance variables may be
unobservable. The EM algorithm provides a quite general approach to learning in the
presence of unobservable variables. This algorithm begins with an arbitrary initial
hypothesis. It then repeatedly calculates the expected values of the hidden variables
(assuming the current hypothesis is correct), and then recalculates the maximum
likelihood hypothesis (assuming the hidden variables have the expected values
calculated by the first step). This procedure converges to a local maximum likelihood
hypothesis, along with estimated values for the hidden variables.
*****
31
Dept of CSE, Vemana IT