0% found this document useful (0 votes)

88 views30 pages

18CS71 Module 4

1. The document summarizes Bayesian learning methods which provide a probabilistic approach to machine learning problems. 2. It explains key concepts like Bayes' theorem, prior and posterior probabilities, and how Bayesian learning calculates the probability of hypotheses given training data. 3. Bayesian learning allows incremental updating of hypothesis probabilities based on new data, incorporates prior knowledge, and finds the maximum a posteriori hypothesis.

Uploaded by

Manasa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

88 views30 pages

18CS71 Module 4

Uploaded by

Manasa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Artificial Intelligence and Machine Learning (18CS71) Module 4

Module-4
Course Notes - 18CS71
Artificial Intelligence and Machine Learning

Syllabus: Bayesian Learning: Introduction, Bayes theorem, Bayes theorem and concept learning, ML and LS
error hypothesis, ML for predicting, MDL principle, Bates optimal classifier, Gibbs algorithm, Navie Bayes
classifier, BBN, EM Algorithm.

Texbook2: Chapter 6
Textbooks:
1. Tom M Mitchell, “Machine Lerning”,1 st Edition, McGraw Hill Education, 2017.
2. Elaine Rich, Kevin K and S B Nair, “Artificial Intelligence”, 3rd Edition, McGraw Hill Education, 2017.

1
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4

Module-4: Bayesian Learning

Introduction
Bayesian reasoning provides a probabilistic approach to inference. It assumes that the
quantities of interest are governed by probability distributions and that optimal decisions can
be made by reasoning about these probabilities together with observed data. It is important to
machine learning because it provides a quantitative approach to weighing the evidence
supporting alternative hypotheses.
Bayesian learning methods are relevant to our study of machine learning for two different
reasons.
• First, Bayesian learning algorithms that calculate explicit probabilities for hypotheses,
such as the naive Bayes classifier, are among the most practical approaches to certain
types of learning problems.
• The second reason that Bayesian methods are important to our study of machine
learning is that they provide a useful perspective for understanding many learning
algorithms that do not explicitly manipulate probabilities.
Features of Bayesian learning methods include:
• Each observed training example can incrementally decrease or increase the estimated
probability that a hypothesis is correct. This provides a more flexible approach to
learning than algorithms that completely eliminate a hypothesis if it is found to be
inconsistent with any single example.
• Prior knowledge can be combined with observed data to determine the final probability
of a hypothesis. In Bayesian learning, prior knowledge is provided by asserting (1) a
prior probability for each candidate hypothesis, and (2) a probability distribution over
observed data for each possible hypothesis.
• Bayesian methods can accommodate hypotheses that make probabilistic predictions
(e.g., hypotheses such as "this pneumonia patient has a 93% chance of complete
recovery").
• New instances can be classified by combining the predictions of multiple hypotheses,
weighted by their probabilities.
• Even in cases where Bayesian methods prove computationally intractable, they can
provide a standard of optimal decision making against which other practical methods
can be measured.
One practical difficulty in applying Bayesian methods is that they typically require initial
knowledge of many probabilities. When these probabilities are not known in advance they are
often estimated based on background knowledge, previously available data, and assumptions
about the form of the underlying distributions. A second practical difficulty is the significant
computational cost required to determine the Bayes optimal hypothesis in the general case
(linear in the number of candidate hypotheses). In certain specialized situations, this
computational cost can be significantly reduced.

3
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4

Bayes Theorem
In machine learning we are often interested in determining the best hypothesis from some space
H, given the observed training data D. Bayes theorem provides a way to calculate the
probability of a hypothesis based on its prior probability, the probabilities of observing various
data given the hypothesis, and the observed data itself.
To define Bayes theorem precisely, let us first introduce a little notation.
• We shall write P(h) to denote the initial probability that hypothesis h holds, before we
have observed the training data. P(h) is often called the prior-probability of h and may
reflect any background knowledge we have about the chance that h is a correct
hypothesis.
• Similarly, we will write P(D) to denote the prior probability that training data D will
be observed
• Next, we will write P(D|h) to denote the probability of observing data D given some
world in which hypothesis h holds. In general, we write P(x|y) to denote the probability
of x given y. In machine learning problems we are interested in the probability P(h|D)
that h holds given the observed training data D. P(h|D) is called the posterior-
probability of h, because it reflects our confidence that h holds after we have seen the
training data D. Notice the posterior probability P(h|D) reflects the influence of the
training data D, in contrast to the prior probability P(h), which is independent of D.
Bayes theorem provides a way to calculate the posterior probability P(h|D), from the prior
probability P(h), together with P(D) and P(D|h).

Bayes theorem: …(1)

As one might intuitively expect, P(h|D) increases with P(h) and with P(D|h) according to Bayes
theorem. It is also reasonable to see that P(h|D) decreases as P(D) increases, because the more
probable it is that D will be observed independent of h, the less evidence D provides in support
of h.
In many learning scenarios, the learner considers some set of candidate hypotheses H and is
interested in finding the most probable hypothesis h ∈ H given the observed data D (or at least
one of the maximally probable if there are several). Any such maximally probable hypothesis
is called a maximum a posteriori (MAP) hypothesis. We can determine the MAP hypotheses
by using Bayes theorem to calculate the posterior probability of each candidate hypothesis.
More precisely, we will say that hMAP is a MAP hypothesis provided,

…(2)
4
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4

Notice in the final step above we dropped the term P(D) because it is a constant independent
of h. In some cases, we will assume that every hypothesis in H is equally probable a priori
( P(hi) = P(hj) for all hi and hj in H). In this case we can further above equation and need only
consider the term P(D|h) to find the most probable hypothesis. P(D|h) is often called the
likelihood of the data D given h, and any hypothesis that maximizes P(D|h) is called a
maximum likelihood (ML) hypothesis, hML

…(3)
In order to make clear the connection to machine learning problems, we introduced Bayes
theorem above by referring to the data D as training examples of some target function and
referring to H as the space of candidate target functions.

Summary of basic probability formulas.

Example: To illustrate Bayes rule, consider a medical diagnosis problem in which there are
two alternative hypotheses: (1) that the patient has a particular form of cancer, and (2) that
the patient does not. The available data is from a particular laboratory test with two possible
outcomes: ⊕ (positive) and ⊖ (negative). We have prior knowledge that over the entire
population of people only .008 have this disease. Furthermore, the lab test is only an imperfect
indicator of the disease. The test returns a correct positive result in only 98% of the cases in
which the disease is actually present and a correct negative result in only 97% of the cases in
which the disease is not present. In other cases, the test returns the opposite result.
Suppose we now observe a new patient for whom the lab test returns a positive result. Should
we diagnose the patient as having cancer or not?
Solution: The above situation can be summarized by the following probabilities:

5
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4

The maximum a posteriori hypothesis can be found using Equation (2):

Thus, hmap= ¬ cancer. ( No Cancer)

Note: The exact posterior probabilities can also be determined by normalizing the above
quantities so that they sum to 1.

This step is warranted because Bayes theorem states that the posterior probabilities are just the
above quantities divided by the probability of the data, P(⊕). Although P(⊕) was not
provided directly as part of the problem statement, we can calculate it in this fashion because
we know that P(cancer|⊕) and P(¬cancer|⊕) must sum to 1.
Notice that while the posterior probability of cancer is significantly higher than its prior
probability, the most probable hypothesis is still that the patient does not have cancer.
As this example illustrates, the result of Bayesian inference depends strongly on the prior
probabilities, which must be available in order to apply the method directly. Note also that in
this example the hypotheses are not completely accepted or rejected, but rather become more
or less probable as more data is observed.

Bayes theorem and Concept Learning

What is the relationship between Bayes theorem and the problem of concept learning? Since
Bayes theorem provides a principled way to calculate the posterior probability of each
hypothesis given the training data, we can use it as the basis for a straightforward learning
algorithm that calculates the probability for each possible hypothesis, then outputs the most
probable.
Brute-Force Bayes Concept Learning
Consider the concept learning problem first introduced in Module-1. Assume the learner
considers some finite hypothesis space H defined over the instance space X, in which the task
is to learn some target concept c : X → {0,1}. As usual, we assume that the learner is given
some sequence of training examples ((x1, d1 ) . . . (xm, dm)) where xi is some instance from X
and where di is the target value of xi (i.e., di = c(xi)). To simplify the discussion in this section,
we assume the sequence of instances (xl . . . xm) is held fixed, so that the training data D can be
written simply as the sequence of target values D = (dl . . . dm)
We can design a straightforward concept learning algorithm to output the maximum a posteriori
hypothesis, based on Bayes theorem, as follows:

6
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4

Brute-Force Map Learning Algorithm

This algorithm may require significant computation, because it applies Bayes theorem to each
hypothesis in H to calculate P(h|D ). While this may prove impractical for large hypothesis
spaces, the algorithm is still of interest because it provides a standard against which we may
judge the performance of other concept learning algorithms.
We assume the following.
1. The training data D is noise free (i.e., di = c(xi)).
2. The target concept c is contained in the hypothesis space H
3. We have no a priori reason to believe that any hypothesis is more probable than any
other.
Given no prior knowledge ( i.e. P(h) is not given) that one hypothesis is more likely than
another, it is reasonable to assign the same prior probability to every hypothesis h in H.

Now, P(D|h) is the probability of observing the target values D = (dl . . .dm) for the fixed set of
instances (x1 . . . xm), given a world in which hypothesis h holds (i.e., given a world in which
h is the correct description of the target concept c). Since we assume noise-free training data,
the probability of observing classification di given h is just 1 if di = h(xi) and 0 if di ≠ h(xi).
Therefore,
..(4)

In other words, the probability of data D given hypothesis h is 1 if D is consistent with h, and
0 otherwise. Recalling Bayes theorem, we have,

First consider the case where h is inconsistent with the training data D. Here P(D|h) = 0 due to
Equation (4). Thus, the posterior probability of hypothesis is

Now consider the case where h is consistent with D. Since Equation (4) defines P(D|h) = 1
when h is consistent with D, we have

7
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4

where VSH,D is the Version Space (subset of hypotheses) from H that are consistent with D.
The derivation for P(D) is as follows

To summarize, Bayes theorem implies that the posterior probability P(h|D) under our assumed
P(h) and P(D|h) is

Every consistent hypothesis is, therefore, a MAP hypothesis.

The evolution of probabilities associated with hypotheses is depicted schematically in Figure
given below. Initially (Figure 6.1a) all hypotheses have the same probability. As training data
accumulates (Figures 6.1b and 6.lc), the posterior probability for inconsistent hypotheses
becomes zero while the total probability summing to one is shared equally among the
remaining consistent hypotheses.

MAP Hypotheses and Consistent Learners

The above analysis shows that in the given setting, every hypothesis consistent with D is a
MAP hypothesis. We will say that a learning algorithm is a consistent learner provided it
outputs a hypothesis that commits zero errors over the training examples. Given the above
analysis, we can conclude that every consistent learner outputs a MAP hypothesis, if we assume
8
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4

a uniform prior probability distribution over H (i.e., P(hi) = P(hj) for all i, j), and if we assume
deterministic, noise free training data.
The Bayesian framework allows one way to characterize the behavior of learning algorithms
(e.g., FIND-S), even when the learning algorithm does not explicitly manipulate probabilities.
By identifying probability distributions P(h) and P(D|h) under which the algorithm outputs
optimal (i.e., MAP) hypotheses, we can characterize the implicit assumptions, under which this
algorithm behaves optimally. Thus, Bayesian analysis can be used to show that a particular
learning algorithm outputs MAP hypothesis even though it may not explicitly use Bayes rule
or calculate probabilities in any form.
So far we discussed a special case of Bayesian reasoning, where P(D|h) takes on values of only
0 and 1, reflecting the deterministic predictions of hypotheses and the assumption of noise-free
training data. In the next section, we model learning from noisy training data, by allowing
P(D|h) to take on values other than 0 and 1, and by introducing into P(D|h) additional
assumptions about the probability distributions that govern the noise.

Maximum Likelihood and Least-Squared Error Hypotheses

In this section we consider the problem of learning a continuous-valued target function. This
is a problem faced by many learning approaches such as neural network learning, linear
regression, and polynomial curve fitting. A straightforward Bayesian analysis will show that
under certain assumptions any learning algorithm that minimizes the squared error between the
output hypothesis predictions and the training data will output a maximum likelihood
hypothesis.
Consider the following problem. Learner L considers an instance space X and a hypothesis
space H consisting of some class of real-valued functions defined over X (i.e., each h in H is a
function of the form h : X→R, where R represents the set of real numbers). The problem faced
by L is to learn an unknown target function f : X→R drawn from H. A set of m training
examples is provided, where the target value of each example is corrupted by random noise
drawn according to a Normal probability distribution. More precisely, each training example
is a pair of the form (xi, di) where di = f (xi) + ei. Here f (xi) is the noise-free value of the target
function and ei is a random variable representing the noise. It is assumed that the values of the
ei are drawn independently and that they are distributed according to a Normal distribution with
zero mean. The task of the learner is to output a maximum likelihood hypothesis, or,
equivalently, a MAP hypothesis assuming all hypotheses are equally probable a priori.
Example: A simple example of such a problem is learning a linear function, though our
analysis applies to learning arbitrary real-valued functions. Figure 6.2 illustrates the whole
scenario. Here notice that the maximum likelihood hypothesis is not necessarily identical to
the correct hypothesis, f, because it is inferred from only a limited sample of noisy training
data.

9
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4

Before showing why a hypothesis that minimizes the sum of squared errors in this setting is
also a maximum likelihood hypothesis, let us quickly review two basic concepts from
probability theory: probability densities and Normal distributions.
Probability densities:
First, in order to discuss probabilities over continuous variables such as e, we must introduce
probability densities. The reason, roughly, is that we wish for the total probability over all
possible values of the random variable to sum to one. In the case of continuous variables we
cannot achieve this by assigning a finite probability to each of the infinite set of possible values
for the random variable. Instead, we speak of a probability density for continuous variables
such as e and require that the integral of this probability density over all possible values be one.
In general, we will use lower case p to refer to the probability density function, to distinguish
it from a finite probability P (which we will sometimes refer to as a probability mass). The
probability density p(x0) is the limit as E goes to zero, of times the probability that x will take
on a value in the interval [x0, x0 + 6).
Probability density function:

Normal Distribution: Random noise variable e is generated by a Normal probability

distribution. A Normal distribution (also called a Gaussian distribution) is a smooth, bell-
shaped distribution that can be completely characterized by its mean μ and its standard
deviation σ. It can be defined by the probability density function.

10
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4

A Normal distribution is fully determined by two parameters in the above formula: μ and σ. If
the random variable X follows a normal distribution, then:
• The probability that X will fall into the interval (a, b) is given by
• The expected, or mean value of X, E[X], is E[X] = μ
• The variance of X, Var(X), is Var(X) = σ2
• The standard deviation of X, σx, is σx = σ
The Central Limit Theorem states that the sum of a large number of independent, identically
distributed random variables follows a distribution that is approximately Normal.
Prove: Maximum likelihood hypothesis hML minimizes the sum of the squared errors between
the observed training values di and the hypothesis predictions h(xi)
Proof: From equation (3) we have

Let set of training instances be (x1 , … , xm) and therefore consider the data D to be the
corresponding sequence of target values D = (dl , … , dm). Here di = f(xi) + ei. Assuming the
training examples are mutually independent given h, we can write P(D|h) as the product of the
various p(di|h)

Given that the noise ei obeys a Normal distribution with zero mean and unknown variance σ2,
each di must also obey a Normal distribution with variance σ2 centered around the true target
value f(xi) rather than 0. Therefore p(di|h) can be written as a Normal distribution with variance
σ2 and mean p = f (xi). Let us write the formula for this Normal distribution to describe p(di
|h), using general formula for a Normal distribution and substituting the appropriate μ and σ2.
Because we are writing the expression for the probability of di given that h is the correct
description of the target function f, we will also substitute μ = f (xi) = h(xi), yielding

We now apply a transformation that is common in maximum likelihood calculations: Rather

than maximizing the above complicated expression we shall choose to maximize its (less
complicated) logarithm. This is justified because ln p is a monotonic function of p. Therefore,
maximizing ln p also maximizes p.

11
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4

The first term in this expression is a constant independent of h, and can therefore be discarded,
yielding,

Maximizing this negative quantity is equivalent to minimizing the corresponding positive

quantity.

Finally, we can again discard constants that are independent of h.

Above equation shows that the maximum likelihood hypothesis hML is the one that minimizes
the sum of the squared errors between the observed training values di and the hypothesis
predictions h(xi).
Limitations: The above analysis considers noise only in the target value of the training
example and does not consider noise in the attributes describing the instances themselves.

Maximum Likelihood Hypotheses for Predicting Probabilities

In the problem setting of the previous section we determined that the maximum likelihood
hypothesis is the one that minimizes the sum of squared errors over the training examples. In
this section we derive an analogous criterion for a second setting that is common in neural
network learning: learning to predict probabilities.
Consider the setting in which we wish to learn a nondeterministic (probabilistic) function
f : X →{0, 1}, which has two discrete output values. For example, the instance space X might
represent medical patients in terms of their symptoms, and the target function f (x) might be 1
if the patient survives the disease and 0 if not. Alternatively, X might represent loan applicants
in terms of their past credit history, and f (x) might be 1 if the applicant successfully repays
their next loan and 0 if not. In both of these cases we might well expect f to be probabilistic.
For example, among a collection of patients exhibiting the same set of observable symptoms,
we might find that 92% survive, and 8% do not. This unpredictability could arise from our
inability to observe all the important distinguishing features of the patients, or from some
genuinely probabilistic mechanism in the evolution of the disease. Whatever the source of the
problem, the effect is that we have a target function f (x) whose output is a probabilistic function
of the input.
Given this problem setting, we might wish to learn a neural network (or other real-valued
unction approximator) whose output is the probability that f (x) = 1. In other words, we seek
to learn the target function, f ’ : X →{0, 1}, such that f '(x) = P( f (x) = 1). In the above medical

12
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4

patient example, if x is one of those indistinguishable patients of which 92% survive, then f'(x)
= 0.92 whereas the probabilistic function f (x) will be equal to 1 in 92% of cases and equal to
0 in the remaining 8%.
How can we learn f ‘ using, say, a neural network? One obvious, bruteforce way would be to
first collect the observed frequencies of 1's and 0's for each possible value of x and to then train
the neural network to output the target frequency for each x. As we shall see below, we can
instead train a neural network directly from the observed training examples of f, yet still derive
a maximum likelihood hypothesis for f '.
What criterion should we optimize in order to find a maximum likelihood hypothesis for f' in
this setting? To answer this question, we must first obtain an expression for P(D1h). Let us
assume the training data D is of the form D = {(xl, dl) . . . (xm, dm)}, where di is the observed 0
or 1 value for f (xi). Recall that in the maximum likelihood, least-squared error analysis of the
previous section, we made the simplifying assumption that the instances (xl . . . xm) were fixed.
This enabled us to characterize the data by considering only the target values di. Although we
could make a similar simplifying assumption in this case, let us avoid it here in order to
demonstrate that it has no impact on the final outcome. Thus, treating both xi and di as random
variables, and assuming that each training example is drawn independently, we can write
P(D|h) as

It is reasonable to assume, furthermore, that the probability of encountering any particular

instance xi is independent of the hypothesis h. For example, the probability that our training set
contains a particular patient xi is independent of our hypothesis about survival rates (though of
course the survival d, of the patient does depend strongly on h). When x is independent of h we
can rewrite the above expression as

…(8)
Now what is the probability P(di | h, xi) of observing di = 1 for a single instance xi, given a
world in which hypothesis h holds? Recall that h is our hypothesis regarding the target function,
which computes this very probability.
Therefore, P(di = 1 | h, xi) = h(xi), and in general

….(9)
In order to substitute for P(D|h) in (8), let us first "re-express it in a more mathematically
manipulable form, as

…(10)

13
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4

The expression on the right side of Equation (12) can be seen as a generalization of the
Binomial distribution. The expression in Equation (12) describes the probability that flipping
each of m distinct coins will produce the outcome (dl . . .dm), assuming that each coin xi has
probability h(xi) of producing a heads. Note the Binomial distribution is similar, but makes the
additional assumption that the coins have identical probabilities of turning up heads (i.e., that
h(xi) = h(xj), for every i, j). In both cases we assume the outcomes of the coin flips are mutually
independent-an assumption that fits our current setting.
As in earlier cases, we will find it easier to work with the log of the likelihood, yielding

…(13)
Equation (13) describes the quantity that must be maximized in order to obtain the maximum
likelihood hypothesis in our current problem setting. This result is analogous to our earlier
result showing that minimizing the sum of squared errors produces the maximum likelihood
hypothesis in the earlier problem setting. Note the similarity between Equation (13) and the
general form of the entropy function, -xi pi log pi, discussed in Chapter 3. Because of this
similarity, the negation of the above quantity is sometimes called the cross entropy.

Minimum Description Length Principle

Recall from Module-3 the discussion of Occam's razor, a popular inductive bias that can be
summarized as “choose the shortest explanation for the observed data”. There we discussed

14
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4

several arguments in the long-standing debate regarding Occam's razor. Here we consider a
Bayesian perspective on this issue and a closely related principle called the Minimum
Description Length (MDL) principle.
The Minimum Description Length principle is motivated by interpreting the definition of hMAP
light of basic concepts from information theory. Consider again the now familiar definition of
MAP

Above equation can be interpreted as a statement that short hypotheses are preferred, assuming
a particular representation scheme for encoding hypotheses and data.
To explain this, let us introduce a basic result from information theory: Consider the problem
of designing a code to transmit messages drawn at random, where the probability of
encountering message i is pi. We are interested here in the most compact code; that is, we are
interested in the code that minimizes the expected number of bits we must transmit in order to
encode a message drawn at random. Clearly, to minimize the expected code length we should
assign shorter codes to messages that are more probable. Shannon and Weaver (1949) showed
that the optimal code (i.e., the code that minimizes the message length) assigns -log2 pi bits to
encode message i . We will refer to the number of bits required to encode message i using code
C as the description length of message i with respect to C, which we denote by Lc(i).
Let us interpret above equation in light of the above result from coding theory.

The Minimum Description Length (MDL) principle recommends choosing the hypothesis that
minimizes the sum of these two description lengths. Of course, to apply this principle in
practice we must choose specific encodings or representations appropriate for the given

15
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4

learning task. Assuming we use the codes C1 and C2 to represent the hypothesis and the data
given the hypothesis, we can state the MDL principle as

The above analysis shows that if we choose C1 to be the optimal encoding of hypotheses CH,
and if we choose C2 to be the optimal encoding CDlh then hMDL= hMAP.
Intuitively, we can think of the MDL principle as recommending the shortest method for re-
encoding the training data, where we count both the size of the hypothesis and any additional
cost of encoding the data given this hypothesis.
MDL principle provides a way of trading off hypothesis complexity for the number of errors
committed by the hypothesis. It might select a shorter hypothesis that makes a few errors over
a longer hypothesis that perfectly classifies the training data. Viewed in this light, it provides
one method for dealing with the issue of overfitting the data.

BAYES OPTIMAL CLASSIFIER

To develop some intuitions consider a hypothesis space containing three hypotheses, hl, h2,
and h3. Suppose that the posterior probabilities of these hypotheses given the training data
are .4, .3, and .3 respectively. Thus, hl is the MAP hypothesis. Suppose a new instance x is
encountered, which is classified positive by hl, but negative by h2 and h3. Taking all
hypotheses into account, the probability that x is positive is .4 (the probability associated with
hi), and the probability that it is negative is therefore .6. The most probable classification
(negative) in this case is different from the classification generated by the MAP hypothesis.

In general, the most probable classification of the new instance is obtained by combining the
predictions of all hypotheses, weighted by their posterior probabilities. If the possible
classification of the new example can take on any value vj from some set V, then the
probability P(vjlD) that the correct classification for the new instance is vi is just

16
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4

Any system that classifies new instances according to Equation (6.18) is called a Bayes
optimal classzjier, or Bayes optimal learner. No other classification method using the same
hypothesis space and same prior knowledge can outperform this method on average. This
method maximizes the probability that the new instance is classified correctly, given the
available data, hypothesis space, and prior probabilities over the hypotheses.
The labeling of instances defined in this way need not correspond to the instance labeling of
any single hypothesis h from H. One way to view this situation is to think of the Bayes
optimal classifier as effectively considering a hypothesis space H' different from the space of
hypotheses H to which Bayes theorem is being applied. In particular, H' effectively includes
hypotheses that perform comparisons between linear combinations of predictions from
multiple hypotheses in H
17
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4

GIBBS ALGORITHM
Although the Bayes optimal classifier obtains the best performance that can be achieved from
the given training data, it can be quite costly to apply. The expense is due to the fact that it
computes the posterior probability for every hypothesis in H and then combines the
predictions of each hypothesis to classify each new instance. An alternative, less optimal
method is the Gibbs algorithm (see Opper and Haussler 1991), defined as follows:

1. Choose a hypothesis h from H at random, according to the posterior probability

distribution over H.
2. Use h to predict the classification of the next instance x.

Given a new instance to classify, the Gibbs algorithm simply applies a hypothesis drawn at
random according to the current posterior probability distribution. Surprisingly, it can be
shown that under certain conditions the expected misclassification error for the Gibbs
algorithm is at most twice the expected error of the Bayes optimal classifier (Haussler et al.
1994). More precisely, the expected value is taken over target concepts drawn at random
according to the prior probability distribution assumed by the learner.

Under this condition, the expected value of the error of the Gibbs algorithm is at worst twice
the expected value of the error of the Bayes optimal classifier. This result has an interesting
implication for the concept learning problem described earlier. In particular, it implies that if
the learner assumes a uniform prior over H, and if target concepts are in fact drawn from such
a distribution when presented to the learner, then classifying the next instance according to a
hypothesis drawn at random from the current version space (according to a uniform
distribution), will have expected error at most twice that of the Bayes optimal classijier.
Again, we have an example where a Bayesian analysis of a non-Bayesian algorithm yields
insight into the performance of that algorithm.

Naive Bayes Classifier

One highly practical Bayesian learning method is the naive Bayes learner, often called the naive
Bayes classifier. In some domains its performance has been shown to be comparable to that of
neural network and decision tree learning.
The naive Bayes classifier applies to learning tasks where each instance x is described by a
conjunction of attribute values and where the target function f (x) can take on any value from
some finite set V. A set of training examples of the target function is provided, and a new
instance is presented, described by the tuple of attribute values (al, a2, ... ,an). The learner is
asked to predict the target value, or classification, for this new instance.

18
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4

The Bayesian approach to classifying the new instance is to assign the most probable target
value, vMAP, given the attribute values (al, a2, ..., an) that describe the instance.

We can use Bayes theorem to rewrite this expression as

..(19)
Now we could attempt to estimate the two terms in Equation (19) based on the training data. It
is easy to estimate each of the P(vj) simply by counting the frequency with which each target
value vj occurs in the training data. However, estimating the different P(al, a2, ... an | vj) terms
in this fashion is not feasible unless we have a very, very large set of training data. (The problem
is that the no. of these terms = no. of possible instances * no. of possible target values.)
The naive Bayes classifier is based on the simplifying assumption that the attribute values are
conditionally independent given the target value. In other words, the assumption is that given
the target value of the instance, the probability of observing the conjunction al, a2, … , an, is
just the product of the probabilities for the individual attributes: P(al, a2, … , an | vj) = Πi P(ai |
vj). Substituting this into Equation (6.19), we have the approach used by the naive Bayes
classifier.
Naive Bayes classifier: …(20)
where vNB denotes the target value output by the naive Bayes classifier. (Here total terms are
only n)
To summarize, the naive Bayes learning method involves a learning step in which the various
P(vj) and P(ai|vj) terms are estimated, based on their frequencies over the training data. The set
of these estimates corresponds to the learned hypothesis. This hypothesis is then used to
classify each new instance by applying the rule in Equation (20).
One interesting difference between the naive Bayes learning method and other learning
methods we have considered is that there is no explicit search through the space of possible
hypotheses. Instead, the hypothesis is formed without searching, simply by counting the
frequency of various data combinations within the training examples.
Illustration: Consider the following data.
Day Outlook Temp. Humidity Wind Play Tennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes

19
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4

D11 Sunny Mild Normal Strong Yes

D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

Let us use the naive Bayes classifier and the training data from this table to classify the
following novel instance:
(Outlook = sunny, Temperature = cool, Humidity = high, Wind = strong)
Our task is to predict the target value (yes or no) of the target concept PlayTennis for this new
instance. Instantiating Equation (20) to fit the current task, the target value vNB is given by

The probabilities of the different target values can easily be estimated based on their
frequencies over the 14 training examples

.. and so on (remaining 10)

We have

Thus, the naive Bayes classifier assigns the target value PlayTennis = no to this new instance,
based on the probability estimates learned from the training data.
Furthermore, by normalizing the above quantities to sum to one we can calculate the
conditional probability that the target value is no, given the observed attribute values. For the
current example, this probability is 0.0206 / (0,0206+0.0053) = 0.795
Estimating Probabilities: In the above computations, conditional fraction
P(Wind = strong | PlayTennis = no) = 3/5 = nc/n
from the training samples provides a good estimate of the probability in many cases, but
estimate is poor when n is very small or nc is 0. There are two difficulties. 1) First, nc/n produces
a biased underestimate of the probability. 2) Second, when this probability estimate is zero,
this probability term will dominate the Bayes classifier if the future query contains Wind =
strong. The reason is that the quantity calculated in Equation (20) requires multiplying all the
other probability terms by these zero values.
To avoid this difficulty, we can adopt a Bayesian approach to estimating the probability, using
the m-estimate defined as follows.
m-estimate of probability: ...(22)
20
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4

Here, nc, and n are defined as before, p is our prior estimate of the probability we wish to
determine, and m is a constant called the equivalent sample size, which determines how heavily
to weight p relative to the observed data.
A typical method for choosing p in the absence of other information is to assume uniform
priors; that is, if an attribute has k possible values we set p = 1/k. For example, in estimating
P(Wind = strong | PlayTennis = no) we note the attribute Wind has two possible values, so
uniform priors would correspond to choosing p = .5. Note that if m is zero, the m-estimate is
equivalent to the simple fraction nc/n. If both n and m are nonzero, then the observed fraction
nc/n and prior p will be combined according to the weight m. The reason m is called the
equivalent sample size is that Equation (22) can be interpreted as augmenting the n actual
observations by an additional m virtual sample distributed according to p.

Bayesian Belief Networks

The naive Bayes classifier makes significant use of the assumption that the values of the
attributes a1, . . , an, are conditionally independent given the target value v. This assumption
dramatically reduces the complexity of learning the target function. When it is met, the naive
Bayes classifier outputs the optimal Bayes classification. However, in many cases this
conditional independence assumption is clearly overly restrictive.
A Bayesian belief network (or Bayesian network) describes the probability distribution
governing a set of variables by specifying a set of conditional independence assumptions along
with a set of conditional probabilities. Bayesian networks allow stating conditional
independence assumptions that apply to subsets of the variables. They are an active focus of
current research, and a variety of algorithms have been proposed for learning them and for
using them for inference.
Bayesian networks describes the probability distribution over a set of variables. The probability
distribution over these joint variables are called the joint probability distribution. The joint
probability distribution specifies the probability for each of the possible variable bindings for
the tuple (Y1, . . . Y2). A Bayesian belief network describes the joint probability distribution
for a set of variables.
8.1 Conditional Independence
Let X, Y, and Z be three discrete-valued random variables. We say that X is conditionally
independent of Y given Z if the probability distribution governing X is independent of the value
of Y given a value for Z; that is, if

where xi ∈ V(X), yj ∈ V(Y), and zk ∈ V(Z). We commonly write the above expression in
abbreviated form as P(X|Y, Z) = P(X|Z). This definition of conditional independence can be
extended to sets of variables as well. We say that the set of variables X1 . . . Xi is conditionally
independent of the set of variables Yl . . . Ym given the set of variables Z1 . . . Zn, if

21
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4

Note the correspondence between this definition and our use of conditional independence in
the definition of the naive Bayes classifier. The naive Bayes classifier assumes that the instance
attribute A1 is conditionally independent of instance attribute A2 given the target value V. This
allows the naive Bayes classifier to calculate P(Al, A2|V) in Equation (20) as follows

Equation (6.23) is just the general form of the product rule of probability from Table 6.1.
Equation (6.24) follows because if A1 is conditionally independent of A2 given V, then by our
definition of conditional independence P (A1 | A2, V) = P(A1 | V).

Representation
A Bayesian belief network (Bayesian network for short) represents the joint probability
distribution for a set of variables. For example, the Bayesian network in Figure 6.3 represents
the joint probability distribution over the boolean variables Storm, Lightning, Thunder,
ForestFire, Campjre, and BusTourGroup. In general, a Bayesian network represents the joint
probability distribution by specifying a set of conditional independence assumptions
(represented by a directed acyclic graph), together with sets of local conditional probabilities.
Each variable in the joint space is represented by a node in the Bayesian network.
For each variable two types of information are specified.
1. First, the network arcs represent the assertion that the variable is conditionally
independent of its non-descendants in the network given its immediate predecessors in
the network. We say X is a descendant of Y if there is a directed path from Y to X.
2. Second, a conditional probability table is given for each variable, describing the
probability distribution for that variable given the values of its immediate predecessors.
The joint probability for any desired assignment of values (y1, . . . , yn) to the tuple of
network variables (Y1, . . . , Yn) can be computed by the formula

where Parents(Yi) denotes the set of immediate predecessors of Yi in the network. Note
the values of P(yi | Parents(Yi)) are precisely the values stored in the conditional
probability table associated with node Yi.

22
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4

To illustrate, the Bayesian network in Figure 6.3 represents the joint probability distribution
over the boolean variables Storm, Lightning, Thunder, ForestFire, Campfire, and
BusTourGroup. Consider the node Campfire. The network nodes and arcs represent the
assertion that Campfire is conditionally independent of its non-descendants Lightning and
Thunder, given its immediate parents Storm and BusTourGroup. This means that once we
know the value of the variables Storm and BusTourGroup, the variables Lightning and Thunder
provide no additional information about Campfire. The right side of the figure shows the
conditional probability table associated with the variable Campfire. The top left entry in this
table, for example, expresses the assertion that
P(Campfire = True | Storm = True, BusTourGroup = True) = 0.4
Note this table provides only the conditional probabilities of Campfire given its parent variables
Storm and BusTourGroup. The set of local conditional probability tables for all the variables,
together with the set of conditional independence assumptions described by the network,
describe the full joint probability distribution for the network.
One attractive feature of Bayesian belief networks is that they allow a convenient way to
represent causal knowledge such as the fact that Lightning causes Thunder. In the terminology
of conditional independence, we express this by stating that Thunder is conditionally
independent of other variables in the network, given the value of Lightning.
8.2 Inference
We might wish to use a Bayesian network to infer the value of some target variable (e.g.,
ForestFire) given the observed values of the other variables. Of course, given that we are
dealing with random variables it will not generally be correct to assign the target variable a
single determined value. What we really wish to infer is the probability distribution for the
target variable, which specifies the probability that it will take on each of its possible
23
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4

values given the observed values of the other variables. This inference step can be
straightforward if values for all of the other variables in the network are known exactly. In the
more general case we may wish to infer the probability distribution for some variable (e.g.,
ForestFire) given observed values for only a subset of the other variables (e.g., Thunder and
BusTourGroup may be the only observed values available).
In general, a Bayesian network can be used to compute the probability distribution for any
subset of network variables given the values or distributions for any subset of the remaining
variables.

Learning Bayesian Belief Networks

Can we devise effective algorithms for learning Bayesian belief networks from training data?
This question is a focus of much current research. Several different settings for this learning
problem can be considered. First, the network structure might be given in advance, or it might
have to be inferred from the training data. Second, all the network variables might be directly
observable in each training example, or some might be unobservable.
In the case where the network structure is given in advance and the variables are fully
observable in the training examples, learning the conditional probability tables is
straightforward. We simply estimate the conditional probability table entries just as we would
for a naive Bayes classifier.
In the case where the network structure is given but only some of the variable values are
observable in the training data, the learning problem is more difficult. This problem is
somewhat analogous to learning the weights for the hidden units in an artificial neural network,
where the input and output node values are given but the hidden unit values are left unspecified
by the training examples. In fact, Russell et al. (1995) propose a similar gradient ascent
procedure that learns the entries in the conditional probability tables. This gradient ascent
procedure searches through a space of hypotheses that corresponds to the set of all possible
entries for the conditional probability tables. The objective function that is maximized during
gradient ascent is the probability P(D|h) of the observed training data D given the hypothesis
h. By definition, this corresponds to searching for the maximum likelihood hypothesis for the
table entries.
Note: Refer lecture slides for more examples/illustrations

The EM Algorithm
In many practical learning settings, only a subset of the relevant instance features might be
observable. For example, in training or using the Bayesian belief network, we might have data
where only a subset of the network variables Storm, Lightning, Thunder, ForestFire, Campfire,
and BusTourGroup have been observed. Many approaches have been proposed to handle the
problem of learning in the presence of unobserved variables. If some variable is sometimes
observed and sometimes not, then we can use the cases for which it has been observed to learn
to predict its values when it is not.

24
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4

In this section we describe the EM algorithm (Dempster et al. 1977), a widely used approach
to learning in the presence of unobserved variables. The EM algorithm can be used even for
variables whose value is never directly observed, provided the general form of the probability
distribution governing these variables is known.
Application: The EM algorithm has been used to train Bayesian belief networks (Heckerman
1995) as well as radial basis function neural networks. The EM algorithm is also the basis for
many unsupervised clustering algorithms (e.g., Cheeseman et al. 1988), and it is the basis for
the widely used Baum-Welch forward-backward algorithm for learning Partially Observable
Markov Models (Rabiner 1989).
Estimating Means of k Gaussians
The easiest way to introduce the EM algorithm is via an example. Consider a problem in which
the data D is a set of instances generated by a probability distribution that is a mixture of k
distinct Normal distributions. This problem setting is illustrated in Figure 6.4 for the case where
k = 2 and where the instances are the points shown along the x axis. Each instance is generated
using a two-step process. First, one of the k Normal distributions is selected at random. Second,
a single random instance xi is generated according to this selected distribution.

This process is repeated to generate a set of data points as shown in the figure. To simplify
our discussion, we consider the special case where the selection of the single Normal
distribution at each step is based on choosing each with uniform probability, where each of the
k Normal distributions has the same variance σ2, known value. The learning task is to output a
hypothesis h = (μ1, . . . ,μk)

that describes the means of each of the k distributions. We would like to find a maximum
likelihood hypothesis for these means; that is, a hypothesis h that maximizes p(D |h).
Note it is easy to calculate the maximum likelihood hypothesis for the mean of a single Normal
distribution given the observed data instances x1, x2, . . . , xm drawn from this single distribution.
Earlier where we showed that the maximum likelihood hypothesis is the one that minimizes

25
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4

the sum of squared errors over the m training instances. Now the problem of finding the mean
of a single distribution is just a special case of the problem discussed. Restating using our
current notation, we have

…(6.27)
In this case, the sum of squared errors is minimized by the sample mean

… (6.28)
Our problem here, however, involves a mixture of k different Normal distributions, and we
cannot observe which instances were generated by which distribution. Thus, we have a
prototypical example of a problem involving hidden variables. In the example of Figure 6.4,
we can think of the full description of each instance as the triple (xi, zi1, zi2), where xi is the
observed value of the ith instance and where zil and zi2 indicate which of the two Normal
distributions was used to generate the value xi. In particular, zij has the value 1 if xi was created
by the jth Normal distribution and 0 otherwise. Here xi is the observed variable in the description
of the instance, and zil and zi2 are hidden variables. If the values of zil and zi2 were observed,
we could use Equation (6.27) to solve for the means p1 and p2. Because they are not, we will
instead use the EM algorithm.
Applied to our k-means problem the EM algorithm searches for a maximum likelihood
hypothesis by repeatedly re-estimating the expected values of the hidden variables zij given its
current hypothesis (μ1 . . . μ k), then recalculating the maximum likelihood hypothesis using
these expected values for the hidden variables.
We will first describe this instance of the EM algorithm, and later state the EM algorithm in its
general form.
Applied to the problem of estimating the two means for Figure 6.4, the EM algorithm first
initializes the hypothesis to h = (μ1, μ2), where μ1 and μ2 are arbitrary initial values. It then
iteratively re-estimates h by repeating the following two steps until the procedure converges to
a stationary value for h.

Let us examine how both of these steps can be implemented in practice. Step 1 must calculate
the expected value of each zi,. This E[zij] is just the probability that instance xi was generated
by the jth Normal distribution.

26
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4

Thus, the first step is implemented by substituting the current values (μl, μ2) and the observed
xi into the above expression.
In the second step we use the E[zij] calculated during Step 1 to derive a new maximum
likelihood hypothesis h' = (μ'1, μ'2). As we will discuss later, the maximum likelihood
hypothesis in this case is given by

Note this expression is similar to the sample mean from Equation (6.28) that is used to estimate
μ for a single Normal distribution. Our new expression is just the weighted sample mean for
μj, with each instance weighted by the expectation E[zij] that it was generated by the jth Normal
distribution.
The above algorithm for estimating the means of a mixture of k Normal distributions illustrates
the essence of the EM approach: The current hypothesis is used to estimate the unobserved
variables, and the expected values of these variables are then used to calculate an
improved hypothesis. It can be proved that on each iteration through this loop, the EM
algorithm increases the likelihood P(D|h) unless it is at a local maximum. The algorithm thus
converges to a local maximum likelihood hypothesis for (μ1, μ2).

General Statement of EM Algorithm

Above we described an EM algorithm for the problem of estimating means of a mixture of
Normal distributions. More generally, the EM algorithm can be applied in many settings where
we wish to estimate some set of parameters θ that describe an underlying probability
distribution, given only the observed portion of the full data produced by this distribution. In
the above two-means example the parameters of interest were θ = (μ1, μ2), and the full data
were the triples (xi, zi1, zi2) of which only the xi were observed. In general let X = {xl, . . . , xm}
denote the observed data in a set of m independently drawn instances, let Z = {zl, . . . , zm}
denote the unobserved data in these same instances, and let Y = X U Z denote the full data.
Note the unobserved Z can be treated as a random variable whose probability distribution
depends on the unknown parameters θ and on the observed data X. Similarly, Y is a random
variable because it is defined in terms of the random variable Z. In the remainder of this section
we describe the general form of the EM algorithm. We use h to denote the current hypothesized
values of the parameters θ, and h' to denote the revised hypothesis that is estimated on each
iteration of the EM algorithm.
The EM algorithm searches for the maximum likelihood hypothesis h' by seeking the h' that
maximizes E[ln P(Y|h')]. This expected value is taken over the probability distribution
27
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4

governing Y, which is determined by the unknown parameters θ. Let us consider exactly what
this expression signifies. First, P(Y|h’) is the likelihood of the full data Y given hypothesis h'.
It is reasonable that we wish to find a h' that maximizes some function of this quantity. Second,
maximizing the logarithm of this quantity ln(P(Y|h’)) also maximizes P(Y|h’), as we have
discussed on several occasions already. Third, we introduce the expected value E[ln P(Y|h’)]
because the full data Y is itself a random variable. Given that the full data Y is a combination
of the observed data X and unobserved data Z, we must average over the possible values of the
unobserved Z, weighting each according to its probability. In other words we take the expected
value E[ln P(Y|h')] over the probability distribution governing the random variable Y. The
distribution governing Y is determined by the completely known values for X, plus the
distribution governing Z.
What is the probability distribution governing Y? In general, we will not know this distribution
because it is determined by the parameters θ that we are trying to estimate. Therefore, the EM
algorithm uses its current hypothesis h in place of the actual parameters θ to estimate the
distribution governing Y. Let us define a function Q(h’|h) that gives E[ln P(Y |h')] as a function
of h', under the assumption that θ = h and given the observed portion X of the full data Y.

We write this function Q in the form Q(h’|h) to indicate that it is defined in part by the
assumption that the current hypothesis h is equal to 8. In its general form, the EM algorithm
repeats the following two steps until convergence:

When the function Q is continuous, the EM algorithm converges to a stationary point of the
likelihood function P(Y|h'). When this likelihood function has a single maximum, EM will
converge to this global maximum likelihood estimate for h'. Otherwise, it is guaranteed only to
converge to a local maximum. In this respect, EM shares some of the same limitations as other
optimization methods such as gradient descent, line search, and conjugate gradient discussed
in Chapter 4.
9.1 Derivation of the k Means Algorithm

28
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4

29
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4

30
Dept of CSE, Vemana IT
Artificial Intelligence and Machine Learning (18CS71) Module 4

Summary
• Bayesian methods provide the basis for probabilistic learning methods that
accommodate (and require) knowledge about the prior probabilities of alternative
hypotheses and about the probability of observing various data given the hypothesis.
Bayesian methods allow assigning a posterior probability to each candidate hypothesis,
based on these assumed priors and the observed data.
• Bayesian methods can be used to determine the most probable hypothesis given the
data-the maximum a posteriori (MAP) hypothesis. This is the optimal hypothesis in the
sense that no other hypothesis is more likely.
• The naive Bayes classifier is a Bayesian learning method that has been found to be
useful in many practical applications. It is called "naive" because it incorporates the
simplifying assumption that attribute values are conditionally independent, given the
classification of the instance. When this assumption is met, the naive Bayes classifier
outputs the MAP classification. Even when this assumption is not met, as in the case of
learning to classify text, the naive Bayes classifier is often quite effective. Bayesian
belief networks provide a more expressive representation for sets of conditional
independence assumptions among subsets of the attributes.
• The framework of Bayesian reasoning can provide a useful basis for analyzing certain
learning methods that do not directly apply Bayes theorem. For example, under certain
conditions it can be shown that minimizing the squared error when learning a real-
valued target function corresponds to computing the maximum likelihood hypothesis.
• The Minimum Description Length principle recommends choosing the hypothesis that
minimizes the description length of the hypothesis plus the description length of the
data given the hypothesis. Bayes theorem and basic results from information theory can
be used to provide a rationale for this principle.
• In many practical learning tasks, some of the relevant instance variables may be
unobservable. The EM algorithm provides a quite general approach to learning in the
presence of unobservable variables. This algorithm begins with an arbitrary initial
hypothesis. It then repeatedly calculates the expected values of the hidden variables
(assuming the current hypothesis is correct), and then recalculates the maximum
likelihood hypothesis (assuming the hidden variables have the expected values
calculated by the first step). This procedure converges to a local maximum likelihood
hypothesis, along with estimated values for the hidden variables.
*****

31
Dept of CSE, Vemana IT

Naive Bayes
No ratings yet
Naive Bayes
60 pages
ML - Unit4pdf
No ratings yet
ML - Unit4pdf
65 pages
UNIT4 - Part2 Aiml
No ratings yet
UNIT4 - Part2 Aiml
46 pages
Module 5
No ratings yet
Module 5
30 pages
Unit - 3 Itai & ML
No ratings yet
Unit - 3 Itai & ML
57 pages
Bcs602 ML Mod-4 Notes @vtunetwork
No ratings yet
Bcs602 ML Mod-4 Notes @vtunetwork
31 pages
ML Unit-Iii
No ratings yet
ML Unit-Iii
178 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
180 pages
Unit 3
No ratings yet
Unit 3
157 pages
ML Unit-4.a
No ratings yet
ML Unit-4.a
69 pages
Module 4
No ratings yet
Module 4
51 pages
@vtudeveloper - in ML Mod 4
No ratings yet
@vtudeveloper - in ML Mod 4
11 pages
Unit2 - 5 Part 1
No ratings yet
Unit2 - 5 Part 1
14 pages
Module4 Notes
100% (1)
Module4 Notes
31 pages
Unit 3 Bayesian Concept Learning
No ratings yet
Unit 3 Bayesian Concept Learning
66 pages
Machine Learning Unit 5 Part 2
No ratings yet
Machine Learning Unit 5 Part 2
16 pages
Unit III
No ratings yet
Unit III
19 pages
ML Unit-4
No ratings yet
ML Unit-4
24 pages
PML UNIT V Material
No ratings yet
PML UNIT V Material
44 pages
ML Module 4 Chapter 8 RNSIT
No ratings yet
ML Module 4 Chapter 8 RNSIT
5 pages
Unit 4
No ratings yet
Unit 4
24 pages
AIML - Module 4 - Updated
No ratings yet
AIML - Module 4 - Updated
41 pages
ML Unit 3 Part 1
No ratings yet
ML Unit 3 Part 1
36 pages
Unit 2 Bayesian Learning
No ratings yet
Unit 2 Bayesian Learning
50 pages
Module - 4 QB Solved-1
No ratings yet
Module - 4 QB Solved-1
31 pages
ML Unit 4-1-24
No ratings yet
ML Unit 4-1-24
24 pages
6.1 Bayesian Learning
No ratings yet
6.1 Bayesian Learning
33 pages
Unit 6 Neural Network Part 2 2
No ratings yet
Unit 6 Neural Network Part 2 2
27 pages
Module 5
No ratings yet
Module 5
24 pages
ML Unit 3 Part 1
No ratings yet
ML Unit 3 Part 1
36 pages
Module - 5 - Notes BAYESIAN Learning Notes
No ratings yet
Module - 5 - Notes BAYESIAN Learning Notes
24 pages
Bayesian
No ratings yet
Bayesian
91 pages
Aiml Module 04
No ratings yet
Aiml Module 04
62 pages
ML Unit III
No ratings yet
ML Unit III
40 pages
2bayesian Learning
No ratings yet
2bayesian Learning
22 pages
Wa0002.
No ratings yet
Wa0002.
24 pages
AI Mod4@AzDOCUMENTS - in
No ratings yet
AI Mod4@AzDOCUMENTS - in
41 pages
15CS73 Module 4
No ratings yet
15CS73 Module 4
60 pages
Class 11 Mathematics DPP With Solution Chapter 4 Complex Numbers
No ratings yet
Class 11 Mathematics DPP With Solution Chapter 4 Complex Numbers
50 pages
ML - Unit 1 - Part Ii
No ratings yet
ML - Unit 1 - Part Ii
18 pages
ML Unit 3 Bayesian - Learning (Textbook)
No ratings yet
ML Unit 3 Bayesian - Learning (Textbook)
25 pages
Mod 4
No ratings yet
Mod 4
26 pages
Unit 4
No ratings yet
Unit 4
18 pages
Features of Bayesian Learning Methods
No ratings yet
Features of Bayesian Learning Methods
39 pages
3.1 New
No ratings yet
3.1 New
12 pages
Bayesian Learning: Artificial Intelligence and Machine Learning 18CS71
No ratings yet
Bayesian Learning: Artificial Intelligence and Machine Learning 18CS71
24 pages
Visit:: Join Telegram To Get Instant Updates: Contact: MAIL: Instagram: Instagram: Whatsapp Share
No ratings yet
Visit:: Join Telegram To Get Instant Updates: Contact: MAIL: Instagram: Instagram: Whatsapp Share
25 pages
Grade 11 Mathematics RELAB (Term1 - Term 4) Learner Booklet PDF
100% (8)
Grade 11 Mathematics RELAB (Term1 - Term 4) Learner Booklet PDF
163 pages
ML UNIT-5 Notes PDF
No ratings yet
ML UNIT-5 Notes PDF
41 pages
Bayesian Learning Video Tutorial
No ratings yet
Bayesian Learning Video Tutorial
25 pages
Geometry: Grades
100% (1)
Geometry: Grades
108 pages
Lec04 BayesianLearning
No ratings yet
Lec04 BayesianLearning
39 pages
Module 2 Notes
No ratings yet
Module 2 Notes
24 pages
Chapter 2 - Solving Linear Equations
No ratings yet
Chapter 2 - Solving Linear Equations
82 pages
Unit II Probabilistic Reasoning
No ratings yet
Unit II Probabilistic Reasoning
28 pages
Bayesian Learning: Salma Itagi, Svit
No ratings yet
Bayesian Learning: Salma Itagi, Svit
14 pages
E-Note 14654 Content Document 20231228101425AM
No ratings yet
E-Note 14654 Content Document 20231228101425AM
10 pages
ABM Specialized Subjects
No ratings yet
ABM Specialized Subjects
2 pages
Circular Measure
No ratings yet
Circular Measure
34 pages
Answer:c: Loop Loo P
No ratings yet
Answer:c: Loop Loo P
37 pages
GSEB Solutions Class 10 Social Science Chapter 5 India
No ratings yet
GSEB Solutions Class 10 Social Science Chapter 5 India
19 pages
Geometry in The Real World
No ratings yet
Geometry in The Real World
22 pages
Module - 4 Bayeian Learning
No ratings yet
Module - 4 Bayeian Learning
44 pages
Transportation Network Design: Dr. Tom V. Mathew
No ratings yet
Transportation Network Design: Dr. Tom V. Mathew
17 pages
Crux v13n07 Sep
No ratings yet
Crux v13n07 Sep
37 pages
Module 4 - Bayesian Learning
No ratings yet
Module 4 - Bayesian Learning
36 pages
Domain Testing
No ratings yet
Domain Testing
12 pages
Bayesian Learning Unit 3 PDF
No ratings yet
Bayesian Learning Unit 3 PDF
18 pages
Math4 q2 Mod8 Changingimproperfractiontomixednumber v3
No ratings yet
Math4 q2 Mod8 Changingimproperfractiontomixednumber v3
22 pages
Chapter 8 Probability Lecture 1
No ratings yet
Chapter 8 Probability Lecture 1
18 pages
BAYESIAN
No ratings yet
BAYESIAN
8 pages
Chương 6
No ratings yet
Chương 6
44 pages
Csec Add Maths 2022 Paper 2
No ratings yet
Csec Add Maths 2022 Paper 2
17 pages
PC Solve Linear Systems
100% (1)
PC Solve Linear Systems
15 pages
9 Times 4 6 Times 6 Understanding The Quantum Solu
No ratings yet
9 Times 4 6 Times 6 Understanding The Quantum Solu
26 pages
Bayesian Methodology: an Overview With The Help Of R Software
From Everand
Bayesian Methodology: an Overview With The Help Of R Software
Editor IJSMI
No ratings yet
Chapter 02 Multiple Choice Questions With Answers (3 Files Merged)
67% (3)
Chapter 02 Multiple Choice Questions With Answers (3 Files Merged)
6 pages
Maths Class - Xii (Assignment On Differentiation)
No ratings yet
Maths Class - Xii (Assignment On Differentiation)
2 pages
Ss2 Mathematics Third Term
No ratings yet
Ss2 Mathematics Third Term
2 pages
4.2.6. Design of Equiripple Linear-Phase FIR Digital Filters
No ratings yet
4.2.6. Design of Equiripple Linear-Phase FIR Digital Filters
17 pages
2022 Y5 Term 3 CT Revision Paper 1 (Soln)
No ratings yet
2022 Y5 Term 3 CT Revision Paper 1 (Soln)
12 pages
Growth and Decay, Newtons Law of Cooling, Mixtures
No ratings yet
Growth and Decay, Newtons Law of Cooling, Mixtures
4 pages
Excel Maxima & Minima Assignment 1
No ratings yet
Excel Maxima & Minima Assignment 1
2 pages
Find Z-Transform Of: Solution: Since
No ratings yet
Find Z-Transform Of: Solution: Since
8 pages
Assignment 1 Solution
No ratings yet
Assignment 1 Solution
5 pages
The Transfer Function The Transfer Function
No ratings yet
The Transfer Function The Transfer Function
7 pages
PRMO 2017 Past Paper
No ratings yet
PRMO 2017 Past Paper
3 pages
1.1 Metric Space
No ratings yet
1.1 Metric Space
2 pages
Naive Bayes Classifier: Fundamentals and Applications
From Everand
Naive Bayes Classifier: Fundamentals and Applications
Fouad Sabry
No ratings yet
Bayesian Inference: Fundamentals and Applications
From Everand
Bayesian Inference: Fundamentals and Applications
Fouad Sabry
No ratings yet

18CS71 Module 4

Uploaded by

18CS71 Module 4

Uploaded by

Artificial Intelligence and Machine Learning (18CS71) Module 4

Module-4: Bayesian Learning

Bayes theorem: …(1)

Summary of basic probability formulas.

The maximum a posteriori hypothesis can be found using Equation (2):

Thus, hmap= ¬ cancer. ( No Cancer)

Bayes theorem and Concept Learning

Brute-Force Map Learning Algorithm

Every consistent hypothesis is, therefore, a MAP hypothesis.

MAP Hypotheses and Consistent Learners

Maximum Likelihood and Least-Squared Error Hypotheses

Normal Distribution: Random noise variable e is generated by a Normal probability

We now apply a transformation that is common in maximum likelihood calculations: Rather

Maximizing this negative quantity is equivalent to minimizing the corresponding positive

Finally, we can again discard constants that are independent of h.

Maximum Likelihood Hypotheses for Predicting Probabilities

It is reasonable to assume, furthermore, that the probability of encountering any particular

Minimum Description Length Principle

BAYES OPTIMAL CLASSIFIER

1. Choose a hypothesis h from H at random, according to the posterior probability

Naive Bayes Classifier

We can use Bayes theorem to rewrite this expression as

D11 Sunny Mild Normal Strong Yes

.. and so on (remaining 10)

Bayesian Belief Networks

Learning Bayesian Belief Networks

General Statement of EM Algorithm

You might also like