Bayesian Learning: Salma Itagi, Svit
Bayesian Learning: Salma Itagi, Svit
Module - 4
BAYESIAN LEARNING
INTRODUCTION
Bayesian learning methods are relevant to our study of machine learning for two different
reasons.
Bayesian learning algorithms that calculate explicit probabilities for hypotheses,
such as the naive Bayes classifier, are among the most practical approaches to certain
types of learning problems.
Bayesian methods are important to our study of machine learning is that they pro vide
a useful perspective for understanding many learning algorithms that do not
explicitly manipulate probabilities.
Features of Bayesian learning methods include:
Each observed training example can incrementally decrease or increase the estimated
probability that a hypothesis is correct. This provides a more flexible approach to
learning than algorithms that completely eliminate a hypothesis if it is found to be
inconsistent with any single example.
Prior knowledge can be combined with observed data to determine the final
probability of a hypothesis. In Bayesian learning, prior knowledge is provided by
asserting
(1) A prior probability for each candidate hypothesis, and
(2) A probability distribution over observed data for each possible hypothesis.
Bayesian methods can accommodate hypotheses that make probabilistic predictions
(e.g., hypotheses such as "this pneumonia patient has a 93% chance of complete
recovery").
New instances can be classified by combining the predictions of multiple
hypotheses, weighted by their probabilities.
Even in cases where Bayesian methods prove computationally intractable, they can
provide a standard of optimal decision making against whichother practical methods
can be measured.
An Example
To illustrate Bayes rule, consider a medical diagnosis problem in which there are two
alternative hypotheses:
(1) That the patient; has a particular form of cancer. and
(2) That the patient does not cancer.
The available data is from a particular laboratory test with two possible outcomes: positive
and ⊖ negative. We have prior knowledge that over the entire population of people only .008
has this disease.
The test returns a correct positive result in only 98% of the cases in which the disease is
actually present and a correct negative result in only 97% of the cases in which the disease is
not present. In other cases, the test returns the opposite result. Suppose we now observe a
new patient for whom the lab test returns a positive result. Should we diagnose the patient as
having cancer or not?
The above situation can be summarized by the following probabilities:
In order specify a learning problem for the BRUTE-FORCE MAP LEARNING algorithm we
must specify what values are to be used for P(h) and for P(D|h). Here let us choose them to
be consistent with the following assumptions:
1. The training data D is noise free (i.e., di = c(xi)).
2. The target concept c is contained in the hypothesis space H
3. We have no a priori reason to believe that any hypothesis is more probable than any
other.
Given these assumptions, what values should we specify for P(h).
According to the assumption 2 and 3
What choice shall we make for P(D|h)? Since we assume noise-free training data, the
probability of observing classification di given h is just 1 if di = h(xi) and 0 if di ≠ h(xi).
Therefore,
In other words, the probability of data D given hypothesis h is 1 if D is consistent with h, and
0 otherwise. Let us consider the first step of this algorithm, Recalling Bayes theorem, we
have
First consider the case where h is inconsistent with the training data D. Since Equation (6.4)
defines P(D)h) to be 0 when h is inconsistent with D, we have
where VSH,D, is the subset of hypotheses from H that are consistent with D.
We can derive P(D) from the theorem of total probability and the fact that the hypotheses are
mutually exclusive
To summarize, Bayes theorem implies that the posterior probability P(h|D) under our
assumed P(h) and P(D|h) is
where |VSH,D| is the number of hypotheses from H consistent with D. The evolution of
probabilities associated with hypotheses is depicted schematically in Figure 6.1.
The assumption is that given the target value of the instance, the probability of observing the
conjunction al, a2.. .a, is just the product of the probabilities for the individual attributes:
P(a1, a2 . . . an | vj) = i P(ai|vj). Substituting this into Equation (6.19), we have the approach
used by the naive Bayes classifier.
Where VNB denotes the target value output by the naive Bayes classifier.
An Illustrative Example
Consider the dataset of 14 instances and 4 attributes that we have used in Decision tree
learning module.
To calculate VNB we now require 10 probabilities that can be estimated from the training
data. First, the probabilities of the different target values can easily be estimated based on
their frequencies over the 14 training examples
Similarly, we can estimate the conditio nal probabilities. For example, those for Wind = trong
are
Using these probability estimates and similar estimates for the remaining attribute values, we
calculate VNB according to Equation (6.21) as follows
Thus, the naive Bayes classifier assigns the target value PlayTennis = no to this new instance,
based on the probability estimates learned from the training data.
Estimating Probabilities
Up to this point we have estimated probabilities by the fraction of times the event is observed
to occur over the total number of opportunities. For example, in the above case we estimated
P(Wind = strong|Play Tennis = no) by the fraction nc/n where n = 5 is the total number of
training examples for which PlayTennis = no, and n, = 3 is the number of these for which
Wind = strong. it provides poor estimates when nc is very small. This raises two difficulties.
First, nc/n produces a biased underestimate of the probability. Second, when this probability
estimate is zero, this probability term will dominate the Bayes classifier if the future query
contains Wind = strong. To avoid this difficulty we can adopt a Bayesian approach to
estimating the probability, using the m-estimate defined as follows.
Here, n, and n are defined as before, p is our prior estimate of the probability we wish to
determine, and m is a constant called the equivalent sample size, which determines how
heavily to weight p relative to the observed data.
Equation (6.23) is just the general form of the product rule of probability. Equation (6.24)
follows because if A 1 is conditionally independent of A2 given V, then by our definition of
conditional independence P (A1 |A2, V) = P (A1| V).
Representation
A Bayesian belief network (Bayesian network for short) represents the joint probability
distribution for a set of variables. In general, a Bayesian network represents the joint
probability distribution by specifying a set of conditional independence assumptions
(represented by a directed acyclic graph), together with sets of local conditional probabilities.
Infe rence
We might wish to use a Bayesian network to infer the value of some target variable (e.g.,
ForestFire) given the observed values of the other variables. This inference step can be
straightforward if values for all of the other variables in the network are known exactly. In
the more general case we may wish to infer the probability distribution for some variable
(e.g., ForestFire) given observed values for only a subset of the other variables e.g., Thunder
and BusTourGroup may be the only observed values available). In general, a Bayesian
network can be used to compute the probability distribution for any subset of network
variables given the values or distributions for any subset of the remaining variables. Exact
inference of probabilities in general for an arbitrary Bayesian network is known to be NP-
hard (Cooper 1990). N umerous methods have been proposed for probabilistic inference in
Bayesian networks, including exact inference methods and approximate inference methods
that sacrifice precision to gain efficiency.
Learning Bayesian Belief Networks
Can we devise effective algorithms for learning Bayesian belief networks from training data?.
Several different settings for this learning problem can be considered. First, the network
structure might be given in advance, or it might have to be inferred from the training data.
Second, all the network variables might be directly observable in each training example, or
some might be unobservable.
In this case, the sum of squared errors is minimized by the sample mean.
Our problem here, however, involves a mixture of k different Normal distributions, and we
cannot observe which instances were generated by which distribution. Thus, we have a
prototypical example of a problem involving hidden variables. We can think of the full
description of each instance as the triple (xi, zi1, zi2), where xi is the observed value of the ith
instance and where zil and zi2 indicate which of the two Normal distributions was used to
generate the value xi. In particular, zij has the value 1 if xi was created by the jth Normal
distribution and 0 otherwise. Here xi is the observed variable in the description of the
instance, and zil and zi2 are hidden variables.
Applied to the problem of estimating the two means the EM algorithm first initializes the
hypothesis to h = (μ1 , μk), where μ1 and μk are arbitrary initial values. It then iteratively
re-estimates h by repeating the following two steps until the procedure converges to a
stationary value for h.
above two- means example the parameters of interest were θ = (μ1, μ 2), and the
full data were the triples (xi, zil, zi2) of which only the xi were observed.
In general let X = {xl, . . . , xm} denote the observed data in a set of m independently
drawn instances, let Z = {zl, . . . , zm}denote the unobserved data in these same
instances, and let Y = X U Z denote the full data.
We use h to denote the current hypothesized values of the parameters θ, and h' to
denote the revised hypothesis that is estimated on each iteration ofthe EM algorithm.
The EM algorithm searches for the maximum likelihood hypothesis h' by seeking the
h' that maximizes E[ln P(Y|h')].
Let us define a function Q(h’|h) that gives E[ln P(Y|h')] as a function of h', under the
assumption that θ = h and given the observed portion X of the full data Y.
In its general form, the EM algorithm repeats the following two steps until
convergence:
Step 1: Estimation (E) step: Calculate Q(h‘|h) using the current hypothesis h and the
observed data X to estimate the probability distribution over Y.
Q(h’|h) ← E [ln P(Y|h’) | h, X]
Step 2: Maximization (M) step: Replace hypothesis h by the hypothesis h' that
maximizes this Q function.
h ← argmax Q (h’| h)
h'
When the function Q is continuous, the EM algorithm converges to a stationary point of the
likelihood function P(Y|h’).
In this respect, EM shares some of the same limitations as other optimization methods such as
gradient descent, line search, and conjugate gradient.
Derivation of the k Means Algorithm
Let us use EM Algorithm to derive the algorithm for estimating the means of a mixture of k
Normal distributions. To apply EM we must derive an expression for Q(h|h’) that applies to
our k- means problem.
First, let us derive an expression for 1n p(Y|h’). Note the probability p(yi|h') of a single
instance yi = (xi, Zil, . . . Zik ) of the full data can be written
Note the above expression for In P(Y|h’) is a linear function of these zij. In general, for any
function f (z) that is a linear function of z, the following equality holds