0% found this document useful (0 votes)
89 views65 pages

ML - Unit4pdf

Bayesian learning provides a probabilistic approach to machine learning problems. It is based on representing quantities as probability distributions and making optimal decisions based on reasoning about probabilities given observed data. Bayes' theorem provides a way to calculate the posterior probability of a hypothesis based on its prior probability and the likelihood of observed data given the hypothesis. A key application is the naive Bayes classifier, which classifies new instances by selecting the hypothesis with the highest posterior probability based on training data. Practical challenges include specifying prior probabilities and managing computational costs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views65 pages

ML - Unit4pdf

Bayesian learning provides a probabilistic approach to machine learning problems. It is based on representing quantities as probability distributions and making optimal decisions based on reasoning about probabilities given observed data. Bayes' theorem provides a way to calculate the posterior probability of a hypothesis based on its prior probability and the likelihood of observed data given the hypothesis. A key application is the naive Bayes classifier, which classifies new instances by selecting the hypothesis with the highest posterior probability based on training data. Practical challenges include specifying prior probabilities and managing computational costs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Machine Learning Unit -4 18CS62

BAYESIAN LEARNING
Syllabus:
Bayesian Learning: Introduction, Bayes theorem, Bayes theorem and concept learning, ML and LS
error hypothesis, ML for predicting probabilities, MDL principle, Naive Bayes classifier, Bayesian belief
networks, EM algorithm.
Bayesian reasoning provides a probabilistic approach to inference. It is based on the
assumption that the quantities of interest are governed by probability distributions and that
optimal decisions can be made by reasoning about these probabilities together with observed
data. It is important to machine learning because it provides a quantitative approach to
weighing the evidence supporting alternative hypotheses.
Bayesian reasoning provides the basis for learning algorithms that directly manipulate
probabilities, as well as a framework for analyzing the operation of other algorithms that do
not explicitly manipulate probabilities.

SOME IMPORTANT TERMS

• Probability?
– Probability is the branch of mathematics concerning numerical descriptions of how
likely an event is to occur/not occur.
– The probability of an event is a number between 0 and 1, where, roughly speaking, 0
indicates impossibility of the event (not occur)and 1 indicates certainty(occur).
• Inference?
– a conclusion reached on the basis of evidence and reasoning/ conclusion made based
on observing the dataset/ Inference is using observation and background to reach a
logical conclusion.
• Hypothesis?
– A proposed explanation/statement made on the basis of limited evidence(data sets) as
a starting point for further investigation.
– Null Hypothesis(NO), Alternative Hypothesis(YES).
• Prior Probability?
– General truth related to the fact.
– Prior probability represents what is originally believed before new evidence is
introduced

1 Dept. of CSE, Dr. AIT, Bangalore


Machine Learning Unit -4 18CS62

INTRODUCTION

Bayesian learning methods are relevant to study of machine learning for two different reasons.
1. First, Bayesian learning algorithms that calculate explicit probabilities for hypotheses,
such as the naive Bayes classifier, are among the most practical approaches to certain
types of learning problems
2. The second reason is that they provide a useful perspective for understanding many
learning algorithms that do not explicitly manipulate probabilities.

Features of Bayesian Learning Methods

1. Each observed training example can incrementally decrease or increase the estimated
probability that a hypothesis is correct. This provides a more flexible approach to
learning than algorithms that completely eliminate a hypothesis if it is found to be
inconsistent with any single example
2. Prior knowledge can be combined with observed data to determine the final probability
of a hypothesis. In Bayesian learning, prior knowledge is provided by asserting (1) a
prior probability for each candidate hypothesis, and (2) a probability distribution over
observed data for each possible hypothesis.
3. Bayesian methods can accommodate hypotheses that make probabilistic predictions
4. New instances can be classified by combining the predictions of multiple hypotheses,
weighted by their probabilities.
5. Even in cases where Bayesian methods prove computationally intractable, they can
provide a standard of optimal decision making against which other practical methods
can be measured.

Practical difficulty in applying Bayesian methods

1. One practical difficulty in applying Bayesian methods is that they typically require
initial knowledge of many probabilities. When these probabilities are not known in
advance they are often estimated based on background knowledge, previously available
data, and assumptions about the form of the underlying distributions.
2. A second practical difficulty is the significant computational cost required to determine
the Bayes optimal hypothesis in the general case. In certain specialized situations, this
computational cost can be significantly reduced.

2 Dept. of CSE, Dr. AIT, Bangalore


Machine Learning Unit -4 18CS62

BAYES THEOREM

 Bayes’ Theorem (also known as Bayes’ rule) is a deceptively simple formula used to
calculate conditional probability. The Theorem was named after English
mathematician Thomas Bayes (1701-1761). Bayes Theorem is also widely used in the
field of machine learning.
 Bayes theorem provides a way to calculate the probability of a hypothesis based on its
prior probability, the probabilities of observing various data given the hypothesis, and
the observeddata itself.

Notations
 P(h) prior probability of h, reflects any background knowledge about the chance
that his correct
 P(D) prior probability of D, probability that D will be observed
 P(D|h) probability of observing D given a world in which h holds
 P(h|D) posterior probability of h, reflects confidence that h holds after D has
beenobserved

 Bayes theorem is the cornerstone of Bayesian learning methods because it provides a


way to calculate the posterior probability P(h|D), from the prior probability P(h),
together with P(D) and P(D|h).

– P(h|D) increases with P(h) and with P(D|h) according to Bayes theorem.
– P(h|D) decreases as P(D) increases, because the more probable it is that D
will beobserved independent of h, the less evidence D provides in support of h.

3 Dept. of CSE, Dr. AIT, Bangalore


Machine Learning Unit -4 18CS62

Maximum a Posteriori (MAP) Hypothesis


 In many learning scenarios, the learner considers some set of candidate hypotheses H
and is interested in finding the most probable hypothesis h ∈ H given the observed data
D. Any such maximally probable hypothesis is called a maximum a posteriori (MAP)
hypothesis.
 Bayes theorem to calculate the posterior probability of each candidate hypothesis is hMAP
is a MAP hypothesis provided

– P(D) can be dropped, because it is a constant independent of h

Maximum Likelihood (ML) Hypothesis


 In some cases, it is assumed that every hypothesis in H is equally probable a priori
(P(hi) = P(hj) for all hi and hj in H).
 In this case the below equation can be simplified and need only consider the term P(D|h) to
find the most probable hypothesis.

 P(D|h) is often called the likelihood of the data D given h, and any hypothesis that
maximizesP(D|h) is called a maximum likelihood (ML) hypothesis

Example
 Consider a medical diagnosis problem in which there are two alternative hypotheses:
(1) that the patient has particular form of cancer, and (2) that the patient does not. The
available data is from a particular laboratory test with two possible outcomes: +
(positive) and - (negative).
 We have prior knowledge that over the entire population of people only .008 have this
disease. Furthermore, the lab test is only an imperfect indicator of the disease.
 The test returns a correct positive result in only 98% of the cases in which the disease is
actually present and a correct negative result in only 97% of the cases in which the
disease is not present. In other cases, the test returns the opposite result.

4 Dept. of CSE, Dr. AIT, Bangalore


Machine Learning Unit -4 18CS62

 The above situation can be summarized by the following probabilities:

Suppose a new patient is observed for whom the lab test returns a positive (+) result.
Should we diagnose the patient as having cancer or not?

The exact posterior probabilities can also be determined by normalizing the above quantities
so that they sum to 1

Basic formulas for calculating probabilities are summarized in Table

5 Dept. of CSE, Dr. AIT, Bangalore


Machine Learning Unit -4 18CS62

BAYES THEOREM AND CONCEPT LEARNING

What is the relationship between Bayes theorem and the problem of concept learning?

Since Bayes theorem provides a principled way to calculate the posterior probability of each
hypothesis given the training data, and can use it as the basis for a straightforward learning
algorithm that calculates the probability for each possible hypothesis, then outputs the most
probable.

Brute-Force Bayes Concept Learning

Consider the concept learning problem


 Assume the learner considers some finite hypothesis space H defined over the instance
space X, in which the task is to learn some target concept c : X → {0,1}.
 Learner is given some sequence of training examples ((x1, d1) . . . (xm, dm)) where xi is
some instance from X and where di is the target value of xi (i.e., di = c(xi)).
 The sequence of target values are written as D = (d1 . . . dm).

We can design a straightforward concept learning algorithm to output the maximum a posteriori
hypothesis, based on Bayes theorem, as follows:

BRUTE-FORCE MAP LEARNING algorithm:

1. For each hypothesis h in H, calculate the posterior probability

2. Output the hypothesis hMAP with the highest posterior probability

In order specify a learning problem for the BRUTE-FORCE MAP LEARNING algorithm we
must specify what values are to be used for P(h) and for P(D|h) ?

Let’s choose P(h) and for P(D|h) to be consistent with the following assumptions:
 The training data D is noise free (i.e., di = c(xi))
 The target concept c is contained in the hypothesis space H
 Do not have a priori reason to believe that any hypothesis is more probable than any
other.

6 Dept. of CSE, Dr. AIT, Bangalore


Machine Learning Unit -4 18CS62

What values should we specify for P(h)?


 Given no prior knowledge that one hypothesis is more likely than another, it is
reasonable to assign the same prior probability to every hypothesis h in H.
 Assume the target concept is contained in H and require that these prior probabilities
sum to 1.

What choice shall we make for P(D|h)?


 P(D|h) is the probability of observing the target values D = (d1 . . .dm) for the fixed set
of instances (x1 . . . xm), given a world in which hypothesis h holds
 Since we assume noise-free training data, the probability of observing classification di
given h is just 1 if di = h(xi) and 0 if di ≠ h(xi). Therefore,

Given these choices for P(h) and for P(D|h) we now have a fully-defined problem for the above
BRUTE-FORCE MAP LEARNING algorithm.

Recalling Bayes theorem, we have

Consider the case where h is inconsistent with the training data D

The posterior probability of a hypothesis inconsistent with D is zero

Consider the case where h is consistent with D

Where, VSH,D is the subset of hypotheses from H that are consistent with D

To summarize, Bayes theorem implies that the posterior probability P(h|D) under our assumed
P(h) and P(D|h) is

7 Dept. of CSE, Dr. AIT, Bangalore


Machine Learning Unit -4 18CS62

The Evolution of Probabilities Associated with Hypotheses

 Figure (a) all hypotheses have the same probability.


 Figures (b) and (c), As training data accumulates, the posterior probability for inconsistent
hypotheses becomes zero while the total probability summing to 1 is shared equally among
the remaining consistent hypotheses.

MAP Hypotheses and Consistent Learners


 A learning algorithm is a consistent learner if it outputs a hypothesis that commits zero
errors over the training examples.
 Every consistent learner outputs a MAP hypothesis, if we assume
– a uniform prior probability distribution over H (P(hi) = P(hj) for all i, j), and
– deterministic, noise free training data (P(D|h) =1 if D and h are consistent, and 0
otherwise).

Example:
 FIND-S outputs a consistent hypothesis, it will output a MAP hypothesis under the
probability distributions P(h) and P(D|h) defined above.
 Are there other probability distributions for P(h) and P(D|h) under which FIND-S outputs
MAP hypotheses? Yes.
 Because FIND-S outputs a maximally specific hypothesis from the version space, its output
hypothesis will be a MAP hypothesis relative to any prior probability distribution that favours
more specific hypotheses.

Note
 Bayesian framework is a way to characterize the behaviour of learning algorithms
 By identifying probability distributions P(h) and P(D|h) under which the output is a
optimal hypothesis, implicit assumptions of the algorithm can be characterized (Inductive
Bias)
 Inductive inference is modelled by an equivalent probabilistic reasoning system based on
Bayes theorem

8 Dept. of CSE, Dr. AIT, Bangalore


Machine Learning Unit -4 18CS62

MAXIMUM LIKELIHOOD AND LEAST-SQUARED ERROR HYPOTHESES

Consider the problem of learning a continuous-valued target function such as neural network
learning, linear regression, and polynomial curve fitting

A straightforward Bayesian analysis will show that under certain assumptions any learning
algorithm that minimizes the squared error between the output hypothesis predictions
and the training data will output a maximum likelihood (ML) hypothesis

 Learner L considers an instance space X and a hypothesis space H consisting of some


class of real-valued functions defined over X, i.e., (∀ h ∈ H)[ h : X → R] and training
examples of the form <xi,di>
 The problem faced by L is to learn an unknown target function f : X → R
 A set of m training examples is provided, where the target value of each example is
corrupted by random noise drawn according to a Normal probability distribution with
zero mean (di = f(xi) + ei)
 Each training example is a pair of the form (xi ,di ) where di = f (xi ) + ei .
– Here f(xi) is the noise-free value of the target function and ei is a random
variable representing the noise.
– It is assumed that the values of the ei are drawn independently and that they
are distributed according to a Normal distribution with zero mean.
 The task of the learner is to output a maximum likelihood hypothesis or a
MAPhypothesis assuming all hypotheses are equally probable a priori.

Using the definition of hML we have

Assuming training examples are mutually independent given h, we can write P(D|h) as the
product of the various (di|h)

Given the noise ei obeys a Normal distribution with zero mean and unknown variance σ2 , each
di must also obey a Normal distribution around the true targetvalue f(x i). Because we are
writing the expression for P(D|h), we assume h is the correct description of f.
Hence, µ = f(xi) = h(xi)

9 Dept. of CSE, Dr. AIT, Bangalore


Machine Learning Unit -4 18CS62

Maximize the less complicated logarithm, which is justified because of the monotonicity of
function p

The first term in this expression is a constant independent of h, and can therefore be
discarded, yielding

Maximizing this negative quantity is equivalent to minimizing the corresponding positive


quantity

Finally, discard constants that are independent of h.

Thus, above equation shows that the maximum likelihood hypothesis h ML is the one that
minimizes the sum of the squared errors between the observed training values di and the
hypothesis predictions h(xi)

Note:
Why is it reasonable to choose the Normal distribution to characterize noise?
 Good approximation of many types of noise in physical systems
 Central Limit Theorem shows that the sum of a sufficiently large number of
independent, identically distributed random variables itself obeys a Normal distribution
Only noise in the target value is considered, not in the attributes describing the instances
themselves

10 Dept. of CSE, Dr. AIT, Bangalore


Machine Learning Unit -4 18CS62

MAXIMUM LIKELIHOOD HYPOTHESES FOR PREDICTING PROBABILITIES

 Consider the setting in which we wish to learn a nondeterministic (probabilistic)


function f : X → {0, 1}, which has two discrete output values.
 We want a function approximator whose output is the probability that f(x) = 1. In other
words, learn the target function

Such that
 In the above medical patient example, if x is one of those indistinguishable
patients of which 92% survive, then f'(x) = 0.92 whereas the probabilistic
function f (x) will be equal to 1 in 92% of cases and equal to 0 in the remaining
8%.

How can we learn f ' using a neural network?


 Use of brute force way would be to first collect the observed frequencies of 1's and 0's
for each possible value of x and to then train the neural network to output the target
frequency for each x.

What criterion should we optimize in order to find a maximum likelihood hypothesis for f ' in
this setting?
 First obtain an expression for P(D|h)
 Assume the training data D is of the form D = {(x1, d1) . . . (xm, dm)}, where di is the
observed 0 or 1 value for f (xi).
 Both xi and di as random variables, and assuming that each training example is drawn
independently, we can write P(D|h) as

Applying the product rule

The probability P(di|h, xi)

Re-express it in a more mathematically manipulatable form, as

11 Dept. of CSE, Dr. AIT, Bangalore


Machine Learning Unit -4 18CS62

Equation (4) to substitute for P(di |h, xi) in Equation (5) to obtain

We write an expression for the maximum likelihood hypothesis

The last term is a constant independent of h, so it can be dropped

It easier to work with the log of the likelihood, yielding

Equation (7) describes the quantity that must be maximized in order to obtain the maximum
likelihood hypothesis in our current problem setting

Gradient Search to Maximize Likelihood in a Neural Net


Derive a weight-training rule for neural network learning that seeks to maximize G(h,D)
using gradient ascent
 The gradient of G(h,D) is given by the vector of partial derivatives of G(h,D) with respect
to the various network weights that define the hypothesis h represented by the learned
network
 In this case, the partial derivative of G(h, D) with respect to weight wjk from input k to
unit j is

 Suppose our neural network is constructed from a single layer of sigmoid units. Then,

where xijk is the kth input to unit j for the ith training example, and d(x) is the derivativeof
the sigmoid squashing function.
12 Dept. of CSE, Dr. AIT, Bangalore
Machine Learning Unit -4 18CS62

 Finally, substituting this expression into Equation (1), we obtain a simple expression for
the derivatives that constitute the gradient

Because we seek to maximize rather than minimize P(D|h), we perform gradient ascent rather
than gradient descent search. On each iteration of the search the weight vector is adjusted in
the direction of the gradient, using the weight update rule

Where, α is a small positive constant that determines the step size of the i gradient ascent search

MINIMUM DESCRIPTION LENGTH PRINCIPLE

 We have already learnt about Occam's razor, a popular inductive bias that can be summarized
as “choose the shortest explanation for the observed data“.
 Here, we consider Bayesian perspective on Occam’s razor a closely related principle called the
Minimum Description Length (MDL) principle.
 The Minimum Description Length (MDL) principle is motivated by interpreting the definition
of hMAP in the light of basic concepts from information theory.

which can be equivalently expressed in terms of maximizing the log2

or alternatively, minimizing the negative of this quantity

 This equation (1) can be interpreted as a statement that short hypotheses are
preferred, assuming a particular representation scheme for encoding hypotheses and data.
 -log2P(h): the description length of h under the optimal encoding for the hypothesis
space H, LCH (h) = −log2P(h), where CH is the optimal code for hypothesis space H.
 -log2P(D | h): the description length of the training data D given hypothesis h, under
the optimal encoding from the hypothesis space H: LC(D|h) = −log2P(D| h) , where
CD|h is the optimal code for describing data D assuming that both the sender and
receiver know the hypothesis h.

13 Dept. of CSE, Dr. AIT, Bangalore


Machine Learning Unit -4 18CS62

 Rewrite Equation (1) to show that hMAP is the hypothesis h that minimizes the sum given
by the description length of the hypothesis plus the description length of the data given
the hypothesis.

Where, CH and CD|h are the optimal encodings for H and for D given h

 The Minimum Description Length (MDL) principle recommends choosing the


hypothesis that minimizes the sum of these two description lengths of equ.

Minimum Description Length principle:

– Where, codes C1 and C2 to represent the hypothesis and the data given the hypothesis.
– The above analysis shows that if we choose C1 to be the optimal encoding of hypotheses
CH,and if we choose C2 to be the optimal encoding CD|h, then hMDL = hMAP

Application to Decision Tree Learning

Apply the MDL principle to the problem of learning decision trees from some training data.
What should we choose for the representations C1 and C2 of hypotheses and data?
 For C1: C1 might be some obvious encoding, in which the description length grows with
the number of nodes and with the number of edges
 For C2: Suppose that the sequence of instances (x1 . . .xm) is already known to both the
transmitter and receiver, so that we need only transmit the classifications (f (x 1) . . . f
(xm)).
 Now if the training classifications (f (x1) . . .f(xm)) are identical to the predictions of the
hypothesis, then there is no need to transmit any information about these examples. The
description length of the classifications given the hypothesis ZERO
 If examples are misclassified by h, then for each misclassification we need to transmit
a message that identifies which example is misclassified as well as its correct
classification
 The hypothesis hMDL under the encoding C1 and C2 is just the one that minimizes the
sum of these description lengths.

14 Dept. of CSE, Dr. AIT, Bangalore


Machine Learning Unit -4 18CS62

NAIVE BAYES CLASSIFIER

 The naive Bayes classifier applies to learning tasks where each instance x is described by
a conjunction of attribute values and where the target function f (x) can take on any value
from some finite set V.
 A set of training examples of the target function is provided, and a new instance is
presented, described by the tuple of attribute values (al, a2.. .am).
 The learner is asked to predict the target value, or classification, for this new instance.
 The Bayesian approach to classifying the new instance is to assign the most probable target
value, VMAP, given the attribute values (al, a2.. .am) that describe the instance

Use Bayes theorem to rewrite this expression as

 The naive Bayes classifier is based on the assumption that the attribute values are
conditionally independent given the target value. Means, the assumption is that given the
target value of the instance, the probability of observing the conjunction (al, a2.. .am), is just

the product of the probabilities for the individual attributes:


Substituting this into Equation (1),
Naive Bayes classifier:

Where, VNB denotes the target value output by the naive Bayes classifier

15 Dept. of CSE, Dr. AIT, Bangalore


Machine Learning Unit -4 18CS62

An Illustrative Example
 Let us apply the naive Bayes classifier to a concept learning problem i.e., classifying
days according to whether someone will play tennis.
 The below table provides a set of 14 training examples of the target concept PlayTennis,
where each day is described by the attributes Outlook, Temperature, Humidity, and
Wind

Day Outlook Temperature Humidity Wind PlayTennis


D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

 Use the naive Bayes classifier and the training data from this table to classify the
following novel instance:
< Outlook = sunny, Temperature = cool, Humidity = high, Wind = strong >
 Our task is to predict the target value (yes or no) of the target concept PlayTennis for

this new instance

 The probabilities of the different target values can easily be estimated based on their
frequencies over the 14 training examples
– P(P1ayTennis = yes) = 9/14 = 0.64
– P(P1ayTennis = no) = 5/14 = 0.36

16 Dept. of CSE, Dr. AIT, Bangalore


Machine Learning Unit -4 18CS62

 Similarly, estimate the conditional probabilities. For example, those for Wind =
strong
– P(Wind = strong | PlayTennis = yes) = 3/9 = 0.33
– P(Wind = strong | PlayTennis = no) = 3/5 = 0.60

 Calculate VNB according to Equation (1)

Thus, the naive Bayes classifier assigns the target value PlayTennis = no to this new
instance, based on the probability estimates learned from the training data.

By normalizing the above quantities to sum to one, calculate the conditional probability
thatthe target value is no, given the observed attribute values

Estimating Probabilities
 We have estimated probabilities by the fraction of times the event is observed to occur
over the total number of opportunities.
 For example, in the above case we estimated P(Wind = strong | Play Tennis = no) by
the fraction nc /n where, n = 5 is the total number of training examples for which
PlayTennis = no, and nc = 3 is the number of these for which Wind = strong.
 When nc = 0, then nc /n will be zero and this probability term will dominate the quantity
calculated in Equation (2) requires multiplying all the other probability terms by this
zero value
 To avoid this difficulty we can adopt a Bayesian approach to estimating the probability,
using the m-estimate defined as follows
m -estimate of probability:

 p is our prior estimate of the probability we wish to determine, and m is a constant


called the equivalent sample size, which determines how heavily to weight p relative
to the observed data
 Method for choosing p in the absence of other information is to assume uniform
priors; that is, if an attribute has k possible values we set p = 1 /k.

17 Dept. of CSE, Dr. AIT, Bangalore


Machine Learning Unit -4 18CS62

BAYESIAN BELIEF NETWORKS

 The naive Bayes classifier makes significant use of the assumption that the values of the
attributes a1… an are conditionally independent given the target value v.
 This assumption dramatically reduces the complexity of learning the target function
 A Bayesian belief network describes the probability distribution governing a set of
variables by specifying a set of conditional independence assumptions along with a set of
conditional probabilities
 Bayesian belief networks allow stating conditional independence assumptions that apply
to subsets of the variables

Notation
 Consider an arbitrary set of random variables Y1 . . . Yn , where each variable Yi can
take on the set of possible values V(Yi).
 The joint space of the set of variables Y to be the cross product V(Y1) x V(Y2) x. . .
V(Yn).
 In other words, each item in the joint space corresponds to one of the possible
assignments of values to the tuple of variables (Y1 . . . Yn). The probability distribution
over this joint' space is called the joint probability distribution.
 The joint probability distribution specifies the probability for each of the possible
variable bindings for the tuple (Y1 . . . Yn).
 A Bayesian belief network describes the joint probability distribution for a set of
variables.

Conditional Independence
Let X, Y, and Z be three discrete-valued random variables. X is conditionally independent of
Y given Z if the probability distribution governing X is independent of the value of Y given a
value for Z, that is, if

Where,

The above expression is written in abbreviated form as


P(X | Y, Z) = P(X | Z)

Conditional independence can be extended to sets of variables. The set of variables X1 . . . Xl


is conditionally independent of the set of variables Y1 . . . Ym given the set of variables Z1 . . .
Zn if

18 Dept. of CSE, Dr. AIT, Bangalore


Machine Learning Unit -4 18CS62

The naive Bayes classifier assumes that the instance attribute A1 is conditionally independent
of instance attribute A2 given the target value V. This allows the naive Bayes classifier to
calculate P(Al, A2 | V) as follows,

Representation
A Bayesian belief network represents the joint probability distribution for a set of variables.
Bayesian networks (BN) are represented by directed acyclic graphs.

The Bayesian network in above figure represents the joint probability distribution over the
boolean variables Storm, Lightning, Thunder, ForestFire, Campfire, and BusTourGroup

A Bayesian network (BN) represents the joint probability distribution by specifying a set of
conditional independence assumptions
 BN represented by a directed acyclic graph, together with sets of local conditional
probabilities
 Each variable in the joint space is represented by a node in the Bayesian network
 The network arcs represent the assertion that the variable is conditionally independent
of its non-descendants in the network given its immediate predecessors in the network.
 A conditional probability table (CPT) is given for each variable, describing the
probability distribution for that variable given the values of its immediate predecessors

The joint probability for any desired assignment of values (y1, . . . , yn) to the tuple of network
variables (Y1 . . . Ym) can be computed by the formula

Where, Parents(Yi) denotes the set of immediate predecessors of Yi in the network.

19 Dept. of CSE, Dr. AIT, Bangalore


Machine Learning Unit -4 18CS62

Example:
Consider the node Campfire. The network nodes and arcs represent the assertion that Campfire
is conditionally independent of its non-descendants Lightning and Thunder, given its
immediate parents Storm and BusTourGroup.

This means that once we know the value of the variables Storm and BusTourGroup, the
variables Lightning and Thunder provide no additional information about Campfire
The conditional probability table associated with the variable Campfire. The assertion is
P(Campfire = True | Storm = True, BusTourGroup = True) = 0.4
Inference
 Use a Bayesian network to infer the value of some target variable (e.g., ForestFire) given
the observed values of the other variables.
 Inference can be straightforward if values for all of the other variables in the network are
known exactly.
 A Bayesian network can be used to compute the probability distribution for any subset of
network variables given the values or distributions for any subset of the remaining
variables.
 An arbitrary Bayesian network is known to be NP-hard.

Learning Bayesian Belief Networks


Affective algorithms can be considered for learning Bayesian belief networks from training
data by considering several different settings for learning problem
 First, the network structure might be given in advance, or it might have to be inferred from
the training data.
 Second, all the network variables might be directly observable in each training example,
or some might be unobservable.
 In the case where the network structure is given in advance and the variables are fully
observable in the training examples, learning the conditional probability tables is
straightforward and estimate the conditional probability table entries
 In the case where the network structure is given but only some of the variable values
are observable in the training data, the learning problem is more difficult. The learning
problem can be compared to learning weights for an ANN.

20 Dept. of CSE, Dr. AIT, Bangalore


Machine Learning Unit -4 18CS62

Gradient Ascent Training of Bayesian Network


The gradient ascent rule which maximizes P(D|h) by following the gradient of ln P(D|h) with
respect to the parameters that define the conditional probability tables of the Bayesian network.

Let wijk denote a single entry in one of the conditional probability tables. In particular wijk
denote the conditional probability that the network variable Y i will take on the value yi, given
that its immediate parents Ui take on the values given by uik.

The gradient of ln P(D|h) is given by the derivatives for each of the wijk.
As shown below, each of these derivatives can be calculated as

Derive the gradient defined by the set of derivatives for all i, j, and k. Assuming the
training examples d in the data set D are drawn independently, we write this derivative as

We write the abbreviation Ph(D) to represent P(D|h).

21 Dept. of CSE, Dr. AIT, Bangalore


Machine Learning Unit -4 18CS62

THE EM ALGORITHM

The EM algorithm can be used even for variables whose value is never directly observed,
provided the general form of the probability distribution governing these variables is known.

22 Dept. of CSE, Dr. AIT, Bangalore


Machine Learning Unit -4 18CS62

Estimating Means of k Gaussians

 Consider a problem in which the data D is a set of instances generated by a probability


distribution that is a mixture of k distinct Normal distributions.

 This problem setting is illustrated in Figure for the case where k = 2 and where the
instances are the points shown along the x axis.
 Each instance is generated using a two-step process.
 First, one of the k Normal distributions is selected at random.
 Second, a single random instance xi is generated according to this selected
distribution.
 This process is repeated to generate a set of data points as shown in the figure.
 To simplify, consider the special case
 The selection of the single Normal distribution at each step is based on choosing
each with uniform probability
 Each of the k Normal distributions has the same variance σ2, known value.
 The learning task is to output a hypothesis h = (μ1 , . . . ,μk) that describes the means of
each of the k distributions.
 We would like to find a maximum likelihood hypothesis for these means; that is, a

hypothesis h that maximizes p(D |h).


In this case, the sum of squared errors is minimized by the sample mean

 Our problem here, however, involves a mixture of k different Normal distributions, and
we cannot observe which instances were generated by which distribution.
 Consider full description of each instance as the triple (xi, zi1, zi2),
23 Dept. of CSE, Dr. AIT, Bangalore
Machine Learning Unit -4 18CS62

 where xi is the observed value of the ith instance and


 where zi1 and zi2 indicate which of the two Normal distributions was used to
generate the value xi
 In particular, zij has the value 1 if xi was created by the jth Normal distribution and 0
otherwise.
 Here xi is the observed variable in the description of the instance, and zil and zi2 are
hidden variables.
 If the values of zil and zi2 were observed, we could use following Equation to solve for
the means p1 and p2
 Because they are not, we will instead use the EM algorithm

EM algorithm

24 Dept. of CSE, Dr. AIT, Bangalore


Machine Learning Unit -4 18CS62

Note this expression is similar to the sample mean from Equation that is used to estimate μ for a
single Normal distribution. Our new expression is just the weighted sample mean for μj, with
each instance weighted by the expectation E[zij] that it was generated by the jth Normal
distribution.
The above algorithm for estimating the means of a mixture of k Normal distributions illustrates
the essence of the EM approach: The current hypothesis is used to estimate the unobserved
variables, and the expected values of these variables are then used to calculate an
improved hypothesis. It can be proved that on each iteration through this loop, the EM
algorithm increases the likelihood P(D|h) unless it is at a local maximum. The algorithm thus
converges to a local maximum likelihood hypothesis for (μ1, μ2).

General Statement of EM Algorithm


 More generally, the EM algorithm can be applied in many settings where we wish to estimate
some set of parameters θ that describe an underlying probability distribution, given only the
observed portion of the full data produced by this distribution.

 In the above two-means example the parameters of interest were θ = (μ1, μ2), and the full
data were the triples (xi, zi1, zi2) of which only the xi were observed.

 In general let X = {xl,.., xm} denote the observed data in a set of m independently drawn
instances, let Z = {zl,.., zm} denote the unobserved data in these same instances, and let
Y = X U Z denote the full data.

 Note the unobserved Z can be treated as a random variable whose probability distribution
depends on the unknown parameters θ and on the observed data X.

 Similarly, Y is a random variable because it is defined in terms of the random variable Z.

 We use h to denote the current hypothesized values of the parameters θ, and h' to denote the
revised hypothesis that is estimated on each iteration of the EM algorithm.

 The EM algorithm searches for the maximum likelihood hypothesis h' by seeking the h' that
maximizes E[ln P(Y|h')]. This expected value is taken over the probability distribution
governing Y, which is determined by the unknown parameters θ.

 Let us consider exactly what this expression signifies.


o First, P(Y|h’) is the likelihood of the full data Y given hypothesis h'. It is reasonable
that we wish to find a h' that maximizes some function of this quantity.
o Second, maximizing the logarithm of this quantity ln(P(Y|h’)) also maximizes P(Y|h’),
as we have discussed on several occasions already.
o Third, we introduce the expected value E[ln P(Y|h’)] because the full data Y is itself a
random variable.

25 Dept. of CSE, Dr. AIT, Bangalore


Machine Learning Unit -4 18CS62

 Given that the full data Y is a combination of the observed data X and unobserved data Z, we
must average over the possible values of the unobserved Z, weighting each according to its
probability. In other words we take the expected value E[ln P(Y|h')] over the probability
distribution governing the random variable Y. The distribution governing Y is determined by
the completely known values for X, plus the distribution governing Z.
 In general, we will not know this distribution because it is determined by the parameters θ
that we are trying to estimate. Therefore, the EM algorithm uses its current hypothesis h in
place of the actual parameters θ to estimate the distribution governing Y. Let us define a
function Q(h’|h) that gives E[ln P(Y |h')] as a function of h', under the assumption that θ = h
and given the observed portion X of the full data Y.

 In its general form, the EM algorithm repeats the following two steps until convergence:

When the function Q(h’|h) is continuous, the EM algorithm converges to a stationary point of the
likelihood function P(Y|h'). When this likelihood function has a single maximum, EM will
converge to this global maximum likelihood estimate for h'. Otherwise, it is guaranteed only to
converge to a local maximum.
Derivation of the k Means Algorithm
 kMeans algorithm is an unsupervised learning algorithm
 Given a data set of items, with certain features, and values for these features, the algorithm
will categorize the items into k groups or clusters of similarity.
 To calculate the similarity, we use the Euclidean distance, Manhattan distance, Hamming
distance, Cosine distance as measurement.
 Here is the pseudocode for implementing a K-means algorithm.
Input: Algorithm K-Means (K number of clusters, D list of data points)
1. Choose K number of random data points as initial centroids (cluster centers).
2. Repeat till cluster centers stabilize:
a) Allocate each point in D to the nearest of Kth centroids.
b) Compute centroid for the cluster using all points in the cluster.

26 Dept. of CSE, Dr. AIT, Bangalore


Machine Learning Unit -4 18CS62

27 Dept. of CSE, Dr. AIT, Bangalore


Machine Learning Unit -4 18CS62

28 Dept. of CSE, Dr. AIT, Bangalore


Machine Learning Unit -4 18CS62

Questions on Unit-4

1. Define Bayesian theorem? What is the relevance and features of Bayesian theorem? Explain
the practical difficulties of Bayesian theorem.
2. Define is Maximum a Posteriori (MAP) Maximum Likelihood (ML) Hypothesis. Derive the
relation for hMAP and hML using Bayesian theorem.
3. Consider a medical diagnosis problem in which there are two alternative hypotheses: 1.that the
patient has a particular form of cancer (+) and 2. That the patient does not (-). A patient takes
a lab test and the result comes back positive. The test returns a correct positive result in only
98% of the cases in which the disease is actually present, and a correct negative result in only
97% of the cases in which the disease is not present. Furthermore, .008 of the entire
population has this cancer. Determine whether the patient has Cancer or not using MAP
hypothesis.
4. Explain Brute force Bayes Concept Learning
5. What are Consistent Learners?
6. Discuss Maximum Likelihood and Least Square Error Hypothesis
7. Describe Maximum Likelihood Hypothesis for predicting probabilities.
8. Explain the Gradient Search to Maximize Likelihood in a Neural Net.
9. Describe the concept of MDL. Obtain the equation for hMDL
10. Explain Naïve Bayes Classifier with an Example
11. What are Bayesian Belief nets? Where are they used?
12. Explain Bayesian belief network and conditional independence with example
13. Explain Gradient Ascent Training of Bayesian Networks
14. Explain the concept of EM Algorithm. Discuss what are Gaussian Mixtures

29 Dept. of CSE, Dr. AIT, Bangalore


BAYESIAN
LEARNING
SOLVED PROBLEMS

You might also like