BAYESIAN LEARNING
INTRODUCTION
• Why Bayesian ?
• For Two different reasons
– Bayesian learning algorithms that calculate explicit
probabilities for hypotheses, such as the naive Bayes
classifier are most practical approaches than decision
tree and neural network algorithms
– Learning to classify text documents such as electronic
news articles
Bayesian Learning
Features of Bayesian learning methods:
• Each observed training example can incrementally decrease or
increase the estimated probability that a hypothesis is correct.
– This provides a more flexible approach to learning than
algorithms that completely eliminate a hypothesis if it is
found to be inconsistent with any single example.
• Prior knowledge can be combined with observed data to
determine the final probability of a hypothesis. In Bayesian
learning, prior knowledge is provided by asserting
– a prior probability for each candidate hypothesis, and
– a probability distribution over observed data for each possible
hypothesis.
Bayesian Learning
• Bayesian methods can accommodate hypotheses
that make probabilistic predictions
• New instances can be classified by combining the
predictions of multiple hypotheses, weighted by
their probabilities.
• Bayesian methods prove computationally
intractable, they can provide a standard of
optimal decision making against which other
practical methods can be measured.
Difficulties with Bayesian Methods
• Require initial knowledge of many probabilities
– When these probabilities are not known in advance they are
often estimated based on background knowledge, previously
available data, and assumptions about the form of the
underlying distributions.
• Significant computational cost is required to determine the
Bayes optimal hypothesis in the general case (linear in the
number of candidate hypotheses).
– In certain specialized situations, this computational cost can
be significantly reduced.
Bayes
Theorem
• In machine learning, we try to determine the best
hypothesis from some hypothesis space H, given the
observed training data D.
• In Bayesian learning, the best hypothesis means the
most probable hypothesis, given the data D plus any
initial knowledge about the prior probabilities of the
various hypotheses in H.
• Bayes theorem provides a way to calculate the
probability of a hypothesis based on its prior
probability, the probabilities of observing various data
given the hypothesis, and the observed data itself.
Bayes
Theorem
P(h) is prior probability of hypothesis h
– P(h) to denote the initial probability that hypothesis h holds, before observing training data.
– P(h) may reflect any background knowledge we have about the chance that h is correct. If
we have no such prior knowledge, then each candidate hypothesis might simply get the
same prior probability.
P(D) is prior probability of training data D
– The probability of D given no knowledge about which hypothesis holds
P(h|D) is posterior probability of h given D
– P(h|D) is called the posterior probability of h, because it reflects our confidence thath
holds after we have seen the training data D.
– The posterior probability P(h|D) reflects the influence of the training data D, in contrast to
the prior probability P(h), which is independent ofD.
P(D|h) is posterior probability of D given h
– The probability of observing data D given some world in which hypothesis h holds.
– Generally, we write P(xly) to denote the probability of event x given event y.
Bayes Theorem
• In ML problems, we are interested in the probability P(h|D) that h
holds given the observed training data D.
• Bayes theorem provides a way to calculate the posterior probability
P(h|D), from the prior probability P(h), together with P(D) and P(D|h).
P(D |h) P(h)
Bayes Theorem: P(h | D)=
P(D)
• P(h|D) increases with P(h) and P(D|h) according to Bayes theorem.
• P(h|D) decreases as P(D) increases, because the more probable it is
that D will be observed independent of h, the less evidence D provides
in support of h.
Prior and Posterior Probability
• Prior probability : The probability of an
event before new data is collected
• Posterior probability : The probability of an
event after new data is collected
Maximum A Posteriori (MAP)
Hypothesis, hMAP
➢ The learner considers some set of candidate hypotheses H and it is
interested in finding the most probable hypothesis h € H given the
observed data D
➢ Such maximally probable hypothesis is called a maximum A
posteriori (MAP) hypothesis hMAP.
➢ Determine the MAP hypotheses by using Bayes theorem to
calculate the posterior probability of each candidate hypothesis
Maximum Likelihood (ML)
Hypothesis, hML
• If we assume that every hypothesis in H is equally
probable
i.e. P(hi) = P(hj) for all hi and hj in H
Then consider only P(D|h) to find the most probable
hypothesis.
• P(D|h) is often called the likelihood of the data D given
h
• Any hypothesis that maximizes P(D|h) is called a
maximum likelihood (ML) hypothesis, hML.
Example
➢ A medical diagnosis problem in which there are two
alternative hypotheses:
➢ (1) that the patient has a particular form of cancer, and
➢ (2) that the patient does not.
➢ The available data is from a particular laboratory test
with two possible outcomes: + (positive) and - (negative)
➢ We have prior knowledge that over the entire population
of people only .008 have this disease. Furthermore, the
lab test is only an imperfect indicator of the disease.
➢ The test returns a correct positive result in only 98% of
the cases in which the disease is actually present and a
correct negative result in only 97% of the cases in which
the disease is not present
➢ In other cases, the test returns the opposite result
Example - Does patient have cancer or
not?
P(cancer) = .008 P(notcancer) = .992
P(+|cancer) = .98 P(-|cancer) = .02
P(+|notcancer) = .03 P(-|notcancer) = .97
• A patient takes a lab test and the result comes back positive.
P(+|cancer) P(cancer) = .98 * .008 = .0078
P(+|notcancer) P(notcancer) = .03 * .992 = .0298 ➔ hMAP is notcancer
• Since P(cancer|+) + P(notcancer|+) must be 1
P(cancer|+) = .0078 / (.0078+.0298) = .21
P(notcancer|+) = .0298 / (.0078+.0298) = .79
Solution
The above situation can be summarized by the. following
probabilities:
Maximum a posteriori hypothesis
Suppose we now observe a new patient for whom the lab test returns a positive
result. Should we diagnose the patient as having cancer or not? The maximum
a posteriori hypothesis can be found using
MINIMUM DESCRIPTION LENGTH
PRINCIPLE
• https://www.youtube.com/watch?v=tRHpFG3
P2k8
• https://www.youtube.com/watch?v=0kufNLe3
1t0
BAYES OPTIMAL CLASSIFIER
• https://www.youtube.com/watch?v=o5x361YstFI
• https://www.youtube.com/watch?v=7R3b59ohiv
U
• https://www.youtube.com/watch?v=kWV_dVKn
m2c
• https://www.youtube.com/watch?v=t51t8kPGvis
• https://www.youtube.com/watch?v=i4qF0-Jroq0
GIBBS ALGORITHM
• https://www.youtube.com/watch?v=602Bus3
1zgc
• https://www.youtube.com/watch?v=o5x361Ys
tFI
NAIVE BAYES CLASSIFIER
• https://www.youtube.com/watch?v=XzSlEA4c
k2I
• https://www.youtube.com/watch?v=AUPmlIY
_Rkw
• https://www.youtube.com/watch?v=caRLHyy
Uudg
• https://www.youtube.com/watch?v=CICk9ApE
C3U
Naïve bayes
• https://www.youtube.com/watch?v=Ab4viREnP74
• Gaussian Naive Bayes Classifier (for Continuous values)
• https://www.youtube.com/watch?v=kufuBE6TJew
• Solved Example Naive Bayes Classification Age Income
Student Credit Rating Buys Computer Mahesh
• https://www.youtube.com/watch?v=ztYAWF8tzLI
• classify the new example as Senior or Junior
https://www.youtube.com/watch?v=Tw4U4a8VmIs
More
https://www.youtube.com/watch?v=QPvHY9t1Ouw
• Text classifier
• https://www.youtube.com/watch?v=fgbG7fH
QwJk
• Spam mail classifier
• https://www.youtube.com/watch?v=YcsDbCv
RBxg
MAXIMUM LIKELIHOOD AND LEAST-SQUARED ERROR
HYPOTHESES
• https://www.youtube.com/watch?v=Yj5jkzPtu
cM
• https://www.youtube.com/watch?v=lx9PkgeO
5Hc