0% found this document useful (0 votes)
12 views20 pages

Bayesian Learning Note

mlt unit 3

Uploaded by

Rohan Rathi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views20 pages

Bayesian Learning Note

mlt unit 3

Uploaded by

Rohan Rathi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

MACHINE LEARNING

BAYESIAN LEARNING
Two Roles for Bayesian Methods
 Provides practical learning algorithms:
– Naive Bayes learning
– Bayesian belief network learning
– Combine prior knowledge (prior probabilities) with observed data
– Requires prior probabilities

 Provides useful conceptual framework


– Provides “gold standard” for evaluating other learning algorithms
– Additional insight into Occam’s razor

2
Bayes Theorem

 P(h) = prior probability of hypothesis h


 P(D) = prior probability of training data D
 P(h|D) = probability of h given D
 P(D|h) = probability of D given h

3
Choosing Hypotheses

 Generally want the most probable hypothesis given the


training data
Maximum a posteriori hypothesis hMAP:

 If assume P(hi) = P(hj) then can further simplify, and


choose the Maximum likelihood (ML) hypothesis

4
Bayes Theorem
 Does patient have cancer or not?
A patient takes a lab test and the result comes back positive.
The test returns a correct positive result in only 98% of the
cases in which the disease is actually present, and a correct
negative result in only 97% of the cases in which the disease
is not present. Furthermore, .008 of the entire population
have this cancer.
P(cancer) = P(cancer) =
P(|cancer) = P(|cancer) =
P(|cancer) = P(|cancer) =
5
Basic Formulas for Probabilities
 Product Rule: probability P(A  B) of a conjunction of
two events A and B:
P(A  B) = P(A | B) P(B) = P(B | A) P(A)
 Sum Rule: probability of a disjunction of two events A
and B:
P(A  B) = P(A) + P(B) - P(A  B)
 Theorem of total probability: if events A1,…, An are
mutually exclusive with , then

6
BRUTE FORCE MAP HYPOTHESIS LEARNER

1. For each hypothesis h in H, calculate the posterior probability

2. Output the hypothesis hMAP with the highest posterior


probability

7
Naive Bayes Classifier
 Along with decision trees, neural networks, nearest
nbr, one of the most practical learning methods.
 When to use
– Moderate or large training set available
– Attributes that describe instances are conditionally
independent given classification
 Successful applications:
– Diagnosis
– Classifying text documents
8
Naive Bayes Classifier

9
Naive Bayes Classifier

10
Naïve Bayes Classifier Example

11
12
13
Learning to Classify Text (1/4)

 Why?
– Learn which news articles are of interest
– Learn to classify web pages by topic

 Naive Bayes is among most effective algorithms


 What attributes shall we use to represent text
documents??

14
Bayes Optimal Classifier
 Bayes optimal classification:

 Example:
P(h1|D) = .4, P(|h1) = 0, P(+|h1) = 1
P(h2|D) = .3, P(|h2) = 1, P(+|h2) = 0
P(h3|D) = .3, P(|h3) = 1, P(+|h3) = 0
therefore

and

15
Gibbs Classifier
 Bayes optimal classifier provides best result, but can be
expensive if many hypotheses.
 Gibbs algorithm:
1. Choose one hypothesis at random, according to P(h|D)
2. Use this to classify new instance
 Surprising fact: Assume target concepts are drawn at
random from H according to priors on H. Then:
E[errorGibbs]  2E [errorBayesOptional]
 Suppose correct, uniform prior distribution over H, then
– Pick any hypothesis from VS, with uniform probability
– Its expected error no worse than twice Bayes optimal
16
Expectation Maximization (EM)
 When to use:
– Data is only partially observable
– Unsupervised clustering (target value unobservable)
– Supervised learning (some instance attributes unobservable)
 Some uses:
– Train Bayesian Belief Networks
– Unsupervised clustering (AUTOCLASS)
– Learning Hidden Markov Models

17
EM Algorithm
 Converges to local maximum likelihood h
and provides estimates of hidden variables zij

 In fact, local maximum in E[ln P(Y|h)]


– Y is complete (observable plus unobservable
variables) data
– Expected value is taken over possible values of
unobserved variables in Y

18
General EM Problem
 Given:
– Observed data X = {x1,…, xm}
– Unobserved data Z = {z1,…, zm}
– Parameterized probability distribution P(Y|h), where
 Y = {y1,…, ym} is the full data yi = xi  zi
 h are the parameters
 Determine: h that (locally) maximizes E[ln P(Y|h)]
 Many uses:
– Train Bayesian belief networks
– Unsupervised clustering (e.g., k means)
– Hidden Markov Models
19
GENERAL EM METHOD
Define likelihood function Q(h'|h) which calculates
Y = X  Z using observed X and current parameters h to
estimate Z
Q(h'|h)  E[ln P(Y| h')|h, X]
EM Algorithm:
 Estimation (E) step: Calculate Q(h'|h) using the current hypothesis h and the
observed data X to estimate the probability distribution over Y .
Q(h'|h)  E[ln P(Y| h')|h, X]

 Maximization (M) step: Replace hypothesis h by the hypothesis h' that


maximizes this Q function.

20

You might also like