0% found this document useful (0 votes)
12 views

Bayesian Learning Note

mlt unit 3

Uploaded by

Rohan Rathi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Bayesian Learning Note

mlt unit 3

Uploaded by

Rohan Rathi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

MACHINE LEARNING

BAYESIAN LEARNING
Two Roles for Bayesian Methods
 Provides practical learning algorithms:
– Naive Bayes learning
– Bayesian belief network learning
– Combine prior knowledge (prior probabilities) with observed data
– Requires prior probabilities

 Provides useful conceptual framework


– Provides “gold standard” for evaluating other learning algorithms
– Additional insight into Occam’s razor

2
Bayes Theorem

 P(h) = prior probability of hypothesis h


 P(D) = prior probability of training data D
 P(h|D) = probability of h given D
 P(D|h) = probability of D given h

3
Choosing Hypotheses

 Generally want the most probable hypothesis given the


training data
Maximum a posteriori hypothesis hMAP:

 If assume P(hi) = P(hj) then can further simplify, and


choose the Maximum likelihood (ML) hypothesis

4
Bayes Theorem
 Does patient have cancer or not?
A patient takes a lab test and the result comes back positive.
The test returns a correct positive result in only 98% of the
cases in which the disease is actually present, and a correct
negative result in only 97% of the cases in which the disease
is not present. Furthermore, .008 of the entire population
have this cancer.
P(cancer) = P(cancer) =
P(|cancer) = P(|cancer) =
P(|cancer) = P(|cancer) =
5
Basic Formulas for Probabilities
 Product Rule: probability P(A  B) of a conjunction of
two events A and B:
P(A  B) = P(A | B) P(B) = P(B | A) P(A)
 Sum Rule: probability of a disjunction of two events A
and B:
P(A  B) = P(A) + P(B) - P(A  B)
 Theorem of total probability: if events A1,…, An are
mutually exclusive with , then

6
BRUTE FORCE MAP HYPOTHESIS LEARNER

1. For each hypothesis h in H, calculate the posterior probability

2. Output the hypothesis hMAP with the highest posterior


probability

7
Naive Bayes Classifier
 Along with decision trees, neural networks, nearest
nbr, one of the most practical learning methods.
 When to use
– Moderate or large training set available
– Attributes that describe instances are conditionally
independent given classification
 Successful applications:
– Diagnosis
– Classifying text documents
8
Naive Bayes Classifier

9
Naive Bayes Classifier

10
Naïve Bayes Classifier Example

11
12
13
Learning to Classify Text (1/4)

 Why?
– Learn which news articles are of interest
– Learn to classify web pages by topic

 Naive Bayes is among most effective algorithms


 What attributes shall we use to represent text
documents??

14
Bayes Optimal Classifier
 Bayes optimal classification:

 Example:
P(h1|D) = .4, P(|h1) = 0, P(+|h1) = 1
P(h2|D) = .3, P(|h2) = 1, P(+|h2) = 0
P(h3|D) = .3, P(|h3) = 1, P(+|h3) = 0
therefore

and

15
Gibbs Classifier
 Bayes optimal classifier provides best result, but can be
expensive if many hypotheses.
 Gibbs algorithm:
1. Choose one hypothesis at random, according to P(h|D)
2. Use this to classify new instance
 Surprising fact: Assume target concepts are drawn at
random from H according to priors on H. Then:
E[errorGibbs]  2E [errorBayesOptional]
 Suppose correct, uniform prior distribution over H, then
– Pick any hypothesis from VS, with uniform probability
– Its expected error no worse than twice Bayes optimal
16
Expectation Maximization (EM)
 When to use:
– Data is only partially observable
– Unsupervised clustering (target value unobservable)
– Supervised learning (some instance attributes unobservable)
 Some uses:
– Train Bayesian Belief Networks
– Unsupervised clustering (AUTOCLASS)
– Learning Hidden Markov Models

17
EM Algorithm
 Converges to local maximum likelihood h
and provides estimates of hidden variables zij

 In fact, local maximum in E[ln P(Y|h)]


– Y is complete (observable plus unobservable
variables) data
– Expected value is taken over possible values of
unobserved variables in Y

18
General EM Problem
 Given:
– Observed data X = {x1,…, xm}
– Unobserved data Z = {z1,…, zm}
– Parameterized probability distribution P(Y|h), where
 Y = {y1,…, ym} is the full data yi = xi  zi
 h are the parameters
 Determine: h that (locally) maximizes E[ln P(Y|h)]
 Many uses:
– Train Bayesian belief networks
– Unsupervised clustering (e.g., k means)
– Hidden Markov Models
19
GENERAL EM METHOD
Define likelihood function Q(h'|h) which calculates
Y = X  Z using observed X and current parameters h to
estimate Z
Q(h'|h)  E[ln P(Y| h')|h, X]
EM Algorithm:
 Estimation (E) step: Calculate Q(h'|h) using the current hypothesis h and the
observed data X to estimate the probability distribution over Y .
Q(h'|h)  E[ln P(Y| h')|h, X]

 Maximization (M) step: Replace hypothesis h by the hypothesis h' that


maximizes this Q function.

20

You might also like