Bayesian Learning Note
Bayesian Learning Note
BAYESIAN LEARNING
Two Roles for Bayesian Methods
Provides practical learning algorithms:
– Naive Bayes learning
– Bayesian belief network learning
– Combine prior knowledge (prior probabilities) with observed data
– Requires prior probabilities
2
Bayes Theorem
3
Choosing Hypotheses
4
Bayes Theorem
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive.
The test returns a correct positive result in only 98% of the
cases in which the disease is actually present, and a correct
negative result in only 97% of the cases in which the disease
is not present. Furthermore, .008 of the entire population
have this cancer.
P(cancer) = P(cancer) =
P(|cancer) = P(|cancer) =
P(|cancer) = P(|cancer) =
5
Basic Formulas for Probabilities
Product Rule: probability P(A B) of a conjunction of
two events A and B:
P(A B) = P(A | B) P(B) = P(B | A) P(A)
Sum Rule: probability of a disjunction of two events A
and B:
P(A B) = P(A) + P(B) - P(A B)
Theorem of total probability: if events A1,…, An are
mutually exclusive with , then
6
BRUTE FORCE MAP HYPOTHESIS LEARNER
7
Naive Bayes Classifier
Along with decision trees, neural networks, nearest
nbr, one of the most practical learning methods.
When to use
– Moderate or large training set available
– Attributes that describe instances are conditionally
independent given classification
Successful applications:
– Diagnosis
– Classifying text documents
8
Naive Bayes Classifier
9
Naive Bayes Classifier
10
Naïve Bayes Classifier Example
11
12
13
Learning to Classify Text (1/4)
Why?
– Learn which news articles are of interest
– Learn to classify web pages by topic
14
Bayes Optimal Classifier
Bayes optimal classification:
Example:
P(h1|D) = .4, P(|h1) = 0, P(+|h1) = 1
P(h2|D) = .3, P(|h2) = 1, P(+|h2) = 0
P(h3|D) = .3, P(|h3) = 1, P(+|h3) = 0
therefore
and
15
Gibbs Classifier
Bayes optimal classifier provides best result, but can be
expensive if many hypotheses.
Gibbs algorithm:
1. Choose one hypothesis at random, according to P(h|D)
2. Use this to classify new instance
Surprising fact: Assume target concepts are drawn at
random from H according to priors on H. Then:
E[errorGibbs] 2E [errorBayesOptional]
Suppose correct, uniform prior distribution over H, then
– Pick any hypothesis from VS, with uniform probability
– Its expected error no worse than twice Bayes optimal
16
Expectation Maximization (EM)
When to use:
– Data is only partially observable
– Unsupervised clustering (target value unobservable)
– Supervised learning (some instance attributes unobservable)
Some uses:
– Train Bayesian Belief Networks
– Unsupervised clustering (AUTOCLASS)
– Learning Hidden Markov Models
17
EM Algorithm
Converges to local maximum likelihood h
and provides estimates of hidden variables zij
18
General EM Problem
Given:
– Observed data X = {x1,…, xm}
– Unobserved data Z = {z1,…, zm}
– Parameterized probability distribution P(Y|h), where
Y = {y1,…, ym} is the full data yi = xi zi
h are the parameters
Determine: h that (locally) maximizes E[ln P(Y|h)]
Many uses:
– Train Bayesian belief networks
– Unsupervised clustering (e.g., k means)
– Hidden Markov Models
19
GENERAL EM METHOD
Define likelihood function Q(h'|h) which calculates
Y = X Z using observed X and current parameters h to
estimate Z
Q(h'|h) E[ln P(Y| h')|h, X]
EM Algorithm:
Estimation (E) step: Calculate Q(h'|h) using the current hypothesis h and the
observed data X to estimate the probability distribution over Y .
Q(h'|h) E[ln P(Y| h')|h, X]
20