Unit 3 Bayesian Learning
Unit 3 Bayesian Learning
Bayesian Learning
Bayesian Learning
• Bayes Theorem
• MAP, ML hypotheses
• MAP learners
• Minimum description length principle
• Bayes optimal classifier
• Naive Bayes learner
• Example: Learning over text data
• Bayesian belief networks
• Expectation Maximization algorithm
2
Two Roles for Bayesian Methods
• Provides practical learning algorithms:
– Naive Bayes learning
– Bayesian belief network learning
– Combine prior knowledge (prior probabilities) with
observed data
– Requires prior probabilities
3
Bayes Theorem
4
Choosing Hypotheses
5
Bayes Theorem
7
Brute Force MAP Hypothesis Learner
8
Relation to Concept Learning(1/2)
9
Relation to Concept Learning(2/2)
• Assume fixed set of instances <x1,…, xm>
• Assume D is the set of classifications: D = <c(x1),…,c(xm)>
• Choose P(D|h):
– P(D|h) = 1 if h consistent with D
– P(D|h) = 0 otherwise
• Choose P(h) to be uniform distribution
– P(h) = 1/|H| for all h in H
• Then,
10
Evolution of Posterior Probabilities
11
Characterizing Learning Algorithms
by Equivalent MAP Learners
12
Learning A Real Valued Function(1/2)
14
Learning to Predict Probabilities
• Consider predicting survival probability from patient data
• Training examples <xi, di>, where di is 1 or 0
• Want to train neural network to output a probability given xi
(not a 0 or 1)
• In this case can show
where
15
Minimum Description Length Principle (1/2)
Occam’s razor: prefer the shortest hypothesis
MDL: prefer the hypothesis h that minimizes
• Example:
P(h1|D) = .4, P(|h1) = 0, P(+|h1) = 1
P(h2|D) = .3, P(|h2) = 1, P(+|h2) = 0
P(h3|D) = .3, P(|h3) = 1, P(+|h3) = 0
therefore
and
19
Gibbs Classifier
20
Naive Bayes Classifier (1/2)
21
Naive Bayes Classifier (2/2)
23
Naive Bayes: Example
• Consider PlayTennis again, and new instance
<Outlk = sun, Temp = cool, Humid = high, Wind = strong>
• Want to compute:
24
Naive Bayes: Subtleties (1/2)
1. Conditional independence assumption is often
violated
25
Naive Bayes: Subtleties (2/2)
where
– n is number of training examples for which v = vi,
– nc number of examples for which v = vj and a = ai
– p is prior estimate for
– m is weight given to prior (i.e. number of “virtual” examples)
26
Learning to Classify Text (1/4)
• Why?
– Learn which news articles are of interest
– Learn to classify web pages by topic
27
Learning to Classify Text (2/4)
Target concept Interesting? : Document {, }
1. Represent each document by vector of words
– one attribute per word position in document
2. Learning: Use training examples to estimate
– P() P()
– P(doc|) P(doc|)
Naive Bayes conditional independence assumption
LEARN_NAIVE_BAYES_TEXT (Examples, V)
1. collect all words and other tokens that occur in Examples
• Vocabulary all distinct words and other tokens in
Examples
2. calculate the required P(vj) and P(wk | vj) probability terms
• For each target value vj in V do
– docsj subset of Examples for which the target value is vj
–
– Textj a single document created by concatenating all members
of docsj
29
Learning to Classify Text (4/4)
CLASSIFY_NAIVE_BAYES_TEXT (Doc)
• positions all word positions in Doc that contain tokens
found in Vocabulary
• Return vNB where
30
Twenty NewsGroups
Interesting because:
• Naive Bayes assumption of conditional independence too
restrictive
• But it’s intractable without some such assumptions...
• Bayesian Belief networks describe conditional
independence among subsets of variables
allows combining prior knowledge about
(in)dependencies among variables with observed training
data
(also called Bayes Nets)
33
Conditional Independence
36
Inference in Bayesian Networks
37
Learning of Bayesian Networks
• Several variants of this learning task
– Network structure might be known or unknown
– Training examples might provide values of all
network variables, or just some
38
Learning Bayes Nets
• Suppose structure known, variables partially
observable
• e.g., observe ForestFire, Storm, BusTourGroup,
Thunder, but not Lightning, Campfire...
– Similar to training neural network with hidden units
– In fact, can learn network conditional probability tables
using gradient ascent!
– Converge to network h that (locally) maximizes P(D|h)
39
Gradient Ascent for Bayes Nets
41
Summary: Bayesian Belief Networks
• Combine prior knowledge with observed data
• Impact of prior knowledge (when correct!) is to
lower the sample complexity
• Active research area
– Extend from boolean to real-valued variables
– Parameterized distributions instead of tables
– Extend to first-order instead of propositional systems
– More effective inference methods
– …
42
Expectation Maximization (EM)
• When to use:
– Data is only partially observable
– Unsupervised clustering (target value unobservable)
– Supervised learning (some instance attributes
unobservable)
• Some uses:
– Train Bayesian Belief Networks
– Unsupervised clustering (AUTOCLASS)
– Learning Hidden Markov Models
43
Generating Data from Mixture of k Gaussians
45
EM for Estimating k Means (2/2)
• EM Algorithm: Pick random initial h = <1, 2> then iterate
E step: Calculate the expected value E[zij] of each
hidden variable zij, assuming the current
hypothesis
h = <1, 2> holds.
46
EM Algorithm
47
General EM Problem
• Given:
– Observed data X = {x1,…, xm}
– Unobserved data Z = {z1,…, zm}
– Parameterized probability distribution P(Y|h), where
• Y = {y1,…, ym} is the full data yi = xi zi
• h are the parameters
• Determine: h that (locally) maximizes E[ln P(Y|h)]
• Many uses:
– Train Bayesian belief networks
– Unsupervised clustering (e.g., k means)
– Hidden Markov Models
48
General EM Method
49