0% found this document useful (0 votes)

24 views49 pages

Unit 3 Bayesian Learning

Uploaded by

N MAHBOOB SUBANI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views49 pages

Unit 3 Bayesian Learning

Uploaded by

N MAHBOOB SUBANI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 49

Machine Learning

Bayesian Learning
Bayesian Learning
• Bayes Theorem
• MAP, ML hypotheses
• MAP learners
• Minimum description length principle
• Bayes optimal classifier
• Naive Bayes learner
• Example: Learning over text data
• Bayesian belief networks
• Expectation Maximization algorithm

2
Two Roles for Bayesian Methods
• Provides practical learning algorithms:
– Naive Bayes learning
– Bayesian belief network learning
– Combine prior knowledge (prior probabilities) with
observed data
– Requires prior probabilities

• Provides useful conceptual framework

– Provides “gold standard” for evaluating other learning
algorithms
– Additional insight into Occam’s razor

3
Bayes Theorem

4
Choosing Hypotheses

• Generally want the most probable hypothesis given the

training data
Maximum a posteriori hypothesis hMAP:

• If assume P(hi) = P(hj) then can further simplify, and

choose the Maximum likelihood (ML) hypothesis

5
Bayes Theorem

• Does patient have cancer or not?

A patient takes a lab test and the result comes back
positive. The test returns a correct positive result in only
98% of the cases in which the disease is actually present,
and a correct negative result in only 97% of the cases in
which the disease is not present. Furthermore, .008 of the
entire population have this cancer.
P(cancer) = P(cancer) =
P(|cancer) = P(|cancer) =
P(|cancer) = P(|cancer) =
6
Basic Formulas for Probabilities

• Product Rule: probability P(A  B) of a conjunction of

two events A and B:
P(A  B) = P(A | B) P(B) = P(B | A) P(A)
• Sum Rule: probability of a disjunction of two events A
and B:
P(A  B) = P(A) + P(B) - P(A  B)
• Theorem of total probability: if events A1,…, An are
mutually exclusive with , then

7
Brute Force MAP Hypothesis Learner

1. For each hypothesis h in H, calculate the posterior

probability

2. Output the hypothesis hMAP with the highest

posterior probability

8
Relation to Concept Learning(1/2)

• Consider our usual concept learning task

– instance space X, hypothesis space H, training
examples D
– consider the FindS learning algorithm (outputs most
specific hypothesis from the version space V SH,D)
• What would Bayes rule produce as the MAP
hypothesis?
• Does FindS output a MAP hypothesis??

9
Relation to Concept Learning(2/2)
• Assume fixed set of instances <x1,…, xm>
• Assume D is the set of classifications: D = <c(x1),…,c(xm)>
• Choose P(D|h):
– P(D|h) = 1 if h consistent with D
– P(D|h) = 0 otherwise
• Choose P(h) to be uniform distribution
– P(h) = 1/|H| for all h in H
• Then,

10
Evolution of Posterior Probabilities

11
Characterizing Learning Algorithms
by Equivalent MAP Learners

12
Learning A Real Valued Function(1/2)

Consider any real-valued target function f

Training examples <xi, di>, where di is noisy training value
• di = f(xi) + ei
• ei is random variable (noise) drawn independently for each xi
according to some Gaussian distribution with mean=0
Then the maximum likelihood hypothesis hML is the one that minimizes
the sum of squared errors:
13
Learning A Real Valued Function(2/2)

• Maximize natural log of this instead...

14
Learning to Predict Probabilities
• Consider predicting survival probability from patient data
• Training examples <xi, di>, where di is 1 or 0
• Want to train neural network to output a probability given xi
(not a 0 or 1)
• In this case can show

• Weight update rule for a sigmoid unit:

where

15
Minimum Description Length Principle (1/2)
Occam’s razor: prefer the shortest hypothesis
MDL: prefer the hypothesis h that minimizes

where LC(x) is the description length of x under encoding C

Example: H = decision trees, D = training data labels

• LC1(h) is # bits to describe tree h
• LC2(D|h) is # bits to describe D given h
– Note LC2(D|h) = 0 if examples classified perfectly by h. Need only
describe exceptions
• Hence hMDL trades off tree size for training errors
16
Minimum Description Length Principle (2/2)

Interesting fact from information theory:

The optimal (shortest expected coding length) code for an event with
probability p is –log2p bits.
So interpret (1):
• –log2P(h) is length of h under optimal code
• –log2P(D|h) is length of D given h under optimal code
 prefer the hypothesis that minimizes
length(h) + length(misclassifications)
17
Most Probable Classification
of New Instances

• So far we’ve sought the most probable hypothesis given

the data D (i.e., hMAP)
• Given new instance x, what is its most probable
classification?
– hMAP(x) is not the most probable classification!
• Consider:
– Three possible hypotheses:
P(h1|D) = .4, P(h2|D) = .3, P(h3|D) = .3
– Given new instance x,
h1(x) = +, h2(x) = , h3(x) = 
– What’s most probable classification of x?
18
Bayes Optimal Classifier

• Bayes optimal classification:

• Example:
P(h1|D) = .4, P(|h1) = 0, P(+|h1) = 1
P(h2|D) = .3, P(|h2) = 1, P(+|h2) = 0
P(h3|D) = .3, P(|h3) = 1, P(+|h3) = 0
therefore

and

19
Gibbs Classifier

• Bayes optimal classifier provides best result, but can be

expensive if many hypotheses.
• Gibbs algorithm:
1. Choose one hypothesis at random, according to P(h|D)
2. Use this to classify new instance
• Surprising fact: Assume target concepts are drawn at
random from H according to priors on H. Then:
E[errorGibbs]  2E [errorBayesOptional]
• Suppose correct, uniform prior distribution over H, then
– Pick any hypothesis from VS, with uniform probability
– Its expected error no worse than twice Bayes optimal

20
Naive Bayes Classifier (1/2)

• Along with decision trees, neural networks, nearest

nbr, one of the most practical learning methods.
• When to use
– Moderate or large training set available
– Attributes that describe instances are conditionally
independent given classification
• Successful applications:
– Diagnosis
– Classifying text documents

21
Naive Bayes Classifier (2/2)

• Assume target function f : X  V, where each instance x

described by attributes <a1, a2 … an>.
• Most probable value of f(x) is:

Naive Bayes assumption:

which gives
Naive Bayes classifier:
22
Naive Bayes Algorithm
• Naive Bayes Learn(examples)
For each target value vj
P(vj)  estimate P(vj)
^
For each attribute value ai of each attribute a

^ i |vj)  estimate P(ai |vj)

P(a

• Classify New Instance(x)

23
Naive Bayes: Example
• Consider PlayTennis again, and new instance
<Outlk = sun, Temp = cool, Humid = high, Wind = strong>
• Want to compute:

P(y) P(sun|y) P(cool|y) P(high|y) P(strong|y) = .005

P(n) P(sun|n) P(cool|n) P(high|n) P(strong|n) = .021
 vNB = n

24
Naive Bayes: Subtleties (1/2)
1. Conditional independence assumption is often
violated

– ...but it works surprisingly well anyway. Note don’t need

estimated posteriors to be correct; need only that

– see [Domingos & Pazzani, 1996] for analysis

– Naive Bayes posteriors often unrealistically close to 1 or 0

25
Naive Bayes: Subtleties (2/2)

2. what if none of the training instances with target value vj

have attribute value ai? Then

Typical solution is Bayesian estimate for

where
– n is number of training examples for which v = vi,
– nc number of examples for which v = vj and a = ai
– p is prior estimate for
– m is weight given to prior (i.e. number of “virtual” examples)

26
Learning to Classify Text (1/4)

• Why?
– Learn which news articles are of interest
– Learn to classify web pages by topic

• Naive Bayes is among most effective algorithms

• What attributes shall we use to represent text
documents??

27
Learning to Classify Text (2/4)
Target concept Interesting? : Document {, }
1. Represent each document by vector of words
– one attribute per word position in document
2. Learning: Use training examples to estimate
– P()  P()
– P(doc|)  P(doc|)
Naive Bayes conditional independence assumption

where P(ai = wk | vj) is probability that word in position i is

wk, given vj
one more assumption:
28
Learning to Classify Text (3/4)

LEARN_NAIVE_BAYES_TEXT (Examples, V)
1. collect all words and other tokens that occur in Examples
• Vocabulary  all distinct words and other tokens in
Examples
2. calculate the required P(vj) and P(wk | vj) probability terms
• For each target value vj in V do
– docsj  subset of Examples for which the target value is vj
–
– Textj  a single document created by concatenating all members
of docsj
29
Learning to Classify Text (4/4)

– n  total number of words in Textj (counting duplicate words

multiple times)
– for each word wk in Vocabulary
* nk  number of times word wk occurs in Textj

CLASSIFY_NAIVE_BAYES_TEXT (Doc)
• positions  all word positions in Doc that contain tokens
found in Vocabulary
• Return vNB where

30
Twenty NewsGroups

• Given 1000 training documents from each group Learn to

classify new documents according to which newsgroup it
came from

comp.graphics misc.forsale alt.atheism sci.space

comp.os.ms-win- rec.autos soc.religion.christian sci.crypt
dows.misc rec.motorcycles talk.religion.misc sci.electronics
comp.sys.ibm.pc.hardware rec.sport.baseball talk.politics.mideast sci.med
comp.sys.mac.hardware rec.sport.hockey talk.politics.misc
comp.windows.x talk.politics.guns

• Naive Bayes: 89% classification accuracy

31
Learning Curve for 20 Newsgroups

• Accuracy vs. Training set size (1/3 withheld for test)

32
Bayesian Belief Networks

Interesting because:
• Naive Bayes assumption of conditional independence too
restrictive
• But it’s intractable without some such assumptions...
• Bayesian Belief networks describe conditional
independence among subsets of variables
 allows combining prior knowledge about
(in)dependencies among variables with observed training
data
(also called Bayes Nets)
33
Conditional Independence

• Definition: X is conditionally independent of Y given Z if

the probability distribution governing X is independent of
the value of Y given the value of Z; that is, if
(xi, yj, zk) P(X= xi|Y= yj, Z= zk) = P(X= xi|Z= zk)
more compactly, we write
P(X|Y, Z) = P(X|Z)

• Example: Thunder is conditionally independent of Rain,

given Lightning
P(Thunder|Rain, Lightning) = P(Thunder|Lightning)

• Naive Bayes uses cond. indep. to justify

P(X, Y|Z) = P(X|Y, Z) P(Y|Z) = P(X|Z) P(Y|Z)
34
Bayesian Belief Network (1/2)

• Network represents a set of conditional independence

assertions:
– Each node is asserted to be conditionally independent
of its nondescendants, given its immediate
predecessors.
– Directed acyclic graph 35
Bayesian Belief Network (2/2)

• Represents joint probability distribution over all

variables
– e.g., P(Storm, BusTourGroup, . . . , ForestFire)
– in general,

where Parents(Yi) denotes immediate predecessors of

Yi in graph
– so, joint distribution is fully defined by graph, plus the
P(yi|Parents(Yi))

36
Inference in Bayesian Networks

• How can one infer the (probabilities of) values of one or

more network variables, given observed values of others?
– Bayes net contains all information needed for this inference
– If only one variable with unknown value, easy to infer it
– In general case, problem is NP hard

• In practice, can succeed in many cases

– Exact inference methods work well for some network structures
– Monte Carlo methods “simulate” the network randomly to calculate
approximate solutions

37
Learning of Bayesian Networks
• Several variants of this learning task
– Network structure might be known or unknown
– Training examples might provide values of all
network variables, or just some

• If structure known and observe all variables

– Then it’s easy as training a Naive Bayes classifier

38
Learning Bayes Nets
• Suppose structure known, variables partially
observable
• e.g., observe ForestFire, Storm, BusTourGroup,
Thunder, but not Lightning, Campfire...
– Similar to training neural network with hidden units
– In fact, can learn network conditional probability tables
using gradient ascent!
– Converge to network h that (locally) maximizes P(D|h)

39
Gradient Ascent for Bayes Nets

• Let wijk denote one entry in the conditional probability

table for variable Yi in the network
wijk = P(Yi = yij|Parents(Yi) = the list uik of values)
• e.g., if Yi = Campfire, then uik might be
<Storm = T, BusTourGroup = F >
• Perform gradient ascent by repeatedly
1. update all wijk using training data D

2. then, renormalize the to wijk assure

– j wijk = 1  0  wijk  1
40
More on Learning Bayes Nets
• EM algorithm can also be used. Repeatedly:
1. Calculate probabilities of unobserved variables, assuming
h
2. Calculate new wijk to maximize E[ln P(D|h)] where D now
includes both observed and (calculated probabilities of)
unobserved variables

• When structure unknown...

– Algorithms use greedy search to add/substract edges and
nodes
– Active research topic

41
Summary: Bayesian Belief Networks
• Combine prior knowledge with observed data
• Impact of prior knowledge (when correct!) is to
lower the sample complexity
• Active research area
– Extend from boolean to real-valued variables
– Parameterized distributions instead of tables
– Extend to first-order instead of propositional systems
– More effective inference methods
– …

42
Expectation Maximization (EM)
• When to use:
– Data is only partially observable
– Unsupervised clustering (target value unobservable)
– Supervised learning (some instance attributes
unobservable)
• Some uses:
– Train Bayesian Belief Networks
– Unsupervised clustering (AUTOCLASS)
– Learning Hidden Markov Models

43
Generating Data from Mixture of k Gaussians

• Each instance x generated by

1. Choosing one of the k Gaussians with uniform probability
2. Generating an instance at random according to that
Gaussian
44
EM for Estimating k Means (1/2)
• Given:
– Instances from X generated by mixture of k Gaussian distributions
– Unknown means <1,…,k > of the k Gaussians
– Don’t know which instance xi was generated by which Gaussian
• Determine:
– Maximum likelihood estimates of <1,…,k >

• Think of full description of each instance as

yi = < xi, zi1, zi2> where
– zij is 1 if xi generated by jth Gaussian
– xi observable
– zij unobservable

45
EM for Estimating k Means (2/2)
• EM Algorithm: Pick random initial h = <1, 2> then iterate
E step: Calculate the expected value E[zij] of each
hidden variable zij, assuming the current
hypothesis
h = <1, 2> holds.

M step: Calculate a new maximum likelihood hypothesis

h' = <'1, '2>, assuming the value taken on by each hidden
variable zij is its expected value E[zij] calculated above.
Replace h = <1, 2> by h' = <'1, '2>.

46
EM Algorithm

• Converges to local maximum likelihood h and

provides estimates of hidden variables zij

• In fact, local maximum in E[ln P(Y|h)]

– Y is complete (observable plus unobservable
variables) data
– Expected value is taken over possible values of
unobserved variables in Y

47
General EM Problem

• Given:
– Observed data X = {x1,…, xm}
– Unobserved data Z = {z1,…, zm}
– Parameterized probability distribution P(Y|h), where
• Y = {y1,…, ym} is the full data yi = xi  zi
• h are the parameters
• Determine: h that (locally) maximizes E[ln P(Y|h)]
• Many uses:
– Train Bayesian belief networks
– Unsupervised clustering (e.g., k means)
– Hidden Markov Models

48
General EM Method

• Define likelihood function Q(h'|h) which calculates

– Maximization (M) step: Replace hypothesis h by the hypothesis h' that

maximizes this Q function.

Naive Bayes
No ratings yet
Naive Bayes
60 pages
ML Unit-Iii
No ratings yet
ML Unit-Iii
178 pages
IML Module 3
No ratings yet
IML Module 3
95 pages
Bayesian Learning
No ratings yet
Bayesian Learning
81 pages
Week 7 AI LUMS
No ratings yet
Week 7 AI LUMS
47 pages
Unit 3
No ratings yet
Unit 3
99 pages
ML - Unit-3 Chapter - 6 (Bayes Theorem) - Notes
No ratings yet
ML - Unit-3 Chapter - 6 (Bayes Theorem) - Notes
123 pages
Slide07 Bayes
No ratings yet
Slide07 Bayes
51 pages
Chapter 6 Reviewer
100% (5)
Chapter 6 Reviewer
22 pages
Lecture13 Nbayes
No ratings yet
Lecture13 Nbayes
56 pages
UNIT4 - Part2 Aiml
No ratings yet
UNIT4 - Part2 Aiml
46 pages
UNIT 4 - Bayesian Learning
No ratings yet
UNIT 4 - Bayesian Learning
54 pages
Unit 2 Bayesian Learning
No ratings yet
Unit 2 Bayesian Learning
50 pages
Lecture 4
No ratings yet
Lecture 4
36 pages
Unit 4
No ratings yet
Unit 4
36 pages
Naive Bayes
No ratings yet
Naive Bayes
31 pages
L4 Naive Bayes
No ratings yet
L4 Naive Bayes
31 pages
2024 - Slide2 - BayesML Sub
No ratings yet
2024 - Slide2 - BayesML Sub
40 pages
Chapter 6 Bayesianlearning
No ratings yet
Chapter 6 Bayesianlearning
32 pages
L13 Bayesian Methods
No ratings yet
L13 Bayesian Methods
30 pages
Bayesian Learning
No ratings yet
Bayesian Learning
44 pages
2022 Slide9 BayesML Eng
No ratings yet
2022 Slide9 BayesML Eng
34 pages
Lecture 5 Bayesian
No ratings yet
Lecture 5 Bayesian
37 pages
Irs Unit 4 CH 1
No ratings yet
Irs Unit 4 CH 1
58 pages
ML-Module 4-P1
No ratings yet
ML-Module 4-P1
30 pages
Unit 3
No ratings yet
Unit 3
16 pages
Bayesian Decision Theory and Learning: Jayanta Mukhopadhyay Dept. of Computer Science and Engg
No ratings yet
Bayesian Decision Theory and Learning: Jayanta Mukhopadhyay Dept. of Computer Science and Engg
56 pages
Bayesian Learning Note
No ratings yet
Bayesian Learning Note
20 pages
Naive Bayes Classifiers - Parta
No ratings yet
Naive Bayes Classifiers - Parta
17 pages
Unit III
No ratings yet
Unit III
19 pages
Text Mining - Classification
No ratings yet
Text Mining - Classification
28 pages
NBayes 1 20 2011 Ann
No ratings yet
NBayes 1 20 2011 Ann
21 pages
6.1 Bayesian Learning
No ratings yet
6.1 Bayesian Learning
33 pages
8 ML
No ratings yet
8 ML
22 pages
Lecture 06 Bayesian Networks 07112022 011127pm
No ratings yet
Lecture 06 Bayesian Networks 07112022 011127pm
33 pages
Statistical Tables and Formulae PDF
No ratings yet
Statistical Tables and Formulae PDF
93 pages
Bayesian Learning: Based On "Machine Learning", T. Mitchell, Mcgraw Hill, 1997, Ch. 6
No ratings yet
Bayesian Learning: Based On "Machine Learning", T. Mitchell, Mcgraw Hill, 1997, Ch. 6
54 pages
6 Naive-Bayes
No ratings yet
6 Naive-Bayes
18 pages
Lecture 7
No ratings yet
Lecture 7
15 pages
Unit 6
No ratings yet
Unit 6
19 pages
Lecture - 4.1 - Bayes Classifier
No ratings yet
Lecture - 4.1 - Bayes Classifier
31 pages
Module - 3 - Last Part
No ratings yet
Module - 3 - Last Part
16 pages
Bayes Algorithm
No ratings yet
Bayes Algorithm
26 pages
Classification - Naive Bayes
No ratings yet
Classification - Naive Bayes
17 pages
Wa0002.
No ratings yet
Wa0002.
24 pages
15CS73 Module 4
No ratings yet
15CS73 Module 4
60 pages
Bayesian Learning
No ratings yet
Bayesian Learning
49 pages
Ba Yes Naive
No ratings yet
Ba Yes Naive
15 pages
ML BayesionBeliefNetwork Lect12 14
No ratings yet
ML BayesionBeliefNetwork Lect12 14
99 pages
@vtudeveloper - in ML Mod 4
No ratings yet
@vtudeveloper - in ML Mod 4
11 pages
Bayes Practice Book
No ratings yet
Bayes Practice Book
229 pages
Naïve Bayes Classifier: Dr. Hussain Dawood
No ratings yet
Naïve Bayes Classifier: Dr. Hussain Dawood
20 pages
Machine - Learning (Unit 3)
No ratings yet
Machine - Learning (Unit 3)
9 pages
Unit II Probabilistic Reasoning
No ratings yet
Unit II Probabilistic Reasoning
28 pages
Probabilistic Models in Machine Learning: Unit - III Chapter - 1
No ratings yet
Probabilistic Models in Machine Learning: Unit - III Chapter - 1
18 pages
Naïve Bayes Classifier: April 25, 2006
No ratings yet
Naïve Bayes Classifier: April 25, 2006
19 pages
Data Mining - Module 7
No ratings yet
Data Mining - Module 7
8 pages
Machine Learning: Lecture 6: Bayesian Learning (Based On Chapter 6 of Mitchell T.., Machine Learning, 1997)
No ratings yet
Machine Learning: Lecture 6: Bayesian Learning (Based On Chapter 6 of Mitchell T.., Machine Learning, 1997)
15 pages
Naïve Bayes Classifier: Ke Chen
No ratings yet
Naïve Bayes Classifier: Ke Chen
20 pages
ML Module 4 Chapter 8 RNSIT
No ratings yet
ML Module 4 Chapter 8 RNSIT
5 pages
E-Note 14654 Content Document 20231228101425AM
No ratings yet
E-Note 14654 Content Document 20231228101425AM
10 pages
Data Analytics III-i
No ratings yet
Data Analytics III-i
85 pages
Probabilistic Class I Fiers
No ratings yet
Probabilistic Class I Fiers
5 pages
LMS Content IVth Sem Module 3 PDF
No ratings yet
LMS Content IVth Sem Module 3 PDF
16 pages
Survival Models ANU Lecture 3
No ratings yet
Survival Models ANU Lecture 3
17 pages
21CS54 Module 4 2021 Scheme
No ratings yet
21CS54 Module 4 2021 Scheme
42 pages
R20CS700OE-Data-Structures Material
No ratings yet
R20CS700OE-Data-Structures Material
259 pages
Mathematics-II (Probability & Statistics) - Course Oultine
No ratings yet
Mathematics-II (Probability & Statistics) - Course Oultine
2 pages
AJC JC 2 H2 Maths 2011 Mid Year Exam Question Paper 2
No ratings yet
AJC JC 2 H2 Maths 2011 Mid Year Exam Question Paper 2
3 pages
Bus 172 Chapter 4
No ratings yet
Bus 172 Chapter 4
109 pages
Probility Assignment
No ratings yet
Probility Assignment
20 pages
Introduction To Probability Theory: Rong Jin
No ratings yet
Introduction To Probability Theory: Rong Jin
39 pages
Narrabundah College: Specialist Mathematics AC/IB Unit 7: Specialist Mathematics - 1.0 STD Units
No ratings yet
Narrabundah College: Specialist Mathematics AC/IB Unit 7: Specialist Mathematics - 1.0 STD Units
12 pages
ECE 6337 HW 1-7 and Soln 1-7
No ratings yet
ECE 6337 HW 1-7 and Soln 1-7
19 pages
Marek Kimmel's Plagiarism - Branching Processes in Biology
No ratings yet
Marek Kimmel's Plagiarism - Branching Processes in Biology
55 pages
SS4 Linear Combinations of Random Variables Student Notes
No ratings yet
SS4 Linear Combinations of Random Variables Student Notes
16 pages
Manual de Crystal Ball
No ratings yet
Manual de Crystal Ball
402 pages
Arima
No ratings yet
Arima
14 pages
Failure Analysis and Risk Concepts
No ratings yet
Failure Analysis and Risk Concepts
23 pages
Week 5 Decision Tree - Revised Probability
No ratings yet
Week 5 Decision Tree - Revised Probability
51 pages
Calculation of Grain Size Statistics
No ratings yet
Calculation of Grain Size Statistics
6 pages
Models and Modeling
No ratings yet
Models and Modeling
12 pages
Lecture Slides 12 - Sampling and The Central Limit Theorem
No ratings yet
Lecture Slides 12 - Sampling and The Central Limit Theorem
40 pages
Solution Q2 Summer - 2019
No ratings yet
Solution Q2 Summer - 2019
2 pages
CS 725: Foundations of Machine Learning: Lecture 2. Overview of Probability Theory For ML
No ratings yet
CS 725: Foundations of Machine Learning: Lecture 2. Overview of Probability Theory For ML
23 pages
OOAD Unit 1
No ratings yet
OOAD Unit 1
40 pages
Chapter 13: Policy Gradient Methods: by Richard Sutton and Andrew Barto
No ratings yet
Chapter 13: Policy Gradient Methods: by Richard Sutton and Andrew Barto
35 pages
Probability Tutorial
No ratings yet
Probability Tutorial
8 pages
Discriminating Among Weibull, Log-Normal and LL
No ratings yet
Discriminating Among Weibull, Log-Normal and LL
28 pages
Chapter 8 - Probability
No ratings yet
Chapter 8 - Probability
18 pages
Business Analytics Module 2 Summary
No ratings yet
Business Analytics Module 2 Summary
3 pages
Ece 313: Problem Set 7: Solutions Continuous Rvs and Poisson Processes
No ratings yet
Ece 313: Problem Set 7: Solutions Continuous Rvs and Poisson Processes
4 pages
On The Pricing of European Swaptions
No ratings yet
On The Pricing of European Swaptions
6 pages

Unit 3 Bayesian Learning

Uploaded by

Unit 3 Bayesian Learning

Uploaded by

Machine Learning

• Provides useful conceptual framework

• Generally want the most probable hypothesis given the

• If assume P(hi) = P(hj) then can further simplify, and

• Does patient have cancer or not?

• Product Rule: probability P(A  B) of a conjunction of

1. For each hypothesis h in H, calculate the posterior

2. Output the hypothesis hMAP with the highest

• Consider our usual concept learning task

Consider any real-valued target function f

• Maximize natural log of this instead...

• Weight update rule for a sigmoid unit:

where LC(x) is the description length of x under encoding C

Example: H = decision trees, D = training data labels

Interesting fact from information theory:

• So far we’ve sought the most probable hypothesis given

• Bayes optimal classification:

• Bayes optimal classifier provides best result, but can be

• Along with decision trees, neural networks, nearest

• Assume target function f : X  V, where each instance x

Naive Bayes assumption:

^ i |vj)  estimate P(ai |vj)

• Classify New Instance(x)

P(y) P(sun|y) P(cool|y) P(high|y) P(strong|y) = .005

– ...but it works surprisingly well anyway. Note don’t need

– see [Domingos & Pazzani, 1996] for analysis

2. what if none of the training instances with target value vj

Typical solution is Bayesian estimate for

• Naive Bayes is among most effective algorithms

where P(ai = wk | vj) is probability that word in position i is

– n  total number of words in Textj (counting duplicate words

• Given 1000 training documents from each group Learn to

comp.graphics misc.forsale alt.atheism sci.space

• Naive Bayes: 89% classification accuracy

• Accuracy vs. Training set size (1/3 withheld for test)

• Definition: X is conditionally independent of Y given Z if

• Example: Thunder is conditionally independent of Rain,

• Naive Bayes uses cond. indep. to justify

• Network represents a set of conditional independence

• Represents joint probability distribution over all

where Parents(Yi) denotes immediate predecessors of

• How can one infer the (probabilities of) values of one or

• In practice, can succeed in many cases

• If structure known and observe all variables

• Let wijk denote one entry in the conditional probability

2. then, renormalize the to wijk assure

• When structure unknown...

• Each instance x generated by

• Think of full description of each instance as

M step: Calculate a new maximum likelihood hypothesis

• Converges to local maximum likelihood h and

• In fact, local maximum in E[ln P(Y|h)]

• Define likelihood function Q(h'|h) which calculates

– Maximization (M) step: Replace hypothesis h by the hypothesis h' that

You might also like