0% found this document useful (0 votes)
7 views32 pages

Chapter 6 Bayesianlearning

Uploaded by

Army
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views32 pages

Chapter 6 Bayesianlearning

Uploaded by

Army
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 32

6.

Bayesian Learning

Introduction
– Bayesian learning algorithms calculate explicit
probabilities for hypotheses
– Practical approach to certain learning problems
– Provide useful perspective for understanding
learning algorithms
Real Life applications

• Text-based classification such as spam or junk


mail filtering, author identification, or topic
categorization
• Medical diagnosis such as given the presence of
a set of observed symptoms during a disease.
• Identifying the probability of new patients having
the disease
• Network security such as detecting illegal
intrusion or anomaly in computer networks
Drawbacks:
– Typically requires initial knowledge of many
probabilities
– In some cases, significant computational cost
required to determine the Bayes optimal
hypothesis (linear in the number of candidate
hypotheses)
Bayes Theorem
Best hypothesis  most probable hypothesis
Notation
P(h): prior probability of hypothesis h
P(D): prior probability that dataset D be
observed
P(D|h): prior probability of D given h
P(h|D): posterior probability of h
• Bayes Theorem
P(h|D) = P(D|h) P(h) / P(D)

• Maximum a posteriori hypothesis


hMAP  argmaxhH P(h|D)
= argmaxhH P(D|h) P(h)

• Maximum likelihood hypothesis


hML = argmaxhH P(D|h)
= hMAP if we assume
P(h)=constant
• Example
P(cancer) = 0.008 P(cancer) =
0.992
P(+|cancer) = 0.98 P(- |cancer) = 0.02
P(+|cancer) = 0.03 P(- |cancer) = 0.97

For a new patient the lab test returns a positive


result. Should be diagnose cancer or not?
P(+|cancer)P(cancer)=0.0078 P(-|cancer)P(cancer)=0.0298
 hMAP = cancer
5.3 Bayes Theorem and Concept Learning
What is the relationship between Bayes theorem
and concept learning?

– Brute Force Bayes Concept Learning


1. For each hypothesis hH calculate P(h|D)
2. Output hMAP  argmaxhH P(h|D)

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006


– We must choose P(h) and P(D|h) from prior
knowledge
Let’s assume:
1. The training data D is noise free
2. The target concept c is contained in H
3. We consider a priori all the hypotheses equally
probable
 P(h) = 1/|H|  hH

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006


5. Bayesian Learning

Since the data is assumed noise free:


P(D|h)=1 if di=h(xi)  di  D
P(D|h)=0 otherwise

Brute-force MAP learning


– If h is inconsistent with D:
P(h|D) = P(D|h).P(h)/P(D) = 0.P(h)/P(D) = 0

– If h is consistent with D:
P(h|D) = 1. (1/|H|) / (|VSH,D| / |H|) = 1/ |VSH,D|

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006


5. Bayesian Learning

 P(D|h)=1/|VSH,D| if h is consistent with D


P(D|h)=0 otherwise
Every consistent hypothesis is a MAP hypothesis

Consistent Learners
– Learning algorithms whose outputs are
hypotheses that commit zero errors over the
training examples (consistent hypotheses)

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006


5. Bayesian Learning

Under the assumed conditions, Find-S is a


consistent learner

The Bayesian framework allows to characterize


the behavior of learning algorithms, identifying
P(h) and P(D|h) under which they output optimal
(MAP) hypotheses

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006


5. Bayesian Learning

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006


5. Bayesian Learning

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006


5. Bayesian Learning

6.4 Maximum Likelihood and LSE Hypotheses


Learning a continuous-valued target function
(regression or curve fitting)

H = Class of real-valued functions defined over X


h:X L learns f : X  
(xi,di)  D di = f(xi) + i i=1,m
f : noise-free target function : white noise
N(0,)

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006


5. Bayesian Learning

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006


5. Bayesian Learning

Under these assumptions, any learning algorithm


that minimizes the squared error between the output
hypothesis predictions and the training data will
output a ML hypothesis:

hML = argmaxhH p(D|h)


= argmaxhH i=1,m p(di|h)
= argmaxhH i=1,m exp{-[di-h(xi)]2/22}
= argminhH i=1,m [di-h(xi)]2 = hLSE

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006


5. Bayesian Learning

5.5 ML Hypotheses for Predicting Probabilities


– We wish to learn a nondetermnistic function
f : X  {0,1}
that is, the probabilities that f(x)=0 and f(x)=1

– Training data D = (xi,di)

– We assume that any particular instance xi is


independent of hypothesis h

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006


5. Bayesian Learning

Then

P(D|h) = i=1,m P(xi,di|h) = i=1,m P(di|h, xi) P(xi)

P(di|h,xi) = h(xi) if di=1


P(di|h,xi) =1-h(xi) if di=0

 P(di|h,xi) = h(xi)di [1-h(xi)]1-di

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006


5. Bayesian Learning

hML = argmaxhH i=1,m h(xi)di [1-h(xi)]1-di


= argmaxhH i=1,m di log[h(xi)] + [1-di] log[1-
h(xi)]
= argminhH [Cross Entropy]

Cross Entropy 
- i=1,m di log[h(xi)] + [1-di] log[1-h(xi)]

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006


5. Bayesian Learning

5.6 Minimum Description Length Principle


hMAP = argmaxhH P(D|h) P(h)
= argminhH {-log2P(D|h)-log2P(h)}

 short hypotheses are


preferred

Description Length LC(h): Number of bits required


to encode message h using code C

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006


5. Bayesian Learning

– - log2P(h)  LCH(h): Description length of h under


the optimal (most compact) encoding of H
– - log2P(D|h)  LCD |h(D|h): Description length of
training data D given hypothesis h

 hMAP = argminhH {LCH(h) + LCD |h(D|h)}

MDL Principle:
Choose hMDL = argminhH {LC1(h) + LC2(D|h)}

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006


5. Bayesian Learning

5.7 Bayes Optimal Classifier


What is the most probable classification of a new
instance given the training data?

Answer: argmaxvjV hH P(vj|h) P(h|D)


where vj V are the possible classes

 Bayes Optimal Classifier

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006


5. Bayesian Learning

5.9 Naïve Bayes Classifier


Given the instance x=(a1,a2,...,an)
vMAP = argmaxvjV P(x|vj) P(vj)

The Naïve Bayes Classifier assumes conditional


independence of attribute values :
vNB = argmaxvjV P(vj) i=1,n P(ai|vj)

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006


5. Bayesian Learning

5.10 An Example: Learning to Classify Text


Task: “Filter WWW pages that discuss ML topics”
• Instance space X contains all possible text documents
• Training examples are classified as “like” or “dislike”

How to represent an arbitrary document?


• Define an attribute for each word position
• Define the value of the attribute to be the English word
found in that position

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006


5. Bayesian Learning

vNB = argmaxvjV P(vj) i=1,Nwords P(ai|vj)

V {like,dislike} ai 50.000 distinct words in


English

 We must estimate ~ 2 x 50.000 x Nwords


conditional probabilities P(ai|vj)

This can be reduced to 2 x 50.000 terms by


considering
P(ai=wk|vj) = P(am=wk|vj)  i,j,k,m
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
5. Bayesian Learning

– How to choose the conditional probabilities?


m-estimate:
P(wk|vj) = (nk + 1) / (Nwords + |Vocabulary|)

nk : number of times word wk is found


|Vocabulary| : total number of distinct words

Concrete example: Assigning articles to 20 usenet


newsgroups  Accuracy:
89%
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
5. Bayesian Learning

5.11 Bayesian Belief Networks


Bayesian belief networks assume conditional
independence only between subsets of the
attributes
– Conditional independence
• Discrete-valued random variables X,Y,Z

• X is conditionally independent of Y given Z if

P(X |Y,Z)= P(X |Z)

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006


5. Bayesian Learning

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006


5. Bayesian Learning

Representation
• A Bayesian network represents the joint probability
distribution of a set of variables
• Each variable is represented by a node

• Conditional independence assumptions are


indicated by a directed acyclic graph
• Variables are conditionally independent of its
nondescendents in the network given its inmediate
predecessors

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006


5. Bayesian Learning

The joint probabilities are calculated as

P(Y1,Y2,...,Yn) = i=1,n P [Yi|Parents(Yi)]

The values P [Yi|Parents(Yi)] are stored in tables


associated to nodes Yi

Example:
P(Campfire=True|Storm=True,BusTourGroup=True)=0.4

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006


5. Bayesian Learning

Inference
• We wish to infer the probability distribution for
some variable given observed values for (a subset
of) the other variables
• Exact (and sometimes approximate) inference of
probabilities for an arbitrary BN is NP-hard
• There are numerous methods for probabilistic
inference in BN (for instance, Monte Carlo), which
have been shown to be useful in many cases

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006


5. Bayesian Learning

Learning Bayesian Belief Networks


Task: Devising effective algorithms for learning BBN
from training data
– Focus of much current research interest
– For given network structure, gradient ascent can be
used to learn the entries of conditional probability
tables
– Learning the structure of BBN is much more difficult,
although there are successful approaches for some
particular problems

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

You might also like