0% found this document useful (0 votes)
34 views

Bayesian Learning

Bayesian learning uses probability and Bayes' theorem to model concepts and perform classification. It interprets probability as partial belief to estimate the validity of propositions based on prior estimates and evidence. Bayes' theorem is used to determine the most probable hypothesis given initial knowledge and data. Naive Bayes classification assumes conditional independence between features to simplify computation. Bayesian networks graphically represent conditional independence relationships to efficiently encode joint distributions and perform inference.

Uploaded by

Hriday Shetty
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Bayesian Learning

Bayesian learning uses probability and Bayes' theorem to model concepts and perform classification. It interprets probability as partial belief to estimate the validity of propositions based on prior estimates and evidence. Bayes' theorem is used to determine the most probable hypothesis given initial knowledge and data. Naive Bayes classification assumes conditional independence between features to simplify computation. Bayesian networks graphically represent conditional independence relationships to efficiently encode joint distributions and perform inference.

Uploaded by

Hriday Shetty
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 44

Bayesian Learning

Probability for Learning


• Probability for classification and
modeling concepts.
• Bayesian probability
– Notion of probability interpreted as partial
belief
• Bayesian Estimation
– Calculate the validity of a proposition
• Based on prior estimate of its probability
• and New relevant evidence
Bayes Theorem
• Goal:
• To determine the most probable hypothesis, given
the data D plus any initial knowledge about the
prior probabilities of the various hypotheses in H.
Bayes Theorem

Bayes P(D |
P(h | D)
Rule: h)P(h)
P(D)

• P(h) = prior probability of hypothesis h
• P(D) = prior probability of training data D
• P(h|D) = probability of h given D (posterior density )
• P(D|h) = probability of D given h (likelihood of D given
h)
An Example
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive. The test
returns a correct positive result in only 98% of the cases in which
the disease is actually present, and a correct negative result in only
97% of the cases in which the disease is not present.
Furthermore, .008 of the entire population have this cancer.

P(cancer )  .008, P ( canc er )  .992


P (  | cancer)  .98, P (  | cancer)  .02
P (  | cancer )  .03, P (  | cancer)

 .97 P( |
P(cancer | ) 
cancer )P (cancer )
P()
P( |
An Example
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive. The test
returns a correct positive result in only 98% of the cases in which
the disease is actually present, and a correct negative result in only
97% of the cases in which the disease is not present.
Furthermore, .008 of the entire population have this cancer.

P(cancer )  .008, P ( canc er )  .992


P (  | cancer)  .98, P (  | cancer)  .02
P (  | cancer )  .03, P (  | cancer)

 .97 P( |
P(cancer | ) 
cancer )P (cancer )
P()
P( |
Maximum A Posteriori (MAP) Hypothesis

P(D | h)P(h)
P(h | D)  P(D)
The Goal of Bayesian Learning: the most probable
hypothesis given the training data (Maximum A Posteriori
hypothesis)

hMAP  arg max P(h | D)


hH
P(D | h)P(h)
 arghH
max P(D)
 arg max P(D |
h)P(h)
hH
Maximum Likelihood (ML) Hypothesis

• If every hypothesis in H is equally probable, we


only need to consider the likelihood of the data D
given h, P(D|h). Then, hMAP becomes the Maximum
Likelihood,
hML= argmax hH P(D|h)
MAP Learner
For each hypothesis h in H, calculate the posterior probability
P(D | h)P(h)
P(h | D) 
P(D)
Output the hypothesis hMAP with the highest posterior
probability
h
H
h  max P(h | D)
Comments: MAP

Computational intensive
Providing a standard for judging the performance
of learning algorithms
Choosing P(h) and P(D|h) reflects our
prior knowledge about the learning task
Bayes Optimal Classifier
Question: Given new instance x, what is its most probable
classification?
•ℎ 𝑀 𝐴 𝑃 ( 𝑥 ) is not the most probable classification!
Example: Let P(h1|D) = .4, P(h2|D) = .3, P(h3 |D)
=.3
Bayes optimal
Given new data x, we have h1(x)=+, h2(x) = -, h3(x) =
classification:
- What is the most probable classification of x ?
argv max h
j H P(v j | hi )P(hi | D)
i
Vall the values a classification can take and vj is one
where V is the set of
possible such classification.
Example:
P(h1| D) =.4, P(-|h1)=0, P(+|h1)=1  P( | h )P(h
i i | D)
P(h2|D) =.3, P(-|h2)=1, P(+|h2)=0  .4
P(h3|D)=.3, P(-|h3)=1, P(+|h3)=0 hiH

 P( | h )P(h
i i | D)
 .6
hiH
Bayes Theorem
P(D | h)P(h)
P(h | D) 
P(D)
Naïve Bayes
• Bayes
classification
P(Y | X)  P(X |Y )P(Y )  P( X1,, X n |
Y )P(Y
Difficulty: ) the joint
learning P(X1 ,,Xn |C)
probability
• Naïve Bayes classification
Assume all input features are conditionally
P(independent!
X1, X 2 ,, X n | Y )  P( X1 | X 2 ,, X n ,Y )P( X 2 ,,
Xn | Y )
 P( X1 | Y )P( X 2 ,, X n | Y )
 P( X1 | Y )P( X 2 | Y )  P( X n |
Y)
Example
• Example: Play Tennis

7
Example
Learning Phase
Outlook Play=Yes Play=No Temperature Play=Yes Play=No
Sunny 2/9 3/5 Hot 2/9 2/5
Overcast 4/9 0/5 Mild 4/9 2/5
Rain 3/9 2/5 Cool 3/9 1/5

Humidity Play=Yes Play=No Wind Play=Yes Play=No


High 3/9 4/5 Strong 3/9 3/5
Normal 6/9 1/5 Weak 6/9 2/5

P(Play=Yes) = P(Play=No) =
9/14 5/14

8
Example
Test Phase
– Given a new instance, predict its label
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High,
Wind=Strong)
– P(Outlook=Sunny|Play=Yes)
Look up tables achieved = 2/9
inP(Outlook=Sunny|Play=No)
the learning phrase= 3/5
P(Temperature=Cool|Play==No) = 1/5
P(Temperature=Cool|Play=Yes) = 3/9
P(Huminity=High|Play=No) = 4/5
P(Huminity=High|Play=Yes) = 3/9
P(Wind=Strong|Play=No) = 3/5
P(Wind=Strong|Play=Yes) = 3/9
P(Play=No) = 5/14
P(Play=Yes) = 9/14
– Decision making with the MAP rule
P(Yes| x’) ≈ [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053
P(No| x ’) ≈ [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206

Given the fact P(Yes|x ’) < P(No| x ’), we label x’ to be


“No”.
Car example:
Confusion
Matrix

True class 
Hypothesized
Pos Neg • Accuracy =
class
(TP+TN)/(P+N
Yes TP FP )
No FN TN • Precision =
P=TP+FN N=FP+TN

TP/(TP+FP
)
• Recall/TP rate
= TP/P
Why Bayes Network
• Bayes optimal classifier is too costly to apply
• Naïve Bayes makes overly restrictive
assumptions.
• Bayes network represents conditional independence
relations among the features.
• Representation of causal relations makes the
representation and inference efficient.
Bayesian Network
• A graphical model that efficiently encodes the joint probability
distribution for a large set of variables
• A Bayesian Network for a set of variables (nodes) X = { X1,
…….Xn}
• Arcs represent probabilistic dependence among variables
• Lack of an arc denotes a conditional independence
• The network structure is a directed acyclic graph
• local probability distributions at each node (Conditional Probability
Table)
Late Rainy
Acciden
wakeu day
t
p

Traffi Meeting
c postpone
Jam d
Late
for
Work

Late for

meetin
g
Representation in Bayesian Belief
Networks
Late
Accid Rain
wak Conditional probability table
e nt y
e up
day associated with each node
specifies the conditional
Traffi Meeting distribution for the
c postpone variable given its immediate
Jam d
Late parents in the graph
for
Wor
k
Late for

meetin
g

Each node is asserted to be conditionally independent


of its non-descendants, given its immediate parents 5
Inference in Bayesian Networks
• Computes posterior probabilities given evidence
about some nodes
• Exploits probabilistic independence for
efficient computation.
• Unfortunately, exact inference of probabilities in
general for an arbitrary Bayesian Network is known
to be NP-hard.
• In theory, approximate techniques (such as Monte
Carlo Methods) can also be NP-hard, though in
practice, many such methods were shown to be
useful.
• Efficient algorithms leverage the structure of the
6
graph
Applications of Bayesian Networks

• Diagnosis: P(cause|symptom)=?
• Prediction: P(symptom|cause)=?
• Classification: P(class|data)
• Decision-making
C C2
(given a cost
1
function)
Bayesian Networks
• Structure of the graph  Conditional independence
relations

In general,
p(X1, X2,....XN) = 
p(Xi |The
parents(X
full joint i) ) The graph-structured approximation
distribution
• Requires that graph is acyclic (no directed cycles)

• 2 components to a Bayesian network


– The graph structure (conditional
independence assumptions)
– The numerical probabilities (for each variable given
its parents)
Examples
A B C Marginal
Independence: p(A,B,C)
= p(A) p(B) p(C)
A: Conditionally independent
D effects: p(A,B,C) = p(B|A)p(C|
A)p(A)
B: C: B and C are conditionally
S1 S2 independent Given A

A: B: Late Independent Causes:


Traffic wakeup p(A,B,C) = p(C|
A,B)p(A)p(B)
C: “Explaining away”
late

A B C Markov dependence:
p(A,B,C) = p(C|B) p(B|
A)p(A)
Find the probability that ‘P1’ is true (P1 has called ‘gfg’),
‘P2’ is true (P2 has called ‘gfg’) when the alarm ‘A’ rang,
but no burglary ‘B’ and fire ‘F’ has occurred.  

A P (P1=T) P (P1=F)
Burglary ‘B’ – B F P (A=T) P (A=F)

 P (B=T) = 0.001 (‘B’ is true i.e burglary T 0.95 0.05


T T 0.95 0.05
has occurred)
 P (B=F) = 0.999  (‘B’ is false i.e burglary T F 0.94 0.06 F 0.05 0.95

has not occurred)


F T 0.29 0.71
Fire ‘F’ – A P (P2=T) P (P2=F)
 P (F=T) = 0.002 (‘F’ is true i.e fire has F F 0.001 0.999
occurred) T 0.80 0.20

 P (F=F) = 0.998 (‘F’ is false i.e fire has F 0.01 0.99


not occurred)
P ( P1, P2, A, ~B, ~F)
= P (P1/A) * P (P2/A) * P (A/~B~F) * P (~B) * P (~F)
Example:
• Probability that john calls
Naïve Bayes Model
Y1 Y2 Y
n
Y3

C
Hidden Markov Model (HMM)
Y1 Y3 Y
Observed
Y2
n

---------------------------------------------
-------
S1 S3
Hidden
S2 Sn
Assumptions:
1. hidden state sequence is Markov
2.observation Yt is conditionally independent of all other
variables given St

Widely used in sequence learning eg, speech recognition,


POS tagging
Inference is linear in n
Learning Bayesian Belief Networks
1. The network structure is given in advance and all the
variables are fully observable in the training
examples.
– estimate the conditional probabilities.
2. The network structure is given in advance but
only some of the variables are observable in the
training data.
– Similar to learning the weights for the hidden units of
a Neural Net: Gradient Ascent Procedure
3. The network structure is not known in advance.
– Use a heuristic search or constraint-based technique
to search through potential structures. 14
Estimating Parameters: Y, Xi discrete-valued

If unlucky, our MLE estimate for P(Xi | Y) may be


zero.

MAP estimates:
Only difference:
“imaginary”
examples
Naïve Bayes: Assumptions of Conditional
Independence
Often the Xi are not really conditionally
independent

• We can use Naïve Bayes in many cases anyway


– often the right classification, even when not the
right probability
Gaussian Naïve Bayes (continuous X)
• Algorithm: Continuous-valued Features
– Conditional probability often modeled with the
normal distribution

Sometimes assume
variance
– is independent of Y (i.e., i),
– or independent of Xi (i.e., k)
– or both (i.e., )
Gaussian Naïve Bayes Algorithm – continuous Xi
(but still discrete Y)
• Train Naïve Bayes
(examples) for each value
yk
estimate*
for each attribute Xi
estimate
class conditional mean

, variance

• Classify (Xnew)
Estimating Parameters: Y discrete, Xi continuous

Maximum likelihood jth


estimates: training
example

ith kth class


feature (z)=1 if z
true, else
0
Naïve Bayes
• Example: Continuous-valued Features
– Temperature is naturally of continuous value.
Yes: 25.2, 19.3, 18.5, 21.7, 20.1, 24.3, 22.8, 23.1,
19.8
No: 27.3, 30.1, 17.4, 29.5, 15.1
– Estimate
N mean andNvariance for each
Yes class
21.64, Yes  2.35
1 xn , 2 1 (xn 
 N
N
n1 )2
  n1  No  23.88,   7.09
No
– Learning Phase: output two Gaussian models for P(temp|
C)

ˆ | Yes) 1  
P(x (x  21.64)2  1 (x  21.64)2
2.35 exp  2 2.352   2.35 exp  11.09 
    

ˆ | No) 2 1  2 1 
P(x (x  23.88)2  (x  23.88)2 
7.09 exp  2    7.09 exp  50.25 
    
2 7.09 2
2
The independence hypothesis…
• makes computation possible
• yields optimal classifiers when satisfied
• Rarely satisfied in practice, as attributes
(variables) are often correlated.
• To overcome this limitation:
– Bayesian networks combine Bayesian reasoning
with causal relationships between attributes

You might also like