Bayesian
Bayesian
CS 165B
Spring 2012
1
Course outline
• Introduction (Ch. 1)
• Concept learning (Ch. 2)
• Decision trees (Ch. 3)
• Ensemble learning
• Neural Networks (Ch. 4)
• Linear classifiers
• Support Vector Machines
• Bayesian Learning (Ch. 6)
• Instance-based Learning (Ch. 8)
• Clustering
• Genetic Algorithms (Ch. 9)
2
• Computational learning theory (Ch. 7)
Three approaches to classification
4
Basic formulas for probabilities
• Product rule : probability PAB of a conjunction of two
events A and B :
PABPA|BPBPB|APA
• Binomial Distribution: Random Variable X takes values {1, 2,…, n}, representing
the number of successes (X=1) in n Bernoulli trials.
k
P(X=k) = f(n, p, k) = Cn pk (1-p)n-k
7
Bayes Theorem
P ( D | h) P ( h)
P(h | D)
P( D)
• Phprior probability of hypothesis h
• PD prior probability of training data D
• Ph|D (posterior) probability of h given D
• PD|h probability of D given h /*likelihood*/
• Note proof of theorem:
from definition of conditional probabilities
e.g., Ph, DPh|DPD
8
Choosing Hypotheses
P ( D | h) P ( h )
P(h | D)
P( D)
• The goal of Bayesian Learning: the most probable hypothesis given the
training data
Maximum a Posteriori hypothesis hMAP
P ( D | h) P ( h)
arg max
hH P( D)
arg max P( D | h) P (h)
hH
• If P(hi)P(hj), Maximum
hMLLikelihood (ML
arg max P ( D )| hhypothesis:
)
hH
9
Maximum Likelihood Estimate
• Assume that you toss a (p,1-p) coin m times and get k Heads, m-
k Tails. What is p? The model we assumed
is binomial. You could
assume a different
model!
• If p is the probability of Heads, the probability of the data
observed is:
P(D|p) = pk (1-p)m-k
• The log Likelihood:
L(p) = log P(D|p) = k log(p) + (m-k)log(1-p)
12
Coin toss example
h MAP argmaxhH P(h | D) argmaxhH P(D | h)P(h)/P(D)
• A given coin is either fair or has a 60% bias in favor of Head.
• Decide what is the bias of the coin [This is a learning problem!]
– P(h|D):
P(h1|D) = P(D|h1)P(h1)/P(D) = 0.50.75/0.525 = 0.714
P(h2|D) = P(D|h2)P(h2)/P(D) = 0.60.25/0.525 = 0.286
13
Coin toss example
h MAP argmaxhH P(h | D) argmaxhH P(D | h)P(h)/P(D)
• After 1st coin toss is H we still think that the coin is more likely
to be fair
• If we were to use Maximum Likelihood approach (i.e., assume
equal priors) we would think otherwise. The data supports the
biased coin better.
14
Coin toss example
h MAP argmaxhH P(h | D) argmaxhH P(D | h)P(h)/P(D)
• Case of 100 coin tosses; 70 heads.
15
Example: Relation to Concept Learning
16
Relation to Concept Learning
1 if h is consistent with D
• Choose P ( D | h)
0 otherwise
1 VS H , D
• Compute P( D) P ( D | h) P ( h)
hH hVS H , D H
H
• Now 1
if h is consistent with D
P(h | D ) VS H , D
0 otherwise
18
(Bayesian) Learning a real-valued
function
• Continuous-valued target function
– Goal: learn h: X → R
– Bayesian justification for minimizing SSE
• Assume
– Target function h(x) is corrupted by noise
probability density functions model noise
19
hML = Minimizing Squared Error
hML =arg max P(D | h)
hÎ H
m
=arg max Õ P(di | h)
hÎ H i=1
m 1 2
1
=arg max Õ
- d - h( xi ))
2( i
e 2s
2
hÎ H i=1 2ps
m 1 2
=arg max Õ e
- d - h( xi ))
2( i
2s
hÎ H i=1
m
1
=arg max å -
2
( di - h(xi ) )
hÎ H i=1 2s 2
m
=arg max å - ( di - h(xi ))
2
hÎ H i=1
m
=arg min å ( di - h(xi ))
2
hÎ H i=1
20
Learning to Predict Probabilities
• Consider predicting survival probability from patient data
– Training examples xi, di where di is either or
– Want to learn a probabilistic function (like a coin) that for a
given input outputs 0/1 with certain probabilities.
– Could train a NN/SVM to learn ratios.
• Approach: train neural network to output a probability given xi
Training examples for f
• Modified target function f’xPf x Learn f’ using ML
• Max likelihood hypothesis: hence need to find
P(D | h)
= i=1..m P (xi , di | h) (independence of each example)
i=1..m P (di | h , xi) P(xi | h) (conditional probabilities)
= i=1..m P (di | h , xi) P(xi) (independence of h and xi)
21
Maximum Likelihood Hypothesis
h( xi ) if d i 1 h would output h(xi) for
P(di | h, xi ) input xi. Prob that di is
1 h( xi ) if d i 0 1 = h(xi), and prob that
di is 0 = 1-h(xi).
P (d i | h, xi ) h( xi ) d i (1 h( xi ))1 d i
m
P ( D h) h( xi ) d i (1 h( xi ))1 d i P( xi )
i 1
m
hML arg max h( xi ) d i (1 h( xi ))1 d i P( xi )
hH i 1
m
hML arg max d i ln h( xi ) (1 d i )(1 h( xi ))
hH i 1
m
• Go up gradient of likelihood function G(h,D) = d ln h( x ) (1 d ) ln(1 h( x ))
i 1
i i i i
m
G (h, D)
w j
w
i 1 j
d i ln h( xi ) (1 d i ) ln(1 h( xi ))
m
d i ln h( xi ) (1 d i ) ln(1 h( xi )) . h( xi ) . neti
i 1
h( xi ) neti w j
m
d i h ( xi )
h( x )(1 h( x ))h( x )(1 h( x )) x
i 1 i i
i i ji
m
(d
i 1
i h( xi ))x ji
23
Information theoretic view of hMAP
hMAP arg max P( D | h) P (h)
hH
arg max P (v | h) P (h | D)
vV hH
27
Simpler classification:Naïve Bayes
• Along with decision trees, neural networks, nearest
neighbor, one of the most practical learning methods
• When to use
– Moderate or large training set available
– Attributes that describe instances are conditionally independent
given classification
• Successful applications:
– Diagnosis
– Classifying text documents
28
Naïve Bayes Classifier
• Assume target function f : X → V
each instance x described by attributes a1, …, an
– In simplest case, V has two values (0,1)
• Most probable value of f (x) is:
vMAP arg max P(v | a1 a2 an )
vV
P (a1 a2 an | v) P(v)
arg max
vV P(a1 a2 an )
arg max P (a1 a2 an | v) P(v)
vV
• Naïve Bayes assumption:
P (a1 a2 an | v) P (ai | v)
• Naïve Bayes classifier: i
Example
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
• Consider PlayTennis again
D4 Rain Mild High Weak Yes
• P (yes) = 9/14, D5 Rain Cool Normal Weak Yes
P (no) = 5/14 D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
• P(Sunny|yes) 2/9 Sunny Mild High Weak No
D8
• P(Sunny|no) 3/5 D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
• Classify: D12 Overcast Mild High Strong Yes
(sun, cool, high, strong) D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
vNB arg max P (v) P (ai | v) n
vV i
P(y)P(sunny|y)P(cool|y)P(high|y)P(strong|y) = 0.005
P(n)P(sunny|n)P(cool|n)P(high|n)P(strong|n) = 0.021
30
Conditional Independence
• Conditional independence assumption
P( a1 a2 an | v) P( ai | v)
i
is often violated
• but it works surprisingly well anyway
• Don’t need estimated posteriors Pˆ (v | a1 an )
to be correct; need only that
31
Estimating Probabilities
• If none of the training instances with target value v
have attribute value ai ?
Pˆ (ai | v) 0 Pˆ (v) Pˆ (ai | v) 0
i
32
Classify Text
• Why?
– Learn which news articles are of interest
– Learn to classify web pages by topic
– Junk mail filtering
• Naïve Bayes is among the most effective algorithms
• What attributes shall we use to represent text documents?
33
Learning to Classify Text
34
Position Independence Assumption
length ( doc )
P(doc | v) i 1
P(ai wk | v)
35
LEARN_Naïve_Bayes_Text (Examples, V)
• collect all words and other tokens that occur in Examples
– Vocabulary ← all distinct words and other tokens in Examples
• calculate probability terms P (v) and P (wk | v)
For each target value v in V do
– docsv ← subset of Examples for which the target value is v
– P(v) ← |docsv| / |Examples|
– Textv ← a single document created by concatenating all
members of docsv
– n ← total number of words in Textv (duplicates counted)
– for each word wk in Vocabulary
nk ← number of times word wk occurs in Textv
P (wk|v) ← nk n |Vocabulary|
36
CLASSIFY_Naïve_Bayes_Text (Doc)
• positions ← all word positions in Doc that contain tokens found in
Vocabulary
• Return
37
Example: 20 Newsgroups
• Given 1000 training documents from each group
• Learn to classify new documents to a newsgroup
– comp.graphics, comp.os.ms-windows.misc,
comp.sys.ibm.pc.hardware, comp.sys.mac.hardware,
comp.windows.x
– misc.forsale, rec.autos, rec.motorcycles, rec.sport.baseball,
rec.sport.hockey
– alt.atheism, talk.religion.misc, talk.politics.mideast,
talk.politics.misc, talk.politics.guns
– soc.religion.christian, sci.space sci.crypt, sci.electronics,
sci.med
• Naive Bayes: 89% classification accuracy
38
Conditional Independence
• X is conditionally independent of Y given Z if the probability
distribution governing X is independent of the value of Y given
the value of Z
xiyjzk PXxi|Y yj Z zk PX xi|Z zk
[or PX|Y,Z PX|Z
• Example: P(Thunder|Rain, Lightning) = P(Thunder|
Lightning)
• Can generalize to X1…Xn, Y1…Ym, Z1…Zk
• Extreme case:
– Naive Bayes assumes full conditional independence:
PX1…Xn|Z PX1…Xn|XnZPXn|Z
PX1…Xn|ZPXn|Z…
i PXi|Z
39
• Symmetry of conditional independence
– Assume X is conditionally independent of Z given Y
P(X|Y,Z) = P(X|Y)
– Now,
P(Z|X,Y) = P(X|Y,Z) P(Z|Y) / P(X|Y)
– Therefore,
P(Z|X,Y) = P(Z|Y)
40
Bayesian Belief Networks
• Problems with above methods:
– Bayes Optimal Classifier expensive computationally
– Naive Bayes assumption of conditional independence too
restrictive
• For tractability/reliability, need other assumptions
– Model of world intermediate between
Full conditional probabilities
Full conditional independence
• Bayesian Belief networks describe conditional independence among
subsets of variables
– Assume only proper subsets are conditionally independent
– Combines prior knowledge about dependencies among variables
with observed training data
41
Bayesian Belief Networks (a.k.a.
Bayesian Networks)
a.k.a. Probabilistic networks, Belief nets, Bayes nets, etc.
• Belief network
– A data structure (depicted as a graph) that represents the
dependence among variables and allows us to concisely specify
the joint probability distribution
• A belief network is a directed acyclic graph where:
– The nodes represent the set of random variables (one node per
random variable)
– Arcs between nodes represent influence, or dependence
A link from node X to node Y means that X “directly
influences” Y
– Each node has a conditional probability table (CPT) that defines
P(node | parents)
Thunder ForestFire
43
Example
• Random variables X and Y X P(X)
– X: It is raining
– Y: The grass is wet
• X affects Y
Or, Y is a symptom of X Y P(Y|X)
• Draw two nodes and link them
• Define the CPT for each node
− P(X) and P(Y | X)
• Typical use: we observe Y and we want to query P(X | Y)
− Y is an evidence variable
− X is a query variable
44
Try it…
• What is P(X | Y)?
– Given that we know the CPTs of each
node in the graph
X P(X)
P (Y | X ) P ( X )
P( X | Y )
P (Y )
Y P(Y|X)
P (Y | X ) P ( X )
P( X , Y )
X
P (Y | X ) P( X ) Example
P(Y | X ) P( X )
X 45
Belief nets represent joint probability
• The joint probability function can be calculated directly
from the network
– It is the product of the CPTs of all the nodes
– P(var1, …, varN) = Πi P(vari|Parents(vari))
Y P(Y|X) Z P(Z|X,Y)
50
Calculate P(B | J, M)
P ( B , J , M )
P ( B | J , M )
P ( J , M ) By marginalization:
P ( J , M , A , B , E )
i j
i j
P ( J , M , A , B , E )
i j k
i j k
P ( B ) P ( E ) P ( A | B , E ) P ( J | A ) P ( M | A )
i j
j i j i i
P ( B ) P ( E ) P ( A | B , E ) P ( J | A ) P ( M | A )
i j k
j k i j k i i
51
Example
52
Course outline
53
Class feedback
• Difficult concepts
o PCA
o Fischer’s Linear Discriminant
o Backpropagation
o Logistic regression
o SVM?
o Bayesian learning?
54
Class feedback
• Pace
• Slightly fast
• Slow down on difficult parts
• Difficulty of homework
• Slightly hard
• Difficulty of project
• Need more structure
• Other feedback
• More depth?
55
Naive Bayes model
• A common situation is when a single cause directly
influences several variables, which are all conditionally
independent, given the cause.
Rain C
In general,
e1 e2 e3
Wet grass People with Car
umbrellas accidents
56
Naive Bayes model
• Typical query for naive Bayes:
– Given some evidence, what’s the probability of the cause?
– P(C | e1) = ?
– P(C | e1, e3) = ?
P(e1 | C ) P(C )
P(C | e1 )
Rain C P (e1 )
P (e1 | C ) P (C )
P(e1 | C ) P(C )
C
e1 e2 e3
Wet grass People with Car
umbrellas accidents
57
Drawing belief nets
• What would a belief net look like if all the variables were
fully dependent?
X1 X2 X3 X4 X5
P(X1,X2,X3,X4,X5) = P(X1)P(X2|X1)P(X3|X1,X2)P(X4|X1,X2,X3)P(X5|X1,X2,X3,X4)
• But this isn’t the only way to draw the belief net when all
the variables are fully dependent
58
Fully connected belief net
• In fact, there are N! ways of connecting up a fully-
connected belief net
– That is, there are N! ways of ordering the nodes
A way to represent joint probability
For N=2 Does not really capture causality!
X1 X2 X1 X2 P(X1,X2) = ?
For N=5
X1 X2 X3 X4 X5 P(X1,X2,X3,X4,X5) = ?
X1 X2 X3 X4 X5
X1 X2 X3 X4 X5
60
Drawing belief nets (cont.)
What if the variables are all independent?
P(X1, X2, X3, X4, X5) = P(X1) P(X2) P(X3) P(X4) P(X5)
X1 X2 X3 X4 X5
X1 X2 X3 X4 X5
X1 X2 X3 X4 X5
P(X1, X2, X3, X4, X5) = P(X1) P(X2 | X3) P(X3 | X1) P(X4 | X2) P(X5 | X4)
X1 X3 X2 X4 X5
63
What are belief nets used for?
• Given the structure, we can now pose queries:
– Typically: P(Cause | Symptoms)
– P(X1 | X4, X5)
– P(Earthquake | JohnCalls)
– P(Burglary | JohnCalls, MaryCalls)
64
Rained X P(X)
Raining
X P(X)
Y P(Y|X)
Worm
ASK P(X|Y) sighting Z P(Z|Y)
ASK P(X|Z)
65
How to construct a belief net
• Choose the random variables that describe the domain
– These will be the nodes of the graph
X1 X2 X3 X4 X5
Causes Symptoms
66
How to construct a belief net (cont.)
• Draw arcs from left to right to indicate “direct influence” among variables
– May have to reorder some nodes
X1 X2 X3 X4 X5
P(F) F M P(M)
P(V | F, M) V S P(S | M)
X Y 70
Conditional Independence
• X and Y are (conditionally) independent given E iff
– P(X | Y, E) = P(X | E) Independence is the same as
– P(Y | X, E) = P(Y | E) conditional independence given
empty E
Cases 1 and 2:
variable Z is in E
Case 3:
variable Z or
its descendants
is not in E
72
Path Blockage
Blocked Unblocked
Active
Three cases:
– Common cause
E E
– X Y X Y
73
Path Blockage
Blocked Unblocked
Active
Three cases:
– Common cause X X
E E
– Intermediate cause
Y Y
–
74
Path Blockage
Blocked Unblocked
Active
Three cases:
– Common cause X Y
X Y A
– Intermediate cause
A C
– Common Effect X Y
C
A
C
75
Examples
R G W P(W | R, G) = P(W | G)
Rain Wet Worms
Grass
T F C P(T | C, F) = P(T | F)
Tired Flu Cough
W M I P(W | I, M) P(W | M)
Work Money Inherit P(W | I) = P(W)
76
Examples
Z X – wet grass
Y – rainbow P(X, Y) P(X) P(Y)
Z – rain P(X | Y, Z) = P(X | Z)
X Y
X Y
X – rain
P(X, Y) = P(X) P(Y)
Y – sprinkler
Z P(X | Y, Z) P(X | Z)
Z – wet grass
P(X | Y, W) P(X |
W – worms
W)
W
78
Examples
Are X and Y independent?
Are X and Y conditionally independent given Z?
X Y X Y
Z W Z W
X – rain X – rain
Y – sprinkler Y – sprinkler
Z – rainbow Z – rainbow
W – wet grass W – wet grass
80
Conditional Independence
A B C D E
81
Theorems
• A node is conditionally independent of its non-descendants
given its parents.
82
Why does conditional independence matter?
• Helps the developer (or the user) verify the graph structure
– Are these things really independent?
– Do I need more/fewer arcs?
83
Case Study
• Pathfinder system. (Heckerman 1991, Probabilistic
Similarity Networks, MIT Press, Cambridge MA).
• Diagnostic system for lymph-node diseases.
– 60 diseases and 100 symptoms and test-results.
– 14,000 probabilities
– Expert consulted to make net.
– 8 hours to determine variables.
– 35 hours for net topology.
– 40 hours for probability table values.
• Apparently, the experts found it quite easy to invent the
links and probabilities.
• Pathfinder is now outperforming world experts.
84
Inference in Bayesian Networks
• How can one infer (probabilities of) values of one/more network
variables, given observed values of others?
– Bayes net contains all information needed for this
– Easy if only one variable with unknown value
– In general case, problem is NP hard
Need to compute sums of probs over unknown values
• In practice, can succeed in many cases
– Exact inference methods work well for some network
structures (polytrees)
– Variable elimination methods reduce the amount of repeated
computation
– Monte Carlo methods “simulate” the network randomly to
calculate approximate solutions
85
Learning Bayesian Networks
• Object of current research
• Several variants of this learning task
– Network structure might be known or unknown
Structure incorporates prior beliefs
86
Learning Bayes Nets
• Suppose structure known, variables partially observable
– e.g., observe ForestFire, Storm, BusTourGroup, Thunder, but not
Lightning, Campfire...
• Analogous to learning weights for hidden units of ANN
– Assume know input/output node values
– Do not know values of hidden units
• In fact, can learn network conditional probability tables
using gradient ascent
– Search through hypothesis space corresponding to set of all
possible entries for conditional probability tables
– Maximize P(D|h) (ML hypoth for table entries)
– Converge to network h that (locally) maximizes PD|h
87
Gradient for Bayes Net
• Let wijk denote one entry in the conditional probability table for variable
Yi in the network
wijk PYi yij|Parents(Yi) the list uik of parents values)
– e.g., if Yi Campfire, then uik could be
Storm T, BusTourGroup F
• Perform gradient ascent repeatedly by:
– update wijk using training data D
Using gradient ascent up lnP(D|h) in w-space using wijk update rule with small
step
Need to calculate sum over training examples of
P(Yi=yij, Ui=uik|d)/wijk
– Calculate these from network
– If unobservable for a given d, use inference
– Renormalize wijk by summing to 1 and normalizing to between [0,1]
88
Gradient for Bayes Net
• Let wijk denote one entry in the conditional probability
table for variable Yi in the network
wijk PYi yij|Parents(Yi) the list uik of values)
¶ln P(D | h) 1
¶wijk P(d | h) w
d D ijk
P(d | yij , ui , k , h) P( yij | u i ,k , h) P(u i ,k | h)
1
=
¶
ln Õ P(d | h) P(d | h) w P(d | yij , ui , k , h) wijk P (u i ,k | h)
¶wijk dÎ D d D ijk
1
=å
¶ln P(d | h) P (d | h) P( d | y
d D
ij , u i , k , h) P (u i , k | h)
dÎ D ¶wijk P ( yij , ui , k | d , h) P (d | h)
1
1 ¶P(d| h)
P (ui , k | h)
=å d D
P ( d | h ) P ( y ij , u i , k | h )
×
dÎ D P(d | h) ¶wijk P ( yij , ui , k | d , h)
P ( yij , ui , k | h)
P (ui ,k | h)
1
=å å P(d | yij ', ui,k', h)P(yij ', ui,k' | h)
¶ d D
× P ( yij , ui , k | d , h)
dÎ D P(d | h) ¶wijk j ',k'
d D
P ( yij | u i, k , h)
1
=å å P(d | yij ', ui,k', h)P(yij ' | ui,k', h)P(ui,k' | h)
¶
× P ( yij , ui , k | d , h)
dÎ D P(d | h) ¶wijk j ',k'
d D
wijk
89
Gradient Ascent for Bayes Net
• wijk PYi yij|Parents(Yi) the list uik of values)
ln P( D | h) P ( yij , ui , k | d , h)
wijk d D wijk
90
Course outline
91