CHP: 13 and 14
CHP: 13 and 14
Net
Chp: 13 and 14
Reza Nezami
[These slides were extracted from the course taught by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at
Uncertainty
■ The real world is rife with uncertainty!
■ E.g., if I leave for SFO 60 minutes before my flight, will I be there in time?
■ Problems:
■ partial observability (road state, other drivers’ plans, etc.)
■ noisy sensors (radio traffic reports, Google maps)
■ immense complexity of modelling and predicting traffic, security line, etc.
■ lack of knowledge of world dynamics (will tire burst? will I get in crash?)
■ Probabilistic assertions summarize effects of ignorance and laziness
■ Combine probability theory + utility theory -> decision theory
■ Maximize expected utility : a* = argmaxa s P(s | a) U(s)
Basic laws of probability
(discrete)
■ Begin with a set of possible worlds
■ E.g., 6 possible rolls of a die, {1, 2, 3, 4, 5, 6}
■S_m P() = 1
Basic laws contd.
1/6 1/6
■ An event is any subset of 1/6 1/6
1/6
1/6 1/6 1/6
1/6
1/6
Weather
meteor 0.0 rain 0.02 0.08
fog 0.03 0.27
meteor 0.00 0.00
T P W P
hot 0.5 sun 0.6
rain 0.1
cold 0.5
fog 0.3
meteor 0.0
T W P
hot sun 0.4
• Must obey:
hot rain 0.1
cold sun 0.2
cold rain 0.3
• Probabilistic models:
• (Random) variables with domains
• Assignments are called outcomes
• Joint distributions: says whether assignments Distribution over T,W
(outcomes) are likely or not.
• Normalized: sum to 1.0 T W P
• Ideally: only certain variables directly interact
hot sun 0.4
hot rain 0.1
cold sun 0.2
cold rain 0.3
Events
• An event is a set E of outcomes:
T P
T W P hot 0.5
hot sun 0.4 cold 0.5
hot rain 0.1
cold sun 0.2
W P
cold rain 0.3
sun 0.6
rain 0.4
Conditional Probabilities
• A simple relation between joint and conditional probabilities
• In fact, this is taken as the definition of a conditional probability
P(a,b)
P(a) P(b)
T W P
hot sun 0.4
hot rain 0.1
cold sun 0.2
cold rain 0.3
Normalizing a distribution
P(W,T)
P(W | T=c) = P(W,T=c)/P(T=c)
Temperature
P(W,T=c) = P(W,T=c)
hot cold
sun 0.45 0.15 0.15 0.30
Normalize
Weather
• Example:
D W P D W P
wet sun 0.1 wet sun 0.08
R P
dry sun 0.9 dry sun 0.72
sun 0.8
wet rain 0.7 wet rain 0.14
rain 0.2
dry rain 0.3 dry rain 0.06
The Chain Rule
• More generally, can always write any joint distribution as an
incremental product of conditional distributions
That’s my rule!
• Dividing, we get:
• Example:
• M: meningitis, S: stiff neck
Example
givens
• This says that their joint distribution factors into product of two simpler distributions
• Another form:
• We write:
T P
hot 0.5
cold 0.5
T W P T W P
hot sun 0.4 hot sun 0.3
hot rain 0.1 hot rain 0.2
cold sun 0.2 cold sun 0.3
cold rain 0.3 cold rain 0.2
W P
sun 0.6
rain 0.4
Example: Smoke alarm
Variables:
F: There is fire F
S: There is smoke
A: Alarm sounds
A
Conditional Independence Examples
What about this domain:
Fire
Smoke
Alarm
Conditional Independence
• P(Toothache, Cavity, Catch)
• Trivial decomposition:
• Arcs: interactions
• Similar to CSP constraints
• Indicate “direct influence” between variables
• Formally: encode conditional independence
(more later)
• Model 1: independence
Model 2: rain causes traffic
R R
T
• Why is an agent using model 2 better? T
Bay’s net Examples
Bay’s net Examples: Car
won’t Start!
Example Bayes’ Net: Insurance
Bayes’ Net Semantics
• A set of nodes, one per variable X
• Example:
Example: Alarm Network
B P(B) E P(E)
+b 0.001 Burglary Earthquake +e 0.002
-b 0.999 -e 0.998
A J P(J|A)
Alarm A M P(M|A)
B E A P(A|B,E)
+a +j 0.9 +a +m 0.7
+b +e +a 0.95
+a -j 0.1 +a -m 0.3
+b +e -a 0.05
-a +j 0.05 JohnCall MaryCall -a +m 0.01
+b -e +a 0.94
-a -j 0.95 -a -m 0.99
+b -e -a 0.06
-b +e +a 0.29
-b +e -a 0.71
-b -e +a 0.001
-b -e -a 0.999
Bayes Net vs Joint Distribution Table
■ Both give you the power to calculate
■ BNs encode joint distributions as P(X1, X2, …, XN)
product of conditional distributions on ■ Bayes Nets: huge space savings with sparsity
each variable: since usually number of parents is small!
P(X1,..,Xn) = i P(Xi |
Parents(Xi)) ■ It’s easier to elicit local CPTs
■ How big is a joint distribution over N
variables, each with d values? ■ BNs faster to answer queries (coming)
dN
■ Assume (without loss of generality) that X1,..,Xn sorted in topological order according to
the graph (i.e., parents before children), so Parents(Xi) � X1,…,Xi-1
■ So the Bayes net asserts conditional independences P(Xi | X1,…,Xi-1) = P(Xi | Parents(Xi))
■ To ensure these are valid, choose parents for node Xi that “shield” it from other predecessors
Conditional independence
semantics
■ Every variable is conditionally independent of its non-descendants given its parents
■ Conditional independence semantics <=> global semantics
U1 ... Um
X
Z1j Z nj
Y1 ... Yn
34
P(B) Example: Burglary P(E)
■ Burglary 0.001
true false
0.999
true false
0.002 0.998
■ Earthquake Burglary
?
Earthquake
■ Alarm ? ?
Alarm B E P(A|B,E)
true false
35
Causality?
■ When Bayes’ nets reflect the true causal patterns:
■ Often simpler (nodes have fewer parents)
■ Often easier to think about
■ Often easier to elicit from experts
+r 0.1
R -r 0.9 Join R
+r +t 0.08
Join T
R, T, L
+r -t 0.02
T +r +t 0.8 -r +t 0.09
+r -t 0.2 -r -t 0.81 R, T
-r +t 0.1 +r +t +l 0.024
-r -t 0.9 +r +t -l 0.056
L
L +r -t +l 0.002
+r -t -l 0.018
+t +l 0.3 +t +l 0.3 -r +t +l 0.027
+t -l 0.7 +t -l 0.7 -r +t -l 0.063
-t +l 0.1 -t +l 0.1 -r -t +l 0.081
-t -l 0.9 -t -l 0.9 -r -t -l 0.729
Operation 2: Eliminate
• Second basic operation: marginalization
• A projection operation
• Example:
+r +t 0.08
+r -t 0.02 +t 0.17
-r +t 0.09 -t 0.83
-r -t 0.81
Multiple Elimination
R, T, L T, L L
Sum Sum
+r +t +l 0.024 out T
+r +t -l 0.056
out R
+t +l 0.051
+r -t +l 0.002 +l 0.134
+t -l 0.119
+r -t -l 0.018 -l 0.886
-t +l 0.083
-r +t +l 0.027
-t -l 0.747
-r +t -l 0.063
-r -t +l 0.081
-r -t -l 0.729
General Variable Elimination
• Query:
Choose A
Example Process (ctn’d)
Choose E
Finish with B
Normalize
Marginalizing
Join R Sum out R Join T Sum out T
+r +t 0.08
+r 0.1 +r -t 0.02
-r 0.9 +t 0.17
-r +t 0.09
-t 0.83
-r -t 0.81
R R, T T T, L L
+r +t 0.8
+r -t 0.2
-r +t 0.1
T -r -t 0.9 L L +t +l 0.051
+t -l 0.119 +l 0.134
-t +l 0.083 -l 0.866
L +t +l 0.3 +t +l 0.3 -t -l 0.747
+t +l 0.3
+t -l 0.7 +t -l 0.7
+t -l 0.7
-t +l 0.1 -t +l 0.1
-t +l 0.1
-t -l 0.9 -t -l 0.9
-t -l 0.9
Example 2: P(B|+a)
B a, B
B P
+b 0.1
b 0.9 a
A B P A B P
+a +b 0.08 +a +b 8/17
B A P +a b 0.09 +a b 9/17
+b +a 0.8
b a 0.2
b +a 0.1
b a 0.9
Causal Chains
• This configuration is a “causal chain” Guaranteed X independent of Z given Y?
Yes!
Evidence along the chain “blocks” the
influence
Common Causes
This configuration is a “common cause” Guaranteed X independent of Z ?
No!
Y: Project
due One example set of CPTs for which X is not
independent of Z is sufficient to show this
independence is not guaranteed.
Example:
Project due causes both Canvas busy
and lab full
X: Canvas
Z: Lab full
active
Common Cause
This configuration is “common cause” Guaranteed X and Z independent given Y?
Y: Project
due
X: Forums
Z: Lab full
busy Yes!
Observing the cause blocks influence
between effects.
Common Effect
2 causes of one effect (v-structures) Are X and Y independent?
Yes: the hockey game and the rain cause traffic, but
they are not correlated
X: Raining Y: Hockey game
Proof:
Z: Traffic
Common Effect
two causes of one effect (v-structures)
Are X and Y independent given Z?
No: seeing traffic puts the rain and the hockey game
X: Raining Y: Hockey game
in competition as explanation.
Z: Traffic
Naïve Bayes (Towards Machine Learning fundamentals)
• A general Naive Bayes model:
Y
|Y| parameters
F1 F2 Fn