0% found this document useful (0 votes)
10 views62 pages

CHP: 13 and 14

The document discusses the concepts of uncertainty in real-world scenarios and introduces probability theory and decision theory as tools for managing this uncertainty. It covers basic laws of probability, probability distributions, joint distributions, and the importance of conditional probabilities and Bayes' Rule for probabilistic inference. Additionally, it highlights the use of Bayes' Nets as a graphical model to represent complex joint distributions and the relationships between variables.

Uploaded by

Pubg Star
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views62 pages

CHP: 13 and 14

The document discusses the concepts of uncertainty in real-world scenarios and introduces probability theory and decision theory as tools for managing this uncertainty. It covers basic laws of probability, probability distributions, joint distributions, and the importance of conditional probabilities and Bayes' Rule for probabilistic inference. Additionally, it highlights the use of Bayes' Nets as a graphical model to represent complex joint distributions and the relationships between variables.

Uploaded by

Pubg Star
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

CMPT310: Probability, Bayes

Net
Chp: 13 and 14

Reza Nezami
[These slides were extracted from the course taught by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at
Uncertainty
■ The real world is rife with uncertainty!
■ E.g., if I leave for SFO 60 minutes before my flight, will I be there in time?
■ Problems:
■ partial observability (road state, other drivers’ plans, etc.)
■ noisy sensors (radio traffic reports, Google maps)
■ immense complexity of modelling and predicting traffic, security line, etc.
■ lack of knowledge of world dynamics (will tire burst? will I get in crash?)
■ Probabilistic assertions summarize effects of ignorance and laziness
■ Combine probability theory + utility theory -> decision theory
■ Maximize expected utility : a* = argmaxa  s P(s | a) U(s)
Basic laws of probability
(discrete)
■ Begin with a set  of possible worlds
■ E.g., 6 possible rolls of a die, {1, 2, 3, 4, 5, 6}

■ A probability model assigns a number P() to each world


■ E.g., P(1) = P(2) = P(3) = P(5) = P(5) = P(6) = 1/6. 1/6
1/6

■ These numbers must satisfy 1/6 1/6 1/6

■ 0 <= P() 1/6

■S_m P() = 1
Basic laws contd.
1/6 1/6
■ An event is any subset of  1/6 1/6
1/6
1/6 1/6 1/6
1/6
1/6

■ E.g., “roll < 4” is the set {1,2,3}


■ E.g., “roll is odd” is the set {1,3,5} 1/6 1/6

■ The probability of an event is the sum of probabilities over its worlds


■ P(A) = Sw P()
■ E.g., P(roll < 4) = P(1) + P(2) + P(3) = 1/2

■ De Finetti (1931): anyone who bets according to probabilities that


violate these laws can be forced to lose money on every set of
bets
Probability Distributions
■ Associate a probability with each value; sums to 1
■ Temperature: ■ Weather: ■ Joint distribution
Marginal distributions
P(T) P(W) P(T,W)
W P
T P
sun 0.6 Temperature
hot 0.5
rain 0.1 hot cold
cold 0.5
fog 0.3 sun 0.45 0.15

Weather
meteor 0.0 rain 0.02 0.08
fog 0.03 0.27
meteor 0.00 0.00

■ Can’t deduce joint from marginals


■ Can deduce marginals from joint
Probability Distributions
• Unobserved random variables have distributions

T P W P
hot 0.5 sun 0.6
rain 0.1
cold 0.5
fog 0.3
meteor 0.0

• A distribution is a TABLE of probabilities of values

• A probability (lower case value) is a single number

• Must have: and


Joint Distributions
• A joint distribution over a set of random variables:
specifies a real number for each assignment (or outcome):

T W P
hot sun 0.4
• Must obey:
hot rain 0.1
cold sun 0.2
cold rain 0.3

• Size of distribution if n variables with domain sizes d?


• For all but the smallest distributions, impractical to write out!
Probabilistic Models
• A probabilistic model is a joint distribution
over a set of random variables

• Probabilistic models:
• (Random) variables with domains
• Assignments are called outcomes
• Joint distributions: says whether assignments Distribution over T,W
(outcomes) are likely or not.
• Normalized: sum to 1.0 T W P
• Ideally: only certain variables directly interact
hot sun 0.4
hot rain 0.1
cold sun 0.2
cold rain 0.3
Events
• An event is a set E of outcomes:

• From a joint distribution, we can calculate the probability of any


event
T W P
• Probability that it’s hot AND sunny?
hot sun 0.4
• Probability that it’s hot? hot rain 0.1
cold sun 0.2
• Probability that it’s hot OR sunny? cold rain 0.3
• Typically, the events we care about are partial assignments, like
P(T=hot)
Marginal Distributions
• Marginal distributions are sub-tables which eliminate variables
• Marginalization (summing out): Combine collapsed rows by adding

T P
T W P hot 0.5
hot sun 0.4 cold 0.5
hot rain 0.1
cold sun 0.2
W P
cold rain 0.3
sun 0.6
rain 0.4
Conditional Probabilities
• A simple relation between joint and conditional probabilities
• In fact, this is taken as the definition of a conditional probability

P(a,b)

P(a) P(b)

T W P
hot sun 0.4
hot rain 0.1
cold sun 0.2
cold rain 0.3
Normalizing a distribution

■ (Dictionary) To bring or restore to a normal condition

■ Procedure: All entries sum to ONE


■ Multiply each entry by  = 1/(sum over all entries)

P(W,T)
P(W | T=c) = P(W,T=c)/P(T=c)
Temperature
P(W,T=c) =  P(W,T=c)
hot cold
sun 0.45 0.15 0.15 0.30
Normalize
Weather

rain 0.02 0.08 0.08 0.16


fog 0.03 0.27 0.27 0.54
meteor 0.00 0.00 0.00  = 1/0.50 = 0.00
2
The Product Rule:

• Example:

D W P D W P
wet sun 0.1 wet sun 0.08
R P
dry sun 0.9 dry sun 0.72
sun 0.8
wet rain 0.7 wet rain 0.14
rain 0.2
dry rain 0.3 dry rain 0.06
The Chain Rule
• More generally, can always write any joint distribution as an
incremental product of conditional distributions

• Why is this always true?


Probabilistic Inference
• Probabilistic inference: compute a desired
probability from other known probabilities (e.g.
conditional from joint)

• We generally compute conditional probabilities


• P(on time | no reported accidents) = 0.90
• These represent the agent’s beliefs given the evidence

• Probabilities change with new evidence:


• P(on time | no accidents, 5 a.m.) = 0.95
• P(on time | no accidents, 5 a.m., raining) = 0.80
• Observing new evidence causes beliefs to be updated
Inference by Enumeration
• General case:  We want:
• Evidence variables:
• Query* variable: * Works fine with
All variables multiple query
• Hidden variables: variables, too

 Step 1: Select the


entries consistent  Step 2: Sum out H to get joint  Step 3: Normalize
with the evidence of Query and evidence
Inference by Enumeration example:
S T W P
• P(W)? summer hot sun 0.30

summer hot rain 0.05

summer cold sun 0.10


• P(W | winter)?
summer cold rain 0.05

winter hot sun 0.10


winter hot rain 0.05
• P(W | winter, hot)? winter cold sun 0.15
winter cold rain 0.20
Bayes’ Rule
• Two ways to factor a joint distribution over two variables:

That’s my rule!

• Dividing, we get:

• Why is this at all helpful?

• Allows us build one conditional from its reverse


• Often one conditional is tricky but the other one is simple
• Foundation of many probabilistic Learning Systems.
Inference with Bayes’ Rule
• Example: Diagnostic probability from causal probability:

• Example:
• M: meningitis, S: stiff neck
Example
givens

• Note: posterior probability of meningitis still very small


• Note: you should still get stiff necks checked out! Why?
Example: Tom’s test result
• Tom takes a test of leukemia.
• It’s known that patients with leukemia have positive result
98%
• If patient doesn’t have leukemia still the result may be
positive 3%
• It is known that 0.8 % of population usually have leukemia
• Tom’s test comes back positive!
• What is the likelihood that Tom has leukemia? P(+L | +test)
Prior : P(+L) = 0.008 , P(-L) = 0.992
P(+test | +L ) = 0.98, P(+test | -L) = 0.03

P(+L | +test) = (P(+test | +L) * P(+L) ) / P(+test)

P(+test) = P(+test | +L) * P(+L) + P(+test | -L) * P(-L) =


0.98 * 0.008 + 0.03 * 0.992 = 0.0078 + 0.0298

P(+L | +test) = (0.98 * 0.008 ) / (0.0078 + 0.0298) = 0.21

so it means P(-L | +test) = 0.79


Quiz: Bayes’ Rule
D W P
• Given:
wet sun 0.1
R P
dry sun 0.9
sun 0.8
wet rain 0.7
rain 0.2
dry rain 0.3

• What is P(W | dry) ?


Independence
• Two variables are independent if:

• This says that their joint distribution factors into product of two simpler distributions

• Another form:

• We write:

• Independence is a simplifying modeling assumption

• Empirical joint distributions: at best “close” to independent

• What could we assume for {Weather, Traffic, Cavity, Toothache}?


Example: Independence?

T P
hot 0.5
cold 0.5
T W P T W P
hot sun 0.4 hot sun 0.3
hot rain 0.1 hot rain 0.2
cold sun 0.2 cold sun 0.3
cold rain 0.3 cold rain 0.2
W P
sun 0.6
rain 0.4
Example: Smoke alarm
 Variables:
 F: There is fire F
 S: There is smoke
 A: Alarm sounds

A
Conditional Independence Examples
 What about this domain:

 Fire
 Smoke
 Alarm
Conditional Independence
• P(Toothache, Cavity, Catch)

• If I have a cavity, the probability that the probe catches it


doesn't depend on whether I have a toothache:
• P(+catch | +toothache, +cavity) = P(+catch | +cavity)

• The same independence holds if I don’t have a cavity:


• P(+catch | +toothache, -cavity) = P(+catch| -cavity)

• Catch is conditionally independent of Toothache given Cavity:


• P(Catch | Toothache, Cavity) = P(Catch | Cavity)
 Equivalent statements:
 P(Toothache | Catch , Cavity) = P(Toothache | Cavity)
 P(Toothache, Catch | Cavity) = P(Toothache | Cavity) P(Catch | Cavity)
 One can be derived from the other easily
Conditional Independence
• Unconditional (absolute) independence very rare (why?)

• Conditional independence is our most basic and robust form of


knowledge about uncertain environments.

• X is conditionally independent of Y given Z:

if and only if:

or, equivalently, if and only if


Conditional Independence and the Chain Rule
• Chain rule:

• Trivial decomposition:

• With assumption of conditional independence:

• Bayes’nets / graphical models help us express conditional independence assumptions


Bayes’ Nets: Big Picture
• Problems with using full joint distribution tables as our
probabilistic models:
• Unless there are only a few variables, the joint is WAY too big
to represent explicitly
• Hard to learn (estimate) anything empirically about more than
a few variables at a time

• Bayes’ nets: a technique for describing complex joint


distributions (models) using simple, local distributions
(conditional probabilities)
• More properly called graphical models
• We describe how variables locally interact
• Local interactions chain together to give global, indirect
interactions
• For now, we’ll be vague about how these interactions are
specified
Graphical Model Notation
• Nodes: variables (with domains)
• Can be assigned (observed) or unassigned
(unobserved)

• Arcs: interactions
• Similar to CSP constraints
• Indicate “direct influence” between variables
• Formally: encode conditional independence
(more later)

• For now: imagine that arrows mean direct


causation (in general, they don’t!)
Example: Traffic
• Variables:
• R: It rains
• T: There is traffic

• Model 1: independence
 Model 2: rain causes traffic

R R

T
• Why is an agent using model 2 better? T
Bay’s net Examples
Bay’s net Examples: Car
won’t Start!
Example Bayes’ Net: Insurance
Bayes’ Net Semantics
• A set of nodes, one per variable X

• A directed, acyclic graph


A1 An

• A conditional distribution for each node given its parent


variables in the graph
X
• A collection of distributions over X, one for each combination of
parents’ values

• CPT: conditional probability table

• Description of a noisy “causal” process

A Bayes net = Topology (graph) + Local Conditional Probabilities


Exercise: Alarm Network
• Variables
• B: Burglary
• A: Alarm goes off
• M: Mary calls
• J: John calls
• E: Earthquake!
Example: Bayesian vs Join distribution parameter
count
Probabilities in BNs
• Bayes’ Nets implicitly encode joint distributions

• As a product of local conditional distributions

• To see what probability a BN gives to a full assignment, multiply all the


relevant conditionals together:

• Chain rule gives:

• Example:
Example: Alarm Network
B P(B) E P(E)
+b 0.001 Burglary Earthquake +e 0.002
-b 0.999 -e 0.998

A J P(J|A)
Alarm A M P(M|A)
B E A P(A|B,E)
+a +j 0.9 +a +m 0.7
+b +e +a 0.95
+a -j 0.1 +a -m 0.3
+b +e -a 0.05
-a +j 0.05 JohnCall MaryCall -a +m 0.01
+b -e +a 0.94
-a -j 0.95 -a -m 0.99
+b -e -a 0.06
-b +e +a 0.29
-b +e -a 0.71
-b -e +a 0.001
-b -e -a 0.999
Bayes Net vs Joint Distribution Table
■ Both give you the power to calculate
■ BNs encode joint distributions as P(X1, X2, …, XN)
product of conditional distributions on ■ Bayes Nets: huge space savings with sparsity
each variable: since usually number of parents is small!
P(X1,..,Xn) =  i P(Xi |
Parents(Xi)) ■ It’s easier to elicit local CPTs
■ How big is a joint distribution over N
variables, each with d values? ■ BNs faster to answer queries (coming)
dN

■ How big is an N-node BN if nodes


have at most k parents?
O(N * dk)
Conditional independence in BNs
■ Compare the Bayes net global semantics
P(X1,..,Xn) =  i P(Xi | Parents(Xi))

with the chain rule identity


P(X1,..,Xn) =  i P(Xi | X1,…,Xi-1)

■ Assume (without loss of generality) that X1,..,Xn sorted in topological order according to
the graph (i.e., parents before children), so Parents(Xi) � X1,…,Xi-1
■ So the Bayes net asserts conditional independences P(Xi | X1,…,Xi-1) = P(Xi | Parents(Xi))
■ To ensure these are valid, choose parents for node Xi that “shield” it from other predecessors
Conditional independence
semantics
■ Every variable is conditionally independent of its non-descendants given its parents
■ Conditional independence semantics <=> global semantics

U1 ... Um

X
Z1j Z nj

Y1 ... Yn

34
P(B) Example: Burglary P(E)

■ Burglary 0.001
true false
0.999
true false
0.002 0.998
■ Earthquake Burglary
?
Earthquake

■ Alarm ? ?
Alarm B E P(A|B,E)

true false

true true 0.95 0.05

true false 0.94 0.06

false true 0.29 0.71

false false 0.001 0.999

35
Causality?
■ When Bayes’ nets reflect the true causal patterns:
■ Often simpler (nodes have fewer parents)
■ Often easier to think about
■ Often easier to elicit from experts

■ BNs need not actually be causal


■ Sometimes no causal net exists over the domain
(especially if variables are missing)
■ E.g. consider the variables Traffic and Rain
■ End up with arrows that reflect correlation, not
causation

■ What do the arrows really mean?


■ Topology may happen to encode causal
structure
■ Topology really encodes conditional
independence
Summary
■ Independence and conditional independence are
important forms of probabilistic knowledge

■ Bayes net encode joint distributions efficiently by


taking advantage of conditional independence
■ Global joint probability = product of local conditionals

■ Allows for flexible tradeoff between model accuracy


and memory/compute efficiency by picking most
important correlations and ignoring the rest!
A A B A B
A B C
B E C C
D E
C D D E D E

Strict Independence Naïve Bayes Sparse Bayes Net Joint Distribution


Inference
• Inference: calculating some useful  Examples:
quantity from a joint probability  Posterior probability
distribution

 Most likely explanation:


Operation 1: Joining Factors

+r 0.1
R -r 0.9 Join R
+r +t 0.08
Join T
R, T, L
+r -t 0.02
T +r +t 0.8 -r +t 0.09
+r -t 0.2 -r -t 0.81 R, T
-r +t 0.1 +r +t +l 0.024
-r -t 0.9 +r +t -l 0.056
L
L +r -t +l 0.002
+r -t -l 0.018
+t +l 0.3 +t +l 0.3 -r +t +l 0.027
+t -l 0.7 +t -l 0.7 -r +t -l 0.063
-t +l 0.1 -t +l 0.1 -r -t +l 0.081
-t -l 0.9 -t -l 0.9 -r -t -l 0.729
Operation 2: Eliminate
• Second basic operation: marginalization

• Take a table and sum out a variable


• Shrinks a table to a smaller one

• A projection operation

• Example:

+r +t 0.08
+r -t 0.02 +t 0.17
-r +t 0.09 -t 0.83
-r -t 0.81
Multiple Elimination
R, T, L T, L L

Sum Sum
+r +t +l 0.024 out T
+r +t -l 0.056
out R
+t +l 0.051
+r -t +l 0.002 +l 0.134
+t -l 0.119
+r -t -l 0.018 -l 0.886
-t +l 0.083
-r +t +l 0.027
-t -l 0.747
-r +t -l 0.063
-r -t +l 0.081
-r -t -l 0.729
General Variable Elimination
• Query:

• Start with initial factors:


• Local CPTs (but instantiated by evidence)

• While there are still hidden variables (not Q or


evidence):
• Pick a hidden variable H
• Join all factors mentioning H
• Eliminate (sum out) H

• Join all remaining factors and normalize


Example Process

Choose A
Example Process (ctn’d)

Choose E

Finish with B

Normalize
Marginalizing
Join R Sum out R Join T Sum out T

+r +t 0.08
+r 0.1 +r -t 0.02
-r 0.9 +t 0.17
-r +t 0.09
-t 0.83
-r -t 0.81
R R, T T T, L L
+r +t 0.8
+r -t 0.2
-r +t 0.1
T -r -t 0.9 L L +t +l 0.051
+t -l 0.119 +l 0.134
-t +l 0.083 -l 0.866
L +t +l 0.3 +t +l 0.3 -t -l 0.747
+t +l 0.3
+t -l 0.7 +t -l 0.7
+t -l 0.7
-t +l 0.1 -t +l 0.1
-t +l 0.1
-t -l 0.9 -t -l 0.9
-t -l 0.9
Example 2: P(B|+a)

Start / Select Join on B Normalize

B a, B
B P
+b 0.1
b 0.9 a
A B P A B P
+a +b 0.08 +a +b 8/17
B A P +a b 0.09 +a b 9/17
+b +a 0.8
b a 0.2
b +a 0.1
b a 0.9
Causal Chains
• This configuration is a “causal chain”  Guaranteed X independent of Z given Y?

X: Low pressure Y: Rain Z: Traffic

Yes!
 Evidence along the chain “blocks” the
influence
Common Causes
 This configuration is a “common cause”  Guaranteed X independent of Z ?
 No!
Y: Project
due  One example set of CPTs for which X is not
independent of Z is sufficient to show this
independence is not guaranteed.

 Example:
 Project due causes both Canvas busy
and lab full

X: Canvas
Z: Lab full
active
Common Cause
 This configuration is “common cause”  Guaranteed X and Z independent given Y?

Y: Project
due

X: Forums
Z: Lab full
busy Yes!
 Observing the cause blocks influence
between effects.
Common Effect
 2 causes of one effect (v-structures)  Are X and Y independent?
 Yes: the hockey game and the rain cause traffic, but
they are not correlated
X: Raining Y: Hockey game
 Proof:

Z: Traffic
Common Effect
 two causes of one effect (v-structures)
 Are X and Y independent given Z?
 No: seeing traffic puts the rain and the hockey game
X: Raining Y: Hockey game
in competition as explanation.

 This is backwards from the other cases


 Observing an effect activates influence between
possible causes.

Z: Traffic
Naïve Bayes (Towards Machine Learning fundamentals)
• A general Naive Bayes model:
Y

|Y| parameters

F1 F2 Fn

|Y| x |F|n values n x |F| x |Y|


parameters

• We only have to specify how each feature depends on the class


• Total number of parameters is linear in n
• Model is very simplistic, but often works anyway

You might also like