0% found this document useful (0 votes)
23 views96 pages

Lec 10

The document discusses uncertainty and Bayesian networks. It introduces probability as a way to represent uncertainty given partial observability, noisy sensors, and complexity. Probabilities represent an agent's degree of belief rather than assertions about the world. The document covers probability syntax using random variables, propositions, and events. It also discusses prior probabilities, joint probability distributions, and using probabilities for decision making under uncertainty.

Uploaded by

lucaszhuuu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views96 pages

Lec 10

The document discusses uncertainty and Bayesian networks. It introduces probability as a way to represent uncertainty given partial observability, noisy sensors, and complexity. Probabilities represent an agent's degree of belief rather than assertions about the world. The document covers probability syntax using random variables, propositions, and events. It also discusses prior probabilities, joint probability distributions, and using probabilities for decision making under uncertainty.

Uploaded by

lucaszhuuu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 96

Uncertainty and Bayesian Networks

吉建民

USTC
[email protected]

2023 年 4 月 29 日

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Used Materials

Disclaimer: 本课件采用了 S. Russell and P. Norvig’s Artificial


Intelligence –A modern approach slides, 徐林莉老师课件和其他网
络课程课件,也采用了 GitHub 中开源代码,以及部分网络博客
内容

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
课程大纲

! 第一部分:人工智能概述 / Introduction and Agents


(chapters 1, 2)
! 第二部分:问题求解 / Search (chapters 3, 4, 5, 6)
! 第三部分:知识与推理 / Logic (chapters 7, 8, 9, 10, 11, 12)
! 第四部分:不确定知识与推理 / Uncertainty (chapters 13,
14, 15, 16, 17)
! 第五部分:学习 / Learning (chapters 18, 19, 20, 21)
! 第六部分:应用 / Application (chapters 22, 23, 24, 25)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Table of Contents

Uncertainty

Probability

Syntax and Semantics

Inference

Independence and Bayes’ Rule

Bayesian network
Graphical models
Bayesian networks
Inference in Bayesian networks

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Uncertainty

Let action At = “leave for airport t minutes before flight”


Will At get me there on time?

Problems:
! partial observability(部分可观察性,e.g., road state, other
drivers’ plans, etc.)
! noisy sensors (e.g., traffic reports)
! uncertainty in action outcomes (e.g., flat tire, etc.)
! immense complexity of modeling and predicting traffic

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Uncertainty

Hence a purely logical approach either:


! risk falsehood (错误风险):“A25 will get me there on time”,
or
! leads to conclusions that are too weak for decision making:
“A25 will get me there on time if there’s no accident on the
bridge and it doesn’t rain and my tires remain intact etc etc.”
(A1440 might reasonably be said to get me there on time but I’d
have to stay overnight in the airport …)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Method for handling uncertainty

Probability
! Model agent’s degree of belief (信度)
! Given the available evidence,
A25 will get me there in time with probability 0.04

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Table of Contents

Uncertainty

Probability

Syntax and Semantics

Inference

Independence and Bayes’ Rule

Bayesian network
Graphical models
Bayesian networks
Inference in Bayesian networks

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Probability

概率理论提供了一种方法以概括来自我们的惰性和无知的不确定
性。Probabilistic assertions summarize effects of
! Laziness(惰性): failure to enumerate exceptions(例外),
qualifications(条件), etc.
! Ignorance(理论的无知): lack of relevant facts, initial
conditions, etc.
Subjective probability(主观概率):
! Probabilities relate propositions(命题)to agent’s own state
of knowledge.
e.g., P(A25 | no reported accidents) = 0.06
These are not assertions(断言)about the world
Probabilities of propositions change with new evidence:
e.g., P(A25 | no reported accidents, 5 a.m.) = 0.15

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Making decisions under uncertainty

Suppose I believe the following:


! P(A25 gets me there on time | …) = 0.04
! P(A90 gets me there on time | …) = 0.70
! P(A120 gets me there on time | …) = 0.95
! P(A1440 gets me there on time | …) = 0.9999
Which action to choose?
—Depends on my preferences(偏好) for missing flight vs. time
spent waiting, etc.
Utility theory(效用理论) is used to represent and infer
preferences.
Decision theory = probability theory + utility theory
决策理论 = 概率理论 + 效用理论

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Table of Contents

Uncertainty

Probability

Syntax and Semantics

Inference

Independence and Bayes’ Rule

Bayesian network
Graphical models
Bayesian networks
Inference in Bayesian networks

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Syntax

Basic element: random variable(随机变量)

Similar to propositional logic: possible worlds defined by


assignment of values to random variables.
! Boolean random variables(布尔随机变量)
e.g., Cavity(牙洞)(do I have a cavity?)
! Discrete random variables(离散随机变量)
e.g., Weather is one of ! sunny, rainy, cloudy, snow "
Domain values must be exhaustive(穷尽的)and mutually
exclusive(互斥的)
! Continuous random variables(连续随机变量)
e.g., Temp=21.6; also allow, e.g., Temp < 22.0

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Syntax

Elementary proposition(命题) constructed by assignment of a


value to a random variable:
e.g., Weather = sunny, Cavity = false (abbreviated as ¬ cavity)

Complex propositions formed from elementary propositions and


standard logical connectives.
e.g., Weather = sunny ∨ Cavity = false

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Syntax

Atomic event: A complete specification of the state of the world


about which the agent is uncertain
原子事件:对智能体无法确定的世界状态的一个完整的详细描
述。
e.g., if the world consists of only two Boolean variables Cavity and
Toothache, then there are 4 distinct atomic events:
! Cavity = false ∧ Toothache = false
! Cavity = false ∧ Toothache = true
! Cavity = true ∧ Toothache = false
! Cavity = true ∧ Toothache = true
Atomic events are mutually exclusive (互斥) and exhaustive(穷尽
的)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Axioms(公理)of probability

For any propositions A, B


! 0 ≤ P(A) ≤ 1
! P(true) = 1 and P(false) = 0
! P(A ∨ B) = P(A) + P(B) − P(A ∧ B)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Prior probability(先验概率)

Prior or unconditional probabilities(无条件概率) of propositions


在没有任何其它信息存在的情况下关于命题的信度
e.g., P(Cavity = true) = 0.1 and P(Weather = sunny) = 0.72
correspond to belief prior to arrival of any (new) evidence

Probability distribution gives values for all possible assignments:


概率分布给出一个随机变量所有可能取值的概率
e.g., P(Weather) = <0.72, 0.1, 0.08, 0.1> (normalized(归一
化的), i.e., sums to 1)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Prior probability(先验概率)

Joint probability distribution for a set of random variables gives the


probability of every atomic event on those random variables (i.e.,
every sample point)
联合概率分布给出一个随机变量集的值的全部组合的概率
e.g., P(Weather, Cavity) = a 4 × 2 matrix of values:
Weather = sunny rainy doudy snow
Cavity = true 0.144 0.02 0.016 0,02
Cavity = false 0.576 0.08 0.064 0.08

Every question about a domain can be answered by the joint


distribution because every event is a sum of sample points

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Probability for continuous variables
Express distribution as a parameterized(参数化的)function of
value:
P(X = x) = U[18, 26](x) = uniform(均匀分布)density between
18 and 26

Here P is a density; integrates to 1.


P(X = 20.5) = 0.125 means
lim P(20.5 ≤ X ≤ 20.5 + dx)/dx = 0.125
dx→0
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Probability for continuous variables

Normal distribution:

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Conditional probability(条件概率)

Conditional or posterior probabilities(后验概率)P(a|b)


e.g., P(cavity|toothache) = 0.8
i.e., given that toothache is all I know

Notation for conditional distributions(条件概率分布):


P(Cavity|Toothache) = a 2 × 2 matrix of values

If we know more, e.g., cavity is also given, then we have


P(cavity|toothache, cavity) = 1

New evidence may be irrelevant, allowing simplification, e.g.,


P(cavity|toothache, sunny) = P(cavity|toothache) = 0.8
This kind of inference, sanctioned by domain knowledge, is crucial

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Conditional probability

Definition of conditional probability:


P(a|b) = P(a ∧ b)/P(b) if P(b) > 0
Product rule(乘法规则) gives an alternative formulation:
P(a ∧ b) = P(a|b)P(b) = P(b|a)P(a)
A general version holds for whole distributions, e.g.,
P(Weather, Cavity) = P(Weather|Cavity)P(Cavity)
(View as a set of 4 × 2 equations, not matrix multification)
Chain rule(链式法则) is derived by successive application of
product rule: P(X1 , . . . , Xn ) = P(X1 , . . . , Xn−1 )P(Xn |X1 , . . . , Xn−1 )
= P(X1 , . . . , Xn−2 )P(Xn−1 |X1 , . . . , Xn−2 )P(Xn |X1 , . . . , Xn−1 )
= .!..
= ni=1 P(Xi |X1 , . . . , Xi−1 )

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Table of Contents

Uncertainty

Probability

Syntax and Semantics

Inference

Independence and Bayes’ Rule

Bayesian network
Graphical models
Bayesian networks
Inference in Bayesian networks

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Inference by enumeration

Start with the joint probability distribution(全联合概率分布):

For any proposition φ, sum the atomic events where it is true:


一个命题的概率等于所有当它为真时的原子事件的概率和
"
P(φ) = ω:ω|=φ P(ω)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Inference by enumeration

Start with the joint probability distribution(全联合概率分布):

For any proposition φ, sum the atomic events where it is true:


一个命题的概率等于所有当它为真时的原子事件的概率和
"
P(φ) = ω(ω|=φ) P(ω)
P(toothache) = 0.108 + 0.012 + 0.016 + 0.064 = 0.2

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Inference by enumeration

Start with the joint probability distribution(全联合概率分布):

For any proposition φ, sum the atomic events where it is true:


一个命题的概率等于所有当它为真时的原子事件的概率和
"
P(φ) = ω(ω|=φ) P(ω)
P(cavity ∨ toothache) =
0.108 + 0.012 + 0.072 + 0.008 + 0.016 + 0.064 = 0.28

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Inference by enumeration

Start with the joint probability distribution(全联合概率分布):

Can also compute conditional probabilities:


P(¬cavity ∧ toothache)
P(¬cavity|toothache) =
P(toothache)
0.016 + 0.064
= = 0.4
0.108 + 0.012 + 0.016 + 0.064

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Normalization(归一化)

Denominator(分母)can be viewed as a normalization constant α


P(Cavity|toothache) = αP(Cavity, toothache)

= α[P(Cavity, toothache, catch) + P(Cavity, toothache, ¬catch)]


= α[< 0.108, 0.016 > + < 0.012, 0.064 >]
= α < 0.12, 0.08 >=< 0.6, 0.4 >
General idea: compute distribution on query variable by fixing
evidence variables(证据变量) and summing over hidden
variables(未观测变量) . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Inference by enumeration, contd.

Typically, we are interested in


! the posterior joint distribution of the query variables(查询变
量) Y
! given specific values e for the evidence variables(证据变量)
E
Let the hidden variables(未观测变量) be H = X - Y - E

Then the required summation of joint entries is done by summing


out the hidden variables: "
P(Y|E = e) = αP(Y, E = e) = α h P(Y, E = e, H = h)

The terms in the summation are joint entries because Y, E and H


together exhaust the set of random variables(Y, E, H 构成了域中
所有变量的完整集合)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Inference by enumeration, contd.

Obvious problems:
! Worst-case time complexity O(dn ) where d is the largest arity
! Space complexity O(dn ) to store the joint distribution
! How to find the numbers for O(dn ) entries?

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Table of Contents

Uncertainty

Probability

Syntax and Semantics

Inference

Independence and Bayes’ Rule

Bayesian network
Graphical models
Bayesian networks
Inference in Bayesian networks

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Independence(独立性)
A and B are independent iff
P(A|B) = P(A) or P(B|A) = P(B) or P(A, B) = P(A)P(B)

e.g. roll of 2 die: P(1,3) = 1/6*1/6 = 1/36

P(Toothache, Catch, Cavity, Weather) =


P(Toothache, Catch, Cavity)P(Weather)
32 entries reduced to 12; for n independent biased coins,
O(2n ) → O(n)
Absolute independence powerful but rare
Dentistry(牙科领域)is a large field with hundreds of variables,
none of which are independent. What to do?
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Conditional independence (条件独立性)

P(Toothache, Cavity, Catch) has 23 –1 = 7 independent entries

If I have a cavity, the probability that the probe catches in it


doesn’t depend on whether I have a toothache:
— P(catch|toothache, cavity) = P(catch|cavity)

The same independence holds if I haven’t got a cavity:


— P(catch|toothache, ¬cavity) = P(catch|¬cavity)

Catch is conditionally independent of Toothache given Cavity:


— P(Catch|Toothache, Cavity) = P(Catch|Cavity)

Equivalent statements:
P(Toothache|Catch, Cavity) = P(Toothache|Cavity)
P(Toothache, Catch|Cavity) =
P(Toothache|Cavity)P(Catch|Cavity)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Conditional independence contd.

Write out full joint distribution using chain rule:


P(Toothache, Catch, Cavity)
= P(Toothache|Catch, Cavity)P(Catch, Cavity)
= P(Toothache|Catch, Cavity)P(Catch|Cavity)P(Cavity)
= P(Toothache|Cavity)P(Catch|Cavity)P(Cavity)
i.e. 2 + 2 + 1 = 5 independent numbers

In most cases, the use of conditional independence reduces the size


of the representation of the joint distribution from exponential in n
to linear in n.
在大多数情况下,使用条件独立性能将全联合概率的表示由 n 的
指数关系减为 n 的线性关系。
Conditional independence is our most basic and robust form of
knowledge about uncertain environments.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Bayes’ Rule(贝叶斯法则)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Bayes’ Rule(贝叶斯法则)

Product rule P(a ∧ b) = P(a|b)P(b) = P(b|a)P(a)


⇒ Bayes’rule: P(a|b) = P(b|a)P(a)
P(b)
or in distribution form
P(Y|X) = P(X|Y)P(Y)
P(X) = αP(X|Y)P(Y)

Useful for assessing diagnostic probability(诊断概率) from


causal probability(因果概率):
P(Effect|Cause)P(Cause)
P(Cause|Effect) =
P(Effect)
e.g., let M be meningitis(脑膜炎), S be stiff neck(脖子僵硬):
P(s|m)P(m) 0.8 ∗ 0.0001
P(m|s) = = = 0.0008
P(s) 0.1
Note: posterior probability of meningitis still very small!

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Probabilistic Inference Using Bayes’ Rule

H = “having a headache”
F = “coming down with Flu”
! P(H)=1/10
! P(F)=1/40
! P(H|F)=1/2

One day you wake up with a headache. You come with the
following reasoning: “since 50% of flues are associated with
headaches, so I must have a 50% chance of coming down with flu”

Is this reasoning correct?

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Probabilistic Inference Using Bayes’ Rule

H = “having a headache”
F = “coming down with Flu”
! P(H)=1/10
! P(F)=1/40
! P(H|F)=1/2

The Problem:
P(F|H) = ?

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Probabilistic Inference Using Bayes’ Rule

H = “having a headache” F = “coming down with Flu”


! P(H)=1/10
! P(F)=1/40
! P(H|F)=1/2

The Problem:
P(F|H)
= P(H|F)P(F)/P(H)
= 1/8
+= P(H|F)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Bayes’ Rule and conditional independence

P(Cavity|toothache ∧ catch)
= αP(toothache ∧ catch|Cavity)P(Cavity)
= αP(toothache|Cavity)P(catch|Cavity)P(Cavity)

This is an example of a naïve Bayes model(朴素贝叶斯模型):


!
P(Cause, Effect1 , . . . , Effectn ) = P(Cause) i P(Effecti |Cause)

Total number of parameters(参数)is linear in n

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Where do probability distributions come from?

! Idea One: Human, Domain Experts


! Idea Two: Simpler probability facts and some algebra

! Use chain rule and independence assumptions to compute


joint distribution

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Where do probability distributions come from?

! Idea Three: Learn them from data!


! A good chunk of machine learning research is essentially about
various ways of learning various forms of them!

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Summary of Uncertainty

! Probability is a rigorous formalism for uncertain knowledge


概率是对不确定知识一种严密的形式化方法
! Joint probability distribution specifies probability of every
atomic event
全联合概率分布指定了对随机变量的每种完全赋值,即每个
原子事件的概率
! Queries can be answered by summing over atomic events
可以通过把对应于查询命题的原子事件的条目相加的方式来
回答查询
! For nontrivial domains, we must find a way to reduce the joint
size
! Independence and conditional independence provide the tools

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
作业

! 第三版:13.15, 13.18, 13.21, 13.22


说明:第二版、第三版对应的题目内容不同,不过都能起到相似
的训练目的,区别不大

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Table of Contents

Uncertainty

Probability

Syntax and Semantics

Inference

Independence and Bayes’ Rule

Bayesian network
Graphical models
Bayesian networks
Inference in Bayesian networks

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Frequentist vs. Bayesian

客观 vs. 主观
Frequentist(频率主义者): probability is the long-run expected
frequency of occurrence. P(A) = n/N, where n is the number of
times event A occurs in N opportunities.
“某事发生的概率是 0.1”意味着 0.1 是在无穷多样本的极限条件
下能够被观察到的比例
But 在许多情景下不可能进行重复试验
e.g., 发生第三次世界大战的概率是多少?

Bayesian: degree of belief. It is a measure of the plausibility(似然


性)of an event given incomplete knowledge.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Probability

! Probability is a rigorous formalism for uncertain knowledge


概率是对不确定知识一种严密的形式化方法
! Joint probability distribution specifies probability of every
atomic event
全联合概率分布指定了对随机变量的每种完全赋值,即每个
原子事件的概率
! Queries can be answered by summing over atomic events
可以通过把对应于查询命题的原子事件的条目相加的方式来
回答查询
! For nontrivial domains, we must find a way to reduce the joint
size
! Independence and conditional independence provide the tools

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Independence/Conditional Independence

A and B are independent iff


P(A|B) = P(A) or P(B|A) = P(B) or P(A, B) = P(A)P(B)

A is conditionally independent of B given C:


P(A|B, C) = P(A|C)

在大多数情况下,使用条件独立性能将全联合概率的表示由 n 的
指数关系减为 n 的线性关系。
Conditional independence is our most basic and robust form of
knowledge about uncertain environment.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Probability Theory

Probability theory can be expressed in terms of two simple


equations
! Sum Rule (加法规则)
! probability of a variable is obtained by marginalizing (边缘化)
or summing out other variables
"
p(a) = b p(a, b)
! Product Rule (乘法规则)
! joint probability expressed in terms of conditionals
p(a, b) = p(b|a)p(a)
All probabilistic inference and learning amounts to repeated
application of sum and product rule

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Outline

! Graphical models (概率图模型)


! Bayesian networks
! Syntax(语法)
! Semantics(语义)
! Inference(推导)in Bayesian networks

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
What are Graphical Models?

They are diagrammatic(图表的)representations of probability


distributions
—marriage between probability theory and graph theory

! Also called probabilistic graphical models


! They augment analysis instead of using pure algebra(代数)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
What is a Graph?

! Consists of nodes (also called vertices) and links (also called


edges or arcs)

! In a probabilistic graphical model


! each node represents a random variable (or group of random
variables)
! Links express probabilistic relationships between variables

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Graphical Models in CS

! Natural tool for handling uncertainty(不确定性)and


complexity(复杂性)
—which occur throughout applied mathematics and
engineering
! Fundamental to the idea of a graphical model is the notion of
modularity(模块性)
—a complex system is built by combining simpler parts.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Why are Graphical Models useful

! Probability theory provides the glue whereby


! the parts are combined, ensuring that the system as a whole is
consistent
! providing ways to interface models to data.
! Graph theoretic side provides:
! Intuitively appealing interface
—by which humans can model highly-interacting sets of
variables
! Data structure
—that lends itself naturally to designing efficient
general-purpose (通用的)algorithms

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Graphical models: Unifying Framework

! View classical multivariate(多变量的)probabilistic systems


as instances of a common underlying formalism(形式)
! mixture models(混合模型), factor analysis(因子分析),
hidden Markov models, Kalman filters(卡尔曼滤波器), etc.
! Encountered in systems engineering, information theory,
pattern recognition and statistical mechanics
! Advantages of View:
! Specialized techniques in one field can be transferred between
communities and exploited
! Provides natural framework for designing new systems

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Role of Graphical Models in Machine Learning

! Simple way to visualize(形象化)


structure of probabilistic model
! Insights into properties of model
Conditional independence properties by inspecting graph
! Complex computations
required to perform inference and learning expressed as
graphical manipulations

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Graph Directionality
! Undirected graphical
! Directed graphical models models
—Directionality associated —links without arrows
with arrows
! Markov random fields(马
! Bayesian networks 尔科夫随机场)
—Express causal —Better suited to express
relationships(因果关系) soft constraints between
between random variables variables
! More popular in AI and ! More popular in Vision and
statistics physics

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Bayesian networks

一种简单的,图形化的数据结构,用于表示变量之间的依赖关系
(条件独立性),为任何全联合概率分布提供一种简明的规范。
Syntax:
! a set of nodes, one per variable
! a directed(有向), acyclic(无环)graph (link ≈ ”direct
influences”)
! a conditional distribution for each node given its parents:
P(Xi |Parents(Xi ))—量化其父节点对该节点的影响
In the simplest case, conditional distribution represented as a
conditional probability table 条件概率表 (CPT) giving the
distribution over Xi for each combination of parent values

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Example

Topology(拓扑结构)of network encodes conditional


independence assertions:

Weather is independent of the other variables


Toothache and Catch are conditionally independent given Cavity

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Example

I’m at work, neighbor John calls to say my alarm is ringing, but


neighbor Mary doesn’t call. Sometimes it’s set off by minor
earthquakes. Is there a burglar(夜贼)?

Variables: Burglary(入室行窃), Earthquake, Alarm, JohnCalls,


MaryCalls

Network topology reflects “causal(因果)” knowledge:


! A burglar can set the alarm off
! An earthquake can set the alarm off
! The alarm can cause Mary to call
! The alarm can cause John to call

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Example contd.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Compactness(紧致性)

A CPT for Boolean Xi with k Boolean parents has 2k rows for the
combinations of parent values
一个具有 k 个布尔父节点的布尔变量的条件概率表中有 2k 个独
立的可指定概率

Each row requires one number p for Xi = true


(the number for Xi = false is just 1-p)

If each variable has no more than k parents, the complete network


requires O(n·2k ) numbers
i.e. grows linearly with n, vs. O(2n ) for the full joint distribution
For burglary net, 1 + 1 + 4 + 2 + 2 = 10 numbers (vs. 25 − 1 = 31)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Global semantics(全局语义)

The full joint distribution is defined as the product of the local


conditional distributions:
全联合概率分布可以表示为贝叶斯网络中的条件概率分布的乘积

“Global” semantics defines the full joint distri-


bution as the product of the local conditional
distributions: !
P(x1 , . . . , xn ) = ni=1 P(xi |parents(Xi ))
e.g., P(j ∧ m ∧ a ∧ ¬b ∧ ¬e)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Global semantics(全局语义)

The full joint distribution is defined as the product of the local


conditional distributions:
全联合概率分布可以表示为贝叶斯网络中的条件概率分布的乘积

“Global” semantics defines the full joint distri-


bution as the product of the local conditional
distributions:
!
P(x1 , . . . , xn ) = ni=1 P(xi |parents(Xi ))
e.g., P(j ∧ m ∧ a ∧ ¬b ∧ ¬e)
= P(j|a)P(m|a)P(a|¬b, ¬e)P(¬b)P(¬e)
= 0.9 ∗ 0.7 ∗ 0.001 ∗ 0.999 ∗ 0.998
≈ 0.00063

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Local semantics

Local semantics: each node is conditionally independent of its


nondescendants(非后代)given its parents
给定父节点,一个节点与它的非后代节点是条件独立的

Theorem: Local semantics ⇔ global semantics

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Constructing Bayesian networks
Need a method such that a series of locally testable assertions of
conditional independence guarantees the required global semantics
需要一种方法使得局部的条件独立关系能够保证全局语义得以成

1. Choose an ordering of variables X1 , . . . , Xn
2. For i = 1 to n
add Xi to the network
select parents from X1 , . . . , Xi−1 such that
P(Xi |Parents(Xi )) = P(Xi |X1 , . . . , Xi−1 )
This choice of parents guarantees the global semantics:
n
#
P(X1 , . . . , Xn ) = P(Xi |X1 , . . . , Xi−1 )(chainrule)
i=1
n (1)
#
= P(Xi |Parents(Xi ))(byconstruction)
i=1

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Constructing Bayesian networks

要求网络的拓扑结构确实反映了合适的父节点集对每个变量的那
些直接影响。

添加节点的正确次序是首先添加“根本原因”节点,然后加入受
它们直接影响的变量,以此类推。

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Example

Suppose we choose the ordering M, J, A, B, E

! P(J|M) = P(J)?

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Example

Suppose we choose the ordering M, J, A, B, E

! P(J|M) = P(J)? No
! P(A|J, M) = P(A|J)?P(A|J, M) =
P(A)?

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Example
Suppose we choose the ordering M, J, A, B, E

! P(J|M) = P(J)? No
! P(A|J, M) = P(A|J)?P(A|J, M) =
P(A)? No
! P(B|A, J, M) = P(B|A)?
! P(B|A, J, M) = P(B)?

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Example
Suppose we choose the ordering M, J, A, B, E

! P(J|M) = P(J)? No
! P(A|J, M) = P(A|J)?P(A|J, M) =
P(A)? No
! P(B|A, J, M) = P(B|A)? Yes
! P(B|A, J, M) = P(B)? No
! P(E|B, A, J, M) = P(E|A)?
! P(E|B, A, J, M) = P(E|A, B)? . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Example
Suppose we choose the ordering M, J, A, B, E

! P(J|M) = P(J)? No
! P(A|J, M) = P(A|J)?P(A|J, M) =
P(A)? No
! P(B|A, J, M) = P(B|A)? Yes
! P(B|A, J, M) = P(B)? No
! P(E|B, A, J, M) = P(E|A)? No
! P(E|B, A, J, M) = P(E|A, B)? Yes
.
.
.
.
.
. . . . .
. . . .
. . . .
. . . .
. . . .
. . . . .
.
.
.
.
.
.
.
.
.
Example contd.

Deciding conditional independence is hard in noncausal(非因果)


directions

(Causal models and conditional independence seem hardwired for


humans!)

Network is less compact: 1 + 2 + 4 + 2 + 4 = 13 numbers needed


. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Inference tasks

Simple queries: compute posterior probability P(Xi |E = e)


e.g., P(NoGas|Gauge�� = empty, Lights = on, Starts = false)

Conjunctive queries(联合查询):
P(XI , Xj |E = e) = P(Xi |E = e)P(Xj |Xi , E = e)

Optimal decisions: decision networks include utility information;


probabilistic inference required for
P(outcome|action, evidence)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Inference by enumeration

Slightly intelligent way to sum out variables from the joint without
actually constructing its"explicit representation
P(X|e) = αP(X, e) = α y P(X, e, y)

注:在贝叶斯网络中可以将全联合分布写成条件概率乘积的行
驶: !
P(X1 , . . . , Xn ) = ni=1 P(Xi |Parents(Xi ))

在贝叶斯网络中可以通过计算条件概率的乘积并求和来回答查
询。

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Inference by enumeration

Slightly intelligent way to sum out variables from the joint without
actually constructing its explicit representation.

Simple query on the burglary network:


P(B|j, m)
= P(B, j, m)/P(j, m)
" "
= αP(B, j, m) = α e a P(B, e, a, j, m)

Rewrite full joint entries using product of CPT entries:


P(B|j,
"m)"
= α e "a P(B)P(e)P(a|B,
" e)P(j|a)P(m|a)
= αP(B) e P(e) a P(a|B, e)P(j|a)P(m|a)

Recursive depth-first enumeration: O(n) space, O(dn ) time

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Evaluation tree

Enumeration is inefficient: repeated computations


e.g., computes P(j|a)P(m|a) for each value of e

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Inference by variable elimination

Variable elimination(变量消元): carry out summations


right-to-left, storing intermediate results (factors:因子) to avoid
recomputation

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Complexity of exact inference
Singly connected networks 单联通网络 (or polytrees 多树,对应的
无向图是树):
! any two nodes are connected by at most one (undirected) path

! time and space cost of variable elimination are O(dk n)


多树上的变量消元的时间和空间复杂度都与网络规模呈线性
关系。
Multiply connected networks 多联通网络:
! can reduce 3SAT to exact inference ⇒ NP-hard

! equivalent to counting 3SAT models ⇒ #P-complete

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Example: Naïve Bayes model

There is a single parent variable and a collection of child variables


whose values are conditionally independent from one another given
the parent.

P(X1 = x1 , . . . , Xn = xn )
= P(X1 = x1 )P(X2 = x2 |X1 = x1 ) . . . P(Xn = xn |X1 = x1 )

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Naïve Bayes model

!
P(Cause, Effect1 , . . . , Effectn ) = P(Cause) i P(Effecti |Cause)
P(Cause|Effect1 , . . . , Effectn ) = P(Effects,
! Cause)/P(Effects)
= αP(Cause, Effects) = αP(Cause) i P(Effecti |Cause)

Total number of parameters(参数)is linear in n

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Example: Spam detection

Imagine the problem of trying to automatically detect spam e-mail


messages(垃圾邮件). A simple approach to get started is to
look only at the “Subject:”headers in the e-mail messages and
attempt to recognize spam by checking some simple computable
features(特征). The two simple features we will consider are:
! Caps: Whether the subject header is entirely capitalized
! Free: Whether the subject header contains the word ‘free’,
either in upper case or lower case
e.g., a message with the subject header “NEW MORTGAGE
RATE” is likely to be spam. Similarly, for “Money for Free”,
“FREE lunch”, etc.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Example: Spam detection

The model is based on the following three random variables, Caps,


Free and Spam, each of which take on the values Y (for Yes) or N
(for No)
! Caps = Y if and only if the subject of the message does not
contain lowercase letters
! Free = Y if and only if the word ‘free’ appears in the subject
(letter case is ignored)
! Spam = Y if and only if the message is spam
P(Free , Caps, Spam)= P(Spam ) P(Caps|Spam) P(Free|Spam)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Example: Spam detection

P(Free, Caps, Spam) = P(Spam)P(Caps|Spam)P(Free|Spam)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Example: Spam detection

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Example: Spam detection

P(Free = Y, Caps = N, Spam = N)


= P(Spam = N)P(Caps = N|Spam = N)P(Free = Y|Spam = N)
≈ 0.53 ∗ 0.9245 ∗ 0.0189
≈ 0.0093

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Example: Learning to classify text documents

文本分类是在文档所包含的文本基础上,把给定的文档分配到固
定类别集合中某一个类别的任务。这个任务中常常用到朴素贝叶
斯模型。在这些模型中,查询变量是文档类别,“结果”变量则
是语言中每个词是否出现。我们假设文档中的词的出现都是独立
的,其出现频率由文档类别确定。
! 准确地解释当给定一组类别已经确定的文档作为“训练数
据”时,这样的模型是如何构造的。
! 准确地解释如何对新文档进行分类。
! 这里独立性假设合理吗?请讨论。

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Example: Learning to classify text documents

The model consists of the prior probability P(Category) and the


conditional probabilities P(word i|Category)
! P(Category=c) is estimated as the fraction of all documents
that are of category c
! P(word i = true|Category=c) is estimated as the fraction of
documents of category c that contain word i

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Twenty Newsgroups

Given 1000 training documents from each group. Learn to classify


new documents according to which newsgroup it came from

Naïve Bayes: 89% classification accuracy

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Learning Curve for 20 Newsgroups

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
TFIDF

! TFIDF (tf-idf)
在某一类中词条 w 出现的次数
! Term Frequency: TFw =
该类中所有的词条数目
! Inverse Document Frequency:
语料库的文档总数
IDFw = log( )
包含词条 w 的文档数 +1
! TFIDFw = TFw × IDFw ,某一特定文件内的高词语频率,以
及该词语在整个文件集合中的低文件频率,可以产生出高权
重的 TFIDF
! PRTFIDF (A Probabilistic Classifier Derived from TFIDF)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Example: A Digit Recognizer

! Input: pixel grids

! output: a digit 0-9

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Naïve Bayes for Digits

Simple version:
! One feature Fij for each grid position < i, j >
! Possible feature values are on / off, based on whether
intensity is more or less than 0.5 in underlying image
! Each input maps to a feature vector e.g.,
→ (F0,0 = 0, F0,1 = 0, F0,2 = 1, . . . , F15,15 = 0)
! Here: lots of features, each is binary
Naïve Bayes model:
!
P(Y|F0,0 , . . . , , F15,15 ) ∝ P(Y) i,j P(Fi,j |Y)

What do we need to learn?

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Examples: CPTs

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Comments on Naïve Bayes

! Makes probabilistic inference tractable by making a strong


assumption of conditional independence.
! Tends to work fairly well despite this strong assumption.
! Experiments show it to be quite competitive with other
classification methods on standard datasets.
! Particularly popular for text categorization, e.g., spam
filtering.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Summary

! Bayesian networks provide a natural representation for


(causally induced) conditional independence
! Topology + CPTs = compact representation of joint
distribution
! Generally easy for domain experts to construct

! Exact inference by variable elimination:


! polytime on polytrees, NP-hard on general graphs
! space = time, very sensitive to topology
! Naïve Bayes model

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
作业

! 第三版:14.12,14.13

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

You might also like