Lecture 19 - Bayes
Lecture 19 - Bayes
𝑃(𝐵|𝐴) × 𝑃 𝐴
𝑃 𝐴𝐵 =
𝑃 𝐵
– As we will see, we will be Bayesian about other things, e.g., the parameters of the
model
Basics of Bayesian Learning
• 𝑃(ℎ) - the prior probability of a hypothesis ℎ
Reflects background knowledge; before data is observed. If
no information - uniform distribution.
• 𝑃(𝐷) - The probability that this sample of the Data is
observed. (No knowledge of the hypothesis)
• 𝑃(𝐷|ℎ): The probability of observing the sample 𝐷, given that
hypothesis ℎ is the target
• 𝑃(ℎ|𝐷): The posterior probability of ℎ. The probability that ℎ is
the target, given that 𝐷 has been observed.
Bayes Theorem
𝑃(𝐷|ℎ)𝑃 ℎ
𝑃 ℎ𝐷 =
𝑃(𝐷)
• 𝑃(ℎ|𝐷) increases with 𝑃(ℎ) and with 𝑃(𝐷|ℎ)
7
Air-Traffic Data
Days Season Fog Rain Class
Weekday Spring None None On Time
Weekday Winter None Slight On Time
Weekday Winter None None On Time
Holiday Winter High Slight Late
Saturday Summer Normal None On Time
Weekday Autumn Normal None Very Late
Holiday Summer High Slight On Time
Sunday Summer Normal None On Time
Weekday Winter High Heavy Very Late
Weekday Summer None Slight On Time
8
Air-Traffic Data
Cond. from previous slide…
Days Season Fog Rain Class
Saturday Spring High Heavy Cancelled
Weekday Summer High Slight On Time
Weekday Winter Normal None Late
Weekday Summer High None On Time
Weekday Winter Normal Heavy Very Late
Saturday Autumn High Slight On Time
Weekday Autumn None Heavy On Time
Holiday Spring Normal Slight On Time
Weekday Spring Normal None On Time
Weekday Spring Normal Heavy On Time
9
Air-Traffic Data
In this database, there are four attributes
A = [ Day, Season, Fog, Rain]
with 20 tuples.
The categories of classes are:
C= [On Time, Late, Very Late, Cancelled]
Given this is the knowledge of data and classes, we are to find most
likely classification for any other unseen instance, for example:
Class
Attribute On Time Late Very Late Cancelled
Weekday 9/14 = 0.64 ½ = 0.5 3/3 = 1 0/1 = 0
Saturday 2/14 = 0.14 ½ = 0.5 0/3 = 0 1/1 = 1
Day
Class
Attribute On Time Late Very Late Cancelled
None 5/14 = 0.36 0/2 = 0 0/3 = 0 0/1 = 0
Fog
12
Naïve Bayesian Classifier
Instance:
Case3: Class = Very Late : 0.15 × 1.0 × 0.67 × 0.33 × 0.67 = 0.0222
13
Naive Bayes
𝑉𝑀𝐴𝑃 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑣 𝑃(𝑥1, 𝑥2, … , 𝑥𝑛 | 𝑣 )𝑃(𝑣)
𝑃 𝑥1 , 𝑥2 , … , 𝑥𝑛 𝑣𝑗 = 𝑃 𝑥1 𝑥2 , … , 𝑥𝑛 , 𝑣𝑗 𝑃(𝑥2 , … , 𝑥𝑛 |𝑣𝑗 )
= 𝑃 𝑥1 𝑥2 , … , 𝑥𝑛 , 𝑣𝑗 𝑃 𝑥2 𝑥3 , … , 𝑥𝑛 , 𝑣𝑗 𝑃(𝑥3 , … , 𝑥𝑛 |𝑣𝑗 )
=⋯
= 𝑃 𝑥1 𝑥2 , … , 𝑥𝑛 , 𝑣𝑗 𝑃 𝑥2 𝑥3 , … , 𝑥𝑛 , 𝑣𝑗 𝑃 𝑥3 𝑥4 , … , 𝑥𝑛 , 𝑣𝑗 … 𝑃 𝑥𝑛 𝑣𝑗
𝑛
= ෑ 𝑃(𝑥𝑖 |𝑣𝑗 )
𝑖=1
• Predict: 𝑃𝑙𝑎𝑦𝑇𝑒𝑛𝑛𝑖𝑠 = ?
Example
• Given: (Outlook=sunny; Temperature=cool; Humidity=high; Wind=strong)
• 𝑃 𝑃𝑙𝑎𝑦𝑇𝑒𝑛𝑛𝑖𝑠 = 𝑦𝑒𝑠 𝑃 𝑃𝑙𝑎𝑦𝑇𝑒𝑛𝑛𝑖𝑠 = 𝑛𝑜
= 9/14 = 0.64) = 5/14 = 0.36
Denote 𝑝𝑖 = 𝑃 𝑥𝑖 = 1 𝑣 = 1 , 𝑞𝑖 = 𝑃 𝑥𝑖 = 1 𝑣 = 0
𝑥
𝑃 𝑣𝑗 = 1 ⋅ ς𝑛𝑖=1 𝑝𝑖 𝑖 1 − 𝑝𝑖 1−𝑥𝑖
𝑛 𝑥𝑖 1−𝑥
>1
𝑃 𝑣𝑗 = 0 ⋅ ς𝑖=1 𝑞𝑖 1 − 𝑞𝑖 𝑖
Naïve Bayes: Two Classes
• In the case of two classes, 𝑣{0,1} we predict that 𝑣 =
1 iff: 𝑛 𝑥𝑖 1−𝑥𝑖 𝑛 𝑥𝑖
𝑃 𝑣𝑗 = 1 ⋅ ς𝑖=1 𝑝𝑖 1 − 𝑝𝑖 𝑃 𝑣𝑗 = 1 ⋅ ς𝑖=1(1 − 𝑝𝑖 ) 𝑝𝑖 /1 − 𝑝𝑖
𝑥 = >1
𝑃 𝑣𝑗 = 0 ⋅ ς𝑛𝑖=1 𝑞𝑖 𝑖 1 − 𝑞𝑖 1−𝑥𝑖 𝑃 𝑣𝑗 = 0 ⋅ ς𝑛𝑖=1(1 − 𝑞𝑖 ) 𝑞𝑖 /1 − 𝑞𝑖 𝑥𝑖
Naïve Bayes: Two Classes
• In the case of two classes, 𝑣{0,1} we predict that 𝑣 =
1 iff: 𝑥
𝑃 𝑣 = 1 ⋅ ς𝑛 𝑝 𝑖 1 − 𝑝 1−𝑥𝑖 𝑃 𝑣 = 1 ⋅ ς𝑛 (1 − 𝑝 ) 𝑝 /1 − 𝑝 𝑥𝑖
𝑗 𝑖=1 𝑖 𝑖 𝑗 𝑖=1 𝑖 𝑖 𝑖
= >1
𝑃 𝑣𝑗 = 0 ⋅ ς𝑛𝑖=1 𝑞𝑖𝑥𝑖 1 − 𝑞𝑖 1−𝑥𝑖 𝑃 𝑣𝑗 = 0 ⋅ ς𝑛𝑖=1(1 − 𝑞𝑖 ) 𝑞𝑖 /1 − 𝑞𝑖 𝑥𝑖
𝑃 𝑌 = 𝑦 ς𝑖 𝑃(𝑋𝑖 |𝑌 = 𝑦)
𝑃 𝑌 = 𝑦 𝑋1 , … , 𝑋𝑛 ) =
σ𝑗 𝑃(𝑌 = 𝑦𝑗 ) ς𝑖 𝑃(𝑋𝑖 |𝑌 = 𝑦𝑗 )
Naïve Bayes: Continuous Features
• 𝑋𝑖 can be continuous
• We can still use
𝑃 𝑋1 , … . , 𝑋𝑛 |𝑌 = ෑ 𝑃(𝑋𝑖 |𝑌)
𝑖
• And
𝑃 𝑌 = 𝑦 ς𝑖 𝑃(𝑋𝑖 |𝑌 = 𝑦)
𝑃 𝑌 = 𝑦 𝑋1 , … , 𝑋𝑛 ) =
σ𝑗 𝑃(𝑌 = 𝑦𝑗 ) ς𝑖 𝑃(𝑋𝑖 |𝑌 = 𝑦𝑗 )
Naïve Bayes classifier:
𝑌 = arg max 𝑃 𝑌 = 𝑦 ෑ 𝑃(𝑋𝑖 |𝑌 = 𝑦)
𝑦
𝑖
Naïve Bayes: Continuous Features
• 𝑋𝑖 can be continuous
• We can still use
𝑃 𝑋1 , … . , 𝑋𝑛 |𝑌 = ෑ 𝑃(𝑋𝑖 |𝑌)
𝑖
• And
𝑃 𝑌 = 𝑦 ς𝑖 𝑃(𝑋𝑖 |𝑌 = 𝑦)
𝑃 𝑌 = 𝑦 𝑋1 , … , 𝑋𝑛 ) =
σ𝑗 𝑃(𝑌 = 𝑦𝑗 ) ς𝑖 𝑃(𝑋𝑖 |𝑌 = 𝑦𝑗 )
Naïve Bayes classifier:
𝑌 = arg max 𝑃 𝑌 = 𝑦 ෑ 𝑃(𝑋𝑖 |𝑌 = 𝑦)
𝑦
𝑖
Assumption: 𝑃(𝑋𝑖 |𝑌) has a Gaussian distribution
The Gaussian Probability Distribution
• Gaussian probability distribution also called normal
distribution.
• It is a continuous distribution with pdf:
• = mean of distribution
• 𝜎 2 = variance of distribution
• 𝑥 is a continuous variable (−∞ ≤ 𝑥 ≤ ∞)
• Probability of 𝑥 being in the range [𝑎, 𝑏] cannot be evaluated
analytically (has to be looked up in a table)
𝑥 −𝜇 2
1 −
• 𝑝 𝑥 = 𝑒 2 𝜎2 p(x)
1
e
( x )2
2
2
gaussian
𝜎 2𝜋 2
x
Naïve Bayes: Continuous Features
• 𝑃(𝑋𝑖 |𝑌) is Gaussian
• Training: estimate mean and standard deviation
– 𝜇𝑖 = 𝐸 𝑋𝑖 𝑌 = 𝑦
– 𝜎𝑖2 = 𝐸 𝑋𝑖 − 𝜇𝑖 2 𝑌=𝑦
Note that the following slides abuse notation
significantly. Since 𝑃(𝑥) = 0 for continues distributions,
we think of
𝑃 (𝑋 = 𝑥| 𝑌 = 𝑦), not as a classic probability distribution,
but just as a function 𝑓(𝑥) = 𝑁(𝑥, 𝜇, 𝜎 2 ).
𝑓(𝑥) behaves as a probability distribution in the sense
that 8 𝑥, 𝑓(𝑥) ¸ 0 and the values add up to 1. Also, note
that 𝑓(𝑥) satisfies Bayes Rule, that is, it is true that:
𝑓𝑌(𝑦|𝑋 = 𝑥) = 𝑓𝑋 (𝑥|𝑌 = 𝑦) 𝑓𝑌 (𝑦)/𝑓𝑋(𝑥)
Naïve Bayes: Continuous Features
• 𝑃(𝑋𝑖 |𝑌) is Gaussian
• Training: estimate mean and standard deviation
– 𝜇𝑖 = 𝐸 𝑋𝑖 𝑌 = 𝑦
– 𝜎𝑖2 = 𝐸 𝑋𝑖 − 𝜇𝑖 2 𝑌=𝑦
𝑋1 𝑋2 𝑋3 𝑌
2 3 1 1
−1.2 2 0.4 1
1.2 0.3 0 0
2.2 1.1 0 1
Naïve Bayes: Continuous Features
• 𝑃(𝑋𝑖 |𝑌) is Gaussian
• Training: estimate mean and standard
deviation 𝑋1 𝑋2 𝑋3 𝑌
– 𝜇𝑖 = 𝐸 𝑋𝑖 𝑌 = 𝑦 2 3 1 1
– 𝜎𝑖2 = 𝐸 𝑋𝑖 − 𝜇𝑖 2
𝑌=𝑦 −1.2 2 0.4 1
2+ −1.2 +2.2 1.2 0.3 0 0
– 𝜇1 = 𝐸 𝑋1 𝑌 = 1 = =1
3
2.2 1.1 0 1
– 𝜎12 = 𝐸 𝑋1 − 𝜇1 𝑌 = 1 =
2−1 2 + −1.2 −1 2 + 2.2 −1 2
= 2.43
3