0% found this document useful (0 votes)
19 views33 pages

Lecture 19 - Bayes

Bayes

Uploaded by

raoseshu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views33 pages

Lecture 19 - Bayes

Bayes

Uploaded by

raoseshu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Transfer Functions

Supervised Learning – Classification


Naïve Bayes Method
Bayes’ Rule
• The most important formula in probabilistic machine
learning

𝑃(𝐵|𝐴) × 𝑃 𝐴
𝑃 𝐴𝐵 =
𝑃 𝐵

Bayes, Thomas (1763) An essay


towards solving a problem in the
doctrine of chances. Philosophical
Transactions of the Royal Society of
London, 53:370-418
Bayes’ Rule for Machine Learning
• Allows us to reason from evidence to hypotheses
• Another way of thinking about Bayes’ rule:
Basics of Bayesian Learning
• Goal: find the best hypothesis from some space 𝐻 of hypotheses, given
the observed data (evidence) 𝐷.

• Define best to be: most probable hypothesis in 𝐻


• In order to do that, we need to assume a probability distribution over the
class 𝐻.

• In addition, we need to know something about the relation between the


data observed and the hypotheses (E.g., a coin problem.)

– As we will see, we will be Bayesian about other things, e.g., the parameters of the
model
Basics of Bayesian Learning
• 𝑃(ℎ) - the prior probability of a hypothesis ℎ
Reflects background knowledge; before data is observed. If
no information - uniform distribution.
• 𝑃(𝐷) - The probability that this sample of the Data is
observed. (No knowledge of the hypothesis)
• 𝑃(𝐷|ℎ): The probability of observing the sample 𝐷, given that
hypothesis ℎ is the target
• 𝑃(ℎ|𝐷): The posterior probability of ℎ. The probability that ℎ is
the target, given that 𝐷 has been observed.
Bayes Theorem

𝑃(𝐷|ℎ)𝑃 ℎ
𝑃 ℎ𝐷 =
𝑃(𝐷)
• 𝑃(ℎ|𝐷) increases with 𝑃(ℎ) and with 𝑃(𝐷|ℎ)

• 𝑃(ℎ|𝐷) decreases with 𝑃(𝐷)


Example: Bayesian Classification
 Example: Air Traffic Data

 Let us consider a set


observation recorded in a
database
 Regarding the arrival of
airplanes in the routes from
any airport to New Delhi
under certain conditions.

7
Air-Traffic Data
Days Season Fog Rain Class
Weekday Spring None None On Time
Weekday Winter None Slight On Time
Weekday Winter None None On Time
Holiday Winter High Slight Late
Saturday Summer Normal None On Time
Weekday Autumn Normal None Very Late
Holiday Summer High Slight On Time
Sunday Summer Normal None On Time
Weekday Winter High Heavy Very Late
Weekday Summer None Slight On Time

8
Air-Traffic Data
Cond. from previous slide…
Days Season Fog Rain Class
Saturday Spring High Heavy Cancelled
Weekday Summer High Slight On Time
Weekday Winter Normal None Late
Weekday Summer High None On Time
Weekday Winter Normal Heavy Very Late
Saturday Autumn High Slight On Time
Weekday Autumn None Heavy On Time
Holiday Spring Normal Slight On Time
Weekday Spring Normal None On Time
Weekday Spring Normal Heavy On Time

9
Air-Traffic Data
 In this database, there are four attributes
A = [ Day, Season, Fog, Rain]
with 20 tuples.
 The categories of classes are:
C= [On Time, Late, Very Late, Cancelled]

 Given this is the knowledge of data and classes, we are to find most
likely classification for any other unseen instance, for example:

Week Winter High None ???


Day

 Classification technique eventually to map this tuple into an accurate


class.
10
Naïve Bayesian Classifier
 Example - Air Traffic Dataset: Let us tabulate all the probabilities.

Class
Attribute On Time Late Very Late Cancelled
Weekday 9/14 = 0.64 ½ = 0.5 3/3 = 1 0/1 = 0
Saturday 2/14 = 0.14 ½ = 0.5 0/3 = 0 1/1 = 1
Day

Sunday 1/14 = 0.07 0/2 = 0 0/3 = 0 0/1 = 0


Holiday 2/14 = 0.14 0/2 = 0 0/3 = 0 0/1 = 0
Spring 4/14 = 0.29 0/2 = 0 0/3 = 0 0/1 = 0
Summer 6/14 = 0.43 0/2 = 0 0/3 = 0 0/1 = 0
Season

Autumn 2/14 = 0.14 0/2 = 0 1/3= 0.33 0/1 = 0


Winter 2/14 = 0.14 2/2 = 1 2/3 = 0.67 0/1 = 0
11
Naïve Bayesian Classifier

Class
Attribute On Time Late Very Late Cancelled
None 5/14 = 0.36 0/2 = 0 0/3 = 0 0/1 = 0
Fog

High 4/14 = 0.29 1/2 = 0.5 1/3 = 0.33 1/1 = 1


Normal 5/14 = 0.36 1/2 = 0.5 2/3 = 0.67 0/1 = 0
None 5/14 = 0.36 1/2 = 0.5 1/3 = 0.33 0/1 = 0
Rain

Slight 8/14 = 0.57 0/2 = 0 0/3 = 0 0/1 = 0


Heavy 1/14 = 0.07 1/2 = 0.5 2/3 = 0.67 1/1 = 1
Prior Probability 14/20 = 0.70 2/20 = 0.10 3/20 = 0.15 1/20 = 0.05

12
Naïve Bayesian Classifier
Instance:

Week Winter High Heavy ???


Day
Case1: Class = On Time : 0.70 × 0.64 × 0.14 × 0.29 × 0.07 = 0.0013

Case2: Class = Late : 0.10 × 0.50 × 1.0 × 0.50 × 0.50 = 0.0125

Case3: Class = Very Late : 0.15 × 1.0 × 0.67 × 0.33 × 0.67 = 0.0222

Case4: Class = Cancelled : 0.05 × 0.0 × 0.0 × 1.0 × 1.0 = 0.0000

Case3 is the strongest; Hence correct classification is Very Late

13
Naive Bayes
𝑉𝑀𝐴𝑃 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑣 𝑃(𝑥1, 𝑥2, … , 𝑥𝑛 | 𝑣 )𝑃(𝑣)

𝑃 𝑥1 , 𝑥2 , … , 𝑥𝑛 𝑣𝑗 = 𝑃 𝑥1 𝑥2 , … , 𝑥𝑛 , 𝑣𝑗 𝑃(𝑥2 , … , 𝑥𝑛 |𝑣𝑗 )
= 𝑃 𝑥1 𝑥2 , … , 𝑥𝑛 , 𝑣𝑗 𝑃 𝑥2 𝑥3 , … , 𝑥𝑛 , 𝑣𝑗 𝑃(𝑥3 , … , 𝑥𝑛 |𝑣𝑗 )
=⋯
= 𝑃 𝑥1 𝑥2 , … , 𝑥𝑛 , 𝑣𝑗 𝑃 𝑥2 𝑥3 , … , 𝑥𝑛 , 𝑣𝑗 𝑃 𝑥3 𝑥4 , … , 𝑥𝑛 , 𝑣𝑗 … 𝑃 𝑥𝑛 𝑣𝑗
𝑛

= ෑ 𝑃(𝑥𝑖 |𝑣𝑗 )
𝑖=1

• Assumption: feature values are independent given the target value


Naïve Bayes Example
Day Outlook Temperature Humidity Wind PlayTennis

1 Sunny Hot High Weak No


2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No
Example
• Compute 𝑃(𝑃𝑙𝑎𝑦𝑇𝑒𝑛𝑛𝑖𝑠 = 𝑦𝑒𝑠); 𝑃(𝑃𝑙𝑎𝑦𝑇𝑒𝑛𝑛𝑖𝑠 = 𝑛𝑜)
• Compute 𝑃(𝑜𝑢𝑡𝑙𝑜𝑜𝑘 = 𝑠/𝑜𝑐/𝑟 | 𝑃𝑙𝑎𝑦𝑇𝑒𝑛𝑛𝑖𝑠 = 𝑦𝑒𝑠/𝑛𝑜) (6 numbers)
• Compute 𝑃(𝑇𝑒𝑚𝑝 = ℎ/𝑚𝑖𝑙𝑑/𝑐𝑜𝑜𝑙 | 𝑃𝑙𝑎𝑦𝑇𝑒𝑛𝑛𝑖𝑠 = 𝑦𝑒𝑠/𝑛𝑜) (6 numbers)
• Compute 𝑃(ℎ𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = ℎ𝑖/𝑛𝑜𝑟 | 𝑃𝑙𝑎𝑦𝑇𝑒𝑛𝑛𝑖𝑠 = 𝑦𝑒𝑠/𝑛𝑜) (4 numbers)
• Compute 𝑃(𝑤𝑖𝑛𝑑 = 𝑤/𝑠𝑡 | 𝑃𝑙𝑎𝑦𝑇𝑒𝑛𝑛𝑖𝑠 = 𝑦𝑒𝑠/𝑛𝑜) (4 numbers)
Example
• Compute 𝑃(𝑃𝑙𝑎𝑦𝑇𝑒𝑛𝑛𝑖𝑠 = 𝑦𝑒𝑠); 𝑃(𝑃𝑙𝑎𝑦𝑇𝑒𝑛𝑛𝑖𝑠 = 𝑛𝑜)
• Compute 𝑃(𝑜𝑢𝑡𝑙𝑜𝑜𝑘 = 𝑠/𝑜𝑐/𝑟 | 𝑃𝑙𝑎𝑦𝑇𝑒𝑛𝑛𝑖𝑠 = 𝑦𝑒𝑠/𝑛𝑜) (6 numbers)
• Compute 𝑃(𝑇𝑒𝑚𝑝 = ℎ/𝑚𝑖𝑙𝑑/𝑐𝑜𝑜𝑙 | 𝑃𝑙𝑎𝑦𝑇𝑒𝑛𝑛𝑖𝑠 = 𝑦𝑒𝑠/𝑛𝑜) (6 numbers)
• Compute 𝑃(ℎ𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = ℎ𝑖/𝑛𝑜𝑟| 𝑃𝑙𝑎𝑦𝑇𝑒𝑛𝑛𝑖𝑠 = 𝑦𝑒𝑠/𝑛𝑜) (4 numbers)
• Compute 𝑃(𝑤𝑖𝑛𝑑 = 𝑤/𝑠𝑡 | 𝑃𝑙𝑎𝑦𝑇𝑒𝑛𝑛𝑖𝑠 = 𝑦𝑒𝑠/𝑛𝑜) (4 numbers)

• Given a new instance:


(Outlook=sunny; Temperature=cool; Humidity=high; Wind=strong)

• Predict: 𝑃𝑙𝑎𝑦𝑇𝑒𝑛𝑛𝑖𝑠 = ?
Example
• Given: (Outlook=sunny; Temperature=cool; Humidity=high; Wind=strong)
• 𝑃 𝑃𝑙𝑎𝑦𝑇𝑒𝑛𝑛𝑖𝑠 = 𝑦𝑒𝑠 𝑃 𝑃𝑙𝑎𝑦𝑇𝑒𝑛𝑛𝑖𝑠 = 𝑛𝑜
= 9/14 = 0.64) = 5/14 = 0.36

• 𝑃(𝑜𝑢𝑡𝑙𝑜𝑜𝑘 = 𝑠𝑢𝑛𝑛𝑦 | 𝑦𝑒𝑠) = 2/9 𝑃(𝑜𝑢𝑡𝑙𝑜𝑜𝑘 = 𝑠𝑢𝑛𝑛𝑦 | 𝑛𝑜) = 3/5


• 𝑃(𝑡𝑒𝑚𝑝 = 𝑐𝑜𝑜𝑙 | 𝑦𝑒𝑠) = 3/9 𝑃(𝑡𝑒𝑚𝑝 = 𝑐𝑜𝑜𝑙 | 𝑛𝑜) = 1/5
• 𝑃(ℎ𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = ℎ𝑖 |𝑦𝑒𝑠) = 3/9 𝑃(ℎ𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = ℎ𝑖 | 𝑛𝑜) = 4/5
• 𝑃(𝑤𝑖𝑛𝑑 = 𝑠𝑡𝑟𝑜𝑛𝑔 | 𝑦𝑒𝑠) = 3/9 𝑃(𝑤𝑖𝑛𝑑 = 𝑠𝑡𝑟𝑜𝑛𝑔 | 𝑛𝑜) = 3/5

• 𝑃(𝑦𝑒𝑠, … . . ) ~ 0.0053 𝑃(𝑛𝑜, … . . ) ~ 0.0206


Example
• Given: (Outlook=sunny; Temperature=cool; Humidity=high; Wind=strong)
• 𝑃 𝑃𝑙𝑎𝑦𝑇𝑒𝑛𝑛𝑖𝑠 = 𝑦𝑒𝑠 𝑃 𝑃𝑙𝑎𝑦𝑇𝑒𝑛𝑛𝑖𝑠 = 𝑛𝑜
= 9/14 = 0.64) = 5/14 = 0.36

• 𝑃(𝑜𝑢𝑡𝑙𝑜𝑜𝑘 = 𝑠𝑢𝑛𝑛𝑦 | 𝑦𝑒𝑠) = 2/9 𝑃(𝑜𝑢𝑡𝑙𝑜𝑜𝑘 = 𝑠𝑢𝑛𝑛𝑦 | 𝑛𝑜) = 3/5


• 𝑃(𝑡𝑒𝑚𝑝 = 𝑐𝑜𝑜𝑙 | 𝑦𝑒𝑠) = 3/9 𝑃(𝑡𝑒𝑚𝑝 = 𝑐𝑜𝑜𝑙 | 𝑛𝑜) = 1/5
• 𝑃(ℎ𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = ℎ𝑖 |𝑦𝑒𝑠) = 3/9 𝑃(ℎ𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = ℎ𝑖 | 𝑛𝑜) = 4/5
• 𝑃(𝑤𝑖𝑛𝑑 = 𝑠𝑡𝑟𝑜𝑛𝑔 | 𝑦𝑒𝑠) = 3/9 𝑃(𝑤𝑖𝑛𝑑 = 𝑠𝑡𝑟𝑜𝑛𝑔 | 𝑛𝑜) = 3/5

• 𝑃 𝑦𝑒𝑠, … . . ~ 0.0053 𝑃 𝑛𝑜, … . . ~ 0.0206


• 𝑃(𝑛𝑜|𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒) = 0.0206/(0.0053 + 0.0206) = 0.795
What if we were asked about Outlook=OC ?
Additional Material
Naïve Bayes: Two Classes Why do things work?

𝑣𝑁𝐵 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑣𝑗 ∈𝑉 P vj ෑ 𝑃(𝑥𝑖 |𝑣𝑗 )


𝑖
• Notice that the naïve Bayes method gives a method for
predicting
• rather than an explicit classifier.
• In the case of two classes, 𝑣{0,1} we predict that 𝑣 =
1 iff: 𝑃 𝑣𝑗 = 1 ⋅ ς𝑛𝑖=1 𝑃 𝑥𝑖 𝑣𝑗 = 1
>1
𝑃 𝑣𝑗 = 0 ⋅ ς𝑛𝑖=1 𝑃 𝑥𝑖 𝑣𝑗 = 0
Naïve Bayes: Two Classes
𝑣𝑁𝐵 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑣𝑗 ∈𝑉 𝑃 𝑣𝑗 ෑ 𝑃(𝑥𝑖 |𝑣𝑗 )
𝑖

• Notice that the naïve Bayes method gives a method for


predicting rather than an explicit classifier.
• In the case of two classes, 𝑣{0,1} we predict that 𝑣 =
1 iff: 𝑃 𝑣𝑗 = 1 ⋅ ς𝑛𝑖=1 𝑃 𝑥𝑖 𝑣𝑗 = 1
>1
𝑃 𝑣𝑗 = 0 ⋅ ς𝑛𝑖=1 𝑃 𝑥𝑖 𝑣𝑗 = 0

Denote 𝑝𝑖 = 𝑃 𝑥𝑖 = 1 𝑣 = 1 , 𝑞𝑖 = 𝑃 𝑥𝑖 = 1 𝑣 = 0
𝑥
𝑃 𝑣𝑗 = 1 ⋅ ς𝑛𝑖=1 𝑝𝑖 𝑖 1 − 𝑝𝑖 1−𝑥𝑖
𝑛 𝑥𝑖 1−𝑥
>1
𝑃 𝑣𝑗 = 0 ⋅ ς𝑖=1 𝑞𝑖 1 − 𝑞𝑖 𝑖
Naïve Bayes: Two Classes
• In the case of two classes, 𝑣{0,1} we predict that 𝑣 =
1 iff: 𝑛 𝑥𝑖 1−𝑥𝑖 𝑛 𝑥𝑖
𝑃 𝑣𝑗 = 1 ⋅ ς𝑖=1 𝑝𝑖 1 − 𝑝𝑖 𝑃 𝑣𝑗 = 1 ⋅ ς𝑖=1(1 − 𝑝𝑖 ) 𝑝𝑖 /1 − 𝑝𝑖
𝑥 = >1
𝑃 𝑣𝑗 = 0 ⋅ ς𝑛𝑖=1 𝑞𝑖 𝑖 1 − 𝑞𝑖 1−𝑥𝑖 𝑃 𝑣𝑗 = 0 ⋅ ς𝑛𝑖=1(1 − 𝑞𝑖 ) 𝑞𝑖 /1 − 𝑞𝑖 𝑥𝑖
Naïve Bayes: Two Classes
• In the case of two classes, 𝑣{0,1} we predict that 𝑣 =
1 iff: 𝑥
𝑃 𝑣 = 1 ⋅ ς𝑛 𝑝 𝑖 1 − 𝑝 1−𝑥𝑖 𝑃 𝑣 = 1 ⋅ ς𝑛 (1 − 𝑝 ) 𝑝 /1 − 𝑝 𝑥𝑖
𝑗 𝑖=1 𝑖 𝑖 𝑗 𝑖=1 𝑖 𝑖 𝑖
= >1
𝑃 𝑣𝑗 = 0 ⋅ ς𝑛𝑖=1 𝑞𝑖𝑥𝑖 1 − 𝑞𝑖 1−𝑥𝑖 𝑃 𝑣𝑗 = 0 ⋅ ς𝑛𝑖=1(1 − 𝑞𝑖 ) 𝑞𝑖 /1 − 𝑞𝑖 𝑥𝑖

Take logarithm; we predict 𝑣 = 1 iff


𝑃 𝑣𝑗 = 1 1 − 𝑝𝑖 𝑝𝑖 𝑞𝑖
𝑙𝑜𝑔 + ෍ log + ෍(log − log )𝑥 > 0
𝑃(𝑣𝑗 = 0) 1 − 𝑞𝑖 1 − 𝑝𝑖 1 − 𝑞𝑖 𝑖
𝑖 𝑖
Naïve Bayes: Two Classes
• In the case of two classes, v{0,1} we predict that v=1 iff:
𝑥
𝑃 𝑣𝑗 = 1 ⋅ ς𝑛𝑖=1 𝑝𝑖 𝑖 1 − 𝑝𝑖 1−𝑥𝑖 𝑃 𝑣𝑗 = 1 ⋅ ς𝑛𝑖=1(1 − 𝑝𝑖 ) 𝑝𝑖 /1 − 𝑝𝑖 𝑥𝑖
𝑥 = >1
𝑃 𝑣𝑗 = 0 ⋅ ς𝑛𝑖=1 𝑞𝑖 𝑖 1 − 𝑞𝑖 1−𝑥𝑖 𝑃 𝑣𝑗 = 0 ⋅ ς𝑛𝑖=1(1 − 𝑞𝑖 ) 𝑞𝑖 /1 − 𝑞𝑖 𝑥𝑖

Take logarithm; we predict 𝑣 = 1 iff


𝑃 𝑣𝑗 = 1 1 − 𝑝𝑖 𝑝𝑖 𝑞𝑖
𝑙𝑜𝑔 + ෍ log + ෍(log − log )𝑥 > 0
𝑃(𝑣𝑗 = 0) 1 − 𝑞𝑖 1 − 𝑝𝑖 1 − 𝑞𝑖 𝑖
𝑖 𝑖
• We get that naive Bayes is a linear separator with
𝑝𝑖 𝑞𝑖 𝑝𝑖 1 − 𝑞𝑖
𝑤𝑖 = log − log = log
1 − 𝑝𝑖 1 − 𝑞𝑖 𝑞𝑖 1 − 𝑝𝑖
• If 𝑝𝑖 = 𝑞𝑖 then 𝑤𝑖 = 0 and the feature is irrelevant
Naïve Bayes: Two Classes
• In the case of two classes we have that:
𝑃(𝑣𝑗 = 1|𝑥) We have:
log = ෍ 𝒘 𝑖 𝒙𝑖 − 𝑏
𝑃(𝑣𝑗 = 0|𝑥) 𝐴 = 1 − 𝐵; 𝐿𝑜𝑔(𝐵/𝐴) = −𝐶.
𝑖
Then:
• but since 𝐸𝑥𝑝(−𝐶) = 𝐵/𝐴 =
𝑃 𝑣𝑗 = 1 𝑥 = 1 − 𝑃(𝑣𝑗 = 0|𝑥) = (1 − 𝐴)/𝐴 = 1/𝐴 – 1
• We get: = 1 + 𝐸𝑥𝑝(−𝐶) = 1/𝐴
1 𝐴 = 1/(1 + 𝐸𝑥𝑝(−𝐶))
𝑃 𝑣𝑗 = 1 𝑥 =
1 + exp(− σ𝑖 𝒘𝑖 𝒙𝑖 + 𝑏)
• Which is simply the logistic function.

• The linearity of NB provides a better explanation for why it works.


Naïve Bayes: Continuous Features
• 𝑋𝑖 can be continuous
• We can still use
𝑃 𝑋1 , … . , 𝑋𝑛 |𝑌 = ෑ 𝑃(𝑋𝑖 |𝑌)
𝑖
• And

𝑃 𝑌 = 𝑦 ς𝑖 𝑃(𝑋𝑖 |𝑌 = 𝑦)
𝑃 𝑌 = 𝑦 𝑋1 , … , 𝑋𝑛 ) =
σ𝑗 𝑃(𝑌 = 𝑦𝑗 ) ς𝑖 𝑃(𝑋𝑖 |𝑌 = 𝑦𝑗 )
Naïve Bayes: Continuous Features
• 𝑋𝑖 can be continuous
• We can still use
𝑃 𝑋1 , … . , 𝑋𝑛 |𝑌 = ෑ 𝑃(𝑋𝑖 |𝑌)
𝑖
• And

𝑃 𝑌 = 𝑦 ς𝑖 𝑃(𝑋𝑖 |𝑌 = 𝑦)
𝑃 𝑌 = 𝑦 𝑋1 , … , 𝑋𝑛 ) =
σ𝑗 𝑃(𝑌 = 𝑦𝑗 ) ς𝑖 𝑃(𝑋𝑖 |𝑌 = 𝑦𝑗 )
Naïve Bayes classifier:
𝑌 = arg max 𝑃 𝑌 = 𝑦 ෑ 𝑃(𝑋𝑖 |𝑌 = 𝑦)
𝑦
𝑖
Naïve Bayes: Continuous Features
• 𝑋𝑖 can be continuous
• We can still use
𝑃 𝑋1 , … . , 𝑋𝑛 |𝑌 = ෑ 𝑃(𝑋𝑖 |𝑌)
𝑖
• And

𝑃 𝑌 = 𝑦 ς𝑖 𝑃(𝑋𝑖 |𝑌 = 𝑦)
𝑃 𝑌 = 𝑦 𝑋1 , … , 𝑋𝑛 ) =
σ𝑗 𝑃(𝑌 = 𝑦𝑗 ) ς𝑖 𝑃(𝑋𝑖 |𝑌 = 𝑦𝑗 )
Naïve Bayes classifier:
𝑌 = arg max 𝑃 𝑌 = 𝑦 ෑ 𝑃(𝑋𝑖 |𝑌 = 𝑦)
𝑦
𝑖
Assumption: 𝑃(𝑋𝑖 |𝑌) has a Gaussian distribution
The Gaussian Probability Distribution
• Gaussian probability distribution also called normal
distribution.
• It is a continuous distribution with pdf:
•  = mean of distribution
• 𝜎 2 = variance of distribution
• 𝑥 is a continuous variable (−∞ ≤ 𝑥 ≤ ∞)
• Probability of 𝑥 being in the range [𝑎, 𝑏] cannot be evaluated
analytically (has to be looked up in a table)
𝑥 −𝜇 2
1 −
• 𝑝 𝑥 = 𝑒 2 𝜎2 p(x) 
1
e

( x  )2
2
2
gaussian
𝜎 2𝜋  2

x
Naïve Bayes: Continuous Features
• 𝑃(𝑋𝑖 |𝑌) is Gaussian
• Training: estimate mean and standard deviation
– 𝜇𝑖 = 𝐸 𝑋𝑖 𝑌 = 𝑦
– 𝜎𝑖2 = 𝐸 𝑋𝑖 − 𝜇𝑖 2 𝑌=𝑦
Note that the following slides abuse notation
significantly. Since 𝑃(𝑥) = 0 for continues distributions,
we think of
𝑃 (𝑋 = 𝑥| 𝑌 = 𝑦), not as a classic probability distribution,
but just as a function 𝑓(𝑥) = 𝑁(𝑥, 𝜇, 𝜎 2 ).
𝑓(𝑥) behaves as a probability distribution in the sense
that 8 𝑥, 𝑓(𝑥) ¸ 0 and the values add up to 1. Also, note
that 𝑓(𝑥) satisfies Bayes Rule, that is, it is true that:
𝑓𝑌(𝑦|𝑋 = 𝑥) = 𝑓𝑋 (𝑥|𝑌 = 𝑦) 𝑓𝑌 (𝑦)/𝑓𝑋(𝑥)
Naïve Bayes: Continuous Features
• 𝑃(𝑋𝑖 |𝑌) is Gaussian
• Training: estimate mean and standard deviation
– 𝜇𝑖 = 𝐸 𝑋𝑖 𝑌 = 𝑦
– 𝜎𝑖2 = 𝐸 𝑋𝑖 − 𝜇𝑖 2 𝑌=𝑦

𝑋1 𝑋2 𝑋3 𝑌
2 3 1 1
−1.2 2 0.4 1
1.2 0.3 0 0
2.2 1.1 0 1
Naïve Bayes: Continuous Features
• 𝑃(𝑋𝑖 |𝑌) is Gaussian
• Training: estimate mean and standard
deviation 𝑋1 𝑋2 𝑋3 𝑌
– 𝜇𝑖 = 𝐸 𝑋𝑖 𝑌 = 𝑦 2 3 1 1
– 𝜎𝑖2 = 𝐸 𝑋𝑖 − 𝜇𝑖 2
𝑌=𝑦 −1.2 2 0.4 1
2+ −1.2 +2.2 1.2 0.3 0 0
– 𝜇1 = 𝐸 𝑋1 𝑌 = 1 = =1
3
2.2 1.1 0 1
– 𝜎12 = 𝐸 𝑋1 − 𝜇1 𝑌 = 1 =
2−1 2 + −1.2 −1 2 + 2.2 −1 2
= 2.43
3

You might also like