0% found this document useful (0 votes)

70 views68 pages

Week6 - Naive Bayes

The document discusses probability basics and Naive Bayes classification. It covers single event probability, joint probability, conditional probability, Bayes' theorem, and how Naive Bayes works. Naive Bayes makes a strong independence assumption that features are independent. It calculates the probability of each class given the features by multiplying the individual feature probabilities. The class with the highest posterior probability is selected using the maximum a posteriori rule. To avoid computational issues, log probabilities are used instead of multiplying probabilities directly. An example of using Naive Bayes to predict whether to play tennis given weather conditions is provided to demonstrate the training and classification process.

Uploaded by

Tele boy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

70 views68 pages

Week6 - Naive Bayes

Uploaded by

Tele boy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 68

Probability Basics

LinSingle event probability:

𝑃 𝑋

2
Probability Basics
LinSingle event probability:
𝑃 𝑋 ,𝑃 𝑌

𝑃 𝑌

3
Probability Basics
LinSingle event probability:
𝑃 𝑋 ,𝑃 𝑌
Joint event probability:
𝑃 𝑋, 𝑌

𝑃(𝑋, 𝑌)

4
Probability Basics
LinSingle event probability:
𝑃 𝑋 ,𝑃 𝑌
Joint event probability:
𝑃 𝑋, 𝑌
Conditional probability:
𝑃 𝑋|𝑌

𝑃(𝑋|𝑌)

5
Probability Basics
LinSingle event probability:
𝑃 𝑋 ,𝑃 𝑌
Joint event probability:
𝑃 𝑋, 𝑌
Conditional probability:
𝑃 𝑋|𝑌 , 𝑃(𝑌|𝑋)

𝑃(𝑌|𝑋)

6
Probability Basics
LinSingle event probability:
𝑃 𝑋 ,𝑃 𝑌
Joint event probability:
𝑃 𝑋, 𝑌
Conditional probability:
𝑃 𝑋|𝑌 , 𝑃(𝑌|𝑋)
Joint and conditional relationship:
𝑃 𝑋, 𝑌 = 𝑃 𝑌 𝑋 ∗ 𝑃 𝑋 = 𝑃 𝑋 𝑌 ∗ 𝑃 𝑌

7
Bayes Theorem Derivation
By conditional and joint relationship:
𝑃 𝑌 𝑋 ∗𝑃 𝑋 =𝑃 𝑋 𝑌 ∗𝑃 𝑌

8
Bayes Theorem Derivation
By conditional and joint relationship:
𝑃 𝑌 𝑋 ∗𝑃 𝑋 =𝑃 𝑋 𝑌 ∗𝑃 𝑌
To invert conditional probability:
𝑃 𝑋 𝑌 ∗𝑃 𝑌
𝑃 𝑌𝑋 =
𝑃 𝑋

9
Bayes Theorem Derivation
By conditional and joint relationship:
𝑃 𝑌 𝑋 ∗𝑃 𝑋 =𝑃 𝑋 𝑌 ∗𝑃 𝑌
To invert conditional probability:
𝑃 𝑋 𝑌 ∗𝑃 𝑌
𝑃 𝑌𝑋 =
𝑃 𝑋

𝑃 𝑋 = 𝑃(𝑋, 𝑍) = 𝑃 𝑋 𝑍 ∗ 𝑃(𝑍)
𝑍 𝑍

10
Bayes Theorem

𝑃 𝑋 𝑌 ∗𝑃 𝑌
𝑃 𝑌𝑋 =
𝑃 𝑋

11
Bayes Theorem

𝑃 𝑋 𝑌 ∗𝑃 𝑌
𝑃 𝑌𝑋 =
𝑃 𝑋

𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 ∗ 𝑝𝑟𝑖𝑜𝑟
𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 =
𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒

12
Naïve Bayes Classification

𝑃 𝑋 𝑌 ∗𝑃 𝑌
𝑃 𝑌𝑋 =
𝑃 𝑋

𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 ∗ 𝑝𝑟𝑖𝑜𝑟
𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 =
𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒

13
Training Naïve Bayes
For each class (𝐶),
calculate probability 𝑃 𝐶 𝑋 = 𝑃 𝑋 𝐶 ∗𝑃 𝐶
given features (𝑋)
Class Features

14
Training Naïve Bayes: The Naïve Assumption
For each class (𝐶),
calculate probability 𝑃 𝐶 𝑋 = 𝑃 𝑋 𝐶 ∗𝑃 𝐶
given features (𝑋)

Difficult to calculate 𝑃 𝐶 𝑋 = 𝑃 𝑋1 , 𝑋2 , … , 𝑋𝑛 𝐶 ∗ 𝑃 𝐶
joint probabilities 𝑃 𝑋1 𝑋2 , … , 𝑋𝑛 , 𝐶 ∗ 𝑃 𝑋2 , … , 𝑋𝑛 𝐶 ∗ 𝑃 𝐶
produced by expanding
for all features
…

15
Training Naïve Bayes: The Naïve Assumption
For each class (𝐶),
calculate probability 𝑃 𝐶 𝑋 = 𝑃 𝑋 𝐶 ∗𝑃 𝐶
given features (𝑋)

Solution: assume all 𝑃 𝐶 𝑋 = 𝑃 𝑋1 𝐶 ∗ 𝑃 𝑋2 𝐶 ∗ 𝑃 𝑋𝑛 𝐶 ∗ 𝑃 𝐶

features independent
of each other

16
Training Naïve Bayes: The Naïve Assumption
For each class (𝐶),
calculate probability 𝑃 𝐶 𝑋 = 𝑃 𝑋 𝐶 ∗𝑃 𝐶
given features (𝑋)

Solution: assume all 𝑃 𝐶 𝑋 = 𝑃 𝑋1 𝐶 ∗ 𝑃 𝑋2 𝐶 ∗ 𝑃 𝑋𝑛 𝐶 ∗ 𝑃 𝐶

features independent
of each other

𝑛
This is the “naïve” 𝑃 𝐶𝑋 =𝑃 𝐶 𝑃 𝑋𝑖 𝐶
assumption 𝑖=1

17
Training Naïve Bayes
For each class (𝐶),
calculate probability 𝑃 𝐶 𝑋 = 𝑃 𝑋 𝐶 ∗𝑃 𝐶
given features (𝑋)

Class assignment is 𝑎𝑟𝑔𝑚𝑎𝑥 𝑛

selected based on 𝑃 𝐶𝑘 𝑃 𝑋𝑖 𝐶𝑘
𝑘 ∈ {1, … 𝐾} 𝑖=1
maximum a posteriori
(MAP) rule

18
Training Naïve Bayes
For each class (𝐶),
calculate probability 𝑃 𝐶 𝑋 = 𝑃 𝑋 𝐶 ∗𝑃 𝐶
given features (𝑋)

Class assignment is 𝑎𝑟𝑔𝑚𝑎𝑥 𝑛

selected based on 𝑃 𝐶𝑘 𝑃 𝑋𝑖 𝐶𝑘
𝑘 ∈ {1, … 𝐾} 𝑖=1
maximum a posteriori
(MAP) rule

Means select potential

class with largest value

19
The Log Trick
Multiplying many values 𝑛
together causes
𝑎𝑟𝑔𝑚𝑎𝑥
𝑃 𝐶𝑘 𝑃 𝑋𝑖 𝐶𝑘
computational instability 𝑘 ∈ {1, … 𝐾} 𝑖=1
(underflows)

20
The Log Trick
Multiplying many values 𝑛
together causes
𝑎𝑟𝑔𝑚𝑎𝑥
𝑃 𝐶𝑘 𝑃 𝑋𝑖 𝐶𝑘
computational instability 𝑘 ∈ {1, … 𝐾} 𝑖=1
(underflows)

𝑛
Work with log values
log(𝑃 𝐶𝑘 ) log(𝑃 𝑋𝑖 𝐶𝑘 )
and sum the results
𝑖=1

21
Example: Predicting Tennis With Naïve Bayes
Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

22
Example: Training Naïve Bayes Tennis Model
P(Play=Yes) = 9/14 P(Play=No) = 5/14

23
Example: Training Naïve Bayes Tennis Model
P(Play=Yes) = 9/14 P(Play=No) = 5/14
Outlook Play=Yes Play=No Temperature Play=Yes Play=No
Sunny 2/9 3/5 Hot 2/9 2/5

Overcast 4/9 0/5 Mild 4/9 2/5

Rain 3/9 2/5 Cool 3/9 1/5

24
Example: Training Naïve Bayes Tennis Model
P(Play=Yes) = 9/14 P(Play=No) = 5/14
Outlook Play=Yes Play=No Temperature Play=Yes Play=No
Sunny 2/9 3/5 Hot 2/9 2/5

Overcast 4/9 0/5 Mild 4/9 2/5

Rain 3/9 2/5 Cool 3/9 1/5

Humidity Play=Yes Play=No Wind Play=Yes Play=No

High 3/9 4/5 Strong 3/9 3/5

Normal 6/9 1/5 Weak 6/9 2/5

Create probability lookup tables based on training data

25
Example: Predicting Tennis With Naïve Bayes
Predict outcome for the following:
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)

𝑃 𝑦𝑒𝑠 𝑠𝑢𝑛𝑛𝑦, 𝑐𝑜𝑜𝑙, ℎ𝑖𝑔ℎ, 𝑠𝑡𝑟𝑜𝑛𝑔 = 𝑃 𝑠𝑢𝑛𝑛𝑦 𝑦𝑒𝑠 ∗ 𝑃 𝑐𝑜𝑜𝑙 𝑦𝑒𝑠 *

𝑃 ℎ𝑖𝑔ℎ 𝑦𝑒𝑠 * 𝑃 𝑠𝑡𝑟𝑜𝑛𝑔 𝑦𝑒𝑠 * 𝑃 𝑦𝑒𝑠

𝑃 𝑛𝑜 𝑠𝑢𝑛𝑛𝑦, 𝑐𝑜𝑜𝑙, ℎ𝑖𝑔ℎ, 𝑠𝑡𝑟𝑜𝑛𝑔 = 𝑃 𝑠𝑢𝑛𝑛𝑦 𝑛𝑜 ∗ 𝑃 𝑐𝑜𝑜𝑙 𝑛𝑜 *

𝑃 ℎ𝑖𝑔ℎ 𝑛𝑜 * 𝑃 𝑠𝑡𝑟𝑜𝑛𝑔 𝑛𝑜 * 𝑃 𝑛𝑜

26
Example: Predicting Tennis
With Naïve Bayes Feature Play=Yes Play=No

Predict outcome for the following: Outlook=Sunny 2/9 3/5

x’=(Outlook=Sunny, Temperature=Cool,
Humidity=High, Wind=Strong)

27
Example: Predicting Tennis
With Naïve Bayes Feature Play=Yes Play=No

Predict outcome for the following: Outlook=Sunny 2/9 3/5

x’=(Outlook=Sunny, Temperature=Cool,
Humidity=High, Wind=Strong) Temperature=Cool 3/9 1/5

Humidity=High 3/9 4/5

Wind=Strong 3/9 3/5

Overall Label 9/14 5/14

28
Example: Predicting Tennis
With Naïve Bayes Feature Play=Yes Play=No

Predict outcome for the following: Outlook=Sunny 2/9 3/5

x’=(Outlook=Sunny, Temperature=Cool,
Humidity=High, Wind=Strong) Temperature=Cool 3/9 1/5

Humidity=High 3/9 4/5

Wind=Strong 3/9 3/5

Overall Label 9/14 5/14

Probability 0.0053 0.0206

29
Example: Predicting Tennis
With Naïve Bayes Feature Play=Yes Play=No

Predict outcome for the following: Outlook=Sunny 2/9 3/5

x’=(Outlook=Sunny, Temperature=Cool,
Humidity=High, Wind=Strong) Temperature=Cool 3/9 1/5

Humidity=High 3/9 4/5

Wind=Strong 3/9 3/5

Overall Label 9/14 5/14

Probability 0.0053 0.0206

30
Laplace Smoothing
Problem: categories
with no entries result in 𝑃 𝐶 𝑋 = 𝑃 𝑋1 𝐶 ∗ 𝑃 𝑋2 𝐶 ∗ 𝑃 𝐶
a value of "0" for
conditional probability

31
Laplace Smoothing
0
Problem: categories
with no entries result in 𝑃 𝐶 𝑋 = 𝑃 𝑋1 𝐶 ∗ 𝑃 𝑋2 𝐶 ∗ 𝑃 𝐶
a value of "0" for
conditional probability

32
Laplace Smoothing
0
Problem: categories
with no entries result in 𝑃 𝐶 𝑋 = 𝑃 𝑋1 𝐶 ∗ 𝑃 𝑋2 𝐶 ∗ 𝑃 𝐶
a value of "0" for
conditional probability

1
Solution: add "1" to 𝑃 𝑋1 𝐶 =
numerator and 𝐶𝑜𝑢𝑛𝑡(𝐶) + 𝑛
denominator of empty
categories
𝐶𝑜𝑢𝑛𝑡 𝑋2 & 𝐶 + 1
𝑃 𝑋2 𝐶 =
𝐶𝑜𝑢𝑛𝑡(𝐶) + 𝑚

33
Types of Naïve Bayes
Naïve Bayes Model Data Type

Bernoulli Binary (T/F)

34
Types of Naïve Bayes
Naïve Bayes Model Data Type

Bernoulli Binary (T/F)

Multinomial Discrete (e.g. count)

35
Types of Naïve Bayes
Naïve Bayes Model Data Type

Bernoulli Binary (T/F)

Multinomial Discrete (e.g. count)

Gaussian Continuous

36
Combining Feature Types

Model features contain different data types (continuous and

Problem categorical)

37
Combining Feature Types

Model features contain different data types (continuous and

Problem categorical)

 Option 1: Bin continuous features to create categorical

Solution ones and fit multinomial model

38
Combining Feature Types

Model features contain different data types (continuous and

Problem categorical)

 Option 1: Bin continuous features to create categorical

ones and fit multinomial model
Solution  Option 2: Fit Gaussian model on continuous features and
multinomial on categorical features; combine to create
"meta model" (week 10)

39
Distributed Computing with Naïve Bayes
 Well-suited for large data and distributed computing—limited parameters
and log probabilities are a summation
 Scikit-Learn implementations contain a "partial_fit" method designed for
out-of-core calculations

40
Naïve Bayes: The Syntax
Import the class containing the classification method
from sklearn.naive_bayes import BernoulliNB

41
Naïve Bayes: The Syntax
Import the class containing the classification method
from sklearn.naive_bayes import BernoulliNB

Create an instance of the class

BNB = BernoulliNB(alpha=1.0)

42
Naïve Bayes: The Syntax
Import the class containing the classification method
from sklearn.naive_bayes import BernoulliNB

Create an instance of the class Laplace smoothing

BNB = BernoulliNB(alpha=1.0) parameter

43
Naïve Bayes: The Syntax
Import the class containing the classification method
from sklearn.naive_bayes import BernoulliNB

Create an instance of the class Laplace smoothing

BNB = BernoulliNB(alpha=1.0) parameter

Fit the instance on the data and then predict the expected value
BNB = BNB.fit(X_train, y_train)
y_predict = BNB.predict(X_test)

44
Naïve Bayes: The Syntax
Import the class containing the classification method
from sklearn.naive_bayes import BernoulliNB

Create an instance of the class Laplace smoothing

BNB = BernoulliNB(alpha=1.0) parameter

Fit the instance on the data and then predict the expected value
BNB = BNB.fit(X_train, y_train)
y_predict = BNB.predict(X_test)

Other naïve Bayes models: MultinomialNB, GaussianNB.

45
Generalized Hyperparameter Grid Search
 Hyperparameter selection for
regularization / better models
requires cross validation on
training data

Parameter B
 Linear and logistic regression
methods have classes devoted to
grid search (e.g. LassoCV)

Parameter A

47
Generalized Hyperparameter Grid Search
 Grid search can be useful for
other methods too, so a
generalized method is desirable

Parameter B
 Scikit-learn contains
GridSearchCV, which performs a
grid search with parameters using
cross validation

Parameter A

48
Grid Search with Cross Validation: The Syntax
Import the class containing the grid search method
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

49
Grid Search with Cross Validation: The Syntax
Import the class containing the grid search method
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

Create an instance of the estimator and grid search class

LR = LogisticRegression(penalty='l2')
GS = GridSearchCV(LR, param_grid={'c':[0.001, 0.01, 0.1]},
scoring='accuracy', cv=4)

50
Grid Search with Cross Validation: The Syntax
Import the class containing the grid search method
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV logistic
regression
Create an instance of the estimator and grid search class method
LR = LogisticRegression(penalty='l2')
GS = GridSearchCV(LR, param_grid={'c':[0.001, 0.01, 0.1]},
scoring='accuracy', cv=4)

51
Grid Search with Cross Validation: The Syntax
Import the class containing the grid search method
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

Create an instance of the estimator and grid search class

LR = LogisticRegression(penalty='l2')
GS = GridSearchCV(LR, param_grid={'c':[0.001, 0.01, 0.1]},
scoring='accuracy', cv=4)

Fit the instance on the data to find the best model and then predict
GS = GS.fit(X_train, y_train)
y_train = GS.predict(X_test)

52
Optimizing the Rest of the Pipeline
 Grid searches enable model parameters to be optimized

53
Optimizing the Rest of the Pipeline
 Grid searches enable model parameters to be optimized
 How can this be incorporated with other steps of the process (e.g. feature extraction
and transformation)?

54
Optimizing the Rest of the Pipeline
 Grid searches enable model parameters to be optimized
 How can this be incorporated with other steps of the process (e.g. feature extraction
and transformation)?

Pipelines!

55
Automating Machine Learning with Pipelines
 Machine learning models often selected empirically

Log Standard
Data Transform Scaler
KNN Prediction

56
Automating Machine Learning with Pipelines
 Machine learning models often selected empirically
 By trying different processing methods and tuning multiple models