Week6 - Naive Bayes
Week6 - Naive Bayes
𝑃 𝑋
2
Probability Basics
LinSingle event probability:
𝑃 𝑋 ,𝑃 𝑌
𝑃 𝑌
3
Probability Basics
LinSingle event probability:
𝑃 𝑋 ,𝑃 𝑌
Joint event probability:
𝑃 𝑋, 𝑌
𝑃(𝑋, 𝑌)
4
Probability Basics
LinSingle event probability:
𝑃 𝑋 ,𝑃 𝑌
Joint event probability:
𝑃 𝑋, 𝑌
Conditional probability:
𝑃 𝑋|𝑌
𝑃(𝑋|𝑌)
5
Probability Basics
LinSingle event probability:
𝑃 𝑋 ,𝑃 𝑌
Joint event probability:
𝑃 𝑋, 𝑌
Conditional probability:
𝑃 𝑋|𝑌 , 𝑃(𝑌|𝑋)
𝑃(𝑌|𝑋)
6
Probability Basics
LinSingle event probability:
𝑃 𝑋 ,𝑃 𝑌
Joint event probability:
𝑃 𝑋, 𝑌
Conditional probability:
𝑃 𝑋|𝑌 , 𝑃(𝑌|𝑋)
Joint and conditional relationship:
𝑃 𝑋, 𝑌 = 𝑃 𝑌 𝑋 ∗ 𝑃 𝑋 = 𝑃 𝑋 𝑌 ∗ 𝑃 𝑌
7
Bayes Theorem Derivation
By conditional and joint relationship:
𝑃 𝑌 𝑋 ∗𝑃 𝑋 =𝑃 𝑋 𝑌 ∗𝑃 𝑌
8
Bayes Theorem Derivation
By conditional and joint relationship:
𝑃 𝑌 𝑋 ∗𝑃 𝑋 =𝑃 𝑋 𝑌 ∗𝑃 𝑌
To invert conditional probability:
𝑃 𝑋 𝑌 ∗𝑃 𝑌
𝑃 𝑌𝑋 =
𝑃 𝑋
9
Bayes Theorem Derivation
By conditional and joint relationship:
𝑃 𝑌 𝑋 ∗𝑃 𝑋 =𝑃 𝑋 𝑌 ∗𝑃 𝑌
To invert conditional probability:
𝑃 𝑋 𝑌 ∗𝑃 𝑌
𝑃 𝑌𝑋 =
𝑃 𝑋
𝑃 𝑋 = 𝑃(𝑋, 𝑍) = 𝑃 𝑋 𝑍 ∗ 𝑃(𝑍)
𝑍 𝑍
10
Bayes Theorem
𝑃 𝑋 𝑌 ∗𝑃 𝑌
𝑃 𝑌𝑋 =
𝑃 𝑋
11
Bayes Theorem
𝑃 𝑋 𝑌 ∗𝑃 𝑌
𝑃 𝑌𝑋 =
𝑃 𝑋
𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 ∗ 𝑝𝑟𝑖𝑜𝑟
𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 =
𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒
12
Naïve Bayes Classification
𝑃 𝑋 𝑌 ∗𝑃 𝑌
𝑃 𝑌𝑋 =
𝑃 𝑋
𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 ∗ 𝑝𝑟𝑖𝑜𝑟
𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 =
𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒
13
Training Naïve Bayes
For each class (𝐶),
calculate probability 𝑃 𝐶 𝑋 = 𝑃 𝑋 𝐶 ∗𝑃 𝐶
given features (𝑋)
Class Features
14
Training Naïve Bayes: The Naïve Assumption
For each class (𝐶),
calculate probability 𝑃 𝐶 𝑋 = 𝑃 𝑋 𝐶 ∗𝑃 𝐶
given features (𝑋)
Difficult to calculate 𝑃 𝐶 𝑋 = 𝑃 𝑋1 , 𝑋2 , … , 𝑋𝑛 𝐶 ∗ 𝑃 𝐶
joint probabilities 𝑃 𝑋1 𝑋2 , … , 𝑋𝑛 , 𝐶 ∗ 𝑃 𝑋2 , … , 𝑋𝑛 𝐶 ∗ 𝑃 𝐶
produced by expanding
for all features
…
15
Training Naïve Bayes: The Naïve Assumption
For each class (𝐶),
calculate probability 𝑃 𝐶 𝑋 = 𝑃 𝑋 𝐶 ∗𝑃 𝐶
given features (𝑋)
16
Training Naïve Bayes: The Naïve Assumption
For each class (𝐶),
calculate probability 𝑃 𝐶 𝑋 = 𝑃 𝑋 𝐶 ∗𝑃 𝐶
given features (𝑋)
𝑛
This is the “naïve” 𝑃 𝐶𝑋 =𝑃 𝐶 𝑃 𝑋𝑖 𝐶
assumption 𝑖=1
17
Training Naïve Bayes
For each class (𝐶),
calculate probability 𝑃 𝐶 𝑋 = 𝑃 𝑋 𝐶 ∗𝑃 𝐶
given features (𝑋)
18
Training Naïve Bayes
For each class (𝐶),
calculate probability 𝑃 𝐶 𝑋 = 𝑃 𝑋 𝐶 ∗𝑃 𝐶
given features (𝑋)
19
The Log Trick
Multiplying many values 𝑛
together causes
𝑎𝑟𝑔𝑚𝑎𝑥
𝑃 𝐶𝑘 𝑃 𝑋𝑖 𝐶𝑘
computational instability 𝑘 ∈ {1, … 𝐾} 𝑖=1
(underflows)
20
The Log Trick
Multiplying many values 𝑛
together causes
𝑎𝑟𝑔𝑚𝑎𝑥
𝑃 𝐶𝑘 𝑃 𝑋𝑖 𝐶𝑘
computational instability 𝑘 ∈ {1, … 𝐾} 𝑖=1
(underflows)
𝑛
Work with log values
log(𝑃 𝐶𝑘 ) log(𝑃 𝑋𝑖 𝐶𝑘 )
and sum the results
𝑖=1
21
Example: Predicting Tennis With Naïve Bayes
Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
22
Example: Training Naïve Bayes Tennis Model
P(Play=Yes) = 9/14 P(Play=No) = 5/14
23
Example: Training Naïve Bayes Tennis Model
P(Play=Yes) = 9/14 P(Play=No) = 5/14
Outlook Play=Yes Play=No Temperature Play=Yes Play=No
Sunny 2/9 3/5 Hot 2/9 2/5
24
Example: Training Naïve Bayes Tennis Model
P(Play=Yes) = 9/14 P(Play=No) = 5/14
Outlook Play=Yes Play=No Temperature Play=Yes Play=No
Sunny 2/9 3/5 Hot 2/9 2/5
25
Example: Predicting Tennis With Naïve Bayes
Predict outcome for the following:
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
26
Example: Predicting Tennis
With Naïve Bayes Feature Play=Yes Play=No
27
Example: Predicting Tennis
With Naïve Bayes Feature Play=Yes Play=No
28
Example: Predicting Tennis
With Naïve Bayes Feature Play=Yes Play=No
29
Example: Predicting Tennis
With Naïve Bayes Feature Play=Yes Play=No
30
Laplace Smoothing
Problem: categories
with no entries result in 𝑃 𝐶 𝑋 = 𝑃 𝑋1 𝐶 ∗ 𝑃 𝑋2 𝐶 ∗ 𝑃 𝐶
a value of "0" for
conditional probability
31
Laplace Smoothing
0
Problem: categories
with no entries result in 𝑃 𝐶 𝑋 = 𝑃 𝑋1 𝐶 ∗ 𝑃 𝑋2 𝐶 ∗ 𝑃 𝐶
a value of "0" for
conditional probability
32
Laplace Smoothing
0
Problem: categories
with no entries result in 𝑃 𝐶 𝑋 = 𝑃 𝑋1 𝐶 ∗ 𝑃 𝑋2 𝐶 ∗ 𝑃 𝐶
a value of "0" for
conditional probability
1
Solution: add "1" to 𝑃 𝑋1 𝐶 =
numerator and 𝐶𝑜𝑢𝑛𝑡(𝐶) + 𝑛
denominator of empty
categories
𝐶𝑜𝑢𝑛𝑡 𝑋2 & 𝐶 + 1
𝑃 𝑋2 𝐶 =
𝐶𝑜𝑢𝑛𝑡(𝐶) + 𝑚
33
Types of Naïve Bayes
Naïve Bayes Model Data Type
34
Types of Naïve Bayes
Naïve Bayes Model Data Type
35
Types of Naïve Bayes
Naïve Bayes Model Data Type
Gaussian Continuous
36
Combining Feature Types
37
Combining Feature Types
38
Combining Feature Types
39
Distributed Computing with Naïve Bayes
Well-suited for large data and distributed computing—limited parameters
and log probabilities are a summation
Scikit-Learn implementations contain a "partial_fit" method designed for
out-of-core calculations
40
Naïve Bayes: The Syntax
Import the class containing the classification method
from sklearn.naive_bayes import BernoulliNB
41
Naïve Bayes: The Syntax
Import the class containing the classification method
from sklearn.naive_bayes import BernoulliNB
42
Naïve Bayes: The Syntax
Import the class containing the classification method
from sklearn.naive_bayes import BernoulliNB
43
Naïve Bayes: The Syntax
Import the class containing the classification method
from sklearn.naive_bayes import BernoulliNB
Fit the instance on the data and then predict the expected value
BNB = BNB.fit(X_train, y_train)
y_predict = BNB.predict(X_test)
44
Naïve Bayes: The Syntax
Import the class containing the classification method
from sklearn.naive_bayes import BernoulliNB
Fit the instance on the data and then predict the expected value
BNB = BNB.fit(X_train, y_train)
y_predict = BNB.predict(X_test)
45
Generalized Hyperparameter Grid Search
Hyperparameter selection for
regularization / better models
requires cross validation on
training data
Parameter B
Linear and logistic regression
methods have classes devoted to
grid search (e.g. LassoCV)
Parameter A
47
Generalized Hyperparameter Grid Search
Grid search can be useful for
other methods too, so a
generalized method is desirable
Parameter B
Scikit-learn contains
GridSearchCV, which performs a
grid search with parameters using
cross validation
Parameter A
48
Grid Search with Cross Validation: The Syntax
Import the class containing the grid search method
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
49
Grid Search with Cross Validation: The Syntax
Import the class containing the grid search method
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
50
Grid Search with Cross Validation: The Syntax
Import the class containing the grid search method
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV logistic
regression
Create an instance of the estimator and grid search class method
LR = LogisticRegression(penalty='l2')
GS = GridSearchCV(LR, param_grid={'c':[0.001, 0.01, 0.1]},
scoring='accuracy', cv=4)
51
Grid Search with Cross Validation: The Syntax
Import the class containing the grid search method
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
Fit the instance on the data to find the best model and then predict
GS = GS.fit(X_train, y_train)
y_train = GS.predict(X_test)
52
Optimizing the Rest of the Pipeline
Grid searches enable model parameters to be optimized
53
Optimizing the Rest of the Pipeline
Grid searches enable model parameters to be optimized
How can this be incorporated with other steps of the process (e.g. feature extraction
and transformation)?
54
Optimizing the Rest of the Pipeline
Grid searches enable model parameters to be optimized
How can this be incorporated with other steps of the process (e.g. feature extraction
and transformation)?
Pipelines!
55
Automating Machine Learning with Pipelines
Machine learning models often selected empirically
Log Standard
Data Transform Scaler
KNN Prediction
56
Automating Machine Learning with Pipelines
Machine learning models often selected empirically
By trying different processing methods and tuning multiple models
57
Automating Machine Learning with Pipelines
Machine learning models often selected empirically
By trying different processing methods and tuning multiple models
59
Automating Machine Learning with Pipelines
Pipelines in Scikit-Learn allow feature transformation steps and models to be
chained together
Successive steps perform 'fit' and 'transform' before sending data to the next step
60
Automating Machine Learning with Pipelines
Pipelines in Scikit-Learn allow feature transformation steps and models to be
chained together
Successive steps perform 'fit' and 'transform' before sending data to the next step
62
Pipelines: The Syntax
Import the class containing the pipeline method
from sklearn.pipeline import Pipeline
63
Pipelines: The Syntax
Import the class containing the pipeline method
feature
from sklearn.pipeline import Pipeline
scaler class
Create an instance of the class with estimators
estimators = [('scaler', MinMaxScaler()), ('lasso', Lasso())]
Pipe = Pipeline(estimators)
64
Pipelines: The Syntax
Import the class containing the pipeline method
lasso
from sklearn.pipeline import Pipeline
model
class
Create an instance of the class with estimators
estimators = [('scaler', MinMaxScaler()), ('lasso', Lasso())]
Pipe = Pipeline(estimators)
65
Pipelines: The Syntax
Import the class containing the pipeline method
from sklearn.pipeline import Pipeline
Fit the instance on the data and then predict the expected value
Pipe = Pipe.fit(X_train, y_train)
y_predict = Pipe.predict(X_test)
66
Pipelines: The Syntax
Import the class containing the pipeline method
from sklearn.pipeline import Pipeline
Fit the instance on the data and then predict the expected value
Pipe = Pipe.fit(X_train, y_train)
y_predict = Pipe.predict(X_test)
67