Machine Learning With Naive Bayes Course Notes 365 Data Science
Machine Learning With Naive Bayes Course Notes 365 Data Science
Table of Contents
Abstract ........................................................................................................................ 3
1 Motivation .............................................................................................................. 4
Appendix .................................................................................................................... 29
Abstract
Naïve Bayes algorithms are supervised machine learning algorithms that are
most often used in classification tasks. Python’s sklearn library offers a variety of
based on the famous Bayes’ theorem, named after the English statistician Thomas
Naïve Bayes algorithms are known to be fast learners, solve with ease
problems in real-time, as well as handle sparse data. That is why they are preferred
when faced with tasks involving text analysis, such as spam filtering, categorizing
These course notes give, in a very compact and comprehensive manner, the
essentials for understanding Bayes’ theorem and the Naïve Bayes algorithm. They
theoretical concepts introduced there. The course notes could also be used as a
1. Motivation
of Naïve Bayes classifiers (Table 1) together with a couple of the most common use-
cases.
Pros Cons
not considered
completely trusted
Good predictor
performance
o Spam filtering
o Categorizing documents
o Sentiment analysis
• Recommendation systems
365 DATA SCIENCE 5
In 1763, two years after the passing of the British statistician Thomas Bayes,
his friend Richard Price sends a letter 1 to the British physicist John Canton
containing Bayes’ lecture notes on probability theory. In this letter, Price discusses
the importance of his friend’s work. In Section II of the famous lecture notes, Bayes
Consider a square table divided into smaller squares in a 9-by-9 grid (Figure
1). The rules are as follows. Each cell in the grid can be occupied by a single ball.
The balls cannot occupy two or more cells simultaneously, that is, once they fall onto
Figure 1: A square table divided into 81 squares. Imagine a red ball is tossed
and lands somewhere on that table. Based on the rules and the conditions for
the blue ball given in the text, where would you put the red ball?
1
https://royalsocietypublishing.org/doi/pdf/10.1098/rstl.1763.0053
365 DATA SCIENCE 6
W E
S
Figure 2: The orientation of the table.
Imagine a red ball is tossed and it lands somewhere on the table. The goal of
this exercise is to guess the position of that ball based on the position of another,
Example:
Consider the positions of the red and the blue ball in Figure 1 (not known a
priori). The information you would be given, based on this arrangement, is:
“A blue ball is tossed, and it lands 2 squares north and 3 squares west of the red
ball.”
We are now prepared for the thought experiment. Five blue balls are tossed in a
row and land in a cell on the table. The position of each one with respect to the
1. A blue ball is tossed, and it lands 2 squares north and 1 square west of the
red ball.
2. A second blue ball is tossed, and it lands 1 square north and 3 squares east
3. A third blue ball is tossed, and it lands 4 squares north and 3 squares west
4. A fourth blue ball is tossed, and it lands 5 squares north and 2 squares east
5. A fifth blue ball is tossed, and it lands 3 squares south and 5 squares east of
Which cell do you think the red ball has landed in? You can find a discussion of
3. Bayes’ Theorem
There are two ways to approach the topic of probability. The first one is the
particular outcome.
Example 1:
layers, with the outermost giving the least number of points and the innermost giving
The darts player wonders what her chances of hitting the bull’s eye are. She
can resolve this problem by throwing the darts at the target 10 times. Assume that
the dart ends up hitting the center twice. The player, therefore, concludes that 2
out of 10 times, or once every 5 throws, she will be awarded the maximum number
of points. The larger the number of throws, the more data would be obtained,
therefore the better the predictions would be. This example assumes, of course,
that practice does not make the player better at throwing darts.
probability a certain hypothesis is true given past data (evidence). The difference
with the frequentist approach is that the event whose outcome we are trying to
Example 2:
Consider the scenario from Example 1. This time, however, the already
wondering what her chances of winning the game are. Unfortunately, this is not an
experiment that can be conducted repeatedly – such championships are held only
once every couple of years, so sufficient data would be hard to gather. Moreover,
the outcome depends not only on the practice and experience she has gained but
would occur only once. Relevant questions for Bayesian statistics could be:
365 DATA SCIENCE 9
A B
𝑃𝑃(𝐴𝐴)
Example:
The event of drawing the Queen of Spades, or the Queen of Clubs, or any
other Queen from the deck, belongs to set A. Set 𝐵𝐵 represents the event of
drawing a Spade. The event of drawing the Queen of Spades, or the King of
Spades, or any other Spade from the deck, belongs to set 𝐵𝐵.
Q ♠
365 DATA SCIENCE 10
Since there are 4 Queens and 13 Spades in a deck of 52 cards, we can write
4
𝑃𝑃(𝐴𝐴) =
52
13
𝑃𝑃(𝐵𝐵) =
52
• Joint probability – the probability of an event from set 𝐴𝐴 and an event from
𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵)
Note that
Example:
Q ♠
Q♠
1
𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵) =
52
365 DATA SCIENCE 11
𝑃𝑃(𝐴𝐴|𝐵𝐵)
Note that
𝑃𝑃(𝐴𝐴|𝐵𝐵) ≠ 𝑃𝑃(𝐵𝐵|𝐴𝐴)
“𝐴𝐴 given 𝐵𝐵” would be the event of drawing a Queen given that the set
of cards to choose from are all Spades. There is only one Queen in a set of 13
Spades. Therefore
1
𝑃𝑃(𝐴𝐴|𝐵𝐵) =
13
“𝐵𝐵 given 𝐴𝐴” would be the event of drawing a Spade given that the set
of cards to choose from are all Queens. There is only one Spade in a set of 4
Queens. Therefore
1
𝑃𝑃(𝐵𝐵|𝐴𝐴) =
4
To derive the product rule, let’s first make the following comment. Marginal
probabilities are estimated by taking the ratio between the favorable outcomes and
all outcomes:
favorable outcomes
𝑃𝑃(𝐴𝐴) =
all outcomes
365 DATA SCIENCE 12
case, the favorable outcomes are those entering the intersection between the two
sets 𝐴𝐴 and 𝐵𝐵. The probability of obtaining a favorable outcome is 𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵). All
outcomes, on the other hand, are those for which event B takes place, no matter
𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵)
𝑃𝑃(𝐴𝐴|𝐵𝐵) =
𝑃𝑃(𝐵𝐵)
Analogously,
𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵)
𝑃𝑃(𝐵𝐵|𝐴𝐴) =
𝑃𝑃(𝐴𝐴)
and analogously
The result from the previous subsection showed that the intersection of two
𝑃𝑃(𝐴𝐴|𝐵𝐵)𝑃𝑃(𝐵𝐵) = 𝑃𝑃(𝐵𝐵|𝐴𝐴)𝑃𝑃(𝐴𝐴)
Upon dividing both sides by 𝑃𝑃(𝐵𝐵), we obtain the celebrated Bayes theorem.
𝑃𝑃(𝐵𝐵|𝐴𝐴)𝑃𝑃(𝐴𝐴)
𝑃𝑃(𝐴𝐴|𝐵𝐵) =
𝑃𝑃(𝐵𝐵)
365 DATA SCIENCE 13
The result is spectacular in the sense that it connects the conditional probabilities
𝑃𝑃(𝐸𝐸|𝐻𝐻)𝑃𝑃(𝐻𝐻)
𝑃𝑃(𝐻𝐻|𝐸𝐸) =
𝑃𝑃(𝐸𝐸)
being true, given that our hypothesis holds. We would additionally need to
provide the marginal probabilities of the hypothesis and the evidence being true
individually.
specific name.
𝑃𝑃(𝐸𝐸|𝐻𝐻)𝑃𝑃(𝐻𝐻)
𝑃𝑃(𝐻𝐻|𝐸𝐸) =
𝑃𝑃(𝐸𝐸)
• Likelihood function – the probability of the evidence being true given the
• The normalization constant – this term makes sure that, once all
𝑖𝑖=𝑛𝑛
� 𝑃𝑃(𝐻𝐻𝑖𝑖 |𝐸𝐸) = 1
𝑖𝑖=1
𝑖𝑖=𝑛𝑛
When applying the Naïve Bayes machine learning algorithm, one is typically
interested in the hypothesis with the largest conditional probability rather than the
𝑃𝑃(𝐻𝐻|𝐸𝐸) ∝ 𝑃𝑃(𝐸𝐸|𝐻𝐻)𝑃𝑃(𝐻𝐻)
The setting in which the Naïve Bayes machine learning algorithm is used the
most is text analysis. The example below is a continuation of the example from
“The ham-or-spam example” video lecture, as well as the discussion from “The
Example:
The marginal (prior) probabilities of an email being ham or spam are therefore
1
𝑃𝑃(ham) = 𝑃𝑃(spam) =
2
dear 5 3
deadline 3 1
lecture 7 10
notes 9 9
assignment 6 7
student 15 0
Total word 45 30
count
365 DATA SCIENCE 16
According to the table, the word “student” has not appeared in any of the 20
spam messages. Therefore, even if an incoming message that contains the word
“student” is spam, it will not be considered as such by the model, as the conditional
introduce the so-called smoothing parameter, 𝛼𝛼. Let’s set its value equal to 1
(Laplace smoothing). The purpose of this parameter is to increase the count of each
dear 6 4
deadline 4 2
lecture 8 11
notes 10 10
assignment 7 8
student 16 1
Total count 51 36
Therefore
6 4 16 1
𝑃𝑃(ham|dear, deadline, student) ∝ ⋅ ⋅ ⋅ ≈ 1.4 × 10−3
51 51 51 2
4 2 1 1
𝑃𝑃(spam|dear, deadline, student) ∝ ⋅ ⋅ ⋅ ≈ 8.6 × 10−5
36 36 36 2
To substitute the proportionality sign with an equal sign, we need to further divide
The conditional probabilities for the message to belong to the ham or spam classes
is the following
1.4 × 10−3
𝑃𝑃(ham|dear, deadline, student) ≈ ≈ 94%
1.486 × 10−3
8.6 × 10−5
𝑃𝑃(spam|dear, deadline, student) ≈ ≈ 5.7%
1.486 × 10−3
If the two results are now rounded, their sum would indeed equal 100%. In
this way, not only did we learn which class the message belongs to, but we also
In this section, we outline the most important steps that need to be executed
when creating a machine learning model. It is important that these steps are
First and foremost, we need to create a pandas DataFrame where all inputs
and targets are organized. Of course, a pandas DataFrame is not the only way to
store a database, but it proves to be very useful. You are welcome to experiment
with other means, but keep in mind that the train_test_split() method accepts the
inputs and targets in the form of lists, NumPy arrays, SciPy-sparse matrices, as
forward to the next step, do check for any null values in the data. There are various
techniques to deal with this issue. One would be to remove the samples
containing missing values altogether. This, however, can be done only if the
number of such samples is much smaller than the number of all samples in the
more than 5% of the total number of samples, then removing them from the
database should be safe. If that is not the case, statistical methods for filling up the
and remove any outliers in the data. The presence of samples with obscure values
correctly.
Next, split the data into training and testing sets using, for example,
dedicate 80% of the data to the training set and 20% to the test set. Other splits
such as 90:10, or 70:30 could, of course, be used as well. Use the training data to
fit the model and the test data to evaluate its performance.
This step of splitting the data is one of the most common ways to avoid
that it also captures random noise in the data which affects its predictions. This is
5. Data wrangling
prepare the data for the classifier. Some classifiers, such as K-nearest neighbors,
standardized inputs, which usually imply transforming the data. Others, such as the
Multinomial Naïve Bayes classifier, which is mainly used for text analysis, require
365 DATA SCIENCE 20
standardization, for example, the knowledge on the mean and standard deviation
dataset, that is, before train-test splitting. Doing so could lead to data leakage.
In this step, the appropriate classifier for the task is chosen, it is fit to the
Once the model is created and finetuned, it is time to test it on a new dataset.
Metrics such as accuracy, precision, recall, and F1 score are studied in the next
section.
6. Relevant Metrics
In this section, we introduce some of the relevant metrics that could be used
classification task.
A confusion matrix, 𝐶𝐶, is constructed such that each entry, 𝐶𝐶𝑖𝑖𝑖𝑖 , equals the
only one of two classes. We denote these two classes by 0 and 1 and, for the time
being, define 1 to be the positive class. This would result in the confusion matrix
from
Figure 4.
0 TN FP
True label
1 FN TP
0 1
Predicted
label
Figure 4: A 2 × 2 confusion matrix denoting the cells representing the true and false positives and negatives.
Here, class 1 is defined as the positive one.
• Top-left cell – true negatives (TN). This is the number of samples whose
true class is 0 and the model has correctly classified them as such.
• Top-right cell – false positives (FP). This is the number of samples whose
• Bottom-left cell – false negatives (FN). This is the number of samples whose
• Bottom-right cell – true positives (TP). This is the number of samples whose
true class is 1 and the model has correctly classified them as such.
365 DATA SCIENCE 22
Example:
Figure 5: The figure shows a 2 × 2 confusion matrix in the form of a heatmap. A brighter color designates a
higher number in the cell. We see that there are 167 true negative samples, 23 false positives, 5 false
negatives, and 196 true positives.
positive class. This makes classes 0 and 2 negative. The confusion matrix would
Figure 6:
0 TN FP FN
True label
1 FN TP FN
2 FN FP TN
0 1 2
Predicted label
Figure 6: A 3 × 3 confusion matrix denoting the cells representing the true and false positives and negatives.
Here, class 1 is defined as the positive one.
365 DATA SCIENCE 23
Example:
Figure 7: The figure shows a confusion matrix in the form of a heatmap. A brighter color designates a higher
number in the cell.
6.2. Accuracy
𝑇𝑇𝑇𝑇 + 𝑇𝑇𝑇𝑇
Accuracy =
𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹 + 𝐹𝐹𝐹𝐹 + 𝑇𝑇𝑇𝑇
Example 1:
Using the confusion matrix from Figure 5, let’s calculate the accuracy of the
model.
167 + 196
Accuracy = ≈ 0.93
167 + 23 + 5 + 196
365 DATA SCIENCE 24
Example 2:
Using the confusion matrix from Figure 7, let’s calculate the accuracy of the
model.
6.3. Precision
Example 1:
Using the confusion matrix from Figure 5, let’s calculate the precision of the
model for both classes. To calculate the precision for class 0, we define that class as
positive, while class 1 becomes negative. Applying the formula above, we obtain
167
Precision0 = ≈ 0.97
167 + 5
Analogously, to calculate the precision for class 1, we define that class as positive,
196
Precision1 = ≈ 0.89
196 + 23
Example 2:
Using the confusion matrix from Figure 7, let’s calculate the precision of the
model for all three classes. To calculate the precision for class 0, we define that class
365 DATA SCIENCE 25
obtain
195
Precision0 = ≈ 0.96
195 + 5 + 3
186
Precision1 = ≈ 0.83
186 + 5 + 34
163
Precision2 = ≈ 0.95
163 + 9 + 0
6.4. Recall
𝑇𝑇𝑇𝑇
Recall =
𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹
Example 1:
Using the confusion matrix from Figure 5, let’s calculate the recall of the
model for both classes. To calculate the recall for class 0, we define that class as
positive, while class 1 becomes negative. Applying the formula above, we obtain
167
Recall0 = ≈ 0.88
167 + 23
Analogously, to calculate the recall for class 1, we define that class as positive,
196
Recall1 = ≈ 0.98
196 + 5
365 DATA SCIENCE 26
Example 2:
Using the confusion matrix from Figure 7, let’s calculate the recall of the
model for all three classes. To calculate the precision for class 0, we define that
class as positive, while classes 1 and 2 become negative. Applying the definition,
we obtain
195
Recall0 = ≈ 0.98
195 + 5 + 0
186
Recall1 = ≈ 0.93
186 + 5 + 9
163
Recall2 = ≈ 0.82
163 + 3 + 34
6.5. F1 score
2
F1 =
1 1
+
precision recall
The F1 score can be thought of as putting precision and recall into a single
metric. Contrary to taking the simple arithmetic mean of precision and recall, the
F1 score penalizes low values more heavily. That is to say, if either precision or
recall is very low, while the other is high, the F1 score would be significantly lower
In the examples below, we calculate the F1 score for class 0 for both the 2 × 2
and the 3 × 3 confusion matrices. Calculating the F1 score for the rest of the classes
365 DATA SCIENCE 27
is left as an exercise to the reader. The answers are provided in Section 6.6 with a
Example 1:
Using the confusion matrix from Figure 5, we calculated the precision and
recall values for class 0 to be 0.97 and 0.88, respectively. Applying the definition, the
F1 score is
2
F1 = ≈ 0.923
1 1
+
0.97 0.88
0.97 + 0.88
Arithmetic mean = = 0.925
2
We see that the arithmetic mean is slightly larger than the F1 score. The difference
here is not big as both precision and recall are quite high in value. The discrepancy
between the F1 score and the arithmetic mean would be much more apparent if the
Example 2:
Using the confusion matrix from Figure 7, we calculated the precision and
recall values for class 0 to be 0.96 and 0.98, respectively. Applying the definition, the
F1 score is
2
F1 = ≈ 0.970
1 1
+
0.96 0.98
0.96 + 0.98
Arithmetic mean = = 0.97
2
365 DATA SCIENCE 28
When precision is almost equal to recall, the F1 score and the arithmetic mean bear
6.6. Summary
Table 4: A classification report for the model that has output the 2 × 2 confusion matrix.
Metric
Precision Recall F1 score Accuracy
Class
0.93
Table 5: A classification report for the model that has output the 3 × 3 confusion matrix.
Metric
Precision Recall F1 score Accuracy
Class
0.91
365 DATA SCIENCE 29
Appendix
With each toss of the blue ball, the number of possible spots for the red ball
becomes smaller and smaller. In Figure 10 to Figure 12 below, you can see the
possible choices going from 81 down to 4. In Figure 13, the original setup of the
problem is revealed.
Figure 10: The possible spots for the Figure 9: The possible spots for the Figure 8: The possible spots for the
red ball after gaining information on red ball after gaining information on red ball after gaining information on
the position of the first blue ball with the position of the second blue ball the position of the third blue ball with
respect to the red ball. with respect to the red ball. respect to the red ball.
Figure 11: The possible spots for the Figure 12: The possible spots for the
red ball after gaining information on red ball after gaining information on
the position of the fourth blue ball the position of the fifth blue ball with
with respect to the red ball. respect to the red ball.
365 DATA SCIENCE 30
Copyright 2022 365 Data Science Ltd. Reproduction is forbidden unless authorized. All rights reserved.
Learn DATA SCIENCE
anytime, anywhere, at your own pace.
If you found this resource useful, check out our e-learning program. We have
everything you need to succeed in data science.
Learn the most sought-after data science skills from the best experts in the field!
Earn a verifiable certificate of achievement trusted by employers worldwide and
future proof your career.
$432 $172.80/year
Email: [email protected]