Bayesian Inference
Bayesian Inference
1 Overview
This project has been broken down in to the following steps:
1
1.0.1 Step 0: Introduction to the Naive Bayes Theorem
Bayes Theorem is one of the earliest probabilistic inference algorithms. It was developed by Rev-
erend Bayes (which he used to try and infer the existence of God no less), and still performs
extremely well for certain use cases.
It’s best to understand this theorem using an example. Let’s say you are a member of the Secret
Service and you have been deployed to protect the Democratic presidential nominee during one
of his/her campaign speeches. Being a public event that is open to all, your job is not easy and you
have to be on the constant lookout for threats. So one place to start is to put a certain threat-factor
for each person. So based on the features of an individual, like the age, sex, and other smaller
factors like whether the person is carrying a bag, looks nervous, etc., you can make a judgment
call as to whether that person is a viable threat.
If an individual ticks all the boxes up to a level where it crosses a threshold of doubt in your
mind, you can take action and remove that person from the vicinity. Bayes Theorem works in the
same way, as we are computing the probability of an event (a person being a threat) based on the
probabilities of certain related events (age, sex, presence of bag or not, nervousness of the person,
etc.).
One thing to consider is the independence of these features amongst each other. For example
if a child looks nervous at the event then the likelihood of that person being a threat is not as much
as say if it was a grown man who was nervous. To break this down a bit further, here there are two
features we are considering, age AND nervousness. Say we look at these features individually, we
could design a model that flags ALL persons that are nervous as potential threats. However, it is
likely that we will have a lot of false positives as there is a strong chance that minors present at
the event will be nervous. Hence by considering the age of a person along with the ’nervousness’
feature we would definitely get a more accurate result as to who are potential threats and who
aren’t.
This is the ’Naive’ bit of the theorem where it considers each feature to be independent of each
other which may not always be the case and hence that can affect the final judgement.
In short, Bayes Theorem calculates the probability of a certain event happening (in our case,
a message being spam) based on the joint probabilistic distributions of certain other events (in
our case, the appearance of certain words in a message). We will dive into the workings of Bayes
Theorem later in the mission, but first, let us understand the data we are going to work with.
Instructions: * Import the dataset into a pandas dataframe using the read_table
method. The file has already been downloaded, and you can access it using the filepath
’smsspamcollection/SMSSpamCollection’. Because this is a tab separated dataset we
will be using ’\t’ as the value for the ’sep’ argument which specifies this format. * Also,
2
rename the column names by specifying a list [’label’, ’sms_message’] to the ’names’
argument of read_table(). * Print the first five values of the dataframe with the new
column names.
In [16]: # '!' allows you to run bash commands from jupyter notebook.
print("List all the files in the current directory\n")
!ls
# The required data table could be found under smsspamcollection/SMSSpamCollection
print("\n List all the files inside the smsspamcollection directory\n")
!ls smsspamcollection
Bayesian_Inference.ipynb images
Bayesian_Inference_solution.ipynb smsspamcollection
readme SMSSpamCollection
3
Instructions: * Convert the values in the ’label’ column to numerical values using map
method as follows: {’ham’:0, ’spam’:1} This maps the ’ham’ value to 0 and the ’spam’
value to 1. * Also, to get an idea of the size of the dataset we are dealing with, print out
number of rows and columns using ’shape’.
In [45]: '''
Solution
'''
df['label'] = df['label'].map({'ham':0, 'spam':1})
df.shape
df.head()
• It tokenizes the string(separates the string into individual words) and gives an integer ID to
each token.
• It counts the occurrence of each of those tokens.
Please Note:
4
• The CountVectorizer method automatically converts all tokenized words to their lower case
form so that it does not treat words like ’He’ and ’he’ differently. It does this using the
lowercase parameter which is by default set to True.
• It also ignores all punctuation so that words followed by a punctuation mark (for example:
’hello!’) are not treated differently than the same words not prefixed or suffixed by a punc-
tuation mark (for example: ’hello’). It does this using the token_pattern parameter which
has a default regular expression which selects tokens of 2 or more alphanumeric characters.
• The third parameter to take note of is the stop_words parameter. Stop words refer to the
most commonly used words in a language. They include words like ’am’, ’an’, ’and’, ’the’
etc. By setting this parameter value to english, CountVectorizer will automatically ignore
all words(from our input text) that are found in the built in list of english stop words in
scikit-learn. This is extremely helpful as stop words can skew our calculations when we are
trying to find certain key words that are indicative of spam.
We will dive into the application of each of these into our model in a later step, but for now
it is important to be aware of such preprocessing techniques available to us when dealing with
textual data.
Instructions: * Convert all the strings in the documents set to their lower
case. Save them into a list called ’lower_case_documents’. You can convert
strings to their lower case in python by using the lower() method.
In [46]: '''
Solution:
'''
documents = ['Hello, how are you!',
'Win money, win from home.',
'Call me now.',
'Hello, Call hello you tomorrow?']
lower_case_documents = []
for i in documents:
lower_case_documents.append(i.lower())
print(lower_case_documents)
5
['hello, how are you!', 'win money, win from home.', 'call me now.', 'hello, call hello you tomo
Instructions: Remove all punctuation from the strings in the document set.
Save them into a list called ’sans_punctuation_documents’.
In [47]: '''
Solution:
'''
sans_punctuation_documents = []
import string
punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''
for i in lower_case_documents:
sans_punctuation_documents.append(i.translate(i.maketrans('', '', string.punctu
print(sans_punctuation_documents)
['hello how are you', 'win money win from home', 'call me now', 'hello call hello you tomorrow']
Step 3: Tokenization
Tokenizing a sentence in a document set means splitting up a sentence into individual words
using a delimiter. The delimiter specifies what character we will use to identify the beginning and
the end of a word(for example we could use a single space as the delimiter for identifying words
in our document set.)
In [21]: '''
Solution:
'''
preprocessed_documents = []
for i in sans_punctuation_documents:
s= i.split(" ")
preprocessed_documents.append(i.split(" "))
print(preprocessed_documents)
[['hello', 'how', 'are', 'you'], ['win', 'money', 'win', 'from', 'home'], ['call', 'me', 'now'],
6
Step 4: Count frequencies
Now that we have our document set in the required format, we can proceed to counting the
occurrence of each word in each document of the document set. We will use the Counter method
from the Python collections library for this purpose.
Counter counts the occurrence of each item in the list and returns a dictionary with the key as
the item being counted and the corresponding value being the count of that item in the list.
In [22]: '''
Solution
'''
frequency_list = []
import pprint
from collections import Counter
for i in preprocessed_documents:
frequency_list.append(Counter(i))
pprint.pprint(frequency_list)
Congratulations! You have implemented the Bag of Words process from scratch! As we can
see in our previous output, we have a frequency distribution dictionary which gives a clear view
of the text that we are dealing with.
We should now have a solid understanding of what is happening behind the scenes in the
sklearn.feature_extraction.text.CountVectorizer method of scikit-learn.
We will now implement sklearn.feature_extraction.text.CountVectorizer method in the
next step.
In [23]: '''
Here we will look to create a frequency matrix on a smaller document set to make sure w
document-term matrix generation happens. We have created a sample document set 'documen
'''
7
documents = ['Hello, how are you!',
'Win money, win from home.',
'Call me now.',
'Hello, Call hello you tomorrow?']
Instructions: Import the sklearn.feature_extraction.text.CountVectorizer
method and create an instance of it called ’count_vector’.
In [48]: '''
Solution
'''
from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer()
Data preprocessing with CountVectorizer()
In Step 2.2, we implemented a version of the CountVectorizer() method from scratch that en-
tailed cleaning our data first. This cleaning involved converting all of our data to lower case and
removing all punctuation marks. CountVectorizer() has certain parameters which take care of
these steps for us. They are:
• lowercase = True
The lowercase parameter has a default value of True which converts all of our text to its
lower case form.
• token_pattern = (?u)\\b\\w\\w+\\b
The token_pattern parameter has a default regular expression value of (?u)\\b\\w\\w+\\b
which ignores all punctuation marks and treats them as delimiters, while accepting alphanu-
meric strings of length greater than or equal to 2, as individual tokens or words.
• stop_words
The stop_words parameter, if set to english will remove all words from our document set
that match a list of English stop words which is defined in scikit-learn. Considering the size
of our dataset and the fact that we are dealing with SMS messages and not larger text sources
like e-mail, we will not be setting this parameter value.
You can take a look at all the parameter values of your count_vector object by simply printing
out the object as follows:
In [49]: '''
Practice node:
Print the 'count_vector' object which is an instance of 'CountVectorizer()'
'''
print(count_vector)
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), preprocessor=None, stop_words=None,
strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
tokenizer=None, vocabulary=None)
8
Instructions: Fit your document dataset to the CountVectorizer object you
have created using fit(), and get the list of words which have been catego-
rized as features using the get_feature_names() method.
In [50]: '''
Solution:
'''
count_vector.fit(documents)
count_vector.get_feature_names()
Out[50]: ['are',
'call',
'from',
'hello',
'home',
'how',
'me',
'money',
'now',
'tomorrow',
'win',
'you']
The get_feature_names() method returns our feature names for this dataset, which is the set
of words that make up our vocabulary for ’documents’.
Instructions: Create a matrix with the rows being each of the 4 documents,
and the columns being each word. The corresponding (row, column) value
is the frequency of occurrence of that word(in the column) in a particular
document(in the row). You can do this using the transform() method and
passing in the document data set as the argument. The transform() method
returns a matrix of numpy integers, you can convert this to an array using
toarray(). Call the array ’doc_array’
In [51]: '''
Solution
'''
doc_array = count_vector.transform(documents).toarray()
doc_array
Now we have a clean representation of the documents in terms of the frequency distribution
of the words in them. To make it easier to understand our next step is to convert this array into a
dataframe and name the columns appropriately.
9
Instructions: Convert the array we obtained, loaded into ’doc_array’,
into a dataframe and set the column names to the word names(which
you computed earlier using get_feature_names(). Call the dataframe ’fre-
quency_matrix’.
In [28]: '''
Solution
'''
frequency_matrix = pd.DataFrame(doc_array, columns = count_vector.get_feature_names())
frequency_matrix
Out[28]: are call from hello home how me money now tomorrow win you
0 1 0 0 1 0 1 0 0 0 0 0 1
1 0 0 1 0 1 0 0 1 0 0 2 0
2 0 1 0 0 0 0 1 0 1 0 0 0
3 0 1 0 2 0 0 0 0 0 1 0 1
Congratulations! You have successfully implemented a Bag of Words problem for a document
dataset that we created.
One potential issue that can arise from using this method out of the box is the fact that if our
dataset of text is extremely large(say if we have a large collection of news articles or email data),
there will be certain values that are more common that others simply due to the structure of the
language itself. So for example words like ’is’, ’the’, ’an’, pronouns, grammatical constructs etc
could skew our matrix and affect our analyis.
There are a couple of ways to mitigate this. One way is to use the stop_words parameter and
set its value to english. This will automatically ignore all words(from our input text) that are
found in a built in list of English stop words in scikit-learn.
Another way of mitigating this is by using the tfidf method. This method is out of scope for
the context of this lesson.
Instructions: Split the dataset into a training and testing set by using the
train_test_split method in sklearn. Split the data using the following vari-
ables: * X_train is our training data for the ’sms_message’ column. * y_train
is our training data for the ’label’ column * X_test is our testing data for the
’sms_message’ column. * y_test is our testing data for the ’label’ column
Print out the number of rows we have in each our training and testing data.
In [52]: '''
Solution
10
# USE from sklearn.model_selection import train_test_split to avoid seeing deprecation
from sklearn.cross_validation import train_test_split
• Firstly, we have to fit our training data (X_train) into CountVectorizer() and return the
matrix.
• Secondly, we have to transform our testing data (X_test) to return the matrix.
Note that X_train is our training data for the ’sms_message’ column in our dataset and we
will be using this to train our model.
X_test is our testing data for the ’sms_message’ column and this is the data we will be us-
ing(after transformation to a matrix) to make predictions on. We will then compare those predic-
tions with y_test in a later step.
For now, we have provided the code that does the matrix transformations for you!
In [ ]: '''
[Practice Node]
The code for this segment is in 2 parts. Firstly, we are learning a vocabulary dictionar
and then transforming the data into a document-term matrix; secondly, for the testing da
transforming the data into a document-term matrix.
We will provide the transformed data to students in the variables 'training_data' and 't
'''
In [53]: '''
Solution
'''
# Instantiate the CountVectorizer method
11
count_vector = CountVectorizer()
# Transform testing data and return the matrix. Note we are not fitting the testing dat
testing_data = count_vector.transform(X_test)
• P(A) is the prior probability of A occurring independently. In our example this is P(D). This
value is given to us.
• P(B) is the prior probability of B occurring independently. In our example this is P(Pos).
• P(A|B) is the posterior probability that A occurs given B. In our example this is P(D|Pos).
That is, the probability of an individual having diabetes, given that, that individual got a
positive test result. This is the value that we are looking to calculate.
• P(B|A) is the likelihood probability of B occurring, given A. In our example this is P(Pos|D).
This value is given to us.
Putting our values into the formula for Bayes theorem we get:
P(D|Pos) = P(D) * P(Pos|D) / P(Pos)
12
The probability of getting a positive test result P(Pos) can be calculated using the Sensitivity
and Specificity as follows:
P(Pos) = [P(D) * Sensitivity] + [P(~D) * (1-Specificity))]
In [ ]: '''
Instructions:
Calculate probability of getting a positive test result, P(Pos)
'''
In [33]: '''
Solution (skeleton code will be provided)
'''
# P(D)
p_diabetes = 0.01
# P(~D)
p_no_diabetes = 0.99
# Sensitivity or P(Pos|D)
p_pos_diabetes = 0.9
# Specificity or P(Neg|~D)
p_neg_no_diabetes = 0.9
# P(Pos)
p_pos = (.01*.9) + (.99 *(1-.9))
print('The probability of getting a positive test result P(Pos) is: {}',format(p_pos))
In [ ]: '''
Instructions:
Compute the probability of an individual having diabetes, given that, that individual go
In other words, compute P(D|Pos).
13
In [34]: '''
Solution
'''
# P(D|Pos)
p_diabetes_pos = (p_diabetes * p_pos_diabetes)/p_pos
print('Probability of an individual having diabetes, given that that individual got a p
',format(p_diabetes_pos))
Probability of an individual having diabetes, given that that individual got a positive test res
In [ ]: '''
Instructions:
Compute the probability of an individual not having diabetes, given that, that individua
In other words, compute P(~D|Pos).
Therefore:
P(Pos|~D) = p_pos_no_diabetes = 1 - 0.9 = 0.1
'''
In [36]: '''
Solution
'''
# P(Pos|~D)
p_pos_no_diabetes = 0.1
# P(~D|Pos)
p_no_diabetes_pos = (p_no_diabetes*p_pos_no_diabetes )/p_pos
print ('Probability of an individual not having diabetes, given that that individual go
,p_no_diabetes_pos)
Probability of an individual not having diabetes, given that that individual got a positive test
Congratulations! You have implemented Bayes Theorem from scratch. Your analysis shows
that even if you get a positive test result, there is only an 8.3% chance that you actually have
diabetes and a 91.67% chance that you do not have diabetes. This is of course assuming that only
1% of the entire population has diabetes which is only an assumption.
What does the term ’Naive’ in ’Naive Bayes’ mean ?
The term ’Naive’ in Naive Bayes comes from the fact that the algorithm considers the features
that it is using to make the predictions to be independent of each other, which may not always be
the case. So in our Diabetes example, we are considering only one feature, that is the test result.
Say we added another feature, ’exercise’. Let’s say this feature has a binary value of 0 and 1, where
the former signifies that the individual exercises less than or equal to 2 days a week and the latter
signifies that the individual exercises greater than or equal to 3 days a week. If we had to use
14
both of these features, namely the test result and the value of the ’exercise’ feature, to compute
our final probabilities, Bayes’ theorem would fail. Naive Bayes’ is an extension of Bayes’ theorem
that assumes that all the features are independent of each other.
And let us also assume that the probability of Jill Stein giving a speech, P(J) is 0.5 and the
same for Gary Johnson, P(G) = 0.5.
Given this, what if we had to find the probabilities of Jill Stein saying the words ’freedom’ and
’immigration’? This is where the Naive Bayes’ theorem comes into play as we are considering two
features, ’freedom’ and ’immigration’.
Now we are at a place where we can define the formula for the Naive Bayes’ theorem:
Here, y is the class variable (in our case the name of the candidate) and x1 through xn are the
feature vectors (in our case the individual words). The theorem makes the assumption that each
of the feature vectors or words (xi) are independent of each other.
To break this down, we have to compute the following posterior probabilities:
• P(J|F,I): Given the words freedom and immigration were said, what’s the probability that
it was said by Jill?
Using the formula and our knowledge of Bayes’ theorem, we can compute this as follows:
P(J|F,I) = (P(J) * P(F|J) * P(I|J)) / P(F,I). Here P(F,I) is the probability of the
words ’freedom’ and ’immigration’ being said in a speech.
• P(G|F,I): Probability that words Freedom and Immigration are said by Gary Johnson ‘
Using the formula, we can compute this as follows: P(G|F,I) = (P(G) * P(F|G) * P(I|G))
/ P(F,I)
In [ ]: '''
Instructions: Compute the probability of the words 'freedom' and 'immigration' being sai
P(F,I).
The first step is multiplying the probabilities of Jill Stein giving a speech with her i
15
probabilities of saying the words 'freedom' and 'immigration'. Store this in a variable
The second step is multiplying the probabilities of Gary Johnson giving a speech with hi
probabilities of saying the words 'freedom' and 'immigration'. Store this in a variable
The third step is to add both of these probabilities and you will get P(F,I).
'''
In [1]: '''
Solution: Step 1
'''
# P(J)
p_j = 0.5
# P(F/J)
p_j_f = 0.1
# P(I/J)
p_j_i = 0.1
0.005000000000000001
In [2]: '''
Solution: Step 2
'''
# P(G)
p_g = 0.5
# P(F/G)
p_g_f = 0.7
# P(I/G)
p_g_i = 0.2
0.06999999999999999
In [3]: '''
Solution: Step 3: Compute P(F,I) and store in p_f_i
'''
p_f_i = p_j_text + p_g_text
print('Probability of words freedom and immigration being said are: ', format(p_f_i))
16
Probability of words freedom and immigration being said are: 0.075
Now we can compute the probability of P(J|F,I), that is the probability of Jill Stein saying the
words Freedom and Immigration and P(G|F,I), that is the probability of Gary Johnson saying the
words Freedom and Immigration.
In [ ]: '''
Instructions:
Compute P(J|F,I) using the formula P(J|F,I) = (P(J) * P(F|J) * P(I|J)) / P(F,I) and stor
'''
In [4]: '''
Solution
'''
p_j_fi = p_j_text / p_f_i
print('The probability of Jill Stein saying the words Freedom and Immigration: ', format
The probability of Jill Stein saying the words Freedom and Immigration: 0.06666666666666668
In [ ]: '''
Instructions:
Compute P(G|F,I) using the formula P(G|F,I) = (P(G) * P(F|G) * P(I|G)) / P(F,I) and stor
'''
In [5]: '''
Solution
'''
p_g_fi = p_g_text / p_f_i
print('The probability of Gary Johnson saying the words Freedom and Immigration: ', form
The probability of Gary Johnson saying the words Freedom and Immigration: 0.9333333333333332
And as we can see, just like in the Bayes’ theorem case, the sum of our posteriors is equal to
1. Congratulations! You have implemented the Naive Bayes’ theorem from scratch. Our analysis
shows that there is only a 6.6% chance that Jill Stein of the Green Party uses the words ’freedom’
and ’immigration’ in her speech as compared the the 93.3% chance for Gary Johnson of the Liber-
tarian party.
Another more generic example of Naive Bayes’ in action is as when we search for the term
’Sacramento Kings’ in a search engine. In order for us to get the results pertaining to the Scra-
mento Kings NBA basketball team, the search engine needs to be able to associate the two words
together and not treat them individually, in which case we would get results of images tagged
with ’Sacramento’ like pictures of city landscapes and images of ’Kings’ which could be pictures
of crowns or kings from history when what we are looking to get are images of the basketball
team. This is a classic case of the search engine treating the words as independent entities and
hence being ’naive’ in its approach.
Applying this to our problem of classifying messages as spam, the Naive Bayes algorithm
looks at each word individually and not as associated entities with any kind of link between them.
17
In the case of spam detectors, this usually works as there are certain red flag words which can
almost guarantee its classification as spam, for example emails with words like ’viagra’ are usually
classified as spam.
In [ ]: '''
Instructions:
We have loaded the training data into the variable 'training_data' and the testing data
variable 'testing_data'.
Import the MultinomialNB classifier and fit the training data into the classifier using
'naive_bayes'. You will be training the classifier using 'training_data' and y_train' fr
'''
In [54]: '''
Solution
'''
from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data,y_train)
In [ ]: '''
Instructions:
Now that our algorithm has been trained using the training data set we can now make some
stored in 'testing_data' using predict(). Save your predictions into the 'predictions' v
'''
In [55]: '''
Solution
'''
predictions = naive_bayes.predict(testing_data)
Now that predictions have been made on our test set, we need to check the accuracy of our
predictions.
18
1.0.12 Step 6: Evaluating our model
Now that we have made predictions on our test set, our next goal is to evaluate how well our
model is doing. There are various mechanisms for doing so, but first let’s do quick recap of them.
Accuracy measures how often the classifier makes the correct prediction. It’s the ratio of the
number of correct predictions to the total number of predictions (the number of test data points).
Precision tells us what proportion of messages we classified as spam, actually were spam.
It is a ratio of true positives(words classified as spam, and which are actually spam) to all posi-
tives(all words classified as spam, irrespective of whether that was the correct classification), in
other words it is the ratio of
[True Positives/(True Positives + False Positives)]
Recall(sensitivity) tells us what proportion of messages that actually were spam were classi-
fied by us as spam. It is a ratio of true positives(words classified as spam, and which are actually
spam) to all the words that were actually spam, in other words it is the ratio of
[True Positives/(True Positives + False Negatives)]
For classification problems that are skewed in their classification distributions like in our case,
for example if we had a 100 text messages and only 2 were spam and the rest 98 weren’t, accuracy
by itself is not a very good metric. We could classify 90 messages as not spam (including the 2 that
were spam but we classify them as not spam, hence they would be false negatives) and 10 as spam
(all 10 false positives) and still get a reasonably good accuracy score. For such cases, precision and
recall come in very handy. These two metrics can be combined to get the F1 score, which is the
weighted average of the precision and recall scores. This score can range from 0 to 1, with 1 being
the best possible F1 score.
We will be using all 4 metrics to make sure our model does well. For all 4 metrics whose values
can range from 0 to 1, having a score as close to 1 as possible is a good indicator of how well our
model is doing.
In [ ]: '''
Instructions:
Compute the accuracy, precision, recall and F1 scores of your model using your test data
you made earlier stored in the 'predictions' variable.
'''
In [56]: '''
Solution
'''
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Accuracy score: ', format(accuracy_score(y_test,predictions)))
print('Precision score: ', format(precision_score(y_test,predictions)))
print('Recall score: ', format(recall_score(y_test,predictions)))
print('F1 score: ', format(f1_score(y_test,predictions)))
19
1.0.13 Step 7: Conclusion
One of the major advantages that Naive Bayes has over other classification algorithms is its ability
to handle an extremely large number of features. In our case, each word is treated as a feature
and there are thousands of different words. Also, it performs well even with the presence of
irrelevant features and is relatively unaffected by them. The other major advantage it has is its
relative simplicity. Naive Bayes’ works well right out of the box and tuning it’s parameters is
rarely ever necessary, except usually in cases where the distribution of the data is known. It rarely
ever overfits the data. Another important advantage is that its model training and prediction
times are very fast for the amount of data it can handle. All in all, Naive Bayes’ really is a gem of
an algorithm!
Congratulations! You have successfully designed a model that can efficiently predict if an SMS
message is spam or not!
Thank you for learning with us!
20