Spam Filtering on Mail
Aayush Shah, 1641065
Nihar Shah, 1641054
Overview
● Spam Mail wastes the Internet’s two most precious resources Bandwidth and Time
● It can eat up lot of inbox space and can contain malware and viruses that can compromise
company security and data.
● Very advantageous theorem of probability to classify spam mail : Bayes Theorem.
Problem Statement
● We have message m = (w1, w2, w3, … , wn), where (w1, w2, w3, … , wn) is a set of unique words
● Assume occurence of word are independent of all other words
Problem Statement (Cont.)
● In order to classify we have to determine which is greater
Loading dependencies
● NLTK for processing the messages
● WordCloud and matplotlib for visualization
● Pandas for loading data
● NumPy for generating random probabilities for train-test split
Loading Data
● We do not require the columns ‘Unnamed: 2’, ‘Unnamed: 3’ and ‘Unnamed: 4’, so we remove
them. We rename the column ‘v1’ as ‘label’ and ‘v2’ as ‘message’. ‘ham’ is replaced by 0 and ‘spam’
is replaced by 1 in the ‘label’ column.
Loading Data (Cont.)
Data Visualization of Spam and Ham emails:
In UCI ML Dataset taken there are 5722 emails out
of which 4120 are ham(legitimate) and remaining are
spam emails.
Train-Test split
● Use 75% of the dataset as training
and rest as test dataset. Selection of
data is uniformly random.
Visualizing Data
● To see which are the most repeated words in the spam message we have used WordCloud library
Visualizing Data
(Cont.)
● Result for spam mails is as
expected
● Messages contains the words
like ‘FREE’, ‘call’, ‘text’, etc.
Visualizing Data
(Cont.)
● Result for ham mails.
Training the Model
1. Preprocessing
a) Make all words to lowercase (FREE and free are same words).
b) Tokenize each word (Split message into pieces and throw away the punctuation)
c) Go, goes, going indicates the same activity. Replace all these words by go by using Porter
Stemmer algorithm
Training the Model (Cont.)
d) Remove the stop words (‘a’, ‘an’, ‘the’ are the stop words).
e) Find number of occurence of each word
f) TF-IDF (Term frequency - Inverse document frequency)
Training the Model (Cont.)
g) Probability of each word is counted as:
Training the Model (Cont.)
h) If some word comes in test dataset which is not part of training dataset then P(w) = 0 this
creates problem. Additive smoothing must be done.
Classification and Evaluation Results:
1. Multinomial Naive Bayes Classifier is used and its results are as follows:
Thank You