0% found this document useful (0 votes)
18 views18 pages

Aayush Nihar Spam Mail Filtering

dav report

Uploaded by

Nihar Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views18 pages

Aayush Nihar Spam Mail Filtering

dav report

Uploaded by

Nihar Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Spam Filtering on Mail

Aayush Shah, 1641065


Nihar Shah, 1641054
Overview
● Spam Mail wastes the Internet’s two most precious resources Bandwidth and Time
● It can eat up lot of inbox space and can contain malware and viruses that can compromise
company security and data.
● Very advantageous theorem of probability to classify spam mail : Bayes Theorem.
Problem Statement
● We have message m = (w1, w2, w3, … , wn), where (w1, w2, w3, … , wn) is a set of unique words

● Assume occurence of word are independent of all other words


Problem Statement (Cont.)
● In order to classify we have to determine which is greater
Loading dependencies
● NLTK for processing the messages
● WordCloud and matplotlib for visualization
● Pandas for loading data
● NumPy for generating random probabilities for train-test split
Loading Data

● We do not require the columns ‘Unnamed: 2’, ‘Unnamed: 3’ and ‘Unnamed: 4’, so we remove
them. We rename the column ‘v1’ as ‘label’ and ‘v2’ as ‘message’. ‘ham’ is replaced by 0 and ‘spam’
is replaced by 1 in the ‘label’ column.
Loading Data (Cont.)
Data Visualization of Spam and Ham emails:
In UCI ML Dataset taken there are 5722 emails out
of which 4120 are ham(legitimate) and remaining are
spam emails.
Train-Test split

● Use 75% of the dataset as training


and rest as test dataset. Selection of
data is uniformly random.
Visualizing Data
● To see which are the most repeated words in the spam message we have used WordCloud library
Visualizing Data
(Cont.)

● Result for spam mails is as


expected
● Messages contains the words
like ‘FREE’, ‘call’, ‘text’, etc.
Visualizing Data
(Cont.)

● Result for ham mails.


Training the Model
1. Preprocessing

a) Make all words to lowercase (FREE and free are same words).
b) Tokenize each word (Split message into pieces and throw away the punctuation)

c) Go, goes, going indicates the same activity. Replace all these words by go by using Porter
Stemmer algorithm
Training the Model (Cont.)
d) Remove the stop words (‘a’, ‘an’, ‘the’ are the stop words).
e) Find number of occurence of each word

f) TF-IDF (Term frequency - Inverse document frequency)


Training the Model (Cont.)
g) Probability of each word is counted as:
Training the Model (Cont.)
h) If some word comes in test dataset which is not part of training dataset then P(w) = 0 this
creates problem. Additive smoothing must be done.
Classification and Evaluation Results:
1. Multinomial Naive Bayes Classifier is used and its results are as follows:
Thank You

You might also like