0% found this document useful (0 votes)
150 views18 pages

A Study of Supervised Spam Detection Using Artificial Intelligence

This document summarizes a study on using machine learning for supervised spam detection. It discusses different types of spam techniques used by spammers, such as obscuring text, using images or character encodings. It then presents Naive Bayes as a solution, which learns the probabilities of words occurring in spam vs ham (non-spam) emails. The document evaluates different spam detection algorithms and open-source filters on standard evaluation measures like accuracy, recall and precision. It concludes that machine learning can classify emails into spam and ham with over 99.9% accuracy using the best performing algorithms.

Uploaded by

Mohit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
150 views18 pages

A Study of Supervised Spam Detection Using Artificial Intelligence

This document summarizes a study on using machine learning for supervised spam detection. It discusses different types of spam techniques used by spammers, such as obscuring text, using images or character encodings. It then presents Naive Bayes as a solution, which learns the probabilities of words occurring in spam vs ham (non-spam) emails. The document evaluates different spam detection algorithms and open-source filters on standard evaluation measures like accuracy, recall and precision. It concludes that machine learning can classify emails into spam and ham with over 99.9% accuracy using the best performing algorithms.

Uploaded by

Mohit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 18

A Study of Supervised Spam Detection

using Artificial Intelligence

Presented by
Mohit Magare
Class: BE-B-10
PRN No: 71921639H

1
What is Spam?
• Typical legal definition
– Unsolicited commercial email from someone
without a pre-existing business relationship

• Definition mostly used


– Whatever the users think

2
Spam Detection

Ham

Spam

Is this just text categorization?


What are the special challenges?
3
Text classification alone is not enough

• Spammers now often try to obscure text.


• Special features are necessary.
– E.g. subject line vs. body text
– E.g. Mail in the middle of the night is more
likely to be spam than mail in the middle of the
day.

4
Weather Report Guy

• Content in Image

Weather, Sunny, High


82, Low 81, Favorite…

5
Secret Decoder Ring Dude
• Character Encoding
• HTML word breaking
Pharmacy
Prod&#117;c<!LZJ>t<!LG>s

6
Diploma Guy
• Word Obscuring

Dlpmoia Pragorm
Caerte a mroe prosoeprus

7
One Solution to Spam Detection
• Machine Learning
– Learn spam versus good

8
Naïve Bayes
• Want P( spam | words )
• Use Bayes Rule: P(spam | words )  P(words | spam) P(spam)
P( words )

P ( words )  P ( words | spam)  P ( spam)  P ( words | good )  P( good )

• Assume independence: probability of each word


independent of others
P( words | spam)  P( word1 | spam)  P( word 2 | spam)  ... P( wordn | spam)

9
A Bayesian Approach to Filtering Junk E-Mail
1998 - Sahami, Dumais, Heckerman, Horvitz

• One of the first papers on using machine learning to


combat spam
• Used Naïve Bayes
• Feature Space: Words, Phrases, Domain-Specific Features
• Evaluation Data: ~1700 Messages, ~88% Spam, from
volunteer’s private e-mail

10
A Bayesian Approach to Filtering Junk E-Mail
1998 - Sahami, Dumais, Heckerman, Horvitz

• Hand Crafted Features


– 35 Phrases
• ‘Free Money’
• ‘Only $’
• ‘be over 21’
– 20 Domain Specific Features
• Domain type of sender (.edu, .com, etc)
• Sender name resolutions (internal mail)
• Has attachments
• Time received
• Percent of non-alphanumeric characters in subject
• Best collection of heuristics discussed in literature
– Without them: Spam precision 97.1% Spam recall 94.3%
– With them: Spam precision 100% Spam recall 98.3%
11
Algorithms Used in Spam Detection
12
10
8
6
4
2
0

• Naïve Bayes reported to do very well


• More complex algorithms have some gain 12
Which Algorithm is Best?

• Very difficult to tell


– No consistently-used good data set
– No standard evaluation measures

13
O

• Present several evaluation measures for spam detection


• Compare methods in six open-sources spam filters
• Analysis the experiment results

14
Filters
• Some available open-source spam filters
– Spamassassin
– Bogofilter
– CRM-114
– DSPAM
– SpamBayes
– Spamprobe

15
Evaluation Measures (1)
judgement
Ham Spam
Ham a b
Result
Spam c d
a: ham (correctly classified) [true negative]
b: spam misclassification [false negative]
c: ham misclassification [false positive]
d: spam (correctly classified) [true negative]

• Accuracy: (a+d)/(a+b+c+d) • Ham misclassification rate: c/(a+c)


• Spam misclassification rate: b/(b+d)
• Spam recall: d/(b+d)
• Spam precision: d/(d+c) 16
Conclusion

We are able to classify the emails as spam or


non-spam using artificial intelligence with almost
99.9% accuracy and with best performing
algorithms.

17
Thank you!

18

You might also like