0% found this document useful (0 votes)

259 views34 pages

Spam

This document discusses various methods for detecting and filtering spam emails. It begins by outlining the problem of spam and its costs. It then describes different types of spam and some challenges in detecting spam, such as collecting data and the asymmetry of false positives versus false negatives. The document reviews early work using naive Bayes and feature engineering. It also discusses how spammers try to evade detection through techniques like character encoding, hidden text, and image spam. The document explores potential defenses like blacklisting, postage for email, and CAPTCHAs, and notes challenges they face from adaptive spammers and new technologies.

Uploaded by

Rajeev Hatwar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

259 views34 pages

Spam

Uploaded by

Rajeev Hatwar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 34

Text Categorization

Moshe Koppel

Lecture 10: Spam Detection

Some slides from Joshua Goodman

Obligatory Scare Slide

Theres lots of spam The proportion of spam is growing it will soon exceed 100% of all email sent It costs the world gazillions of dollars Spam is BAD (Actually, lately it looks like spam email has been mostly defeated.)

Kinds of spam
Active spam ads and scams
email chatbots commentbots

Passive spam websites

link farms for SEO adsense parking lots

Differences between these increasingly artificial

Special Issues
Spam detection is basically a text cat problem, but there are some special issues:
Collecting data non-spam email is private
Asymmetry must never class good mail as spam Adversarial spammers try to defeat filters

Collecting Data
Standard collections
SpamAssassin Corpus TREC corpora

Use your own email

Might not reflect world

gmail has user feedback

LOTS of examples Haphazardly labeled How much info do they keep about each email?

Problem of False Positives

False positives more costly than false negatives

Research must report recall-precision curves; key point is precision ~ 1

Adversarial Problem
Spammers reverse engineer global filters; use nasty tricks to circumvent them This is what makes spam detection an interesting problem

Basic Spam
Lets start with some garden variety spam This is easily detected by standard text cat tricks

It cost you nothing (Yes! $0) to give Us a call, We will contact You back
Absolutely No exams/Tests/classes/books/Interviews No Pre-School qualification Needed! ----------------------------Inside USA: 1-718-989-5XXX 0utside USA: +1-718-989-5XXX ----------------------------Degree, Bacheelor, masteerMBA, PhDD available in the field of your choice that's Right, You can even become a doctor & receive all the benefits That omes With it! Please Leave Below 3 INFO in voicemail:

1) your Name 2) your Country 3) your Phone No. (with Countrycode)

Call Now! 24 hours a day, 7 Days a week to recieve Your call

Most Honorable Sir,

I am Ehud Olmert, formerly the Prime Minister of Israel. I URGENTLY REQUIRE YOUR ASSISTANCE IN A MOST DISCRETE MATTER. As a result of certain events in my country, it has become necessary for me to transfer a considerable sum of cash to a foreign bank account. I turn to you as a MOST HONORABLE AND TRUSTED PERSON for your discrete assistance. The total amount involved is THIRTY MILLION NEW ISRAELI SHEKELS only [30,000.000.00 NIS] and we wish to transfer this money into safe foreigners account abroad. I am only contacting you as a foreigner because this money cannot be approved to a local person here, but to a foreigner who has information about the account, which I shall give to you upon your positive response. I am revealing this to you with believe in God that you will never let me down in this business, you are the FIRST AND THE ONLY PERSON that I am contacting for this business, so please reply urgently so that I will inform you the next step to take urgently. At the conclusion of this business, you will be given 40% of the total amount, 50% will be for us while 10% will be for the expenses both parties may incurred during this transaction. PLEASE, TREAT THIS PROPOSAL AS TOP SECRET.

Early Work
Learner: Nave Bayes

Sahami et al 98

Feature Set: Words, Phrases, Structural Features

Feature Selection: top 500 infogain

Evaluation Data: ~1700 Messages, ~88% Spam Results: Spam precision 100%, Spam recall 98.3%

Early Work
Hand Crafted Features

Sahami et al 98

35 Phrases Free Money Only $ be over 21 20 Domain Specific Features Domain type of sender (.edu, .com, etc) Sender name resolutions (internal mail) Has attachments Time received Percent of non-alphanumeric characters in subject

Later Studies
The early work was followed by the usual stream of extended feature sets and fancier learning methods (e.g. SVM) It is now common to use over 100,000 features Learning methods for huge data sets must be very efficient (online algorithms) Methods must be adaptive

How to Beat an Adaptive Spam Filter Graham-Cumming 04

Use machine learning to discover words that beat an adaptive filter

Take a message that is near spam threshold Send it to the target filter 10,000 times each time adding 5 random words Train an evil filter to learn which messages beat the target filter Use evil filter to modify new spam messages

Found single word additions to get new spam by the filter

Other Tricks
Fill messages with real text taken from books, sites, etc. Can even generate real-looking texts using Markovian language models

The Hitchhiker Chaffer

Content Chaff
Random passages from the Hitchhikers Guide Footers from valid mail
This must be Thursday, said Arthur to himself, sinking low over his beer, I never could get the hang of Thursdays.

Express yourself with MSN Messenger 6.0

Hitchhiker Chaffers Later Work

There is nothing fancy about this spam
A spam filter will catch that in its sleep anonymous

Or maybe not

Hitchhiker Chaffers Later Work

Hidden Text Content Chaff URL Spamming
Also included a number of unusual statements made by candidates during, On display? I eventually had to go down to the cellar to find them. http://join.msn.com/?Pag e=features/es

More Tricks
Encoded Text Distorted Text

Secret Decoder Ring Dude

Another spam that looks easy

Is it?

Secret Decoder Ring Dude

Character Encoding HTML word breaking
Pharmacy Produc<!LZJ>t<!LG>s

Diploma Guy
Word Obscuring

Dplmoia Pragorm

Caerte a mroe prosoeprus

More of Diploma Guy

Diploma Guy is good at what he does

One Pretty Good Text Cat Method

Optimally compress spam training examples Optimally compress non-spam training examples Check which compression method better compresses suspicious message

Why This Works

Works at level of character n-grams
Should be applied to html source Captures weird encodings, word distortions

Probably using character n-grams with SVM would also work well

But Spammers Arent Sitting Around

Embed text in images (can vary non-text parts of image) Also, just send link to spam site

Text Cat isnt the only Trick

Dont display images w/o user okay Blacklist IPs that spam comes from
Can harm legitimate senders (zombies, etc.)

Charge postage for email

Cash Puzzles that waste CPU Task easy for humans, hard for computers

Message hallenge

Sender

Response

Recipient

CAPTCHAS
Identify distorted characters Supposed to be easy for humans, hard for computers Actually, nowadays computers better at it than humans

Computers vs. Humans

Slight Variation
Fortunately, for now, humans are still better than computers at identifying character boundaries

New CAPTCHAS

Economics of CAPTCHAs
CAPTCHAs taken from books Google is trying to OCR. We all work for them for free.
Spammers use Mechanical Turk to solve CAPTCHAs. Its worth paying for.

Barbarian Chicks Demons Vol 3
100% (1)
Barbarian Chicks Demons Vol 3
2 pages
8.securing Information Systems
No ratings yet
8.securing Information Systems
25 pages
Email Phishing 01
No ratings yet
Email Phishing 01
72 pages
Cybersecurity Public Policy - Bradley Fowler Kennedy Maranga
No ratings yet
Cybersecurity Public Policy - Bradley Fowler Kennedy Maranga
201 pages
Email Spam Filtering Techniques
No ratings yet
Email Spam Filtering Techniques
11 pages
Lesson 14 Ethics
100% (1)
Lesson 14 Ethics
44 pages
How To Send Fake Mail Using SMTP Servers
No ratings yet
How To Send Fake Mail Using SMTP Servers
5 pages
Detection of Spams Using Extended ICA & Neural Networks
No ratings yet
Detection of Spams Using Extended ICA & Neural Networks
6 pages
IP SPOOFING Documentation
No ratings yet
IP SPOOFING Documentation
18 pages
DOJ - Investigative Uses of Technology
100% (2)
DOJ - Investigative Uses of Technology
169 pages
HTML Tutorial
100% (1)
HTML Tutorial
86 pages
Phishing Attack
No ratings yet
Phishing Attack
17 pages
NISR The PHISHING Guide
No ratings yet
NISR The PHISHING Guide
42 pages
Tut Pres Wijayatunga Spam BCP
No ratings yet
Tut Pres Wijayatunga Spam BCP
137 pages
R. K. Desai Achchhariwala College of Computer and Applied Science
No ratings yet
R. K. Desai Achchhariwala College of Computer and Applied Science
29 pages
Creating An SMTP Client Server
No ratings yet
Creating An SMTP Client Server
22 pages
Phishing Attack
100% (1)
Phishing Attack
13 pages
CCTV Service Plan
No ratings yet
CCTV Service Plan
2 pages
How To Embed A Metasploit Payload in An Original Apk File Part 2 Do It Manually
No ratings yet
How To Embed A Metasploit Payload in An Original Apk File Part 2 Do It Manually
9 pages
Glossary - Malwarebytes
No ratings yet
Glossary - Malwarebytes
63 pages
Building Services 03: Assignment - 04
No ratings yet
Building Services 03: Assignment - 04
8 pages
User Manual Noc Nazul
No ratings yet
User Manual Noc Nazul
16 pages
Web Spoofing Presentation
91% (11)
Web Spoofing Presentation
37 pages
Phishing
No ratings yet
Phishing
11 pages
2019 Phishing Trends and Intelligence Report: The Growing Social Engineering Threat
No ratings yet
2019 Phishing Trends and Intelligence Report: The Growing Social Engineering Threat
30 pages
Phishing
No ratings yet
Phishing
7 pages
Chapter 7 Conclusion and Suggestions
No ratings yet
Chapter 7 Conclusion and Suggestions
12 pages
How To Become A Cyber Warrior - 2012 - 016 - 102 - 67947
No ratings yet
How To Become A Cyber Warrior - 2012 - 016 - 102 - 67947
4 pages
How Hackers Are Using Phishing To Bypass 2fa
0% (1)
How Hackers Are Using Phishing To Bypass 2fa
4 pages
My Phishing
100% (1)
My Phishing
46 pages
MaxBulk Mailer™ MAXMAILER USER GUIDE
No ratings yet
MaxBulk Mailer™ MAXMAILER USER GUIDE
19 pages
Login PDF
No ratings yet
Login PDF
10 pages
Ethical Hacking
No ratings yet
Ethical Hacking
29 pages
Computer Hacking Related To Fraud of Records
No ratings yet
Computer Hacking Related To Fraud of Records
52 pages
"Internet Banking": Ms - Prathyusha Samvedam Faculty of Law
100% (1)
"Internet Banking": Ms - Prathyusha Samvedam Faculty of Law
21 pages
Phishing
No ratings yet
Phishing
16 pages
Spam Filtering Thesis
100% (2)
Spam Filtering Thesis
6 pages
Numerical Analysis: Using MATLAB and Spreadsheets
No ratings yet
Numerical Analysis: Using MATLAB and Spreadsheets
44 pages
Securing ATM With OTP and Biometric
No ratings yet
Securing ATM With OTP and Biometric
4 pages
Spam Tools Download
0% (1)
Spam Tools Download
2 pages
How To Create Badges With A Vcard QR Code Using OnMerge Barcodes
100% (1)
How To Create Badges With A Vcard QR Code Using OnMerge Barcodes
8 pages
Spoofing Emails
No ratings yet
Spoofing Emails
4 pages
Impact of Ict S02.1
No ratings yet
Impact of Ict S02.1
10 pages
Spamming Complete Guide
No ratings yet
Spamming Complete Guide
30 pages
Report On Digital Forensic Fundamentals-Part A
No ratings yet
Report On Digital Forensic Fundamentals-Part A
11 pages
Spam Message Detection Using Logistic Regression
No ratings yet
Spam Message Detection Using Logistic Regression
4 pages
Detection & Analysis of Dridex With Cybershield and For It
100% (1)
Detection & Analysis of Dridex With Cybershield and For It
9 pages
Pishing
No ratings yet
Pishing
16 pages
The Origin of The Word Spam'
No ratings yet
The Origin of The Word Spam'
2 pages
Installing and Using Python
No ratings yet
Installing and Using Python
8 pages
Web Spoofing Documentation
100% (2)
Web Spoofing Documentation
22 pages
UK Inbound Cab Booking Campaign
No ratings yet
UK Inbound Cab Booking Campaign
10 pages
Internet Fraud Docu
No ratings yet
Internet Fraud Docu
3 pages
RDP
No ratings yet
RDP
1 page
All About Spam
No ratings yet
All About Spam
9 pages
Email Weapon System (EWS) : LCDR Greg Taylor Bupers Iam
No ratings yet
Email Weapon System (EWS) : LCDR Greg Taylor Bupers Iam
24 pages
Spam Detection Using BERT
No ratings yet
Spam Detection Using BERT
6 pages
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <HTML><HEAD><META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> <TITLE>ERROR: The requested URL could not be retrieved</TITLE> <STYLE type="text/css"></STYLE> </HEAD><BODY> <H1>ERROR</H1> <H2>The requested URL could not be retrieved</H2> <HR noshade size="1px"> <P> While trying to process the request: <PRE> TEXT http://www.scribd.com/titlecleaner?title=CyberCrime+Report.docx HTTP/1.1 Host: www.scribd.com Proxy-Connection: keep-alive Accept: */* Origin: http://www.scribd.com X-CSRF-Token: ea5b3d74fc35283c15ef440947b36a61b715cffd User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31 X-Requested-With: XMLHttpRequest Referer: http://www.scribd.com/upload-document Accept-Encoding: gzip,defl
No ratings yet
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <HTML><HEAD><META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> <TITLE>ERROR: The requested URL could not be retrieved</TITLE> <STYLE type="text/css"></STYLE> </HEAD><BODY> <H1>ERROR</H1> <H2>The requested URL could not be retrieved</H2> <HR noshade size="1px"> <P> While trying to process the request: <PRE> TEXT http://www.scribd.com/titlecleaner?title=CyberCrime+Report.docx HTTP/1.1 Host: www.scribd.com Proxy-Connection: keep-alive Accept: */* Origin: http://www.scribd.com X-CSRF-Token: ea5b3d74fc35283c15ef440947b36a61b715cffd User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31 X-Requested-With: XMLHttpRequest Referer: http://www.scribd.com/upload-document Accept-Encoding: gzip,defl
32 pages
Installing A SMTP System
No ratings yet
Installing A SMTP System
11 pages
Spam Scams and Hacking You
No ratings yet
Spam Scams and Hacking You
6 pages
Why Phishing Works
No ratings yet
Why Phishing Works
10 pages
PHP3 PDF
No ratings yet
PHP3 PDF
11 pages
Design and Implementation of A Computerized Traffic Offence System Chapter One 1.0
No ratings yet
Design and Implementation of A Computerized Traffic Offence System Chapter One 1.0
6 pages
Bandwidth Bandits
No ratings yet
Bandwidth Bandits
9 pages
What Is SMTP?: Plugin)
No ratings yet
What Is SMTP?: Plugin)
2 pages
Fighting Obfuscated Spam
No ratings yet
Fighting Obfuscated Spam
15 pages
Post: Assistant Director (General) : Bangladesh Bank
No ratings yet
Post: Assistant Director (General) : Bangladesh Bank
1 page
Detecting Spam Messages Using The Naive Bayes Algorithm of Basic Machine Learning
No ratings yet
Detecting Spam Messages Using The Naive Bayes Algorithm of Basic Machine Learning
3 pages
Device Installation Guide CPN Revised
No ratings yet
Device Installation Guide CPN Revised
11 pages
Spam Filtering Install Guide
No ratings yet
Spam Filtering Install Guide
20 pages
Storing Your Data Into A Database With Php/Mysql
No ratings yet
Storing Your Data Into A Database With Php/Mysql
5 pages
Web Spoofing
No ratings yet
Web Spoofing
6 pages
Create A Site For E-Commerce With PHP, MySQL, and PayPal
No ratings yet
Create A Site For E-Commerce With PHP, MySQL, and PayPal
7 pages
Unit 6 Study Guide
No ratings yet
Unit 6 Study Guide
2 pages
How To Detect Fraud Sites On The Internet
No ratings yet
How To Detect Fraud Sites On The Internet
6 pages
Malware List
100% (1)
Malware List
13 pages
Phishing 3371
No ratings yet
Phishing 3371
10 pages
Topic: An Efficient Key Management in Wireless Sensor Networks Using ECDH Algorithm
No ratings yet
Topic: An Efficient Key Management in Wireless Sensor Networks Using ECDH Algorithm
2 pages
Object Oriented Design Patterns
No ratings yet
Object Oriented Design Patterns
3 pages
Online Scams in India: by Spandana, SRO0700177
No ratings yet
Online Scams in India: by Spandana, SRO0700177
16 pages
Chap 3 CIS True or False
No ratings yet
Chap 3 CIS True or False
1 page
Spam Assassin
No ratings yet
Spam Assassin
6 pages
36 - Abrigo Farymel Joy Q. - Common Financial Scams To Avoid
No ratings yet
36 - Abrigo Farymel Joy Q. - Common Financial Scams To Avoid
10 pages
YKPT-22254 Ethics Term Paper
No ratings yet
YKPT-22254 Ethics Term Paper
11 pages
Social Problem - Crime-Cyber Crime
No ratings yet
Social Problem - Crime-Cyber Crime
3 pages
Cyber Force, Cyber Crime, Cyber Terrorism in (Cyberspace)
No ratings yet
Cyber Force, Cyber Crime, Cyber Terrorism in (Cyberspace)
20 pages
Admin, Vol 5 No1-82-106
No ratings yet
Admin, Vol 5 No1-82-106
25 pages
Cryptographic Use and Weaknesses (Summary)
No ratings yet
Cryptographic Use and Weaknesses (Summary)
3 pages
English Ppt. Final - Mohit
No ratings yet
English Ppt. Final - Mohit
7 pages

Spam

Uploaded by

Spam

Uploaded by

Text Categorization

Lecture 10: Spam Detection

Obligatory Scare Slide

Passive spam websites

Differences between these increasingly artificial

Use your own email

gmail has user feedback

Problem of False Positives

Research must report recall-precision curves; key point is precision ~ 1

1) your Name 2) your Country 3) your Phone No. (with Countrycode)

Most Honorable Sir,

Feature Set: Words, Phrases, Structural Features

Feature Selection: top 500 infogain

How to Beat an Adaptive Spam Filter Graham-Cumming 04

Use machine learning to discover words that beat an adaptive filter

Found single word additions to get new spam by the filter

The Hitchhiker Chaffer

Express yourself with MSN Messenger 6.0

Hitchhiker Chaffers Later Work

Hitchhiker Chaffers Later Work

Secret Decoder Ring Dude

Secret Decoder Ring Dude

Caerte a mroe prosoeprus

More of Diploma Guy

One Pretty Good Text Cat Method

Why This Works

But Spammers Arent Sitting Around

Text Cat isnt the only Trick

Charge postage for email

Computers vs. Humans

You might also like