Spam
Spam
Moshe Koppel
Kinds of spam
Active spam ads and scams
email chatbots commentbots
Special Issues
Spam detection is basically a text cat problem, but there are some special issues:
Collecting data non-spam email is private
Asymmetry must never class good mail as spam Adversarial spammers try to defeat filters
Collecting Data
Standard collections
SpamAssassin Corpus TREC corpora
Adversarial Problem
Spammers reverse engineer global filters; use nasty tricks to circumvent them This is what makes spam detection an interesting problem
Basic Spam
Lets start with some garden variety spam This is easily detected by standard text cat tricks
It cost you nothing (Yes! $0) to give Us a call, We will contact You back
Absolutely No exams/Tests/classes/books/Interviews No Pre-School qualification Needed! ----------------------------Inside USA: 1-718-989-5XXX 0utside USA: +1-718-989-5XXX ----------------------------Degree, Bacheelor, masteerMBA, PhDD available in the field of your choice that's Right, You can even become a doctor & receive all the benefits That omes With it! Please Leave Below 3 INFO in voicemail:
I am Ehud Olmert, formerly the Prime Minister of Israel. I URGENTLY REQUIRE YOUR ASSISTANCE IN A MOST DISCRETE MATTER. As a result of certain events in my country, it has become necessary for me to transfer a considerable sum of cash to a foreign bank account. I turn to you as a MOST HONORABLE AND TRUSTED PERSON for your discrete assistance. The total amount involved is THIRTY MILLION NEW ISRAELI SHEKELS only [30,000.000.00 NIS] and we wish to transfer this money into safe foreigners account abroad. I am only contacting you as a foreigner because this money cannot be approved to a local person here, but to a foreigner who has information about the account, which I shall give to you upon your positive response. I am revealing this to you with believe in God that you will never let me down in this business, you are the FIRST AND THE ONLY PERSON that I am contacting for this business, so please reply urgently so that I will inform you the next step to take urgently. At the conclusion of this business, you will be given 40% of the total amount, 50% will be for us while 10% will be for the expenses both parties may incurred during this transaction. PLEASE, TREAT THIS PROPOSAL AS TOP SECRET.
Early Work
Learner: Nave Bayes
Sahami et al 98
Early Work
Hand Crafted Features
Sahami et al 98
35 Phrases Free Money Only $ be over 21 20 Domain Specific Features Domain type of sender (.edu, .com, etc) Sender name resolutions (internal mail) Has attachments Time received Percent of non-alphanumeric characters in subject
Later Studies
The early work was followed by the usual stream of extended feature sets and fancier learning methods (e.g. SVM) It is now common to use over 100,000 features Learning methods for huge data sets must be very efficient (online algorithms) Methods must be adaptive
Other Tricks
Fill messages with real text taken from books, sites, etc. Can even generate real-looking texts using Markovian language models
Or maybe not
More Tricks
Encoded Text Distorted Text
Is it?
Diploma Guy
Word Obscuring
Dplmoia Pragorm
Probably using character n-grams with SVM would also work well
Message hallenge
Sender
Response
Recipient
CAPTCHAS
Identify distorted characters Supposed to be easy for humans, hard for computers Actually, nowadays computers better at it than humans
Slight Variation
Fortunately, for now, humans are still better than computers at identifying character boundaries
New CAPTCHAS
Economics of CAPTCHAs
CAPTCHAs taken from books Google is trying to OCR. We all work for them for free.
Spammers use Mechanical Turk to solve CAPTCHAs. Its worth paying for.