Smsassassin: Crowdsourcing Driven Mobile-Based System For Sms Spam Filtering
Smsassassin: Crowdsourcing Driven Mobile-Based System For Sms Spam Filtering
Smsassassin: Crowdsourcing Driven Mobile-Based System For Sms Spam Filtering
Kuldeep Yadav, Ponnurangam Kumaraguru, Atul Goyal, Ashish Gupta and Vinayak Naik
Indraprastha Institute of Information Technology (IIIT), Delhi, India
Abstract for the registration to be effective. Also, the cost of a phone call
Due to increasing use of Short Message Service (SMS)over mo- is several factor high compared to an SMS [2], making SMS as a
bile phones in developing countries, there has been a burst of spam preferred choice for marketing and advertisement.
SMSes. Content-based machine learning approaches were effec- One observation is that while NDNC has largely reduced junk
tive in filtering email spams. Researchers have used topical and calls, it has failed to stop spam SMSes. There could be many rea-
stylistic features of the SMS to classify spam and ham. Unlike sons for the countermeasures not having an impact on the SMS
email spam filtering, SMS spam filtering can be largely influenced spams – the operator may control the SMS spam sent by them but
by the presence of regional words, abbreviations and idioms. We they don’t have an automated system in place which can filter third
have tested the feasibility of applying Bayesian learning and SVM party SMS spam; the number of mobile subscribers registered at
based machine learning techniques which were reported to be most NDNC is only 15 percent [5] of total subscribers in India. There
effective in email spam filtering. In our ongoing research, as an are also other reasons like lack of penalty (e.g. monetary or suspen-
exploratory step, we have developed a mobile-based system SM- sion of the service) for Spammers and NDNC does not have a good
SAssassin that can filter SMS spam messages based on bayesian system to report Spam problems [5]. NDNC do provide a number
learning and sender blacklisting mechanism. Since the spam SMS on which one can report a SMS spam sender identified by its mobile
keywords and patterns keep on changing, SMSAssassin uses crowd number but it becomes difficult when you get an SMS with numeric
sourcing to keep itself updated. Using a dataset that we are collect- codes like 56789 etc. TRAI have set a regulation in Feb 2009 that
ing from users in the real-world, we evaluated our approaches and if a bulk sender wants to use small numeric codes such as a sender
found some interesting results. We also found that the perception identification number then it has to be preceded by operator code
of SMS were different for men and women through a survey. (like “AD” to mean Airtel, Delhi).
As in email spams, there are different stakeholders who should
1 Introduction and Motivation be solving the problem of SMS spam [11]. SMS spam is nearly
Mobile phones have become ubiquitous and pervasive in the cur- same as the email spam from a network operator point of view.
rent environment around the world. Popularity of mobile phone Mobile phone network operators also have a high interest in reduc-
have increased exponentially in the last decade. One of the services ing the SMS spams on the network. Usually SMS spams are sent
that has been very popular in the mobile phones is text messaging, in bulk so it generates high volume of traffic which overloads the
Short Message Service (SMS). In particular, countries like India network called as SMS Flooding. SMS Flooding can delay the re-
where the usage of mobile phones have exploded (subscribers are ception of important / valid SMS. In India, one of largest provider
nearly 671.69 million as of June 2010 [2], SMS has become the Airtel have stopped business of sending bulk SMSes. On the other
way to communicate among people. SMS is the cheapest option hand, some operator use bulk SMS business to generate revenue.
available to reach masses. Last year, TRAI (Telecom Regularoty In recent times, there are many media reports published on SMS
Authority of India) report shows that an average Indian sent approx- spam problem [2, 3]. The most important fact here is that end-
imately 29 SMSes per month [1]. Apart from personal communica- users are helpless in controlling the number of SMS spam they are
tion, SMS is now widely used for offering value added services, receiving. SMS spam messages are annoying for the following two
advertisement medium or a mode of getting consumer involved. reasons – SMS could not be deleted without reading and it also
People have started dedicated company for SMS based advertising gives a notification to user once it is received by mobile phone.
solution where you can reach to 1,00,000 people in just USD 80 [3] Thus SMS spam wastes human attention which is most precious in
. information age; in some countries, you have to pay for receiving a
Increase in SMS usage has increased the SMS spams. Even af- SMS and while traveling / roaming, you pay for all SMS that you
ter various solutions (people, process, regulatory and technology), get, so it can be very expensive even to get these SMS spams.
SMS spam seems to be increasing and causing a lot of annoyance Unlike “email spam,” the spammed user in the case of SMS
to users. SMS spam is any unsolicited message delivered to a mo- does not have a provision to take any countermeasure to prevent
bile phone through text messaging. As an estimate in India, total SMS spam. Due to profileration of smart mobile devices, we can
spam SMSes per day is more than 100 million [2]. Governments now perform spam filtering at device level thus giving a hope for
and many service providers have taken various countermeasures the SMS spams to be silently deleted without user’s knowledge
(mostly regulatory) in order to reduce the number of SMS spam (Silently eliminating the threat kind of solution [12]). If we per-
(e.g. by imposing substantial fines on spammers, blocking specific sonalize the solution of detecting / deleting SMS spams, it may be
phone numbers etc.). Now, there are various web-based solutions appreciated by the users. The contributions of this paper are:
for bulk messaging which makes spammers’ job very easy. Like • We present our detailed analysis on a self collected India-
other countries, India has also set up a NDNC (National Do Not centric real world SMS Spam dataset which can help research
Call) registry. Mobile Subscribers have access to this NDNC reg- community to investigate further. 1
istry through their operator. The main role of NDNC registry is
that after registering, you are not supposed to get calls for adver- 1 As far as our knowledge, this is the first research work on such
tisements as well as promotional SMSes [6]. It takes about 45 days a India-centric dataset and a first mobile based solution.
• We present a proof of concept SMS inbox with spam filtering Due to advent of micro-blogging, websites like Twitter also
and reporting capabilty. opens opportunities for spammers. The motivation for spammer
• Our Bayesian based spam filtering methodology seems to de- on twitter is mostly related to increase website traffic by posting
tect spam and ham effectively. We also use SVM based filter- unrelated URL in a tweet containing trending topics. Whereas, the
ing for higher accuracy. SMS spammers’ motivation is just a low cost advertisement solu-
The paper is organized in following sections. In Section 2, we tion. In [13] authors have taken user attributes in consideration de-
present the related work on SMS Spam filtering problem. In Sec- pending on its social network on twitter to identify spammers. They
tion 3 have the listed the challenges and design goals of spam filter- have also considered tweets’ content attributes which may also help
ing system. Section 4 presents the database collection methodology in classification like number of words which are strong indicator of
and descriptive statistics about the current SMS database. In sec- spam i.e sale, free etc.
tion 5, we present system description of SMSAssassin and detailed 3 Challenges and Design Goals
description of all three parts: Bayesian learning, mobile applica- SMS Spam filtering problem poses various research and engi-
tion and synchronization service. Section 6 have the performance neering challenges which requires further research : Due to limit
evaluation and analysis derived from our results. Section 7 presents of 160 characters in a SMS, it is very hard to get sufficient dis-
user preferences about SMS spam and finally we have concluding criminative features so that machine learning approaches can give
discussion in section 8. good filtering results. Also, frequent presence of short hand nota-
2 Background tions and regional words in SMSes makes feature engineering more
difficult and one of main reason that make it unique from email
Content based filtering approaches were remarkably effective in
spam filtering. Moreover, SMS spam is comparatively less studied
email spam filtering. These approaches were mainly based upon
problem than email spam filtering; there is also a lack of common
machine learning algorithms which operates using some hand engi-
benchmark / data set for conducting studies on SMS spams. Due
neered features to differentiate a spam and a ham (legitimate SMS).
to presence of regional words in the SMS, we need separate dataset
The whole dataset is divided into training and testing set where ma-
for each region and it will require separate feature engineering. Re-
chine learning approaches like Bayesian learning techniques learns
searchers [8, 9, 10] have also tested their proposed approaches on
from already tagged spam and ham messages from training set. The
Korean and spanish datasets.
testing set is used to analyze the effectiveness of the techniques.
Here, we present a set of design goals which will act as guide-
Researchers have tried machine learning based appproches in SMS
lines to design a mobile based SMS spam filtering application:
spam filtering [7, 8, 9] . Due to short length of SMS, it is hard
to find required features for machine learning classification which 1. Client side solution: All related work have focussed on ap-
makes it a different problem than traditional email spam [8]. plying machine learning techniques which are mostly imple-
Gomez et al. [7] explored the use of statistical learning based mented on a server. There is lack of an end user level mobile
classifiers trained with lexical features, such as character and word application for SMS spam filtering.
n-grams, for SMS spam filtering. They have specifically tested 2. Computationally less intensive: The technique / algorithm
the feasibility of applying bayesian based classifier to SMS spam used for spam filtering application should be computationally
problem. This paper also discusses the state of SMS spam in Eu- less intensive so that it can be used on mobile devices. All
rope and various sources of SMS spam. Cormack et al. [8] dis- previous work have used a server side solution with machine
cusses the applicability of content based approaches on short mes- learning techniques which can not easily used on mobile de-
sages which consist of SMS, blog comments, web logs and bulletin vices directly.
boards. Since, there is a problem of finding features into short mes-
sage which restricts application of content based classifiers. Some 3. Real time filtering: Detecting SMS spams in real time and
of these work have focussed on expanding the feature set for con- making a decision on them to flag, delete, etc. will be very
tent based mobile spam classifiers with additional features, such as useful for users. By this approach, the user is not even aware
orthogonal sparse word bi-grams [8, 9]. They were quite effective of he/she getting any SMS spams.
in feature vector based Machine learning algorithms like Support 4. Minimal Resources: It should create minimal extra overhead
Vector Machines (SVM) and Orthogonal Sparse Bigrams with con- for the user or mobile device. The spam filtering should be
fidence Factor (OSBF)-Lua. OSBF-Lua exploits relationship be- done in a minimal time.
tween concatenating words like “You have been offered a movie
ticket” can have some of following combinations – You (0) have; 5. Self Learning: The system should keep learning from the de-
offered (0) movie; character Bigrams like ticket could be break cisions that it is making while classifying. There should also
down into “ti”, “ic”, “ck” etc.; character trigrams like ticket could be a way to report a message as spam as well as a wrongly
be break down into “tic”, “ick” etc. classified message to ham.
Dae-Neung Sohn et.al. [10] took a different approach consider- 6. Resonable Accuracy: The mobile application should give a
ing stylistic information and arguing that content based algorithms resonable level of accuracy in filtering spam SMSes. Due to
will not work better in some cases where common spam words like nature of machine learning algorithms, we may have some
“sale”, “offer” etc. may also be present in legitimate messages. So, false positives and false negatives. The system should store
they have selected feature set consisting of average word / mes- spams so that user can check about any mis-classification.
sage length, special character counts, function words frequencies
(“the”,“is”) etc. to see the effect of stylistic information on SMS 7. Blacklisting Sender: The mobile application should have pro-
spam classification. The evaluation metric used by these work was visions to blacklist a sender such that all future SMSes from
Area Under Curve (AUC) in ROC curve specifially 1-AUC(%). that sender goes to spam folder or gets deleted (depending on
Due to lack of common dataset/benchmark, it is very hard to com- the user preference).
pare the accuracy of two different work. However [10] shows that 8. Personalization: Mobile application should be personalized
they have improved the accuracy produced by previous work from for the users. The prespective of users about same SMS can
10.7227 to 3.7538 (lower the 1- AUC(%), the better it is) using differ where one can see an SMS as a spam whereas some
stylistic features. can see the same SMS as useful information. The mobile ap-
Figure 1: Tag cloud generated from the spam and ham that we collected. Left: Shows the tag cloud for spam SMSes; we see the occurrence
of words like get, free, noida, apply, bhk. Right: Shows the tag cloud for ham SMSes; most of the words here are regional and not English.
Ham Spam has a combination of Hindi and English words. We found that large
Total SMSes 2195 2123 portion of the SMS text in ham is from Hindi. 2 Out of the top
Average Length 157.6 151.6 100 most frequent words in ham messages, 44 were Hindi language
Average number of special characters 10.2 8.7 words whereas in spam words there were only 16 words from Hindi.
Average number of words 29.9 23.7
Average word length 3.9 5.25 5 System Description
Average presence of URL 0 0.2 In this section, we describe the system architecture and explain
each part of the architecture. Figure 2 represents the architecture
Table 1: Descriptive statistics of the collected data of SMSAssassin. There are three major parts of the SMSAssassin:
bayesian filtering algorithm, mobile application, and synchroniza-
tion service that will run on a central server.
plication should use the user preferences while filtering SMS
spam. 5.1 Bayesian Approach for Spam Filtering
9. Privacy-Preserving: Since SMS is an integral part of commu- Bayesian filtering and SVM are reported to be most succesful
nication now a days. Privacy is one of utmost requirement for techniques for SMS spam filtering [8]. Bayesian filtering has been
SMS spam filtering mobile application where no SMS (private reported as one of best techniques in email spam classification sys-
ones) of user should be revealed to third party. tems also. Apart from that, bayesian spam filtering approach does
minimal computation for classification unlike other classification
10. Platform independent: There are so many hardware, software algorithms like SVM, thus making it a preferred choice to use at
makers for Mobile devices. The mobile based spam filtering mobile handheld devices. Like all other supervised machine learn-
solution should be platform independent. ing techniques, bayesian learning also needs a seed dataset to be
4 Database Collection trained. In the training stage of bayesian learning, it computes the
occurence of a word in spam as well as legitimate SMS to learn
The SMS database that we have used for our study contains to- the probability of finding that word into spam / ham. For example,
tal 4,318 SMSes. For collecting SMS spam data, we ran an incen- words like “sms”, “free”, “call” have higher probability due to their
tivized crowd-sourcing scheme in our campus. We asked partici- freqent occurence in spam SMSes. These words can also be seen
pants to forward spam SMS to our SMS server. We have got nearly in the SMS tag cloud shown in Figure 1. After training, Bayes the-
4000 spam SMSes in less than two months. Every fortnight, we orm is used to calculate the probability of the message being a spam
awarded a food coupon to the participant who sent the maximum with different words present in that message.
number of unique SMSes to the server. Out of 4000 SMS spam
After getting the probability of every word’s spaminess, the
received, nearly 50% were duplicate. Therefore, we observe that
technique computes the combined probabability with basic assump-
most users are getting same spam SMSes.
tion that all of these are independent events. Finally, combined
We did not collect ham SMSes through crowd-sourcing due to probabilty value is then compared with a threshold ρ, 3 if the thresh-
privacy reasons. So, we have collected ham message from close as- old is greater than ρ then message is likely to be a spam otherwise
sociates. We are continuing to collect data and we see an increase ham. As one can see in figure 2, SMS Assasin uses SpamKey-
in the amount of data that we are collecting. We expect the SMS wordsFreq list at Mobile phones and GlobalSpamKeywordsFreq at
spam database to reach more than 20,000 in next 2-3 months. After the server to keep the track of spam keyword frequencies. In the
that, we will share the database with the research community. We same way, SenderBlacklist and GlobalSenderBlacklist lists are used
also plan to conduct a longitudinal study on collected Spam SMSes. to detect spam based on senders’ address (phone number).
Table 1 presents some of statistics about the SMS database. There
were some very interesting patterns emerged if we analyze the Ta- 5.2 Mobile Application
ble 1: Average number of special character was high in hams than We designed and implemented a SMS Inbox with spam filtering
spam due to presence of jokes; average length of hams was also capability as a mobile application. Mobile application implementa-
higher than spams because presence of lot of irregular spaces and tion is done in Python S60 in Symbian Platform. The screenshots of
long SMSes; average word length was low in hams due to frequnt mobile application running on a phone is given in Figure 3. One tab
presence of short hand abbreviations than spams.
2 Hindi is the official language of India.
Figure 1 has the tag cloud generated from ham and spam mes-
sages. Due to large influence of regional words in SMSes, a SMS 3 The designer of the filter can decide on the threshold value.
Ham Accuracy Spam Accuracy
Bayesian Learning 97% 72.5%
SVM 93% 86%
Table 2: Classification accuracy comparison of machine learning
approaches for SMS spam filtering using same training and testing
set.
Feature F-score 1. There are some special characters like ’/’ which are very fre-
Count of Spam words 0.141353 quent in spams.
Count of ’/’ 0.138823
Average word length 2. Average word length in most of spam SMSes is high due to
(space as delimiter) 0.131290 presence of special character in place of space. Also, average
Average word length word length is low in most of hams due to use of frequent
(space and special characters as delimiter) 0.118367 short hand abbreviations.
Presence of URL 0.116232 3. We have found that the SMSes which were having a URL
Count of numeric words 0.098747 were certainly spam.
Count of alpha-numeric words 0.098747 4. Spam SMSes have a higher probability for numeric words be-
Presence of Smileys 0.096299 cause each spam SMS contains a number to call or SMS.
Presence of full URL 0.096299
Number of words 0.063765 Based on the 20 different features, SVM improved upon the
spam classification accuracy produced by Bayesian learning (re-
Table 3: Top 10 features used for SVM classification with their F- fer Table 2). Surprisingly, SVM decreases the ham classification
score. accuracy. We have noticed that it happened due to lot of machine
generated ham SMSes and also the discriminating features such as
special characters were also present in some of ham SMSes(mostly
jokes). Even while using low computational intensive feature set
quent occurence in both spam/ham like reply,bar etc. By good ham in SVM, we were able to get comparable accuracy where our 1-
classification produced by bayesian, we can say that there is very AUC(%) value was 5.61.
low probability that legitimate(ham) messages will be classified as
spam.
Motivated by low spam classification accuracy by Bayesian, we 7 User Preferences
have used Support Vector Machine(SVM) for classification. SVMs Decision for SMS being a spam/ham may differ from person
are supervised machine learning algorithm where using training to person. For finding global patterns in user preferenes with re-
data it builds a classification model. That classification model is spect to gender, age and type of spam SMS, we developed a “SMS
used to predict the category of new examples or testing data. We spam user study”’ web application. We have randomly selected
have used SVM implementation of SVMLib library for classifica- 200 spam SMSes from our SMS spam dataset for this study. Most
tion in our dataset. For SVM to give good accuracy, we need to of these 200 SMS (Intially collected by crowdsourcing as Spam
find good features set which can be a good discriminator in spam SMSes) were marketing messages and main aim for this study was
and ham. We have started with a set of 20 features which con- to get user perceptions about those SMSes. Each participant was
tains keyword based features such as presence of spam words and presented 10 different spam SMSes with four different options to
phrases and stylistic features like presence of special characters, choose from : spam, ham, not-sure, skip. There were 39 male and
average word length etc. We have taken special care in selecting 11 female participants in the study. There were total 477 replies
features which were less computation intensive so that it can eas- for total 200 SMSes after the study. Average age of female partic-
ily be performed at a mobile appication. Using SVMLib, we have ipants was 23.5 years whereas for the male, it was 21 years. There
computed F-score for comparing performance of various features were two interesting observations which came out from this study:
in classification. F-score shows importance of features i.e higher Females mostly tagged message related to festivals offers, beauty
value of F-score means that feature is a good discriminator of spam products, discount offers on food/goods etc as ham; males mostly
and ham. In Table 2, we have presented the top 10 features with tagged messages related to sports, businesses(opening of new shop
their respective F-score in our dataset. There are following interest- etc), job, studies, social gathering and recharge schemes related
ing observations that came out from our features engineering on a messages as ham. Further, We have observed that most of food
India centric dataset. discount offers were tagged as ham by males and females both.
8 Discussion [4] http://www.medianama.com/2010/09/223-sms-spam-increase-
SMS spam filtering is an important problem to solve and make india/
use of the of the Information Communication Technologies (ICT) [5] http://emergic.org/2010/08/09/ending-sms-spam-part-1/
to the fullest. Our bayesian based SMS spam filtering solution sat-
isfies most of design goals with resonable accuracy. Since data [6] http://ndncregistry.gov.in/ndncregistry/index.jsp
plan on the phones is still not used by masses in India, we have [7] Gmez Hidalgo, J. M., Bringas, G. C., Snz, E. P., and Garca, F.
kept synchronization time user specific and provided him/her op- C. 2006. Content based SMS spam filtering. In Proceedings of
tion to modify it. Even if user does not use synchonization service, the 2006 ACM Symposium on Document Engineering (Ams-
bayesian learning based SMSAssassin can be used as a stand alone terdam, The Netherlands, October 10 - 13, 2006). DocEng ’06.
application on the phone. ACM, New York, NY, 107-114.
In a developing country like India, there are lot of programmable
[8] Cormack, G. V., Hidalgo, J. M., and Snz, E. P. 2007. Feature
mid price range phones (100$ to 400$) on which SMSAssassin can
engineering for mobile (SMS) spam filtering. In Proceedings of
be deployed easily. We plan to conduct a user study of SMSAssasin
the 30th Annual international ACM SIGIR Conference on Re-
by deploying it on those phones. Since the basic idea of SMSAs-
search and Development in information Retrieval (Amsterdam,
sasin does not make it a platform dependent solution, we are devel-
The Netherlands, July 23 - 27, 2007). SIGIR ’07. ACM, New
oping mobile application for Android and Windows Mobile based
York, NY, 871-872.
phones also for wider applicability of the system.
In Section 6, we have shown that bayesian learning gives low [9] Gordon V. Cormack , Jos Mara Gmez Hidalgo , Enrique Puer-
spam classification accuracy and SVM has improved upon the spam tas Snz, Spam filtering for short messages, Proceedings of the
classification accuracy with a resonable ham classification accu- sixteenth ACM conference on Conference on information and
racy. Due to its high computation requirement, SVM classification knowledge management, November 06-10, 2007, Lisbon, Por-
cannot be performed at mobile application level. As an ongoing tugal .
work, we are developing a server based mobile system to perform [10] Sohn, D., Lee, J., and Rim, H. 2009. The contribution of
SVM classification online. In this mobile system, whenever a SMS stylistic information to content-based mobile spam filtering. In
arrives on to the phone, our application captures that SMS without Proceedings of the ACL-IJCNLP 2009 Conference Short Pa-
showing it to the user and fetch features values using SMS content. pers (Suntec, Singapore, August 04 - 04, 2009). Annual Meet-
For SVM classification, We have used all the light weight features ing of the ACL. Association for Computational Linguistics,
like word length, presence of special characters etc which can eas- Morristown, NJ, 321-324.
ily be computed by a mobile application. Mobile application then
sends those feature values to the server which performs classifica- [11] Sheng, S., Kumaraguru, P., Acquisti, A., Cranor, L., and
tion. The server will then run a pre-trained classifier which can be Hong, J. Improving phishing countermeasures: An analysis of
occasionally updated using user contributed spam messages. Server expert interviews. APWG eCrime Researchers Summit (2009).
will also take a very little time in classification of the SMS and will [12] Kumaraguru, P., Sheng, S., Acquisti, A., Cranor, L., and
send the result back to the mobile device. One thing to note here is Hong, J. Teaching johnny not to fall for phish. ACM Trans-
that user’s privacy is not at stake; because mobile device just sends actions on Internet Technology (TOIT) 10, 2 (2010).
feature vales (just an integer vector) instead of SMS content. How-
ever, we still have to think a way to get user preferences in account [13] Benevenuto et al, Detecting Spammers on Twitter, CEAS
while performing SVM based classification. 2010 Seventh annual Collaboration, Electronic messaging, An-
As we have mentioned before, this is an exploratory and ongoing tiAbuse and Spam Conference July 1314, 2010, Redmond,
work, we present following future research directions: Washington, US