Techniques To Detect Spammers in Twitter-A Survey: International Journal of Computer Applications December 2013
Techniques To Detect Spammers in Twitter-A Survey: International Journal of Computer Applications December 2013
net/publication/262992888
CITATIONS READS
52 607
3 authors, including:
Divya Divya
Lethbridge College
1 PUBLICATION 52 CITATIONS
SEE PROFILE
All content following this page was uploaded by Divya Divya on 14 February 2020.
27
International Journal of Computer Applications (0975 – 8887)
Volume 85 – No 10, January 2014
shape to the current research surrounding social network public by default and visible to all those who are following the
spammer detection. After going through this survey paper, new tweeter. Users share these tweets which may contain news,
researchers can easily evaluate what work has been done, in opinions, photos, videos, links, and messages. Following is the
which year and how the present work can be extended to make standard terminology used in Twitter and relevant to our work:
spam detection more accurate. Whenever appropriate, we have Tweets [3]: A message on Twitter containing maximum
detailed the methodology followed; dataset used; features for length of 140 characters.
detection of spammers and accuracy of the techniques being Followers & Followings [3]: Followers are the users who
used by various authors. are following a particular user and followings are the users
In particular, the papers cover how spammers engage with whom user follows.
social network users, their implications and existing techniques Retweet [3]: A tweet that has been reshared with all
to detect these spammers. followers of a user.
Hashtag [3]: The # symbol is used to tag keywords or
3. SECURITY ISSUES IN OSNs topics in a tweet to make it easily identifiable for search
Online Social Networking sites (OSNs) are vulnerable to purposes.
security and privacy issues because of the amount of user Mention [3]: Tweets can include replies and mentions of
information being processed by these sites each day. Users of other users by preceding their usernames with @ sign.
social networking sites are exposed to various attacks: Lists [3]: Twitter provides a mechanism to list users you
1) Viruses – spammers use the social networks as a platform follow into groups
[19] to spread malicious data in the system of users. Direct Message [3]: Also called a DM, this represents
2) Phishing attacks - user’s sensitive information is acquired by Twitter's direct messaging system for private
impersonating a trustworthy third party [30]. communication amongst users.
3) Spammers - send spam messages to the users of social
networks [11]. As per Twitter policy [16], indicators of spam profiles are
4) Sybil (fake) attack - attacker obtains multiple fake the metrics such as following a large number of users in a
identities and pretends to be genuine in the system in short period of time1or if post consists mainly of links or if
order to harm the reputation of honest users in the network popular hashtags (#) are used when posting unrelated
[20]. information or repeatedly posting other user’s tweets as
5) Social bots- a collection of fake profiles which are your own. There is a provision for users to report spam
created to gather users’ personal data [32]. profiles to Twitter by posting a tweet to @spam. But in
6) Clone and identity theft attacks- where attackers create a Twitter policy [16] there is no clear indication of whether
profile of already existing user in the same network or across there are automated processes that look for these conditions
different networks in order to fool the cloned user’s friends or whether the administrators rely on user reporting,
[23]. If victims accept the friend requests sent by these cloned although it is believed that a combination approach is used.
identities, then attackers will be able to access their
information. These attacks consume extra resources from users 5.2 Threats on Twitter
and systems. 1. Spammed Tweets [13]: Twitter allows its users to
post tweets of maximum 140 characters but
4. TYPES OF SPAMMERS regardless of the character limit, cybercriminals have
Spammers are the malicious users who contaminate the found a way to actually use this limitation to their
information presented by legitimate users and in turn pose a advantage by creating short but compelling tweets
risk to the security and privacy of social networks. Spammers with links for promotions for free vouchers or job
belong to one of the following categories [22]: advertisement posts or other promotions.
1. Phishers: are the users who behave like a normal user 2. Malware downloads [13]: Twitter has been used by
to acquire personal data of other genuine users. cyber criminals to spread posts with links to malware
2. Fake Users: are the users who impersonate the download pages. FAKEAV and backdoor[13]
profiles of genuine users to send spam content to the applications are the examples of Twitter worm that
friends’ of that user or other users in the network. sent
3. Promoters: are the ones who send malicious links of direct messages, and even malware that affected both
advertisements or other promotional links to others Windows and Mac operating systems. The most
so as to obtain their personal information. tarnished social media malware is KOOBFACE [13],
which targeted both Twitter and Facebook.
3. Twitter bots [13]: Cybercriminals tend to use
Motives of Spammers: Twitter to manage and control botnets. These botnets
a) Disseminate pornography control the users’ accounts and pose a threat to their
b) Spread viruses security and privacy.
c) Phishing attacks
d) Compromise system reputation
6. Social Implications of OSNs
Along with the usual problems like spamming, phishing
5. TWITTER AS AN OSN attacks, malware infections, social bots, viruses etc., the greater
5.1 Introduction challenge
Twitter is a social network service launched in March 21, 2006 that social networking sites present for users is to keep private
[14] and has 500 million active users [14] till date who share data secure and confidential.
information. Twitter uses a chirping bird as its logo and hence
the name Twitter. Users can access it to exchange frequent
information called 'tweets' which are messages of up to 140 1
According to Twitter policy [17], if the number of
characters long that anyone can send or read. These tweets are followings of an account is exceeding 2,000, this number
is limited by the number of the account’s followers.
28
International Journal of Computer Applications (0975 – 8887)
Volume 85 – No 10, January 2014
29
International Journal of Computer Applications (0975 – 8887)
Volume 85 – No 10, January 2014
al.[7] and Twitter user based for identification of spammers. One good point in the approach
Content users and content is that it has been validated on two different combinations of
based based dataset – once with 10% spammers+90% non-spammers and
features) and again with 10% non-spammers+90% spammers. Limitation of
accuracy- the approach is that less dataset has been used for validation.
84.5% (with
Benevenuto et. al. [7] detected spammers on the basis of tweet
only user
content and user based features. Tweet content attributes used
based
are - number of hashtags per number of words in each tweet,
features)
Gee et. User Compared Validated on Accuracy-
number of URLs per word, number of words of each tweet,
al.[12] based Naive 450 Twitter 89.6% number of characters of each tweet, number of URLs in each
Bayesian, SVM users with tweet, number of hashtags in each tweet, number of numeric
200 recent characters that appear in the text, number of users mentioned in
tweets each tweet, number of times the tweet has been retweeted.
McCord User Compared Validated on Radom Forest Fraction of tweets containing URLs, fraction of tweets that
et. al.[24] based and Random Forest, 1000 Twitter giving highest contains spam words, and average number of words that are
content SVM, Naive users with accuracy- hashtags on the tweets are the characteristics that differentiate
based Bayesian, K- 100 recent 95.7%
spammers from non spammers. Dataset of 54 million users on
NN tweets
Lin et. URL rate, J48 Validated on Precision-86%
Twitter has been crawled with 1065 users manually labelled as
al.[28] interactio 400 Twitter spammers and non-spammers. A supervised machine learning
n rate users scheme i.e. SVM classifier has been used to distinguish
Amit A. Introduce Compared Validated on Accuracy- between spammers and non spammers. Detection accuracy of
et. al.[2] d 15 new Random Forest, 31,808 93.6% the system is 87.6% with only 3.6% non-spammers
features Decision Tree, Twitter users misclassified.
Decorate,
Naive Bayesian Twitter facilitates its users to report spam users to them by
Chakrabor User Compared Trained on SVM giving sending a message to “@spam”. So Gee et. al. [12] utilized this
ty et. al.[4] based, Random Forest, 5000 Twitter highest feature and detected spam profiles using classification
Content SVM, Naive users with accuracy-89% technique. Normal user profiles have been collected using
based Bayesian, 200 recent Twitter API and spam profiles have been collected from
Decision Tree tweets
“@spam” in Twitter. Collected data was represented in JSON
Yang et. 18 Compared Validated on Bayesian
al.[6] features Random Forest, two datasets- giving highest
then it was presented in matrix form using CSV format. Matrix
(8- Decision Tree, 5000 users accuracy- has users as rows and features as columns. Then CSV files
existing Decorate, and then 88.6% were trained using Naive Bayes algorithm with 27% error rate
& 10 new Naive Bayesian 3500 users then SVM algorithm has been used with error rate of 10%.
features with 40 Spam profiles detection accuracy is 89.3%. Limitation of this
introduce recent tweets approach is that not very technical features have been used for
d) detection and precision is also less i.e. 89.3% so it has been
suggested that aggressive deployment of any system should be
Significant work has been done by Alex Hai Wang [1] in the done only if precision is more than 99%.
year 2010 which used user based as well as content based
features for detection of spam profiles. A spam detection McCord et.al. [24] used user based features like number of
prototype system has been proposed to identify suspicious friends, number of followers and content based features like
users in Twitter. A directed social graph model has been number of URLs, replies/mentions, retweets, hashtags of
proposed to explore the “follower” and “friend” relationships. collected database. Classifiers namely Random Forest, Support
Based on Twitter’s spam policy, content-based features and Vector Machine (SVM), Naive Bayesian and K-Nearest
user-based features have been used to facilitate spam detection Neighbour have been used to identify spam profiles in Twitter.
with Bayesian classification algorithm. Classic evaluation Method has been validated on 1000 users with 95.7% precision
metrics have been used to compare the performance of various and 95.7% accuracy using the Random Forest classifier and
traditional classification methods like Decision Tree, Support this classifier gives the best results followed by the SMO,
Vector Machine (SVM), Naive Bayesian, and Neural Networks Naive Bayesian and K-NN classifiers. Limitation of this
and amongst all Bayesian classifier has been judged the best in approach is that for considered dataset reputation feature has
terms of performance. Over the crawled dataset of 2,000 users been showing wrong results i.e. it is not able to differentiate
and test dataset of 500 users, system achieved an accuracy of spammers and non-spammers, unbalanced dataset has been
93.5% and 89% precision. Limitation of this approach is that is used so Random Forest is giving best results as this classifier is
has been tested on very less dataset of 500 users by considering generally used in case of unbalanced dataset, and finally the
their 20 recent tweets. approach has been validated on less dataset.
Lee et. al.[22] deployed social honeypots consisting of genuine Lin et. al. [28] detected long-surviving spam accounts in
profiles that detected suspicious users and its bot collected Twitter on the basis of two different features that are URL rate
evidence of the spam by crawling the profile of the user and interaction rate. Most of the papers have used lot many
sending the unwanted friend requests and hyperlinks in features for detection of spam accounts like no of followers, no
MySpace and Twitter. Features of profiles like their posting of following, followers/following ratio, tweet content, no of
behaviour, content and friend information to develop a hashtags, URL links etc. But as per this paper all these features
machine learning classifier have been used for identifying are not so effective in detecting spammers so only simple yet
spammers. After analysis profiles of users who sent unsolicited effective features like URL rate and interaction rate have been
friend requests to these social honeypots in MySpace and used for detection purpose. URL rate is the number of tweets
Twitter have been collected. LIBSVM classifier has been used with URL / total number of tweets and interaction rate is the
number of tweets interacting / total number of tweets. 26,758
30
International Journal of Computer Applications (0975 – 8887)
Volume 85 – No 10, January 2014
accounts have been crawled using Twitter API and 816 long Bayesian Network. Bayesian classifier performs best with an
surviving accounts have been analysed J48 classifier with 86% accuracy of 88.6%. Limitation of this approach is that very less
precision. Limitation of the approach is that only two features data has been crawled and only a particular type of spammers
have been used for spam profile detection and if spammers are being detected with less detection rate which is the lower
keep low URL rate and low interaction rate then this technique bound of the spammers present in the dataset.
will not work as intended.
9. RESEARCH DIRECTIONS
According to Amit A. et. al. [2] there are two types of spammer During survey it became quite apparent that a lot of work has
detection techniques – users centric which are based on the been done for detecting spam profiles in different OSNs. Still
features related to user like followers/following ratio and improvements can be made to get better detection rate by using
another is URL centric which depends on detecting malicious a different technique and covering more and robust features as
URLs. Approach mentioned in this paper is hybrid which deciding parameter. So following are the few conclusions
considers above mentioned both types of features. 15 new drawn from survey:
features have been proposed to detect spammers, along with an
alert system to detect spam tweets. Tweet campaigns and 1. Since Twitter has millions of active users and this
techniques used by spammers have also been studied. Two number is constantly increasing. And almost all the
datasets from Twitter have been used one with 500K users and authors have used very small testing dataset to see the
another with 110,789 users. New features that have been used performance of their approach. So there is a need to
are: Bait oriented features which identify the techniques used increase the testing dataset to see the performance of any
by spammers to lure victims to click on malicious links like no approach.
of mentions, mentions to non-followers, hijacking trends, 2. Need to develop a multivariate model.
intersection with famous trends. Behavioral features include 3. Need to develop a method that can detect all kinds of
variance in tweet interval, variance in no of tweets per unit spammers.
time, ratio of variance in tweet interval to variance in no of 4. Need to test the approaches on different combinations of
tweets per unit time, and tweeting sources. URL features spammers and non-spammers.
include duplicate URLs, duplicate domain names, IP/domain
ratio. Content entropy features include dissimilarity of tweet
content, similarity between tweets, URL and tweet similarity.
10. CONCLUSION
Profile features include follower/following ratio, profile’s Many methods have been developed and used by various
description language dissimilarity. Thereafter all these features researchers to find out spammers in different social networks.
have been collected from malicious users as well as benign From the papers reviewed it can be concluded that most of the
users which were then given to four supervised learning work has been done using classification approaches like SVM,
algorithms like Decision Tree, Random Forest, Bayes Network Decision Tree, Naive Bayesian, and Random Forest. Detection
and Decorate using Weka tool. 93.6% of spammers with false has been done on the basis of user based features or content
positive rate of 1.8% have been detected with Decorate based features or a combination of both. Few authors also
classifier giving best results. This technique has been shown to introduced new features for detection. All the approaches have
outperform Twitter’s spammer detection policy. But this been validated on very small dataset and have not been even
technique has been tested on only 31,808 users whereas Twitter tested with different combinations of spammers and non-
is considering millions of users. spammers. Combination of features for detection of spammers
has shown better performance in terms of accuracy, precision,
Chakraborty et. al. [4] have proposed a system to detect recall etc. as compared to using only user based or content
abusive users who post abusive contents, including harmful based features.
URLs, porn URLs, and phishing links and divert away regular
users and harm the privacy of social networks. Two steps in the 11. REFERENCES
algorithm have been used- first is to check the profile of a user [1] Alex Hai Wang, Security and Cryptography (SECRYPT),
sending friend request to other user as for abusive content and Don’t Follow Me: Spam Detection in Twitter,
second is to check the similarity of two profiles. After these Proceedings of the 2010 International Conference, Pages
two steps it is supposed to recommend whether the user should 1-10, 26-28 July 2010, IEEE.
accept friend request or not. This has been tested on Twitter
dataset of 5000 users which was collected with REST API. [2] Amit A. Amleshwaram, Narasimha Reddy, Sandeep
Features considered for differentiating abusive and non-abusive Yadav, Guofei Gu, Chao Yang, CATS: Characterizing
users are- profile based, content based and timing based. Automation of Twitter Spammers, Texas A&M
Classifiers like SVM, Decision Tree, Random Forest and Naïve University, 2013, IEEE.
Bayesian have been used. SVM outperforms all classifiers and
[3] Anshu Malhotra, Luam Totti, Wagner Meira Jr.,
model is performing with an accuracy of 89%.
Ponnurangam Kumaraguru, Virgılio Almeida, Studying
Yang et. al. [6] utilized new features for the detection of User Footprints in Different Online Social Networks
spammers in Twitter. Various techniques used by spammers ,International Conference on Advances in Social
for evasion have been discussed. 10 new detection features Networks Analysis and Mining, 2012, IEEE/ACM.
including three graph-based features, three neighbor-based
[4] Ayon Chakraborty, Jyotirmoy Sundi, Som Satapathy,
features, three automation-based features and one timing-based
SPAM: A Framework for Social Profile Abuse
feature have been proposed as these features are difficult as
Monitoring.
well as expensive to dodge as they are based on the methods
which spammers don’t use in order to not being detected and [5] Boyd, Ellison, N. B. (2007), Social network sites:
requires more money, resources and time for evasion. A total Definition, history, and scholarship, Journal of Computer-
of 18 features (8 existing and 10 newly introduced) have been Mediated Communication, 13(1), article 11,
used for detecting purpose and these have been tested using http://jcmc.indiana.edu/vol13/issue1/boyd.ellison.html
classifiers like Random Forest, Decision Tree, Decorate and
31
International Journal of Computer Applications (0975 – 8887)
Volume 85 – No 10, January 2014
[6] Chao Yang, Robert Chandler Harkreader, Guofei Gu , Die vol. 6961, Pages 301-317, 2011, Springer, Heidelberg
Free or Live Hard? Empirical Evaluation and New Design ACM.
for Fighting Evolving Twitter Spammers, RAID'11
Proceedings of the 14th international conference on [22] Kyumin Lee, James Caverlee, Steve Webb, Uncovering
Recent Advances in Intrusion Detection, Pages 318-337, Social Spammers: Social Honeypots + Machine Learning,
2011, Springer-Verlag Berlin, Heidelberg, ACM Proceeding of the 33rd international ACM SIGIR
conference on Research and development in information
[7] Fabricio Benevenuto, Gabriel Magno, Tiago Rodrigues, retrieval, 2010, Pages 435–442, ACM, New York (2010).
and Virgilio Almeida, Detecting Spammers on Twitter,
CEAS 2010 Seventh annual Collaboration, Electronic [23] Leyla Bilge, Thorsten Strufe, Davide Balzarotti, Engin
messaging, Anti Abuse and Spam Conference, July 2010, Kirda, All Your Contacts Are Belong to Us: Automated
Washington, US. Identity Theft Attacks on Social Networks, International
World Wide Web Conference Committee (IW3C2),
[8] Fact Sheet 35: Social Networking Privacy: How to be WWW 2009, April 20–24, 2009, Madrid, Spain, ACM
Safe, Secure and Social
[24] M. McCord, M. Chuah, Spam Detection on Twitter Using
[9] Faraz Ahmed, Muhammad Abulaish, SMIEEE, An MCL- Traditional Classifiers, ATC’11, Banff, Canada, Sept 2-4,
Based Approach for Spam Profile Detection in Online 2011, IEEE.
Social Networks, IEEE 11th International Conference on
Trust, Security and Privacy in Computing and [25] Manuel Egele, Gianluca Stringhini, Christopher Kruegel,
Communications, 2012. and Giovanni Vigna, COMPA: Detecting Compromised
Accounts on Social Networks.
[10] Georgios Kontaxis, Iasonas Polakis, Sotiris Ioannidis and
Evangelos P. Markatos, Detecting Social Network Profile [26] Marcel Flores, Aleksandar Kuzmanovic, Searching for
Cloning, 3rd International Workshop on Security and Spam: Detecting Fraudulent Accounts via Web Search,
Social Networking, 2011, IEEE. LNCS 7799, pp. 208–217, 2013. Springer-Verlag Berlin
Heidelberg 2013.
[11] Gianluca Stringhini, Christopher Kruegel, Giovanni
Vigna, Detecting Spammers on Social Networks, [27] Mauro Conti, Radha Poovendran, Marco Secchiero,
University of California, Santa Barbara, Proceedings of FakeBook: Detecting Fake Profiles in On-line Social
the 26th Annual Computer Security Applications Networks, IEEE/ACM International Conference on
Conference, ACSAC ’10, Austin, Texas USA, pages 1-9, Advances in Social Networks Analysis and Mining, 2012.
Dec. 6-10, 2010, ACM. [28] Po-Ching Lin, Po-Min Huang, A Study of Effective
[12] Grace gee, Hakson Teh, Twitter Spammer Profile Features for Detecting Long-surviving Twitter Spam
Detection, 2010. Accounts, Advanced Communication Technology
(ICACT), 15th International Conference on 27-30 Jan.
[13] http://about-threats.trendmicro.com/us/webattack- 2013, IEEE.
Information regarding Twitter threats.
[29] Sangho Lee and Jong Kimz, WARNINGBIRD: Detecting
[14] http://en.wikipedia.org/wiki/Twitter-Information of Suspicious URLs in Twitter Stream, 19th Network and
Twitter. Distributed System Security Symposium (NDSS), San
Diego, California, USA, February 5-8, 2012.
[15] http://expandedramblings.com/index.php/march-2013-by-
the-numbers-a-few-amazing-twitter-stats-Regarding [30] T. Jagatic, N. Johnson, M. Jakobsson, and F. Menczer,
statistics of Twitter. “Social phishing,” Communications of the ACM , vol.
50, no. 10, pp. 94–100, 2007.
[16] http://help.twitter.com/forums/26257/entries/1831- The
Twitter Rules. [31] Vijay A. Balasubramaniyan, Arjun Maheswaran,
Viswanathan Mahalingam, Mustaque Ahamad, H.
[17] http://twittnotes.com/2009/03/, 2000-following-limit-on- Venkateswaran, A Crow or a Blackbird? Using True
twitter.html-The 2000 Following Limit Policy on Twitter. Social Network and Tweeting Behavior to Detect
[18] http://www.spamhaus.org/consumer/definition-Spam Malicious Entities in Twitter, 2002, ACM
Definition. [32] Y. Boshmaf, I. Muslukhov, K. Beznosov, and M.
[19] J. Baltazar, J. Costoya, and R. Flores, “The real face of Ripeanu, “The socialbotnetwork: when bots socialize for
koobface: Thelargest web 2.0 botnet explained,” Trend fame and money,” in Proceedings of the 27th Annual
Micro Threat Research , 2009. Computer Security Applications Conference. ACM,2011,
pp. 93–102.
[20] J. Douceur, “The sybil attack,” Peer-to-peer Systems, pp.
251–260, 2002.[12] D. Irani, M. Balduzzi, D. Balzarotti, [33] Yin Zhuy, Xiao Wang, Erheng Zhong, Nanthan N. Liuy,
E. Kirda, and C. Pu, “Reverse socialengineering attacks in He Li, Qiang Yang, Discovering Spammers in Social
online social networks,” Detection of Intrusionsand Networks, Proceedings of the Twenty-Sixth AAAI
Malware, and Vulnerability Assessment , pp. 55–74, Conference on Artificial Intelligence.
2011. [34] Zhi Yang, Christo Wilson, Xiao Wang, Tingting Gao, Ben
[21] Jonghyuk Song, Sangho Lee and Jong Kim, Spam Y. Zhao, and Yafei Dai, Uncovering Social Network
Filtering in Twitter using Sender-Receiver Relationship, Sybils in the Wild, Proceedings of the 11th ACM/USENIX
RAID'11 Proceedings of the 14th International Internet Measurement Conference (IMC’11), 2011.
Conference on Recent Advances in Intrusion Detection,
[35]
IJCATM : www.ijcaonline.org 32