Predictive Modeling For Suspicious Content Identification On Twitter
Predictive Modeling For Suspicious Content Identification On Twitter
https://doi.org/10.1007/s13278-022-00977-7
ORIGINAL ARTICLE
Received: 9 February 2022 / Revised: 24 August 2022 / Accepted: 17 September 2022 / Published online: 5 October 2022
© The Author(s), under exclusive licence to Springer-Verlag GmbH Austria, part of Springer Nature 2022
Abstract
The wide popularity of Twitter as a medium of exchanging activities, entertainment, and information is attracted spammers
to discover it as a stage to spam clients and spread misinformation. It poses the challenge to the researchers to identify mali-
cious content and user profiles over Twitter such that timely action can be taken. Many previous works have used different
strategies to overcome this challenge and combat spammer activities on Twitter. In this work, we develop various models
that utilize different features such as profile-based features, content-based features, and hybrid features to identify malicious
content and classify it as spam or not-spam. In the first step, we collect and label a large dataset from Twitter to create a spam
detection corpus. Then, we create a set of rich features by extracting various features from the collected dataset. Further,
we apply different machine learning, ensemble, and deep learning techniques to build the prediction models. We performed
a comprehensive evaluation of different techniques over the collected dataset and assessed the performance for accuracy,
precision, recall, and f1-score measures. The results showed that the used different sets of learning techniques have achieved
a higher performance for the tweet spam classification. In most cases, the values are above 90% for different performance
measures. These results show that using profile, content, user, and hybrid features for suspicious tweets detection helps build
better prediction models.
Keywords Suspicious content detection · User-content features · Natural language processing · Machine learning
techniques · Social network
13
Vol.:(0123456789)
149 Page 2 of 13 Social Network Analysis and Mining (2022) 12:149
of this false information is huge that even the World Health tweet content, such as the number of hashtags in comparison
Organization has declared the onslaught of messages as an to total word count, users mentioned in a tweet, number of
“infodemic”3. The overabundance of information, some are URLs, and count of numerals, are used. (3) Relation fea-
correct, and others are not, makes it difficult for the end-user tures-based techniques: These techniques use the connec-
to find virtuous sources and reliable information when they tion degree measures such as whether a person mentioned
need it. This can cause a negative impact on the psychology a direct friend in a tweet or a mutual friend, etc., to identify
of users, driving them to anxiety and depression. It is highly malicious content. (4) Hybrid feature-based techniques:
essential to curb the pitfalls of Twitter to make it a more reli- these techniques drive new features such as reputation (ratio
able and trustworthy place (Lingam et al 2019). of followers with following), frequency of tweets, and the
Therefore, the boon of Twitter is now turning into a curse rate at which user follows other users by using the user fea-
as spammers are using these platforms to spread malicious tures. In 2020, Abkenar et al. (2020) performed a SLR on
or irritating information to achieve their malevolent intends. Twitter spam detection and reported that spam detection
In a tweet, the user can add text, URLs, videos, and images. approaches had used content analysis approaches (15%),
Further, it allows various functionalities, such as follow- user analysis approaches (9%), tweet analysis approaches
ing a user, mentioning a topic or user, hashtag, reply, and (9%), network analysis approaches (11%), and hybrid analy-
retweet. Lee and Kim (2013). A hashtag is used to catego- sis approaches (56%). Furthermore, the authors stated that
rize a tweet into a particular category, and all tweets related collecting real-time Twitter data, labeling datasets, spam
to that tweet can be read by clicking that tag. At the point drifting, and class imbalance problems are open challenges
when any remarkable occasion happens, a large number of in Twitter spam detection approaches.
clients tweet about it and quickly make it a trending subject. In this paper, first, we collect the spam dataset from Twit-
These trending themes become the objective of spammers ter by utilizing Twitter developer API. We fetch 4000 latest
who post tweets consisting of some trademark expressions tweets, consisting of information like timestamp, tweet text,
of the moving point with URL interfaces that lead clients username, hashtags, followers count, following count, the
to disconnected sites. As tweets usually incorporate abbre- number of mentions, word count, retweets, etc. Further, we
viated URL joins, it becomes for the clients to recognize perform feature engineering and extract different feature sets
the substance of the URL without stacking the site. Spam- such as content-based and user-based features. Additionally,
mers can have a few thought processes behind spamming, we create hybrid features such as the user’s reputation, fre-
for example, advertise a product to produce exceptional quency of tweets of a user, and following frequency. Further,
yield on deals, compromising the user’s account (Lingam we label the dataset as spam or non-spam using hybrid fea-
et al 2019; Barushka and Hajek 2018; Dokuz 2021). Spam- tures, blacked list URLs, and some predefined words in the
mers contaminate the continuous pursuit climate. However, text. Afterward, we apply different machine learning and
they additionally affect tweets statistics. Filtering malicious deep learning techniques to predict suspicious or malicious
content becomes a challenging problem because of URL tweets. Further, we perform an analysis to assess how differ-
shorteners, modern and informal languages, and abbrevia- ent techniques are performed to predict suspicious content
tions used on social networking sites. Spammers influence on Twitter. Specifically, we made the following contribu-
the users to click a particular URL or read the content with tions in the presented work.
specific phrases or words (Tingmin et al. 2018; Madisetty
and Desarkar 2018). 1. We create a spam dataset to detect suspicious content of
In their study, Kaur et al. (2016) have surveyed research Twitter.
papers published between 2010 and 2015 for malicious 2. We extract different features from the collected Twit-
tweets and content identification. The authors reported that ter dataset.These features are language based, content
most of the used techniques for malicious tweets detection based, and user based. Further, we create hybrid features
could be categorized into four categories. (1) User fea- to enrich the feature set for building effective prediction
tures-based techniques: These techniques classify a user as models.
spammer or non-spammer by analyzing the user’s account 3. We apply two different natural language processing
information such as no. of followers, no. of following, no. (NLP) techniques, bag of words and TF-IDF to extract
of mentions, and tweets creation time. (2) Content features- different language features.
based techniques: These techniques analyze the text proper- 4. We apply different machine learning and state-of-the-art
ties and decide whether tweets are spam or non-spam. The deep learning techniques and evaluate their performance
for the suspicious content detection on Twitter.
3
https://www.washingtonpost.com/science/2020/03/17/analysis-mil- The rest of the paper is organized as follows. Section 2 dis-
lions-coronavirus-tweets-shows-whole-world-is-sad/. cusses works related to the techniques used for the Tweets
13
Social Network Analysis and Mining (2022) 12:149 Page 3 of 13 149
spam classification. The Twitter spam data collection and that content and graph-based feature-based models achieved
feature extraction procedure are presented in Sect. 3. The an accuracy of 90%, and the user and graph-based feature-
experimental analysis and results are provided in Sect. 4. based models achieved an accuracy of 92%. The presented
Section 5 concludes the paper. work also performed correlation analysis between features
and removed features with higher correlations.
Sagar and Manik (2017) have applied different machine
2 Related studies learning algorithms for Twitter spam detection. The pre-
sented work has used SVM as the principal classifier. The
In this section, we discuss some of the state-of-arts related authors have introduced a new feature that matches the tweet
to proposed work. Kaur et al. (2016) have reported a review content with URL destination content. The experimental
of various research papers published between 2010 and 2015 dataset consists of a random set of 1000 tweets; out of those
and discussed techniques used in these research papers. The 1000 tweets, 95–97% were classified correctly. Arushi and
authors stated that researchers had utilized numerous meth- Rishabh (2015) proposed an integrated algorithm that com-
ods for spam detection. Most of the works have been done by bines the benefits of three distinctive learning algorithms
considering tweets’ content and profile-based features. Dan- (to be specific naive Bayes, clustering, and decision trees)
gkesee and Puntheeranurak (2017) performed an adaptive was implemented. This incorporated calculation classifies a
classification for spam detection. Authors have used spam record as spammer/non-spammer with a by and large preci-
world filter and URL filter using black-listed URLs. After sion of 87.9%. Lin and Huang (2013) analyzed the impor-
labeling and preprocessing the data set, the Naive Bayes tance of existing features for recognizing spammers on
classifier used 50000 and 10000 tweets. The results found Twitter and utilized two basic yet compelling features (i.e.,
that the proposed model outperformed spam world filters by the URL rate and the collaboration rate) to characterize the
comparing accuracy, precision, recall, f1-score. In the end, Twitter accounts. This study, dependent on 26,758 Twitter
the authors have suggested the utilization of safe browsing accounts with 508,403 tweets, shows that the classification
instead of URL blacklisting for filtering URLs. has precision up to around 0.99 and 0.86 and a higher recall.
Raj et al. (2020) applied multiple machine learning algo- Willian and Yanqing (2013) proposed a versatile strategy
rithms to classify tweet content. The experimental results to distinguish spam on Twitter using content, social, and
showed that out of the used techniques, KNN (92%), deci- graph-based data, and after various examinations, an edge
sion tree classifier (90%), random forest classifier (93%), and acquainted-based model is made. This new model is
and naive Bayes classifier (69%) outperformed other tech- contrasted with SVM and two other existing calculations
niques. The authors suggested that the tweet be deleted after utilizing accuracy, precision, and recall. The new classifier
detecting it as spam. Song et al. (2011) presented Bagging, with an accuracy of 79.26% is superior to SVM with a preci-
SVM, J48, BayesNet with relation-based features by cre- sion of 69.32%. Wu et al. (2017) used various deep learning
ating graphs between users. The authors have used meas- techniques utilizing training through word vectors and crea-
ures such as distance and connectivity between users. The tion of various classifiers through ML algorithms. Doc2Vec
results showed that Bagging outperformed other techniques was used as the word vector training model, and machine
with a 94.6% true positive rate and 6.5% false positive rate. learning algorithms included random forest, naive Bayes,
The authors have also highlighted that if any user created and decision tree. The author collected 10 days ground truth
a new account and generated a tweet, it would be added to data from twitter consisting of 1,376,206 spam tweets and
the spammer category, even if it is not spam. It is due to the 673,836 non-spam messages and created four different data-
classification of the user as malicious earlier. sets with varying spam to non-spam ratios. MLP proved to
Alom et al. (2020) have applied CNN with tweet text and perform the best on all four datasets. Tang et al. (2014) tried
with both tweet text and meta-data features for the spam a unique approach of extracting out features from tweets
classification. The presented approach utilized NLP meth- using deep learning networks in order to capture syntactic
ods such as word embeddings and n-grams methods. The texts of embedded words and labels. However, the machine
approach converts the text into a matrix before sending it to learning algorithms using these features did not perform that
CNN. The method that combined both the features produced well as the best f1 score was reported to be 87.61% (<90%).
better accuracy of around 93.38%. The presented approach The previous work done in spam detection on Twitter
outperformed other used deep learning methods. predominantly centers around the profile and content-based
Mateen et al. (2017) proposed a hybrid solution for spam features. Better utilization of other features in Twitter spam
detection that used different combinations of features such detection is still a major concern (Tingmin et al. 2018).
as content-based, graph-based, and user-based features. The Additionally, there is a need for adding hybrid features in
authors have applied J48, decorate, and naive Bayes classi- training set for tweet classification. The proposed work uses
fiers on the dataset having these features. The results showed two different datasets with different features combinations
13
149 Page 4 of 13 Social Network Analysis and Mining (2022) 12:149
Dataset 1
(DS1)
Predictive Model
Profile and Text Encoding using TF-IDF Building using ML
Features and BOW Techniques
Dataset 2
(DS2)
Fig. 1 Twitter data collection, feature extraction procedure and ML/DL model evaluation
to analyze different machine learning, ensemble, and deep the number of following, the number of mentions, and
learning techniques. tweets creation time.
2. Content-based features These features concern the text
properties of the tweets (Chen et al. 2017). A tweet con-
3 Twitter spam dataset collection tent has some crucial information such as the number of
hashtags, total word count, users mentioned in a tweet,
The overview of the dataset collection, feature extraction, the number of URLs, and count of numerals.
and model evaluation procedure is depicted in Fig. 1. 3. Hybrid features These features are derived from the
The proposed work utilizes the tweets fetched using Twit- user-based features. Some new features that can be
ter developer API4. Twitter allows its users to fetch Twitter derived are reputation (ratio of followers with follow-
data using the Tweepy library5. The Tweepy library required ing), frequency of tweets, the rate at which user follows
four user credentials like consumer_key, consumer_secret, other users, account age, metric entropy for all textual
access_key, access_secret to send the request over API. features, the proportion of similarity in username and
We fetched 4000 latest tweets, consisting of many features screen name, etc.
like timestamp, tweet text, username, hashtags, follow-
ers count, the following count, number of mentions, word Table 1 describes the important features that we have
count, retweet, etc. All of these features are categorized into extracted from the collected dataset. Specifically, we focus
content-based features and user-based features. Further, we on the following properties to extract different feature sets.
create various hybrid features such as the user’s reputation,
frequency of tweets of a user, and following frequency. For 1. Count of the number of followers and followees Follow-
labeling the dataset as spam or non-spam, we use hybrid ers are those users who follow a specific user, while fol-
features, blacked list URLs, and some predefined words in lowees are the users who a specific user follows. In gen-
the text (Gupta et al. 2018. Finally, the dataset is prepared eral, spammers have limited numbers of followers but
for analyzing the performances of different machine learn- large followees. Therefore, users with large followees
ing models. Two different datasets are created by combining and limited numbers of followers can be considered
user-content features, user-relation features, user-content- spam account.
relation features. 2. URLs URLs are the connections that direct to some other
We collect the features of three different categories as page on the program. With URL shorteners’ improve-
described below. ment, it has become simple to post irrelevant connec-
tions on any OSN. This is because URL shorteners
1. Profile-based features These features concern the pro- hide the original content of the URL, making it hard
file properties of the users. A user’s account includes for detection algorithms to detect malicious URLs. An
important information such as the number of followers, excessive number of URLs in tweets of a user are an
expected pointer of the user being a spammer.
3. Spam words A record with spam words in pretty much
4
https://developer.twitter.com/en. every tweet can be viewed as a spam account. Subse-
5
https://www.tweepy.org/.
13
Social Network Analysis and Mining (2022) 12:149 Page 5 of 13 149
Table 1 Description of the features collected for the Twitter’s spam dataset
Feature name Feature type Feature description
quently, text including spam words can be considered count and followers count are prone to spam accounts. The
as a significant factor for identifying spammers. username and screenname of a legitimate user are usually
4. Replies Since, data or message sent by a spammer is similar, and the username is not very lengthy and does not
pointless, thusly individuals once in a while answers begin with a digit. If these naming conventions are not fol-
to its post. On the other hand, a spammer answers to an lowed, such users are usually spam accounts. NameSim and
enormous number of presents altogether on getting seen NamesRatio features capture this aspect of the accounts. A
by numerous individuals. This example can be utilized suspicious spam account usually posts 12 or more tweets per
in recognition of spammers. day, whereas a legitimate account posts on average 4 tweets
5. Hashtags Hashtags are the novel identifier (“#” trailed per day. We have considered these characteristics of the user
by the identifier name) which is utilized to bunch com- accounts and calculated hybrid features. The details of these
parative tweets together under a similar name. Spam- features are given in Inuwa-Dutse et al. (2018).
mers utilize enormous #hashtags in their posts, with the
goal that their post is posted under all the hashtag clas- 3.1 Labeling of spam dataset
sifications and consequently gets high viewership and is
perused by others. Initially, all of the tweets are unlabeled. We perform a data
labeling process and assign spam or non-spam label to reach
tweets. Concone et al. (2019) have presented a labeling tech-
The hybrid features are included in the dataset to under- nique for the Twitter spam account. The authors have used
stand the dynamism of features such as “statuses count, malicious URLs and recurrent content information to decide
friends count, followers count, favorites count, naming whether a tweet is spam or not. In our work, we use the
conventions and tweeting patterns.” Account age shows same technique to label the tweets. The labeling technique’s
the frequency of user activity. Accounts with a very high first step is defining some criteria that help decide between
value of status and friends count, but a low value of favorites spam and trustworthy content. The first criteria to consider
13
149 Page 6 of 13 Social Network Analysis and Mining (2022) 12:149
Ads Ads, images, banners, Hedberg, RealMedia, img, announcer, popup, offer, adserver, sales, gifs, media, exit,
out, adv, splash, pub, pop, graphics
Books Catalog, book, patterns, weaving, product, sniacademic, news, ebook, educator, library, store, wilecyda
E-commerce Shop, store, catalog, tickets, art, users, business
Games Juegos, Jeux, category, game, Xbox, jeunesse, pc, online, Comunidad, consoles, flash, PSP, arcade, Wii,
emulator, gratis, Nintendo, PlayStation
Medical Health, conditions, article, content, diseases, meds, group
News News, newspapers, media, publications, section, feed, opinion, business, community, archive, papers, profile
Sport Sport, athletics, team, basketball, football, college, women, track, tennis, soccer, baseball, golf, mens
are the publication of URLs of some malicious sites in the 3. Based on hybrid and profile features There are some
tweet. It is simple to detect malicious content. Another cri- important hybrid and profile features on the basis of
terion is the publication of duplicate content or messages to which we can label a tweet. These features include the
spread some information. This strategy is often used to dis- ratio of friends count and followers count, the ratio of
seminate misinformation. The use of vocabulary and other the status count, and account age. Some profile features
meta-information is also used as the criteria. Based on these are also used for labeling, such as Is_verified and Listed.
characteristics, we design and use the labeling technique.To Is_verified represents whether the user is verified or not
label a tweet as spam or non-spam, we used a combination checked from the Twitter security bot. Listed represents
of word category filter, URL filter, and some hybrid and how many times the user reported. Table 1 lists all
profile-based meta-features. They are described as follows. hybrid and profile features used in the paper.
6
https://github.com/ssrathore/Suspicious_Tweets-datasset.
13
Social Network Analysis and Mining (2022) 12:149 Page 7 of 13 149
13
149 Page 8 of 13 Social Network Analysis and Mining (2022) 12:149
experimental purpose, followed by a fully connected layer Table 3 Different Ml models with bag of words on DS1 and DS2
of 32 neurons and a classification layer. Classifier Accuracy Precision Recall F1-score
13
Social Network Analysis and Mining (2022) 12:149 Page 9 of 13 149
Table 6 Ensemble techniques with TF-IDF on DS1 and DS2 4.5.1 Results of machine learning techniques for the tweet
Classifier Accuracy Precision Recall F1-score spam classification
Ensemble techniques with TF-IDF on DS1 Tables 3 and 4 show the results of ML techniques with BOW
Bagging 0.94368 0.94283 0.93451 0.93457 and TF-IDF on DS1 and DS2 datasets in terms of accuracy,
Boosting 0.90354 0.90228 0.9134 0.91586 precision, recall, and f1-score measures. From Table 3, it can
Stacking 0.93765 0.92506 0.92355 0.93679 be seen that among the used ML techniques, random forest
Ensemble techniques with TF-IDF on DS2 and decision tree have produced the best prediction perfor-
Bagging 0.95242 0.94525 0.95221 0.94625 mance across all the measures. The highest achieved values
Boosting 0.93691 0.93542 0.92231 0.92042 for all the measures are above 90%. The naive Bayes tech-
Stacking 0.91969 0.90125 0.90589 0.91287 nique produced the lowest performance for all the measures.
The performance of the ML techniques has been improved
for the DS2, which consists of the profile, user, and hybrid
Table 7 Deep learning techniques based models on DS1 and DS2 features. Similarly, from Table 4, it can be observed that
Models Accuracy Precision Recall F1-score
again decision tree and random forest techniques are the top
performers for all the performance measures. The values of
Deep learning techniques on DS1 all the measures are greater than 90% for the decision tree,
BASIC ANN 64 and 32 0.979 0.969 0.988 0.978 random forest, and logistic regression techniques. Again,
layers
the performance of ML techniques has been improved for
LSTM 0.673 0.652 0.637 0.646
the DS2.
Single convolution layer 0.979 0.974 0.982 0.978
Two convolution layer 0.986 0.997 0.972 0.985
4.5.2 Results of ensemble techniques for the tweet spam
GRU 0.983 0.978 0.986 0.982
classification
VDCNN 0.938 0.99 0.868 0.929
Convolution + LSTM 0.923 0.88 0.954 0.92
Tables 5 and 6 show the results of ensemble techniques with
Deep learning techniques on DS2
BOW and TF-IDF on DS1 and DS2 datasets in terms of
BASIC ANN 64 and 32 0.928 0.948 0.868 0.916
layers
accuracy, precision, recall, and f1-score measures. From
LSTM 0.612 0.606 0.378 0.464
Table 5, it can be observed that the bagging technique pro-
Single convolution layer 0.906 0.951 0.832 0.88
duced the best prediction performance for all the measures
Two convolution layer 0.87 0.896 0.801 0.846
followed by the boosting technique. The three ensemble
GRU 0.558 0.559 0.047 0.086
techniques have achieved values above 90% for all the meas-
VDCNN 0.829 0.977 0.631 0.767
ures. The stacking technique produced the lowest perfor-
Convolution + LSTM 0.558 0.682 0.021 0.041
mance for all the measures. However, the performance of the
ensemble techniques has been decreased for the DS2, which
consists of the profile, user, and hybrid features. Similarly,
from Table 6, it can be seen that again the bagging technique
learning models, we used TF-IDF and bag of words as the is the best performer, followed by the boosting technique.
text embedding, whereas to test the deep learning models, It is true for both DS1 and DS2 datasets. The values of all
we used the pre-trained Paraphrase-distilroberta-base-v1 the measures are again greater than 90% for the bagging and
embedding (Reimers et al. 2019). The experimental pack- boosting techniques. Again, the performance of ensemble
age with the twitter spam dataset and source code can be techniques has been decreased for the DS2.
found here7.
4.5.3 Results of deep learning techniques for the tweet
4.5 Experiment results spam classification
We have used five different machine learning techniques and Table 7 shows the results of different deep learning tech-
three different ensemble techniques to build and evaluate the niques on DS1 and DS2 datasets in terms of accuracy, pre-
prediction models. These techniques have been applied to cision, recall, and f1-score measures. The table shows that
both DS1 and DS2 datasets with the bag of words (BOW) except for the LSTM technique, all other used deep learning
and TF-IDF feature extraction methods. The results of the
experimental analysis are reported in Tables 3, 4, 5, 6, and 7.
7
https://github.com/ssrathore/Suspicious_Tweets-datasset.
13
149 Page 10 of 13 Social Network Analysis and Mining (2022) 12:149
Accuracy with BOW DS1 DS2 Precision with BOW DS1 DS2
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
Recall with BOW DS1 DS2 F1-score With Bow DS1 DS2
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
Fig. 2 Comparison of different used ML, ensemble, and deep learn- layer convolution, TLC= Two layer convolution, GRU= Gated recur-
ing techniques with bag of words (BOW) on DS1 and DS2 datasets, rent unit, Cov_LSTM= Convolution + LSTM, VDCNN= Very deep
(*LR= Logistic Regression, NB= Naive Bayes, KNN= K-nearest convolution neural network)
neighbors, DT= Decision Tree, RF= Random Forest, SLC= Single
Accuracy with TF-IDF DS1 DS2 Precision with TF-IDF DS1 DS2
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
Recall with TF-IDF DS1 DS2 F1-Score with TF-IDF DS1 DS2
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
Fig. 3 Comparison of different used ML, ensemble, and deep learning techniques with TF-IDF on DS1 and DS2
13
Social Network Analysis and Mining (2022) 12:149 Page 11 of 13 149
techniques have produced a higher performance for spam that ensemble learning techniques-based models produced
classification on the DS1 dataset. The values are above 90% equal or better performance than deep learning techniques-
for all the measures in most cases. For the DS2 dataset, based models. The possible reason behind it is that the DS1
the performance of the deep learning techniques has been and DS2 datasets are not large enough to optimally train the
decreased. Here, basic ANN and single convolution layer deep learning-based models. Moreover, no improvement in the
techniques produced a performance greater than 90%. GRU performance of the deep learning models has been recorded
and the Convolution + LSTM have performed relatively when DS2 is used. Therefore, it can be inferred that adding
poorly on DS2. a hybrid does not help with performance improvement. One
exception report has been reported for the Convolution+LSTM
4.5.4 Performance comparison of the used machine model, where the recall value was very low. This issue can be
learning, ensemble, and deep learning techniques further investigated by optimally tuning the hyperparameters
for the tweet spam classification of the technique. Furthermore, it can also be inferred that time-
series models such as LSTM are not an ideal choice for the
Figures 2 and 3 show the performance comparison of the suspicious tweets’ identification.
used different set of techniques for the BOW and TF-IDF on
DS1 and DS2 datasets. The X-axis represents the set of tech-
niques, and Y-axis shows the achieved performance values. 5 Conclusions and future work
From Fig. 2, it is observed that overall, ensemble techniques
and deep learning techniques (except the LSTM technique) This paper focused on detecting suspicious tweets in trend-
have performed better than the machine learning techniques ing Twitter topics by analyzing the profile, user, content fea-
on the DS1 dataset. However, the performance of decision tures, and combinations. First, we crawled and extracted the
tree and random forest techniques is comparable or better data of Twitter trending topics by using the tweepy library.
than ensemble and deep learning techniques for the DS2 Further, we extracted different sets of features from the col-
dataset. For the recall and f1-score measures, LSTM, GRU, lected Twitter data. Additionally, we labeled the dataset with
and convolution+LSTM techniques have performed relatively spam and non-spam labels. Then, we applied and assessed
poorly compared to other used techniques in the case of DS2. the performance of different machine learning, ensemble,
Overall, techniques performed better for the DS1 and relatively and deep learning techniques for tweet spam classification.
poorly for DS2. Similarly, from Fig. 3, it is seen that again The results showed that the dataset with the combination of
ensemble learning techniques and deep learning techniques profile, content, and hybrid features improved the perfor-
produced a better performance compared to the machine learn- mance of machine learning and ensemble techniques but
ing techniques on DS1. The performance of machine learning did not improve deep learning techniques’ performance. The
and ensemble techniques has improved for the DS2. In com- used learning techniques performed almost equally for both
parison, the performance of deep learning techniques has been NLP feature extraction methods, BOW and TF-IDF. In most
decreased for the DS2. of the cases, used machine learning techniques produced
Overall, from the presented experimental analysis, we the performance of 90% or above for different performance
found that the used different sets of learning techniques have measures. The presented work showed that the hybrid fea-
achieved a higher performance for the tweet spam classifica- tures are most important for tweet spam classification. In
tion. In most cases, the values are above 90% for different this paper, we used some common behaviors of the users and
performance measures. These results show that using profile, content to label the tweets as spam or not and built several
content, user, and hybrid features for suspicious tweets detec- models for the identification of spam tweets. The idea was to
tion helps build better prediction models. recognize some patterns to design a method capable of auto-
matically annotating the large-scale dataset. However, there
4.6 Discussion of results is further scope for improving the filters to use in the spam
labeling of tweets. The experimental analysis presented in
This paper aims to develop models for the suspicious tweets’ this work showed that factors such as feature selection and
identification using different features such as profile-based, the use of filters for spam labeling greatly influence the per-
content-based, and hybrid features. Different machine learn- formance of the learning techniques. A stable and better-
ing and deep learning techniques have been applied to build annotated dataset could result in improved performance of
the prediction models. We tried two different combinations of the models. Future research work would present an approach
features and thus created two datasets of 3650 and 9778 tweets, to classify a user as a malicious or a valid user. Further, we
respectively. Dataset-1 (DS-1) includes the only profile-based would like to investigate the dependence among the features
and content-based features. Dataset-2 (DS-2) includes profile- and their significance in malicious bot detection.
based, content-based, and hybrid features. The results showed
13
149 Page 12 of 13 Social Network Analysis and Mining (2022) 12:149
Measure Description
Accuracy It is defined as the ratio of correctly predicted examples to the total examples.
(TP+TN)
Accuracy = (TP+TN+FP+FN)
Precision It is calculated as the proportion of accurately predicted positive examples to all positive examples
predicted. Precision = (TP+FP)
TP
Recall It is defined as the proportion of correctly predicted positive examples to all positive examples in the
actual class. Recall = (TP+FN)
TP
F1-score It is the weighted average of precision and recall. F1-score considers both the false positives and
false negatives. F1 − score = 2∗Precision∗Recall
(Precision+Recall)
*TP = True Positive, TN = True Negative, FP = False Positive, FN = False Negative
Appendix A Ethical approval This article does not contain any studies with human
participants or animals performed by any of the authors.
Appendix B
References
Term frequency (TF) It ascertains the occasions a word wi
Abkenar SB, Kashani MH, Akbari M, Mahdipour E (2020) Twitter
occurs in a survey rj ; with respect to the total number of spam detection: a systematic review. arXiv preprint arXiv:2011.
words. It is defined by Eq. 1. 14754
Adhikari A, Ram A, Tang R, Lin J (2019) Rethinking complex neural
Number of times wi occurs in rj network architectures for document classification. In: Proceed-
tf (wi , rj ) = (1) ings of the 2019 conference of the North American chapter of the
Total number of words in rj association for computational linguistics: human language tech-
nologies, vol 1 (Long and Short Papers), pp 4046–4051
Inverse document frequency (IDF) It highlights terms that Aizawa Akiko (2003) An information-theoretic perspective of TF-IDF
appear in a small number of documents throughout the cor- measures. Inf Process Manag 39(1):45–65
pus, or in plain English, words with a high IDF score. It is Alom Z, Carminati B, Ferrari Elena (2020) A deep learning model
for Twitter spam detection. Online Soc Netw Media 18:100079
defined by Eq. 2. Barushka A, Hajek P (2018) Spam filtering using integrated distribu-
tion-based balancing approach and regularized deep neural net-
|D|
idf (d, D) = log (2) works. Appl Intell 48(10):3538–3556
{d ∈ D ∶ t ∈ D} Boukes M (2019) Social network sites and acquiring current affairs
knowledge: the impact of Twitter and Facebook usage on learning
where ft,D is the recurrence of the term t in the record D. about the news. J Inf Technol Politics 16(1):36–51
|D| is the absolute number of reports in the corpus. Chen W, Yeo CK, Lau CT, Lee BS (2017) A study on real-time low-
{d ∈ D ∶ t ∈ D} is the include of archives in the corpus, quality content detection on Twitter from the users’ perspective.
PLoS ONE 12(8):e0182487
which contains the term t. Cho K, van Merrienboer B, Bahdanau D, Bengio Y (2014) On the
Term frequency-inverse document frequency(TF-IDF) properties of neural machine translation, encoder-decoder
TF-IDF is the multiplication of TF and IDF. It is defined approaches. CoRR arXiv:1409.1259
by Eq. 3. Concone F, Re GL, Morana M, Ruocco C (2019) Twitter spam account
detection by effective labeling. InITASEC
Dangkesee T, Puntheeranurak S (2017) Adaptive classification for
tf − idf (t, d, D) = tf (t, D) × idf (d, D) (3)
spam detection on twitter with specific data. In: 2017 21st inter-
national computer science and engineering conference (ICSEC),
Acknowledgments This work is partially supported by a Research pp 1–4. IEEE
Grant under National Super computing Mission (India), Grant number: Dokuz AS (2021) Social velocity based spatio-temporal anoma-
DST/NSM/R &D_HPC_Applications/2021/24. lous daily activity discovery of social media users. Appl Intell
52:2745–2762
Edo-Osagie O, De La Iglesia B, Lake I, Edeghere O (2020) A scoping
Declarations
review of the use of Twitter for public health research. Comput
Biol Med 122:103770
Conflict of interest The authors declare no potential conflict of inter- Gharge S, Chavan M (2017) An integrated approach for malicious
ests with respect to the research, authorship, and/or publication of this tweets detection using NLP. In: 2017 international conference
article.
13
Social Network Analysis and Mining (2022) 12:149 Page 13 of 13 149
on inventive communication and computational technologies In: 2019 international engineering conference (IEC), pp 200–204.
(ICICCT), pp 435–438. IEEE IEEE
Gorunescu F (2011) Classification performance evaluation. In: Data Raj RJR, Srinivasulu S, Ashutosh A (2020) A multi-classifier frame-
mining. Intelligent systems reference library, vol 12. Springer, work for detecting spam and fake spam messages in Twitter. In:
Berlin, Heidelberg. https://d oi.o rg/1 0.1 007/9 78-3-6 42-1 9721-5_6 2020 IEEE 9th international conference on communication sys-
Gupta A, Kaushal R (2015) Improving spam detection in online social tems and network technologies (CSNT), pp 266–270. IEEE
networks. In: 2015 International conference on cognitive comput- Reimers N, Gurevych I, Thakur N (2019) Sentence-bert: Sentence
ing and information processing (CCIP), pp 1–6. IEEE embeddings using siamese bert-networks. In: Proceedings of the
Gupta H, Jamal MS, Madisetty S, Desarkar MS (2018) A framework 2019 conference on empirical methods in natural language pro-
for real-time spam detection in twitter. In: 2018 10th interna- cessing. Association for Computational Linguistics
tional conference on communication systems & networks (COM- Simonyan K, Zisserman A (2014) Very deep convolutional networks
SNETS), pp 380–383. IEEE for large-scale image recognition. arXiv preprint arXiv:1 409.1 556
Hennig-Thurau T, Wiertz C, Feldhaus Fabian (2015) Does twitter mat- Song J, Lee S, Kim J (2011) Spam filtering in Twitter using sender-
ter? the impact of microblogging word of mouth on consumers’ receiver relationship. In: International workshop on recent
adoption of new movies. J Acad Mark Sci 43(3):375–394 advances in intrusion detection, pp 301–317. Springer, Cham
Hochreiter S, Schmidhuber Jürgen (1997) Long short-term memory. Sreekanth M, Sankar DM (2018) A neural network-based ensemble
Neural Comput 9(8):1735–1780 approach for spam detection in Twitter. IEEE Trans Comput Soc
Hua W, Zhang Y (2013) Threshold and associative based classifica- Syst 5(4):973–984
tion for social spam profile detection on twitter. In: 2013 ninth Tang D, Wei F, Qin B, Liu T, Zhou M (2014) Coooolll: a deep learn-
international conference on semantics, knowledge and grids, pp ing system for Twitter sentiment classification. In: Proceedings of
113–120. IEEE the 8th international workshop on semantic evaluation (SemEval
Inuwa-Dutse I, Liptrott M, Korkontzelos I (2018) Detection of spam- 2014), pp 208–212, Association for Computational Linguistics,
posting accounts on Twitter. Neurocomputing 315:496–511 Dublin
Kim S-W, Gil Joon-Min (2019) Research paper classification systems Tingmin W, Wen S, Xiang Y, Zhou Wanlei (2018) Twitter spam detec-
based on TF-IDF and LDA schemes. Hum-centric Comput Inf tion: survey of new approaches and comparative study. Comput
Sci 9(1):1–21 Secur 76:265–284
Lee S, Kim J (2013) Fluxing botnet command and control channels Wang B, Zhuang Jun (2017) Crisis information distribution on twitter:
with URL shortening services. Comput Commun 36(3):320–332 a content analysis of tweets during hurricane sandy. Nat Hazards
Lingam G, Rout RR, Somayajulu DVLN (2019) Adaptive deep 89(1):161–181
Q-learning model for detecting social bots and influential users Wu T, Liu S, Zhang J, Xiang Y (2017) Twitter spam detection based
in online social networks. Appl Intell 49(11):3947–3964 on deep learning. In: Proceedings of the Australasian computer
Lin P-C, Huang P-M (2013) A study of effective features for detecting science week multiconference, ACSW ’17, Association for Com-
long-surviving Twitter spam accounts. In: 2013 15th international puting Machinery, New York
conference on advanced communications technology (ICACT), Zhou P, Shi W, Tian J, Qi Z, Li B, Hao H, Xu B (2016) Attention-
pp 841–846. IEEE based bidirectional long short-term memory networks for relation
Martinez-Rojas M, del Carmen Pardo-Ferreira M, Rubio-Romero JC classification. In: Proceedings of the 54th annual meeting of the
(2018) Twitter as a tool for the management and analysis of emer- association for computational linguistics (vol 2: Short papers),
gency situations: a systematic literature review. Int J Inf Manag pp 207–212
43:196–208 Yang Zi, Yang D, Dyer C, He X, Smola A, Hovy E (2016) Hierarchical
Martinez-Romo J, Araujo Lourdes (2013) Detecting malicious tweets attention networks for document classification. In: Proceedings of
in trending topics using a statistical analysis of language. Expert the 2016 conference of the North American chapter of the associa-
Syst Appl 40(8):2992–3000 tion for computational linguistics: human language technologies,
Mateen M, Iqbal MA, Aleem M, Islam MA (2017) A hybrid approach pp 1480–1489,
for spam detection for Twitter. In: 2017 14th international Bhur-
ban conference on applied sciences and technology (IBCAST), Publisher's Note Springer Nature remains neutral with regard to
pp 466–471. IEEE jurisdictional claims in published maps and institutional affiliations.
Pengcheng Y, Sun X, Li W, Ma S, Wu W, Wang H (2018) Sgm:
sequence generation model for multi-label classification. arXiv Springer Nature or its licensor holds exclusive rights to this article under
preprint arXiv:1806.04822 a publishing agreement with the author(s) or other rightsholder(s);
Prabhjot K, Anubha S, Jasleen K (2016) Spam detection on twitter: a author self-archiving of the accepted manuscript version of this article
survey. In: 2016 3rd international conference on computing for is solely governed by the terms of such publishing agreement and
sustainable global development (INDIACom), pp 2570–2573. applicable law.
IEEE
Qader WA, Ameen MM, Ahmed BI (2019) An overview of bag of
words; importance, implementation, applications, and challenges.
13