Using BERT Encoding To Tackle The Mad-Lib Attack in SMS Spam Detection
Using BERT Encoding To Tackle The Mad-Lib Attack in SMS Spam Detection
Sergio Rojas–Galeano∗
∗
Universidad Distrital Francisco José de Caldas, Bogotá, Colombia
[email protected]
stitute vocables with synonyms or similar words that turn the message
unrecognisable by the detection algorithms. In this paper we investigate
whether the recent development of language models sensitive to the se-
mantics and context of words, such as Google’s BERT, may be useful to
overcome this adversarial attack (called “Mad-lib” as per the word sub-
stitution game). Using a dataset of 5572 SMS spam messages, we first
established a baseline of detection performance using widely known doc-
ument representation models (BoW and TFIDF) and the novel BERT
model, coupled with a variety of classification algorithms (Decision Tree,
kNN, SVM, Logistic Regression, Naive Bayes, Multilayer Perceptron).
Then, we built a thesaurus of the vocabulary contained in these mes-
sages, and set up a Mad-lib attack experiment in which we modified
each message of a held out subset of data (not used in the baseline
experiment) with different rates of substitution of original words with
synonyms from the thesaurus. Lastly, we evaluated the detection per-
formance of the three representation models (BoW, TFIDF and BERT)
coupled with the best classifier from the baseline experiment (SVM). We
found that the classic models achieved a 94% Balanced Accuracy (BA)
in the original dataset, whereas the BERT model obtained 96%. On the
other hand, the Mad-lib attack experiment showed that BERT encodings
manage to maintain a similar BA performance of 96% with an average
substitution rate of 1.82 words per message, and 95% with 3.34 words
substituted per message. In contrast, the BA performance of the BoW
and TFIDF encoders dropped to chance. These results hint at the po-
tential advantage of BERT models to combat these type of ingenious
attacks, offsetting to some extent for the inappropriate use of semantic
relationships in language.
1 Introduction
Unsolicited email (spam) remains a global burden, accounting for up to 85% of
daily message traffic, according to some network security providers1 . Although
1
see https://dataprot.net/statistics/spam-statistics/, last visit: July 13, 2021
2 Sergio Rojas–Galeano
Adversarial attack tactics typically involve carefully crafting the content of the
input data to disrupt the expected behaviour of a prediction model [17]. The
study of adversarial environments attracted attention more than a decade ago,
when incidentally, the vulnerabilities of spam filters confronted with this type
of manipulation were uncovered [4]. Since then, many adversarial attacks and
defences have been described in a variety of applications such as online abusive
comments and profanity detection [10, 22, 23, 25], classification of medical images
[9] or object identification in computer vision [11, 2], to name a few.
In the case of text classification tasks, attacks are generally performed by
corrupting features or distorting the content of the text sequence [23]. More
particularly, in the field of adversarial attacks on spam filters, several tricks have
been characterised [12, 16, 24]: poisoning, injection of good words, obfuscation of
Tackling Mad-lib Spam Attack using BERT 3
spam words, change of labels and replacement of synonyms. Our study focuses on
the latter by taking a proactive approach [4], that is, anticipating, modelling and
countering the adversarial strategy. In this sense, our study takes a step forward
by showing the feasibility of addressing Mad-lib’s adversaries (our second set
of experiments, see below), compared to the work of [16] where the attack was
described but was not addressed.
Regarding the use of BERT encodings for extracting spam features (our
first set of experiments, see below), a modified Transformer model was recently
proposed to improve the detection performance of spam classifiers [19]. Other
modified models derived from BERT have been proposed for the effective detec-
tion of malicious phishing emails [18], while BERT with increased functionality
has also been applied to filter multilingual spam messages [7] and to block fake
tweets COVID [13], with promising results.
1.2 Contributions
2 Methods
2.1 Study roadmap
The study was conducted according to the stages illustrated in the roadmap of
Figure 1, which are described next.
(1) Dataset splitting. We worked with the SMS spam collection dataset from
the UCI repository2 . The dataset is unbalanced, as of the total of 5,574 mes-
sages, 4,827 are labelled as ham and only 747 as spam. The messages are quite
short; with an average length of 14.5 words, they pose an interesting challenge
for content-based filtering algorithms [3]. We use random sampling without re-
placement to divide this data set into three subsets: train (60%), test (20%), and
hold-out (20%).
(2) Thesaurus creation. We extracted a vocabulary of the 5000 most frequent
terms from the entire dataset and used them as keywords in a thesaurus. For each
keyword, a list of synonyms was automatically scrapped from its corresponding
entry page on the website www.dictionary.com.
2
The dataset is available at: https://archive.ics.uci.edu/ml/datasets/sms+spam+collection
4 Sergio Rojas–Galeano
(3) Document encoding. Messages in each split are represented using two en-
codings commonly used in spam filtering, Bag-of-Words (BoW) and the Term
Inverse Frequency of Document Frequency (TFIDF) [15], and the recently intro-
duced Bi-directional Encoder Representations from Transformers (BERT) [8].
BoW and TFIDF are simplified representations that map words within a docu-
ment to a vector of frequencies indexed by a vocabulary (the latter normalised
by the fraction of documents that contain the words). These mappings capture
lexical features while ignoring syntax or semantics. For these models, we prepro-
cess the text by removing accents, removing stopwords in English, converting it
to lowercase, and applying stemming and tokenisation.
On the other hand, BERT is a language model trained as a deep bidirectional
network conditioned by both the left and right context of the words in the text
input, also considering semantic relationships. One of the outputs at the top
layer of the network is a vector of 768 positions that encodes an embedding
of the entire input sentence. We will use it as a vector of context features and
semantic relationships between a sequence of words that make up a message,
focusing on its ability to project similar spam messages that differ in lexical
variations in close locations of the embedding space, regardless of the actual
interpretation of these features. Besides, the text cleanup for this model was
minimal, basically converting to lowercase and applying the BERT tokeniser [8].
(4) Spam classification. At this stage, a first set of experiments was carried out
to evaluate how well the classification algorithms work on the original messages.
For this purpose, we used the training and test splits, represented with the
three encodings as input features of a variety of classification algorithms that
are regularly used for text classification tasks [15, 1, 14]: Decision Tree, Naive
Bayes, k-Nearest Neighbour (kNN), Support Vector Machine (SVM), Logistic
Regression and Multilayer Perceptron (MLP).
(5) Mad-lib attack. Two attacks were carried out on the held-out subset,
where in each message an attempt was made to replace 5 or 10 words chosen at
random, using synonyms from the previously constructed thesaurus. As a result,
two modified Mad-lib subsets were obtained.
Tackling Mad-lib Spam Attack using BERT 5
(6) Mad-lib spam classification. In this second set of experiments, the pre-
viously trained classifiers were evaluated in the modified Mad-lib sets, once en-
coded with the three aforementioned representation models.
The experiments were carried out according to the protocol described in Figure 2.
The dataset is divided into three partitions: Train, Test, and Holdout. The first
experiment was conducted to estimate a baseline of spam detection performance
on the original data set, for comparison purposes in a subsequent spam experi-
ment with the Mad-lib attack.
Initially, the messages in the Train and Test splits were encoded with the
three representation models (BoW, TFIDF, BERT) to obtain vectors of 768 fea-
tures (since this is the inherent size of the dense vectors generated by BERT,
we set the vocabulary size base for BoW and TFIDF accordingly). Then the
obtained feature vectors are fed to the aforementioned classification algorithms.
Each classifier is trained with the encoded vectors of the Train split along with
their respective labels; once trained, their performance is evaluated in the Test
split, using the metrics of Accuracy (ACC ), Precision (PR) and Sensitivity (SE )
[26] and Balanced Accuracy (BA) [6]. The latter was considered the most ap-
propriate metric for this particular task, considering that the dataset is highly
unbalanced. They are defined by the following equations:
TP + TN 1 TP TN TP TP
ACC = , BA = + , PR= , SE = ,
P +N 2 P N TP + FP TP + FN
where P and N are the total number of spam and ham messages, T P and F P
are the correctly and wrongly classified spam, and T N and F N are the correctly
and wrongly classified ham, respectively. The results are collected from a total of
30 replicas (with different samples of Train and Test partitions) so as to reduce
their variability due to randomness in the sampling procedure.
The models and experiments were implemented in the Python 3.8.5 language,
using the libraries sckit-learn 0.24.0 [20], PyDictionary [5] and SimpleTrans-
formers [21], which were executed in Google Colab with GPU accelerator. A
repository with code and materials is available at: github.com/Sargaleano/
Madlib-Spam-Attack-BERT.
3 Results
3.1 Spam detection experiments
The results for these experiments are summarised in Table 2, where averages
and standard deviations for the performance metrics are reported, grouped by
encoding model and classification algorithm. A preliminary experimentation was
conducted to perform classifier calibration (the final set of parameters is reported
in the Appendix).
Tackling Mad-lib Spam Attack using BERT 7
The results for these experiments are summarised in Table 3, where averages
and standard deviations for the performance metrics are reported, grouped by
number of attempts in the attack and encoding model (linear SVM was chosen
as classifier, since it achieved better performances across all of the three models).
8 Sergio Rojas–Galeano
In general, the results support the premise of the usefulness of the BERT
model to resist this type of attack. We will focus on examining the BA metric
for this analysis. In the first attack with zero substitutions (that is, using the
Holdout split without modifying the original messages) the SVM performance
is maintained, with a value of 96.6%. On the other hand, for the attacks with
5 and 10 substitution attempts (corresponding on average to 1.82 and 3.34 real
substitutions as explained above), the accuracy rate of the BERT model de-
creased slightly to 96.2% and 95.2% respectively, about a 1% drop compared to
the baseline experiment.
In contrast, these results also show that regarding BA, the performance of
the BoW and TFIDF encoders degrades at levels close to chance. It is curious
that even in the Hold-out partition without modifications the drop is noticeable;
when examining the SE rate, a sharp drop to 21.5% is observed, that is, the
detection of the features commonly associated to spam-related words is greatly
affected by including out-of-sample terms, a phenomenon that is accentuated
when Mad-lib substitutions are made in each message.
4 Conclusion
This study provided empirical evidence on the promise of BERT encodings in
tackling the Mad-lib spam attack. We reason that this is due to the ability of this
model to represent semantic and contextual functions of language. Furthermore,
other advantages of BERT are that it requires little pre-processing (cleanup)
of text, as well as its ability to recognise out-of-vocabulary terms due to its
inherent tokenisation method. On the computational side, BERT is heavier than
the simpler BoW encoders that achieve comparable performances with spam not
tampered with by Mad-lib adversaries.
Therefore, we anticipate that a combination of encoding models would be
a realistic configuration at the core of modern spam filters, in order to detect
behavioural changes implying that filter retraining is required (for example, acti-
vating an alert when the performance of BoW and BERT begins to differ amply).
Tackling Mad-lib Spam Attack using BERT 9
Furthermore, we hope that BERT encodings will help resist not only the adver-
sarial scenario described in this document, but also other related attacks, such as
the inoculation of good words, the obfuscation with homoglyphs, or the disguise
of multilingual words that trigger spam. We plan to explore these ideas in our
future work.
References
1. Charu C Aggarwal and ChengXiang Zhai. A survey of text classification algo-
rithms. In Mining text data, pages 163–222. Springer, 2012.
2. Naveed Akhtar and Ajmal Mian. Threat of adversarial attacks on deep learning
in computer vision: A survey. IEEE Access, 6:14410–14430, 2018.
3. Tiago A Almeida, José Marı́a G Hidalgo, and Akebo Yamakami. Contributions to
the study of SMS spam filtering: new collection and results. In Proceedings of the
11th ACM Symposium on Document Engineering, pages 259–262, 2011.
4. Battista Biggio and Fabio Roli. Wild patterns: Ten years after the rise of adversarial
machine learning. Pattern Recognition, 84:317–331, 2018.
5. Pradipta Bora. PyDictionary: A ”Real” Dictionary Module for Python (version
2.0.1), https://github.com/geekpradd/pydictionary, 2021.
6. Kay Henning Brodersen, Cheng Soon Ong, Klaas Enno Stephan, and Joachim M
Buhmann. The balanced accuracy and its posterior distribution. In 2010 20th
International Conference on Pattern Recognition, pages 3121–3124. IEEE, 2010.
7. Jie Cao and Chengzhe Lai. A Bilingual Multi-type Spam Detection Model Based
on M-BERT. In IEEE Global Communications Conference, pages 1–6. IEEE, 2020.
8. Jacob Devlin, Ming-Wei Chang, Kenton Lee, et al. BERT: Pre-training of deep
bidirectional transformers for language understanding. arXiv:1810.04805, 2018.
9. Samuel G Finlayson, John D Bowers, Joichi Ito, Jonathan L Zittrain, Andrew L
Beam, and Isaac S Kohane. Adversarial attacks on medical Machine Learning.
Science, 363(6433):1287–1289, 2019.
10. Hossein Hosseini, Sreeram Kannan, Baosen Zhang, et al. Deceiving Google’s per-
spective API built for detecting toxic comments. arXiv:1702.08138, 2017.
11. Hossein Hosseini, Baicen Xiao, and Radha Poovendran. Google’s cloud vision API
is not robust to noise. In 2017 16th IEEE international conference on machine
learning and applications (ICMLA), pages 101–105. IEEE, 2017.
12. Niddal H Imam and Vassilios G Vassilakis. A survey of attacks against twitter
spam detectors in an adversarial environment. Robotics, 8(3):50, 2019.
13. Debanjana Kar, Mohit Bhardwaj, et al. No Rumours Please! A Multi-Indic-Lingual
Approach for COVID Fake-Tweet Detection. arXiv:2010.06906, 2020.
14. Vandana Korde and C Namrata Mahender. Text classification and classifiers: A
survey. International Journal of Artificial Intelligence & Applications, 3(2):85,
2012.
15. Kamran Kowsari, Kiana Jafari Meimandi, Mojtaba Heidarysafa, et al. Text clas-
sification algorithms: A survey. Information, 10(4):150, 2019.
16. Bhargav Kuchipudi, Ravi Teja Nannapaneni, and Qi Liao. Adversarial machine
learning for spam filters. In Proceedings of the 15th International Conference on
Availability, Reliability and Security, pages 1–6, 2020.
17. Pavel Laskov and Richard Lippmann. Machine learning in adversarial environ-
ments. Machine Learning, (2):115–119, 2010.
10 Sergio Rojas–Galeano
18. Younghoo Lee, Joshua Saxe, and Richard Harang. CATBERT: Context-Aware
Tiny BERT for Detecting Social Engineering Emails. arXiv:2010.03484, 2020.
19. Xiaoxu Liu. A Spam Transformer Model for SMS Spam Detection. Master’s thesis,
Université d’Ottawa/University of Ottawa, 2021.
20. Fabian Pedregosa, Gael Varoquaux, et al. Scikit-learn: Machine Learning in
Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
21. Thilina Rajapakse. Simple Transformers (2021), https://simpletransformers.ai.
22. Nestor Rodriguez and Sergio Rojas-Galeano. Shielding google’s language toxicity
model against adversarial attacks. arXiv preprint arXiv:1801.01828, 2018.
23. Sergio Rojas-Galeano. On obstructing obscenity obfuscation. ACM Transactions
on the Web (TWEB), 11(2):1–24, 2017.
24. Sergio A Rojas-Galeano. Revealing non-alphabetical guises of spam-trigger voca-
bles. Dyna, 80(182):15–24, 2013.
25. Sara Sood, Judd Antin, et al. Profanity use in online communities. In Proceedings
of the SIGCHI Conference on Human Factors in Computing Systems, 2012.
26. Alaa Tharwat. Classification assessment methods. Applied Computing and Infor-
matics, 17, 2021.
Appendix
Chosen model parameters for algorithms used in experiments are shown below.
Classification algorithms
Decision Tree max depth=10
Naive Bayes default parameters
kNN k = 15
SVM (linear) C=1, loss=‘squared hinge’
Logistic Regression default parameters
MLP hidden layer sizes=(10,), alpha=1, max iter=1000
SVM (gaussian) gamma=.01, C=100
Representation models
BoW, TFIDF stemming, lowercase, stop words, max features=768
BERT model=‘xlm-r-bert-base-nli-stsb-mean-tokens’