0% found this document useful (0 votes)

166 views10 pages

Using BERT Encoding To Tackle The Mad-Lib Attack in SMS Spam Detection

This document investigates using BERT encoding to combat "Mad-lib attacks" in SMS spam detection. It finds that BERT encodings can maintain high detection accuracy even when words in spam messages are substituted with synonyms, unlike bag-of-words and TF-IDF encodings which see performance drop to chance. The document establishes baselines using various models and classifiers on original data, then conducts a "Mad-lib attack" experiment substituting words in a test set at different rates, finding BERT encodings resist such attacks better than other representations.

Uploaded by

José Patrício

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

166 views10 pages

Using BERT Encoding To Tackle The Mad-Lib Attack in SMS Spam Detection

Uploaded by

José Patrício

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Using BERT Encoding to Tackle the

Mad-lib Attack in SMS Spam Detection

Sergio Rojas–Galeano∗
∗
Universidad Distrital Francisco José de Caldas, Bogotá, Colombia
[email protected]

Abstract. One of the stratagems used to deceive spam filters is to sub-

arXiv:2107.06400v1 [cs.CL] 13 Jul 2021

stitute vocables with synonyms or similar words that turn the message
unrecognisable by the detection algorithms. In this paper we investigate
whether the recent development of language models sensitive to the se-
mantics and context of words, such as Google’s BERT, may be useful to
overcome this adversarial attack (called “Mad-lib” as per the word sub-
stitution game). Using a dataset of 5572 SMS spam messages, we first
established a baseline of detection performance using widely known doc-
ument representation models (BoW and TFIDF) and the novel BERT
model, coupled with a variety of classification algorithms (Decision Tree,
kNN, SVM, Logistic Regression, Naive Bayes, Multilayer Perceptron).
Then, we built a thesaurus of the vocabulary contained in these mes-
sages, and set up a Mad-lib attack experiment in which we modified
each message of a held out subset of data (not used in the baseline
experiment) with different rates of substitution of original words with
synonyms from the thesaurus. Lastly, we evaluated the detection per-
formance of the three representation models (BoW, TFIDF and BERT)
coupled with the best classifier from the baseline experiment (SVM). We
found that the classic models achieved a 94% Balanced Accuracy (BA)
in the original dataset, whereas the BERT model obtained 96%. On the
other hand, the Mad-lib attack experiment showed that BERT encodings
manage to maintain a similar BA performance of 96% with an average
substitution rate of 1.82 words per message, and 95% with 3.34 words
substituted per message. In contrast, the BA performance of the BoW
and TFIDF encoders dropped to chance. These results hint at the po-
tential advantage of BERT models to combat these type of ingenious
attacks, offsetting to some extent for the inappropriate use of semantic
relationships in language.

Key words: Spam classification, adversarial spam attack, BERT encoding.

1 Introduction
Unsolicited email (spam) remains a global burden, accounting for up to 85% of
daily message traffic, according to some network security providers1 . Although
1
see https://dataprot.net/statistics/spam-statistics/, last visit: July 13, 2021
2 Sergio Rojas–Galeano

spam filters have taken advantage of artificial intelligence technologies to improve

their detection performance, these algorithms can still be fooled by adversarial
attacks, that is, carefully crafted content modifications that attempt to bypass
the filters while nonetheless being easily recognisable to humans, by inoculating
either benign, unrelated or obfuscated words or characters [16, 23]. One of these
kind of stratagems (called the “Mad-lib ” attack in reference to the game based
on replacing words in a sentence to come up with crazy stories), consists of
substituting spam-triggering terms with synonyms or similar words preventing
the message to be recognised as junk by the filter; this is also known as the
synonym replacement attack [16].
In this context, we envision using recent advances in semantic and context
sensitive language models developed for natural language processing (NLP) tasks
to combat these types of attacks. One of them is the BERT model developed
by Google [8], which demonstrated state-of-the-art performance on eleven NLP
tasks. In essence, this model is capable of representing a short document (con-
sisting of a sequence of up to 512 words) as a numerical vector embedded in
a space of 768 positions, corresponding to a dense and distributed representa-
tion of the document’s features. Unlike embedding representations of individual
words (such as Word2Vec, GloVe, or FastText, see for example [15]), BERT is a
deep network model that incorporates internal blocks of attention mechanisms
that encode sequence of words into vectors depending on its context [8], cap-
turing lexical, semantic and grammatical features related to the order in which,
generally, one word precedes or succeeds another in particular sentences. One of
the outputs of BERT is a vector representation of the input document, which we
shall use to find similarities (distances) between spam messages in the resulting
embedding space, likewise the word embeddings mentioned above are used to
map similar words to close locations.
On this account, in this work we intend to evaluate the usefulness of ap-
plying the BERT model for the recognition of spam text sequences that differ
only in some lexical terms but that still retain their unwanted intention, thus
contributing to the detection of attacks of the Mad-lib type.

1.1 Related work

Adversarial attack tactics typically involve carefully crafting the content of the
input data to disrupt the expected behaviour of a prediction model [17]. The
study of adversarial environments attracted attention more than a decade ago,
when incidentally, the vulnerabilities of spam filters confronted with this type
of manipulation were uncovered [4]. Since then, many adversarial attacks and
defences have been described in a variety of applications such as online abusive
comments and profanity detection [10, 22, 23, 25], classification of medical images
[9] or object identification in computer vision [11, 2], to name a few.
In the case of text classification tasks, attacks are generally performed by
corrupting features or distorting the content of the text sequence [23]. More
particularly, in the field of adversarial attacks on spam filters, several tricks have
been characterised [12, 16, 24]: poisoning, injection of good words, obfuscation of
Tackling Mad-lib Spam Attack using BERT 3

spam words, change of labels and replacement of synonyms. Our study focuses on
the latter by taking a proactive approach [4], that is, anticipating, modelling and
countering the adversarial strategy. In this sense, our study takes a step forward
by showing the feasibility of addressing Mad-lib’s adversaries (our second set
of experiments, see below), compared to the work of [16] where the attack was
described but was not addressed.
Regarding the use of BERT encodings for extracting spam features (our
first set of experiments, see below), a modified Transformer model was recently
proposed to improve the detection performance of spam classifiers [19]. Other
modified models derived from BERT have been proposed for the effective detec-
tion of malicious phishing emails [18], while BERT with increased functionality
has also been applied to filter multilingual spam messages [7] and to block fake
tweets COVID [13], with promising results.

1.2 Contributions

The contributions of this study are:

– We show that the BoW, TFIDF, and BERT encoders are able to extract
effective functions to identify spam using widely-used classification algorithms,
with BERT performing slightly better. This corroborates findings previously
reported in the literature (e.g.[3].
– We describe an automatic adversarial procedure to carry out a Mad-lib attack
on the chosen dataset.
– We provide empirical evidence that BERT is able to resist Mad-lib attacks
whereas BoW or TFiDF are vulnerable.

2 Methods
2.1 Study roadmap

The study was conducted according to the stages illustrated in the roadmap of
Figure 1, which are described next.
(1) Dataset splitting. We worked with the SMS spam collection dataset from
the UCI repository2 . The dataset is unbalanced, as of the total of 5,574 mes-
sages, 4,827 are labelled as ham and only 747 as spam. The messages are quite
short; with an average length of 14.5 words, they pose an interesting challenge
for content-based filtering algorithms [3]. We use random sampling without re-
placement to divide this data set into three subsets: train (60%), test (20%), and
hold-out (20%).
(2) Thesaurus creation. We extracted a vocabulary of the 5000 most frequent
terms from the entire dataset and used them as keywords in a thesaurus. For each
keyword, a list of synonyms was automatically scrapped from its corresponding
entry page on the website www.dictionary.com.
2
The dataset is available at: https://archive.ics.uci.edu/ml/datasets/sms+spam+collection
4 Sergio Rojas–Galeano

Fig. 1. Study roadmap

(3) Document encoding. Messages in each split are represented using two en-
codings commonly used in spam filtering, Bag-of-Words (BoW) and the Term
Inverse Frequency of Document Frequency (TFIDF) [15], and the recently intro-
duced Bi-directional Encoder Representations from Transformers (BERT) [8].
BoW and TFIDF are simplified representations that map words within a docu-
ment to a vector of frequencies indexed by a vocabulary (the latter normalised
by the fraction of documents that contain the words). These mappings capture
lexical features while ignoring syntax or semantics. For these models, we prepro-
cess the text by removing accents, removing stopwords in English, converting it
to lowercase, and applying stemming and tokenisation.
On the other hand, BERT is a language model trained as a deep bidirectional
network conditioned by both the left and right context of the words in the text
input, also considering semantic relationships. One of the outputs at the top
layer of the network is a vector of 768 positions that encodes an embedding
of the entire input sentence. We will use it as a vector of context features and
semantic relationships between a sequence of words that make up a message,
focusing on its ability to project similar spam messages that differ in lexical
variations in close locations of the embedding space, regardless of the actual
interpretation of these features. Besides, the text cleanup for this model was
minimal, basically converting to lowercase and applying the BERT tokeniser [8].
(4) Spam classification. At this stage, a first set of experiments was carried out
to evaluate how well the classification algorithms work on the original messages.
For this purpose, we used the training and test splits, represented with the
three encodings as input features of a variety of classification algorithms that
are regularly used for text classification tasks [15, 1, 14]: Decision Tree, Naive
Bayes, k-Nearest Neighbour (kNN), Support Vector Machine (SVM), Logistic
Regression and Multilayer Perceptron (MLP).
(5) Mad-lib attack. Two attacks were carried out on the held-out subset,
where in each message an attempt was made to replace 5 or 10 words chosen at
random, using synonyms from the previously constructed thesaurus. As a result,
two modified Mad-lib subsets were obtained.
Tackling Mad-lib Spam Attack using BERT 5

(6) Mad-lib spam classification. In this second set of experiments, the pre-
viously trained classifiers were evaluated in the modified Mad-lib sets, once en-
coded with the three aforementioned representation models.

2.2 Experiment protocol

The experiments were carried out according to the protocol described in Figure 2.
The dataset is divided into three partitions: Train, Test, and Holdout. The first
experiment was conducted to estimate a baseline of spam detection performance
on the original data set, for comparison purposes in a subsequent spam experi-
ment with the Mad-lib attack.
Initially, the messages in the Train and Test splits were encoded with the
three representation models (BoW, TFIDF, BERT) to obtain vectors of 768 fea-
tures (since this is the inherent size of the dense vectors generated by BERT,
we set the vocabulary size base for BoW and TFIDF accordingly). Then the
obtained feature vectors are fed to the aforementioned classification algorithms.
Each classifier is trained with the encoded vectors of the Train split along with
their respective labels; once trained, their performance is evaluated in the Test
split, using the metrics of Accuracy (ACC ), Precision (PR) and Sensitivity (SE )
[26] and Balanced Accuracy (BA) [6]. The latter was considered the most ap-
propriate metric for this particular task, considering that the dataset is highly
unbalanced. They are defined by the following equations:

TP + TN 1 TP TN TP TP
ACC = , BA = + , PR= , SE = ,
P +N 2 P N TP + FP TP + FN

where P and N are the total number of spam and ham messages, T P and F P
are the correctly and wrongly classified spam, and T N and F N are the correctly
and wrongly classified ham, respectively. The results are collected from a total of
30 replicas (with different samples of Train and Test partitions) so as to reduce
their variability due to randomness in the sampling procedure.

Fig. 2. Experimental protocol

6 Sergio Rojas–Galeano

What will we do in the shower, baby?

what will we do in the shower bath infant
Good Morning my Dear ........... Have a great & successful day.
good day my darling have a great ampere victorious today
Refused a loan? Secured or Unsecured? Can’t get credit? Call free now 0800
195 6669 or text back ’help’ & we will!
turn down a loan secured or unsecured can t turn credit call free now 0800
195 6669 or text back care we will
Camera - You are awarded a SiPix Digital Camera ! call 09061221066 fromm
landline. Delivery within 28 days.
cine-camera you are grant a sipix digital-analog converter cartridge
call 09061221066 fromm landline serving within 28 days
Table 1. Examples of Mad-lib attacked ham (top) and spam (bottom) messages.
Attempts: underline, substitutions: bold.

The second set of experiments focused on evaluating how the previously

trained classifiers perform on the Mad-lib attack (replacement of some words
with synonyms) with the different representation models. To do this, the Hold-
out split was used, together with the thesaurus to carry out two different attacks,
one attempting to substitute 5 or 10 words at random. Notice that since some
of the words may not have synonyms in the thesaurus, the actual number of
substitutions may be less (see examples in Table 1). Once the attack is com-
pleted, the modified messages are encoded with the three representation models
to obtain the corresponding feature vectors, and they are fed to the previously
trained classifiers, in order to evaluate their performance with the same met-
rics mentioned above. In this case, the results of the 30 replicas are averaged to
reduce variability due to random sampling and word substitution processes.

2.3 Implementation details

The models and experiments were implemented in the Python 3.8.5 language,
using the libraries sckit-learn 0.24.0 [20], PyDictionary [5] and SimpleTrans-
formers [21], which were executed in Google Colab with GPU accelerator. A
repository with code and materials is available at: github.com/Sargaleano/
Madlib-Spam-Attack-BERT.

3 Results
3.1 Spam detection experiments

The results for these experiments are summarised in Table 2, where averages
and standard deviations for the performance metrics are reported, grouped by
encoding model and classification algorithm. A preliminary experimentation was
conducted to perform classifier calibration (the final set of parameters is reported
in the Appendix).
Tackling Mad-lib Spam Attack using BERT 7

Encoder Classifier BA ACC SE PR

Decision Tree 85.2±1.7% 93.3±0.9% 73.9±3.1% 76.6±4.1%
Naive Bayes 93.1±1.0% 95.8±0.5% 89.3±2.0% 82.0±3.2%
kNN 93.2±1.3% 97.0±0.5% 87.9±2.7% 89.8±2.4%
BERT SVM (linear) 96.3±0.9% 98.4±0.3% 93.2±1.8% 95.3±1.9%
Logistic Regression 96.3±0.9% 98.7±0.3% 93.0±1.8% 97.5±1.3%
MLP 96.6±0.9% 98.8±0.3% 93.5±1.8% 97.4±1.3%
SVM (gaussian) 95.1±1.1% 98.6±0.4% 90.3±2.2% 99.4±0.9%
Decision Tree 84.1±1.8% 95.0±0.5% 69.3±3.8% 91.3±3.2%
Naive Bayes 82.6±1.2% 76.4±1.5% 91.0±2.2% 35.1±2.7%
kNN 59.8±1.7% 89.3±0.8% 19.7±3.4% 99.5±1.7%
BoW SVM (linear) 93.8±1.2% 97.7±0.3% 88.6±2.5% 93.3±1.5%
Logistic Regression 92.8±1.4% 97.9±0.4% 85.9±2.7% 97.8±1.4%
MLP 92.2±1.3% 97.7±0.4% 84.7±2.6% 97.8±1.5%
SVM (gaussian) 93.6±1.4% 97.5±0.4% 88.3±2.8% 92.6±2.0%
Decision Tree 86.0±2.0% 94.9±0.6% 73.8±4.3% 86.1±3.6%
Naive Bayes 82.8±1.0% 77.9±1.5% 89.6±2.5% 36.4±2.4%
kNN 55.4±1.1% 88.2±1.0% 10.8±2.2% 99.4±1.9%
TFIDF SVM (linear) 94.0±1.4% 98.1±0.5% 88.5±2.8% 96.6±2.0%
Logistic Regression 87.3±1.6% 96.5±0.5% 74.9±3.1% 98.0±1.4%
MLP 88.0±1.5% 96.6±0.5% 76.2±3.1% 97.9±1.3%
SVM (gaussian) 93.9±1.4% 98.0±0.4% 88.4±2.9% 96.3±1.8%
Table 2. Spam classification results

Roughly speaking, the results varied widely depending on the classification

algorithm. For the BoW and TFIDF encoders, the lowest rates were obtained
by kNN (BA: 59.8% and 55.4% resp.), and the highest by SVM (BA: 93.8% and
94% resp.). In the case of BERT, the variability is less noticeable (all classifiers
obtained a BA greater than 93% except Decision Tree with 85%), with MLP
being the best achieving a BA of 96.6%, followed by SVM with 96.3%.
By examining the ACC and SE rates we corroborate similar results obtained
in previous studies (for example [3]). We remark that the performances obtained
for SE and PR with the BERT representation are more uniform than those of
the other two encoders.

3.2 Mad-lib spam attack experiments

The results for these experiments are summarised in Table 3, where averages
and standard deviations for the performance metrics are reported, grouped by
number of attempts in the attack and encoding model (linear SVM was chosen
as classifier, since it achieved better performances across all of the three models).
8 Sergio Rojas–Galeano

Attempts Encoder Subs BA ACC SE PR

BERT 0.00 96.6±0.9% 98.3±0.4% 94.4±1.9% 92.8±2.3%
0 BoW 0.00 54.9±3.8% 79.4±2.5% 21.5±8.0% 21.7±6.5%
TFiDF 0.00 50.0±0.3% 86.6±0.8% 0.3±0.6% 13.7±28.4%
BERT 1.82 96.2±1.0% 97.6±0.5% 94.2±2.0% 88.4±3.1%
5 BoW 1.82 55.2±3.7% 80.7±2.2% 20.4±7.7% 23.4±7.0%
TFiDF 1.82 50.0±0.3% 86.6±0.7% 0.3±0.6% 15.4±29.1%
BERT 3.34 95.2±0.9% 96.8±0.6% 93.0±1.7% 84.8±2.8%
10 BoW 3.34 55.2±3.3% 82.1±2.1% 18.7±6.8% 25.8±7.8%
TFiDF 3.34 50.0±0.2% 86.6±0.7% 0.2±0.4% 13.8±30.2%
Table 3. Mad-lib spam attack results

In general, the results support the premise of the usefulness of the BERT
model to resist this type of attack. We will focus on examining the BA metric
for this analysis. In the first attack with zero substitutions (that is, using the
Holdout split without modifying the original messages) the SVM performance
is maintained, with a value of 96.6%. On the other hand, for the attacks with
5 and 10 substitution attempts (corresponding on average to 1.82 and 3.34 real
substitutions as explained above), the accuracy rate of the BERT model de-
creased slightly to 96.2% and 95.2% respectively, about a 1% drop compared to
the baseline experiment.
In contrast, these results also show that regarding BA, the performance of
the BoW and TFIDF encoders degrades at levels close to chance. It is curious
that even in the Hold-out partition without modifications the drop is noticeable;
when examining the SE rate, a sharp drop to 21.5% is observed, that is, the
detection of the features commonly associated to spam-related words is greatly
affected by including out-of-sample terms, a phenomenon that is accentuated
when Mad-lib substitutions are made in each message.

4 Conclusion
This study provided empirical evidence on the promise of BERT encodings in
tackling the Mad-lib spam attack. We reason that this is due to the ability of this
model to represent semantic and contextual functions of language. Furthermore,
other advantages of BERT are that it requires little pre-processing (cleanup)
of text, as well as its ability to recognise out-of-vocabulary terms due to its
inherent tokenisation method. On the computational side, BERT is heavier than
the simpler BoW encoders that achieve comparable performances with spam not
tampered with by Mad-lib adversaries.
Therefore, we anticipate that a combination of encoding models would be
a realistic configuration at the core of modern spam filters, in order to detect
behavioural changes implying that filter retraining is required (for example, acti-
vating an alert when the performance of BoW and BERT begins to differ amply).
Tackling Mad-lib Spam Attack using BERT 9

Furthermore, we hope that BERT encodings will help resist not only the adver-
sarial scenario described in this document, but also other related attacks, such as
the inoculation of good words, the obfuscation with homoglyphs, or the disguise
of multilingual words that trigger spam. We plan to explore these ideas in our
future work.

References
1. Charu C Aggarwal and ChengXiang Zhai. A survey of text classification algo-
rithms. In Mining text data, pages 163–222. Springer, 2012.
2. Naveed Akhtar and Ajmal Mian. Threat of adversarial attacks on deep learning
in computer vision: A survey. IEEE Access, 6:14410–14430, 2018.
3. Tiago A Almeida, José Marı́a G Hidalgo, and Akebo Yamakami. Contributions to
the study of SMS spam filtering: new collection and results. In Proceedings of the
11th ACM Symposium on Document Engineering, pages 259–262, 2011.
4. Battista Biggio and Fabio Roli. Wild patterns: Ten years after the rise of adversarial
machine learning. Pattern Recognition, 84:317–331, 2018.
5. Pradipta Bora. PyDictionary: A ”Real” Dictionary Module for Python (version
2.0.1), https://github.com/geekpradd/pydictionary, 2021.
6. Kay Henning Brodersen, Cheng Soon Ong, Klaas Enno Stephan, and Joachim M
Buhmann. The balanced accuracy and its posterior distribution. In 2010 20th
International Conference on Pattern Recognition, pages 3121–3124. IEEE, 2010.
7. Jie Cao and Chengzhe Lai. A Bilingual Multi-type Spam Detection Model Based
on M-BERT. In IEEE Global Communications Conference, pages 1–6. IEEE, 2020.
8. Jacob Devlin, Ming-Wei Chang, Kenton Lee, et al. BERT: Pre-training of deep
bidirectional transformers for language understanding. arXiv:1810.04805, 2018.
9. Samuel G Finlayson, John D Bowers, Joichi Ito, Jonathan L Zittrain, Andrew L
Beam, and Isaac S Kohane. Adversarial attacks on medical Machine Learning.
Science, 363(6433):1287–1289, 2019.
10. Hossein Hosseini, Sreeram Kannan, Baosen Zhang, et al. Deceiving Google’s per-
spective API built for detecting toxic comments. arXiv:1702.08138, 2017.
11. Hossein Hosseini, Baicen Xiao, and Radha Poovendran. Google’s cloud vision API
is not robust to noise. In 2017 16th IEEE international conference on machine
learning and applications (ICMLA), pages 101–105. IEEE, 2017.
12. Niddal H Imam and Vassilios G Vassilakis. A survey of attacks against twitter
spam detectors in an adversarial environment. Robotics, 8(3):50, 2019.
13. Debanjana Kar, Mohit Bhardwaj, et al. No Rumours Please! A Multi-Indic-Lingual
Approach for COVID Fake-Tweet Detection. arXiv:2010.06906, 2020.
14. Vandana Korde and C Namrata Mahender. Text classification and classifiers: A
survey. International Journal of Artificial Intelligence & Applications, 3(2):85,
2012.
15. Kamran Kowsari, Kiana Jafari Meimandi, Mojtaba Heidarysafa, et al. Text clas-
sification algorithms: A survey. Information, 10(4):150, 2019.
16. Bhargav Kuchipudi, Ravi Teja Nannapaneni, and Qi Liao. Adversarial machine
learning for spam filters. In Proceedings of the 15th International Conference on
Availability, Reliability and Security, pages 1–6, 2020.
17. Pavel Laskov and Richard Lippmann. Machine learning in adversarial environ-
ments. Machine Learning, (2):115–119, 2010.
10 Sergio Rojas–Galeano

18. Younghoo Lee, Joshua Saxe, and Richard Harang. CATBERT: Context-Aware
Tiny BERT for Detecting Social Engineering Emails. arXiv:2010.03484, 2020.
19. Xiaoxu Liu. A Spam Transformer Model for SMS Spam Detection. Master’s thesis,
Université d’Ottawa/University of Ottawa, 2021.
20. Fabian Pedregosa, Gael Varoquaux, et al. Scikit-learn: Machine Learning in
Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
21. Thilina Rajapakse. Simple Transformers (2021), https://simpletransformers.ai.
22. Nestor Rodriguez and Sergio Rojas-Galeano. Shielding google’s language toxicity
model against adversarial attacks. arXiv preprint arXiv:1801.01828, 2018.
23. Sergio Rojas-Galeano. On obstructing obscenity obfuscation. ACM Transactions
on the Web (TWEB), 11(2):1–24, 2017.
24. Sergio A Rojas-Galeano. Revealing non-alphabetical guises of spam-trigger voca-
bles. Dyna, 80(182):15–24, 2013.
25. Sara Sood, Judd Antin, et al. Profanity use in online communities. In Proceedings
of the SIGCHI Conference on Human Factors in Computing Systems, 2012.
26. Alaa Tharwat. Classification assessment methods. Applied Computing and Infor-
matics, 17, 2021.

Appendix

Chosen model parameters for algorithms used in experiments are shown below.

Classification algorithms
Decision Tree max depth=10
Naive Bayes default parameters
kNN k = 15
SVM (linear) C=1, loss=‘squared hinge’
Logistic Regression default parameters
MLP hidden layer sizes=(10,), alpha=1, max iter=1000
SVM (gaussian) gamma=.01, C=100
Representation models
BoW, TFIDF stemming, lowercase, stop words, max features=768
BERT model=‘xlm-r-bert-base-nli-stsb-mean-tokens’

Unit I Predictive Analytics
No ratings yet
Unit I Predictive Analytics
39 pages
Project Report On Algorithm Visualizer
No ratings yet
Project Report On Algorithm Visualizer
108 pages
R22 ML Syllabus
No ratings yet
R22 ML Syllabus
2 pages
Broadcasting Chat Server
83% (6)
Broadcasting Chat Server
25 pages
Review On NLP Paraphrase Detection Approaches
No ratings yet
Review On NLP Paraphrase Detection Approaches
4 pages
r22 1 9 ML Lab Manual r22 Regulations
No ratings yet
r22 1 9 ML Lab Manual r22 Regulations
24 pages
Spam Email. Classifier
No ratings yet
Spam Email. Classifier
16 pages
Introduction To AI and Production Systems
No ratings yet
Introduction To AI and Production Systems
20 pages
Dseclzg628T: Birla Institute of Technology & Science, Pilani Second Semester 2021-22 Dissertation
100% (1)
Dseclzg628T: Birla Institute of Technology & Science, Pilani Second Semester 2021-22 Dissertation
5 pages
CS 601 Machine Learning Unit 5
No ratings yet
CS 601 Machine Learning Unit 5
18 pages
Unit 1 DataScience
No ratings yet
Unit 1 DataScience
105 pages
News Classification Using Machine Learning
No ratings yet
News Classification Using Machine Learning
5 pages
FDS Lab Manual
No ratings yet
FDS Lab Manual
48 pages
Unit 5
No ratings yet
Unit 5
23 pages
Advanced Machine Learning: Module-1
No ratings yet
Advanced Machine Learning: Module-1
164 pages
IB Math AA SL Questionbank - Exponents & Logarithms
No ratings yet
IB Math AA SL Questionbank - Exponents & Logarithms
1 page
Aim: Write A Program To Parse XML Text, Generate Web Graph and Compute Topic Specific Page Rank. Source Code
0% (1)
Aim: Write A Program To Parse XML Text, Generate Web Graph and Compute Topic Specific Page Rank. Source Code
5 pages
Ticket Booking System (OOP)
0% (1)
Ticket Booking System (OOP)
9 pages
Introduction To Data Engineering
No ratings yet
Introduction To Data Engineering
8 pages
Unit-5 DS Notes
No ratings yet
Unit-5 DS Notes
19 pages
PPS Course Material
100% (1)
PPS Course Material
177 pages
ML UNIT-IV Notes
100% (1)
ML UNIT-IV Notes
23 pages
Big Data - SRM University PDF
No ratings yet
Big Data - SRM University PDF
29 pages
Cs2351 Ai Notes
100% (1)
Cs2351 Ai Notes
91 pages
Unit 1 MWS
No ratings yet
Unit 1 MWS
22 pages
AI Lab MAnual Final
No ratings yet
AI Lab MAnual Final
44 pages
Ai Unit 1
No ratings yet
Ai Unit 1
149 pages
Lab Program
100% (1)
Lab Program
15 pages
DataMining Lecture 1
No ratings yet
DataMining Lecture 1
35 pages
Mathematics For Machine Learning-I
No ratings yet
Mathematics For Machine Learning-I
10 pages
Mooc File On Introduce To Machine Learning
No ratings yet
Mooc File On Introduce To Machine Learning
13 pages
RTRP Lab Project
No ratings yet
RTRP Lab Project
13 pages
Machine Learning Notes Btech
No ratings yet
Machine Learning Notes Btech
3 pages
Artificial Intelligence Module 5
No ratings yet
Artificial Intelligence Module 5
23 pages
Infosys Pragathi Report
No ratings yet
Infosys Pragathi Report
68 pages
Rec27 100a2 20040598-1
No ratings yet
Rec27 100a2 20040598-1
72 pages
Topic - 7 (Uncertainty)
No ratings yet
Topic - 7 (Uncertainty)
25 pages
Hotels Review Classification Final
No ratings yet
Hotels Review Classification Final
34 pages
LP I ML Viva Questions
100% (1)
LP I ML Viva Questions
9 pages
Module-02 AIML NOTES
No ratings yet
Module-02 AIML NOTES
29 pages
Cp5151 Advanced Data Structures and Algorithims
No ratings yet
Cp5151 Advanced Data Structures and Algorithims
3 pages
BDA Unit 1
No ratings yet
BDA Unit 1
10 pages
Ai Project: Water Jug Problem
No ratings yet
Ai Project: Water Jug Problem
5 pages
Tree Traversals (Inorder, Preorder and Postorder)
No ratings yet
Tree Traversals (Inorder, Preorder and Postorder)
4 pages
Unit Iv
No ratings yet
Unit Iv
8 pages
Soft Computing PDF
No ratings yet
Soft Computing PDF
60 pages
Ai Ii Notes
No ratings yet
Ai Ii Notes
33 pages
Chandigarh Group of Colleges College of Engineering Landran, Mohali
No ratings yet
Chandigarh Group of Colleges College of Engineering Landran, Mohali
47 pages
SSN Project Report PDF
No ratings yet
SSN Project Report PDF
27 pages
Future Skills - An Introduction, General Overview of The Future Skills Sub-Sector-1
No ratings yet
Future Skills - An Introduction, General Overview of The Future Skills Sub-Sector-1
15 pages
OOSE Lab Report
No ratings yet
OOSE Lab Report
30 pages
ML Question Bank
No ratings yet
ML Question Bank
29 pages
Thyroid Disease Classification Using Machine Learning Project
No ratings yet
Thyroid Disease Classification Using Machine Learning Project
34 pages
Clustering & Association Algorithms 4
No ratings yet
Clustering & Association Algorithms 4
17 pages
Data Analytics Lab File Rohit
No ratings yet
Data Analytics Lab File Rohit
23 pages
Buet Admission Quest Basic
No ratings yet
Buet Admission Quest Basic
7 pages
Aim L Record
No ratings yet
Aim L Record
26 pages
Gartner Market Guide For Network Detection and Response 2022
No ratings yet
Gartner Market Guide For Network Detection and Response 2022
13 pages
Design A Learning System in Machine Learning
No ratings yet
Design A Learning System in Machine Learning
41 pages
Module-3 Association Analysis: Data Mining Association Analysis: Basic Concepts and Algorithms
No ratings yet
Module-3 Association Analysis: Data Mining Association Analysis: Basic Concepts and Algorithms
34 pages
ML Viva Questions
No ratings yet
ML Viva Questions
8 pages
18ECO127T Unit 5
No ratings yet
18ECO127T Unit 5
44 pages
General Instructions:: Sanskriti School Dr. S Radhakrishnan Marg Chanakyapuri
No ratings yet
General Instructions:: Sanskriti School Dr. S Radhakrishnan Marg Chanakyapuri
2 pages
AnycubicSlicer - Usage Instructions - V1.0 - EN
100% (1)
AnycubicSlicer - Usage Instructions - V1.0 - EN
16 pages
HTML Slides
No ratings yet
HTML Slides
192 pages
Bict DBMS
No ratings yet
Bict DBMS
6 pages
Self Repair App
No ratings yet
Self Repair App
38 pages
OOP - S2021 - Mid Term Exam
No ratings yet
OOP - S2021 - Mid Term Exam
2 pages
Toolbox PLUS Users Manual 3.11.0
No ratings yet
Toolbox PLUS Users Manual 3.11.0
238 pages
Hawassa University PPT 3
No ratings yet
Hawassa University PPT 3
21 pages
Technicolor Dga4231
No ratings yet
Technicolor Dga4231
8 pages
Trans Connect
No ratings yet
Trans Connect
7 pages
Indian Porn Sex Archita Pukham Viral Video Clip Full Original Video Social ...
No ratings yet
Indian Porn Sex Archita Pukham Viral Video Clip Full Original Video Social ...
4 pages
Introduction To Embedded Systems: Printed Book
No ratings yet
Introduction To Embedded Systems: Printed Book
1 page
Future Generation Computer Systems: Pradeep Kumar Roy Jyoti Prakash Singh Snehasish Banerjee
No ratings yet
Future Generation Computer Systems: Pradeep Kumar Roy Jyoti Prakash Singh Snehasish Banerjee
10 pages
Twinkle
No ratings yet
Twinkle
2 pages
UNIT5
No ratings yet
UNIT5
150 pages
The Effects of Technology On Society: Katie Christianson - Josiah Lenthe - Alex Cole
No ratings yet
The Effects of Technology On Society: Katie Christianson - Josiah Lenthe - Alex Cole
30 pages
EGE Operator Guidelines
No ratings yet
EGE Operator Guidelines
25 pages
Grade 10 Holiday Assignment 2024 Final
No ratings yet
Grade 10 Holiday Assignment 2024 Final
18 pages
Table Analysis
No ratings yet
Table Analysis
22 pages
Development of Content-Based SMS Classification Application by Using Word2Vec-based Feature Extraction
No ratings yet
Development of Content-Based SMS Classification Application by Using Word2Vec-based Feature Extraction
10 pages
Multimedia Presentation
No ratings yet
Multimedia Presentation
11 pages
CPO 3713 Spring 2021 Syllabus
No ratings yet
CPO 3713 Spring 2021 Syllabus
13 pages
Sm-Detector: A Security Model Based On Bert To Detect Smishing Messages in Mobile Environments
No ratings yet
Sm-Detector: A Security Model Based On Bert To Detect Smishing Messages in Mobile Environments
15 pages
Future Generation Computer Systems: Sandhya Mishra Devpriya Soni
No ratings yet
Future Generation Computer Systems: Sandhya Mishra Devpriya Soni
13 pages
20 Abbreviations Related To Computer (5 Files Merged)
No ratings yet
20 Abbreviations Related To Computer (5 Files Merged)
11 pages
Fall SlidesMania
No ratings yet
Fall SlidesMania
11 pages
A Neural-Based Architecture For Small Datasets Classification
No ratings yet
A Neural-Based Architecture For Small Datasets Classification
9 pages
2016 Quiz Paper
No ratings yet
2016 Quiz Paper
1 page
January Budget 2021
No ratings yet
January Budget 2021
6 pages
Textbook of Engineering Chemistry
From Everand
Textbook of Engineering Chemistry
C. Parameswara Murthy
No ratings yet
Connectivity Prediction in Mobile Ad Hoc Networks for Real-Time Control
From Everand
Connectivity Prediction in Mobile Ad Hoc Networks for Real-Time Control
Sebastian Thelen
5/5 (1)
Hopfield Networks: Fundamentals and Applications of The Neural Network That Stores Memories
From Everand
Hopfield Networks: Fundamentals and Applications of The Neural Network That Stores Memories
Fouad Sabry
No ratings yet
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
From Everand
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
Fouad Sabry
No ratings yet