Anchalora
Anchalora
Anchalora
Anchal Ora
Student ID: x18135846
School of Computing
National College of Ireland
I hereby certify that the information contained in this (my submission) is information
pertaining to research I conducted for this project. All information other than my own
contribution will be fully referenced and listed in the relevant bibliography section at the
rear of the project.
ALL internet material must be referenced in the bibliography section. Students are
required to use the Referencing Standard specified in the report template. To use other
author’s written or electronic work is illegal (plagiarism) and may result in disciplinary
action.
Signature:
Attach a completed copy of this sheet to each project (including multiple copies).
Attach a Moodle submission receipt of the online project submission, to
each project (including multiple copies).
You must ensure that you retain a HARD COPY of the project, both for
your own reference and in case a project is lost or mislaid. It is not sufficient to keep
a copy on computer.
Assignments that are submitted to the Programme Coordinator office must be placed
into the assignment box located outside the office.
Date:
Penalty Applied (if applicable):
Spam Detection in Short Message Service Using
Natural Language Processing and Machine Learning
Techniques
Anchal Ora
x18135846
Abstract
As the usage of mobile phones increased, the use of Short Message Service
increased significantly. Due to the lower costs of text messages, people started
using it for promotional purposes and unethical activities. This resulted in the
ratio of spam messages increasing exponentially and thereby loss of personal and
financial data. To prevent data loss, it is crucial to detect spam messages as quick as
possible. Thus, the research aims to classify spam messages not only efficiently but
also with low latency. Different machine learning models like XGBoost, LightGBM,
Bernoulli Naı̈ve Bayes that are proven to be very fast with low time complexity
have been implemented in the research. The length of the messages was taken as
an additional feature, and the features were extracted using Unigram, Bigram and
TF-IDF matrix. Chi-Square feature selection was implemented to further reduce
the space complexity. The results showcased that Bernoulli Naı̈ve Bayes followed
by LightGBM with the TF-IDF matrix generated the highest accuracy of 96.5% in
0.157 seconds and 95.4% in 1.708 seconds respectively.
Keywords: Spam SMS, Text Classification, Natural Language Processing, Machine
Learning, Bernoulli Naı̈ve Bayes, LightGBM, XGBoost
1 Introduction
Short Messaging Service (SMS) is mainly used for unofficial communication such as pro-
moting new products and services but at times also used for official communication like
information about any bank transaction or confirmation of the order on an online portal
etc. Due to advancements in technology, the costs of sending an SMS have reduced
drastically. This has proved to be a boon for some whereas a bane for many. People
are misusing the SMS facility to promote products, services, offers and schemes and so
on. How annoying this has become can be assessed by the fact that people have started
ignoring SMS they receive because twenty to thirty percent of the total SMS received is
spam (Kim et al. (2015)). This menace is growing at a rapid rate. As a result, people
miss out on genuine informative messages such as bank transactions. At times the ig-
norance towards SMS can prove detrimental because some fraud transactions might have
been performed, but the information was neglected. The motive behind this project is to
apply machine learning algorithms to separate spam messages from genuine ones. Ma-
chine learning techniques along with Natural Language Processing techniques was used
to make the process more agile and efficient.
1
1.1 Motivation and Background
For today’s generation, usage of phones is not just confined to communication now but an
array of different uses such as storing personal information like documents, notes, media,
making financial transactions, shopping, etc. Owing to a wide range of information stored
on devices some of which are personal and critical, hacking phones is of utmost interest to
people having unethical intentions. SMS is an easy way to target people because it is used
by people from all walks of life and all ages and they are not aware of the implications.
Hackers access phone and all its information when a phone is hacked, and people have
absolutely no idea about it. Consequently, there is a loss of critical data that can be
exploited for illegal purposes. It can be traumatic for the victims causing psychological
despair and financial loses.
Messages are not just written in English but in various languages and those in English
can also have words and abbreviations which are from other languages. Hence identifying
spam SMS is a challenging work as stated by (Yadav et al. (2012)) since it does not have
any header like in case of emails. So, the techniques used to figure out spam emails cannot
be used for messages. New solutions must be devised for new problems and thus many
researchers have been working in this area to develop new algorithms and techniques. In a
recently concluded research by (Gupta et al. (2018)), a deep learning algorithm has been
implemented with distinct traditional algorithms. Through this, the researcher was able
to achieve an accuracy of 99.1 percent which is the best result achieved so far. According
to (Yue and Elfayoumy (2007)), implementation of deep learning models comes with
challenges. It is an arduous task to apply deep learning models to actual datasets since
it becomes computationally expensive and time-consuming. Although a model generates
accurate results, it takes time which is not good enough in case of SMS because people
tend to access the SMS within seconds of receiving it. This gap in research demands
the creation of a model that performs efficiently with high accuracy consuming very less
time. Hence, below research question is put forward in this research.
2
in turn decreases time complexity. To make the model more reliable, Stratified 10-Fold
Cross-Validation technique was used.
To solve the research question, the research objectives are implemented as depicted
in section 1.3 (Table 1).
3
Table 2: Research Objectives
Objectives Description Evaluation Metrics
Contributions: The major contribution resulting from this research was the imple-
mentation of machine learning models such as XGBoost, LightGBM, Random Forest,
SVM, and Bernoulli Naı̈ve Bayes that are capable to detect spam SMS with good effi-
ciency and less time.
The minor contribution of the research was the comparison of results of fully developed
models along with the existing models on evaluation metrics using the visualizations.
These contributions will help mobile phone users as it will effectively resolve the problem
of spam SMS by detecting it with the minimum time.
The rest of the technical report contains the following chapters: Chapter 2 represents Lit-
erature Review, Chapter 3 consists of Scientific Methodology and Design Specifications,
Chapter 4 talks about Data Pre-processing, Implementation, Evaluation, and Results of
Spam Short Message Service Classification Models, Chapter 5 contains Discussion and
Comparison of Results and lastly, Chapter 6 consists of Conclusion and Future Work.
4
2 Literature Review
2.1 Introduction
SMS Spam is a serious problem and thus many researchers have keen motivation to
solve the problem using different approaches and methods. This section investigates the
problem of spam and various other text mining problem and how it is tackled by imple-
menting different machine learning algorithms and techniques by reviewing the literature
from the year 2007 to date. The review is divided into the following subsections: (i)A
Critique of Spam Short Message Service Detection (ii)A Critical Review of Algorithms
and Techniques Used in Spam Classification and Identified Gaps (iii) A Critique on Data-
set and Features used in Short Message Service Spam Detection (iv)A Comparison of the
Reviewed techniques.
5
2.3 A Critical Review of Algorithms, Techniques Used in Spam
Classification and Identified Gaps
2.3.1 A Review of Short Message Service Spam Classification
An analysis of spam SMS filtering was done on the UCI machine learning dataset in the
year 2015 by (Kim et al. (2015)) who chose the frequency ratio feature selection tech-
nique while implementing the algorithms like Naı̈ve Bayes, Logistic Regression and J-48
Decision Trees where the 10 fold cross-validation technique was applied. It was seen that
Naı̈ve Bayes generated results in a minimum time with the highest accuracy of 94 per-
cent. In the year 2018, a similar analysis was conducted by (Gupta et al. (2018)) using
2 sets of data, one with UCI machine learning which is the same corpus as of Kaggle
with a total of 5574 ham and spam messages and another dataset contains 2000 spam
and ham messages. TF-IDF matrix was created and then machine learning algorithms
like Naı̈ve Bayes, Random Forest, SVM, Decision Tree, Convolutional Neural Network
and Artificial Neural Network were applied on both the datasets. The results obtained
by CNN were the state-of-art in this area with an accuracy of 99.10 followed by Naı̈ve
Bayes and SVM.
A research conducted by (Ma et al. (2016)), on spam SMS detection proposes a message
topic model which is a form of probability topic model. It uses the KNN algorithm to
remove the sparsity problem in the messages. The symbol terms, background terms were
considered and it was found that the model generated better results than the standard
LDA. The classifier GentleBoost was used for the first time in the research done by (Ak-
bari and Sajedi (2015)) on SMS spam detection in the year 2015. For unbalanced data
and binary classification, boosting algorithms work well. GentleBoost is a combination
of two algorithms, namely AdaBoost and LogitBoost. GentleBoost is well known for its
higher accuracy and less consumption of storage as it removes unwanted features. It
obtained an accuracy of 98% on the dataset consisting of 5572 text messages.
The author (Agarwal et al. (2016)) states that the short length of the messages and the
use of casual words in the text messages do not allow it to perform well with the already
established solutions of email spam filtering. In this research, it can be seen that SVM
followed by Multinomial Naı̈ve Bayes (MNB) shows outstanding results in terms of ac-
curacy with 98.23 and 97.87% respectively. MNB took the least execution time of 2.03
seconds. The researcher further suggests that features like the number of characters in
the messages or definite threshold to the length of the message can increase the perform-
ance.
Looking at the above research, it can be stated that the performance of traditional al-
gorithms like Naı̈ve Bayes and SVM is superior to other algorithms. Also in this research,
the length parameter can be taken as an additional feature to check if it enhances the
performance of the model.
6
Significant power and memory are needed if the number of features in a classification
problem is excessive. Higher the features, more is the dimensionality and greater is the
need for power and memory. By removing the features that are not pertinent or redundant
and pulling out the beneficial ones, performance can be amplified as stated by the paper
(Ergin and Isik (2014)) which is on email spam filtering. Eliminating the stop words,
stemming and normalizing the dataset is followed by the creation of Document Term
Matrix using Bag of Words. Techniques like Gini Index, Chi-Square and Information
Gain were used for feature selection on machine learning algorithms like Artificial Neural
Networks (ANN) and Decision Tree. An outstanding result was generated by the coalition
of Bag of Words and Chi-Square techniques on ANN with the accuracy as 91%.
In the research performed by (Islam et al. (2009)) on the spam filtering techniques
such as Naı̈ve Bayes and Artificial Neural Networks (ANN) were applied. The features
from the header and the body of an email were taken into consideration. It was found that
Naı̈ve Bayes outperformed ANN with higher accuracy, recall, and precision. A paper by
(Lee et al. (2017)) found that Weighted Naı̈ve Bayes is not only computationally effective
but also very efficient in case of spam detection even with new destructive campaigns. The
evaluation was carried on eight datasets that were sourced from two sites. The accuracy
attained is around 95% for both the sources. The researcher (Yue and Elfayoumy (2007))
argues that despite neural networks producing good results, it takes a huge amount of
time in generating results. Also, once the model is built it stops learning from new
emails, unlike Naı̈ve Bayes which is adaptive and trains from new emails. The researcher
suggested that applying techniques like boosting in the future which can produce more
quick results.
The researcher (Yu and ben Xu (2008)) discusses the pitfalls of email spams as these
messages not only waste time and energy of the end-users but lead to issues like utilizing
high mailbox space and bandwidth. The researcher tried to solve this problem by imple-
menting 4 machine learning models like Support Vector Machine, Naı̈ve Bayes, Neural
Network and Relevance Vector Machine. The performance of the models was computed
using the training set of varying sizes and the results demonstrated that Neural Networks
are not suitable for spam filtering as they are susceptible to the size of training data.
SVM and RVM performed better where SVM took less training time than RVM.
Looking at this section, it can be interpreted that Naı̈ve Bayes along with feature
selection like the Chi-Square technique shows great results in text classification. The
results produced are not only better in terms of accuracy but also in terms of time
complexity. Also, the deep learning techniques like Neural Networks were not found to
be productive in spam detection as it is sensitive to the size of the training data. It takes
a large amount of time to train and test the dataset which makes them computationally
ineffective.
7
URL. Random Forest technique was implemented and found to produce an accuracy of
94.5%. The researcher (Koray et al. (2019)) detects the phishing websites by analyzing
and extracting features from the URL. The Random Word Detection module is used
which decomposes URL into small features which are then used to classify if the websites
are legitimate or not. Seven different machine learning algorithms like Naı̈ve Bayes,
KNN, SVM, Random Forest, Decision Tree, Adaboost, K-star were implemented on a
humongous amount of data. It was seen that Random Forest produced the highest
accuracy of around 97.98 percent among all the techniques applied.
In the year 2018, the researcher (Yuan et al. (2018)) proposed a blend of features
pertaining to URL and web page for the detection of phishing websites. Along with the
basic features like the length of the URL, unusual characters, etc, statistical features like
mean, median and variance and lexical features like title and the content of the web page
were also considered. Several algorithms like KNN, Logistic Regression, Random Forest,
Deep Forest, XGBoost were applied. It was found that Deep Forest followed by XGBoost
manifested high accuracy and less training time.
Nowadays, the success of any business heavily depends on authentic reviews. How-
ever, not all reviews are authentic. The ratio of genuine reviews to sham reviews varies
significantly from case to case and higher the bogus reviews, more is the image maligned
of a business. Implementing sentiment analysis using natural language processing tech-
niques can effectively detect the opinion spams. Earlier, researchers like (Jindal and
Liu (2007)), (Ren and Ji (2017)) used supervised and semi-supervised algorithms for the
detection of spam opinions. These models have a few restrictions such as low flexibil-
ity, high computational time and poor accuracy. These limitations were overcome by
the researcher ((Hazim et al. (2018)) using the Gradient Boosting models like XGBoost,
GBM Bernoulli, GBM Adaboost and GBM Gaussian. The opinion spam detection was
performed on multilingual datasets. It was found that XGBoost outperformed the other
models for the English language dataset generating high recall percentage whereas GBM
Gaussian produced good results for the Malay language dataset. The researcher (Prieto
et al. (2016)) detects opinion spam using neural networks and it was observed that model
complexity increases due to the large set of details provided to the neural network and
thus increases the overall computational cost.
In the world of online advertising, fraud clicks are one of the most momentous issues.
The research done by (Minastireanu and Mesnita (2019)) tackles the problem of fraud
clicks by using the latest machine learning technique viz. LightGBM on the dataset which
contains millions of clicks. The K-Fold Cross-Validation technique is used as a feature
engineering which helps in improving the performance. The accuracy achieved by the
model was 98% and was found to be the fastest with respect to computational speed
and low on memory consumption. Looking at the above research papers, it was found
that LightGBM and XGBoost are suitable as it performs faster with a less computational
speed, unlike the deep learning techniques. Also, Random Forest performed well giving
high accuracy and hence in this research, these algorithms are chosen.
8
Out of these 1042 were spam and rest are legitimate. The length feature was added
as the spam messages tend to be longer than ham. The data was pre-processed using
NLP techniques and then Bag of Words (BoW) and Term Frequency - Inverse Document
Frequency (TF-IDF) were chosen as feature selection techniques. A content-based spam
message filtering method proposed by (Balli and Karasoy (2018)) having a dataset of 5574
messages of which 747 are spam uses the semantic relationship between the SMS words.
The feature selection is done using the Word2Vec algorithm which calculates the distance
between the vector of the words and thus features are extracted for each message. The
work presented by the researcher (Aich et al. (2019)) for spam detection on imbalanced
datasets of SMS, implemented the SMOTE approach which generated great results in
combination with the SVM algorithm and the accuracy increased by seven points.
The researcher (Najadat et al. (2014))investigates spam SMS detection by implement-
ing 12 different types of classifiers on the dataset which contained 5574 messages. The
research used the technique of down-sampling where the ham messages were reduced
to the count of spam messages. The main contribution of this research is to examine
the impact of class imbalance on performance and the researcher found that a balanced
dataset produces more accurate results. The results are compared to the previous re-
search done with the imbalanced dataset and found that performance degrades due to
the under-fitting issue.
It can be seen that the datasets which have the issue of class imbalance can lead to bias
results and poor performance whereas the balanced dataset can improve the performance.
Also, feature selection techniques like Bag Of Words and TF-IDF works well in the case
of text classification.
9
Table 3: Comparison of Reviewed Features and Classification Techniques
Area of Applied Classi- Features Ex- Results Author
Classifica- fiers tracted
tion
Spam SMS Naive Bayes, J- Message-Keyword Naive Bayes (Kim et al.
Classification 48, Logistic Re- Matrix (Bag of with 94.7% (2015))
gression Words) accuracy
Spam SMS Naive Bayes, TFIDFVectorizer CNN with (Gupta et al.
Classification SVM, Random 99.1% accur- (2018))
Forests, Ada- acy
boost, CNN,
ANN,Logistic Re-
gression, Decision
Tree
Email Spam ANN, Na¨ıve Multiple features Naive Bayes (Islam et al.
Classification Bayes from header and with 92% (2009))
body like URL, with low ex-
images, links, etc ecution time
than Neural
Networks
Phishing Logistic Regres- TF-IDF for fea- Deep Forest (Yuan et al.
Classification sion, Decision ture extracted with 97.7% (2018))
Tree, GBDT, from URL and and XGBoost
Deep Forest, links with 97.1%
Random Forest,
XGBoost, KNN
Phishing Naı̈ve Bayes, NLP features, Random (Koray et al.
Classification KNN, SVM, word vectors Forest with (2019))
Random Forest, 97.98%
Decision Tree,
Adaboost, K-star
Fraud Advert- LightGBM, XG- IP, OS, channel, LightGBM (Minastireanu
isement Clas- Boost. Stochastic device, click time with 98% and Mesnita
sification Gradient Boosting accuracy (2019))
2.6 Conclusion
Looking at the related work done in the area of text classification, it is clearly seen
that gradient boosting techniques produce quick and accurate results. As suggested, the
length feature can be a useful parameter in deciding the type of SMS. Hence the same
was implemented in this research. There is an urgent need for developing a spam SMS
detection model by combining the best-reviewed machine learning techniques with the
NLP techniques which can generate good results and can answer the research question
(section 1.2) and the research objectives (section 1.3). The next chapter discusses the
scientific methodology and design specifications chosen to develop the spam detection
model which helps mobile phone users.
10
3 Scientific Methodology and Design Specifications
In data mining projects, different methodologies are used. The most common ones are
CRISP-DM, KDD, and SEMMA. Knowledge Discovery and Data Mining (KDD) suits
well for the research as the deployment step is not required. KDD is a very precise and
complete approach that focuses not only on the business process but also on implement-
ation(Shafique and Qaiser (2014)). The design architecture is a two-tier which contains
a Presentation Layer and Application Layer.
11
Figure 2: Design Process for Short Message Service Spam Detection
Google cloud platform Colab was used for the implementation of the project.
Colab is a free to use platform as a service provided by Google. It is the preferred
platform of a large number of developers for machine learning since it has almost all the
packages pre-installed and does not require any special installations.
The implementation, evaluation, and results are discussed in detail in the next chapter.
12
Figure 3: Work Flow Diagram for Short Message Service Spam Detection
13
Figure 4: Correlation Matrix For
SMS Type and Length Feature
Figure 5: Word Cloud for Spam Figure 6: Word Cloud for Ham
14
Figure 7: Message Count before and after Down Sampling
15
4.5 Document Term Matrix using NLP Techniques
As the machine learning models work only on mathematical data, a matrix was created
which contains the word and its frequency of occurrence. The following two techniques
were used for the creation of the document term matrix.
Bag of Words (BoW): Bag of Words is a way of extracting the features from the set of
text messages. In this, a matrix called a bag of words is created (shown in Figure 8) which
describes the text based on the frequency of the word appearing in the document.This
was implemented using the CountVectorizer package in python. The CountVectorizer can
be implemented in the form of n-grams. In the research, the feature was extracted using
Unigram and Bigram matrix. The Unigram matrix comprises a single word whereas the
Bigram matrix consists of two consecutive words from a document.
16
independent feature. It checks the deviation of the observed count from the expected
count. Chi-Square technique performs well in the case of categorical data and the Bag-
Of-Words and TF-IDF matrix consist of categorical data only and therefore the technique
is preferred in the research. Also, from the literature (Ergin and Isik (2014)) and (Al-
meida et al. (2011)) it can be seen that Chi-Square generated excellent outputs in the
field of text analytics. Chi2 package from the sklearn library was used for selecting the
best features.
This accomplished the research objective from Obj-1 to Obj-3 mentioned in Chapter 1,
Section 1.3.
Accuracy: It measures how close the observed value is from the actual value. As
per (Aich et al. (2019)) the classification accuracy metric is more transparent when the
classes are balanced. It is formulated as:
Accuracy = (TP+TN)/(TP+TN+FP+FN)
Precision: It is the ratio of the correctly predicted positive results to the total num-
ber of positive results that a model predicts. It can also be interpreted as how much the
model is relevant when it predicts. (Gupta et al. (2018)) It is formulated as :
Precision = (TP)/(TP+FP)
Recall = (TP)/(TP+FN)
F1-Score: F1-Score is the weighted average of recall and precision. It measures the
incorrectly classified instances. It is formulated as:
17
F1-Score = 2*((Precision*Recall)/(Precision+Recall))
18
4.9 Experiment 2: Implementation, Evaluation and Results of
LightGBM Model
4.9.1 Implementation
Light Gradient Boosting machine is recently developed machine learning model in 2017.
It is a boosting technique and similar to XGBoost with advantages like high efficiency,
low memory usage, fast training speed, better accuracy, capable to handle large datasets
and support parallelization 1 . As it is new, comparatively less research has been done on
it but looking at the literature, it can be seen that it performed well in the field of text
analytics. (Minastireanu and Mesnita (2019) (Li et al. (2019))
LightGBM classifier was implemented using Stratified 10-Fold Cross-Validation to di-
vide the dataset into training and test sets. The function used to implement the model
was the LGBMClassifier() of the sklearn library. The model was implemented on Uni-
gram, Bigram, and TF-IDF with and without length feature.
19
and it works well with categorical data(Xu (2018)). Therefore it makes the classifier a
suitable choice to implement in the research.
Bernoulli Naive Bayes classifier was implemented using Stratified 10-Fold Cross-Validation
to divide the dataset into training and test sets. The function used to implement the
model was BernoulliNB() of the sklearn library 2 . The model was implemented on Uni-
gram, Bigram, and TF-IDF with and without length feature.
20
4.11.2 Evaluation and Results
SVM model evaluation metric is shown in the Figure 13. A comparison of all the matrices
shows that the Unigram matrix has surpassed others with an Accuracy of 0.963, Recall
of 0.945 and F1-Score of 0.962. The length feature did not contribute much to the model.
Precision for Bigram without length feature was the highest of 0.993.
21
Figure 14: Evaluation Metrics for Random Forest Model
4.13 Conclusion
On the basis of implementation and generated results, the research objectives till Obj-4
(section 1.3) along with the research question (section 1.2) have been accomplished. The
models developed will contribute remarkably to mobile phone users in detection of spam
SMS.
22
Figure 15: Model-wise Accuracy Figure 16: Model-wise Execution
Comparison Time Comparison
Gupta et al. Convolutional Neural 99.10% Researcher did not handle class
(2018) Network imbalance and hold out technique
was used to split train and test
dataset.TF-IDF matrix was used
for feature.
Kim et al. Naive Bayes 94.70% The class imbalance was not
(2015) handled. The frequency ratio
technique used for feature selec-
tion.Researcher used 10-fold Cross
Validation Technique to split test
and train dataset
Current Re- Bernoulli Naive Bayes 96.50% The class imbalance issue was
search handled, Feature selection impl-
emneted and Stratified 10-Fold
Cross Validation technique used to
split the dataset.
23
and the ratio was extremely uneven. This was handled by applying the down-sampling
technique to the ham class to ascertain the records were equal in ham as well as spam
class. This reduction in the record size of the ham class could have impacted the training
of the machine learning models used for this research.
2. Due to the down-sampling technique, the effective record size used for training the
machine learning models was very limited. The performance achieved using this approach
may vary when the data size grows exponentially. In the future, this can be tackled by
aggregating different datasets and training the models again for better results.
3. The research in its current form is confined to the English language only as the dataset
used contained English words. Also, the models are trained to identify proper English
language words and not the slang and short forms. The result may vary when non-English
words, slang language or short forms are used.
Future Work: In an endeavor to improve on the results, the machine learning models
should be trained using datasets from different sources and also using datasets having a
large number of records. This will improve the reliability of the models. The millennial
in today’s time use slang and short forms in texting which cannot be detected by the
models at present. This can be improved upon to better classify the genuine ham messages
from the spurious ones. In-depth research can be conducted on this. Also, non-English
languages can be included for spam SMS detection in the future. Apart from Chi-Square,
other techniques like Information Gain, Gini Index, etc. can be used to evaluate the
impact on performance.
24
7 Acknowledgment
I would like to extend my heartfelt gratitude towards my mentor Dr. Catherine Mulwa.
This research would not have been possible without her guidance, help, and support. She
always went the extra mile to help me whenever I was in need. I would also like to thank
my family for bestowing their trust in me.
References
Agarwal, S., Kaur, S. and Garhwal, S. (2016). SMS spam detection for Indian messages,
Proceedings on 2015 1st International Conference on Next Generation Computing Tech-
nologies, NGCT 2015 (September): 634–638.
Aich, P., Venugopalan, M. and Gupta, D. (2019). Content based spam detection in short
text messages with emphasis on dealing with imbalanced datasets, Proceedings - 2018
4th International Conference on Computing, Communication Control and Automation,
ICCUBEA 2018 .
Akbari, F. and Sajedi, H. (2015). SMS spam detection using selected text features and
Boosting Classifiers, 2015 7th Conference on Information and Knowledge Technology,
IKT 2015 pp. 1–5.
Almeida, T. A., Almeida, J. and Yamakami, A. (2011). Spam filtering : how the dimen-
sionality reduction affects the accuracy of Naive Bayes classifiers, Journal of Internet
Services and Applications 1: 183–200.
Balli, S. and Karasoy, O. (2018). Development of content based sms classification applic-
ation by using word2vec based feature extraction, IET Software .
Basu, A., Watters, C. and Shepherd, M. (2002). Support Vector Machines for Text
Categorization, Proceedings of the 36th Hawaii International Conference on System
Sciences pp. 1–7.
Ergin, S. and Isik, S. (2014). The assessment of feature selection methods on agglutinative
language for spam email detection: A special case for Turkish, INISTA 2014 - IEEE
International Symposium on Innovations in Intelligent Systems and Applications, Pro-
ceedings pp. 122–125.
Fernandes, D., d. Costa, K. A. P., Almeida, T. A. and Papa, J. P. (2015). Sms spam
filtering through optimum-path forest-based classifiers, pp. 133–137.
Gupta, M., Bakliwal, A., Agarwal, S. and Mehndiratta, P. (2018). A Comparative Study
of Spam SMS Detection Using Machine Learning Classifiers, 2018 11th International
Conference on Contemporary Computing, IC3 2018 pp. 1–7.
Islam, M. S., Khaled, S. M., Farhan, K., Rahman, M. A. and Rahman, J. (2009). Model-
ing spammer behavior: Naı̈ve Bayes vs. artificial neural networks, 2009 International
Conference on Information and Multimedia Technology, ICIMT 2009 pp. 52–55.
25
Jindal, N. and Liu, B. (2007). Analyzing and detecting review spam, pp. 547–552.
Kim, S. E., Jo, J. T. and Choi, S. H. (2015). SMS Spam filterinig using keyword frequency
ratio, International Journal of Security and its Applications 9(1): 329–336.
Koray, O., Buber, E., Demir, O. and Diri, B. (2019). Machine learning based phishing
detection from URLs, Expert Systems With Applications 117: 345–357.
URL: https://doi.org/10.1016/j.eswa.2018.09.029
Lee, C. N., Chen, Y. R. and Tzeng, W. G. (2017). An online subject-based spam filter
using natural language features, 2017 IEEE Conference on Dependable and Secure
Computing pp. 479–484.
Li, Y., Yang, Z., Chen, X., Yuan, H. and Liu, W. (2019). A stacking model using URL and
HTML features for phishing webpage detection, Future Generation Computer Systems
94: 27–39.
URL: https://doi.org/10.1016/j.future.2018.11.004
Liew, S. W., Sani, N. F. M., Abdullah, M. T., Yaakob, R. and Sharum, M. Y. (2019). An
effective security alert mechanism for real-time phishing tweet detection on Twitter,
Computers and Security 83: 201–207.
URL: https://doi.org/10.1016/j.cose.2019.02.004
Longadge, R. and Dongre, S. (2013). Class imbalance problem in data mining review,
Int. J. Comput. Sci. Netw. 2.
Ma, J., Zhang, Y., Liu, J., Yu, K. and Wang, X. (2016). Intelligent sms spam filter-
ing using topic model, 2016 International Conference on Intelligent Networking and
Collaborative Systems (INCoS) pp. 380–383.
Minastireanu, E.-A. and Mesnita, G. (2019). Light GBM Machine Learning Algorithm to
Online Click Fraud Detection Light GBM Machine Learning Algorithm to Online Click
Fraud Detection, Journal of Information Assurance & Cybersecurity 3(April): 1–15.
Najadat, H., Abdulla, N., Abooraig, R. and Nawasrah, S. (2014). Mobile SMS Spam
Filtering based on Mixing Classifiers, International Journal of Advanced Computing
Research 1.
Pham, T.-h. (2016). Content-based Approach for Vietnamese Spam SMS Filtering, 2016
International Conference on Asian Language Processing (IALP) pp. 41–44.
Prieto, A., Prieto, B., Ortigosa, E., Ros, E., Pelayo, F., Ortega, J. and Rojas, I. (2016).
Neural networks: An overview of early research, current frameworks and new chal-
lenges, Neurocomputing 214.
Ren, Y. and Ji, D. (2017). Neural networks for deceptive opinion spam detection: An
empirical study, Information Sciences 385-386: 213–224.
Sethi, P., Bhandari, V. and Kohli, B. (2018). SMS spam detection and comparison of
various machine learning algorithms, 2017 International Conference on Computing and
Communication Technologies for Smart Nation, IC3TSN 2017 2017-October: 28–31.
26
Shafique, U. and Qaiser, H. (2014). A comparative study of data mining process mod-
els (kdd, crisp-dm and semma), International Journal of Innovation and Scientific
Research 12: 2351–8014.
Shirani-Mehr, H. (2012). SMS Spam Detection using Machine Learning Approach, tech.
rep., Stanford University pp. 1–4.
Tianqi, C. and Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System Tianqi,
Il Friuli medico 19(6).
Vani, K. and Gupta, D. (2014). Using K-means cluster based techniques in external
plagiarism detection, Proceedings of 2014 International Conference on Contemporary
Computing and Informatics, IC3I 2014 pp. 1268–1273.
Xu, S. (2018). Bayesian naı̈ve bayes classifiers to text classification, Journal of Informa-
tion Science 44(1): 48–59.
URL: https://doi.org/10.1177/0165551516677946
Yadav, K., Saha, S. K., Kumaraguru, P. and Kumra, R. (2012). Take control of your
SMSes: Designing an usable spam SMS filtering system, Proceedings - 2012 IEEE 13th
International Conference on Mobile Data Management, MDM 2012 pp. 352–355.
Yu, B. and ben Xu, Z. (2008). A comparative study for content-based dynamic
spam classification using four machine learning algorithms, Knowledge-Based Systems
21(4): 355–362.
Yuan, H., Chen, X., Li, Y., Yang, Z. and Gv Liu, W. (2018). Detecting Phishing Web-
sites and Targets Based on URLs and Webpage Links, Proceedings - International
Conference on Pattern Recognition 2018-August: 3669–3674.
Yue, Y. and Elfayoumy, S. (2007). Anti-spam filtering using neural networks and Baysian
classifiers, Proceedings of the 2007 IEEE International Symposium on Computational
Intelligence in Robotics and Automation, CIRA 2007 pp. 272–278.
27