SMS Spam Detection and Classification Using NLP Thesis
SMS Spam Detection and Classification Using NLP Thesis
2023
OVERVIEW
• Introduction
• Problem Statement
• Literature Review
• Aim and Objectives
• Methodology
• Dataset
• SMS Spam Detection Phases
• Data Pre-processing with BERT Model
• Feature Extraction and Selection
• SMS Message Spam Classification
• Performance Evaluation
• Expected Result
• References
INTRODUCTION
• Many people now consider their mobile phones to be a kind of devoted companion an practically everyone has a
mobile phone, whether it be a smartphone or not, with the ability to send and receive text messages.
• According to (Shirani-Mehr, Houshmand, 2013), SMS is a text-based medium that enables mobile phone users to
share a short text message (usually limited to 160 7-bit characters) and has become one of the most widely used
methods for individuals to communicate electronically.
• In Nigeria, more people use SMS messages than emails to communicate because it doesn't require an internet
connection and is quick and easy. SMS has become a multi-million service in the telecommunications industry
due to the explosive growth of mobile devices and the millions of people who send messages daily (National
Bureau of Statistics, 2019).
• The negative aspect of the rise in mobile users and the low cost of SMS text messages is that mobile phones are
receiving more unsolicited bulk messages, particularly in adverts, which has led to the SMS spam issue.
INTRODUCTION cont’d
• SMS spams nonetheless is endangering mobile users privacy with phishing and fraud on daily basis.
• Keyword filters have been the most common strategy utilized to distinguish between spam and non-spam messages
(ham), using Statistical Learning Theory, Artificial Neural Networks (ANNs), and Support Vector Machines (SVMs)
(Suleiman & Al-Naymat, 2017).
• In numerous experiments of different classification algorithms, some perform better with specific training datasets
while performing poorly with other training datasets for no logical reason (Megha Rathi, Vikas Pareek, 2013).
• There are numerous spam filtering techniques in use, however, no single spam filtering strategy can be guaranteed
to be 100% effective at eradicating spam issues but with the application of text mining techniques to SMS, it will
improve the effectiveness of detecting and classifying spam messages to combat telephone abuse (Vikas Pareek,
2013).
• The objective of this research is to propose an alternative method to address the problem of SMS spam message
identification and classification utilizing Naive Bayes, C4.5(J48), and Frequent Pattern (FP)-Growth Algorithm.
PROBLEM STATEMENT
• Unsolicited message sent to a mobile phone user is usually regarded as spam and this
problem occurs when a mobile user does not want to receive a particular text or text
from a particular type of IDS (Joe, I., & Shim, 2010).
• SMS is less formal than a standard document text due to limited characters (maximum
7-bit 160 characters), all of these makes it difficult to classify as spam.
• Nigeria’s network providers offer an SMS service called Do Not Disturb (DND) Nigerian
Communications Commission (NCC). There is no denying the efficiency of this DND
solution, but because it also prevents ham messages from reaching the target device,
it cannot ensure the complete elimination of spam issues.
• Numerous spam filtering models are in use, but these techniques have experienced
overgeneralization and overfitting issues.
• "Is the present model good enough to distinguish between SMS spam and non-spam?"
LITERATURE REVIEW
TOPIC & AUTHOR SOURCE CONTRIBUTION TO SHORTCOMING
KNOWLEDGE
Choudhary and Jain, 2009 International Journal Explored and analyzed patterns Limited to only abbreviation
A novel approach to detect spam of E-Services and for SMS spam classification patterns in SMS.
and smishing SMS using machine Mobile Applications
learning techniques.
Nurulhuda Firdaus Mohd Azmi, International Journal Method for filtering spam Performance of the algorithm
2012 of Computer Science message using TF-IDF and various based on the features
Filtering spam message using Term and Information Random Forest Algorithm used in the data set.
frequency-inverse document Security
frequency (TF-IDF) and Random
Forest Algorithm
Sethi, G., & Bhootna, V., 2014 International Journal Utilized Bayesian filter in The method accuracy is based
Automated SMS classification and of Computer Science developing an android on only two specific factors;
spam analysis using and Information application that can detect sensitivity and specificity.
topic modeling Technologies (IJCSIT) spam SMS.
AIM & OBJECTIVES
AIM
This research aims to use data mining techniques to detect and classify SMS Spam to combat abuse in telephone
network.
OBJECTIVES
• To implement an alternative approach to the problem of SMS spam message detection and classification using
Naïve Bayes, C4.5(J48), and Frequent Pattern (FP)-Growth Algorithm.
• To design a model that balance both fitting and generalization challenges in detecting and classifying anomalies
in SMS spam detection.
• To evaluate the most effective data mining methods for SMS spam using a variety of datasets with extremely
high classification and prediction accuracy.
METHODOLOGY
• The methodology outlines the overall structure of the workflow of this research.
• In this study, data mining techniques and machine learning algorithms are utilized for the analysis, detection,
and classification of the dataset.
METHODOLOGY cont’d
• Dataset
o Data is gathered from numerous sources to create a respectable dataset of spam and ham text messages,
which will be utilized as the model's input (SMS messages)
o The spam dataset was obtained from the Knowledge Discovery and Data Mining (KDD) machine learning
repository.
o The dataset contains 50,795 English raw text messages (711 continuous input attributes and 2 nominal
class label target attributes) with tag labels either as non-spam (ham) or spam.
METHODOLOGY cont’d
• SMS Spam Detection Phase
o This phase involves preprocessing, pattern extraction, and selection and classification (Han, Jiawei,
Micheline Kamber, and Jian Pei., 2013).
o The activities will be carried out using WEKA data mining software.
• The outcome will achieve a balance between the issues of overfitting and overgeneralization in identifying
and classifying abnormalities in SMS spam detection
REFERENCES
• Chuprat, S., Sarkan, H. M., Yahya, Y., & Sam, S. M. (2019). SMS Spam Message Detection using Term Frequency-Inverse
Document Frequency and Random Forest Algorithm
• Gupta, M., Bakliwal, A., Agarwal, S., & Mehndiratta, P. (2018). A Comparative Study of Spam SMS Detection Using Machine
Learning Classifiers.
• Han, Jiawei, Jian Pei, and Yiwen Yin. (2000) “Mining Frequent Patterns Without Candidate Generation.”
• Han, Jiawei, Micheline Kamber, and Jian Pei. (2013) Data Mining: Concepts and Techniques 3rd Edition.
• Joe, I., & Shim, H. (2010). An SMS Spam Filtering System Using Support Vector Machine.
• Megha Rathi, Vikas Pareek. (2013). Spam Mail Detection Through Data Mining- A Comparative Performance Analysis.
• Nagwani, N. K. (2017). A Bi-Level Text Classification Approach for SMS Spam Filtering and Identifying Priority Messages
• R.Kishore Kumar, G.Poonkuzhali, P.Sudhakar, LAENG. (2012). Comparative Study on Email Spam Classifier using Data
Mining Techniques.
• Shirani-Mehr, Houshmand. (2013) "SMS Spam Detection Using Machine Learning Approach."
REFERENCES cont’d
•Suleiman, D., & Al-Naymat, G. (2017). SMS Spam Detection Using H2O Framework. Procedia Computer Science.
•Qian, Wang, Han Xue, and Wang Xiaoyu. (2009) "Studying Of Classifying Junk Messages Based On The Data Mining.“
Websites Visited
• https://ics.uci.edu/ml/solutions/spam-messages
• http://www.esp.uem.es/jmgomez/SMSspamcorpus
• https://www.softwaretestinghelp.com/fp-growth-algorithm-data-mining/
• http://abcnews.go.com/blogs/technology/201 2/08/69-of-mobile-phone-users-get-text-spam/
• http://archive kdd.org/datasets/download/SMS+Spam+Collection
• https://www.researchgate.net/publication/269651895_Spam_Mail_Detection
• https://medium.com/@easpex/pitfalls-of-using-fp-growth-algorithm-in-weka