Spam Detection in Text Using Machine Learning 1
Spam Detection in Text Using Machine Learning 1
MACHINE LEARNING
A Project work submitted to
Department of Computer Science and Engineering
University College of Sciences
Acharya Nagarjuna University
by
CHIKKUDU SRINIVASULU
Regd .No.Y23MC20009
April 2024
DECLARATION
I hereby declare that the entire thesis work entitled " SPAM DETECTION IN TEXT
University, in partial fulfillment of the requirement for the award of the degree of Master
of Computer Applications (MCA) is a bonafide work of my own, carried out under the
I further declare that the Project, either in part or full, has not been submitted earlier by me
CHIKKUDU SRINIVASULU
Reg. No. Y23MC20009
ii
ACHARYA NAGARJUNA UNIVERSITY
NAGARJUNA NAGAR, GUNTUR.
Department of Computer Science & Engineering.
CERTIFICATE
USING MACHINE LEARNING” is a Bonafide record of the project work done and
my guidance.
External Examiner
iii
ACKNOWLEDGEMENTS
Undertaking this Project has been a truly life-changing experience for me and it would not have
been possible to do without the support and guidance that I received from many people.
I would like to first say a very big thank you to my supervisor Dr. R. Vasantha for all the support
and encouragement he gave me. Her friendly guidance and expert advice have been invaluable
throughout all stages of the work. Without her guidance and constant feedback this Project work not
I would also wish to express my gratitude to Prof. K. Gangadhara Rao for extended discussions
and valuable suggestions which have contributed greatly to the improvement of the thesis.
I am thankful to and fortunate enough to get constant encouragement, support and guidance from all
Teaching staffs of Department which helped us in successfully completing our project work. Also, I
would like to extend our sincere regards to all the non-teaching staff of the department for their
timely support.
I must also thank my parents and friends for the immense support and help during this project.
Without their help, completing this project would have been very difficult.
iv
ABSTRACT
SMS spam detection using Naive Bayes algorithm is a widely used technique in the field of
text classification. The main aim of this approach is to classify the incoming messages into
spam or ham categories. The Naive Bayes algorithm works by calculating the probability of
a message belonging to a particular class, based on the occurrence of different words in the
message. In this paper, we present an efficient and accurate approach for SMS spam
detection using the Naive Bayes algorithm. The proposed approach utilizes a pre-processing
step for feature extraction, which includes tokenization, stop-word removal, and stemming.
The Naive Bayes algorithm is then trained on a dataset of labeled messages to learn the
probability distributions of different words in spam and ham messages. Finally, the trained
model is used to classify incoming messages into spam or ham categories. The results of our
experiments show that the proposed approach achieves high accuracy in detecting SMS
spam messages
v
TABLE OF CONTENTS
TITLE PAGE NO
DECLARATION Ii
CERTIFICATE iii
ACKNOWLEDGEMENT iv
ABSTRACT v
LIST OF FIGURES xi
Chapter 1: Introduction
1.1 Statement 4
1.2 objective 5
vi
Chapter 3: REQUIREMENT SPECIFICATIONS
3.1 Functional requirements 9
3.4 Algorithms 17
3.4.3 LSTM 23
3.4.6 K- means 26
3.4.9 Python 35
Chapter 4: Methodology
vii
Chapter 6: Results And Discussions
References 73
viii
LIST OF FIGURES
4.3.4.1 Accuracy 39
4.3.4.2 Precision 40
4.3.4.3 Recall 42
5.1 Architecture 44
ix
Spam Detection In Text Using Machine Learning
CHAPTER 1
INTRODUCTION
1
Spam Detection In Text Using Machine Learning
1 INTRODUCTION
SMS has become a popular medium for communication with the widespread use of mobile phones.
However, this convenience has also led to the increase in the number of SMS spam messages,
which can be annoying and potentially harmful. SMS spam messages can be used for phishing
attacks, identity theft, and other malicious activities. Therefore, it is crucial to develop efficient
techniques for detecting and filtering out these spam messages. In recent years, machine learning
algorithms have been extensively used in the field of text classification for spam detection. Among
these algorithms, the Naive Bayes algorithm has gained popularity due to its simplicity and
effectiveness. The Naive Bayes algorithm is a probabilistic algorithm that calculates the probability
of a message belonging to a particular class, based on the occurrence of different words in the
message.
SMS Spamming [2] [10] in extremely disappointing for the clients: numerous critical and valuable
messages can get lost because of spam messages, Spam messages are additionally used to trap
individuals, or bait them into purchasing services. As overall utilization of cell phones has grown,
another road for e-junk mail has been opened for notorious advertisers. These publicists use instant
messages (SMS) to target probable purchasers with undesirable publicizing known as SMS spam.
This sort of spam is especially bothersome since, not at all like email spam, numerous PDA clients
pay an expense for each SMS got. Building up a classification algorithm [1] [11] that channels SMS
spam would give a helpful apparatus for mobile phone suppliers. Since naïve Bayes has been
utilized effectively for email spam detection [9], it appears to be expected that it could likewise be
used to build SMS spam classifier [7]. With respect to email spam [6][8], SMS spam represents
extra difficulties for automated channels. SMS texts are regularly restricted to 160 characters,
lessening the measure of content that can be utilized to distinguish whether a message is a ham or
spam. People have also regularly started using shorthand notations and slang which further makes it
2
Spam Detection In Text Using Machine Learning
difficult to distinguish between ham and spam. We will test how well a simple naïve Bayes
We additionally fabricate models to group messages utilizing the SVM algorithm and the maximum
entropy algorithm [3], and it is discovered that SVM gives us the most precise outcomes, with
algorithm. Spam messages can be classified as redundant messages sent to large number of people
at once. The rise of spam messages are based on the following factors: 1) The accessibility to cheap
bulk SMS-plans; 2) dependability (since the message comes to the cell phone client); 3) low
possibility of accepting reactions from some unaware recipients; and 4) the message can be
As the Internet continues to grow in both size and importance, the quantity and impact of online
reviews continually increases. Reviews can influence people across a broad spectrum of industries,
but are particularly important in the realm of ecommerce, where comments and reviews regarding
products and services are often the most convenient, if not the only, way for a buyer to make a
decision on whether or not to buy them. Online reviews may be generated for a variety of reasons.
Often, in an effort to improve and enhance their businesses, online retailers and service.
Providers may ask their customers to provide feedback about their experience with the products or
services they have bought, and whether they were satisfied or not. Customers may also feel inclined
to review a product or service if they had an exceptionally good or bad experience with it. While
online reviews can be helpful, blind trust of these reviews is dangerous for both the seller and buyer.
Many look at online reviews before placing any online order; however, the reviews may be
poisoned or faked for profit or gain, thus any decision based on online reviews must be made
cautiously. Furthermore, business owners might give incentives to whoever writes good reviews
about their merchandise, or might pay someone to write bad reviews about their competitor’s
products or services. These fake reviews are considered review spam and can have a great impact in
3
Spam Detection In Text Using Machine Learning
the online marketplace due to the importance of reviews. Review spam can also negatively impact
businesses due to loss in consumer trust. The issue is severe enough to have attracted the attention
of mainstream media and governments. For example, the BBC and New York Times have reported
that “fake reviews are becoming a common problem on the Web, and a photography company was
recently subjected to hundreds of defamatory consumer reviews” [1]. In 2014, the Canadian
Government issued a warning “encouraging consumers to be wary of fake online endorsements that
give the impression that they have been made by ordinary consumers” and estimated that a third of
all online reviews were fake1 . As review spam is a pervasive and damaging problem, developing
methods to help businesses and consumers distinguish truthful reviews from fake ones is an
In this project work, we propose an efficient and accurate approach for SMS spam detection using
the Naive Bayes algorithm. Our approach includes a pre-processing step for feature extraction,
which involves tokenization, stop-word removal, and stemming. The Naive Bayes algorithm is then
trained on a labeled dataset of messages to learn the probability distributions of different words in
spam and ham messages. Finally, the trained model is used to classify incoming messages into spam
or ham categories.
1.1 Statement
SMS spam is real and a growing problem largely due to the availability of very cheap bulk pre-pay
SMS packages and the fact that SMS stimulate higher response rates as it is a trusted and a personal
service. The Short Messaging Service (SMS) mobile communication system is attractive for
criminal gangs for a number of reasons i.e. it is easy to use, fast reliable and affordable technology
(Delany S. J , Buckley M,& Greene D ,2012). The presence of lack of a unifying model is perceived
as a hindrance to the further development of the field of machine learning especially in Sms spam
4
Spam Detection In Text Using Machine Learning
detection. Many approaches proposed, regardless of their effectiveness, focus on a specific aspect or
language and most of them do not have integrated approach and are not exhaustive.
1.2 Objective
The main objective of this research is to evaluate a machine learning Sms Spam detection model.
To develop Spam detection model that can be used to detect Spam messages in Kenya
Demonstrate the use of machine learning in classifying messages as either Spam or not.
5
Spam Detection In Text Using Machine Learning
CHAPTER 2
LITERATURE SURVEY
6
Spam Detection In Text Using Machine Learning
2 LITERATURE SURVEY
2.1 SMS Spam Detection Based on Long Short-Term Memory and Gated
Recurrent Unit
An SMS spam is the message that hackers develop and send to people via mobile devices targeting
to get their important information. For people who are ignorant, if they follow the instruction in the
message and fill their important information, such as internet banking account in a faked website or
application, the hacker may get the information.This may lead to loss their wealth. The efficient
spam detection is an important tool inorder to help people to classify whether it is a spam SMS or
not. In this research, we propose a novel SMS spam detection based on the case study of the SMS
spams in English language using Natural Language Process and Deep Learning techniques. To
prepare the data for our model development process, we use word tokenization, padding data,
truncating data and word embedding to make more dimension in data. Then, this data is used to
develop the model based on Long ShortTerm Memory and Gated Recurrent Unit algorithms. The
performance of the proposed models is compared to the models based on machine learning
algorithms including Support Vector Machine and Naïve Bayes. The experimental results show that
the model built from the Long Short-Term Memory technique provides the best overall accuracy as
high as 98.18%. On accurately screening spam messages, this model shows the ability that it can
detect spam messages with the 90.96% accuracy rate, while the error percentage that it misclassifies
0.74%.
7
Spam Detection In Text Using Machine Learning
The daily traffic of Short Message Service (SMS) keeps increasing. As a result, it leads to dramatic
increase in mobile attacks such as spammers who plague the service with spam messages sent to the
groups of recipients. Mobile spams are a growing problem as the number of spams keep increasing
day by day even with the filtering systems. Spams are defined as unsolicited bulk messages in
various forms such as unwanted advertisements, credit opportunities or fake lottery winner
notifications. Spam classification has become more challenging due to complexities of the messages
imposed by spammers. Hence, various methods have been developed in order to filter spams. In this
study, methods of term frequency-inverse document frequency (TF-IDF) and Random Forest
Algorithm will be applied on SMS spam message data collection. Based on the experiment,
The spam detection is a big issue in mobile message communication due to which mobile message
communication is insecure. In order to tackle this problem, an accurate and precise method is
needed to detect the spam in mobile message communication. We proposed the applications of the
machine learning-based spam detection method for accurate detection. In this technique, machine
learning classifiers such as Logistic regression (LR), K-nearest neighbor (K-NN), and decision tree
(DT) are used for classification of ham and spam messages in mobile device communication. The
SMS spam collection data set is used for testing the method. The dataset is split into two categories
for training and testing the research. The results of the experiments demonstrated that the
classification performance of LR is high as compared with K-NN and DT, and the LR achieved a
8
Spam Detection In Text Using Machine Learning
high accuracy of 99%. Additionally, the proposed method performance is good as compared with
2.4 SMS Spam Detection using Machine Learning and Deep Learning Techniques
The number of people using mobile devices increasing day by day.SMS (short message service) is a
text message service available in smartphones as well as basic phones. So, the traffic of SMS
increased drastically. The spam messages also increased. The spammers try to send spam messages
for their financial or business benefits like market growth, lottery ticket information, credit card
information, etc. So, spam classification has special attention. In this paper, we applied various
machine learning and deep learning techniques for SMS spam detection. we used a dataset from
UCI and build a spam detection model. Our experimental results have shown that our LSTM model
outperforms previous models in spam detection with an accuracy of 98.5%. We used python for all
implementations.
Over recent years, as the popularity of mobile phone devices has increased, Short Message Service
(SMS) has grown into a multi-billion dollars industry. At the same time, reduction in the cost of
messaging services has resulted in growth in unsolicited commercial advertisements (spams) being
sent to mobile phones. In parts of Asia, up to 30% of text messages were spam in 2012. Lack of real
databases for SMS spams, short length of messages and limited features, and their informal
language are the factors that may cause the established email filtering algorithms to underperform in
their classification. In this project, a database of real SMS Spams from UCI Machine Learning
repository is used, and after preprocessing and feature extraction, different machine learning
techniques are applied to the database. Finally, the results are compared and the best algorithm for
spam filtering for text messaging is introduced. Final simulation results using 10-fold cross
9
Spam Detection In Text Using Machine Learning
validation shows the best classifier in this work reduces the overall error rate of best model in
10
Spam Detection In Text Using Machine Learning
CHAPTER 3
SYSTEM REQUIREMENTS
11
Spam Detection In Text Using Machine Learning
3 SYSTEM REQUIREMNETS
Requirements are the basic constrains that are required to develop a system. Requirements are
1. Functional requirements
2. Non-Functional requirements
3. Technical requirements
Hardware requirements
Software requirements
The Functional Requirements section of our SMS spam detection project outlines the fundamental
functionalities crucial for the successful implementation of our system. To achieve our goal of
effectively detecting and filtering spam messages, we rely on a combination of specialized libraries
I. Problem define
12
Spam Detection In Text Using Machine Learning
13
Spam Detection In Text Using Machine Learning
Software Requirements:
Operating System: Windows
Tool: Anaconda with Jupiter Notebook
Hardware requirements:
Processor: Pentium IV/III
Hard disk: minimum 80 GB
RAM: minimum 2 GB
Implements methods to gather a diverse dataset of SMS messages, including both spam and
legitimate (ham) messages.
Utilizes techniques such as web scraping, API integration, or dataset acquisition from reliable
sources.
Ensures the collected dataset is representative of real-world SMS messages and includes a balanced
distribution of spam and ham.
Preprocessing Module:
Cleans and preprocesses the collected SMS messages by removing noise, special characters, and
irrelevant information.
Performs tasks like tokenization, stop-word removal, and stemming to prepare the text data for
analysis.
14
Spam Detection In Text Using Machine Learning
Extracts relevant features from the preprocessed SMS messages, such as word frequencies, n-
grams, and syntactic features.
Considers additional features such as message length, presence of specific keywords, or linguistic
patterns indicative of spam.
Trains machine learning models, such as Naive Bayes, Support Vector Machines (SVM), or neural
networks, using the extracted features.
Evaluates the trained models using appropriate metrics like accuracy, precision, recall, and F1-
score.
Implements techniques for model selection, hyperparameter tuning, and cross-validation to optimize
performance.
Enables continuous learning and adaptation of the model to evolving patterns of spam messages.
Ensures seamless integration of the trained model into the deployment environment, such as a web
application or API.
Provides robust error handling and logging mechanisms to monitor system performance and
troubleshoot issues.
5. Non-Functional Requirements
Non-functional requirements define the quality attributes and constraints of the software system.
15
Spam Detection In Text Using Machine Learning
Accessibility:
Ensures that the system is accessible to users with disabilities by adhering to accessibility
Availability:
Guarantees high availability of the system, minimizing downtime and ensuring uninterrupted
access to users.
Security:
Implements robust security measures to protect sensitive data and prevent unauthorized access,
Establishes regular data backups and implements disaster recovery procedures to mitigate the
Performance:
Measures the system's performance in terms of speed, responsiveness, and scalability. Ensures
Interoperability:
Ensures seamless integration with existing systems and interoperability with external
6. Performance Requirements
Performance requirements specify the expected behavior and performance metrics of the software
system.
16
Spam Detection In Text Using Machine Learning
Ensures that the system responds promptly to user interactions, with minimal latency.
6.2 Throughput:
Maintains high throughput to handle concurrent requests and process a large volume of image
data efficiently.
6.3 Scalability:
Scales horizontally and vertically to accommodate increasing data volumes and user traffic.
Optimizes resource consumption, including CPU, memory, and storage, to ensure efficient
7. Feasibility Study
The feasibility study serves as a comprehensive evaluation of the project's feasibility, encompassing
technical, operational, and economic considerations. It examines the project's technical feasibility
by assessing the availability of necessary resources, technology readiness, and compatibility with
existing systems. Furthermore, it delves into operational feasibility, evaluating the project's
readiness for adoption. Lastly, the economic feasibility analysis explores the project's financial
viability, including cost estimation, return on investment projections, and potential revenue streams
Determines whether the proposed technology and infrastructure can support the project
17
Spam Detection In Text Using Machine Learning
requirements
Evaluates the availability of suitable algorithms, frameworks, and computing resources for image
classification tasks.
Ensures that the system aligns with user expectations and can seamlessly integrate into existing
workflows.
Analyzes the cost-effectiveness of the project, considering factors such as development costs,
sustainability and financial viability of deploying the system in real-world settings. By addressing
these functional, non-functional, performance, and feasibility aspects, the project aims to develop a
robust and effective solution for classifying Indian medicinal leaves using transfer learning-based
3.4 Algorithms
predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular
The Naïve Bayes classifier is a popular supervised machine learning algorithm used for
classification tasks such as text classification. It belongs to the family of generative learning
algorithms, which means that it models the distribution of inputs for a given class or category. This
18
Spam Detection In Text Using Machine Learning
approach is based on the assumption that the features of the input data are conditionally
independent given the class, allowing the algorithm to make predictions quickly and accurately.
In statistics, naive Bayes classifiers are considered as simple probabilistic classifiers that apply
Bayes’ theorem. This theorem is based on the probability of a hypothesis, given the data and some
prior knowledge. The naive Bayes classifier assumes that all features in the input data are
independent of each other, which is often not true in realworld scenarios. However, despite this
simplifying assumption, the naive Bayes classifier is widely used because of its efficiency and good
Moreover, it is worth noting that naive Bayes classifiers are among the simplest Bayesian network
models, yet they can achieve high accuracy levels when coupled with kernel density estimation.
This technique involves using a kernel function to estimate the probability density function of the
input data, allowing the classifier to improve its performance in complex scenarios where the data
distribution is not well-defined. As a result, the naive Bayes classifier is a powerful tool in machine
learning, particularly in text classification, spam filtering, and sentiment analysis, among others.
For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in
diameter. Even if these features depend on each other or upon the existence of the other features, all
of these properties independently contribute to the probability that this fruit is an apple and that is
An NB model is easy to build and particularly useful for very large data sets. Along with simplicity,
Bayes theorem provides a way of computing posterior probability P(c|x) from P(c), P(x) and P(x|c).
19
Spam Detection In Text Using Machine Learning
Above,
P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
P(x|c) is the likelihood which is the probability of the predictor given class.
• It is straightforward to implement.
• It is not sensitive to irrelevant data and doesn’t follow the assumptions it holds.
20
Spam Detection In Text Using Machine Learning
• The Naive Bayes Algorithm has trouble with the ‘zero-frequency problem’. It happens when you
assign zero probability for categorical variables in the training dataset that is not available. When
you use a smooth method for overcoming this problem, you can make it work the best.
• It will assume that all the attributes are independent, which rarely happens in real life.
• It will estimate things wrong sometimes, so you shouldn’t take its probability outputs seriously.
The Naive Bayes Algorithm is used for various real-world problems like those below: Text
classification: The Naive Bayes Algorithm is used as a probabilistic learning technique for text
classification. It is one of the best-known algorithms used for document classification of one or
many classes.
Sentiment analysis: The Naive Bayes Algorithm is used to analyze sentiments or feelings, whether
21
Spam Detection In Text Using Machine Learning
issued for building hybrid recommendation systems that assist you in predicting whether a user will
Spam filtering: It is also similar to the text classification process. It is popular for helping you
Medical diagnosis: This algorithm is used in medical diagnosis and helps you to predict the patient’s
Weather prediction: You can use this algorithm to predict whether the weather will be good.
Random Forest is a famous machine learning algorithm that uses supervised learning methods. You
can apply it to both classification and regression problems. It is based on ensemble learning, which
integrates multiple classifiers to solve a complex issue and increases the model's performance.
In layman's terms, Random Forest is a classifier that contains several decision trees on various
subsets of a given dataset and takes the average to enhance the predicted accuracy of that dataset.
Instead of relying on a single decision tree, the random forest collects the result from each tree and
The Working of the Random Forest Algorithm is quite intuitive. It is implemented in two phases:
The first is to combine N decision trees with building the random forest, and the second is to make
22
Spam Detection In Text Using Machine Learning
Step 2: Create decision trees for your chosen data points (Subsets).
Step 4: For classification and regression, accordingly, the final output is based on Majority Voting or
Averaging, accordingly.
Before understanding the working of the random forest algorithm in machine learning, we must
look into the ensemble learning technique. Ensemble simplymeans combining multiple models.
Thus a collection of models is used to make predictions rather than an individual model. Ensemble
Bagging
It creates a different training subset from sample training data with replacement & the final output is
23
Spam Detection In Text Using Machine Learning
Boosting
It combines weak learners into strong learners by creating sequential models such that the final
model has the highest accuracy. For example, ADA BOOST, XG BOOST.
3.4.3 LSTM
LSTMs Long Short-Term Memory is a type of RNNs Recurrent Neural Network that can detain
long-term dependencies in sequential data. LSTMs are able to process and analyze sequential data,
such as time series, text, and speech. They use a memory cell and gates to control the flow of
information, allowing them to selectively retain or discard information as needed and thus avoid the
vanishing gradient problem that plagues traditional RNNs. LSTMs are widely used in various
applications such as natural language processing, speech recognition, and time series forecasting.
There are three types of gates in an LSTM: the input gate, the forget gate, and the output gate.
The input gate controls the flow of information into the memory cell. The forget gate controls the
flow of information out of the memory cell. The output gate controls the flow of information out of
Three gates input gate, forget gate, and output gate are all implemented using sigmoid functions,
which produce an output between 0 and 1. These gates are trained using a backpropagation
The input gate decides which information to store in the memory cell. It is trained to open when the
The forget gate decides which information to discard from the memory cell. It is trained to open
24
Spam Detection In Text Using Machine Learning
The output gate is responsible for deciding which information to use for the output of the LSTM. It
is trained to open when the information is important and close when it is not.
The gates in an LSTM are trained to open and close based on the input and the previous hidden
state. This allows the LSTM to selectively retain or discard information, making it more effective at
Structure of LSTM
An LSTM (Long Short-Term Memory) network is a type of RNN recurrent neural network that is
capable of handling and processing sequential data. The structure of an LSTM network consists of a
series of LSTM cells, each of which has a set of gates (input, output, and forget gates) that control
the flow of information into and out of the cell. The gates are used to selectively forget or retain
information from the previous time steps, allowing the LSTM to maintain long-term dependencies
The LSTM cell also has a memory cell that stores information from previous time steps and uses it
to influence the output of the cell at the current time step. The output of each LSTM cell is passed
to the next cell in the network, allowing the LSTM to process and analyze sequential data over
Applications of LSTM
Long Short-Term Memory (LSTM) is a highly effective Recurrent Neural Network (RNN) that has
been utilized in various applications. Here are a few well-known LSTM applications:
Language Simulation: Language support vector machines (LSTMs) have been utilized for natural
language processing tasks such as machine translation, language modeling, and text summarization.
By understanding the relationships between words in a sentence, they can be trained to construct
25
Spam Detection In Text Using Machine Learning
Voice Recognition: LSTMs have been utilized for speech recognition tasks such as speech-to-text-
to-text-transcription and command recognition. They may be taught to recognize patterns in speech
Sentiment Analysis: LSTMs can be used to classify text sentiment as positive, negative, or neutral
Time Series Prediction: LSTMs can be used to predict future values in a time series by learning the
Video Analysis: LSTMs can be used to analyze video by learning the relationships between frames
This is a classifier in which the weights of the network are found by solving a quadratic
programming problem with linear constraints, rather than by solving a non- convex, unconstrained
minimization problem as in standard neural network training. Other well- known algorithms are
based on the notion of perceptron Tapas Kanungo, D. M. (2002). Perceptron algorithm is used for
learning from a batch of training instances by running the algorithm repeatedly through the training
set until it finds a prediction vector which is correct on all of the training set. This prediction rule is
then used for predicting the labels on the test set Neocleous C. (2002).
These are the most recent supervised machine learning technique. Support Vector Machine (SVM)
models are closely related to classical multilayer perceptron neural networks. SVMs revolve around
26
Spam Detection In Text Using Machine Learning
the notion of a margin‖—either side of a hyperplane that separates two data classes. Maximizing the
possible distance between the separating hyperplane and the instances on either side of it has been
3.4.6 K-means: According to Nilsson, N.J. (2005), K- means is one of the simplest unsupervised
learning algorithms that solve the well-known clustering problem. The procedure follows a simple
and easy way to classify a given data set through a certain number of clusters (assume k clusters)
fixed a priori. K-Means algorithm is employed when labeled data is not available. General method
of converting rough rules of thumb into highly accurate prediction rule. Given ―weak learning
algorithm that can consistently find classifiers (―rules of thumb‖) at least slightly better than
random, say, accuracy _ 55%, with sufficient data, a boosting algorithm can provably construct
Decision Trees (DT) are trees that classify instances by sorting them based on feature values. Each
node in a decision tree represents a feature in an instance to be classified and each branch represents
a value that the node can assume. Instances are classified starting at the root node and sorted based
on their feature values. Decision tree learning, used in data mining and machine learning, uses a
decision tree as a predictive model which maps observations about an item to conclusions about the
item's target value. More descriptive names for such tree models are classification trees or
regression trees. Decision tree classifiers usually employ post-pruning techniques that evaluate the
performance of decision trees, as they are pruned by using a validation set. Any node can be
removed and assigned the most common class of the training instances that are sorted to it.
27
Spam Detection In Text Using Machine Learning
Neural Networks (NN) that can actually perform a number of regression and/or classification tasks
at once, although commonly each network performs only one. In the vast majority of cases,
therefore, the network will have a single output variable, although in the case of many-state
classification problems, this may correspond to a number of output units (the post-processing stage
takes care of the mapping from output units to output variables. Artificial Neural Network (ANN)
depends upon three fundamental aspects, input and activation functions of the unit, network
architecture and the weight of each input connection. Given that the first two aspects are fixed; the
behavior of the ANN is defined by the current values of the weights. The weights of the net to be
trained are initially set to random values, and then instances of the training set are repeatedly
exposed to the net. The values for the input of an instance are placed on the input units and the
output of the net is compared with the desired output for this instance. Then, all the weights in the
net are adjusted slightly in the direction that would bring the output values of the net closer to the
values for the desired output. There are several algorithms with which a network can be trained
Lemnaru C. (2012).
3.4.9 Python
Python is a widely used general-purpose, high level programming language. It was created by
Guido van Rossum in 1991 and further developed by the Python Software Foundation. It was
designed with an emphasis on code readability, and its syntax allows programmers to express their
Python is a programming language that lets you work quickly and integrate systems more
efficiently.
There are two major Python versions: Python 2 and Python 3. Both are quite different.
28
Spam Detection In Text Using Machine Learning
• There are no separate compilation and execution steps like C and C++.
• Internally, Python converts the source code into an intermediate form called bytecodes which is
Platform Independent
• Python programs can be developed and executed on multiple operating system platforms.
• Python can be used on Linux, Windows, Macintosh, Solaris and many more.
High-level Language
• In Python, no need to take care about low-level details such as managing the memory used by
• More emphasis on the solution to the problem rather than the syntax
Embeddable
• Python can be used within C/C++ program to give scripting capabilities for the program’s users.
Robust:
29
Spam Detection In Text Using Machine Learning
• Known as the “batteries included” philosophy of Python ;It can help do various things involving
regular expressions, documentation generation, unit testing, threading, databases, web browsers,
CGI, email, XML, HTML, WAV files, cryptography, GUI and many more.
• Besides the standard library, there are various other high-quality libraries such as the Python
3.4.4.1 Pandas
The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was
Pandas allows us to analyze big data and make conclusions based on statistical theories.
Pandas can clean messy data sets, and make them readable and relevant.
Max value?
Min value?
Pandas are also able to delete rows that are not relevant, or contains wrong values, like empty or
30
Spam Detection In Text Using Machine Learning
31
Spam Detection In Text Using Machine Learning
3.4.4.2 Numpy
It also has functions for working in domain of linear algebra, fourier transform, and matrices.
NumPy was created in 2005 by Travis Oliphant. It is an open source project and you can use it
freely.
In Python we have lists that serve the purpose of arrays, but they are slow to process. NumPy aims
to provide an array object that is up to 50x faster than traditional Python lists.
The array object in NumPy is called ndarray, it provides a lot of supporting functions that make
Arrays are very frequently used in data science, where speed and resources are very important.
NumPy is a Python library and is written partially in Python, but most of the parts that require fast
3.4.4.3 matplotlib
Matplotlib is a low level graph plotting library in python that serves as a visualization utility.
Matplotlib is mostly written in python, a few segments are written in C, Objective-C and Javascript
multi-platform data visualization library built on NumPy arrays and designed to work with the
broader SciPy stack. It was introduced by John Hunter in the year 2002. One of the greatest benefits
32
Spam Detection In Text Using Machine Learning
of visualization is that it allows us visual access to huge amounts of data in easily digestible visuals.
Matplotlib consists of several plots like line, bar, scatter, histogram, etc
3.4.4.4 seaborn
Seaborn is a Python data visualization library based on matplotlib. It provides a highlevel interface
seaborn is a library for making statistical graphics in Python. It provides a high-level interface to
matplotlib and integrates closely with pandas data structures. Functions in the seaborn library
expose a declarative, dataset-oriented API that makes it easy to translate questions about data into
graphics that can answer them. When given a dataset and a specification of the plot to make,
seaborn automatically maps the data values to visual attributes such as color, size, or style,
internally computes statistical transformations, and decorates the plot with informative axis labels
and a legend. Many seaborn functions can generate figures with multiple panels that elicit
producing complete graphics from a single function call with minimal arguments, seaborn
facilitates rapid prototyping and exploratory data analysis. And by offering extensive options for
customization, along with exposing the underlying matplotlib objects, it can be used to create
3.4.4.4 tensorflow
It was created and is maintained by Google and was released under the Apache 2.0 open source
license. The API is nominally for the Python programming language, although there is access to the
33
Spam Detection In Text Using Machine Learning
Unlike other numerical libraries intended for use in Deep Learning like Theano, TensorFlow was
designed for use both in research and development and in production systems, not least of which is
It can run on single CPU systems and GPUs, as well as mobile devices and large-scale distributed
3.4.4.4 keras
Keras runs on top of open source machine libraries like TensorFlow, Theano or Cognitive Toolkit
(CNTK). Theano is a python library used for fast numerical computation tasks. TensorFlow is the
most famous symbolic math library used for creating neural networks and deep learning models.
TensorFlow is very flexible and the primary benefit is distributed computing. CNTK is deep
learning framework developed by Microsoft. It uses libraries such as Python, C#, C++ or standalone
machine learning toolkits. Theano and TensorFlow are very powerful libraries but difficult to
Keras is based on minimal structure that provides a clean and easy way to create deep learning
models based on TensorFlow or Theano. Keras is designed to quickly define deep learning models.
Features
• Keras leverages various optimization techniques to make high level neural network API easier
and more performant. It supports the following features − Consistent, simple and extensible
API.
34
Spam Detection In Text Using Machine Learning
Benefits
Keras is highly powerful and dynamic framework and comes up with the following advantages −
• Easy to test.
• Keras neural networks are written in Python which makes things simpler.
• Deep learning models are discrete components, so that, you can combine into many ways.
35
Spam Detection In Text Using Machine Learning
CHAPTER 4
METHODOLOGY
36
Spam Detection In Text Using Machine Learning
The public dataset of SMS labelled messages is obtained from UCI Machine Learning Repository.
The dataset considered in the current research is available on kaggle, a machine learning repository.
This study finds that there are only 5,574 labelled messages in the dataset, with 4827 of messages
belong to ham messages while the other 747 messages belong to spam messages. Nonetheless, this
dataset consists of two named columns starting with the message labels (ham or spam) followed by
It‘s time for a data analyst to pick up the baton and lead the way to machine learning
implementation. The job of a data analyst is to find ways and sources of collecting relevant and
comprehensive data, interpreting it, and analyzing results with the help of statistical techniques. The
type of data depends on what you want to predict There is no exact answer to the question ―
How much data is needed?‖ because each machine learning problem is unique. In turn, the number
of attributes data scientists will use when building a predictive model depends on the
attributes‘predictive value.
The more, the better‘approach is reasonable for this phase. Some data scientists suggest considering
that less than one-third of collected data may be useful. It‘s difficult to estimate which part of the
data will provide the most accurate results until the model training begins. That‘s why it‘s important
to collect and store all data — internal and open, structured and unstructured .
37
Spam Detection In Text Using Machine Learning
The purpose of preprocessing is to convert raw data into a form that fits machine learning.
Structured and clean data allows a data scientist to get more precise results from an applied machine
learning model. The technique includes data formatting, cleaning, and sampling.
4.2.1 Data formatting: The importance of data formatting grows when data is acquired from
various sources by different people. The first task for a data scientist is to standardize record
formats. A specialist checks whether variables representing each attribute are recorded in the same
way. Titles of products and services, prices, date formats, and addresses are examples of variables.
The principle of data consistency also applies to attributes represented by numeric ranges.
4.2.2 Data cleaning: This set of procedures allows for removing noise and fixing inconsistencies
in data. A data scientist can fill in missing data using imputation techniques, e.g. substituting
missing values with mean attributes. A specialist also detects outliers — observations that deviate
significantly from the rest of distribution. If an outlier indicates erroneous data, a data scientist
This stage also includes removing incomplete and useless data objects.
4.2.3 Data anonymization Sometimes a data scientist must anonymize or exclude attributes
representing sensitive information (i.e. when working with healthcare and banking data).
4.2.4 Data sampling: Big datasets require more time and computational power for analysis. If a
dataset is too large, applying data sampling is the way to go. A data scientist uses this technique to
select a smaller but representative data sample to build and run models uch faster, and at the same
Pre-processing is the first stage in which the unstructured data is converted into more structured
data. Since keywords in SMS text messages are prone to be replaced by symbols. In this study, the
38
Spam Detection In Text Using Machine Learning
stop word list remover for English language have beenapplied to eliminate the stop words in the
A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the
It's simply the value of the area under the roc curve. ROC AUC score shows how well the classifier
distinguishes positive and negative classes. It can take values from 0 to 1. A higher ROC AUC
indicates better performance. A perfect model would have an AUC of 1, while a random model
39
Spam Detection In Text Using Machine Learning
ROC is a probability curve and AUC represents the degree or measure of separability. It tells how
much the model is capable of distinguishing between classes. Higher the AUC, the better the model
We can measure model accuracy by two methods. Accuracy simply means the number of values
correctly predicted.
Confusion Matrix
Classification Measure
The confusion matrix is a fundamental tool for evaluating the performance of classification models.
It provides a comprehensive summary of the model's predictions compared to the actual ground
truth across different classes. The matrix is organized into rows and columns, where each row
represents the actual class labels, and each column represents the predicted class labels.
The following 4 are the basic terminology which will help us in determining the metrics we are
looking for.
True Positives (TP): when the actual value is Positive and predicted is also Positive.
True negatives (TN): when the actual value is Negative and prediction is also Negative.
False positives (FP): When the actual is negative but prediction is Positive. Also known as the
Type 1 error
False negatives (FN): When the actual is Positive but the prediction is Negative. Also known as the
Type 2 error
For a binary classification problem, we would have a 2 x 2 matrix as shown below with 4 values:
40
Spam Detection In Text Using Machine Learning
We have a total of 20 cats and dogs and our model predicts whether it is a cat or not.
Actual values = [‘dog’, ‘cat’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘cat’, ‘dog’, ‘dog’,
Predicted values = [‘dog’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’, ‘cat’, ‘cat’, ‘cat’, ‘dog’,
41
Spam Detection In Text Using Machine Learning
You predicted positive and it’s true. You predicted that an animal is a cat and it actually is.
You predicted negative and it’s true. You predicted that animal is not a cat and it actually is not (it’s
a dog).
You predicted positive and it’s false. You predicted that animal is a cat but it actually is not (it’s a
dog).
You predicted negative and it’s false. You predicted that animal is not a cat but it actually is.
42
Spam Detection In Text Using Machine Learning
assessment of model performance in classification tasks. These metrics offer valuable insights into
the strengths and limitations of the model, enabling a deeper understanding and analysis of its
Accuracy: Accuracy measures the overall correctness of the model by calculating the ratio of
correctly predicted instances to the total number of instances. While accuracy provides a general
Precision: Precision quantifies the proportion of true positive predictions among all positive
predictions made by the model. It focuses on the accuracy of positive predictions and helps evaluate
Recall (True Positive Rate, Sensitivity): Recall calculates the proportion of true positive
predictions among all actual positive instances in the dataset. It measures the model's ability to
capture all positive instances and is particularly important in scenarios where missing positive
F1-Score: The F1-Score is the harmonic mean of precision and recall, providing a balanced
measure of a model's performance. It considers both false positives and false negatives and is useful
for evaluating models in situations where there is an uneven class distribution or class imbalance.
False Positive Rate (FPR, Type I Error): FPR measures the proportion of negative instances that
are incorrectly classified as positive by the model. It complements precision by focusing on the rate
of false positives and is essential in applications where minimizing false alarms is critical.
False Negative Rate (FNR, Type II Error): FNR calculates the proportion of positive instances
that are incorrectly classified as negative by the model. It evaluates the model's ability to detect all
43
Spam Detection In Text Using Machine Learning
positive instances and is particularly relevant in scenarios where missing positive instances can
4.3.4.1 Accuracy:
Accuracy simply measures how often the classifier makes the correct prediction. It’s the ratio
between the number of correct predictions and the total number of predictions. The accuracy metric
is not suited for imbalanced classes. Accuracy has its own disadvantages, for imbalanced data, when
the model predicts that each point belongs to the majority class label, the accuracy will be high. But,
It is a measure of correctness that is achieved in true prediction. In simple words, it tells us how
many predictions are actually positive out of all the total positive predicted. Accuracy is a valid
choice of evaluation for classification problems which are well balanced and not skewed or there is
no class imbalance.
44
Spam Detection In Text Using Machine Learning
4.3.4.2 Precision:
It is a measure of correctness that is achieved in true prediction. In simple words, it tells us how
many predictions are actually positive out of all the total positive predicted.
Precision is defined as the ratio of the total number of correctly classified positive classes divided
by the total number of predicted positive classes. Or, out of all the predictive positive classes, how
“Precision is a useful metric in cases where False Positive is a higher concern than False Negatives”
Suppose mail is not a spam but model is predicted as spam : FP (False Positive). We always try to
reduce FP.
Ex 2:- Precision is important in music or video recommendation systems, e-commerce websites, etc.
Wrong results could lead to customer churn and be harmful to the business.
45
Spam Detection In Text Using Machine Learning
4.3.4.3. Recall:
Recall is a valid choice of evaluation metric when we want to capture as many positives as possible.
Recall is defined as the ratio of the total number of correctly classified positive classes divide by the
total number of positive classes. Or, out of all the positive classes, how much we have predicted
correctly. Recall should be high(ideally 1). “Recall is a useful metric in cases where False Negative
Ex 1:- suppose person having cancer (or) not? He is suffering from cancer but model predicted as
Ex 2:- Recall is important in medical cases where it doesn’t matter whether we raise a false alarm
Recall would be a better metric because we don’t want to accidentally discharge an infected person
and let them mix with the healthy population thereby spreading contagious virus. Now you can
46
Spam Detection In Text Using Machine Learning
The F1 score is a number between 0 and 1 and is the harmonic mean of precision and recall. We use
harmonic mean because it is not sensitive to extremely large values, unlike simple averages.
F1 score sort of maintains a balance between the precision and recall for your classifier. If your
precision is low, the F1 is low and if the recall is low again your F1 score is low. There will be cases
where there is no clear distinction between whether Precision is more important or Recall. We
combine them!
In practice, when we try to increase the precision of our model, the recall goes down and vice-versa.
47
Spam Detection In Text Using Machine Learning
CHAPTER 5
SYSTEM DESIGN
48
Spam Detection In Text Using Machine Learning
5 SYSTEM DESIGN
In the context of our SMS spam detection project, system design plays a crucial role in defining the
architecture and components required to achieve our objective effectively. System design involves
conceptualizing the structure of our SMS spam detection system, including its interface, modules,
In the initial stages of system design, we identify the key components of the system, such as the
message preprocessing module, the Naive Bayes classifier, and the classification output module.
Each component is designed to fulfill specific functionalities necessary for detecting SMS spam
messages accurately.
49
Spam Detection In Text Using Machine Learning
In the context of our SMS spam detection project, data flow diagrams (DFDs) serve as valuable
tools for visually illustrating how data moves through our system and how different components
A logical data flow diagram for our project would depict the flow of SMS messages through various
stages of preprocessing, classification, and result generation. It would highlight the key processes
involved, such as tokenization, feature extraction, Naive Bayes classification, and the final
50
Spam Detection In Text Using Machine Learning
A use case diagram would provide a high-level overview of the various functionalities offered by
our system, the actors interacting with the system, and the relationships between these actors and
functionalities.
Actors in our use case diagram would represent different entities that interact with our system.
These could include end users, administrators, and external systems. Each actor has specific roles
These use cases describe specific actions or services that the system offers to its users. Examples of
51
Spam Detection In Text Using Machine Learning
Submit SMS Message: This use case involves users submitting SMS messages to the system for
classification.
View Classification Results: Users can view the classification results of their submitted SMS
messages.
Update Model: Administrators have the ability to update the Naive Bayes classification model with
Generate Reports: The system can generate reports on spam detection accuracy, false positives,
The use case diagram would illustrate the relationships between actors and use cases. For example,
end users may interact with the "Submit SMS Message" and "View Classification Results" use
cases, while administrators may interact with additional use cases such as "Update Model" and
"Generate Reports."
our system, the actors interacting with the system, and the relationships between these actors and
functionalities.
Actors in our use case diagram would represent different entities that interact with our system.
These could include end users, administrators, and external systems. Each actor has specific roles
and responsibilities within the system.The functionalities provided by our SMS spam detection
system would be represented as use cases. These use cases describe specific actions or services that
the system offers to its users. Examples of use cases in our project may include:
52
Spam Detection In Text Using Machine Learning
Submit SMS Message: This use case involves users submitting SMS messages to the system for
classification.
View Classification Results: Users can view the classification results of their submitted SMS
messages.
Update Model: Administrators have the ability to update the Naive Bayes classification model with
Generate Reports: The system can generate reports on spam detection accuracy, false positives,
The use case diagram would illustrate the relationships between actors and use cases. For example,
end users may interact with the "Submit SMS Message" athe class diagram represents the static
structure of the system, detailing the classes involved and their relationships. We have two main
Frontend Class:
Attributes:
Username: Represents the username of the user interacting with the system.
53
Spam Detection In Text Using Machine Learning
Password: Represents the password associated with the user account for authentication.
Other Attributes: These may include additional information related to user preferences,
Methods:
AuthenticateUser(): Method to authenticate the user based on the provided username and
password.
classification.
messages.
Backend Class:
Attributes:
Dataset: Represents the dataset used for training the machine learning model.
ML Model: Represents the trained machine learning model for classifying SMS messages.
Splitting Data: Represents the functionality for splitting the dataset into training and testing
sets.
Other Attributes: These may include additional components or resources used in the
backend processing.
Methods:
TrainModel(): Method to train the machine learning model using the provided dataset.
SplitData(): Method to split the dataset into training and testing sets for model evaluation.
The class diagram visually depicts the structure of the system and the interactions between
its components. Frontend and Backend classes encapsulate their respective functionalities
54
Spam Detection In Text Using Machine Learning
and attributes, providing a clear separation of concerns. The diagram helps developers
understand the architecture of the system, facilitating the implementation and maintenance
of the SMS spam detection application.nd "View Classification Results" use cases, while
administrators may interact with additional use cases such as "Update Model" and "Generate
Reports."
messages exchanged during a particular scenario, such as classifying an incoming SMS message.
The sequence diagram begins with the user attempting to log in to the system by providing their
The Frontend class sends an authentication request message to the Backend class.
The Backend class receives the authentication request and verifies the provided credentials
If the credentials are valid, the Backend class sends a confirmation message back to the
Frontend class.
The Frontend class receives the confirmation and allows the user to access the system.
55
Spam Detection In Text Using Machine Learning
The sequence diagram starts with the Frontend class receiving an incoming SMS message from the
user.
The Frontend class sends the SMS message to the Backend class for classification.
The Backend class receives the SMS message and invokes the ClassifySMSMessage() method to
The ML Model within the Backend class processes the message using the trained machine learning
model.
If training or testing data splitting is required, the sequence diagram may include a sequence for
The Backend class initiates the SplitData() method to divide the dataset into training and testing
sets.
56
Spam Detection In Text Using Machine Learning
The dataset splitting process occurs internally within the Backend class, and the resulting
The activity diagram illustrates the flow of activities and interactions within the system, depicting
how different components and processes interact to achieve specific functionalities. Here's how the
The diagram starts with the system initialization activity, representing the initialization of the
This activity involves initializing the necessary components, such as loading the machine
learning model, setting up the database connection, and preparing the user interface.
After initialization, the diagram shows the user authentication activity, where users are required
This activity includes the process of entering credentials, verifying them against the user
Upon successful authentication, the system proceeds to the SMS message classification activity.
After classification, the system displays the results of the classification process to the user.
This activity involves presenting the classification results (spam or ham) along with any
57
Spam Detection In Text Using Machine Learning
The diagram includes the system shutdown activity, representing the graceful shutdown of the
This activity involves closing connections, saving data, and releasing resources before
58
Spam Detection In Text Using Machine Learning
The diagram begins with the initial state, representing the creation of an object to handle the
At this stage, the system initializes its components and resources to begin processing the SMS
message.
Upon receiving the SMS message, the system transitions to the stock data state.
In this state, the system acquires the necessary data, which includes the content of the SMS message
59
Spam Detection In Text Using Machine Learning
Here, the system performs data cleaning operations to preprocess the raw SMS message content,
which may involve tasks such as removing special characters, correcting spelling errors, and
Preprocessing State:
The system extracts relevant features from the pre-processed SMS message data, preparing it for
Once preprocessing is complete, the system moves to the data split state.
Here, the system divides the pre-processed SMS message data into training and testing datasets to
After splitting the data, the system transitions to the label encoder state.
In this state, the system encodes categorical labels (e.g., spam and ham) into numerical
Model State:
The system trains a machine learning model, such as the Naive Bayes algorithm, using the labelled
training data to learn patterns and relationships between features and class labels.
Accuracy State:
Upon training the model, the system moves to the accuracy state.
In this state, the system evaluates the performance of the trained model using the testing dataset,
Results State:
60
Spam Detection In Text Using Machine Learning
After assessing model performance, the system transitions to the results state.
The system generates and presents the results of SMS message classification, indicating whether
each message is classified as spam or ham based on the trained model's predictions.
Termination State:
The process concludes with the termination state, where the system releases resources and
In this state, the system may perform cleanup tasks and return to an idle state, awaiting the arrival of
61
Spam Detection In Text Using Machine Learning
CHAPTER 6
RESULTS AND DISCUSSIONS
62
Spam Detection In Text Using Machine Learning
Pie Chart:
A pie chart provides a visual representation of the distribution of spam and ham messages within
the dataset. By examining the proportions of spam and ham messages, we can gauge the imbalance
between the two classes and assess the dataset's suitability for training a machine learning model.
Bar Graph:
A bar graph illustrates the frequency distribution of key features within the SMS messages, such as
word count, character count, or the presence of specific keywords. This visualization helps identify
common patterns and characteristics associated with spam and ham messages, guiding the feature
selection and preprocessing steps in the subsequent stages of the SMS spam detection pipeline.
Observations: 87.4% of the SMSes aren't spam while only 12.6% is actually spam Insights: since
the data is imbalanced we need to take that into consideration while splitting the training and testing
set
63
Spam Detection In Text Using Machine Learning
Observation: spam SMSses have on average more sentences/words count than ham ones, but these
In the "Exploring Data" section, comparing the count of sentences or words between spam and ham
messages can provide valuable insights into their structural differences. Here's a brief content for
this comparison:
Understanding the distribution of sentence and word counts in both spam and ham messages is
crucial for identifying distinctive characteristics between the two categories. This comparison
allows us to discern potential patterns or anomalies that could aid in distinguishing spam from
legitimate messages.
64
Spam Detection In Text Using Machine Learning
65
Spam Detection In Text Using Machine Learning
Data preprocessing plays a pivotal role in extracting meaningful insights from raw SMS message
data. In this section, we discuss various preprocessing steps, including text normalization,
A word cloud is a graphical representation of word frequency in a text corpus, where the size of
each word corresponds to its frequency of occurrence. By generating a word cloud for both spam
and ham messages, we can visually inspect the most frequent words used in each category. This
visualization aids in identifying common themes, keywords, and distinguishing features that
66
Spam Detection In Text Using Machine Learning
Purpose: TfidfVectorizer is designed to address the issue of word importance. It considers not only
the frequency of words in a document but also how unique they are across the entire corpus. Words
that are common in many documents receive lower weights, while words that are unique to a
document receive higher weights. How it works: It computes a TF-IDF score for each term in each
67
Spam Detection In Text Using Machine Learning
document. TF (Term Frequency) measures the frequency of a term in a document, while IDF
(Inverse Document Frequency) measures the uniqueness of the term across the entire corpus.
From the above models, it is evident that the Multinomial Naive Bayes (MNB) algorithm performs
well for spam detection. This algorithm, which is based on the principles of Bayes' theorem and
assumes independence among features, demonstrates strong performance in classifying SMS
messages as spam or ham.
The simplicity and effectiveness of the MNB algorithm make it particularly well-suited for text
classification tasks, such as spam detection. By modeling the probability of each word occurring in
68
Spam Detection In Text Using Machine Learning
spam and ham messages independently, MNB can effectively distinguish between the two classes
based on the presence or absence of specific keywords or features.
Moreover, MNB is computationally efficient and robust, making it suitable for handling large
datasets and real-time applications. Its ability to handle sparse data and its resistance to overfitting
further contribute to its suitability for spam detection tasks.
CHAPTER 7
CONCLUSION AND FUTURE SCOPE
69
Spam Detection In Text Using Machine Learning
Conclusion:
The proliferation of SMS spam messages presents a significant challenge globally, with the problem
showing no signs of abating as mobile usage continues to rise. This paper addresses this issue by
presenting a spam filtering technique employing various machine learning algorithms, aimed at
Through experimentation, it was found that the TF-IDF with Naive Bayes classification algorithm
However, relying solely on accuracy may not be sufficient, given the imbalanced nature of the
dataset. Upon closer examination, the Naive Bayes algorithm demonstrated commendable precision
and f-measure scores of 0.98 and 0.97, respectively, underscoring its robustness in identifying spam
Moreover, it's imperative to recognize the multifaceted nature of feature selection and its impact on
algorithm performance. Different algorithms yield varying performances based on the features they
leverage, emphasizing the need for careful consideration and experimentation. In this context, the
incorporation of additional features, such as message lengths, sender metadata, and semantic
attributes, holds promise for enhancing classifier training and overall performance.
Looking ahead, the future scope of this project extends beyond algorithmic enhancements to
encompass broader applications in data analysis and predictive modeling. For instance, integrating
neural network architectures with complementary techniques like genetic algorithms and fuzzy
logic could yield further improvements in spam detection accuracy. Additionally, exploring the use
of machine learning algorithms for analyzing public comments and predicting corporate
70
Spam Detection In Text Using Machine Learning
In conclusion, while the battle against SMS spam messages is ongoing, innovative approaches
further strengthen spam detection systems and mitigate the impact of unsolicited messages on users'
Future Scope:
The future scope of this project entails the inclusion of additional feature parameters, as a greater
number of parameters considered correlates with increased accuracy. Furthermore, the algorithms
can be extrapolated for analyzing public comments to discern patterns and relationships between
customers and companies. Traditional algorithms and data mining techniques can also be harnessed
Looking ahead, there are plans to integrate neural networks with other methodologies such as
genetic algorithms or fuzzy logic. Genetic algorithms offer the potential to identify optimal network
architectures and training parameters, while fuzzy logic can accommodate uncertainties inherent in
neural network predictions. The synergistic application of these techniques alongside neural
Expanding on this future scope, the integration of additional feature parameters, such as message
metadata and semantic attributes, will contribute to a more nuanced understanding of spam
messages and enhance the accuracy of classification models. Furthermore, leveraging natural
language processing techniques to analyze the content and context of public comments can provide
valuable insights into customer sentiments and preferences, enabling companies to tailor their
71
Spam Detection In Text Using Machine Learning
In addition to predictive modeling, the application of traditional algorithms and data mining
techniques for forecasting corporate performance structures represents a compelling avenue for
future research. By analyzing historical data and identifying key performance indicators,
organizations can gain actionable insights into market trends, consumer behavior, and competitive
Integrating neural networks with genetic algorithms or fuzzy logic represents a promising direction
for advancing SMS spam prediction. Genetic algorithms can optimize neural network architectures
and hyperparameters, improving model performance and scalability. Meanwhile, fuzzy logic can
handle uncertainty and imprecision in data, enhancing the robustness and reliability of spam
In summary, the future of SMS spam detection lies in the continued exploration of advanced
techniques and interdisciplinary approaches. By harnessing the power of machine learning, natural
language processing, and traditional analytics, we can develop more accurate, efficient, and
adaptable systems for combating unsolicited messages and safeguarding user privacy and security.
72
Spam Detection In Text Using Machine Learning
REFERENCES
73
Spam Detection In Text Using Machine Learning
3. Paras Sethi, Vaibhav Bhandari and Bhavna Kohli, "SMSspam detectionand comparisonofvarious
machinelearning algorithms", International Conference on Computing and Communication
Technologies for Smart Nation (IC3TSN), October, 2017.
74
Spam Detection In Text Using Machine Learning
75
Spam Detection In Text Using Machine Learning
76