0% found this document useful (0 votes)
272 views85 pages

Spam Detection in Text Using Machine Learning 1

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
0% found this document useful (0 votes)
272 views85 pages

Spam Detection in Text Using Machine Learning 1

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 85

SPAM DETECTION IN TEXT USING

MACHINE LEARNING
A Project work submitted to
Department of Computer Science and Engineering
University College of Sciences
Acharya Nagarjuna University

In partial fulfillment of the requirements for


The award of the degree of

Master of Computer Applications

by

CHIKKUDU SRINIVASULU
Regd .No.Y23MC20009

Under the guidance of

Dr. R. VASANTHA., B.Tech., M. Tech., Ph.D.


Assistant Professor
Department of computer science & engineering
University College of Sciences
Acharya Nagarjuna University

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING


UNIVERSITY COLLEGE OF SCIENCES
ACHARYA NAGARJUNA UNIVERSITY
Nagarjuna Nagar, Guntur,
Andhra Pradesh, India

April 2024
DECLARATION

I hereby declare that the entire thesis work entitled " SPAM DETECTION IN TEXT

USING MACHINE LEARNING" is being submitted to the Department of Computer

Science and Engineering, University College of Sciences, Acharya Nagarjuna

University, in partial fulfillment of the requirement for the award of the degree of Master

of Computer Applications (MCA) is a bonafide work of my own, carried out under the

supervision of Dr. R.VASANTHA, Assistant Professor, Department of Computer Science

& Engineering, Acharya Nagarjuna University.

I further declare that the Project, either in part or full, has not been submitted earlier by me

or others for the award of any degree in any University.

CHIKKUDU SRINIVASULU
Reg. No. Y23MC20009

ii
ACHARYA NAGARJUNA UNIVERSITY
NAGARJUNA NAGAR, GUNTUR.
Department of Computer Science & Engineering.

CERTIFICATE

This is to certify that this project entitled “SPAM DETECTION IN TEXT

USING MACHINE LEARNING” is a Bonafide record of the project work done and

submitted by CH. SRINIVASULU (Y23MC20009) during the year 2023 - 2024 in

partial fulfillment of the requirements for the award of degree of Master of

Computer Applications (MCA) in the department of Computer Science &

Engineering. I certify that he carries this project as an independent project under

my guidance.

Head of the Department Project Guide


(Prof. K. Gangadhara Rao) ( Dr. R. Vasantha)

External Examiner

iii
ACKNOWLEDGEMENTS

Undertaking this Project has been a truly life-changing experience for me and it would not have

been possible to do without the support and guidance that I received from many people.

I would like to first say a very big thank you to my supervisor Dr. R. Vasantha for all the support

and encouragement he gave me. Her friendly guidance and expert advice have been invaluable

throughout all stages of the work. Without her guidance and constant feedback this Project work not

have been achievable.

I would also wish to express my gratitude to Prof. K. Gangadhara Rao for extended discussions

and valuable suggestions which have contributed greatly to the improvement of the thesis.

I am thankful to and fortunate enough to get constant encouragement, support and guidance from all

Teaching staffs of Department which helped us in successfully completing our project work. Also, I

would like to extend our sincere regards to all the non-teaching staff of the department for their

timely support.

I must also thank my parents and friends for the immense support and help during this project.

Without their help, completing this project would have been very difficult.

iv
ABSTRACT

SMS spam detection using Naive Bayes algorithm is a widely used technique in the field of

text classification. The main aim of this approach is to classify the incoming messages into

spam or ham categories. The Naive Bayes algorithm works by calculating the probability of

a message belonging to a particular class, based on the occurrence of different words in the

message. In this paper, we present an efficient and accurate approach for SMS spam

detection using the Naive Bayes algorithm. The proposed approach utilizes a pre-processing

step for feature extraction, which includes tokenization, stop-word removal, and stemming.

The Naive Bayes algorithm is then trained on a dataset of labeled messages to learn the

probability distributions of different words in spam and ham messages. Finally, the trained

model is used to classify incoming messages into spam or ham categories. The results of our

experiments show that the proposed approach achieves high accuracy in detecting SMS

spam messages

v
TABLE OF CONTENTS

TITLE PAGE NO

DECLARATION Ii

CERTIFICATE iii

ACKNOWLEDGEMENT iv

ABSTRACT v

TABLE OF CONTENTS vii

LIST OF FIGURES xi

LIST OF TABLES xiii

LIST OF ABBREVATIONS xiv

Chapter 1: Introduction

1.1 Statement 4

1.2 objective 5

Chapter 2: LITERATURE SURVEY

2.1 SMS spam detection based on long short – term 7


memory and gated recurrent unit

2.2 SMS Spam Message Detection using Term Frequency- 8


Inverse Document Frequency and Random Forest
Algorithm

2.3 Spam Detection Approach for Secure Mobile Message 8


Communication Using Machine Learning Algorithms

2.4 SMS Spam Detection using Machine Learning and 9


Deep Learning Techniques

vi
Chapter 3: REQUIREMENT SPECIFICATIONS
3.1 Functional requirements 9

3.2 Non-Functional Requirements 12

3.3 Technical Requirements 13

3.4 Algorithms 17

3.4.1 Naïve Bayesian 17

3.4.2 Random Forest 21

3.4.3 LSTM 23

3.4.4 Multi-layer Perceptron 25

3.4.5 Support Vector Machines 26

3.4.6 K- means 26

3.4.7 Decision trees 27

3.4.8 Neural Networks 27

3.4.9 Python 35

Chapter 4: Methodology

4.1 Data set 36

4.2 Data Preprocessing 41

4.3 Classification model evaluation 47

Chapter 5: System Design

5.1 Architecture Diagram 48

5.2 Data Flow Diagram 49

5.3 Use Case Diagram 50

5.4 Class Diagram 53

5.5 Sequence Diagram 55

5.6 Activity Diagram 57

5.7 State Flow Diagram 61

vii
Chapter 6: Results And Discussions

6.1 Exploring Data 64

6.2 Data Processing 66

6.3 Model Evaluation 69

Chapter 7: Conclusions And Future Scope 70

References 73

viii
LIST OF FIGURES

No. Title of the Page No.

3.4.1 Applications Of Naive Bayes 20

3.4.2 Random Forest 21

4.3.1 ROC Curve 22

4.3.2 Confusion Matrix 37

4.3.4.1 Accuracy 39

4.3.4.2 Precision 40

4.3.4.3 Recall 42

4.3.4.4 F-Measure/ F1-Score 43

5.1 Architecture 44

5.2 Data Flow Diagram 45

5.3 Use Case Diagram 47

5.4 Class Diagram 48

5.5 Sequence Diagram 49

5.6 Activity Diagram 51

5.7 State Diagram 54

6.1 Training Set and Test Set 56

6.2 Sentences / Words Count 57

6.2.1 Comparison of Sentences / Word Count 61

6.3 Word Cloud 62

6.4 Comparison of Models 63

6.5 Voting Classifier 64

ix
Spam Detection In Text Using Machine Learning

CHAPTER 1

INTRODUCTION

1
Spam Detection In Text Using Machine Learning

1 INTRODUCTION

SMS has become a popular medium for communication with the widespread use of mobile phones.

However, this convenience has also led to the increase in the number of SMS spam messages,

which can be annoying and potentially harmful. SMS spam messages can be used for phishing

attacks, identity theft, and other malicious activities. Therefore, it is crucial to develop efficient

techniques for detecting and filtering out these spam messages. In recent years, machine learning

algorithms have been extensively used in the field of text classification for spam detection. Among

these algorithms, the Naive Bayes algorithm has gained popularity due to its simplicity and

effectiveness. The Naive Bayes algorithm is a probabilistic algorithm that calculates the probability

of a message belonging to a particular class, based on the occurrence of different words in the

message.

SMS Spamming [2] [10] in extremely disappointing for the clients: numerous critical and valuable

messages can get lost because of spam messages, Spam messages are additionally used to trap

individuals, or bait them into purchasing services. As overall utilization of cell phones has grown,

another road for e-junk mail has been opened for notorious advertisers. These publicists use instant

messages (SMS) to target probable purchasers with undesirable publicizing known as SMS spam.

This sort of spam is especially bothersome since, not at all like email spam, numerous PDA clients

pay an expense for each SMS got. Building up a classification algorithm [1] [11] that channels SMS

spam would give a helpful apparatus for mobile phone suppliers. Since naïve Bayes has been

utilized effectively for email spam detection [9], it appears to be expected that it could likewise be

used to build SMS spam classifier [7]. With respect to email spam [6][8], SMS spam represents

extra difficulties for automated channels. SMS texts are regularly restricted to 160 characters,

lessening the measure of content that can be utilized to distinguish whether a message is a ham or

spam. People have also regularly started using shorthand notations and slang which further makes it

2
Spam Detection In Text Using Machine Learning

difficult to distinguish between ham and spam. We will test how well a simple naïve Bayes

classifier [4] manages these difficulties.

We additionally fabricate models to group messages utilizing the SVM algorithm and the maximum

entropy algorithm [3], and it is discovered that SVM gives us the most precise outcomes, with

exactness up to 98 %, took after by Naïve bayes algorithm, followed by maximum entropy

algorithm. Spam messages can be classified as redundant messages sent to large number of people

at once. The rise of spam messages are based on the following factors: 1) The accessibility to cheap

bulk SMS-plans; 2) dependability (since the message comes to the cell phone client); 3) low

possibility of accepting reactions from some unaware recipients; and 4) the message can be

customized.5) Free services

As the Internet continues to grow in both size and importance, the quantity and impact of online

reviews continually increases. Reviews can influence people across a broad spectrum of industries,

but are particularly important in the realm of ecommerce, where comments and reviews regarding

products and services are often the most convenient, if not the only, way for a buyer to make a

decision on whether or not to buy them. Online reviews may be generated for a variety of reasons.

Often, in an effort to improve and enhance their businesses, online retailers and service.

Providers may ask their customers to provide feedback about their experience with the products or

services they have bought, and whether they were satisfied or not. Customers may also feel inclined

to review a product or service if they had an exceptionally good or bad experience with it. While

online reviews can be helpful, blind trust of these reviews is dangerous for both the seller and buyer.

Many look at online reviews before placing any online order; however, the reviews may be

poisoned or faked for profit or gain, thus any decision based on online reviews must be made

cautiously. Furthermore, business owners might give incentives to whoever writes good reviews

about their merchandise, or might pay someone to write bad reviews about their competitor’s

products or services. These fake reviews are considered review spam and can have a great impact in
3
Spam Detection In Text Using Machine Learning

the online marketplace due to the importance of reviews. Review spam can also negatively impact

businesses due to loss in consumer trust. The issue is severe enough to have attracted the attention

of mainstream media and governments. For example, the BBC and New York Times have reported

that “fake reviews are becoming a common problem on the Web, and a photography company was

recently subjected to hundreds of defamatory consumer reviews” [1]. In 2014, the Canadian

Government issued a warning “encouraging consumers to be wary of fake online endorsements that

give the impression that they have been made by ordinary consumers” and estimated that a third of

all online reviews were fake1 . As review spam is a pervasive and damaging problem, developing

methods to help businesses and consumers distinguish truthful reviews from fake ones is an

important, but challenging problem.

In this project work, we propose an efficient and accurate approach for SMS spam detection using

the Naive Bayes algorithm. Our approach includes a pre-processing step for feature extraction,

which involves tokenization, stop-word removal, and stemming. The Naive Bayes algorithm is then

trained on a labeled dataset of messages to learn the probability distributions of different words in

spam and ham messages. Finally, the trained model is used to classify incoming messages into spam

or ham categories.

1.1 Statement

SMS spam is real and a growing problem largely due to the availability of very cheap bulk pre-pay

SMS packages and the fact that SMS stimulate higher response rates as it is a trusted and a personal

service. The Short Messaging Service (SMS) mobile communication system is attractive for

criminal gangs for a number of reasons i.e. it is easy to use, fast reliable and affordable technology

(Delany S. J , Buckley M,& Greene D ,2012). The presence of lack of a unifying model is perceived

as a hindrance to the further development of the field of machine learning especially in Sms spam

4
Spam Detection In Text Using Machine Learning

detection. Many approaches proposed, regardless of their effectiveness, focus on a specific aspect or

language and most of them do not have integrated approach and are not exhaustive.

1.2 Objective

The main objective of this research is to evaluate a machine learning Sms Spam detection model.

Other objectives are

 To develop Spam detection model that can be used to detect Spam messages in Kenya

 Demonstrate the use of machine learning in classifying messages as either Spam or not.

 To test the machine learning model through the use of a prototype.

5
Spam Detection In Text Using Machine Learning

CHAPTER 2

LITERATURE SURVEY

6
Spam Detection In Text Using Machine Learning

2 LITERATURE SURVEY

2.1 SMS Spam Detection Based on Long Short-Term Memory and Gated

Recurrent Unit

An SMS spam is the message that hackers develop and send to people via mobile devices targeting

to get their important information. For people who are ignorant, if they follow the instruction in the

message and fill their important information, such as internet banking account in a faked website or

application, the hacker may get the information.This may lead to loss their wealth. The efficient

spam detection is an important tool inorder to help people to classify whether it is a spam SMS or

not. In this research, we propose a novel SMS spam detection based on the case study of the SMS

spams in English language using Natural Language Process and Deep Learning techniques. To

prepare the data for our model development process, we use word tokenization, padding data,

truncating data and word embedding to make more dimension in data. Then, this data is used to

develop the model based on Long ShortTerm Memory and Gated Recurrent Unit algorithms. The

performance of the proposed models is compared to the models based on machine learning

algorithms including Support Vector Machine and Naïve Bayes. The experimental results show that

the model built from the Long Short-Term Memory technique provides the best overall accuracy as

high as 98.18%. On accurately screening spam messages, this model shows the ability that it can

detect spam messages with the 90.96% accuracy rate, while the error percentage that it misclassifies

a normal message as a spam message is only

0.74%.

7
Spam Detection In Text Using Machine Learning

2.2 SMS Spam Message Detection using Term Frequency-Inverse Document

Frequency and Random Forest Algorithm

The daily traffic of Short Message Service (SMS) keeps increasing. As a result, it leads to dramatic

increase in mobile attacks such as spammers who plague the service with spam messages sent to the

groups of recipients. Mobile spams are a growing problem as the number of spams keep increasing

day by day even with the filtering systems. Spams are defined as unsolicited bulk messages in

various forms such as unwanted advertisements, credit opportunities or fake lottery winner

notifications. Spam classification has become more challenging due to complexities of the messages

imposed by spammers. Hence, various methods have been developed in order to filter spams. In this

study, methods of term frequency-inverse document frequency (TF-IDF) and Random Forest

Algorithm will be applied on SMS spam message data collection. Based on the experiment,

Random Forest algorithm outperforms other algorithms with an accuracy of 97.50%

2.3 Spam Detection Approach for Secure Mobile Message Communication

Using Machine Learning Algorithms

The spam detection is a big issue in mobile message communication due to which mobile message

communication is insecure. In order to tackle this problem, an accurate and precise method is

needed to detect the spam in mobile message communication. We proposed the applications of the

machine learning-based spam detection method for accurate detection. In this technique, machine

learning classifiers such as Logistic regression (LR), K-nearest neighbor (K-NN), and decision tree

(DT) are used for classification of ham and spam messages in mobile device communication. The

SMS spam collection data set is used for testing the method. The dataset is split into two categories

for training and testing the research. The results of the experiments demonstrated that the

classification performance of LR is high as compared with K-NN and DT, and the LR achieved a

8
Spam Detection In Text Using Machine Learning

high accuracy of 99%. Additionally, the proposed method performance is good as compared with

the existing state-of-the-art methods.

2.4 SMS Spam Detection using Machine Learning and Deep Learning Techniques

The number of people using mobile devices increasing day by day.SMS (short message service) is a

text message service available in smartphones as well as basic phones. So, the traffic of SMS

increased drastically. The spam messages also increased. The spammers try to send spam messages

for their financial or business benefits like market growth, lottery ticket information, credit card

information, etc. So, spam classification has special attention. In this paper, we applied various

machine learning and deep learning techniques for SMS spam detection. we used a dataset from

UCI and build a spam detection model. Our experimental results have shown that our LSTM model

outperforms previous models in spam detection with an accuracy of 98.5%. We used python for all

implementations.

2.5 SMS Spam Detection using Machine Learning Approach

Over recent years, as the popularity of mobile phone devices has increased, Short Message Service

(SMS) has grown into a multi-billion dollars industry. At the same time, reduction in the cost of

messaging services has resulted in growth in unsolicited commercial advertisements (spams) being

sent to mobile phones. In parts of Asia, up to 30% of text messages were spam in 2012. Lack of real

databases for SMS spams, short length of messages and limited features, and their informal

language are the factors that may cause the established email filtering algorithms to underperform in

their classification. In this project, a database of real SMS Spams from UCI Machine Learning

repository is used, and after preprocessing and feature extraction, different machine learning

techniques are applied to the database. Finally, the results are compared and the best algorithm for

spam filtering for text messaging is introduced. Final simulation results using 10-fold cross

9
Spam Detection In Text Using Machine Learning

validation shows the best classifier in this work reduces the overall error rate of best model in

original paper citing this dataset by more than half.

10
Spam Detection In Text Using Machine Learning

CHAPTER 3
SYSTEM REQUIREMENTS

11
Spam Detection In Text Using Machine Learning

3 SYSTEM REQUIREMNETS

Requirements are the basic constrains that are required to develop a system. Requirements are

collected while designing the system.

The following are the requirements that are to be discussed.

1. Functional requirements

2. Non-Functional requirements

3. Technical requirements

 Hardware requirements

 Software requirements

4.1 Functional requirements

The Functional Requirements section of our SMS spam detection project outlines the fundamental

functionalities crucial for the successful implementation of our system. To achieve our goal of

effectively detecting and filtering spam messages, we rely on a combination of specialized libraries

and modules tailored to the unique requirements of text classification tasks.

4.2 Non-Functional Requirements

 Process of functional steps:

I. Problem define

II. Preparing data

III. Evaluating algorithms

IV. Improving results

V. Prediction the result

12
Spam Detection In Text Using Machine Learning

13
Spam Detection In Text Using Machine Learning

4.3 Technical Requirements

 Software Requirements:
 Operating System: Windows
 Tool: Anaconda with Jupiter Notebook
 Hardware requirements:
 Processor: Pentium IV/III
 Hard disk: minimum 80 GB
 RAM: minimum 2 GB

4.4 Functional Requirements:

 Data Collection Module:

Implements methods to gather a diverse dataset of SMS messages, including both spam and
legitimate (ham) messages.

Utilizes techniques such as web scraping, API integration, or dataset acquisition from reliable
sources.

Ensures the collected dataset is representative of real-world SMS messages and includes a balanced
distribution of spam and ham.

 Preprocessing Module:

Cleans and preprocesses the collected SMS messages by removing noise, special characters, and
irrelevant information.

Performs tasks like tokenization, stop-word removal, and stemming to prepare the text data for
analysis.

Handles common challenges such as misspellings, abbreviations, and variations in message


formatting.

14
Spam Detection In Text Using Machine Learning

 Feature Extraction Module:

Extracts relevant features from the preprocessed SMS messages, such as word frequencies, n-
grams, and syntactic features.

Utilizes techniques like TF-IDF (Term Frequency-Inverse Document Frequency) or word


embeddings for feature representation.

Considers additional features such as message length, presence of specific keywords, or linguistic
patterns indicative of spam.

 Model Training and Classification Module:

Trains machine learning models, such as Naive Bayes, Support Vector Machines (SVM), or neural
networks, using the extracted features.

Evaluates the trained models using appropriate metrics like accuracy, precision, recall, and F1-
score.

Implements techniques for model selection, hyperparameter tuning, and cross-validation to optimize
performance.

Enables continuous learning and adaptation of the model to evolving patterns of spam messages.

 Integration with Deployment Environment:

Ensures seamless integration of the trained model into the deployment environment, such as a web
application or API.

Facilitates real-time or batch processing of incoming SMS messages for classification.

Provides robust error handling and logging mechanisms to monitor system performance and
troubleshoot issues.

5. Non-Functional Requirements

Non-functional requirements define the quality attributes and constraints of the software system.

They encompass aspects such as accessibility, availability, security, and performance.

15
Spam Detection In Text Using Machine Learning

 Accessibility:

Ensures that the system is accessible to users with disabilities by adhering to accessibility

standards such as WCAG (Web Content Accessibility Guidelines).

 Availability:

Guarantees high availability of the system, minimizing downtime and ensuring uninterrupted

access to users.

 Security:

Implements robust security measures to protect sensitive data and prevent unauthorized access,

ensuring compliance with privacy regulations.

 Backup and Disaster Recovery:

Establishes regular data backups and implements disaster recovery procedures to mitigate the

risk of data loss or system failure.

 Performance:

Measures the system's performance in terms of speed, responsiveness, and scalability. Ensures

efficient resource utilization and optimal performance under varying workloads.

 Interoperability:

Ensures seamless integration with existing systems and interoperability with external

applications and databases.

6. Performance Requirements

Performance requirements specify the expected behavior and performance metrics of the software

system.

16
Spam Detection In Text Using Machine Learning

 6.1 Response Time:

Ensures that the system responds promptly to user interactions, with minimal latency.

 6.2 Throughput:

Maintains high throughput to handle concurrent requests and process a large volume of image

data efficiently.

 6.3 Scalability:

Scales horizontally and vertically to accommodate increasing data volumes and user traffic.

 6.4 Resource Consumption:

Optimizes resource consumption, including CPU, memory, and storage, to ensure efficient

utilization and cost-effectiveness.

7. Feasibility Study

The feasibility study serves as a comprehensive evaluation of the project's feasibility, encompassing

technical, operational, and economic considerations. It examines the project's technical feasibility

by assessing the availability of necessary resources, technology readiness, and compatibility with

existing systems. Furthermore, it delves into operational feasibility, evaluating the project's

alignment with organizational objectives, potential impact on workflows, and stakeholders'

readiness for adoption. Lastly, the economic feasibility analysis explores the project's financial

viability, including cost estimation, return on investment projections, and potential revenue streams

 7.1 Technical Feasibility:

Determines whether the proposed technology and infrastructure can support the project

17
Spam Detection In Text Using Machine Learning

requirements

Evaluates the availability of suitable algorithms, frameworks, and computing resources for image

classification tasks.

7.2 Operational Feasibility:


Assesses the usability, reliability, and security of the system from an operational standpoint.

Ensures that the system aligns with user expectations and can seamlessly integrate into existing

workflows.

 7.3 Economic Feasibility:

Analyzes the cost-effectiveness of the project, considering factors such as development costs,

infrastructure requirements, and potential return on investment. Evaluates the long-term

sustainability and financial viability of deploying the system in real-world settings. By addressing

these functional, non-functional, performance, and feasibility aspects, the project aims to develop a

robust and effective solution for classifying Indian medicinal leaves using transfer learning-based

convolutional neural networks.

3.4 Algorithms

3.4.1 Naïve Bayesian

It is a classification technique based on Bayes’ Theorem with an independence assumption among

predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular

feature in a class is unrelated to the presence of any other feature.

The Naïve Bayes classifier is a popular supervised machine learning algorithm used for

classification tasks such as text classification. It belongs to the family of generative learning

algorithms, which means that it models the distribution of inputs for a given class or category. This

18
Spam Detection In Text Using Machine Learning

approach is based on the assumption that the features of the input data are conditionally

independent given the class, allowing the algorithm to make predictions quickly and accurately.

In statistics, naive Bayes classifiers are considered as simple probabilistic classifiers that apply

Bayes’ theorem. This theorem is based on the probability of a hypothesis, given the data and some

prior knowledge. The naive Bayes classifier assumes that all features in the input data are

independent of each other, which is often not true in realworld scenarios. However, despite this

simplifying assumption, the naive Bayes classifier is widely used because of its efficiency and good

performance in many realworld applications.

Moreover, it is worth noting that naive Bayes classifiers are among the simplest Bayesian network

models, yet they can achieve high accuracy levels when coupled with kernel density estimation.

This technique involves using a kernel function to estimate the probability density function of the

input data, allowing the classifier to improve its performance in complex scenarios where the data

distribution is not well-defined. As a result, the naive Bayes classifier is a powerful tool in machine

learning, particularly in text classification, spam filtering, and sentiment analysis, among others.

For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in

diameter. Even if these features depend on each other or upon the existence of the other features, all

of these properties independently contribute to the probability that this fruit is an apple and that is

why it is known as ‘Naive’.

An NB model is easy to build and particularly useful for very large data sets. Along with simplicity,

Naive Bayes is known to outperform even highly sophisticated classification methods.

Bayes theorem provides a way of computing posterior probability P(c|x) from P(c), P(x) and P(x|c).

Look at the equation below:

19
Spam Detection In Text Using Machine Learning

Above,

P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).

P(c) is the prior probability of class.

P(x|c) is the likelihood which is the probability of the predictor given class.

P(x) is the prior probability of the predictor.

Advantages of a Naive Bayes Classifier

Here are some advantages of the Naive Bayes Classifier:

• It doesn’t require larger amounts of training data.

• It is straightforward to implement.

• Convergence is quicker than other models, which are discriminative.

• It is highly scalable with several data points and predictors.

• It can handle both continuous and categorical data.

• It is not sensitive to irrelevant data and doesn’t follow the assumptions it holds.

• It is used in real-time predictions.

Disadvantages of a Naive Bayes Classifier

The disadvantage of the Naive Bayes Classifier are as below:

20
Spam Detection In Text Using Machine Learning

• The Naive Bayes Algorithm has trouble with the ‘zero-frequency problem’. It happens when you

assign zero probability for categorical variables in the training dataset that is not available. When

you use a smooth method for overcoming this problem, you can make it work the best.

• It will assume that all the attributes are independent, which rarely happens in real life.

It will limit the application of this algorithm in real-world situations.

• It will estimate things wrong sometimes, so you shouldn’t take its probability outputs seriously.

Applications that use Naive Bayes

The Naive Bayes Algorithm is used for various real-world problems like those below: Text

classification: The Naive Bayes Algorithm is used as a probabilistic learning technique for text

classification. It is one of the best-known algorithms used for document classification of one or

many classes.

Sentiment analysis: The Naive Bayes Algorithm is used to analyze sentiments or feelings, whether

positive, neutral, or negative.

21
Spam Detection In Text Using Machine Learning

Recommendation system: The Naive Bayes Algorithm is a collection of collaborative filtering

issued for building hybrid recommendation systems that assist you in predicting whether a user will

receive any resource.

Spam filtering: It is also similar to the text classification process. It is popular for helping you

determine if the mail you receive is spam.

Medical diagnosis: This algorithm is used in medical diagnosis and helps you to predict the patient’s

risk level for certain diseases.

Weather prediction: You can use this algorithm to predict whether the weather will be good.

Face recognition: This helps you identify faces.

3.4.2 Random Forest

Random Forest is a famous machine learning algorithm that uses supervised learning methods. You

can apply it to both classification and regression problems. It is based on ensemble learning, which

integrates multiple classifiers to solve a complex issue and increases the model's performance.

In layman's terms, Random Forest is a classifier that contains several decision trees on various

subsets of a given dataset and takes the average to enhance the predicted accuracy of that dataset.

Instead of relying on a single decision tree, the random forest collects the result from each tree and

expects the final output based on the majority votes of predictions.

Working of Random Forest Algorithm

The Working of the Random Forest Algorithm is quite intuitive. It is implemented in two phases:

The first is to combine N decision trees with building the random forest, and the second is to make

predictions for each tree created in the first phase.

The following steps can be used to demonstrate the working process:

22
Spam Detection In Text Using Machine Learning

Step 1: Pick M data points at random from the training set.

Step 2: Create decision trees for your chosen data points (Subsets).

Step 3: Each decision tree will produce a result. Analyze it.

Step 4: For classification and regression, accordingly, the final output is based on Majority Voting or

Averaging, accordingly.

The flowchart below will help you understand better:

Before understanding the working of the random forest algorithm in machine learning, we must

look into the ensemble learning technique. Ensemble simplymeans combining multiple models.

Thus a collection of models is used to make predictions rather than an individual model. Ensemble

uses two types of methods:

Bagging

It creates a different training subset from sample training data with replacement & the final output is

based on majority voting. For example, Random Forest.

23
Spam Detection In Text Using Machine Learning

Boosting

It combines weak learners into strong learners by creating sequential models such that the final

model has the highest accuracy. For example, ADA BOOST, XG BOOST.

3.4.3 LSTM

LSTMs Long Short-Term Memory is a type of RNNs Recurrent Neural Network that can detain

long-term dependencies in sequential data. LSTMs are able to process and analyze sequential data,

such as time series, text, and speech. They use a memory cell and gates to control the flow of

information, allowing them to selectively retain or discard information as needed and thus avoid the

vanishing gradient problem that plagues traditional RNNs. LSTMs are widely used in various

applications such as natural language processing, speech recognition, and time series forecasting.

Types of Gates in LSTM

There are three types of gates in an LSTM: the input gate, the forget gate, and the output gate.

The input gate controls the flow of information into the memory cell. The forget gate controls the

flow of information out of the memory cell. The output gate controls the flow of information out of

the LSTM and into the output.

Three gates input gate, forget gate, and output gate are all implemented using sigmoid functions,

which produce an output between 0 and 1. These gates are trained using a backpropagation

algorithm through the network.

The input gate decides which information to store in the memory cell. It is trained to open when the

input is important and close when it is not.

The forget gate decides which information to discard from the memory cell. It is trained to open

when the information is no longer important and close when it is.

24
Spam Detection In Text Using Machine Learning

The output gate is responsible for deciding which information to use for the output of the LSTM. It

is trained to open when the information is important and close when it is not.

The gates in an LSTM are trained to open and close based on the input and the previous hidden

state. This allows the LSTM to selectively retain or discard information, making it more effective at

capturing long-term dependencies.

Structure of LSTM

An LSTM (Long Short-Term Memory) network is a type of RNN recurrent neural network that is

capable of handling and processing sequential data. The structure of an LSTM network consists of a

series of LSTM cells, each of which has a set of gates (input, output, and forget gates) that control

the flow of information into and out of the cell. The gates are used to selectively forget or retain

information from the previous time steps, allowing the LSTM to maintain long-term dependencies

in the input data.

The LSTM cell also has a memory cell that stores information from previous time steps and uses it

to influence the output of the cell at the current time step. The output of each LSTM cell is passed

to the next cell in the network, allowing the LSTM to process and analyze sequential data over

multiple time steps.

Applications of LSTM
Long Short-Term Memory (LSTM) is a highly effective Recurrent Neural Network (RNN) that has

been utilized in various applications. Here are a few well-known LSTM applications:

Language Simulation: Language support vector machines (LSTMs) have been utilized for natural

language processing tasks such as machine translation, language modeling, and text summarization.

By understanding the relationships between words in a sentence, they can be trained to construct

meaningful and grammatically correct sentences.

25
Spam Detection In Text Using Machine Learning

Voice Recognition: LSTMs have been utilized for speech recognition tasks such as speech-to-text-

to-text-transcription and command recognition. They may be taught to recognize patterns in speech

and match them to the appropriate text.

Sentiment Analysis: LSTMs can be used to classify text sentiment as positive, negative, or neutral

by learning the relationships between words and their associated sentiments.

Time Series Prediction: LSTMs can be used to predict future values in a time series by learning the

relationships between past values and future values.

Video Analysis: LSTMs can be used to analyze video by learning the relationships between frames

and their associated actions, objects, and scenes.

Handwriting Recognition: LSTMs can be used to recognize handwriting by learning the

relationships between images of handwriting and the corresponding text.

3.4.4 Multi-layer Perceptron:

This is a classifier in which the weights of the network are found by solving a quadratic

programming problem with linear constraints, rather than by solving a non- convex, unconstrained

minimization problem as in standard neural network training. Other well- known algorithms are

based on the notion of perceptron Tapas Kanungo, D. M. (2002). Perceptron algorithm is used for

learning from a batch of training instances by running the algorithm repeatedly through the training

set until it finds a prediction vector which is correct on all of the training set. This prediction rule is

then used for predicting the labels on the test set Neocleous C. (2002).

3.4.5 Support Vector Machines (SVMs):

These are the most recent supervised machine learning technique. Support Vector Machine (SVM)

models are closely related to classical multilayer perceptron neural networks. SVMs revolve around

26
Spam Detection In Text Using Machine Learning

the notion of a margin‖—either side of a hyperplane that separates two data classes. Maximizing the

margin and thereby creating the largest

possible distance between the separating hyperplane and the instances on either side of it has been

proven to reduce an upper bound on the expected generalization error.

3.4.6 K-means: According to Nilsson, N.J. (2005), K- means is one of the simplest unsupervised

learning algorithms that solve the well-known clustering problem. The procedure follows a simple

and easy way to classify a given data set through a certain number of clusters (assume k clusters)

fixed a priori. K-Means algorithm is employed when labeled data is not available. General method

of converting rough rules of thumb into highly accurate prediction rule. Given ―weak learning

algorithm that can consistently find classifiers (―rules of thumb‖) at least slightly better than

random, say, accuracy _ 55%, with sufficient data, a boosting algorithm can provably construct

single classifier with very high accuracy, say, 99%.

3.47. Decision Trees:

Decision Trees (DT) are trees that classify instances by sorting them based on feature values. Each

node in a decision tree represents a feature in an instance to be classified and each branch represents

a value that the node can assume. Instances are classified starting at the root node and sorted based

on their feature values. Decision tree learning, used in data mining and machine learning, uses a

decision tree as a predictive model which maps observations about an item to conclusions about the

item's target value. More descriptive names for such tree models are classification trees or

regression trees. Decision tree classifiers usually employ post-pruning techniques that evaluate the

performance of decision trees, as they are pruned by using a validation set. Any node can be

removed and assigned the most common class of the training instances that are sorted to it.

27
Spam Detection In Text Using Machine Learning

3.4.8. Neural Networks:

Neural Networks (NN) that can actually perform a number of regression and/or classification tasks

at once, although commonly each network performs only one. In the vast majority of cases,

therefore, the network will have a single output variable, although in the case of many-state

classification problems, this may correspond to a number of output units (the post-processing stage

takes care of the mapping from output units to output variables. Artificial Neural Network (ANN)

depends upon three fundamental aspects, input and activation functions of the unit, network

architecture and the weight of each input connection. Given that the first two aspects are fixed; the

behavior of the ANN is defined by the current values of the weights. The weights of the net to be

trained are initially set to random values, and then instances of the training set are repeatedly

exposed to the net. The values for the input of an instance are placed on the input units and the

output of the net is compared with the desired output for this instance. Then, all the weights in the

net are adjusted slightly in the direction that would bring the output values of the net closer to the

values for the desired output. There are several algorithms with which a network can be trained

Lemnaru C. (2012).

3.4.9 Python

Python is a widely used general-purpose, high level programming language. It was created by

Guido van Rossum in 1991 and further developed by the Python Software Foundation. It was

designed with an emphasis on code readability, and its syntax allows programmers to express their

concepts in fewer lines of code.

Python is a programming language that lets you work quickly and integrate systems more

efficiently.

There are two major Python versions: Python 2 and Python 3. Both are quite different.

28
Spam Detection In Text Using Machine Learning

Features of Python Interpreted

• There are no separate compilation and execution steps like C and C++.

• Directly run the program from the source code.

• Internally, Python converts the source code into an intermediate form called bytecodes which is

then translated into native language of specific computer to run it.

• No need to worry about linking and loading with libraries, etc.

Platform Independent

• Python programs can be developed and executed on multiple operating system platforms.

• Python can be used on Linux, Windows, Macintosh, Solaris and many more.

• Free and Open Source; Redistributable

High-level Language

• In Python, no need to take care about low-level details such as managing the memory used by

the program. Simple

• Closer to English language;Easy to Learn

• More emphasis on the solution to the problem rather than the syntax

Embeddable

• Python can be used within C/C++ program to give scripting capabilities for the program’s users.

Robust:

• Exceptional handling features

• Memory management techniques in built

29
Spam Detection In Text Using Machine Learning

Rich Library Support

• The Python Standard Library is very vast.

• Known as the “batteries included” philosophy of Python ;It can help do various things involving

regular expressions, documentation generation, unit testing, threading, databases, web browsers,

CGI, email, XML, HTML, WAV files, cryptography, GUI and many more.

• Besides the standard library, there are various other high-quality libraries such as the Python

Imaging Library which is an amazingly simple image manipulation library.

3.4.4.1 Pandas

Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was

created by Wes McKinney in 2008.

Pandas allows us to analyze big data and make conclusions based on statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

Relevant data is very important in data science.

Pandas gives you answers about the data. Like:

Is there a correlation between two or more columns?

What is average value?

Max value?

Min value?

Pandas are also able to delete rows that are not relevant, or contains wrong values, like empty or

NULL values. This is called cleaning the data.

30
Spam Detection In Text Using Machine Learning

31
Spam Detection In Text Using Machine Learning

3.4.4.2 Numpy

NumPy is a Python library used for working with arrays.

It also has functions for working in domain of linear algebra, fourier transform, and matrices.

NumPy was created in 2005 by Travis Oliphant. It is an open source project and you can use it

freely.

NumPy stands for Numerical Python.

In Python we have lists that serve the purpose of arrays, but they are slow to process. NumPy aims

to provide an array object that is up to 50x faster than traditional Python lists.

The array object in NumPy is called ndarray, it provides a lot of supporting functions that make

working with ndarray very easy.

Arrays are very frequently used in data science, where speed and resources are very important.

NumPy is a Python library and is written partially in Python, but most of the parts that require fast

computation are written in C or C++.

3.4.4.3 matplotlib

Matplotlib is a low level graph plotting library in python that serves as a visualization utility.

Matplotlib was created by John D. Hunter.

Matplotlib is open source and we can use it freely.

Matplotlib is mostly written in python, a few segments are written in C, Objective-C and Javascript

for Platform compatibility.

Matplotlib is an amazing visualization library in Python for 2D plots of arrays. Matplotlib is a

multi-platform data visualization library built on NumPy arrays and designed to work with the

broader SciPy stack. It was introduced by John Hunter in the year 2002. One of the greatest benefits

32
Spam Detection In Text Using Machine Learning

of visualization is that it allows us visual access to huge amounts of data in easily digestible visuals.

Matplotlib consists of several plots like line, bar, scatter, histogram, etc

3.4.4.4 seaborn

Seaborn is a Python data visualization library based on matplotlib. It provides a highlevel interface

for drawing attractive and informative statistical graphics.

seaborn is a library for making statistical graphics in Python. It provides a high-level interface to

matplotlib and integrates closely with pandas data structures. Functions in the seaborn library

expose a declarative, dataset-oriented API that makes it easy to translate questions about data into

graphics that can answer them. When given a dataset and a specification of the plot to make,

seaborn automatically maps the data values to visual attributes such as color, size, or style,

internally computes statistical transformations, and decorates the plot with informative axis labels

and a legend. Many seaborn functions can generate figures with multiple panels that elicit

comparisons between conditional subsets of data or across different pairings of variables in a

dataset. seaborn is designed to be useful throughout the lifecycle of a scientific project. By

producing complete graphics from a single function call with minimal arguments, seaborn

facilitates rapid prototyping and exploratory data analysis. And by offering extensive options for

customization, along with exposing the underlying matplotlib objects, it can be used to create

polished, publication-quality figures.

3.4.4.4 tensorflow

TensorFlow is an open-source library for fast numerical computing.

It was created and is maintained by Google and was released under the Apache 2.0 open source

license. The API is nominally for the Python programming language, although there is access to the

underlying C++ API.

33
Spam Detection In Text Using Machine Learning

Unlike other numerical libraries intended for use in Deep Learning like Theano, TensorFlow was

designed for use both in research and development and in production systems, not least of which is

RankBrain in Google search and the fun DeepDream project.

It can run on single CPU systems and GPUs, as well as mobile devices and large-scale distributed

systems of hundreds of machines.

3.4.4.4 keras

Keras runs on top of open source machine libraries like TensorFlow, Theano or Cognitive Toolkit

(CNTK). Theano is a python library used for fast numerical computation tasks. TensorFlow is the

most famous symbolic math library used for creating neural networks and deep learning models.

TensorFlow is very flexible and the primary benefit is distributed computing. CNTK is deep

learning framework developed by Microsoft. It uses libraries such as Python, C#, C++ or standalone

machine learning toolkits. Theano and TensorFlow are very powerful libraries but difficult to

understand for creating neural networks.

Keras is based on minimal structure that provides a clean and easy way to create deep learning

models based on TensorFlow or Theano. Keras is designed to quickly define deep learning models.

Well, Keras is an optimal choice for deep learning applications.

Features

• Keras leverages various optimization techniques to make high level neural network API easier

and more performant. It supports the following features −  Consistent, simple and extensible

API.

• Minimal structure - easy to achieve the result without any frills.

• It supports multiple platforms and backends.

• It is user friendly framework which runs on both CPU and GPU.

34
Spam Detection In Text Using Machine Learning

• Highly scalability of computation.

Benefits

Keras is highly powerful and dynamic framework and comes up with the following advantages −

• Larger community support.

• Easy to test.

• Keras neural networks are written in Python which makes things simpler.

• Keras supports both convolution and recurrent networks.

• Deep learning models are discrete components, so that, you can combine into many ways.

35
Spam Detection In Text Using Machine Learning

CHAPTER 4
METHODOLOGY

36
Spam Detection In Text Using Machine Learning

4.1 Data set

The public dataset of SMS labelled messages is obtained from UCI Machine Learning Repository.

The dataset considered in the current research is available on kaggle, a machine learning repository.

This study finds that there are only 5,574 labelled messages in the dataset, with 4827 of messages

belong to ham messages while the other 747 messages belong to spam messages. Nonetheless, this

dataset consists of two named columns starting with the message labels (ham or spam) followed by

strings of text messages and three unnamed columns.

It‘s time for a data analyst to pick up the baton and lead the way to machine learning

implementation. The job of a data analyst is to find ways and sources of collecting relevant and

comprehensive data, interpreting it, and analyzing results with the help of statistical techniques. The

type of data depends on what you want to predict There is no exact answer to the question ―

How much data is needed?‖ because each machine learning problem is unique. In turn, the number

of attributes data scientists will use when building a predictive model depends on the

attributes‘predictive value.

The more, the better‘approach is reasonable for this phase. Some data scientists suggest considering

that less than one-third of collected data may be useful. It‘s difficult to estimate which part of the

data will provide the most accurate results until the model training begins. That‘s why it‘s important

to collect and store all data — internal and open, structured and unstructured .

37
Spam Detection In Text Using Machine Learning

4.2 Data Preprocessing

The purpose of preprocessing is to convert raw data into a form that fits machine learning.

Structured and clean data allows a data scientist to get more precise results from an applied machine

learning model. The technique includes data formatting, cleaning, and sampling.

 4.2.1 Data formatting: The importance of data formatting grows when data is acquired from

various sources by different people. The first task for a data scientist is to standardize record

formats. A specialist checks whether variables representing each attribute are recorded in the same

way. Titles of products and services, prices, date formats, and addresses are examples of variables.

The principle of data consistency also applies to attributes represented by numeric ranges.

 4.2.2 Data cleaning: This set of procedures allows for removing noise and fixing inconsistencies

in data. A data scientist can fill in missing data using imputation techniques, e.g. substituting

missing values with mean attributes. A specialist also detects outliers — observations that deviate

significantly from the rest of distribution. If an outlier indicates erroneous data, a data scientist

deletes or corrects them if possible.

This stage also includes removing incomplete and useless data objects.

 4.2.3 Data anonymization Sometimes a data scientist must anonymize or exclude attributes

representing sensitive information (i.e. when working with healthcare and banking data).

 4.2.4 Data sampling: Big datasets require more time and computational power for analysis. If a

dataset is too large, applying data sampling is the way to go. A data scientist uses this technique to

select a smaller but representative data sample to build and run models uch faster, and at the same

time to produce accurate outcomes.

Pre-processing is the first stage in which the unstructured data is converted into more structured

data. Since keywords in SMS text messages are prone to be replaced by symbols. In this study, the

38
Spam Detection In Text Using Machine Learning

stop word list remover for English language have beenapplied to eliminate the stop words in the

SMS text messages.

4.3 Classification model evaluation

4.3.1 ROC Curve

A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the

diagnostic ability of a binary classifier system as its discrimination threshold is varied.

4.3.2 ROC Score

It's simply the value of the area under the roc curve. ROC AUC score shows how well the classifier

distinguishes positive and negative classes. It can take values from 0 to 1. A higher ROC AUC

indicates better performance. A perfect model would have an AUC of 1, while a random model

would have an AUC of 0.5.

39
Spam Detection In Text Using Machine Learning

ROC is a probability curve and AUC represents the degree or measure of separability. It tells how

much the model is capable of distinguishing between classes. Higher the AUC, the better the model

is at predicting 0 classes as 0 and 1 classes as 1.

We can measure model accuracy by two methods. Accuracy simply means the number of values

correctly predicted.

 Confusion Matrix

 Classification Measure

4.3.3 Confusion Matrix

The confusion matrix is a fundamental tool for evaluating the performance of classification models.

It provides a comprehensive summary of the model's predictions compared to the actual ground

truth across different classes. The matrix is organized into rows and columns, where each row

represents the actual class labels, and each column represents the predicted class labels.

The following 4 are the basic terminology which will help us in determining the metrics we are

looking for.

 True Positives (TP): when the actual value is Positive and predicted is also Positive.

 True negatives (TN): when the actual value is Negative and prediction is also Negative.

 False positives (FP): When the actual is negative but prediction is Positive. Also known as the

Type 1 error

 False negatives (FN): When the actual is Positive but the prediction is Negative. Also known as the

Type 2 error

For a binary classification problem, we would have a 2 x 2 matrix as shown below with 4 values:

40
Spam Detection In Text Using Machine Learning

Confusion Matrix for the Binary Classification

The target variable has two values: Positive or Negative

The columns represent the actual values of the target variable

The rows represent the predicted values of the target variable

Let’s take an example:

We have a total of 20 cats and dogs and our model predicts whether it is a cat or not.

Actual values = [‘dog’, ‘cat’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘cat’, ‘dog’, ‘dog’,

‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’]

Predicted values = [‘dog’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’, ‘cat’, ‘cat’, ‘cat’, ‘dog’,

‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’]

41
Spam Detection In Text Using Machine Learning

True Positive (TP) = 6

You predicted positive and it’s true. You predicted that an animal is a cat and it actually is.

True Negative (TN) = 11

You predicted negative and it’s true. You predicted that animal is not a cat and it actually is not (it’s

a dog).

False Positive (Type 1 Error) (FP) = 2

You predicted positive and it’s false. You predicted that animal is a cat but it actually is not (it’s a

dog).

False Negative (Type 2 Error) (FN) = 1

You predicted negative and it’s false. You predicted that animal is not a cat but it actually is.

42
Spam Detection In Text Using Machine Learning

4.3.4. Classification Measure:

In addition to the confusion matrix, several performance metrics provide a comprehensive

assessment of model performance in classification tasks. These metrics offer valuable insights into

the strengths and limitations of the model, enabling a deeper understanding and analysis of its

effectiveness. Here's a brief overview of each metric:

 Accuracy: Accuracy measures the overall correctness of the model by calculating the ratio of

correctly predicted instances to the total number of instances. While accuracy provides a general

overview of model performance, it may not be suitable for imbalanced datasets.

 Precision: Precision quantifies the proportion of true positive predictions among all positive

predictions made by the model. It focuses on the accuracy of positive predictions and helps evaluate

the model's ability to minimize false positives.

 Recall (True Positive Rate, Sensitivity): Recall calculates the proportion of true positive

predictions among all actual positive instances in the dataset. It measures the model's ability to

capture all positive instances and is particularly important in scenarios where missing positive

instances can have significant consequences.

 F1-Score: The F1-Score is the harmonic mean of precision and recall, providing a balanced

measure of a model's performance. It considers both false positives and false negatives and is useful

for evaluating models in situations where there is an uneven class distribution or class imbalance.

 False Positive Rate (FPR, Type I Error): FPR measures the proportion of negative instances that

are incorrectly classified as positive by the model. It complements precision by focusing on the rate

of false positives and is essential in applications where minimizing false alarms is critical.

 False Negative Rate (FNR, Type II Error): FNR calculates the proportion of positive instances

that are incorrectly classified as negative by the model. It evaluates the model's ability to detect all

43
Spam Detection In Text Using Machine Learning

positive instances and is particularly relevant in scenarios where missing positive instances can

have serious consequences.

 4.3.4.1 Accuracy:

Accuracy simply measures how often the classifier makes the correct prediction. It’s the ratio

between the number of correct predictions and the total number of predictions. The accuracy metric

is not suited for imbalanced classes. Accuracy has its own disadvantages, for imbalanced data, when

the model predicts that each point belongs to the majority class label, the accuracy will be high. But,

the model is not accurate.

It is a measure of correctness that is achieved in true prediction. In simple words, it tells us how

many predictions are actually positive out of all the total positive predicted. Accuracy is a valid

choice of evaluation for classification problems which are well balanced and not skewed or there is

no class imbalance.

44
Spam Detection In Text Using Machine Learning

4.3.4.2 Precision:
It is a measure of correctness that is achieved in true prediction. In simple words, it tells us how

many predictions are actually positive out of all the total positive predicted.

Precision is defined as the ratio of the total number of correctly classified positive classes divided

by the total number of predicted positive classes. Or, out of all the predictive positive classes, how

much we predicted correctly. Precision should be high(ideally 1).

“Precision is a useful metric in cases where False Positive is a higher concern than False Negatives”

Ex 1:- In Spam Detection : Need to focus on precision

Suppose mail is not a spam but model is predicted as spam : FP (False Positive). We always try to

reduce FP.

Ex 2:- Precision is important in music or video recommendation systems, e-commerce websites, etc.

Wrong results could lead to customer churn and be harmful to the business.

45
Spam Detection In Text Using Machine Learning

4.3.4.3. Recall:

Recall is a valid choice of evaluation metric when we want to capture as many positives as possible.

Recall is defined as the ratio of the total number of correctly classified positive classes divide by the

total number of positive classes. Or, out of all the positive classes, how much we have predicted

correctly. Recall should be high(ideally 1). “Recall is a useful metric in cases where False Negative

trumps False Positive”

Ex 1:- suppose person having cancer (or) not? He is suffering from cancer but model predicted as

not suffering from cancer

Ex 2:- Recall is important in medical cases where it doesn’t matter whether we raise a false alarm

but the actual positive cases should not go undetected!

Recall would be a better metric because we don’t want to accidentally discharge an infected person

and let them mix with the healthy population thereby spreading contagious virus. Now you can

understand why accuracy was a bad metric for our model.

Trick to remember : Precision has Predictive Results in the denominator.

46
Spam Detection In Text Using Machine Learning

4.3.4.4 F-measure / F1-Score

The F1 score is a number between 0 and 1 and is the harmonic mean of precision and recall. We use

harmonic mean because it is not sensitive to extremely large values, unlike simple averages.

F1 score sort of maintains a balance between the precision and recall for your classifier. If your

precision is low, the F1 is low and if the recall is low again your F1 score is low. There will be cases

where there is no clear distinction between whether Precision is more important or Recall. We

combine them!

In practice, when we try to increase the precision of our model, the recall goes down and vice-versa.

The F1-score captures both the trends in a single value.

47
Spam Detection In Text Using Machine Learning

CHAPTER 5
SYSTEM DESIGN

48
Spam Detection In Text Using Machine Learning

5 SYSTEM DESIGN

In the context of our SMS spam detection project, system design plays a crucial role in defining the

architecture and components required to achieve our objective effectively. System design involves

conceptualizing the structure of our SMS spam detection system, including its interface, modules,

and data flow.

In the initial stages of system design, we identify the key components of the system, such as the

message preprocessing module, the Naive Bayes classifier, and the classification output module.

Each component is designed to fulfill specific functionalities necessary for detecting SMS spam

messages accurately.

5.1 Architecture Diagram

49
Spam Detection In Text Using Machine Learning

5.2 DATA FLOW DIAGRAM:

In the context of our SMS spam detection project, data flow diagrams (DFDs) serve as valuable

tools for visually illustrating how data moves through our system and how different components

interact with each other.

A logical data flow diagram for our project would depict the flow of SMS messages through various

stages of preprocessing, classification, and result generation. It would highlight the key processes

involved, such as tokenization, feature extraction, Naive Bayes classification, and the final

categorization of messages as spam or ham.

50
Spam Detection In Text Using Machine Learning

5.3 USECASE DIAGRAM:

A use case diagram would provide a high-level overview of the various functionalities offered by

our system, the actors interacting with the system, and the relationships between these actors and

functionalities.

Actors in our use case diagram would represent different entities that interact with our system.

These could include end users, administrators, and external systems. Each actor has specific roles

and responsibilities within the system.

These use cases describe specific actions or services that the system offers to its users. Examples of

use cases in our project may include:

51
Spam Detection In Text Using Machine Learning

 Submit SMS Message: This use case involves users submitting SMS messages to the system for

classification.

 View Classification Results: Users can view the classification results of their submitted SMS

messages.

 Update Model: Administrators have the ability to update the Naive Bayes classification model with

new training data.

 Generate Reports: The system can generate reports on spam detection accuracy, false positives,

false negatives, etc.

The use case diagram would illustrate the relationships between actors and use cases. For example,

end users may interact with the "Submit SMS Message" and "View Classification Results" use

cases, while administrators may interact with additional use cases such as "Update Model" and

"Generate Reports."

5.4 CLASS DIAGRAM:


A use case diagram would provide a high-level overview of the various functionalities offered by

our system, the actors interacting with the system, and the relationships between these actors and

functionalities.

Actors in our use case diagram would represent different entities that interact with our system.

These could include end users, administrators, and external systems. Each actor has specific roles

and responsibilities within the system.The functionalities provided by our SMS spam detection

system would be represented as use cases. These use cases describe specific actions or services that

the system offers to its users. Examples of use cases in our project may include:

52
Spam Detection In Text Using Machine Learning

 Submit SMS Message: This use case involves users submitting SMS messages to the system for

classification.

 View Classification Results: Users can view the classification results of their submitted SMS

messages.

 Update Model: Administrators have the ability to update the Naive Bayes classification model with

new training data.

 Generate Reports: The system can generate reports on spam detection accuracy, false positives,

false negatives, etc.

The use case diagram would illustrate the relationships between actors and use cases. For example,

end users may interact with the "Submit SMS Message" athe class diagram represents the static

structure of the system, detailing the classes involved and their relationships. We have two main

classes: Frontend and Backend.

Frontend Class:

 Attributes:

 Username: Represents the username of the user interacting with the system.

53
Spam Detection In Text Using Machine Learning

 Password: Represents the password associated with the user account for authentication.

 Other Attributes: These may include additional information related to user preferences,

settings, or session data.

 Methods:

 AuthenticateUser(): Method to authenticate the user based on the provided username and

password.

 SubmitSMSMessage(message): Method to submit an SMS message to the system for

classification.

 ViewClassificationResults(): Method to view the classification results of submitted SMS

messages.

 Backend Class:

 Attributes:

 Dataset: Represents the dataset used for training the machine learning model.

 ML Model: Represents the trained machine learning model for classifying SMS messages.

 Splitting Data: Represents the functionality for splitting the dataset into training and testing

sets.

 Other Attributes: These may include additional components or resources used in the

backend processing.

 Methods:

 TrainModel(): Method to train the machine learning model using the provided dataset.

 ClassifySMSMessage(message): Method to classify an incoming SMS message as spam or

ham using the trained model.

 SplitData(): Method to split the dataset into training and testing sets for model evaluation.

 The class diagram visually depicts the structure of the system and the interactions between

its components. Frontend and Backend classes encapsulate their respective functionalities
54
Spam Detection In Text Using Machine Learning

and attributes, providing a clear separation of concerns. The diagram helps developers

understand the architecture of the system, facilitating the implementation and maintenance

of the SMS spam detection application.nd "View Classification Results" use cases, while

administrators may interact with additional use cases such as "Update Model" and "Generate

Reports."

5.5 SEQUENCE DIAGRAM:


The sequence diagram illustrates the interaction between different entities and the sequence of

messages exchanged during a particular scenario, such as classifying an incoming SMS message.

Here's how the sequence diagram would look:

User Authentication Sequence:

 The sequence diagram begins with the user attempting to log in to the system by providing their

username and password.

 The Frontend class sends an authentication request message to the Backend class.

 The Backend class receives the authentication request and verifies the provided credentials

against the stored user data.

 If the credentials are valid, the Backend class sends a confirmation message back to the

Frontend class.

 The Frontend class receives the confirmation and allows the user to access the system.

55
Spam Detection In Text Using Machine Learning

The sequence diagram starts with the Frontend class receiving an incoming SMS message from the

user.

The Frontend class sends the SMS message to the Backend class for classification.

The Backend class receives the SMS message and invokes the ClassifySMSMessage() method to

classify it as spam or ham.

The ML Model within the Backend class processes the message using the trained machine learning

model.

Data Splitting Sequence (Optional):

 If training or testing data splitting is required, the sequence diagram may include a sequence for

splitting the dataset.

 The Backend class initiates the SplitData() method to divide the dataset into training and testing

sets.

56
Spam Detection In Text Using Machine Learning

 The dataset splitting process occurs internally within the Backend class, and the resulting

datasets are used for training and evaluation.

5.6 ACTIVITY DIAGRAM:

The activity diagram illustrates the flow of activities and interactions within the system, depicting

how different components and processes interact to achieve specific functionalities. Here's how the

activity diagram would depict the functionality of our system:

System Initialization Activity:

 The diagram starts with the system initialization activity, representing the initialization of the

SMS spam detection system.

 This activity involves initializing the necessary components, such as loading the machine

learning model, setting up the database connection, and preparing the user interface.

User Authentication Activity:

 After initialization, the diagram shows the user authentication activity, where users are required

to log in to the system.

 This activity includes the process of entering credentials, verifying them against the user

database, and granting access upon successful authentication.

 SMS Message Classification Activity:

 Upon successful authentication, the system proceeds to the SMS message classification activity.

Display Results Activity:

 After classification, the system displays the results of the classification process to the user.

 This activity involves presenting the classification results (spam or ham) along with any

additional information, such as confidence scores or probabilities.

57
Spam Detection In Text Using Machine Learning

System Shutdown Activity:

 The diagram includes the system shutdown activity, representing the graceful shutdown of the

SMS spam detection system.

 This activity involves closing connections, saving data, and releasing resources before

terminating the system.

58
Spam Detection In Text Using Machine Learning

5.7 STATE FLOW DIAGRAM:


In that we have states like Stock data, data cleaning, preprocessing, data split, label

encoder, model, accuracy, results

 Initial State (Creation of Object):

The diagram begins with the initial state, representing the creation of an object to handle the

incoming SMS message.

At this stage, the system initializes its components and resources to begin processing the SMS

message.

 Stock Data State:

Upon receiving the SMS message, the system transitions to the stock data state.

In this state, the system acquires the necessary data, which includes the content of the SMS message

and any additional metadata associated with it.

 Data Cleaning State:


After obtaining the SMS message data, the system proceeds to the data cleaning state.

59
Spam Detection In Text Using Machine Learning

Here, the system performs data cleaning operations to preprocess the raw SMS message content,

which may involve tasks such as removing special characters, correcting spelling errors, and

handling formatting issues.

 Preprocessing State:

The system extracts relevant features from the pre-processed SMS message data, preparing it for

further analysis and classification.

 Data Split State:

Once preprocessing is complete, the system moves to the data split state.

Here, the system divides the pre-processed SMS message data into training and testing datasets to

facilitate model training and evaluation.

 Label Encoder State:

After splitting the data, the system transitions to the label encoder state.

In this state, the system encodes categorical labels (e.g., spam and ham) into numerical

representations suitable for machine learning algorithms.

 Model State:

The system trains a machine learning model, such as the Naive Bayes algorithm, using the labelled

training data to learn patterns and relationships between features and class labels.

 Accuracy State:

Upon training the model, the system moves to the accuracy state.

In this state, the system evaluates the performance of the trained model using the testing dataset,

measuring metrics such as accuracy, precision, recall, and F1-score.

 Results State:

60
Spam Detection In Text Using Machine Learning

After assessing model performance, the system transitions to the results state.

The system generates and presents the results of SMS message classification, indicating whether

each message is classified as spam or ham based on the trained model's predictions.

 Termination State:

The process concludes with the termination state, where the system releases resources and

concludes its processing cycle.

In this state, the system may perform cleanup tasks and return to an idle state, awaiting the arrival of

new SMS messages for processing.

61
Spam Detection In Text Using Machine Learning

CHAPTER 6
RESULTS AND DISCUSSIONS

62
Spam Detection In Text Using Machine Learning

6.1 Exploring Data (Visualization):


This exploration phase involves visualizing the data using various statistical plots and charts to

uncover meaningful insights.

 Pie Chart:

A pie chart provides a visual representation of the distribution of spam and ham messages within

the dataset. By examining the proportions of spam and ham messages, we can gauge the imbalance

between the two classes and assess the dataset's suitability for training a machine learning model.

 Bar Graph:

A bar graph illustrates the frequency distribution of key features within the SMS messages, such as

word count, character count, or the presence of specific keywords. This visualization helps identify

common patterns and characteristics associated with spam and ham messages, guiding the feature

selection and preprocessing steps in the subsequent stages of the SMS spam detection pipeline.

Figure 6.1 training set and test set

Observations: 87.4% of the SMSes aren't spam while only 12.6% is actually spam Insights: since

the data is imbalanced we need to take that into consideration while splitting the training and testing

set

63
Spam Detection In Text Using Machine Learning

Figure 6.2: sentences/words count

Observation: spam SMSses have on average more sentences/words count than ham ones, but these

have some outliers that surpass the spammy SMSes.

In the "Exploring Data" section, comparing the count of sentences or words between spam and ham

messages can provide valuable insights into their structural differences. Here's a brief content for

this comparison:

 Comparison of Sentence by Word Count:

Understanding the distribution of sentence and word counts in both spam and ham messages is

crucial for identifying distinctive characteristics between the two categories. This comparison

allows us to discern potential patterns or anomalies that could aid in distinguishing spam from

legitimate messages.

64
Spam Detection In Text Using Machine Learning

Figure 6.2: comparison of sentences/words count

65
Spam Detection In Text Using Machine Learning

6.2 Data Preprocessing:

Data preprocessing plays a pivotal role in extracting meaningful insights from raw SMS message

data. In this section, we discuss various preprocessing steps, including text normalization,

tokenization, and feature extraction.

 Word Cloud Visualization:

A word cloud is a graphical representation of word frequency in a text corpus, where the size of

each word corresponds to its frequency of occurrence. By generating a word cloud for both spam

and ham messages, we can visually inspect the most frequent words used in each category. This

visualization aids in identifying common themes, keywords, and distinguishing features that

characterize spam and ham messages.

Figure 1Figure 6.3: word cloud

66
Spam Detection In Text Using Machine Learning

6.4 Model Evalution:

6.4.1 TfidfVectorizer(Term Frequency-Inverse Document Frequency):

Purpose: TfidfVectorizer is designed to address the issue of word importance. It considers not only

the frequency of words in a document but also how unique they are across the entire corpus. Words

that are common in many documents receive lower weights, while words that are unique to a

document receive higher weights. How it works: It computes a TF-IDF score for each term in each

67
Spam Detection In Text Using Machine Learning

document. TF (Term Frequency) measures the frequency of a term in a document, while IDF

(Inverse Document Frequency) measures the uniqueness of the term across the entire corpus.

Figure 6.4: Comparison of models

Figure 6.5 Voting classifer

From the above models, it is evident that the Multinomial Naive Bayes (MNB) algorithm performs
well for spam detection. This algorithm, which is based on the principles of Bayes' theorem and
assumes independence among features, demonstrates strong performance in classifying SMS
messages as spam or ham.

The simplicity and effectiveness of the MNB algorithm make it particularly well-suited for text
classification tasks, such as spam detection. By modeling the probability of each word occurring in

68
Spam Detection In Text Using Machine Learning

spam and ham messages independently, MNB can effectively distinguish between the two classes
based on the presence or absence of specific keywords or features.

Moreover, MNB is computationally efficient and robust, making it suitable for handling large
datasets and real-time applications. Its ability to handle sparse data and its resistance to overfitting
further contribute to its suitability for spam detection tasks.

CHAPTER 7
CONCLUSION AND FUTURE SCOPE

69
Spam Detection In Text Using Machine Learning

7 CONCLUSION AND FUTURE SCOPE

Conclusion:

The proliferation of SMS spam messages presents a significant challenge globally, with the problem

showing no signs of abating as mobile usage continues to rise. This paper addresses this issue by

presenting a spam filtering technique employing various machine learning algorithms, aimed at

effectively distinguishing between legitimate and unsolicited messages.

Through experimentation, it was found that the TF-IDF with Naive Bayes classification algorithm

consistently outperforms other algorithms, including LSTM, in terms of accuracy percentage.

However, relying solely on accuracy may not be sufficient, given the imbalanced nature of the

dataset. Upon closer examination, the Naive Bayes algorithm demonstrated commendable precision

and f-measure scores of 0.98 and 0.97, respectively, underscoring its robustness in identifying spam

messages while minimizing false positives.

Moreover, it's imperative to recognize the multifaceted nature of feature selection and its impact on

algorithm performance. Different algorithms yield varying performances based on the features they

leverage, emphasizing the need for careful consideration and experimentation. In this context, the

incorporation of additional features, such as message lengths, sender metadata, and semantic

attributes, holds promise for enhancing classifier training and overall performance.

Looking ahead, the future scope of this project extends beyond algorithmic enhancements to

encompass broader applications in data analysis and predictive modeling. For instance, integrating

neural network architectures with complementary techniques like genetic algorithms and fuzzy

logic could yield further improvements in spam detection accuracy. Additionally, exploring the use

of machine learning algorithms for analyzing public comments and predicting corporate

performance structures opens up new avenues for research and application.

70
Spam Detection In Text Using Machine Learning

In conclusion, while the battle against SMS spam messages is ongoing, innovative approaches

leveraging machine learning algorithms offer promising solutions. By continuously refining

techniques, incorporating additional features, and exploring interdisciplinary collaborations, we can

further strengthen spam detection systems and mitigate the impact of unsolicited messages on users'

privacy and security.

Future Scope:

The future scope of this project entails the inclusion of additional feature parameters, as a greater

number of parameters considered correlates with increased accuracy. Furthermore, the algorithms

can be extrapolated for analyzing public comments to discern patterns and relationships between

customers and companies. Traditional algorithms and data mining techniques can also be harnessed

to forecast corporate performance structures.

Looking ahead, there are plans to integrate neural networks with other methodologies such as

genetic algorithms or fuzzy logic. Genetic algorithms offer the potential to identify optimal network

architectures and training parameters, while fuzzy logic can accommodate uncertainties inherent in

neural network predictions. The synergistic application of these techniques alongside neural

networks holds promise for enhancing SMS spam prediction.

Expanding on this future scope, the integration of additional feature parameters, such as message

metadata and semantic attributes, will contribute to a more nuanced understanding of spam

messages and enhance the accuracy of classification models. Furthermore, leveraging natural

language processing techniques to analyze the content and context of public comments can provide

valuable insights into customer sentiments and preferences, enabling companies to tailor their

products and services more effectively.

71
Spam Detection In Text Using Machine Learning

In addition to predictive modeling, the application of traditional algorithms and data mining

techniques for forecasting corporate performance structures represents a compelling avenue for

future research. By analyzing historical data and identifying key performance indicators,

organizations can gain actionable insights into market trends, consumer behavior, and competitive

dynamics, facilitating strategic decision-making and resource allocation.

Integrating neural networks with genetic algorithms or fuzzy logic represents a promising direction

for advancing SMS spam prediction. Genetic algorithms can optimize neural network architectures

and hyperparameters, improving model performance and scalability. Meanwhile, fuzzy logic can

handle uncertainty and imprecision in data, enhancing the robustness and reliability of spam

detection systems in real-world scenarios.

In summary, the future of SMS spam detection lies in the continued exploration of advanced

techniques and interdisciplinary approaches. By harnessing the power of machine learning, natural

language processing, and traditional analytics, we can develop more accurate, efficient, and

adaptable systems for combating unsolicited messages and safeguarding user privacy and security.

72
Spam Detection In Text Using Machine Learning

REFERENCES

73
Spam Detection In Text Using Machine Learning

1. Odukoya Oluwatoyin, Bodunde Akinyemi, Titus Gooding and Ganiyu A. Aderounmu,


"AnImproved Machine-LearningBased ShortMessage ServiceSpam DetectionSystem", -
International Journal of Computer Network and Information Security, December 2019.

2. Sridevi Gadde, A. Lakshmanarao and S. Satyanarayana, "SMSspam detectionusing


machinelearning anddeep learningtechniques", -International Conference on Advanced Computing
and Communication Systems (ICACCS), March, 2021.

3. Paras Sethi, Vaibhav Bhandari and Bhavna Kohli, "SMSspam detectionand comparisonofvarious
machinelearning algorithms", International Conference on Computing and Communication
Technologies for Smart Nation (IC3TSN), October, 2017.

4. V Vishagini and K Archana, "Animproved spamdetection methodwith weightedsupport


vectormachine", International Conference on Data Science and Engineering (ICDSE), August 2018.

5. S. P. Rajamohana, K. Umamaheswari and S. V. Keerthana, "Aneffective hybridcuckoo searchwith


harmonysearch forreview spamdetection", -International Conference on Advances in Electrical
Electronics Information Communication and Bio-Informatics (AEEICB), February 2017.

6. N. Kumar, S. Sonowal and Nishant, "Emailspam detectionusing machinelearning algorithms",


International Conference on Inventive Research in Computing Applications (ICIRCA), July 2020.

7. Samira Douzi, Feda Alshahwan, Mouad Lemoudden and Boubid Ouahidi,


"Hybridemail spamdetection modelusing artificialintelligence", -International Journal of Machine
Learning and Computing (IJMLC), February 2020.

74
Spam Detection In Text Using Machine Learning

8. M. V. Neha and M. S. Nair, "ANovel TwitterSpam DetectionTechnique IntegratingInception


Networkwith Attentionbased LSTM", International Conference on Trends in Electronics and
Informatics (ICOEI), June 2021.

A. Lakshmanarao, M. R. Babu and M. M. Bala Krishna, "MaliciousURL Detectionusing


NLPMachine Learningand FLASK", -International Conference on Innovative Computing
Intelligent Communication and Smart Electrical Systems (ICSES), September 2021.

9. S. Sheikhi, M. T. Kheirabadi and A. Bazzazi, "Aneffective modelfor SMSspam detectionusing


contentbased featuresand averagedneural network", International Journal of Engineering, January
2020.

A. Makkar, S. Garg, N. Kumar, M. S. Hossain, A. Ghoneim and M. Alrashoud, "AnEfficient

SpamDetection Techniquefor IoTDevices UsingMachine Learning", IEEE Transactions on

Industrial Informatics, January 2020.

10. P. Navaney, G. Dubey and A. Rana, "SMSSpam FilteringUsing SupervisedMachine


LearningAlgorithms", -International Conference on Cloud Computing Data Science & Engineering,
2018.

11. S. Bosaeed, I. Katib and R. Mehmood, "AFog-Augmented MachineLearning basedSMS


SpamDetection andClassification System", International Conference on Fog and Mobile Edge
Computing (FMEC), April 2020.

12. R. Abinaya and P Naveen, "Spamdetection socialmedia platforms", International Conference on


Smart Structures and Systems (ICSSS), July 2020.

13. H. Ghizlane, R. Jamal, M. A. Mahraz, Y. Ali and T. Hamid, "Spamimage detectionbased


convolutionalblock module", International Conference on Intelligent Systems and Computer Vision
(ISCV), 2022.

75
Spam Detection In Text Using Machine Learning

14. S. Y. Yerima and A. Bashar, "Semi-supervisednovelty detectionwith oneclass SVMfor SMSspam


detection", International Conference on Systems Signals and Image Processing (IWSSIP), 2022.

15. F. Meng, Y. Pan and R. Feng, "NetworkSpam DetectionBased CNNIncorporated withAttention


Model", International Conference on Network and Information Systems for Computers (ICNISC),
2022.

16. R. E, S. K and A. Sharma, "Multi-lingualSpam SMSdetection using hybriddeep


learningtechnique", IEEE Silchar Subsection Conference (SILCON), 2022.

76

You might also like