0% found this document useful (0 votes)
61 views118 pages

Final Document

The project report titled 'SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH' presents a study conducted by a group of students from Annamacharya Institute of Technology and Sciences for their Bachelor of Technology degree in Computer Science and Engineering. The report focuses on utilizing machine learning algorithms, specifically SVM, XG Boost, Naïve Bayes, and Ada Boost, to effectively identify and filter spam messages. The project aims to enhance SMS communication by reducing unwanted spam, thereby improving user experience.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views118 pages

Final Document

The project report titled 'SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH' presents a study conducted by a group of students from Annamacharya Institute of Technology and Sciences for their Bachelor of Technology degree in Computer Science and Engineering. The report focuses on utilizing machine learning algorithms, specifically SVM, XG Boost, Naïve Bayes, and Ada Boost, to effectively identify and filter spam messages. The project aims to enhance SMS communication by reducing unwanted spam, thereby improving user experience.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 118

A Project report on

“SPAM MESSAGE IDENTIFICATION USING


MACHINE LEARNING APPROACH”
Submitted in partial fulfillment of the requirement
for the award of the degree of
BACHELOR OF TECHNOLOGY
in
COMPUTER SCIENCE AND ENGINEERING
By
S.VENKATA PADMA MEGHANA 19701A05H9
D.SUSHMA 19701A05F2
P.VENKATA GIRIDHAR 19701A05H7
T.VENKATA SRINIVASULU 19701A05I3
A.C.VENKATESHWARA REDDY 19701A05I6
Under the esteemed guidance of

Mr. SHAIK MAHAMMAD RAFI M.Tech,(Ph.D)


Assistant Professor in CSE, AITS.

Submitted to

Department of Computer Science and Engineering


Annamacharya Institute of Technology and Sciences
(An Autonomous Institution)
(Approved by AICTE, New-Delhi and affiliated to J.N.T.U, Anantapur)
( Accredited by NBA &NAAC)
New Boyanapalli, Rajampet, Annamaiah (Dt), A.P-516 126
2022-2023
Department of Computer Science and Engineering
Annamacharya Institute of Technology and Sciences
(An Autonomous Institution)
(Approved by AICTE, New-Delhi and affiliated to J.N.T.U, Anantapur)
( Accredited by NBA & NAAC )
New Boyanapalli, Rajampet ,Annamaiah (Dt), A.P-516 126

CERTIFICATE
This is to certify that the project report entitled “SPAM MESSAGE
IDENTIFICATION USING MACHINE LEARNING APPROACH” is submitted by

S.VENKATA PADMA MEGHANA 19701A05H9


D.SUSHMA 19701A05F2
P.VENKATA GIRIDHAR 19701A05H7
T. VENKATA SRINIVASULU 19701A05I3
A.C.VENKATESHWARA REDDY 19701A05I6

in partial fulfillment of the requirements for the award of Degree of Bachelor of Technology
in “Computer Science and Engineering” for the academic year 2022-23.

Signature of Guide: Signature of HOD:

Mr.Shaik Mahammad Rafi,M.Tech.,(Ph.D) Dr. M. Subba Rao

Ph.D.,
Assistant Professor in CSE, Professor & Head, Dept. of CSE,
AITS,Rajampet. Dean of Student
Affairs, AITS,
Rajampet.
Department of Computer Science and Engineering
Annamacharya Institute of Technology and Sciences
(An Autonomous Institution)
(Approved by AICTE, New-Delhi and affiliated to J.N.T.U, Anantapur)
(Accredited by NBA & NAAC)
New Boyanapalli, Rajampet , Annamaiah (Dt), A.P-516 126

CERTIFICATE
This is to certify that the project report entitled “SPAM MESSAGE
IDENTIFICATION USING MACHINE LEARNING APPROACH” is submitted by

S.VENKATA PADMA MEGHANA 19701A05H9


D.SUSHMA 19701A05F2
P.VENKATA GIRIDHAR
19701A05H7 T.VENKATA SRINIVASULU
19701A05I3
A.C.VENKATESHWARA REDDY 19701A05I6
in partial fulfillment of the requirements for the award of Degree of Bachelor of Technology
in “Computer Science and Engineering” is a record of bonafide work carried out by her
during the academic year 2022-23.

Project viva-voce held on :

Internal Examiner External Examiner


Department of Computer Science and Engineering
Annamacharya Institute of Technology and Sciences
(An Autonomous Institution)
(Approved by AICTE, New-Delhi and affiliated to J.N.T.U, Anantapur)
(Accredited by NBA & NAAC)
New Boyanapalli, Rajampet , Annamaiah (Dt), A.P-516 126

ANTI-PLAGIARISM CERTIFICATE
This is to certify that the project report entitled “SPAM MESSAGE
IDENTIFICATION USING MACHINE LEARNING APPROACH” is submitted by

S.VENKATA PADMA MEGHANA 19701A05H9


D.SUSHMA 19701A05F2
P.VENKATA GIRIDHAR 19701A05H7
T.VENKATA SRINIVASULU 19701A05I3
A.C.VENKATESHWARA REDDY 19701A05I6
in partial fulfillment of the requirements for the award of Degree of Bachelor of Technology
in “Computer Science and Engineering”. Course contains the plagiarism of % which is
within the acceptable limits.

Date: Dean / Coordinator


Research & Development Cell Date:
DECLARATION

We hereby declare that the project report entitled “SPAM MESSAGE


IDENTIFICATION USING MACHINE LEARNING APPROACH” under the guidance
of Mr.Shaik Mahammad Rafi M.Tech,(Ph.D) Assistant Professor in CSE, Department of
Computer Science and Engineering is submitted in partial fulfillment of the requirements for
the award of the degree of Bachelor of Technology in Computer Science and Engineering.

This is a record of bonafide work carried out by me/us and the results embodied in
this project report have not been reproduced or copied from any source. The results embodied
in this project report have been submitted to any other University or institute for the Award of
any other Degree or Diploma.

PROJECT ASSOSIATES

S.VENKATA PADMA MEGHANA


D.SUSHMA
P.VENKATA GIRIDHAR
T.VENKATA SRINIVASULU
A.C.VENKATESHWARA REDDY
ACKNOWLEDGMENT

We endeavor of a long period can be successful only with the


advice of many well- wishers. We take this opportunity to express my
deep gratitude and appreciation to all those who encouraged me for the
successful completion of the project work.

Our heartfelt thanks to our Guide, Mr.Shaik Mahammad Rafi


M.Tech,(Ph.D) Assistant Professor in Department of Computer Science and
Engineering, Annamacharya Institute of Technology and Sciences,
Rajampet, for his valuable guidance and suggestions in analyzing and
testing throughout the period, till the end of the project work completion.

We wish to express sincere thanks and gratitude to Dr. M. Subba


Rao, Head of the Department of Computer Science and Engineering, for
his encouragement and facilities that were offered to us for carrying out
this project.

We take this opportunity to offer gratefulness to our Principal Dr.


S.M.V. Narayana, for providing all sorts of help during the project work.

We are very much thankful to Sri. C. Gangi Reddy, Honorary


Secretary of the Annamacharya Educational Trust, for his help in
providing good facilities in our college.

We would express our sincere thanks to all faculty members of


Computer Science and Engineering Department, batch-mates,
friends and lab-technicians, who have helped us to complete the
project work successfully.

Finally, we express our sincere thanks to our parents who has


provided their heartfelt support and encouragement in the
accomplishment to complete this project successfully.
PROJECT ASSOSIATES
S.VENKATA PADMA MEGHANA
D.SUSHM
A P.VENKATA GIRIDHAR
T.VENKATA SRINIVASULU
A.C.VENKATESHWARA
REDDY
CONTENTS

CHAPTER PAGE NO

ABSTRACT
1. INTRODUCTION 1-4

2. LITERATURE SURVEY 5-15

3. SYSTEM ANALYSIS 16-24


3.1. Existing System 16
3.1.1. Disadvantages 16
3.2. Proposed System 16
3.2.1. Advantages 16
3.3. Modules in Proposed System 17
3.3.1. System 17
3.3.2. User 17
3.3.3. Algorithms Used 18-23
3.3.3.1. Support Vector Machine 18-19
3.3.3.2. XG Boost 19
3.3.3.3. Naïve Bayes 19-21
3.3.3.4. Ada Boost 21-23
3.4. Performance Evaluation and Predicting Result 23-24
4. SYSTEM REQUIREMENTS SPECIFICATIONS 25-27
4.1. Software Requirements 25
4.2. Hardware Requirements 25
4.3. Feasibility Study 25-26
4.3.1. Economic Feasibility 25
4.3.2. Technical Feasibility 26
4.3.3. Behavioral Feasibility 26
4.3.4. Benefits of Doing Feasibility Study 26
4.4. Functional and Non-Functional Requirements 26-27
4.4.1. Functional Requirements 27
4.4.2. Non-Functional Requirements 27
5. SYSTEM DESIGN 28-40
5.1. Architecture Design 28
5.2. Introduction to UML Diagrams 28-29
5.2.1. Goals 29
5.3 UML Notations 29-30
5.4 UML Diagrams 31-38
5.4.1 Use Case Diagram 31
5.4.2 Class Diagram 32
5.4.3 Sequence Diagram 33
5.4.4 Collaboration Diagram 34
5.4.5 Deployment Diagram 35
5.4.6 Activity Diagram 36
5.4.7 Component Diagram 37
5.4.8 State Chart Diagram 38
5.5 ER Diagram 39
5.6 Data Flow Diagrams 39-40
5.6.1 Context Level Diagram 40
5.6.2 Level 1 Diagram 40
6. SYSTEM CODING AND IMPLEMENTATION 41-57
6.1. Introduction to Python Programming Language 41-44
6.1.1. Benefits of Python 43-44
6.2. Python Libraries Used in Python 45
6.3. Sample Code 46-57
7. SYSTEM TESTING 58-63
7.1 Software Testing Techniques 58-59
7.1.1. Testing Objectives 58
7.1.2. Test Case Design 58
7.1.3. White Box Testing 58
7.1.4. Black Box Testing 59
7.2. Software Testing Strategies 60-63
7.2.1. Unit Testing 60
7.2.2. Integration Testing 60-61
7.2.3. Validation Testing 61
7.2.4. System Testing 62
7.2.5. Security Testing 62
7.2.6. Performance Testing 62-63
8. RESULTS 64-69

9. CONCLUSION AND FUTURE ENHANCEMENTS 70

BILIOGRAPHY

PLAGARISM REPORT

JOURNAL PUBLICATION
LIST OF FIGURES & TABLES

Fig. No. Figures Page No.


1.1. SMS Spam Detection 2
1.2. Supervised Learning 4
3.1. Block diagram of Proposed Method 17
3.2. Hyperplane is used to categorise two distinct categories 18
3.3. General Confusion Matrix 24
5.1. Architecture Diagram 28
5.2. Use Case Diagram 31
5.3. Class Diagram 32
5.4. Sequence Diagram 33
5.5. Collaboration Diagram 34
5.6. Deployment Diagram 35
5.7. Activity Diagram 36
5.8. Component Diagram 37
5.9. State Chart Diagram 38
5.10. ER Diagram 39
5.11. Context Level Diagram 40
5.12. Level 1 Diagram 40
6.1. Working Of Python Program 42
6.2. Implementation of Python Program 43
8.1. Home Page 64
8.2. About Page 65
8.3. Upload Page 66
8.4. View Page 66
8.5. Preprocessing Page 67
8.6. Model Training Page 67
8.7. Prediction Page 68
8.8. Graph 69
ABSTRACT

We use some communication means to convey messages digitally. Digital tools allow
two or more persons to coordinate with each other. This communication can be textual,
visual, audio, and written. Smart devices including cell phones are the major sources of
communication these days. Intensive communication through SMSs is causing spamming as
well. Unwanted text messages define as junk information that we received in gadgets. Most
of the companies promote their products or services by sending spam texts which are
unwelcome. In general, most of the time spam emails more in numbers than Actual
messages. In this project, we have used text classification techniques to define SMS and
spam filtering in a short view, which segregates the messages accordingly. In this project, we
applied some Machine learning Algorithms such as SVM, XG boost, Naïve Bayes and Ada
Boost are compared by applying all these algorithms on the dataset and the best algorithm
SVM having the best accuracy is selected for the detection of spam messages.

Keywords: Spam Messages, Classification, Spam Filtering, Comparison.


CHAPTER-1

INTRODUCTION
SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

1. INTRODUCTION

One of the most efficient methods of communication is SMS. It is based on cellular


communication networks; however, in order to send or receive a message, a cell phone must
be within a network service region. The majority of people use this service for
communication. Banks and other government institutions, as well as various businesses, use
SMS to communicate with their consumers and customers. Also, a lot of companies use this
service for advertising. Spam is electronic communication that is uninvited, unwelcome, and
potentially malicious. Spam emails are sent and received via the Internet, whereas spam SMS
is often sent over a mobile network. Spammers are the people who send spam. It is tempting
for unethical exploitation because sending SMS texts is often fairly affordable (if not free) for
the customer. This is made worse by the fact that users typically view SMS as a more secure
and reliable method of communication than other sources, such as emails.

In just five years, there will be 3.8 billion mobile phone (or "smartphone") users.
China, India, and the US are the top three countries in terms of mobile usage. Short
Messaging Service, sometimes known as SMS, is a text messaging service that has been
around for a while. You can use SMS services even without an internet connection. SMS
service is thus accessible on both smartphones and low-end mobile devices. Although there
are numerous text messaging apps on smart phones, such as WhatsApp, this service can only
be used online. Nonetheless, SMS is available at all times. As a result, SMS service traffic is
growing daily.

According to a recent poll by Local Circles, a prominent community social media


platform, 68% of mobile customers receive four or more promotional or spam SMSs on
average every day. Not a single one of the 11,094 respondents said that they didn't bug them
every day. According to the survey, which included respondents from 373 districts and 62%
males and 38% women, 32% of respondents stated they received 1-3 such SMS everyday,
36% received between 4 and 7 messages, and 32% received 8 or more such SMS.

The majority of these consumers had signed up for the Do Not Call Registry run by
the Telecom Regulatory Authority of India (TRAI). These calls and messages are typically
placed by telemarketers, spammers, vendors, banks, vehicle dealers, insurance brokers, real
estate agents, and other fraudsters.

Department of CSE, AITS, Rajampet 1


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

Figure 1.1:SMS Spam Detection

Also, the system mostly derived knowledge from the data. For that goal, there are
numerous methods available, including classification, clustering, and many others. SMS
stands for short message service. 160-character messages must be sent by SMS, and lengthy
messages must be broken up into numerous smaller messages. Short text messages could be
exchanged between cell phones using the same communication protocols. The government
intends to keep up with the rapid pace of technological progress. Prior years saw an increase
in text messaging..

In a certain way, SMS spam is sophisticated. The lowest SMS prices have made it
possible for customers and service providers to move away from the issue and limited
availability of cell phone spam filtering software. Spam on SMS is lower than spam on email.
Despite the fact that it accounts for 30% of typescript letters sent to fashionable Asia and
about 1% of transcript intended in the United States. Under the Telephone Customer
Protection Act of 2004, SMS spam became prohibited in the United States. Those who
receive spam SMS are aware of how to lead a guidance counsellor to a court case with no real
legal significance. Three of China's top mobile phone users have now agreed to a joint effort
to combat mobile spam by establishing limits on the number of typescript messages sent to
one another over time since 2009.

We demonstrate a few classification techniques that classify items in this study. .


Using classification techniques, whether a text message is spam or not can be determined.
There must be a training set that contains the items at this establishment. SMS is text
messaging. To summarise the SMS spam class or SMS into a human being, we employ text

Department of CSE, AITS, Rajampet 2


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

classification in this study. Since texts sent from people are considered to be from the
human class or mobile

Department of CSE, AITS, Rajampet 3


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

phone, spam messages are typically provided by businesses and organisations to advertise
their goods. Due to voicemail messages being commonly utilised, mobile phones or
smartphones are generally a communication device and are used by people in a wide range.

Due to the fact that SMS spam datasets are typically small in size, email filter spam
has a greater number of datasets than SMS spam. Due to the small size of spam SMS, the
spam filtering method used for email could not be extended to SMS. Spamming via email is
less common than SMS in some countries, including Korea. However, in western regions, the
opposite strategy was used, with email spam being more prevalent due to its lower cost than
SMS spam, which is more expensive and infrequent. On mobile devices, about 50% of SMS
messages are received as a text message and are flagged as spam. An SMS filtering system
should function in reserve resources as well as in cell phone hardware because of this. We
used ham and spam as genuine data in our analysis. We use a number of different
categorization algorithms, some of which have been used in earlier research and some of
which are brand- new.

Machine learning is a technology that allows computers to learn from the past and
make predictions about the future. The majority of problems in the real world may now be
solved using machine learning and deep learning in all fields, including health, security,
market analysis, etc. Machine learning can be divided roughly into two types: supervised
learning and unsupervised learning, respectively. Supervised learning is one of the important
subcategories of machine learning. Predictive modelling, another name for supervised
learning, is the process of producing predictions from data. There are no sizable datasets of
spam SMS that are publicly accessible. Even if there were, there is absolutely no expectation
that training on those datasets would result in a successful performance in our situation. As a
result, the only option is to use the real data that is streaming into the system to create a
bespoke dataset. Classification and regression are two instances of supervised learning.
supervised instruction for classification issues, the training data set has pre-labels, and for
regression issues, function values are known. Go to scoring later, where we can forecast
values for new data, once training is complete and the model has a minimal cost function for
the training data set.

Classification: It shows which groups a person belongs to. As a result, if we want our system
to determine which label to use when dealing with many events that are each defined by input
parameters that might be labelled in a variety of ways.

Department of CSE, AITS, Rajampet 4


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

Regression: Regression is a combination of function interpolation and multi-dimensional


power supply. The cost function or an approximation of the function with the least amount of
error deviation are found using the regression problem. To put it another way, the regression
technique only aims to forecast numerical dependence, such as a function value, of a set of
data. Figure 1 illustrates using a diagram how supervised learning is used to resolve issues.

Figure 1.2: Supervised Learning

As an example of supervised learning, suppose a system contains a data set of emails.


The aim of supervised learning is to determine if each email is spam or not (ham). Because
there is a predetermined result—spam or ham—this is supervised learning.

We applied a variety of supervised learning techniques for SMS spam detection using
a dataset from UCI that had labels.

Department of CSE, AITS, Rajampet 5


CHAPTER-2

LITERATURE SURVEY
SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

2. LITERATURE SURVEY
1. Sridevi Gadde(2021),S.Satyanarana, A.Lakshmanarao : Spam detection using machine
learning algorithms is nothing new. In the past, a number of researchers used machine
learning techniques to classify SMS spam by ID. With the use of a random forest classifier
and the TF- IDF approach, Nilam Nur Amir Sjarif et al. were able to attain an accuracy of
97.5%. With the help of two metrics Term Frequency and Inverse Document Frequency—the
words in a document can be quantified using the TF-IDF method. For email spam filtering,
A. Lakshmanarao et al. used four machine learning classifiers: Decision Trees, Naive Bayes,
Logistic Regression, and Random Forest. The random forest classifier had an accuracy of
97%.

Using support vector machines, Pavas Navaney et. al. suggested a number of machine
learning techniques and attained an accuracy of 97.4%. By using a logistic regression
classifier, Luo GuangJun et. al. were able to attain a high accuracy rate using a variety of
shallow machine learning techniques. The Hidden Markov Model was suggested by Tian Xia
et. al. for the identification of SMS spam. Their model addressed problems with low term
frequency by making use of the word order information. With their suggested HMM model,
they were able to obtain an accuracy of 98%. With a deep neural network, M. Nivaashini et.
al. were able to detect SMS spam with an accuracy of 98%. The effectiveness of DNN was
also contrasted with that of NB, Random Forest, Support Vector Machine, and KNN. Mehul
Gupta and others comparing various machine learning models for spam detection with deep
learning models and demonstrated that the latter had a high rate of success in detecting SMS
spam.

In a comparison of different machine learning techniques for short message service


spam identification, Gomatham Sai Sravya et al. found that the Naive Bayes classification
model had the highest accuracy. With the aid of a support vector machine, M. Rubin Julis et
al. were successful in achieving a 97% accuracy. Recurrent Neural Networks were proposed
by K. Sree Ram Murthy et al. for the identification of SMS spam, and they had a decent
accuracy rate. S. Sheikh developed feature selection and the neural network model for SMS
spam identification and reported a high accuracy rate. With the use of a support vector
machine classifier, Adem Tekerek et. al. were able to detect spam in Short Messaging Service
messages with an accuracy of 97%.

2. Abhishek Patel#1 , Priya Jhariya*2 , SudalaguntaBharath#3 , Ankita wadhawan#4

Department of CSE, AITS, Rajampet 5


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

“SMS Spam Detection using Machine Learning Approach”:

Department of CSE, AITS, Rajampet 6


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

According to Hidalgo (2002), spam is "unrestricted mass email" that contains "info
made to be delivered to numerous recipients, notwithstanding their longings." In his 2007
illustration, Cormack described how spam that contains an enticing ingredient or compelling
information is distributed by mass mailing. Nonetheless, given the various media spam
techniques employed, including email spam and SMS spam, such spam may be easily
identifiable. Spammers inundate Short Message Service employees and provide end users
with large amounts of unrestricted SMS . From a commercial standpoint, Short Messaging
Service users must invest time in eliminating spam-filled messages because they undoubtedly
result in profit loss and could pose problems for partnerships. From this point on, how to
accurately and proficiently identify Short Messaging Service spam with high precision
becomes a massive report.

Information mining will be employed in this evaluation to control AI by making use


of various classifiers for testing and preparation as well as channels for information
pretreatment and highlight selection. It intends to evaluate various metrics or more precisely
identify the best mix model. There are now many evaluation studies that have been carried
out using information digging techniques, such as information digging using plan strategies

Overall, a lot of effort is focused on a single classifier. In any event, spamming


practices are evolving their evasion techniques . In light of this, we will focus on the whole
framework for managing SMS spam in this analysis by using information mining technique.
Experimentation will be used to determine issues such as if the cross assortment model
provides higher precision results in comparison to any single classifier used for email spam.

3. SHAFI’I MUHAMMAD ABDULHAMID1, (Member, IEEE), MUHAMMAD SHAFIE


ABD LATIFF2 , HARUNA CHIROMA3 , (Member, IEEE), OLUWAFEMI OSHO1 ,
GADDAFI ABDUL-SALAAM5 , ADAMU I.

This strategy is used to emphasise the issues that still need to be resolved and the
differences from our existing analysis. Delany et al. conducted a survey on the filtering of
mobile SMS spam and developments. The difficulties in gathering the research dataset and
making it accessible were discussed by the authors. Further investigation in this area was
promoted by the publication. The results of a subsequent early benchmark experiment
revealed a lack of agreement regarding the most effective strategies for mobile SMS spam
detection and filtering. Also, it demonstrated the techniques used in the thorough SMS
filtering's text classification. Nevertheless, the explicit SMS features were not taken into

Department of CSE, AITS, Rajampet 7


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

account.

Department of CSE, AITS, Rajampet 8


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

In general, the project's methodologies were straightforward. The authors compared


the fingerprints of each newly received SMS to the fingerprints of all recognised spam texts
for the preliminary benchmark experiment. They were categorised as spam if they were
connected to any of the previously found spam fingerprints. Nevertheless, because the study
was only able to include works published before 2012, more contemporary techniques and
benchmark datasets were not included. In order to identify and filter spam in a
telecommunications network, Bantukul and Marsico studied various techniques and software
solutions. The outcome demonstrated that the original mail would be delivered to its intended
recipient if the message made it past the spam filtering.

The report, however, focused more on methods for detecting spam in emails and left
out artificial immune systems and other mobile SMS spam approaches. An overview of the
legal frameworks for spam in the mobile SMS industry in Switzerland, the EU, and the USA
was provided by Camponovo and Cerutti.

The initiative also looked into the conclusions that may be drawn for the commercial
mobile sector. In a study on SMS anti-spam systems, Wang et al. integrated temporal or
spectral testing with behavior-based social network analysis to identify spam with extremely
high recall and accuracy. The authors tackled the scalability issue of social networks by
outlining the classification architecture and presenting a reasonably precise neighbourhood
index solution. Chou and Lien investigated the "mobile teaser ads" by running two distinct
studies on brand awareness, representative friendliness, and representative expertise, as well
as the ways in which they could influence brand interest in subscribers with various SMS
mind- sets. The results showed that a likeable and well-known representation decreased
consumers' curiosities for teaser advertising presenting higher awareness products.

In blogs that feature product advertisements, Jindal and Liu examined spam and spam
recognition. Unfortunately, SMS spam was not mentioned in the review; it only covered
spams associated with blogs that posted product advertisements. Similar to Web, which
evaluated numerous algorithms for filtering questionable behaviour over a ten-year period,
these algorithms were divided into four classes: classic spam, phoney reviews, social spam,
and link farming. This evaluation, however, did not fully analyse SMS spam; it simply
addressed e-mail spams, false blog reviews, and social media spam. Chan et al. presented a
word assault strategy that makes use of the classifier's ability to be controlled with the fewest
characters introduced by combining weight values with word length in the SMS .

Department of CSE, AITS, Rajampet 9


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

The feature reweighting method was put forth together with a novel scaling methodology
that diminished the value of the element denoting a little word in order to necessitate the use
of additional inserted characters for successful avoidance. Using text messages and a sample
comment spam bank, this strategy was evaluated empirically. The results of the experiment
demonstrated that word length was a crucial component of SMS spam filtering's resistance to
excellent word attacks. We outline the procedures utilised to review the previous studies in
the next section.

4. Inwhee Joe and Hyetaek Shim, Division of Computer Science and Engineering,
Hanyang University, Seoul, 133-791 South Korea:

This project describes an SVM (Support Vector Machine) and thesaurus-based spam
filtering system for SMS (Short Messaging Service). The system uses a pre-processing tool to
recognize words from sample data, combines their meanings with the help of a thesaurus,
generates features of integrated words using chi-square statistics, and then examines these
traits. The system is implemented within the Windows environment, and its effectiveness has
been empirically verified.

Spam filtering is an odd field that automatically determines if a page is spam or not.
Automated document categorization entails grouping together related documents and
assigning each one to the appropriate category using a classification scheme. The classifying
process comprises two stages.

After indexing a large number of documents, the first phase of the feature selection
process extracts the necessary features for classification. The second phase is the decision-
making process that determines the appropriate category for the first phase's results. By a
mechanical learning process, automatic document classification is able to automatically
assign the correct category.

A specific word was associated with a group of learnt documents for this process. The
word stands for the document, and the extracting feature denotes a batch operation to choose
words from the document's learned words. But, if it chooses every word in the learnt
document as a feature, it wastes time and loses its ability to make decisions. To avoid this
issue, assess the information contained in each word before choosing featured terms for
automatic classification. In text categorization, there are many different feature spaces to
consider. We therefore require a feature selection technique. Document frequency
thresholding (DF), the X

Department of CSE, AITS, Rajampet 10


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

statistics (CHI), term strength (TS), information gain (IG), and mutual information are the
most widely used feature selection techniques.

5. Tiago, Almeida , José María GómezAkebo Yamakami. University of Campinas, Sao


Paulo, Brazil:

Due to the rise in mobile phone usage, there has been a rapid increase in SMS spam
messages. It is difficult to battle mobile phone spam in practice due to the lower SMS usage
rate, which has allowed many consumers and service providers to ignore the issue, as well as
the limited availability of mobile phone spam-filtering software. On the other side, the lack of
publicly accessible datasets for SMS spam, which are essential for testing and comparing
different classifiers, is a significant disadvantage in academic environments. Also, because
SMS messages are frequently short, content-based spam filters could perform worse.

With this project, we present the largest authentic, accessible, and unencrypted SMS
spam collection we are aware of. Also, we contrast the results of various tried-and-true
machine learning techniques. The findings show that Support Vector Machine performs
better than other assessed classifiers, making it a useful foundation for further comparison.

6. Duan, L., Li, N., & Huang, L. (2009). “A new spam short message classification”:

This project suggests a method of message dual-filtering. To first distinguish spam


messages from other messages, the KNN classification method and rough set are coupled. To
avoid compromising precision for reduction, it must re-filter some messages using the KNN
classification algorithm. Based on a basic set of the KNN classification algorithm, this
method not only increases classification speed but also maintains excellent accuracy.

7. B. G. Becker. Visualizing Decision Table Classifiers. Pages 102- 105, IEEE (1998):

Decision trees, decision networks, and decision tables are all categorization models used
for forecasting . Machine learning algorithms produce these. A decision table is made up of a
hierarchy of tables where each entry is split down into its component parts by the values of
two more characteristics to create a new table. Dimensional stacking is comparable to the
structure
. Here, a visualization technique is shown that enables even non-experts in machine learning
to comprehend a model built on a variety of attributes. This representation is more practical
than other static designs thanks to a variety of interactions.

Department of CSE, AITS, Rajampet 11


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

8. Draško Radovanović, Božo Krstajić, Member, IEEE “Review Spam Detection using
Machine Learning”:

People typically educate themselves before making a purchase by reading online reviews.
Sellers frequently try to imitate user experience to increase their profits. Recognizing and
eliminating bogus reviews is crucial because customers are being duped in this manner. This
project examines machine learning-based spam detection techniques and gives an overview
and findings. According to , review spam can be split into three categories: 1. Untrue beliefs
2. Brand-specific reviews only 3. Omit reviews False reviews are intentionally false opinions.
Reviews that are solely about brands aren't concerned with items at all, but with brands or
producers. Ads and other unrelated reviews without opinions are considered non-reviews.
Types two and three don't mention specific products, but they aren't dishonest. These spam
subtypes are also simple to identify manually, and conventional classification techniques
have little trouble identifying them. It has been demonstrated that detecting false reviews is
substantially more difficult for both a machine and a human observer. These are the reasons
why this project takes into account this kind of spam.

9. Amani Alzahrani and Danda B.Rawat “Comparative Study Of Machine Learning


Algorithms for SMS Spam Detection”

After being first made available as a service in the second-generation (2G) terrestrial
mobile network architecture, the Short Messaging Service (SMS) gained popularity (Global
System for Mobile Communication-GSM). Short Messaging Service (SMS) usage on phones
has expanded to such a big degree due to technological developments and an increase in
content-based advertising that devices are occasionally inundated with a large number of
spam SMS. Private data loss is another risk posed by these spam mailings. There are
numerous content-based machine learning methods that have been successfully used to filter
spam emails. Contemporary studies have classified text messages as spam or gammon using
certain stylistic characteristics.

The ability to detect SMS spam can be significantly impacted by the use of well-
known terms, phrases, abbreviations, and idioms. This project compares various categorizing
methods using various datasets gathered from earlier research projects, and evaluates them
according to their accuracy, precision, recall, and CAP Curve. Deep learning approaches and
conventional machine learning techniques have been compared.

Department of CSE, AITS, Rajampet 12


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

10. Milivoje Popovac, Mirjana Karanovic, Srdjan Sladojevic, Marko Arsenovic, Andras
Anderla :

Using various data analysis techniques, the classification challenge of spam message
detection can be resolved. The papers that apply artificial intelligence approaches are given
below, with reference to the methodology selected for this study. The designers of Tiago's
dataset, the authors of paper , have tested a number of machine learning methods and laid the
groundwork for additional study. They combined classifiers with two tokenizers (to recognize
a domain and preserve symbols that assist distinguish spam from ham messages), with the
highest performance coming from SVM, Boosted NB, Boosted C4.5, and PART. SVM had
the best performance, catching 83.10 percent of spam and blocking just 0.18% of non-spam
communications. Its accuracy was above 97.5%. The SMS Spam Corpus and SMS Spam
Collection datasets were utilized in project, both separately and combined. Both of the two
employed approaches, the FP-Growth Algorithm and the Naive Bayes classifier, achieved an
accuracy rate higher than 90%.

The FP-Growth method was applied on Tiago's dataset to provide the accuracy best
average (98.5%). The research for project was done using the Tiago dataset. Naive Bayes
performed better in the experiment than the algorithms Random Forest and Logistic
Regression. About 98.5% Convolutional Neural Network based SMS Spam Detection
accuracy findings were reported by NB by Milivoje Popovac, Mirjana Karanovic, Srdjan
Sladojevic, Marko Arsenovic, and Andras Anderla T. The project made use of the same
dataset. The authors came to the conclusion that the best outcomes came from boosting the
Random Forest and SVM algorithms. Outcomes can be improved by utilizing both Linguistic
Inquiry, Word Count (LIWC), and SMS-specific content based capabilities.

The Gentle Boost Classifier was selected by the authors of project as the best method
to utilize on Tiago's dataset after evaluating a number of other algorithms. The specified
algorithm, which combines the AdaBoostM1 and Logit Boost techniques, is useful for binary
classification and unbalanced data. This method has produced accuracy of more than 98.3%.
The following algorithms were utilized in project by the authors: NB, SVM, k-NN, RF, and
AdaBoost. All of the aforementioned algorithms achieved an accuracy rate more than 97%;
however, the best classifiers are multinomial naive Bayes with Laplace smoothing and SVM
with linear kernel.

Department of CSE, AITS, Rajampet 13


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

In a paper, another study using Tiago's dataset was presented. The experiment
included involved clustering using the K-Means technique or the NMF Model in addition to
pre- processing and classification using several classifiers. Using the aforementioned
processes, an SMS thread identification solution was put forth. The study came to the
conclusion that the NMF and SVM algorithm combination produced good results in thread
identification and that the SVM algorithm performed better at categorising SMS messages.
On four spam datasets, authors of the research compared the performance of several methods.
Incorporating an N- gram tf.idf feature selection, a modified distribution-based balancing
algorithm, and a regularised deep multi-layer perceptron NN model with rectified linear units,
they proposed a novel spam filter (DBB-RDNN-ReL).

The suggested model's accuracy, FP rate, and auc for Tiago's dataset were 98.5%,
0.0024, and 0.961 respectively. Other applications for similar techniques include text
categorization, sentiment analysis, and email spam detection. The authors of paper trained
CNN to classify sentences; their experiment demonstrates that even a basic CNN with just
one layer of convolution can produce remarkable results.

A study was conducted as part of project to evaluate the CNNRNN model's use in multi-
label text categorization. CNN was utilised for feature extraction, and RNN was employed to
extract local semantic data and model label correlation. It is demonstrated that the
performance of the used model is significantly impacted by the amount of the dataset. Large
datasets can produce impressive performance, however small datasets may result in
overfitting. This study was carried out to assess how well CNN applied to the problem under
consideration.

11. Arijit Chandra, Sunil Kumar Khatri “Spam SMS Filtering using Recurrent Neural
Network and Long Short Term Memory”

Short Messaging Service, or SMS, is an acronym for the standard mobile device
protocols used for information sharing via brief text messages. Nowadays, SMS messages are
a quick, affordable, and commonly accepted alternative to phone calls for communication.
Spam is defined as ad hoc, uninvited messages distributed widely without the recipient's
consent. Consumers still have to deal with spammers that use SMS to advertise fraudulent
claims and access user privacy. With the development of the Internet, spammers have tried to
enter everywhere, including emails, social media sites, reviews, and even Twitter. Spammers
frequently use these various sorts of spam to get money, including comments, emails, search

Department of CSE, AITS, Rajampet 14


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

results, and personal messages.

Department of CSE, AITS, Rajampet 15


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

Several machine learning methods, including neural networks, have attempted to


distinguish between spam and legitimate SMS texts. In contrast to conventional methods,
where features are chosen after analysis for classification, these techniques can automatically
learn high level features from raw data. In this project, we offer a novel approach for
detecting spam and gammon from the "Spam SMS Collection" dataset, which is available at
the UCI machine learning repository, employing recurrent neural networks (RNN) and long
short-term memories (LSTM) using Keras models with TensorFlow backend. Tokenization,
TF-IDF Vectorization, and the elimination of stop words were essential preparation steps for
the dataset. A total accuracy of 98% is reached, which represents an improvement above
previous machine learning spam detection technique.

12. Sakshi Agarwal, Sanmeet Kaur, Sunita Garhwal “SMS Spam Detection for Indian
Messages”

The number of mobile phone users is rising, which has caused a sharp rise in SMS
spam. While the majority of the globe still views mobile messaging as "clean" and reliable,
recent surveys have shown that the amount of mobile phone spam is drastically rising year
over year. It is a growing setback, particularly in Asia and the Middle East. SMS spam
filtering is a relatively new task to address this issue. Several issues and easy remedies carried
over from email spam screening. It does, however, provide some of its own concerns and
issues. By including Indian communications in the global SMS dataset, this effort encourages
further work on the issue of filtering mobile messages for Indian users as Ham or Spam. In
the project, various machine learning classifiers are analysed using a significant corpus of
SMS texts for Indian citizens.

In a landmark study to identify mobile phone spam, Gomez Hidalgo et al. evaluated a
number of Bayesian-based classifiers. The authors of this study suggested the Spanish and
English test databases as the first two well-known SMS spam datasets. For those two
datasets, the authors evaluated various machine learning and message presentation
techniques. They arrived to the conclusion that the Bayesian filter might be effectively used
to classify SMS spam. Even content-based spam filtering can be employed for brief text
messages, which can be found in three different contexts: SMS, blog comments, and email
summary information, according to Cormack et al.

The conclusion of their article said that SMS must contain fewer words in order to
support word or word bigram based spam classifiers. As a result, efficiency was boosted by

Department of CSE, AITS, Rajampet 16


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

expanding the collection of features to include orthogonal sparse word bigrams, character
bigrams, and trigrams. Nuruzzaman et al. examined the effectiveness of utilizing Text
Classification techniques to filter out message spam on independent mobile phones .

A variety of processes relating to training, filtering, and updating was done on a


standalone mobile device. Their proven results show that the proposed model was successful
in reducing spam and gammon messages with a moderate level of efficiency, using less
memory, and functioning properly without using a computer.

By counting the amount of messages transmitted in a single network over a brief


period of time and carrying comparable types of data, Coskun and Giura provided a network-
based online detection technique for the identification of SMS spam campaigns . They
suggested using Bloom filters to maintain a rough count of message content occurrences. A
SMS corpus was used in a clustering experiment by Sarah Jane Delany et al .

They gathered 1353 spam messages and attempted to use them as the dataset that
grasped of no duplicity. Categorization of Machine Learning 635 2015 1st International
Conference on Next Generation Computing Technologies (NGCT-2015) Dehradun, India, 4-
5 September 2015. behaviour of SMS spam.

They used orthogonal initialization k-way spectral clustering. By using spectral clustering
on their own built dataset, a small number of clusters—ten in total with connected top 8 terms
and an assumed annotation—were produced. The details of a brand-new collection of SMS
spam that contains the greatest possible amount of messages—one that is authentic, open, and
unencrypted—were presented by Tiago A. Almeida et al. 4,827 mobile ham communications
and 747 mobile spams make up this message. Also, the authors ran their dataset through a
number of well-known machine learning algorithms and came to the conclusion that SVM is
a superior method for advance evaluation. Houshmand Shirani-Mehr used various machine
learning algorithms to the challenge of classifying SMS spam, compared the results to gain
insight and further investigate the issue, and created a program based on one of these methods
that can accurately filter SMS spams . A database of 5574 text messages was used.

13. Lu CAO, Guihua NIE, Pingfeng LIU “Ontology-based Spam Detection Filtering
System”

This study examines strategies to control spam communications while maintaining


genuine user-beneficial messages in order to address major mobile spam challenges. A

Department of CSE, AITS, Rajampet 17


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

systematic frame model of an ontology-based mobile phone spam messages detection system
is described, together with the pertinent essential technologies, on the basis of the existing
methods of spam message identification.

Spam may be found and filtered using a variety of methods. Depending on whether
the detection process is involved in the transmission of information, it contains real-time
detection and non-real-time detection. According to detection point, it includes the outgoing
and the terminating spam detection and filtering.

Department of CSE, AITS, Rajampet 18


CHAPTER-3

SYSTEM ANALYSIS
SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

3. SYSTEM ANALYSIS

3.1 EXISTING SYSTEM

Due to a lack of knowledge about data visualization, it is a bit difficult to deploy machine
learning algorithms in the current system. The current method uses mathematical calculations
to generate models, which can be time-consuming and inaccurate when using the Naive
Bayes algorithm.

3.1.1 Disadvantages
 Low Accuracy
 Processing Time is High

3.2 PROPOSED SYSTEM


Machine learning algorithms are one of the robust techniques in identifying the spam
messages. Several machine learning algorithms such as SVM, XG boost, Naïve Bayes and
Ada Boost are compared by applying all these algorithms on the dataset and the best
algorithm having the best accuracy is selected for the detection of spam messages.
3.1.2Advantages
 High accuracy
 Reduces the processing time.

Department of CSE, AITS, Rajampet 16


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

3.2MODULES IN PROPOSED SYSTEM

Figure 3.1: Block diagram of Proposed Method


3.3.1 SYSTEM:
 Pre-processing: Data pre-processing is an essential task for the ML application. It was
done from raw data and was formatted using the data mining technique. A clean and
noise- free dataset was needed for analysis of the dataset. Most of the dataset contained
incomplete and missing values which are filled and completed for ML processing
 Model training: The dataset is part into two subsets as testing set and training set so that
the training dataset can be equipped with the algorithms and then used for detecting the
spam Messages on testing dataset. 30% of the data is reviewed for the testing set so that
the training model will train and learn the data effectively.
 Prediction: The results of our model is display of message images are either with spam
or normal.
 Generate results: System can generate results after prediction.

3.3.2 USER:

 Upload data: User can upload messages data into the system.
 View data: Here user can view the uploaded data.
 View results: User can view the predicted results.

Department of CSE, AITS, Rajampet 17


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

3.3.3 ALGORITHMS USED:

Following are the Machine algorithms used to train and test the sample dataset.

3.3.3.1 SUPPORT VECTOR MACHINE


One of the most well-liked supervised learning algorithms, Support Vector Machine,
or SVM, is used to solve Classification and Regression problems. However, it is largely
employed in Machine Learning Classification issues. The SVM algorithm's objective is to
establish the best line or decision boundary that can divide n-dimensional space into classes,
allowing us to quickly classify fresh data points in the future. A hyperplane is the name given
to this optimal decision boundary. SVM selects the extreme vectors and points that aid in the
creation of the hyperplane. Support vectors, which are used to represent these extreme
instances, form the basis for the SVM method. Face identification, image classification, text
categorization, etc. may all be done using the SVM method.

Figure 3.2. Hyperplane is used to categorise two distinct categories


Types of SVM

SVM can be of two types:

Linear SVM: Linear SVM is used for linearly separable data, which is defined as data that can
be divided into two classes using just one straight line. The classifier used for such data is
known as a Linear SVM classifier.

Non-linear SVM: When a dataset cannot be classified using a straight line, it is said to have
been non-linearly separated, and the classifier employed is known as a non-linear SVM
classifier.

Hyperplane and Support Vectors in the SVM algorithm:

Department of CSE, AITS, Rajampet 18


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

Hyperplane: In n-dimensional space, there may be several lines or decision boundaries used
to separate the classes, but we must identify the optimum decision boundary that best aids in
classifying the data points. The hyperplane of SVM is a name for this optimal boundary.

The dataset's features determine the hyperplane's dimensions, therefore if there are just two
features (as in the example image), the hyperplane will be a straight line. Moreover, if there
are three features, the hyperplane will only have two dimensions. We always build a
hyperplane with a maximum margin, or the greatest possible separation between the data
points.
Support Vectors:

The data points or vectors that are the closest to the hyperplane and which affect the
position of the hyperplane are termed as Support Vector. These vectors are called support
vectors because they support the hyperplane.

3.3.3.2 XGBOOST

Recently, the applied machine learning and Kaggle contests for structured or tabular
data have been dominated by the XGBoost method. A gradient boosted decision tree
implementation created for speed and performance is called XGBoost.

A gradient boosting framework is used by the ensemble machine learning method


XGBoost, which is decision-tree based. Artificial neural networks frequently outperform all
other algorithms or frameworks in prediction issues involving unstructured data (pictures,
text, etc.). Nonetheless, decision tree-based algorithms are currently thought to be best-in-
class for small- to medium-sized structured/tabular data.

Bagging: Assume that there is now a panel of interviewers, each of whom has a vote, as
opposed to just one. Using a democratic voting procedure, bagging or bootstrap aggregating
combines the information from each interviewer to determine the final outcome.

Both XGBoost and Gradient Boosting Machines (GBMs), ensemble tree approaches,
use the gradient descent architecture to boost weak learners (CARTs in general). However
XGBoost enhances the fundamental GBM architecture with system optimization and
algorithmic improvements.

3.3.3.3 NAIVE BAYES

A probabilistic machine learning model called a Naive Bayes classifier is utilized for

Department of CSE, AITS, Rajampet 19


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

classification tasks. The Bayes theorem serves as the foundation of the classifier.

When B has already happened, we may use the Bayes theorem to calculate the likelihood that
A will also occur. Thus, A is the hypothesis and B is the supporting evidence. Here, it is
assumed that the predictors and features are independent. That is, the presence of one feature
does not change the behaviour of another. The term "naive" is a result.

Let's use an illustration to comprehend it. I've included a training set of weather data and its
matching goal variable, "Play," below (suggesting possibilities of playing). We must now
categorize whether participants will participate in games based on the weather. Let's carry it
out by following the steps below.

Step 1: Convert the data set into a frequency table

Step 2: Create Likelihood table by finding the probabilities like Overcast probability = 0.29
and probability of playing is 0.64.

Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each class.
The class with the highest posterior probability is the outcome of prediction.

Problem: Players will play if weather is sunny. Is this statement is correct?

We can solve it using above discussed method of posterior probability.

P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny)

Here we have P (Sunny |Yes) = 3/9 = 0.33, P (Sunny) = 5/14 = 0.36, P (Yes) = 9/14 = 0.64

Department of CSE, AITS, Rajampet 20


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.

Naive Bayes uses a similar method to predict the probability of different class based on
various attributes. This algorithm is mostly used in text classification and with problems
having multiple classes.

• It is easy and fast to predict class of test data set. It also performs well in multi class
prediction
• When assumption of independence holds, a Naive Bayes classifier performs better
compare to other models like logistic regression and you need less training data.
• It performs well in case of categorical input variables compared to numerical
variable(s). For numerical variable, normal distribution is assumed (bell curve, which
is a strong assumption).

Applications of Naive Bayes Algorithms

Real time Prediction: Naive Bayes is a quick classifier that eagerly learns new things. Thus,
it might be applied to real-time prediction.Multi class Prediction: The ability of this method
to predict many classes is very widely recognized. Here, we can forecast the likelihood of
several target variable classes.

Text classification/ Spam Filtering/ Sentiment Analysis: Because they perform better in
multi-class situations and follow the independence criterion, naive Bayes classifiers are
frequently employed in text classification and have a greater success rate than other methods.
It is therefore frequently used in Sentiment Analysis and Spam Filtering (to identify spam e-
mail) (in social media analysis, to identify positive and negative customer sentiments).

3.3.3.4 ADA BOOST

The Boosting technique known as AdaBoost algorithm, sometimes known as


Adaptive Boosting, is used as an Ensemble Method in machine learning. It is called Adaptive
Boosting as the weights are re-allocated to each instance, with higher weights applied to
mistakenly identified instances. For supervised learning, boosting is used to lower bias and
variation. It operates under the premise that students advance in stages. Each student after the
first is developed from a prior learner, with the exception of the first. Simply said, weak
students are transformed into strong ones. Similar in concept to boosting, the AdaBoost
method differs slightly from it. Let's go into more detail about this distinction.

Department of CSE, AITS, Rajampet 21


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

Let's first talk about how boosting functions. During the data training phase, 'n'
decision trees are created. The improperly classified record in the first model is given priority
as the first decision tree or model is constructed. For the second model, just these records are
sent as input. The procedure continues until we decide how many base learners to produce.
Repetition of records is permitted with all boosting procedures, keep in mind.

The first model is created, and the algorithm notes any flaws from the initial model. The
improperly classified record is utilized as input for the next model. Up until the given
condition is met, this process is repeated. The graphic shows that by using the errors from the
previous model, 'n' other models were created. Boosting functions in this way. Decision trees
are individual models that include models 1, 2, 3,..., and N. The basic operating mechanism
of all boosting schemes is the same.

Knowing the boosting principle today will make it simple to comprehend the AdaBoost
algorithm. The algorithm creates 'n' trees when the random forest is employed. It creates
correct trees with a start node and many leaf nodes. Although some trees may be larger than
others, a random forest has no set depth. Nevertheless, AdaBoost's approach only creates the
Stump node, which has two leaves. The fact that it has just one node and two leaves can be
easily explained. These stumps are poor students, and boosting methods favor this. In
AdaBoost, the order of the stumps matters a lot. The initial stump's mistake affects how
subsequent ones are created.

This is a sample dataset that only has three attributes and produces categorical results. The
dataset is actually represented in the image. Because of the output's binary/categorical nature,
it is now a classification issue. In reality, the dataset may contain any number of features and
records. For the purposes of explanation, let's look at 5 datasets. The result is categorical and
is presented below as Yes or No. A sample weight will be given to each of these records.
W=1/N, where N is the total number of records, is the formula used for this. Since there are
only 5 records in this dataset, the sample weight is initially set at 1. The weight of each record
is the same. It is 1/5 in this instance.

Learn AdaBoost Model from Data

Ada Boosting, which is based on binary classification issues, is best utilised to improve the
performance of decision trees.

Department of CSE, AITS, Rajampet 22


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

The author first referred to AdaBoost as AdaBoost.M1. Discrete Ada Boost is a more modern
name for it. Due to the fact that categorization rather than regression is the intended use.

Any machine learning algorithm's performance can be improved with AdaBoost. When
teaching weak learners, it works best.

3.4 PERFORMANCE EVALUATION AND PREDICTING RESULTS:

Spam Message detection based on feature analysis, data analysis on the selected dataset was
carried. The confusion matrix shows the performance table on accuracy when compared with
the actual classifications in the dataset. Accuracy was used for performance evaluation which
was calculated based on the confusion matrix. Confusion matrix used specific table layout for
the projection of the performance.
ACCURACY: The proportion of accurate predictions that are generated from test data is
known as accuracy. The formula for calculating it is to divide the total number of guesses by
the number of correct forecasts. The most typical statistic for model evaluation, although it
doesn't really give a good picture of how well the model is doing. Classes that are uneven
experience the worst results.

CONFUSION MATRIX
A confusion matrix, whose rows and columns represent the number of target classes,
is a matrix used to assess the effectiveness of classification models. The matrix compares
actual goal values to values predicted by the machine learning model. This gives us a
comprehensive picture of the way in which our classification model is operating and the
kinds of mistakes that it is doing.

True Positive (TP): Correctly predicted spam messages were detected with the actual Spam
Messages

False Negative (FN): The actual spam messages were false classified and detected as
legitimate messages.

False Positive (FP): The actual legitimate messages were classified as false values and
detected as spam messages.

Department of CSE, AITS, Rajampet 23


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

True Negative (TN): ): The actual class and the predicted class was the same as it showed
here the actual legitimate messages were correctly predicted as legitimate messages .

Figure 3.3. General confusion matrix

Department of CSE, AITS, Rajampet 24


CHAPTER-4
SYSTEM REQUIREMENTS SPECIFICATION
SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

4. SYSTEM REQUIREMENTS SPECIFICATION


Software requirements specifications (SRS), also known as software system requirements
specifications, offer a comprehensive description of the duties that a system must do. The use
cases in this section describe how the software interacts with its users. The SRS also contains
non-functional specifications in addition to the usage case. Non-functional specifications are
criteria that limit design or execution (such as requirements for performance engineering,
quality standards or design constraints).

4.1. SOFTWARE REQUIREMENTS

 Operating System : Windows 10

 Server-side Script : Python 3.6

 IDE : PyCharm

 Libraries Used : Flask,Numpy,IO,OS,Keras,Tenseer Flow


4.2. HARDWARE REQUIREMENTS

 Processor : I3/Intel Processor

 RAM : 8GB

 Hard Disk : 516 GB


4.3. FEASIBILITY STUDY

Finding the optimum solution to meet performance requirements is the goal of a feasibility
study. They include a description of identification, an assessment of potential system
candidates, and the choice of the best candidate.

 Economic Feasibility
 Technical Feasibility
 Behavioral Feasibility

4.3.1. Economic Feasibility:

The most popular way for determining whether a potential system is effective is
economic analysis. The process, more popularly known as cost/benefit analysis, entails
calculating savings and benefits to see if they outweigh expenses. If they do, the decision to

Department of CSE, AITS, Rajampet 25


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

design and execute the system is then made. If the system is to have an enhancement that can
be approved, more justification or changes must be made.

4.3.2. Technical Feasibility:

The existing computer system's capabilities to accommodate the planned expansion are
the focus of the technical analysis (hardware, software, etc.). To allow technical
advancement, there must be financial concerns. The project is deemed unfeasible if funding
is a severe restriction.

4.3.3. Behavioral Feasibility:

The strength of the user staff's expected opposition to the creation of a computerised
system should be estimated. The introduction of a potential system necessitates extra effort
to inform, persuade, and train the current methods of thinking about business. It is well
known that computer installations have something to do with understanding.

4.3.4. Benefits of Doing a Feasibility Study:

The following list summarises some of the benefits of doing a feasibility study.

 The analysis portion of this study, which is being created as the first stage of the
software development life cycle, assists in thoroughly examining the system
requirements.
 Aids in determining the risk variables associated in creating and implementing the
system.
 Planning for risk analysis is aided by the feasibility study.
 Cost-benefit analyses made possible by feasibility studies enable effective operation
of the system and organisation.
 Planning for training developers to put the system into place is aided by feasibility
studies.

4.4. FUNCTIONAL AND NON-FUNCTIONAL REQUIREMENTS

Analysis of requirements is a vital step in determining whether a system or software


project will be successful. Functional requirements and non-functional requirements are the
two main categories of requirements.

Department of CSE, AITS, Rajampet 26


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

4.4.1. Functional Requirements:


These are the necessities that the system must provide in order to meet the end
user's individual requests for basic amenities. The contract must unavoidably stipulate that
each of these functionalities be built into the system. They are portrayed or described as
input to be provided to the system, an operation to be carried out, and an anticipated output.
Unlike non-functional needs, they are essentially the user-stated requirements that can be
seen immediately in the finished product.
An illustration of a functional requirement is:

1) Whenever a user logs into the system, they must authenticate themselves.
2) In the event of a cyberattack, shut down the system.
3) When a user registers for the first time on a software system, a verification email is
automatically sent to them.
4.4.2. Non-functional requirements
In essence, they are the quality requirements that the system must meet in accordance with
the project contract. Depending on the project, different aspects may be given varying
degrees of priority or implementation. These are also known as non-behavioral requirements.

They primarily address things like: Portability, Security, Maintainability, Reliability,


Scalability, Performance, Reusability, Flexibility.

Non-functional needs examples include:

1) With respect to such an activity, emails should be sent no more than 12 hours
afterwards.

2) Each request should be processed in less than ten seconds.

3) If there are more than 10,000 simultaneous users, the website should load in 3
seconds.

Department of CSE, AITS, Rajampet 27


CHAPTER-5

SYSTEM DESIGN
SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

5. SYSTEM DESIGN

5.1. ARCHITECTURE DESIGN

Figure 5.1: Architecture diagram

5.2. INTRODUCTION TO UML DIAGRAMS


As the strategic importance of software grows, the industry searches for ways to
automate software development, enhance quality, cut costs, and accelerate time-to-market.
Component technology, visual programming, patterns, and frameworks are a few examples of
these techniques. When a company grows, it searches for ways to control the scope and size
of its systems. reduce their complexity. The issues with load balancing, fault tolerance,
concurrency, replication, and physical distribution are all issues they are aware of. The
Internet has also made many structural problems worse while simplifying some tasks. The
Unified Modeling Language (UML) was created to satisfy these requirements. Simply
described, systems design is the process of creating a system's architecture, components,
modules, interfaces, and data to meet certain goals. This can be accomplished fast using UML
diagrams. Throughout the project, eight fundamental UML diagrams were explained.

 Use Case Diagram


 Class Diagram
 Activity Diagram

Department of CSE, AITS, Rajampet 28


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

 Sequence Diagram
 Collaboration Diagram
 State chart Diagram
 Component Diagram
 Deployment Diagram
5.2.1 GOALS
1. Make available to users a ready-to-use, expressive visual modeling language that enables
them to create and share meaningful models.
2. Provide mechanisms for extendibility and specialisation in order to broaden the scope of
the core concepts.
3. Refrain from using specific programming languages or development processes.
4. Lay the groundwork for a formal understanding of the modeling language.
5. The following are the primary goals of the UML design:
6. Encourage the growth of the market for OO tools.
7. Help with the implementation of higher-level development concepts like collaborations,
frameworks, patterns, and components.
8. Implement best practices.
5.3. UML NOTATIONS

S.NO SYMBOL NOTATION DESCRIPTION


NAME
1. Initial Activity This diagram
depicts the flows
initial point or
activity.
2. Final Activity A bull’s eye icon
marks the
conclusion of the
activity graphic.
3. Activity Represented by a
NewActivity rectangle with a
rounded edge.
4. Decision One that requires
decision-making.
5. Use Case Explain how a user
and a system
communicate.

Department of CSE, AITS, Rajampet 29


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

6. Actor A function a user


has in relation to
the system.

7. Object object
objec
A Real -Time
t entity.

8. Message To communicate
between the lives
of
object.
9. State It depicts events
NewState that occur during
an objects lifetime.
10. Initial State Represents the
objects initial state.
11. Final State Represents the
objects final state.

12. Transition Label the transition


with the event that
triggered it and the
action that result
from it .
13. Class A group of items
NewClas
with similar
s
structures and
behaviours.

14. Association Relationship


between classes.

15. Generalization Relationship


between more
general class and a
more specific class.

Department of CSE, AITS, Rajampet 30


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

5.4. UML DIAGRAMS

5.4.1. USE CASE DIAGRAM


A use case diagram is a form of behavioural diagram created from use-case research
and is an example of software engineering's use of the Unified Modeling Language (UML).
Its goal is to demonstrate the actors, goals (represented as use cases), and any dependencies
among those use cases in a system. The main goal of a use case diagram is to show which
system functions are executed for each actor. It is clear what the system's actor roles are.
Throughout the requirements elicitation and analysis phase, use cases are used to illustrate the
capabilities of the system. To describe how the technology works when not in use, use
scenarios are utilised. Use cases are inside the system, whereas actors are outside. A device
border separates a group of use cases in the case diagram, which is a diagram of actors. The
application A diagram is necessary to comprehend the element's behaviour.

1. Sequences highlight the relationship to outside circumstances.


2. This covers both the performer's job and the system.
3. Actors can portray people or a building.

Upload Data

View
Data

Preprocess

Model
Training syste
user m

Predictio
n

Generating
Result

View Results

Figure 5.2: Use Case Diagram

Department of CSE, AITS, Rajampet 31


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

5.4.2. CLASS DIAGRAM


A class diagram in the Unified Modeling Language (UML) is a type of static structural
diagram used in software engineering to show the classes, properties, and relationships
between the classes that make up a system.

It is utilised in analysis to show the system's specifics. Architecture examines the class
diagram to determine which classes have an excessive number of functions and, if any do,
whether they should be divided. The connections between the classes are made. The Class
Diagram is a tool used by developers to create classes. A class diagram is a group of related
objects that are all connected and have the same features, operations, relationships, and
connections and regulations referred to as semantics. A class is a huge group of items in a
production.

Figure 5.3: Class Diagram

In the Unified Modeling language, a class diagram is a type of static structural


diagram that displays the functions, interactions, and relationships between objects to depict a
system's structure. The cornerstone of object-oriented modeling is the class diagram. Image,
build dataset, pre-processing, segmentation, and classification are the classes represented in
together with the corresponding properties, processes, and relationships between those
classes.

Department of CSE, AITS, Rajampet 32


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

5.4.3. SEQUENCE DIAGRAM


In the Unified Modelling Language (UML), a sequence diagram is a type of interaction
diagram that illustrates the order and relationship between activities. A message sequence
chart is the name given to it. Sequence diagrams include timing diagrams, event-trace
diagrams, and representations of event contexts. One can also refer to a sequence diagram as
an event diagram or an event scenario. Sequence diagrams show how a system's components
interact with one another. The requirements for both new and current systems are frequently
described and understood by entrepreneurs and software engineers using these diagrams.

An interaction diagram that emphasises the timing of message delivery. Depending on


their lifespan and the messages they transmit or arrange over time, objects taking part in an
interaction are represented in a sequence diagram.

USER
USER
SYSTEM
SYSTEM

upload
data

pre-
processing

model
training

predictio
n

generating
results

view
result

Figure 5.4: Sequence Diagram

Department of CSE, AITS, Rajampet 33


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

5.4.4. COLLABORATION DIAGRAM


The method call sequence in a collaboration diagram is indicated by some numbering
technique, as shown below. The number indicates the order in which the methods are called.
The collaboration diagram is described using the same order management system. The
method calls resemble those of a sequence diagram. The difference is that the sequence
diagram does not describe the object organization, whereas the collaboration diagram does.

Figure 5.5: Collaboration Diagram

Department of CSE, AITS, Rajampet 34


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

5.4.5. DEPLOYMENT DIAGRAM


The hardware and software components that make up a deployment are described
using deployment diagrams. Diagrams of components and deployments have a lot in
common. Diagrams of the components' deployment in hardware are shown in deployment
diagrams, which are used to describe the components.

The software artefacts of a system are the primary emphasis of UML. However, these
two particular diagrams are meant to highlight the hardware and software parts. In contrast to
deployment diagrams, which are designed to concentrate on a system's hardware topology,
most UML diagrams are used to manage logical components. The system engineers utilise
diagrams for deployment. You can characterise the function of deployment diagrams as:

1. Think about how a system's hardware is organized.


2. Explain the hardware elements that are deployed in order to run software components.
3. Tell us about the runtime processing nodes.

syste user
m

Figure 5.6: Deployment Diagram

Department of CSE, AITS, Rajampet 35


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

5.4.6. ACTIVITY DIAGRAM


Activity diagrams offer choice, iteration, and concurrency in their depiction of the
work flows of evolving tasks and actions. The operational and business processes of system
components can be represented in detail using activity flowcharts in the Unified Modified
Language.

An activity diagram illustrates the entire control flow. A flowchart with specific states
is similar to an activity diagram. With the activity diagram, you can keep track of the
sequence of actions occurring in your system. Activities look like states; however, they are a
little more rounded. They are stateless because they take place and then go unabatedly to the
following state. The "diamond" conditional branch determines which activity to switch to
based on a characteristic and is also stateless. Activity Diagram includes

1. Action states.
2. Transition.
3. Objects.
4. Contains Fork, Join and branching relations along with flow Chart symbols.

uploa
d

pre_proce view
s sing data

model
trainin

predictio
n

generatin view
g result

Figure 5.7: Activity Diagram

Department of CSE, AITS, Rajampet 36


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

5.4.7. COMPONENT DIAGRAM


A specific type of diagram in UML is called a component diagram. The goal is also
distinct from the previous diagrams mentioned. Although it defines the components utilised
to provide certain functionalities, it does not describe the system's functionality as a whole.

Component diagrams are used to represent the physical parts of a system from that
perspective. These parts include files, libraries, and other things. A static implementation
view of a system is another way to explain component diagrams. The arrangement of the
components at a specific time is represented by static implementation. The entire system
cannot be represented by a single component diagram; instead, a collection of diagrams is
employed. The component diagram's goal can be summed up as follows:

1. Identify the parts of a system visually.


2. Use both forward and reverse engineering to create executables.
3. Explain how the components are arranged and their connections.

system use
r

Figure 5.8: Component Diagram

Department of CSE, AITS, Rajampet 37


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

5.4.8. STATE CHART DIAGRAM

The transfer of control between states is shown in a state chart diagram. States are
described as a situation where an object existing and changes as a result of an event.
Modelling an object's lifetime from conception to termination is the primary goal of a state
chart diagram. The forward and reverse engineering of a system also uses state chart
diagrams. The reactive system's modelling is the primary goal, though.

Figure 5.9: State Chart Diagram

Department of CSE, AITS, Rajampet 38


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

5.5. ER DIAGRAM:

An entity-relationship model (ER model) uses an entity relationship diagram to


illustrate how the structure of a database is described (ER Diagram). A database design or
blueprint known as an ER model can be used to create a database in the future. The entity set
and relationship set are the two fundamental parts of the E-R model.

An ER diagram illustrates the connections between entity sets. An entity set is a


collection of related entities, each of which may have properties. An entity in a DBMS is a
table or an attribute of a table, hence the ER diagram illustrates the entire logical structure of
a database by displaying the relationships between tables and their attributes.

Figure 5.10: ER Diagram

5.6. DFD DIAGRAM:

The typical method for representing the information flows inside a system is a data
flow diagram (DFD). A good deal of the system requirements can be graphically represented
by a tidy and understandable DFD. It can be done manually, automatically, or both. It
demonstrates how data enters and exits the system, what modifies data, and where it is stored.
A DFD is used to illustrate the scope and bounds of a system as a whole. It can be
applied as a method for communication between a systems analyst and any participant in the
system that serves as the foundation for system redesign.

Department of CSE, AITS, Rajampet 39


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

5.6.1 Context Level Diagram:

Figure 5.11: Context Level Diagram

5.6.2 Level 1 Diagram:

Figure 5.12: Level 1 Diagram

Department of CSE, AITS, Rajampet 40


CHAPTER 6

SYSTEM CODING AND IMPLEMENTATION


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

6. SYSTEM CODING AND IMPLEMENTATION

6.1. Introduction to Python Programming Language:

Technically speaking, Python is a high-level, object-oriented, dynamically integrated


programming language that is largely used for the creation of websites and mobile
applications. Since it provides possibilities for dynamic typing and dynamic binding, it is
quite appealing in the realm of rapid application development. Python's syntax is special in
that it emphasises readability, making it straightforward to learn and reasonably easy to use.
Python code is significantly simpler for developers to read and interpret than code written in
other languages. The ability for teams to collaborate without severe language and experience
barriers lowers the cost of programme maintenance and development as a result.

Features:

• Simple
• Easy
• Portable
• Object oriented
• High Level
• Open Source and Free
• Support for GUI
• Interpreted
• Dynamic
• Readable

Department of CSE, AITS, Rajampet 41


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

Figure 6.1: working of python program

Python is an object-oriented programming language, much like Java. Python uses an


interpreter, making it an interpreted language. Python supports simplicity and modularity to
improve readability and reduce time and space complexity. C codes can be used to extract
Python output from the "cpython" implementation, which is the default Python
implementation. Python converts the original code into a series of bytes codes. Because the
Processor cannot read byte code, Python executes the compilation stage directly into byte
code. This project will need a mediator to be finished. The building is now being carried out
by the Python virtual machine interpreter. The Python virtual machine handles the execution
of bytes of code.

The cpython reference implementation is a Python "implementation," often known as


a programme or setting that enables the execution of Python programmes. There have been
and will continue to be a variety of software packages that deliver what we all know as
Python, even though some of them are more like distributions or adaptations of prior Python
implementations than whole new Python implementations.

Department of CSE, AITS, Rajampet 42


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

Figure 6.2: Implementation of python program

Python is a general-purpose programming language, which means it can be applied to almost


anything. The written code is not actually translated to a computer-readable format at run
time since, most critically, it is an interpreted language, which means that at run time. Yet,
before the programme is even run, the majority of programming languages perform this
conversion. Because it was primarily intended to be used for simple projects, this kind of
language is often known as a "scripting language."

Since Python's beginnings, the idea of a "scripting language" has evolved significantly
because it is now used to create huge, sophisticated systems rather than just simple ones. As
the internet became more widely used, so did this dependency on Python. Python is used by a
vast majority of web platforms and applications, including Google's search engine, YouTube,
and the New York Stock Exchange's web-based transaction system (NYSE). When a
language is used to power a stock exchange system, we know it must be fairly significant.

Python can also be used to solve mathematical problems, display numbers or graphics,
process text, and save data. In essence, it is used in the background to process many elements
that you might need or encounter on your device(s), including mobile.

6.1.1. BENEFITS OF PYTHON

1) Python may be used to create prototypes, and because it is so simple to use and read, it can
be done rapidly.

2) The majority of platforms for automation, data mining, and big data rely on Python.

Department of CSE, AITS, Rajampet 43


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

3) Compared to large languages like C# and Java, Python offers a more productive coding
environment. By using Python, seasoned programmers tend to stay more organised and
productive.

4) Even if you're not an experienced programmer, Python is simple to read. Everyone can
start using the language; all it needs is some perseverance and lots of practise. Additionally,
this makes it a perfect choice for use by large development teams and teams with multiple
programmers.

5Django is a full and open-source web application framework that is powered by Python. The
process of developing software can be made simpler by using frameworks like Ruby on
Rails.

6) Because it was built by the community and is open source, it has a huge fan base. Millions of
like-minded programmers use the language every day and keep its foundational features up to
date. As time goes on, Python's most recent version continues to get updates and
improvements. This is a fantastic method of connecting with other developers.

6.2. Libraries used in Python

Moreover, Python permits the use of modules and packages, allowing for the modular
architecture of programmes and code reuse across numerous projects. After a module or
package has been created, it may be scaled for usage in other applications and is simple to
import or export.

A well-known open-source Python toolbox for data science, data analysis, and machine learning
is called Pandas. It was created using NumPy, another Python library that supports
multidimensional arrays.

NumPy: To work with arrays, a Python module named Numpy is used. Also given are Fourier
transformations and linear algebraic functions.

Sklearn: Free Python machine learning library Scikit-learn, originally known as scikits.learn, is
available online. Support-vector machines, random forests, gradient boosting, k-means, and
DBSCAN are a few of the approaches for classification, regression, and clustering that are
featured. Using a consistent Python interface, scikit-learn offers a variety of supervised and
unsupervised learning techniques.

It is distributed under numerous Linux distributions and is licenced under a

Department of CSE, AITS, Rajampet 44


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

permissive simplified BSD licence, promoting both academic and commercial use.SciPy
(Scientific

Department of CSE, AITS, Rajampet 45


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

Python), which must be installed before using scikit-learn, is the foundation upon which the
library is based.

Pandas: organisation and analysis of data SciPy modules or extensions that go by the
moniker SciKits. As a result, the module, known as scikit-learn, offers learning methods. The
library is intended to have the robustness and support necessary for use in production
systems. This entails placing a strong emphasis on issues like usability, code quality,
teamwork, documentation, and performance.

Pymysql: Py MySQL is a Python library that connects to a MySQL server, allowing Python
programmes to communicate with it. Using Python properties to get the port settings.
PyMySQL, a MySQL driver built entirely in Python, was originally created as a shoddy port
of the MySQL-Python driver. PyMySQL is completely open source, hosted on Github,
distributed via Pypi, and is continuously updated, therefore it satisfies all requirements for a
driver.It is completely compatible with Python 3 and eventlet-monkeypatch because it is
developed entirely in Python.

Pycharm

One of the most well-liked Python IDEs is PyCharm. There are many reasons for this,
one of which is that it was created by JetBrains, the company that also created the well-
known IntelliJ IDEA IDE, one of the "big 3" Java IDEs, and WebStorm, the "smartest
JavaScript IDE." Another solid justification is having Django support for web development.

This IDE was developed by Pycharm primarily for Python programming and to run
on different operating systems, including Windows, Linux, and macOS. The IDE includes
version control options, a debugger, testing tools, and tools for code analysis. It also helps
programmers create Python plugins with the aid of the many available APIs. The IDE enables
us to work directly with a number of databases without integrating them with other
programmes. Despite being specifically made for Python, this IDE also allows for the
creation of HTML, CSS, and Javascript files. It also has a stunning user interface that can be
altered using plugins to suit the demands.

Department of CSE, AITS, Rajampet 46


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

6.3. Sample Code:

app.py

# Importing Necessary Libraries


from posixpath import split
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from flask import Flask, render_template, request
webapp=Flask( name )
@webapp.route('/')
def index():
return render_template('index.html')
@webapp.route('/about')
def about():
return render_template('about.html')
@webapp.route('/load',methods=["GET","POST"])
def load():
global df, dataset
if request.method ==
"POST": data =
request.files['data'] df =
pd.read_csv(data) dataset =
df.head(100)
msg = 'Data Loaded Successfully'
return render_template('load.html', msg=msg)

Department of CSE, AITS, Rajampet 47


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

return render_template('load.html')

Department of CSE, AITS, Rajampet 48


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

@webapp.route('/view')
def view():
return render_template('view.html', columns=dataset.columns.values,
rows=dataset.values.tolist())
def preprocess_data(df):
# Convert text to lowercase
df['Message'] = df['Message'].str.strip().str.lower()
return df
@webapp.route('/preprocess',methods=['POST','GET'])
def preprocess():
global x,y,x_train, x_test, y_train,
y_test,x_test,X_transformed,X_test_transformed,vec,df1,df2
if request.method=="POST":
size=int(request.form['split'])
size=size/100
df = pd.read_csv("spam (1).csv", encoding='latin-1')
df = preprocess_data(df)
# Split into training and testing data
x = df['Message']
y = df['Category']
x_train, x_test, y_train, y_test = train_test_split(x,y, stratify=y, test_size=split,
random_state=42)
print(x)
print(y)
# Vectorize text reviews to numbers
vec = CountVectorizer(stop_words='english')
x = vec.fit_transform(x).toarray()
x_test = vec.transform(x_test).toarray()
print(x_test)
return render_template('preprocess.html',msg='Data Preprocessed and Trained
Successfully')
return render_template('preprocess.html')
@webapp.route('/model',methods=['POST','GET'])

Department of CSE, AITS, Rajampet 49


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

def model():
if request.method=="POST":
print('ccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc')
s=int(request.form['algo'])
if s==0:
return render_template('model.html',msg='Please Choose an Algorithm to Train')
elif s==1:
print('aaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb')
multinomialnb = MultinomialNB()
multinomialnb.fit(x_train,y_train)
# Predicting the Test set results
acc_rf = multinomialnb.score(x_test, y_test)*100
print('aaaaaaaaaaaaaaaaaaaaaaaaa')
msg = 'The accuracy obtained by Naive Bayes Classifier is ' + str(acc_rf) + str('%')
return render_template('model.html', msg=msg)
elif s==2:
linearsvc = LinearSVC()
linearsvc.fit(x_train,y_train)
acc_dt = linearsvc.score(x_test, y_test)*100
msg = 'The accuracy obtained by Support Vector Classifier is ' + str(acc_dt) + str('%')
return render_template('model.html', msg=msg)
return render_template('model.html')
@webapp.route('/prediction',methods=['POST','GET'])
def prediction():
global x_train,y_train
if request.method == "POST":
f1 = request.form['text']
print(f1)
# countvectorizer =CountVectorizer()
multinomialnb = MultinomialNB()
multinomialnb.fit(x_train,y_train)
from sklearn.feature_extraction.text import CountVectorizer
countvectorizer =CountVectorizer()

Department of CSE, AITS, Rajampet 50


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

result =multinomialnb.predict(countvectorizer.transform([f1]))
if result==0:
msg = 'This is a Ham Message'
else:
msg= 'This is a Spam Message'
return render_template('prediction.html',msg=msg)
return render_template('prediction.html')
@webapp.route('/news')
def news():
return render_template('news.html')
if name ==' main ':
webapp.run(debug=True)

Prediction.html

<!DOCTYPE html>
<html lang="en">
<head>
<!-- basic -->
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<!-- mobile metas -->
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="viewport" content="initial-scale=1, maximum-scale=1">
<!-- site metas -->
<title>Spam Message Classification</title>
<link rel="icon" href="static/images/images.jpg" type="image/icon type">
<meta name="keywords" content="">
<meta name="description" content="">
<meta name="author" content="">
<!-- site icons -->
<link rel="icon" href="static/images/fevicon/logo.jpg" type="image/png" />
<!-- bootstrap css -->
<link rel="stylesheet" href="static/css/bootstrap.min.css" />

Department of CSE, AITS, Rajampet 51


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

<!-- site css -->


<link rel="stylesheet" href="static/css/style.css" />
<!-- responsive css -->
<link rel="stylesheet" href="static/css/responsive.css" />
<!-- colors css -->
<link rel="stylesheet" href="static/css/colors.css" />
<!-- wow animation css -->
<link rel="stylesheet" href="static/css/animate.css" />
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/libs/html5shiv/3.7.0/html5shiv.js"></script>
<script src="https://oss.maxcdn.com/libs/respond.js/1.4.2/respond.min.js"></script>
<![endif]-->
</head>
<body id="default_theme" class="home_page1">
<!-- header -->
<header class="header header_style1">
<div class="container">
<div class="row">
<div class="col-md-9 col-lg-10">
<div class="logo"><a href="{{url_for('index')}}"><img
src="static/images/LOGO5.png" alt="#" /></a></div>
<div class="main_menu float-right">
<div class="menu">
<ul class="clearfix">
<li class="active"><a href="{{url_for('index')}}">Home</a></li>
<li><a href="{{url_for('about')}}">About</a></li>
<li><a href="{{url_for('load')}}">Upload</a></li>
<li><a href="{{url_for('view')}}">View</a></li>
<li><a href="{{url_for('preprocess')}}">Preprocessing</a></li>
<li><a href="{{url_for('model')}}">Model Training </a></li>
<li class="active"><a href="{{url_for('news')}}">EDA Section</a></li>
</ul>
</div>

Department of CSE, AITS, Rajampet 52


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

</div>
</div>
<div class="col-md-3 col-lg-2">
<div class="right_bt"><a class="bt_main"
href="{{url_for('prediction')}}">Prediction</a> </div>
</div>
</div>
</div>
</header>

<section id="banner_parallax" class="slide_banner1">


<div class="container">
<div class="row">
<div class="col-md-12">
<div class="full">
<div class="slide_cont">
<h2 style="right: 150px;bottom: 74px;">Predict The Sms</h2>
</div>
</div>
</div>
</div>
</div>
</section>
<!-- end header -->

<div class="overlay"></div>
<div class="gtco-container">
<div class="row">
<div class="col-md-12 col-md-offset-0 text-center">
<div class="display-t">
<div class="display-tc animate-box" data-animate-effect="fadeIn">
<center><h3 style="bottom: 151px;color:rgb(11, 203, 236);top: -

Department of CSE, AITS, Rajampet 53


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

222;">{{msg}}</h3></center>
<h3 style="color:rgb(11, 203, 236);bottom: 115px;">Spam Message
Identification </h3>

<h3 style="color:rgb(11, 203, 236);bottom: 100px;">With the help of Machine


Learning</h3>
<p>
<form action="{{url_for('prediction')}}"
method="post" enctype="multipart/form-data">
<center><input type="text" name="text" placeholder="Enter The Message
To Classify"style="height: 53.99306px;width: 503.99306px; color:rgb(11, 203, 236); border-
color:rgb(11, 203, 236);"><br>
</center>

<input class="white_bt bt_main"


type="submit" value="Submit"style="margin-left:
300px;margin-top: 33px;">
</form>
<!-- <a href="#" class="btn btn-white btn-outline btn-lg"></a></p>-->
<!-- <img src="static/images/business_img.jpg" alt=”Image” height="1000"
width="1200"> -->
</div>
</div>
</div>
</div>
</div>
<!-- section -->
<section class="layout_padding gradiant_bg cross_layout">
<div class="container">
<div class="row">
<div class="col-sm-12">
<div class="full text_align_center white_fonts">
<div class="heading_main center_head_border heading_style_1">

Department of CSE, AITS, Rajampet 54


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

<!-- <h2>Easy <span>Steps</span></h2> -->

Department of CSE, AITS, Rajampet 55


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

</div>
</div>
</div>
</div>
<div class="row step_section">
<div class="offset-xl-1 col-xl-10 col-md-12">
<div class="row">
<div class="col-lg-3 col-md-6 col-sm-12 col-xs-12">
<div class="step_blog arrow_right_step">
<div class="step_inner">
<!-- <i class="fa fa-diamond"></i><br> -->
<!-- <p>Go app store</p> -->
</div>
</div>
</div>
<div class="col-lg-3 col-md-6 col-sm-12 col-xs-12">
<div class="step_blog">
<div class="step_inner">
<!-- <i class="fa fa-user"></i><br> -->
<!-- <p>Create an Account</p> -->
</div>
</div>
</div>
<div class="col-lg-3 col-md-6 col-sm-12 col-xs-12">
<div class="step_blog">
<div class="step_inner">
<!-- <i class="fa fa-download"></i><br> -->
<!-- <p>Download & Install</p> -->
</div>
</div>
</div>
<div class="col-lg-3 col-md-6 col-sm-12 col-xs-12">
<div class="step_blog">

Department of CSE, AITS, Rajampet 56


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

<div class="step_inner">
<!-- <i class="fa fa-thumbs-up"></i><br> -->
<!-- <p>Enjoy & Rate us!</p> -->
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</section>
<!-- end section -->
<!-- footer -->
<footer class="footer_style_2">
<div class="footer_top">
<div class="container">
<div class="row">
<div class="col-xs-12 col-sm-6 col-md-6 col-lg-4 margin_bottom_30">
<div class="full width_9" style="margin-bottom:25px;"> <a
href="index.html"><img class="img-responsive" width="250"
src="static/images/LOGO5.png" alt="#"></a> </div>
<div class="full width_9">
<p align = "justify"style="width: 600px;">Spam messages are messages
sent to a large group of recipients without their prior consent, typically advertising for goods
and services or business opportunities.

In the recent period, the percentage of scam messages amongst spam


have increased sharply. Scam messages typically trick people into giving away money or
personal details by offering an attractive or false deal. Based on the statistics from the
Singapore Police Force, from January till June 2020, the amount cheated through scams have
increased by more than S$8 million!

A spam message classification is a step towards building a tool for scam

Department of CSE, AITS, Rajampet 57


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

message identification and early scam detection.

</p>
</div>
<div class="full width_9">
<!-- <p>the vero eos et accusamus et iusto odio dignissimos ducimus qui
blanditiis praesentium voluptatum deleniti atque corrupti quos dolores et quas molestias
excepturi sint occaecati..</p> -->
</div>
</div>
<!-- <div class="col-xs-12 col-sm-6 col-md-6 col-lg-3 margin_bottom_30">
<div class="full">
<div class="footer_blog_2 width_9">
<h3>Twitter Feed</h3>
<p><i class="fa fa-twitter"></i> Creative_Talent - 26 mins
Te invitamos a seguir la cta. de WEntrepreneur_ ¡Atrévete!
#Emprendimiento #PyMES #Economía #Bussines #Negocios https://t.co/Y7tZMmxGHn
</p>
<p><i class="fa fa-twitter"></i> Creative_Talent - 26 mins
Te invitamos a seguir la cta. de WEntrepreneur_ ¡Atrévete!
#Emprendimiento #PyMES #Economía #Bussines #Negocios https://t.co/Y7tZMmxGHn
</p>
</div>
</div>
</div> -->
<div class="col-xs-12 col-sm-6 col-md-6 col-lg-2
margin_bottom_30"style="left: 0px;margin-left: 290px;">
<div class="full">
<div class="footer_blog_2">
<h3>Social</h3>
</div>
</div>
<div class="full">

Department of CSE, AITS, Rajampet 58


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

<ul class="footer-links">
<li><a href="#"><i class="fa fa-facebook"></i> 256 Likes</a></li>
<li><a href="#"><i class="fa fa-github"></i> 57+ Projects</a></li>
<li><a href="#"><i class="fa fa-twitter"></i> 1,258 Followers</a></li>
<li><a href="#"><i class="fa fa-pinterest"></i> 2538+ Pins</a></li>
</ul>
</div>
</div>
<div class="col-xs-12 col-sm-6 col-md-6 col-lg-3 margin_bottom_30">
<div class="full">
<div class="footer_blog_2 width_9">
<h3>Blog</h3>
</div>
<div class="blog_post_footer">
<div class="blog_post_img"> <img width="80" height="80"
src="static/images/scr1.png" alt="#"> </div>
<div class="blog_post_cont">
<p class="date">July 22, 2015</p>
<p class="post_head">Round and round like a carousel</p>
</div>
</div>
<div class="blog_post_footer">
<div class="blog_post_img"> <img width="80" height="80"
src="static/images/scr2.png" alt="#"> </div>
<div class="blog_post_cont">
<p class="date">July 22, 2015</p>
<p class="post_head">Round and round like a carousel</p>
</div>
</div>
<div class="blog_post_footer">
<div class="blog_post_img"> <img width="80" height="80"
src="static/images/scr3.png" alt="#"> </div>
<div class="blog_post_cont">

Department of CSE, AITS, Rajampet 59


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

<p class="date">July 22, 2015</p>


<p class="post_head">Round and round like a carousel</p>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<!-- footer bottom -->
<div class="footer_bottom">
<!-- <p>Dessigned and developed by <strong>html.design</strong></p> -->
</div>
</footer>
<!-- end footer -->
<!--=========== js section ===========-->
<!-- jQuery (necessary for Bootstrap's JavaScript) -->
<script src="static/js/jquery.min.js"></script>
<script src="static/js/popper.min.js"></script>
<script src="static/js/bootstrap.min.js"></script>
<!-- wow animation -->
<script src="static/js/wow.js"></script>
<!-- custom js -->
<script src="static/js/custom.js"></script>
<!-- google map js -->
<script src="https://maps.googleapis.com/maps/api/js?
key=AIzaSyA8eaHt9Dh5H57Zh0xVTqxVdB FCvFMqFjQ&callback=initMap"></script>
<!-- end google map js -->
</body>
</html>

Department of CSE, AITS, Rajampet 60


CHAPTER-7

SYSTEM TESTING
SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

7. SYSTEM TESTING

7.1. SOFTWARE TESTING TECHNIQUES

Software testing is a method for evaluating the quality of software products and
identifying defects so that they can be rectified. Software testing makes an effort to
accomplish its goals, but there are significant constraints. On the other side, for testing to be
effective, dedication to the set objectives is required.

7.1.1. Testing Objectives

1. The user stories, designs, specifications, and code that make up the work products
2. To ensure that all conditions are satisfied.
3. Ensuring that the test object is complete and meets the expectations of users and
stakeholders
7.1.2. Test Case Design

Every engineering product can be tested in one of these.

7.1.3. White Box Testing

Black box testing and white box testing are two types of software testing
methodologies. White Box testing, also known as structural testing, clear box testing, open
box testing, and transparent box testing, is covered in this article. It focuses on evaluating the
infrastructure and software's fundamental code against current inputs and anticipated and
desired outcomes. It emphasises internal structure analysis and is focused on a program's
internal activities. The fundamental goal of white box testing is to focus on the software's
inputs and outputs while also assuring its security. The phrases "clear box," "white box," and
"transparent box" all allude to being able to see through the exterior covering of the software.
White testing a box is used by designers. This stage involves testing every line of the
program's code. Prior to handing off the programme or software to the testing team, the
developers run white-box testing on it to ensure that it conforms with the requirements and to
identify any mistakes.

Before releasing the project to the testing team, the developer fixes the issues and
does one round of white box testing. In this case, fixing problems includes removing the
problem and activating the specific functionality of the application. For the following

Department of CSE, AITS, Rajampet 58


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

reasons, the test

Department of CSE, AITS, Rajampet 59


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

engineers won't be helping to fix the problems: o Resolving the problem might impair other
features. As a result, developers should keep making advancements while the test engineer
should constantly look for faults.

If the test engineers spend most of their time fixing problems, they might not be able to find
any new flaws in the programme.

The following tests are part of the white box testing:

o Path testing
o Loop testing
o Condition evaluation
o Testing from the viewpoint of memory
o Test results for the programme
7.1.4. Black Box Testing

Testing software applications' functionalities without having access to the internal


code structure, implementation details, or internal paths is known as "black box testing" in the
software industry. The term "black box testing" refers to a sort of software testing that is
solely concerned with the input and output of software programmes as well as the
requirements and specifications for software.

You are free to use any software package you choose as a Black-Box. A few
examples include an Oracle database, a Google website, the Windows operating system, or
even your own custom programme. You can test these applications using black box testing by
focusing just on their inputs and outputs and ignoring any awareness of how their underlying
code is implemented.

This method searches for errors in the following areas:

1. Inadequate or absent capacities

2. Errors in the interface

3. Information structure mistakes

4. Mistakes in behaviour or execution

5. Mistakes at the beginning and end

Department of CSE, AITS, Rajampet 60


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

7.2. STRATEGIES FOR SOFTWARE TESTING

 A unit test
 Integrity Checks
 Validation Examination
 System Evaluation
 Security Checks
 Performance Evaluation
7.2.1. Unit Testing

The module is the smallest piece of software architecture that is tested as part of unit
testing. Within the constraints of the module, significant control channels are analysed using
the procedural design description as a guide. The smallest testable parts of a programme,
called units, are reviewed separately and independently during unit testing to guarantee
proper operation. This testing process is used by software engineers and, on occasion, QA
staff throughout the development phase. The main objective of unit testing is to test and
validate written code separately to ensure that it operates as intended.

When done correctly, unit testing can help detect coding flaws that would otherwise be
difficult to locate. TDD is a practical technique that regularly tests and enhances the product
development process in a complete manner. One of the elements of TDD is unit testing. This
method of testing serves as the initial phase of software testing and includes tests that come
before integration testing and other types of testing. Unit testing verifies a unit's
independence from any external code or functionalities. Manual testing is still an option even
if automation testing is more popular.

7.2.2. Integration Testing

Integration testing is the process of creating a program's structure while running tests
to find interface problems. To create a design-based programme structure, unit-tested
methods are to be used. Integration testing is a testing procedure that conceptually connects
and puts software components to the test. Several software modules made by various
programmers make up a typical software project. Finding issues with how various software
components interact when they are integrated is the goal of this level of testing. The
interactions between these

Department of CSE, AITS, Rajampet 61


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

modules are examined during integration testing. It is known as "String Testing," and the
finished product is "Thread Testing."

Top-Down Integration:

The next step in the testing process is top-down integrations, a method for building
and testing a program's structure progressively. Different modules in a software, product, or
application are integrated by moving downward through the systematic control hierarchy
between the modules, starting with the main control or home control or index programme.
The project's framework includes a variety of breadth- or depth-first activities or modules
related to the primary programme.

Bottom-up Integration:

The construction and testing of a few atomic modules, or the product's most basic
features, is the first step in the subsequent testing methodology. Since all processes or
modules are integrated bottom-up, there is no need for residual, and processing for modules
tied to a certain level is always available.

7.2.3. Validation testing

Validation testing assures that the software developed and tested satisfies the client's
or user's needs. Logic or scenarios for business requirements need to be thoroughly tested.
Here, it is necessary to test every significant component of the application. You must always
be able to validate the business logic or scenarios that are given to you as a tester. One such
method that encourages a careful examination of functioning is the validation process.

Validation testing ensures that the programme has been tested and built to meet user
or customer requirements. The justifications or scenarios for business demands must be
thoroughly tested. Every key component of the application must be tested in this situation. As
a tester, you will always be provided with scenarios or business logic that can be
independently checked. One such process that helps in a detailed analysis of performance is
the validation process.

Department of CSE, AITS, Rajampet 62


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

7.2.4. System Testing

System testing's main goal is to rigorously test computer-based systems. Even though
each test has a distinct goal, they all check to make sure that each system part is properly
integrated in order to reach the objectives. Examining an entirely integrated software system
is a component of system testing. A computer system is typically constructed by mixing
software (any Software is the sole component of a computer system. The programme is made
up of modules that, when placed together with other pieces of software and hardware, form a
complete computer system. In other words, a computer system is made up of numerous
software programmes that perform various jobs. Software, however, is unable to carry out
these duties alone.

System Testing requires the appropriate hardware must be used to help. System
testing is a set of processes used to verify the overall functionality of a computer system that
uses integrated software. The practise of system testing involves examining an application's
or software's end-to-end flow from the viewpoint of a user. Each module required for an
application is examined in detail, and systemic product testing is done to ensure that the final
features and functionality function as planned. Since the testing environment mirrors the
production environment, it is known as end-to-end testing.

7.2.5. Security testing

Security testing is an essential component of software testing since it enables us to


identify vulnerabilities, risks, and hazards in software applications and protects our
programme from malevolent outsiders. Security testing's primary objective is to identify all
of a program's potential ambiguities and vulnerabilities, which maintains the application
operating. When we perform security testing, we might uncover any potential security risks
and assist the programmer in resolving any issues. It is a method for ensuring data security
while preserving software usability.

7.2.6. Performance Evaluation

Performance testing is a technique for assessing a system's responsiveness and


stability under changing workloads. Performance testing assesses the dependability,
scalability, and resource use of the system.

Department of CSE, AITS, Rajampet 63


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

Performance Evaluation Method:

Load testing is the simplest technique for evaluating how well a system will perform
under a particular load. A load test's findings will show how much work is put on the
application server, database, and other systems as well as the importance of key business
transactions. Stress testing is carried out to ascertain the system's maximum capacity and how
it will operate if the present load is greater than the predicted maximum.

Soak tests, often called endurance tests, are used to evaluate a system's performance
under a steady load. During soak testing, memory usage is monitored to identify performance
issues like memory leaks. Monitoring the system's performance over time is the main
objective. When testing during a "spike," the user base is rapidly expanded and the system's
performance is swiftly examined. The main objective is to assess the system's workload
management capabilities.

Department of CSE, AITS, Rajampet 64


CHAPTER-8
RESULTS
SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

8. RESULTS

OUTPUT SCREEN SHOTS WITH DESCRIPTION:

HOME PAGE:

Here user view the home page of Spam Message Identification Using Machine Learning
Approach

Figure 8.1: Home Page

ABOUT PAGE:

Here we can read about Project

Department of CSE, AITS, Rajampet 64


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

Department of CSE, AITS, Rajampet 65


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

Figure 8.2: About Page

UPLOAD PAGE:

In the upload page, users can upload the spam dataset.

Figure 8.3: Upload Page

VIEW PAGE:

Here we can see the uploaded dataset.

Figure 8.4: View Page

Department of CSE, AITS, Rajampet 66


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

PREPROCESSING PAGE:

Here we can prepare our data in such a way that our system should understand i.e., we will
make our data noise free.

Figure 8.5: Preprocessing Page

MODEL TRAINING PAGE:

Here we can train our data using different algorithm.

Figure 8.6:Model Training Page

Department of CSE, AITS, Rajampet 67


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

PREDICTION PAGE:

In this page shows the detection result of the SPAM or HAM prediction data

Figure 8.7.1: Prediction Page

Figure 8.7.2: Prediction Page

Department of CSE, AITS, Rajampet 68


SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

Graph:
After applying Machine Learning Algorithms. The results are shown in below graph.

100.00%

99.00%

98.00%

97.00%

96.00%

95.00%

94.00%
Naive Bayes SVM XGBoost ADA Boost

Figure 8.8: Graph

Department of CSE, AITS, Rajampet 69


CHAPTER 9

CONCLUSION

AND

FUTURE ENHANCEMENTS
SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

9. CONCLUSION AND FUTURE ENHANCEMENT

We presented a spam message categorization utilizing several algorithms such as


Naive Bayes, Support vector machine, xgboost and adaboost. For the assessment of
spambase datasets using the Weka tool, two classification techniques are employed in Weka:
cross validation and training set. The same data will be utilized for training and testing in the
training set. In addition, for cross validation, training data is separated into many folds.
Following implementation and experimental analysis, get the result that classifier with
training set provides accuracy. As a consequence, Support vector machine is the strategy that
produces the best results for spam message categorization.

This study concludes that most of the proposed spam messages identification methods
are based on supervised machine learning techniques. A labeled dataset for the supervised
model training is crucial and time-consuming task. We tested this model on only this dataset
in future we will test our model on several dataset .The study provides comprehensive
insights of these algorithms and some future research directions for spam message
identification. In the future we use some models to predict the message is spam or not.

Department of CSE, AITS, Rajampet 68


BIBLOGRAPHY

[1] J. Han, M. Kamber. Data Mining Concepts and Techniques. by Elsevier inc., Ed: 2nd,
2006
[2] A. Tiago, Almeida , José María GómezAkebo Yamakami. Contributions to the Study of
SMS Spam Filtering. University of Campinas, Sao Paulo, Brazil.
[3] M. Bilal Junaid, Muddassar Farooq. Using Evolutionary Learning Classifiers To Do
Mobile Spam (SMS) Filtering. National University of Computer & Emerging Sciences
(NUCES) Islamabad, Pakistan.
[4] Inwhee Joe and Hyetaek Shim, "An SMS Spam Filtering System Using Support Vector
Machine," Division of Computer Science and Engineering, Hanyang University, Seoul,
133-791 South Korea.
[5] Xu, Qian, Evan Wei Xiang, Qiang Yang, Jiachun Du, and Jieping Zhong. "Sms spam
detection using noncontent features." IEEE Intelligent Systems 27, no. 6 (2012): 44-51.
Yadav, K., Kumaraguru, P., Goyal, A., Gupta, A., and Naik, V. "SMSAssassin:
Crowdsourcing driven mobile-based system for SMS spam filtering," Proceedings of the
12th Workshop on Mobile Computing Systems and Applications, ACM, 2011, pp. 1-6.
[6] Duan, L., Li, N., & Huang, L. (2009). “A new spam short message classification” 2009
First International Workshop on Education Technology and Computer Science, 168-171.
[7] Weka The University of Waikato, Weka 3: Data Mining Software in Java, viewed on
2011 September 14.
[8] Mccallum, A., & Nigam, K. (1998). “A comparison of event models for naive Bayes text
classification”. AAAI-98 Workshop on 'Learning for Text Categorization'
[9] Bayesian Network Classifiers in Weka, viewed on 2011 September 14.
[10] Llora, Xavier, and Josep M. Garrell (2001) Evolution of decision trees, edn., Forth
Catalan Conference on Artificial Intelligence (CCIA2001).
[11] B. G. Becker. Visualizing Decision Table Classifiers. Pages 102- 105, IEEE (1998).
Dogo Rangsang Research Journal UGC Care Group I Journal
ISSN : 2347-7180 Vol-13, Issue-1, No. 2, January 2023

SPAM MESSAGE IDENTIFICATION USING MACHINE


LEARNING APPROACH

S.Mahammad Rafi Assistant Professor in Department of Computer Science and Engineering, Annamacharya
Institute of Technology and Sciences(Autonomous), Rajampet, Andhra Pradesh, India.

S.Venkata Padma Meghana, D.Sushma, P.Venkata Giridhar, T.Venkata Srinivasulu, A.C.Venkateshwara Reddy

Department of Computer Science and Engineering, Annamacharya Institute of Technology and


Sciences(Autonomous), Rajampet, Andhra Pradesh, India.

ABSTRACT:

We use some communication means to convey messages digitally. Digital tools allow two or
more persons to coordinate with each other. This communication can be textual, visual, audio, and
written. Smart devices including cell phones are the major sources of communication these days.
Intensive communication through SMSs is causing spamming as well. Unwanted text messages
define as junk information that we received in gadgets. Most of the companies promote their
products or services by sending spam texts which are unwelcome. In general, most of the time spam
emails more in numbers than Actual messages. In this paper, we have used text classification
techniques to define SMS and spam filtering in a short view, which segregates the messages
accordingly. In this paper, we apply some classification methods along with “machine learning
algorithms” to identify how many SMS are spam or not. For that reason, we compared different
classified methods on dataset collection on which work done by using the Weka tool.

Index Terms: Spam Messages, Classification, Spam Filtering, Comparison

I. INTRODUCTION:

In five years, there will be 3.8 billion mobile phone (smartphone) users, up from 1 billion . China, India,
and the US are the top three countries in terms of mobile usage. Short Message Service, sometimes
known as SMS, is a text messaging service that has been around for a while. You can use SMS
services even without an internet connection. SMS service is thus accessible on both smartphones
and low-end mobile devices. Although there are numerous text messaging apps on smart phones,
such as WhatsApp, this service can only be used online. However, SMS is available at all times.
Consequently, the need for SMS services is growing daily.

Page | 192 Copyright @ 2023 Authors


Dogo Rangsang Research Journal UGC Care Group I Journal
ISSN : 2347-7180 Vol-13, Issue-1, No. 2, January 2023

Figure 1: Spam Message Detection


Additionally, the system mostly derived knowledge from the data. For that goal, there are numerous
methods available, including classification, clustering, and many others. SMS stands for short
message service. 160-character messages must be sent by SMS, and lengthy messages must be
broken up into numerous smaller messages. Short text messages could be exchanged between cell
phones using the same communication protocols. The government intends to keep up with the rapid
pace of technological progress. In previous years, the rate of text messaging climbed.
In some ways, SMS spam is sophisticated. The lowest SMS rates have made it possible for
users and service providers to move away from the issue and limited availability of spam filtering
apps for mobile devices. Email spam is less common than SMS spam. Even since it explains why
30% of typescript letters went to stylish Asia and 1% of transcript addressed in the United States. In
2004, the Telephone Customer Protection Act made SMS spam unlawful in the United States.
Whoever receives unwelcome SMS knows how to lead the guidance counsellor to a court case with
no real legal significance. Now in China, the top three mobile phone manufacturers agree to a joint
plan to combat mobile spam by establishing limits on the number of typescript messages sent to one
another since 2009.
In this paper, we demonstrate a few classification techniques that classify objects. Using
classification techniques, we can determine whether a text message is spam or not. There must be a
training set that contains the items at this establishment. SMS is text messaging. To summarise the
SMS spam class or SMS into a human being, we employ text classification in this study. Since texts
sent from people are considered to be from the human class or mobile phone, spam messages are
typically provided by businesses and organisations to advertise their goods. Voice mail messages are
frequently utilised as a communication tool since mobile phones and smartphones are generally used
by a wide spectrum of people to make calls and send messages. Because SMS spam datasets are
typically small in size, email filter spam contains more datasets than SMS spam when comparing the

Page | 193 Copyright @ 2023 Authors


Dogo Rangsang Research Journal UGC Care Group I Journal
ISSN : 2347-7180 Vol-13, Issue-1, No. 2, January 2023
two types of spam. The filtering techniques scheme of the email spam filtering system could not be
applied to SMS because of the small size of spam SMS. In certain nations, including Korea, email
spam is less common than SMS spam. However, the opposite strategy was used in western regions,
where email spam predominated over SMS spam because it was cheaper and more prevalent there. On
mobile devices, about 50% of SMS messages are received as text messages, which are flagged as
spam. An SMS filtering system should function in reserve resources as well as in cell phone hardware
because of this. We used ham and spam as real data in our analysis. We use a number of different
categorization algorithms, some of which have been used in earlier research and some of which are
brand-new.
Machine learning is a technology that allows computers to learn from the past and make predictions
about the future. The majority of problems in the real world may now be solved using machine
learning and deep learning in all fields, including health, security, market analysis, etc. Machine
learning can be divided roughly into two types: Learning can be supervised, unsupervised, or semi-
supervised.
One of the key subcategories of machine learning is supervised learning. Predictive modelling,
another name for supervised learning, is the process of producing predictions from data.
Classification and regression are two instances of supervised learning. supervised instruction For
classification issues, the training data set has pre-labels, and for regression issues, function values are
known. Switch to scoring later so that we can forecast values for fresh data when training is complete
and the model has a minimum cost function for the training data set.
When a system receives a data set of emails, one supervised learning task is to determine if
each email is spam or not. This is an example of supervised learning (ham). The fact that there is a
predetermined result, such as spam or ham, makes this supervised learning. We applied multiple
supervised learning techniques for SMS spam detection using a labelled dataset from UCI.

II. LITERATURE SURVEY:

Tiago, Almeida , José María GómezAkebo Yamakami. Contributions to the Study of SMS
Spam Filtering. University of Campinas, Sao Paulo, Brazil.

The number of mobile phone users has increased, which has resulted in a sharp rise in SMS spam
messages. The lower SMS usage rate, which has allowed many users and service providers to

Page | 194 Copyright @ 2023 Authors


Dogo Rangsang Research Journal UGC Care Group I Journal
ISSN : 2347-7180 Vol-13, Issue-1, No. 2, January 2023
disregard the problem, as well as the restricted availability of mobile phone spam-filtering software
make it challenging to combat mobile phone spam in practise. On the other hand, a significant
disadvantage in academic contexts is the dearth of publicly available datasets for SMS spam, which
are crucial for validating and contrasting various classifiers. Additionally, since SMS messages are
typically brief, content-based spam filters may work worse.

We present the largest genuine, public, and unencrypted SMS spam collection we are aware of in this
project. Additionally, we contrast the results of various tried-and-true machine learning techniques.
The findings show that Support Vector Machine performs better than the other assessed classifiers,
making it an excellent starting point for future comparison.

Duan, L., Li, N., & Huang, L. (2009). “A new spam short message classification” 2009 First
International Workshop on Education Technology and Computer Science, 168-171.

This project suggests a method of message dual-filtering. The KNN classification algorithm and
rough set are combined to first separate spam messages from other messages. It must re-filter some
messages using the KNN classification algorithm to prevent lowering precision for reduction. Based
on a basic set of the KNN classification algorithm, this method not only increases classification
speed but also maintains excellent accuracy.

Inwhee Jo and Hyetaek Shim, "An SMS Spam Filtering System Using Support Vector
Machine," Division of Computer Science and Engineering, Hanyang University, Seoul, 133-791
South Korea.

An SMS spam filtering system based on SVM (Support Vector Machine) and thesaurus is presented
in this paper (Short Messaging Service). The system identifies words from sample data using a pre-
processing tool, integrates their meanings using a thesaurus, derives features of integrated words
using chi-square statistics, and then analyses these characteristics. The system has been tried out, and
it works well in a Windows context

B. G. Becker. Visualizing Decision Table Classifiers. Pages 102- 105, IEEE (1998).

Decision trees, decision networks, and decision tables are all categorization models used for
forecasting. Machine learning algorithms produce these. A decision table is made up of a hierarchy
of tables where each entry is split down into its component parts by the values of two more
characteristics to create a new table. Dimensional stacking is comparable to the structure [4]. Here, a
visualisation technique is shown that enables even non-experts in machine learning to comprehend a

Page | 195 Copyright @ 2023 Authors


Dogo Rangsang Research Journal UGC Care Group I Journal
ISSN : 2347-7180 Vol-13, Issue-1, No. 2, January 2023
model built on a variety of attributes. This representation is more practical than other static designs
thanks to a variety of interactions.

Inwhee Joe and Hyetaek Shim, "An SMS Spam Filtering System Using Support Vector
Machine," Division of Computer Science and Engineering, Hanyang University, Seoul, 133-791
South Korea:

This project describes a powerful and adaptive spam filtering system for SMS (Short
Messaging Service) that uses SVM (Support Vector Machine) and a thesaurus. The system isolates
words from sample data using a pre-processing device and integrates meanings of isolated words
using a thesaurus, generates features of integrated words through chi-square statistics, and studies
these features. The system is realized in a Windows environment and its performance is
experimentally confirmed.

Spam filtering is a peculiar filed to automatic document classification to considering the


document is spam or not. Automatic document classification means make bunch of similar
documents by allocate each document to proper category by get through the classification system..
That classification is consisting of two phases.

First phase is feature selection method by extracting needed feature to classify after indexing
bunch of documents. Second phase is decision make process that choose right category for the result
from first phase. Automatic document classification gets ability to assign right category
automatically through mechanical learning process.

For this process, it tagged specific word to bunch of learned document. The word represents the
documents and extracting feature means batch job to select words revealed from learned document.
However if it select every word in learned document as features, it takes too much time and looses
judgment. To prevent this problem, calculate weight of information for each word then select
featured words for automatic classification. In text categorization, we are dealing with a huge feature
spaces. This is why; we need a feature selection mechanism. The most popular feature selection
methods are document frequency thresholding (DF) , the X 2 statistics (CHI) , term strength (TS) ,
information gain (IG) , and mutual information.

Abhishek Patel#1 , Priya Jhariya*2 , SudalaguntaBharath#3 , Ankita wadhawan#4 “SMS


Spam Detection using Machine Learning Approach”:

Page | 196 Copyright @ 2023 Authors


Dogo Rangsang Research Journal UGC Care Group I Journal
ISSN : 2347-7180 Vol-13, Issue-1, No. 2, January 2023
Spam is "unconstrained mass email" (Hidalgo, 2002), which "data made to be given to countless
beneficiaries, notwithstanding their longings." Cormack (2007) depicted spam with propelling
substance or compulsion content are passed on in the strategy for mass mailing Regardless, such
spam could be unmistakable as demonstrated by the diverse media spam rehearses used, such email
spam, SMS spam. Spammers flood the Short Message Service workers and give mass proportion of
unconstrained Short Message Service to the end clients [16]. From a business point of view, Short
Message Service clients need to contribute energy on destroying got spam Short Message Service
which unquestionably prompts the advantage reduction and cause possible difficulty for affiliations.
From this time forward, how to recognize the Short Message Service spam appropriately and
proficiently with high precision changes into a gigantic report.

In this appraisal, information mining will be used to manage AI by utilizing various classifiers
for preparing and testing and channels for information preprocessing and highlight choice. It plans
to peer out the ideal mix model with higher precision or base on other metric's evaluation. As of
now, there are various evaluation study done by utilizing information burrowing procedure for
example, information digging by strategies for plan.

Altogether much exertion underscore on single classifier. In any case, spamming rehearses are
changing the strategies to evade the spam territory [18]. Along these lines, in this examination, we
will zero in on the whole around on framework for managing SMS spam by utilizing information
mining technique. Questions, for example, regardless of whether the cross assortment model gives
better precision result standing apart from any single classifier utilized for email spam unmistakable
evidence will be seen through experimentation.

III. METHODOLOGY:

Using several supervised-learning techniques, the proposed method will focus on


increasing the precision of spam message detection. Data was obtained via Kaggle. The dataset is
then partitioned according to entropy. The fine-tuned dataset shows accuracy. Afterward, the
divided dataset is used to observe accuracy. By using correlation and a working model, the
optimal attributes for each leaf node are determined. The model's accuracy was seen to be
hypertuned depending on the best attributes for each division. ML-based solutions have an
advantage over blacklists because they can reduce the impact of zero-hour faked assaults, just
like heuristic checks can. It's interesting to note that ML approaches can build their own
categorization models by

Page | 197 Copyright @ 2023 Authors


Dogo Rangsang Research Journal UGC Care Group I Journal
ISSN : 2347-7180 Vol-13, Issue-1, No. 2, January 2023
examining vast amounts of data. Due to ML algorithms' ability to discover their own models,
manually creating heuristic tests is no longer necessary.

The analysis's module description is represented by the framework in Figure 2.

Naive Bayes

Extraction of Support Vector


features from Machine
Spam Message Dataset dataset for
training and XG Boost
testing
Ada Boost

Input text

Choose the Performance


classify best analysis and
classifier for comparison of all
prediction classifier

Predicting the
Message is Spam
or Ham

Fig 2: Block diagram for proposed system

A. Dataset
In this model, we've combined data sets that we've created with spam datasets that we've obtained
from several online resources, including Kaggle. We test our model using 30% of the Kaggle
spam dataset, and we train our model using the remaining 70%. Data from spam and genuine
messages are included in the dataset.

B. Data preprocessing

The steps involved in data preprocessing include cleaning, instance selection, feature extraction,
normalization, transformation, etc. The training dataset as a whole is the end outcome of data
preprocessing. How data is pre-processed could have an impact on how the final results are
understood. Filling in the gaps in the data, reducing noise, identifying and eliminating outliers,
and resolving incompatibilities are all steps in the data cleaning process. The addition of certain
databases or data sets may be accomplished through a technique called data integration. When
collecting and normalizing data to measure a certain set of data, data transformation is taking
place.

Page | 198 Copyright @ 2023 Authors


Dogo Rangsang Research Journal UGC Care Group I Journal
ISSN : 2347-7180 Vol-13, Issue-1, No. 2, January 2023
Data reduction allows for the creation of a very compact dataset overview that nevertheless
contributes to the analysis's ability to yield a consistent result.

C. Train-test split

In order for the training dataset to be utilised to detect spam messages on the testing dataset, the
dataset is divided into two subsets: testing set and training set. In order for the training model to
adequately train and learn the data, 30% of the data is examined for the testing set.

IV. ALGORITHMS:

Support Vector Machine:

Support Vector Machine is a type of supervised machine learning algorithm that provides data
analysis for classification and regression analysis. SVM is mostly used for classification. The value
of each feature is equal to the value of the specified coordinate. Then, we detect the ideal hyperplane
that differentiates between the two classes. Support vector machine is a representation as points in
space contrasted into categories by a gap that is as wide as possible of the training data. It is
effectual and efficient in high dimensional spaces and uses a subset of training points in the decision
function, hence, it is also known for its memory efficiency. The algorithm indirectly provides
probability estimations; these are calculated using five-fold cross-validation.

Naive Bayes :

A classification technique that is based on Bayes’ Theorem with the presumption of independence
among predictors. Naive Bayes is a way used to predict the class of the dataset. Using this, one can
perform a multi-class prediction. If the assumption of independence is valid, then Naive Bayes is
much more capable than the other algorithms like logistic regression. Furthermore, less training
data is
Page | 199 Copyright @ 2023 Authors
Dogo Rangsang Research Journal UGC Care Group I Journal
ISSN : 2347-7180 Vol-13, Issue-1, No. 2, January 2023
required for the classification. Naive Bayes classifier works efficiently in real-world situations such
as document classification and spam filtering. Although, it is merely recognized as a bad estimator.
It is an easy and a quick technique

P(A/B) is the posterior probability of class (target) given predictor (attribute).

P(A) is the prior probability of class.

P(B/A) is the likelihood which is the probability of predictor given class.

P(B) is the prior probability of predictor.

XGBOOST:

XG Boost stands for extreme Gradient Boosting. It is an application of gradient boosted decision
trees, which is intended for its speed and performance. Boosting is an ensemble learning method
where advanced techniques are included in order to rectify the errors made by the already proposed
models. Models are included consecutively till we find that no additional enhancement can be
carried out. While adding new models it uses a gradient descent technique to minimize the loss. The
application of this algorithm is to provide efficient computational time and memory supplies. The
aim of this design was to produce the best necessity of the accessible sources to train the model.
Execution Speed and Model Performance are the two main reasons to work with XG Boost. This
approach can support both classification and regression models.

ADABOOST:

AdaBoost algorithm, short for Adaptive Boosting, is a Boosting technique used as an Ensemble
Method in Machine Learning. It is called Adaptive Boosting as the weights are re-assigned to each
instance, with higher weights assigned to incorrectly classified instances. Boosting is used to reduce
bias as well as variance for supervised learning. It works on the principle of learners growing
sequentially. Except for the first, each subsequent learner is grown from previously grown learners.
In simple words, weak learners are converted into strong ones. The AdaBoost algorithm works on
the same principle as boosting with a slight difference.

V. EXPERIMENTATION AND RESULTS:

Page | 200 Copyright @ 2023 Authors


Dogo Rangsang Research Journal UGC Care Group I Journal
ISSN : 2347-7180 Vol-13, Issue-1, No. 2, January 2023
In this paper we applied Hashing Vectorizer word embedding technique and then applied four
classification algorithms. The results of the experiments are shown in table. We achieved best
accuracy with SVM . After Applying the algorithms it is used to identify whether the Message is
Ham or Spam.

100.00%

99.00%

98.00%

97.00%

96.00%

95.00%

94.00%
Naive Bayes SVM XGBoost ADA Boost

VI. CONCLUSION:
We presented a spam message categorization utilising several algorithms such as Naive Bayes ,Support
vector machine, XGBoost and ADABoost . For the assessment of spam base datasets using the
Weka tool, two classification techniques are employed in Weka: cross validation and training set.
The same data will be utilised for training and testing in the training set. In addition, for cross
validation, training data is separated into many folds. Following implementation and experimental
analysis, get the result that classifier with training set provides accuracy. As a consequence, Support
vector machine is the strategy that produces the best results for spam msg categorization. On just one
dataset, we tested this model. Future tests of our model on various datasets are planned.

Page | 201 Copyright @ 2023 Authors


Dogo Rangsang Research Journal UGC Care Group I Journal
ISSN : 2347-7180 Vol-13, Issue-1, No. 2, January 2023

VII. REFERENCES

[1] J. Han, M. Kamber. Data Mining Concepts and Techniques. by Elsevier inc., Ed: 2nd, 2006
[2] A. Tiago, Almeida , José María GómezAkebo Yamakami. Contributions to the Study of SMS
Spam Filtering. University of Campinas, Sao Paulo, Brazil.
[3] M. Bilal Junaid, Muddassar Farooq. Using Evolutionary Learning Classifiers To Do Mobile
Spam (SMS) Filtering. National University of Computer & Emerging Sciences (NUCES)
Islamabad, Pakistan.
[4] Inwhee Joe and Hyetaek Shim, "An SMS Spam Filtering System Using Support Vector
Machine," Division of Computer Science and Engineering, Hanyang University, Seoul, 133-791
South Korea.
[5] Xu, Qian, Evan Wei Xiang, Qiang Yang, Jiachun Du, and Jieping Zhong. "Sms spam detection
using noncontent features." IEEE Intelligent Systems 27, no. 6 (2012): 44-51.
[6] Yadav, K., Kumaraguru, P., Goyal, A., Gupta, A., and Naik, V. "SMSAssassin: Crowdsourcing
driven mobile-based system for SMS spam filtering," Proceedings of the 12th Workshop on
Mobile Computing Systems and Applications, ACM, 2011, pp. 1-6.
[7] Duan, L., Li, N., & Huang, L. (2009). “A new spam short message classification” 2009 First
International Workshop on Education Technology and Computer Science, 168-171.
[8] Weka The University of Waikato, Weka 3: Data Mining Software in Java, viewed on 2011
September 14.
[9] Mccallum, A., & Nigam, K. (1998). “A comparison of event models for naive Bayes text
classification”. AAAI-98 Workshop on 'Learning for Text Categorization'
[10] Bayesian Network Classifiers in Weka, viewed on 2011 September 14.
[11] Llora, Xavier, and Josep M. Garrell (2001) Evolution of decision trees, edn., Forth Catalan
Conference on Artificial Intelligence (CCIA2001).
[12] B. G. Becker. Visualizing Decision Table Classifiers. Pages 102- 105, IEEE (1998). A. Bantukul
and P. J. Marsico, ‘‘Methods, systems, and computer program products for short message service
(SMS) spam filtering using E-mail spam filtering resources,’’ U.S. Patent 7 751 836 B2, Jul. 6,
2010.
[13] H.-Y. Chou and N.-H. Lien, ‘‘Effects of SMS teaser ads on product curiosity,’’ Int. J. Mobile
Commun., vol. 12, no. 4, pp. 328–345, Jul. 2014.
[14] N. Jindal and B. Liu, ‘‘Review spam detection,’’ in Proc. 16th Int. Conf. World Wide Web, 2007,
pp. 1189–1190.
[15] M. Jiang, P. Cui, and C. Faloutsos, ‘‘Suspicious behavior detection: Current trends and future
directions,’’ IEEE Intell. Syst., vol. 31, no. 1, pp. 31–39, Jan./Feb. 2016.

Page | 202 Copyright @ 2023 Authors


Dogo Rangsang Research Journal UGC Care Group I Journal
ISSN : 2347-7180 Vol-13, Issue-1, No. 2, January 2023
[16] C. Wang et al., ‘‘A behavior-based SMS antispam system,’’ IBM J. Res. Develop., vol. 54, no. 6,
pp. 3:1–3:16, Nov./Dec. 2010.

Page | 203 Copyright @ 2023 Authors


This is to certify that the article entitled

SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

Authored By

S.Mahammad Rafi
Assistant Professor in Department of Computer Science and Engineering, Annamacharya Institute
of Technology and Sciences(Autonomous), Rajampet, Andhra Pradesh, India.

Published in
Dogo Rangsang Research Journal : ISSN 2347-7180 with IF=5.127
Vol. 13, Issue. 1, No. 02, January : 2023
UGC Care Approved, Group I, Peer Reviewed, Bilingual and Referred Journal
This is to certify that the article entitled

SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

Authored By

S.Venkata Padma Meghana,


Department of Computer Science and Engineering, Annamacharya Institute of Technology and
Sciences(Autonomous), Rajampet, Andhra Pradesh, India.

Published in
Dogo Rangsang Research Journal : ISSN 2347-7180 with IF=5.127
Vol. 13, Issue. 1, No. 02, January : 2023
UGC Care Approved, Group I, Peer Reviewed, Bilingual and Referred Journal
This is to certify that the article entitled

SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

Authored By

D.Sushma,
Department of Computer Science and Engineering, Annamacharya Institute of Technology and
Sciences(Autonomous), Rajampet, Andhra Pradesh, India.

Published in
Dogo Rangsang Research Journal : ISSN 2347-7180 with IF=5.127
Vol. 13, Issue. 1, No. 02, January : 2023
UGC Care Approved, Group I, Peer Reviewed, Bilingual and Referred Journal
This is to certify that the article entitled

SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

Authored By

P.Venkata Giridhar,
Department of Computer Science and Engineering, Annamacharya Institute of Technology and
Sciences(Autonomous), Rajampet, Andhra Pradesh, India.

Published in
Dogo Rangsang Research Journal : ISSN 2347-7180 with IF=5.127
Vol. 13, Issue. 1, No. 02, January : 2023
UGC Care Approved, Group I, Peer Reviewed, Bilingual and Referred Journal
This is to certify that the article entitled

SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

Authored By

T.Venkata Srinivasulu,
Department of Computer Science and Engineering, Annamacharya Institute of Technology and
Sciences(Autonomous), Rajampet, Andhra Pradesh, India.

Published in
Dogo Rangsang Research Journal : ISSN 2347-7180 with IF=5.127
Vol. 13, Issue. 1, No. 02, January : 2023
UGC Care Approved, Group I, Peer Reviewed, Bilingual and Referred Journal
This is to certify that the article entitled

SPAM MESSAGE IDENTIFICATION USING MACHINE LEARNING APPROACH

Authored By

A.C.Venkateshwara Reddy
Department of Computer Science and Engineering, Annamacharya Institute of Technology and
Sciences(Autonomous), Rajampet, Andhra Pradesh, India.

Published in
Dogo Rangsang Research Journal : ISSN 2347-7180 with IF=5.127
Vol. 13, Issue. 1, No. 02, January : 2023
UGC Care Approved, Group I, Peer Reviewed, Bilingual and Referred Journal

You might also like