0% found this document useful (0 votes)
79 views60 pages

Kriti Final Report

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views60 pages

Kriti Final Report

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 60

CANDIDATE’S DECLARATION

I /We hereby declare that the work presented in this report entitled “SMS Spam
Detection Using Machine Learning” was carried out by me/us. I/We have not submitted
the matter embodied in this report for the award of any other degree or diploma of any
other University or Institute.

I/We have given due credit to the original authors/sources for all the words, ideas,
diagrams, graphics, computer programs, experiments, and results that are not my/our
original contribution. I/We have used quotation marks to identify verbatim sentences
and given credit to the original authors/sources.

I/We affirm that no portion of my/our work is plagiarized, and the experiments and
results reported in the report are not manipulated. In the event of a complaint of
plagiarism and the manipulation of the experiments and results, I/We shall be fully
responsible and answerable.

ARCHANA Roll No. - 2005250100011

KRITI YADAV Roll No. - 2005250100028

AKANKSHA DUBEY Roll No. - 2105250109002

ANJALI SHARMA Roll No. -2105250109003

(Candidates’ Signature)

Date:

Department of Computer Science & Engineering

i
CERTIFICATE

This is to certify that Project Report entitled “SMS Spam Detection Using Machine
Learning” submitted by Archana, Kriti Yadav, Akanksha Dubey, Anjali Sharma in
partial fulfilment of the requirement for the award of Bachelor of Technology in
Computer Science & Engineering from Buddha Institute of Technology, Gorakhpur
affiliated to Dr. A.P.J. Abdul Kalam Technical University, Lucknow, Uttar Pradesh
represents the work carried out by students under my supervision. The project
embodies result of original work and studies carried out by the students themselves and
the contents of the project do not form the basis for the award of any other degree to the
Candidate or to anybody else.

Ms. Pallavi
Dixit (Assistant
Professor)
Department of Computer Science & Engineering
Buddha Institute of Technology, GIDA, Gorakhpur

Dr. Abhinandan Tripathi


(Assosiate Professor & Head)
Department of Computer Science & Engineering
Buddha Institute of Technology, GIDA, Gorakhpur
Date:

ii
ABSTRACT

SMS spam detection is a crucial task in ensuring user privacy and security in mobile
communication. This paper proposes a machine learning approach for detecting SMS spam
messages. Leveraging a dataset of labelled SMS messages, we employ various feature
extraction techniques and machine learning algorithms, including but not limited to, Naive
Bayes, Support Vector Machines (SVM), and Random Forest. We explore the
effectiveness of different feature representations such as bag-of-words, TF-IDF, and word
embeddings. Additionally, we investigate the impact of different preprocessing steps such
as text normalization and stemming. Experimental results demonstrate the effectiveness of
our proposed approach in accurately identifying spam messages while minimizing false
positives. This research contributes to the ongoing efforts in combating SMS spam and
enhances user experience and security in mobile communication.

Keywords: SMS spam, machine learning, classification, feature extraction, supervised


learning, mobile communication.

iii
ACKNOWLEDGEMENT

First and foremost, we extend our heartfelt gratitude to God Almighty for providing us
with the strength, resources, and perseverance to complete this project successfully.
Without His divine guidance, our efforts would not have come to fruition.

We are deeply grateful to Mr. Abhinandan Tripathi, Head of the Department of


Computer Science & Engineering at Buddha Institute of Technology, Gorakhpur, for
his invaluable advice, motivation, and continuous support throughout this project.

Our sincere thanks go to our project guide, Ms. Pallavi Dixit, Assistant Professor in the
Department of Computer Science & Engineering, for her unwavering support,
insightful guidance, and constant encouragement. Your expertise and dedication have
been crucial to our success.

We also express our gratitude to all the faculty members and technical experts whose
help and encouragement have been instrumental in the completion of this project. Your
knowledge and support have been invaluable.

ARCHANA Roll No. - 2005250100011

KRITI YADAV Roll No. - 2005250100028

AKANKSHA DUBEY Roll No. - 2105250109002

ANJALI SHARMA Roll No. - 2105250109003

iv
TABLE OF CONTENTS

Page no.
Candidate’s Declaration i
Certificate ii
Acknowledgement iii
Abstract iv
List of Figures ix
List of Tables xi

Chapter 1 INTRODUCTION 1-9

1.1 INTRODUCTION 1

1.2 OBJECTIVES AND MOTIVATION 2

1.3 LITERATURE SURVEY 2

1.4 PROBLEM STATEMENT 2

1.5 OBJECTIVE STATEMENT 2

1.6 SCOPE OF THE PROJECT 3

1.7 EXISTING SYSTEM 3

1.8 EXISTING TECHNIQUE 4

1.8.1 K-Nearest Neighbors (KNN) 4

1.8.2 Random Forest (RF) 5

1.8.3 Disadvantages Of Existing System 5

1.9 PROPOSED SYSTEM 9

1.9.1 Proposed System Advantages 9

Chapter 2 ANALYSIS AND DESIGN 10-26

v
2.1 DATASET 10

2.2 LIBRARIES USED 11

2.3 DESIGN 12

2.3.1 Front End Module Diagrams 13

2.3.2 Back End Module Diagrams 14

2.4 SYSTEM SPECIFICATION 14

2.4.1 Hardware Requirements 14

2.4.2 Software Requirements 14

2.5 MODULE DESCRPITION 15

2.5.1 Data Collection 15

2.5.2 Data Formatting 16

2.5.3 Data Cleaning 16

2.5.4 Data Anonymization 16

2.5.5 Data Sampling 16

2.5.6 Natural Language Processing (NLP) 17

2.5.7 Featurization 18

2.5.8 Splitting Of Data 19

2.5.9 Modeling 19

2.6 SYSTEM DESIGN 22

2.6.1 Architecture Diagram 23

2.6.2 Data Flow Diagram 23

2.6.3 Usecase Diagram 24

vi
2.6.4 Class Diagram 24

2.6.5 Sequence Diagram 25

2.6.6 Activity Diagram 25

2.6.7 State Flow Diagram 26

Chapter 3 SOFTWARE SPECIFICATION 27-33

3.1 GENERAL 27

3.2 PYTHON 31

3.3 APPLICATION OF PYTHON 32

3.3.1 Scientific And Numeric Computing 32

3.3.2 Creating Software Prototypes 33

3.3.3 Good Language To Teach Programming 33

Chapter 4 IMPLEMENTATION 34-42

4.1 PRE-PROCESSING OF DATASET 34

34
4.2 FEATURE EXTRACTION
35
4.3 CLASSIFICATION ALGORITHMS
36
4.3.1 K-Nearest Neighbor
37
4.3.2 Multinomial Naïve Bayes

vii
38
4.3.3 Support Vector Machine
39
4.3.4 Decision Tree
40
4.3.5 Random Forest
4.4 SCREENSHOTS 41

Chapter 5 TESTING AND RESULTS 43-45

5.1 TESTING 43

5.2 RESULTS 44

5.3 VERIFICATION 45

Chapter 6 CONCLUSION 46

6.1 CONCLUSION 46

6.2 FUTURE SCOPE 46

REFERENCES 47-49

viii
LIST OF FIGURES

Figure No. Description Page No.

2.1 Dataset details 10

2.1.1 Graphical representation of dataset 10

2.2 Frequency of all SMS 11

2.2.1 Frequency of ham and spam SMS 12

2.3.1 Use Case Diagram 12

2.3.2 Sequence diagram of SMS spam detection 13

2.3.3 Front End Module Diagram 13

2.3.4 BackEnd Module Diagram 14

3.1 Anaconda Navigator 28

4.1 Pre-processing of dataset 34

4.2 Training Dataset 35

4.3.1 KNN Algorithm 37

4.3.2 Multinomial Navie Bayes algorithm 38

4.3.3 SVM algorithm 39

4.3.4 Decision tree algorithm 40

4.3.5 Random forest algorithm 40

5.1 Formulae used for calculating accuracy, precision, 43


recall, f1-score
5.1.1
Importing confusion matrix and classification report 43

ix
5.2 Graphical representation of the results 44

x
LIST OF TABLES

Table No. Description Page No.

1 Comparing performance of Algorithms 44

xi
CHAPTER 1

INTRODUCTION

1.1 INTRODUCTION

In this world, phone has become a vital product in everyone’s life. Almost every
person uses smart phone for connecting with family, teachers and friends, ordering
from online platforms, using social media and the list goes on. To do these activities
phones should have a sim card with internet connectivity. The sim in our phones
receives many text messages called SMS where only a few percent of it is important.
Other SMS are messages which may have some unwanted information that tries to get
our confidential information like bank details, passwords etc. Among these messages,
some conveys messages like winning a lump amount or wining an expensive product
by answering a silly question.

SMS is an multi - million - dollar commercial industry having a percent of 11 to 13


from Gross National Income (GNI) of developing countries according to the year
2013. Now the rate has gone high. This article also conveys a fact that many people
prefer using SMS for communication because it charges less than US$0.001. It is
found that more that 30 percent of spam messages are sent by scammers in Asian
countries. Many people fall as prey to these spam messages and get cheated by the
scammers. Technology is improved so much such that when a method is found to
reduce spam message, spammers find other way to scam people. These SMS makes
people irritated a lot. To prevent this there are many ways which is discussed in further
section.

1.2 OBJECTIVE AND MOTIVATION

Researchers around the world have found many methods to reduce the annoying text
messages, SMS. One of the methods is to use machine learning techniques. My
objective is to classify a SMS as ham or spam using machine learning algorithms and
using Count Vectorizer () for feature extraction. I’ve used supervised machine learning
algorithms which are K – nearest neighbor (KNN), Multinomial Naïve Bayes, Support
Vector Machine (SVM), Decision tree and Random Forest over a dataset taken from

1
Kaggle platform to classify whether a SMS is ham or spam and try to block spam
messages which reduces threat to a person and their confidential details.

1.3 LITERATURE SURVEY

Nilam Nur Amir Sharif et al[1] built a model to check whether a SMS is ham or spam
using supervised machine learning algorithms like K-Nearest Neighbor(KNN),
Multinomial Naive Bayes, Support Vector Machine(SVM), decision tree and random
forest along with term frequency-inverse document frequency (TF-IDF) for feature
extraction. Their dataset is taken from UCI machine learning repository which contains
5574 instances. After performing feature extraction over the dataset using TF-IDF
vectorizer, the dataset is split into train and test data. All the algorithms are applied on
the dataset by fitting the train data and testing its functionality using predict function
over test data. After all the process, classification report is taken. Observing the values,
it is known that random forest along with TF-IDF vectorizer outperforms the other
algorithm by giving 97.5% accuracy, 0.98 precision score and 0.97 f1-score. SVM
algorithm performed least by giving 87.49% of accuracy, 0.77 precision score and 0.82
f1- score.

1.4 PROBLEM STATEMENT

A number of major differences exist between spam-filtering in text messages and


emails. Unlike emails, which have a variety of large datasets available, real databases
for SMS spams are very limited. Additionally, due to the small length of text messages,
the number of features that can be used for their classification is far smaller than the
corresponding number in emails. Here, no header exists as well. Additionally, text
messages are full of abbreviations and have much less formal language that what one
would expect from emails. All of these factors may result in serious degradation in
performance of major email spam filtering algorithms applied to short text messages.

1.5 OBJECTIVE STATEMENT

2
Prediction of SMS spam has been an important area of research for a long time. the
goal is to apply different machine learning algorithms to SMS spam classification
problem, compare their performance to gain insight and further explore the problem,
and design an application based on one of these algorithms that can filter SMS spams
with high accuracy. The current work proposes a gamut of machine learning and deep
learning-based predictive models for accurately predicting the sms spam movement.
The predictive power of the models is further enhanced by introducing the powerful
deep learning-based long- and short-term memory (LSTM) network into the predictive
framework.

1.6 SCOPE OF THE PROJECT

1. The Proposed mode is based on the study of SMS text data and technical
indicators. Algorithm selects best free parameters combination for LSTM to
avoid over-fitting and local minima problems and improve prediction accuracy.
2. Our dataset consists of one large text file in which each line corresponds to a text
message. Therefore, preprocessing of the data, extraction of features, and
tokenization of each message is required. After the feature extraction, an initial
analysis on the data is done using label encoder and then the models like naive
Bayes (NB) algorithm and LSTM are used on next steps are for prediction.
3. The two methods used to predict the spam messages that are Fundamental and
technical analyses.

1.7 EXISTING SYSTEM

Random Forest (RF) algorithm will used for classification of ham or spam during this
phase. RF is averaging ensemble learning method that can be used for classification
problem. This algorithm combines various decision tree models in order to eliminate
the over fitting problem in decision trees. In RF algorithm, each tree is capable in
providing its own prediction results, different from each other. As a result, each tree
gives different performances, in which the average of their performances will be
generalized and calculated. During the training phase, a set of decision trees will be
constructed before they can operate on randomly selected features. Regardless, RF can
work well with a large dataset with a variety of feature types, similar to binary,
categorical and numerical. The algorithm works as follows diagram: for each tree in

3
the forest, a bootstrap sample is selected from S where S (i) represents the ith bootstrap.
A decision-tree is then learn using a modified decision-tree learning algorithm.

The algorithm is modified as follows: at each node of the tree, instead of examining all
possible feature-splits, some subset of the features text f F is selected randomly.
where F is the set of Spam features. The node then splits on the best feature in f rather
than F. In practice f is much, much smaller than F. Deciding on which feature to split
is oftentimes the most computationally expensive aspect of decision tree learning. By
narrowing the set of features, the speed up the learning of the tree is increase
drastically.

1.8 EXISTING TECHINIQUE

KNN and Random Forest.

1.8.1 K-Nearest Neighbors (KNN)

The KNN algorithm classifies new data based on the class of the k nearest neighbors.
This paper uses the value of k as 6. The distance from neighbors can be calculated
using various distance metrics, such as Euclidean distance, Manhattan distance (used
in this paper), Minkowski distance, etc. The class of the new data may be decided by
majority vote or by an inverse proportion to the distance computed. KNN is a
nongeneralizing method, since the algorithm keeps all of its training data in memory,
possibly transformed into a fast-indexing structure such as a ball tree or a KD tree.

4
1.8.2 Random Forest (RF):

Random forest algorithm is a supervised learning algorithm that is developed to solve


the problems of regression and classification. So, the main advantage of random forest
is that they can handle both numerical and categorical data. Like other conventional
algorithms decision tree algorithm creates a training model and that training model is
used to predict the value or class of the target label/variable but here this is done by
learning decision rules inferred from previous training dataset. This algorithm makes
use of tree structure in which the internal nodes also known as decision node refers to
an attribute and each internal node has two or more leaf nodes which corresponds to a
class label. The topmost node known as root node corresponds to the best predictor i.e.
best attribute of the dataset. This algorithm splits the whole data-frame into parts or
subsets and simultaneously a random forest is developed and the end result of this is a
tree with leaf nodes, internal nodes and a root node. As the tree becomes more deep
and more complex, then the model becomes more and more fit.

1.8.3 DISADVANTAGES OF EXISTING SYSTEM

• Accuracy is low.

• Dataset selection is not correct

• Feature extraction is not accurate

• Complex structure, not easy to understand.

LITERATURE SURVEY

TITLE: SMS Spam Detection Based on Long Short-Term Memory and Gated
Recurrent Unit

Author: Pumrapee Poomka, Wattana Pongsena, Nittaya Kerdprasop, and Kittisak


Kerdprasop YEAR: - 2019

Abstract:

5
An SMS spam is the message that hackers develop and send to people via mobile
devices targeting to get their important information. For people who are ignorant, if
they follow the instruction in the message and fill their important information, such as
internet banking account in a faked website or application, the hacker may get the
information. This may lead to loss their wealth. The efficient spam detection is an
important tool in order to help people to classify whether it is a spam SMS or not. In
this research, we propose a novel SMS spam detection based on the case study of the
SMS spams in English language using Natural Language Process and Deep Learning
techniques. To prepare the data for our model development process, we use word
tokenization, padding data, truncating data and word embedding to make more
dimension in data. Then, this data is used to develop the model based on Long Short-
Term Memory and Gated Recurrent Unit algorithms. The performance of the proposed
models is compared to the models based on machine learning algorithms including
Support Vector Machine and Naïve Bayes. The experimental results show that the
model built from the Long Short-Term Memory technique provides the best overall
accuracy as high as 98.18%. On accurately screening spam messages, this model
shows the ability that it can detect spam messages with the 90.96% accuracy rate,
while the error percentage that it misclassifies a normal message as a spam message is
only 0.74%.

TITLE: SMS Spam Message Detection using Term Frequency-Inverse Document


Frequency and Random Forest Algorithm

Author: Haslina Md Sarkan, Yazriwati Yahya, Suriani Mohd Sam

YEAR:-2019

Abstract:
The daily traffic of Short Message Service (SMS) keeps increasing. As a result, it leads
to dramatic increase in mobile attacks such as spammers who plague the service with
spam messages sent to the groups of recipients. Mobile spams are a growing problem
as the number of spams keep increasing day by day even with the filtering systems.
Spams are defined as unsolicited bulk messages in various forms such as unwanted
advertisements, credit opportunities or fake lottery winner notifications. Spam
classification has become more challenging due to complexities of the messages
6
imposed by spammers. Hence, various methods have been developed in order to filter
spams. In this study, methods of term frequency-inverse document frequency (TF-
IDF) and Random Forest Algorithm will be applied on SMS spam message data
collection. Based on the experiment, Random Forest algorithm outperforms other
algorithms with an accuracy of 97.50%

TITLE: Spam Detection Approach for Secure Mobile Message Communication


Using Machine Learning Algorithms

Author: Shah Nazir,2 Habib Ullah Khan,3 and Amin Ul Haq

YEAR: - 2020

Abstract:

The spam detection is a big issue in mobile message communication due to which
mobile message communication is insecure. In order to tackle this problem, an
accurate and precise method is needed to detect the spam in mobile message
communication. We proposed the applications of the machine learning-based spam
detection method for accurate detection. In this technique, machine learning classifiers
such as Logistic regression (LR), K-nearest neighbor (K-NN), and decision tree (DT)
are used for classification of ham and spam messages in mobile device communication.
The SMS spam collection data set is used for testing the method. The dataset is split
into two categories for training and testing the research. The results of the experiments
demonstrated that the classification performance of LR is high as compared with K-
NN and DT, and the LR achieved a high accuracy of 99%. Additionally, the proposed
method performance is good as compared with the existing state-of-the-art methods.

TITLE: SMS Spam Detection using Machine Learning and Deep Learning
Techniques

Author: Sridevi Gadde

YEAR: - 2021

Abstract:

7
The number of people using mobile devices increasing day by day.SMS (short
message service) is a text message service available in smartphones as well as basic
phones. So, the traffic of SMS increased drastically. The spam messages also increased.
The spammers try to send spam messages for their financial or business benefits like
market growth, lottery ticket information, credit card information, etc. So, spam
classification has special attention. In this paper, we applied various machine learning
and deep learning techniques for SMS spam detection. we used a dataset from UCI and
build a spam detection model. Our experimental results have shown that our LSTM
model outperforms previous models in spam detection with an accuracy of 98.5%. We
used python for all implementations.

TITLE: SMS Spam Detection using Machine Learning Approach

Author: Houshmand Shirani-Mehr, [email protected]

YEAR: - 2019

Abstract:

Over recent years, as the popularity of mobile phone devices has increased, Short
Message Service (SMS) has grown into a multi-billion dollars industry. At the same
time, reduction in the cost of messaging services has resulted in growth in unsolicited
commercial advertisements (spams) being sent to mobile phones. In parts of Asia, up
to 30% of text messages were spam in 2012. Lack of real databases for SMS spams,
short length of messages and limited features, and their informal language are the
factors that may cause the established email filtering algorithms to underperform in
their classification. In this project, a database of real SMS Spams from UCI Machine
Learning repository is used, and after preprocessing and feature extraction, different
machine learning techniques are applied to the database. Finally, the results are
compared and the best algorithm for spam filtering for text messaging is introduced.
Final simulation results using 10-fold cross validation shows the best classifier in this
work reduces the overall error rate of best model in original paper citing this dataset by
more than half.

8
1.9 PROPOSED SYSTEM

Applying NB algorithm to the dataset using extracted features with different training
set sizes. The performance in learning curve is evaluated by splitting the dataset into
70% training set and 30% test set. The NB algorithm shows good overall accuracy.

We notice that the length of the text message (number of characters used) is a very
good feature for the classification of spams. Sorting features based on their mutual
information (MI) criteria shows that this feature has the highest MI with target labels.
Additionally, going through the misclassified samples, we notice that text messages
with length below a certain threshold are usually hams, yet because of the tokens
corresponding to the alphabetic words or numeric strings in the message they might be
classified as spams.

By looking at the learning curve, we see that once the NB is trained on features
extracted, the training set error and test set error are close to each other. Therefore, we
do not have a problem of high variance, and gathering more data may not result in
much improvement in the performance of the learning algorithm. As the result, we
should try reducing bias to improve this classifier. This means adding more
meaningful features to the list of tokens can decrease the error rate, and is the option
that is explored next.

1.9.1 PROPOSED SYSTEM ADVANTAGES

• Complexity is less compared to previous process Ability to learn and extract


complex features.
• Accuracy is good

• With its simplicity and fast processing time, the proposed algorithm gives
better execution time.
• Both machine learning and deep learning technique is performed to predict the
value effectively.
• Prediction is accurate

9
CHAPTER 2

ANALYSIS AND DESIGN

2.1 DATASET

The dataset used consists of 5572 instances in which 4825 instances are labelled as
ham and 747 instances are labelled as spam as shown in fig2.1. The message in the
dataset is of single line length. The dataset is populated by taking a subset of 3,375
SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a
dataset of about 10,000 legitimate messages collected for research at the Department
of Computer Science at the National University of Singapore. A list of 450 SMS ham
messages is collected from Caroline Tag's PhD Thesis. From SMS Spam Corpus
v.0.1 Big, 1002 ham messages and 322 spam messages are added to the dataset. A
graphical representation of ham and spam SMS count is shown in fig2.1.1.

Fig2.1. Dataset details

Fig2.1.1 Graphical representation of dataset.

10
2.2 LIBRARIES USED

The models are executed in jupyter notebook platform by importing all the necessary
packages. The important libraries required are
 Pandas
 NumPy

 Matplotlib

 Seaborn

 Nltk

Pandas is used for analysing data and preforming operations like reading dataset,
cleaning and pre-processing dataset. NumPy provides n-dimensional array and helps
in working with arrays, linear algebra, matrices and fourier transform. Matplotlib and
seaborn are used mainly in graphical representation of the data. Nltk library is mainly
used when dataset contains categorical values, text and string values.

A pictorial graph displaying the frequency (number of characters in a sentence) of all


the text messages in the dataset and the frequency of ham and spam messages is
shown in Fig2.2 and Fig2.2.1 respectively. Observing the graph, it is known that most
of the ham messages are having 700- 800 words on average whereas the spam
messages are having words less than 150. The frequency of the longest sentence in
ham SMS is 1500 whereas the frequency of the longest word in spam SMS is 150.

Fig2.2 Frequency of all SMS

11
Fig2.2.1. Frequency of ham and spam SMS
2.3 DESIGN
The dataset containing ham and spam SMS is given as input which is pre-processed by
dropping the null columns and changing the name of the columns from v1 and v2 to
label and SMS. Then it undergoes feature extraction using CountVectorizer(). After
that the dataset is split into train and test data and classification algorithms are applied
over the dataset. The model is trained using train data and tested using test data. The
whole process of classifying SMS as ham and spam is represented in a pictorial format
in Fig2.3.1.

Fig2.3.1 Use case diagram

A network acts as an interface between sender and receiver. The message from sender

12
is encrypted and sent to network with some confidential details. Then the message
reaches the receiver after confirming the user and decrypting the message. After the
message is received, it undergoes spam check and detects whether the SMS is ham or
spam. If spam, the message is moved to spam folder. This is mainly done in email
services and in some phones. A pictorial representation is shown in fig2.3.2.

Fig2.3.2 Sequence diagram of SMS spam detection.

2.3.1 Front End Module Diagrams:

Fig2.3.3 Front End Module Diagrams


13
2.3.2 Back End Module Diagrams:

Fig2.3.4 BackEnd Module Diagrams

2.4 SYSTEM SPECIFICATION

2.4.1 HARDWARE REQUIREMENTS

The hardware requirements may serve as the basis for a contract for the
implementation of the system and should therefore be a complete and consistent
specification of the whole system. They are used by software engineers as the starting
point for the system design. It shows what the system does and not how it should be
implemented

PROCESSOR : Intel I5

RAM : 4GB

HARD DISK : 40 GB

2.4.2 SOFTWARE REQUIREMENTS

The software requirements document is the specification of the system. It should


include both a definition and a specification of requirements. It is a set of what the
system should do rather than how it should do it. The software requirements provide a
basis for creating the software requirements specification. It is useful in estimating

14
cost, planning team activities, performing tasks and tracking the team‘s and tracking
the team‘s progress throughout the development activity.

PYTHON IDLE : Anaconda Jupyter Notebook

PROGRAMMING LANGUAGE: Python

2.5 MODULE DESCRIPTION

2.5.1 DATA COLLECTION

The public dataset of SMS labelled messages is obtained from UCI Machine Learning
Repository. The dataset considered in the current research is available on kaggle, a
machine learning repository. This study finds that there are only 5,574 labelled
messages in the dataset, with 4827 of messages belong to ham messages while the
other 747 messages belong to spam messages. Nonetheless, this dataset consists of two
named columns starting with the message labels (ham or spam) followed by strings of
text messages and three unnamed columns.
It‘s time for a data analyst to pick up the baton and lead the way to machine learning
implementation. The job of a data analyst is to find ways and sources of collecting
relevant and comprehensive data, interpreting it, and analyzing results with the help of
statistical techniques.
The type of data depends on what you want to predict.

There is no exact answer to the question ―How much data is needed? ‖ because each
machine learning problem is unique. In turn, the number of attributes data scientists
will use when building a predictive model depends on the attributes predictive value.
The more, the better approach is reasonable for this phase. Some data scientists suggest
considering that less than one-third of collected data may be useful. It‘s difficult to
estimate which part of the data will provide the most accurate results until the model
training begins. That‘s why it‘s important to collect and store all data — internal and
open, structured and unstructured.
The purpose of preprocessing is to convert raw data into a form that fits machine
learning. Structured and clean data allows a data scientist to get more precise results
from an applied machine learning model. The technique includes data formatting,
cleaning, and sampling.
15
2.5.2 DATA FORMATTING

The importance of data formatting grows when data is acquired from various sources
by different people. The first task for a data scientist is to standardize record formats.
A specialist checks whether variables representing each attribute are recorded in the
same way. Titles of products and services, prices, date formats, and addresses are
examples of variables. The principle of data consistency also applies to attributes
represented by numeric ranges.

2.5.3 DATA CLEANING

This set of procedures allows for removing noise and fixing inconsistencies in data. A
data scientist can fill in missing data using imputation techniques, e.g. substituting
missing values with mean attributes. A specialist also detects outliers — observations
that deviate significantly from the rest of distribution. If an outlier indicates erroneous
data, a data scientist deletes or corrects them if possible. This stage also includes
removing incomplete and useless data objects.

2.5.4 DATA ANONYMIZATION

Sometimes a data scientist must anonymize or exclude attributes representing sensitive


information (i.e. when working with healthcare and banking data).

2.5.5 DATA SAMPLING

Big datasets require more time and computational power for analysis. If a dataset is too
large, applying data sampling is the way to go. A data scientist uses this technique to
select a smaller but representative data sample to build and run models much faster,
and at the same time to produce accurate outcomes.
Pre-processing is the first stage in which the unstructured data is converted into more
structured data. Since keywords in SMS text messages are prone to be replaced by

16
symbols. In this study, the stop word list remover for English language have been
applied to eliminate the stop words in the SMS text messages.

2.5.6 NATURAL LANGUAGE PROCESSING ( NLP )

The input given by the user is processed through a number of stages to understand
what the user is trying to say. Natural language processing (NLP) is the ability of a
program to make use of the natural language spoken by a human and comprehend it‘s
meaning. NLP is the study of the computational treatment of natural (human) language.
The development of NLP is challenging because computers are used to of getting a
highly structured input whereas natural language is highly complex and ambiguous
with different linguistic structures and complex variables. NLP has various stages as
follows:

 Tokenization( lexical analysis), also referred as segmentation involves breaking


up a sentence or paragraph into tokens or individual words, numbers or
meaning full phrases. Tokens can be thought of as a small part like a word is a
token in a sentence and a sentence is a token in a paragraph. The words are
separated with the help of word boundaries. English is space delimited hence,
word boundaries are the space between ending of one word and starting of the
next one. Example: ―I am suffering with fever!‖ The output after tokenisation
would be: [ ‗I‘ ,‘am‘ , ‗suffering‘ , ‗with‘ , ‗ fever‘] • Syntactic analysis
involves analysis of words for grammar and putting the words together in a
manner which can show their relationship. This can be done with a data
structure such as a parse tree or syntax tree. The tree is constructed with the
rules of grammar of the language. If the input can be produced using the syntax
tree the input is found to have correct syntax. For example the string ―I pick
that have to‖ will be considered incorrect syntax.

 Semantic analysis picks up the dictionary meaning of words and tries to


understand the actual meaning of the sentence. It is the process of mapping
syntactic structures with the actual or text independent meaning of the words.
Strings like ―hot winter‖ will be disregarded.

17
 Pragmatic analysis: Pragmatic investigation manages outside word information,
which implies learning the outer to the archives and additionally inquiries.
Pragmatics analysis that centers around what was portrayed reinterpreted by
what it really implied, inferring the different parts of language that require true
learning.

2.5.7 FEATURIZATION

Featurization is a way to change some form of data (text data, graph data, time-series
data…) into a numerical vector.
Featurization is different from feature engineering. Feature engineering is just
transforming the numerical features somehow so that the machine learning models
work well. In feature engineering, features are already in the numerical form. Whereas
in Featurization data not need to be in the form of numerical vector.
The machine learning model cannot work with row text data directly. In the end,
machine learning models work with numerical (categorical, real…) features. So it is
import to change some type of data into numerical vector so that we can leverage the
whole power of linear algebra (making the decision boundary between data points) and
statistics tools with other types of data also.
Feature extraction and selection is important for the discrimination of ham and spam in
SMS text messages. For this phases TFIDF will be used. TFIDF is the often-weighting
method used to in the Vector Space Model, particularly in IR domain including text
mining. It is a statistical method to measure the important of a word in the document to
the whole corpus. The term frequency is simply calculated in proportion to the number
of occurrences a word appears in the document and usually normalized in positive
quadrant between 0 and 1 to eliminate bias towards lengthy documents. To construct
the index of terms in TFIDF, punctuation is removed, and all text are lowercase during
tokenization. The first two letter TF or term frequency refers to how important if it
occurs more frequently in a document. Therefore, the higher TF reflects to the more
estimated that the term is significant in respective documents. Additionally, IDF or
Inverse Document Frequency calculated on how infrequent a word or term is in the
documents. The weighted value is estimated using the whole training dataset. The idea
of IDF is that a word is not considered to be good candidate to represent the document
18
if it is occurring frequently in the whole dataset as it might be the stop words or
common words that is generic. Hence only infrequent words in contrast of the entire
dataset is relevant for that documents. TF-IDF does not only assess the importance of
words in the documents but it also evaluates the importance of words in document
database or corpus. In this sense, the word frequency in the document will increase the
weight of words proportionally but will then be offset by corpus ‘s word frequency.
This key characteristic of TF-IDF assumes that there are several words that appear
more often compared to others in the document in general.

2.5.8 SPLITTING OF DATA

After cleaning the data, data is normalized in training and testing the model. When
data is spitted then we train algorithm on the training data set and keep test data set
aside. This training process will produce the training model based on logic and
algorithms and values of the feature in training data. Basically, aim of feature
extraction is to bring all the values under same scale.

A dataset used for machine learning should be partitioned into three subsets — training,
test, and validation sets.

TRAINING SET:- A data scientist uses a training set to train a model and define its
optimal parameters-parameters it has to learn from data.

TEST SET:- A test set is needed for an evaluation of the trained model and its
capability for generalization. The latter means a model ‘s ability to identify patterns in
new unseen data after having been trained over a training data. It’s crucial to use
different subsets for training and testing to avoid model over fitting, which is the
incapacity for generalization we mentioned above.

2.5.9 MODELING

During this stage, a data scientist trains numerous models to define which one of them
provides the most accurate predictions.
19
Model Testing:
The goal of this step is to develop the simplest model able to formulate a target value
fast and well enough. A data scientist can achieve this goal through model tuning.
That’s the optimization of model parameters to achieve an algorithm ‘s best
performance.

One of the more efficient methods for model evaluation and tuning is cross-validation

CROSS-VALIDATION: - Cross-validation is the most commonly used tuning


method. It entails splitting a training dataset into ten equal parts (folds). A given model
is trained on only nine folds and then tested on the tenth one (the one previously left
out). Training continues until every fold is left aside and used for testing. As a result of
model performance measure, a specialist calculates a cross-validated score for each set
of hyper parameters. A data scientist trains models with different sets of hyper
parameters to define which model has the highest prediction accuracy. The cross-
validated score indicates average model performance across ten hold-out folds. Then a
data science specialist tests models with a set of hyper parameter values that received
the best cross-validated score. There are various error metrics for machine learning
tasks.

NAIVE BAYES:

Naive Bayes is a simple probabilistic classifier based on applying Bayes' theorem (or
Bayles’s rule) with strong independence (naive) assumptions. Parameter estimation for
Naïve Bayes models uses the maximum likelihood estimation. It takes only one pass
over the training set and is computationally very fast.

BAYES RULE

A conditional probability is the likelihood of some conclusion, C, given some


evidence/observation, D, where a dependence relationship exists between C and D.
This probability is denoted as

(C|D) where, (D/C) = [(D/C)(C)] /[P(D)]

20
NB CLASSIFIER

Naïve Bayes classifier is one of the high detection approaches for learning
classification of text documents. Given a set of classified training samples, an
application can learn from these samples, so as to predict the class of an unmet
samples.
The features (n1 , n2 , n3 , n4) which are present in sms are independent from each
other. Every feature (1 ≤ i ≤ 4) text binary value showing whether the particular
property comes in sms. The probability is calculated that the given web belongs to a
class r (r1 : spam and r2 : Ham) as follows:
(r1/N) = ( (r1) ∗ P(N/ri))/P(N )

PERFORMANCE MATRICES:

Data was divided into two portions, training data and testing data, both these portions
consisting 70% and 30% data respectively. All these two algorithms were applied on
same dataset using Enthought Canaopy and results were obtained.

Predicting accuracy is the main evaluation parameter that we used in this work.
Accuracy can be defied using equation. Accuracy is the overall success rate of the
algorithm.

CONFUSION MATRIX:

It is the most commonly used evaluation metrics in predictive analysis mainly because
it is very easy to understand and it can be used to compute other essential metrics such
as accuracy, recall, precision, etc. It is an NxN matrix that describes the overall
performance of a model when used on some dataset, where N is the number of class
labels in the classification problem.

21
All predicted true positive and true negative divided by all positive and negative. True
Positive (TP), True Negative (TN), False Negative (FN) and False Positive (FP)
predicted by all algorithms are presented in table.

True positive (TP) indicates that the positive class is predicted as a positive class, and
the number of sample positive classes was actually predicted by the model.
False negative indicates (FN) that the positive class is predicted as a negative class,
and the number of negative classes in the sample was actually predicted by the model.
False positive (FP) indicates that the negative class is predicted as a positive class, and
the number of positive classes of samples was actually predicted by the model. True
negative (TN) indicates that the negative class is predicted as a negative class, and the
number of sample negative classes was actually predicted by the model.

2.6 SYSTEM DESIGN

Designing of system is the process in which it is used to define the interface, modules
and data for a system to specified the demand to satisfy. System design is seen as the
application of the system theory. The main thing of the design a system is to develop
the system architecture by giving the data and information that is necessary for the
implementation of a system.

22
2.6.1 ARCHITECTURE DIAGRAM:

2.6.2 DATA FLOW DIAGRAM:

Data flow diagrams are used to graphically represent the flow of data in a business
information system. DFD describes the processes that are involved in a system to
transfer data from the input to the file storage and reports generation. Data flow
diagrams can be divided into logical and physical. The logical data flow diagram
describes flow of data through a system to perform certain functionality of a business.
The physical data flow diagram describes the implementation of the logical data flow

23
2.6.3 USECASE DIAGRAM:

Use case diagrams identify the functionalities provides by the use cases, the actors who
interact with the system and the association between the actors and the functionalities.

2.6.4 CLASS DIAGRAM:

The class diagram is a static diagram. It represents the static view of an application.
Class diagram is not only used for visualizing, describing and documenting different
aspects of a system but also for constructing executable code of the software
application

24
2.6.5 SEQUENCE DIAGRAM:

The sequence diagram of a system shows the entity interplay are ordered in the time
order level. So, that it drafts the classes and object that are imply in the that plot and
also the series of message exchange take place betwixt the body that need to be carried
out by the purpose of that scenario.

2.6.6 ACTIVITY DIAGRAM:

The Activity Diagram forms effective while modeling the functionality of the system.
Hence this diagram reflects the activities, the types of flows between these activities
and finally the response of objects to these activities.

25
2.6.7 STATE FLOW DIAGRAM:

The below state chart diagram describes the flow of control from one state to another
state (event) in the flow of the events from the creation of an object to its termination.

26
CHAPTER 3

SOFTWARE SPECIFICATION

3.1 GENERAL

ANACONDA

It is a free and open-source distribution of the Python and R programming languages


for scientific computing (data science, machine learning applications, large-scale data
processing, predictive analytics, etc.), that aims to simplify package management and
deployment.

Anaconda distribution comes with more than 1,500 packages as well as the Conda
package and virtual environment manager. It also includes a GUI, Anaconda Navigator,
as a graphical alternative to the Command Line Interface (CLI).

The big difference between Conda and the pip package manager is in how package
dependencies are managed, which is a significant challenge for Python data science
and the reason Conda exists. Pip installs all Python package dependencies required,
whether or not those conflict with other packages you installed previously.

So your working installation of, for example, Google Tensorflow, can suddenly stop
working when you pip install a different package that needs a different version of the
Numpy library. More insidiously, everything might still appear to work but now you
get different results from your data science, or you are unable to reproduce the same
results elsewhere because you didn't pip install in the same order.

Conda analyzes your current environment, everything you have installed, any version
limitations you specify (e.g. you only want tensorflow>= 2.0) and figures out how to
install compatible dependencies. Or it will tell you that what you want can't be done.
Pip, by contrast, will just install the thing you wanted and any dependencies, even if
that breaks other things.Open source packages can be individually installed from the
Anaconda repository, Anaconda Cloud (anaconda.org), or your own private repository
or mirror, using the conda install command. Anaconda Inc compiles and builds all the
packages in the Anaconda repository itself, and provides binaries for Windows 32/64
bit, Linux 64 bit and MacOS 64-bit. You can also install anything on PyPI into a
Conda environment using pip, and Conda knows what it has installed and what pip has
27
installed. Custom packages can be made using the conda build command, and can be
shared with others by uploading them to Anaconda Cloud, PyPI or other
repositories.The default installation of Anaconda2 includes Python 2.7 and Anaconda3
includes Python 3.7. However, you can create new environments that include any
version of Python packaged with conda.

Fig 3.1Anaconda Navigator

Anaconda Navigator is a desktop Graphical User Interface (GUI) included in


Anaconda distribution that allows users to launch applications and manage conda
packages, environments and channels without using command-line commands.
Navigator can search for packages on Anaconda Cloud or in a local Anaconda
Repository, install them in an environment, run the packages and update them. It is
available for Windows, macOS and Linux.

28
The following applications are available by default in Navigator:

• JupyterLab

• Jupyter Notebook

• Console

• Spyder

• Glue viz

• Orange

• RStudio

• Visual Studio Code

Microsoft .NET is a set of Microsoft software technologies for rapidly building and
integrating XML Web services, Microsoft Windows-based applications, and Web
solutions. The .NET Framework is a language-neutral platform for writing programs
that can easily and securely interoperate. There‘s no language barrier with .NET: there
are numerous languages available to the developer including Managed C++, C#,
Visual Basic and Java Script. The .NET framework provides the foundation for
components to interact seamlessly, whether locally or remotely on different platforms.
It standardizes common data types and communications protocols so that components
created in different languages can easily interoperate.

NET‖ is also the collective name given to various software components built upon
the .NET platform. These will be both products (Visual Studio.NET and
Windows.NET Server, for instance) and services (like Passport, .NET My Services,
and so on).

Microsoft VISUAL STUDIO is an Integrated Development Environment (IDE) from


Microsoft. It is used to develop computer programs, as well as websites, web apps,
web services and mobile apps.

29
Python is a powerful multi-purpose programming language created by Guido van
Rossum. It has simple easy-to-use syntax, making it the perfect language for someone
trying to learn computer programming for the first time. Python features are:

• Easy to code
• Free and Open Source
• Object-Oriented Language
• GUI Programming Support
• High-Level Language
• Extensible feature
• Python is Portable language
• Python is Integrated language
• Interpreted
• Large Standard Library
• Dynamically Typed Language

30
3.2 PYTHON:

• Python is a powerful multi-purpose programming language created by Guido


van Rossum.
• It has simple easy-to-use syntax, making it the perfect language for someone
trying to learn computer programming for the first time. Features Of Python :

1. Easy to code:
Python is high level programming language. Python is very easy to learn language as
compared to other language like c, c#, java script, java etc. It is very easy to code in
python language and anybody can learn python basic in few hours or days. It is also
developer-friendly language.

2. Free and Open Source:


Python language is freely available at official website and you can download it from
the given download link below click on the Download Python keyword.
Since, it is open-source, this means that source code is also available to the public. So
you can download it as, use it as well as share it.

3. Object-Oriented Language:
One of the key features of python is Object-Oriented programming. Python supports
object oriented language and concepts of classes, objects encapsulation etc.

4. GUI Programming Support:


Graphical Users interfaces can be made using a module such as PyQt5, PyQt4,
wxPython or Tk in python.
PyQt5 is the most popular option for creating graphical apps with Python.

5. High-Level Language:
Python is a high-level language. When we write programs in python, we do not need to
remember the system architecture, nor do we need to manage the memory.
6. Extensible feature:
Python is a Extensible language. we can write our some python code into c or c++
language and also we can compile that code in c/c++ language.

7. Python is Portable language:

31
Python language is also a portable language. for example, if we have python code for
windows and if we want to run this code on other platform such as Linux, Unix and
Mac then we do not need to change it, we can run this code on any platform.

7. Python is Integrated language:


Python is also an Integrated language because we can easily integrated python with
other language like c, c++ etc.

8. Interpreted Language:
Python is an Interpreted Language. because python code is executed line by line at a
time. like other language c, c++, java etc there is no need to compile python code this
makes it easier to debug our code. The source code of python is converted into an
immediate form called bytecode.

9. Large Standard Library:


Python has a large standard library which provides rich set of module and functions so
you do not have to write your own code for every single thing.There are many libraries
present in python for such as regular expressions, unit-testing, web browsers etc.

10. Dynamically Typed Language:


Python is dynamically-typed language. That means the type (for example- int, double,
long etc) for a variable is decided at run time not in advance.because of this feature we
don‘t need to specify the type of variable.

3.3 APPLICATIONS OF PYTHON :

WEB APPLICATIONS

• You can create scalable Web Apps using frameworks and CMS (Content
Management System) that are built on Python. Some of the popular platforms
for creating Web Apps are: Django, Flask, Pyramid, Plone, Django CMS.
• Sites like Mozilla, Reddit, Instagram and PBS are written in Python.

3.3.1 Scientific And Numeric Computing

32
• There are numerous libraries available in Python for scientific and numeric
computing. There are libraries like: SciPy and NumPy that are used in general
purpose computing. And, there are specific libraries like: EarthPy for earth
science, AstroPy for Astronomy and so on.
• Also, the language is heavily used in machine learning, data mining and deep
learning.

3.3.2 Creating Software Prototypes

• Python is slow compared to compiled languages like C++ and Java. It might
not be a good choice if resources are limited and efficiency is a must.
• However, Python is a great language for creating prototypes. For example: You
can use Pygame (library for creating games) to create your game's prototype
first. If you like the prototype, you can use language like C++ to create the
actual game.

3.3.3 Good Language To Teach Programming


• Python is used by many companies to teach programming to kids It is a good
language with a lot of features and capabilities. Yet, it's one of the easiest
language to learn because of its simple easy-to-use system.

33
CHAPTER 4

IMPLEMENTATION

4.1 PRE-PROCESSING OF DATASET

First the table is pre-processed by dropping all the null columns and renaming the
headings of the table. The label for each SMS is denoted in binary values where ‘0’
indicates ‘ham’ SMS and ‘1’ indicates ‘spam’ SMS. Length of every SMS in terms of
characters is calculated and added as a new column to the existing table as shown in
Fig4.1. Now the dataset is ready for undergoing feature extraction.

Fig4.1. Pre-processing of dataset

4.2 FEATURE EXTRACTION

Here the text messages are going to be converted to a matrix of token counts. This is
done by importing Count Vectorizer from sklearn.feature_extraction.text module.
After this the dataset is split into Train, yttrian and Test, y_test data by importing
train_test_split from sklearn.model_selection module. Train data has 75% of the
dataset and the remaining 25% is comes under test data. Now X_train data undergoes
feature extraction by applying fit transform function using Count Vectorizer. Here fit
transform function gets the raw data, learns the vocabulary in the data and returns
document-term matrix. In the same way X_test also undergoes feature extraction by
applying transform function using Count Vectorizer. Here transform function
transforms data to a document-term matrix. At this point, the SMS in the dataset will
34
be transformed into a matrix with token counts. The train dataset is used for training
the model and the test data is used for testing the model. Train and test data is shown
in pictorial format in the figure below. Now machine learning algorithms will be
applied over the datasets, to test its performance.

Fig. 4.2 Training Dataset.

4.3 CLASSIFICATION ALGORITHMS

This is the main part of the project – train and test the model and check its working.
Classification algorithms comes under supervised machine learning algorithm where
it uses train data to learn thoroughly and tries to predict the category the data falls in
the testing part. Here I’m using five algorithms over the dataset.

 K-nearest neighbor
 Multinomial Naive bayes
 Support Vector Machine
 Decision tree
 Random forest

After applying these algorithms, their performance is tested based on the accuracy
which is calculated by generating confusion matrix.

35
Accuracy, precision score, f1 score and recall score is calculated using confusion
matrix and is used to check how the model has classified the data in test dataset
during testing. True positive and false negative values are the data which are correctly
predicted and classified. False positive shows that the false data is classified as true
and in true negative, true data is classified as false.

4.3.1 K-Nearest Neighbor

K-nearest neighbor algorithm is a pattern recognition supervised learning algorithm for


classification. It considers the class mark to be a new input and applies it to the
training set's inputs. K-success NN's isn't quite up to par. Let (x, y) be the training
observation and h: X →Y be the learning function, so that an observation x, h (x) can
determine the value of y. [4]

Here prediction is made regarding n-neighbor value and finding the data value close to
n-value. This algorithm stores training dataset and performs an action on train data
during testing time for classifying the test data. The data point is categorized by using
Euclidian distance as a metric and finding minimum distance between the data points
and the data space for classification [2].

KNeighborsClassifier is imported from sklearn.neighbors module. I’ve given n-value


as 3 and used ‘minkowski’ as metrics and assigning value of ‘p’ to 2 which is
equivalent to selecting Euclidean distance as distance metric. I’ve assigned these
values to a class classifier and used the classifier to fit train data and y_train dataset.
After this, prediction is made by using the classifier on the train data and assigned the
results to a variable. Training and prediction is done by using predefined fit() and
predict() function. Code snippet is shown below.

36
Fig4.3.1 KNN algorithm
4.3.2 Multinomial Naive Bayes

Multinomial naive bayes algorithm is a part of naïve bayes algorithm which is based
on Bayes theorem. Here each value in the dataset is treated as an independent
entity[2].
This algorithm calculates posterior probability using the formula below

where, P(a/b) denotes posterior probability, P(b/a) denotes likelihood,

P(a) denotes class prior probability P(b) denotes predictor prior probability.

Using 10-fold cross validation and naive Bayes with multinomial event model and
laplace smoothing on the dataset, the overall error is 1.12 percent, the SC is 94.5
percent, and the overall error is 1.12 percent. BH makes up 0.51 percent of the total.[3]

Using data priors and applying Bayesian naive Bayes for the same event model
reduces SC (93.7%) and BH (0.44%) by a small amount, but overall error remains the
same. This is to be anticipated, given the Bayesian model. When there is a lot of
variation, the algorithm gets better.

37
These operations are done automatically by importing using predefined functions.
First MultinomialNB is imported from sklearn.naive_bayes module. I’ve assigned
‘1.0’ for alpha and set fit_prior value to ‘True’. This function is assigned a s a class to
a variable which is used for fitting train data and y_train dataset using fit() function
after which prediction is done over the test dataset using predict() function. Code
snippet for fitting train data and predicting test data is shown in fig8.

Fig4.3.2. Multinomial Naive Bayes algorithm

4.3.3 Support Vector Machine

Support Vector Machine shortly called as SVM is an algorithm which can perform
both classification and regression. SVM mainy uses kernels which are the
mathematical function for expanding the dimension where data points are plotted and
classified using a hyper-plane[2].

38
The hyper-plane is placed at a point where the classes are classified corretly. There
are many types of kernels in SVM and I’ve used ‘linear’ as kernel type for
classification of ham and spam SMS. For training and testing the dataset, SVC is
imported from sklearn.svm module and assigned as a class to a variable. The train
data and y_train data are fitted using fit() and prediction is made over the test data
using the predict() function. Code snippet of this algorithm is shown below in fig9.

Fig4.3.3. SVM algorithm

However, increasing the degree of the polynomial from two to three indicates an
increase in error rates by using the polynomial kernel. As the degree is raised, the
error rate does not change. Furthermore, another kernel is used called the radial basis
function (RBF).[3]

4.3.4 Decision Tree

Decision tree is a tree-structured algorithm used for classification and regression.


This algorithm works with categorical variables [2]. This technique of DT is easily
understandable and simple for making the decisions. A DT contains external and
internal nodes interlinked with each other. Decision can be made based on the
internal nodes and the child node to access the preceding node. There is no child of
the leaf node and is linked with a label [4].

This algorithm reduces the entropy by breaking the dataset into smaller subsets and
develop a decision tree. This tree is very easy to explain and understand since it
visualizes a sequence of Yes or No questions. To apply this algorithm over the
dataset DecisionTreeClassifier is imported from sklearn.tree module. I’ve used
‘entropy’ as criterion with random-state as 1. Then this function is assigned as a class
to a variable which is used with fit() function to fit train data and y_train data, and

39
used with predict() function to classify data in test data. Code snippet of how this
algorithm is applied is shown in fig10 below.

Fig4.3.4. Decision tree algorithm

4.3.5 Random Forest

Random forest algorithm is similar to decision tree which is used for classification
and regression. As forest contains many trees, this algorithm is also an ensemble of
many decision trees which merges the trees for better accuracy. It also provides
adiitional randomness on selection of features from dataset to form trees. Random
forest’s performance is similar and, in some cases, excels when comaperd with
decision tree. I’ve used ‘entropy’ as criterion and random-state as 1. To use this
algorithm, RandomForestClassifier is imported from sklearn. ensemble module and is
assigned as class to a variable which will be used with fit () and predict () function.
The train and y_train dataset is fitted using fit () function and predict () function is
applied on test data to test its classification. Code snippet which explains working of
the algorithm is given below.

Fig4.3.5 Random forest algorithm

40
The random forests implementation in the scikitlearn python library is used in this
work, which averages the probabilistic predictions. For this procedure, two different
estimators are simulated. The average error with ten estimators is 2.16 percent, and SC
is 87.7%, while BH is just 0.73 percent. If you use 100 estimators, you'll get it with a
gross error of 1.41 percent, a SC of 92.2 percent, and a BH of 0.51 percent, the overall
error is 1.41 percent.[3]

Now that the model is trained and tested using the dataset by the required algorithms,
their performance is checked in the classification report which has the accuracy score,
precision score, f1 score and recall score needed. This part is explained elaborately in
the next section.

4.4 SCREENSHOTS

41
42
CHAPTER 5

TESTING AND RESULTS

5.1 TESTING

The performance of all the algorithms is tested by generating confusion matrix and
classification report for all the algorithms. The needed values such as accuracy_score,
precision score, recall score, f1-score, confusion matrix and classification report is
imported from sklearn.metrics module. The values are calculated by using data from
confusion matrix and the formula shown in Fig5.1.

Fig 5.1. Formulae used for calculating accuracy, precision, recall, f1-score

The code snippet importing modules and generating confusion matrix is shown in
fig5.1.1 below.

Fig 5.1.1 Importing confusion matrix and classification report

43
5.2 RESULTS

I’ve displayed all the values generated from confusion matrix and classification
report in a tabular format shown in Table1. All the algorithms are combined with
Count Vectorizer ()
function to give the required values. A comparison in the accuracy is done and
represented in a graph format in fig13.

Algorithm Accuracy Precision Recall F1-Score


K-nearest neighbor 0.935 1.0 0.5 0.666
Multinomial Naive Bayes 0.985 0.954 0.933 0.943
Support Vector Machine 0.987 0.993 0.905 0.947
Decision tree 0.965 0.854 0.883 0.868
Random forest 0.982 1.0 0.861 0.925

Table1. Comparing performance of Algorithms

A
1 9
9 9
9
9
9
6
9

9
2

K M S D R

Fig5.2. Graphical representation of the results

All the values fall in range between 0 and 1. Seeing the accuracy it is known that
Multinomial Naive Bayes, SVC and Random Forest algorithms are having the
accuracy of 98%. When keenly observed, SVM outperforms by giving 98.7% which
is equivalent to 99%.

44
5.3 VERIFICATION

This model also performs better than the model proposed by Nilam Nur Amir Sharif et
al[1] using term frequency-inverse document frequency (TF-IDF) as feature extract. In
their model random forest gave an accuracy of 97.5%. Seeing the results, it is known
that Count Vectorizer function performs better than TF-IDF vectorizer. Both these
vectorizers are used over datasets having text and string as values.

45
CHAPTER 6

CONCLUSION

6.1 CONCLUSION

Spam detection is critical for secure message and e-mail communication. The effective
identification of spam is a major concern, and numerous researchers have suggested a
variety of detection methods. These techniques, on the other hand, are incapable of
effectively and reliably detecting spam. To address this issue, I experimented different
spam detection approaches based on machine learning predictive models. According to
simulation performance, the best classifiers for SMS are multinomial naive Bayes with
LaPlace smoothing and SVM with linear kernel. In the original paper cited, the best
classifier was the SVM as a learning algorithm. This results in a 97.64 percent overall
accuracy.

6.2 FUTURE SCOPE

Future scope of this project will involve adding more feature parameter. The more the
parameters are taken into account more will be the accuracy. The algorithms can also be
applied for analyzing the contents of public comments and thus determine
patterns/relationships between the customer and the company. The use of traditional
algorithms and data mining techniques can also help predict the corporation
performance structure as a whole. In the future, we plan to integrate neural network with
some other techniques such as genetic algorithm or fuzzy logic. Genetic algorithm can
be used to identify optimal network architecture and training parameters. Fuzzy logic
provides the ability to account for some uncertainty produced by the neural network
predictions. Their uses in conjunction with neural network could provide an
improvement for SMS spam prediction.

APPLICATION:

• It can used for company to prevent users using fake links.

• Hacking can be prevented.

46
REFERENCES

[1] Modupe, A., O. O. Olugbara, and S. O. Ojo. (2014) ―Filtering of Mobile Short
Messaging Communication Using Latent Dirichlet Allocation with Social Network
Analysis‖, in Transactions on Engineering Technologies: Special Volume of the World
Congress on Engineering 2013, G.-C. Yang, S.-I. Ao, and L. Gelman, Eds. Springer
Science & Business. pp. 671–686.

[2] Shirani-Mehr, H. (2013) ―SMS Spam Detection using Machine Learning


Approach. ‖ p. 4.

[3] Abdulhamid, S. M. et al., (2017) ―A Review on Mobile SMS Spam Filtering


Techniques. ‖ IEEE Access 5: 15650–15666.

[4] Aski, A. S., and N. K. Sourati. (2016) ―Proposed Efficient Algorithm to Filter
Spam Using Machine Learning Techniques. ‖ Pac. Sci. Rev. Nat. Sci. Eng. 18 (2):145–
149.

[5] Narayan, A., and P. Saxena. (2013) ―The Curse of 140 Characters: Evaluating The
Efficacy of SMS Spam Detection on Android. ‖ p. 33– 42.

[6] Almeida, T. A., J. M. Gómez, and A. Yamakami. (2011) ―Contributions to the


Study of SMS Spam Filtering: New Collection and Results.‖ p. 4.

[7] Mujtaba, D. G., and M. Yasin. (2014) ―SMS Spam Detection Using Simple
Message Content Features.‖ J. Basic Appl. Sci. Res. 4 (4): 5.

[8] Gudkova, D., M. Vergelis, T. Shcherbakova, and N. Demidova. (2017) ―Spam and
Phishing in Q3 2017.‖ Securelist - Kaspersky Lab‘s Cyberthreat Research and Reports.
Available from: https://securelist.com/spam-and-phishing-in-q3-2017/82901/.
[Accessed: 10th April 2018].
47
[9] Choudhary, N., and A. K. Jain. (2017) ―Towards Filtering of SMS Spam Messages
Using Machine Learning Based Technique‖, in Advanced Informatics for Computing
Research 712: 18-30.

[10] Safie, W., N.N.A. Sjarif, N.F.M. Azmi, S.S. Yuhaniz, R.C. Mohd, and S.Y. Yusof.
(2018) ―SMS Spam Classification using Vector Space Model and Artificial Neural
Network.‖ International Journal of Advances in Soft Computing & Its Applications 10
(3): 129-141.

[11] Fawagreh, Khaled, Mohamed Medhat Gaber, and Eyad Elyan. (2014) ―Random
Forests: From Early Developments to Recent Advancements, Systems Science &
Control Engineering.‖ An Open Access Journal 2 (1): 602-609.

[12] Sajedi, H., G. Z. Parast, and F. Akbari. (2016) ―SMS Spam Filtering Using
Machine Learning Techniques: A Survey.‖ Machine Learning, 1 (1): 14.

[13] Q. Xu, E., W. Xiang, Q. Yang, J. Du, and J. Zhong. (2012) ―SMS Spam Detection
Using Noncontent Features.‖ IEEE Intell. Syst. 27(6): 44–51.

[14] Sethi, G., and V. Bhootna. (2014) SMS Spam Filtering Application Using Android.

[15] Nagwani, N. K. (2017) ―A Bi-Level Text Classification Approach for SMS Spam
Filtering and Identifying Priority Messages.‖ 14 (4): 8.

[16] Delany, S. J., M. Buckley, and D. Greene. (2012) ―SMS Spam Filtering: Methods
and Data,‖ Expert Syst. Appl. 39(10): 9899–9908.

[17] Chan, P. P. K., C. Yang, D. S. Yeung, and W. W. Y. Ng. (2015) ―Spam Filtering
for Short Messages in Adversarial Environment.‖ Neurocomputing 155: 167–176.

[18] Sethi, P., V. Bhandari, and B. Kohli. (2017) ―SMS Spam Detection and
Comparison of Various Machine Learning Algorithms‖, in 2017 International

48
Conference on Computing and Communication Technologies for Smart Nation
(IC3TSN). pp. 28–31.

[19] Warade, S. J., P. A. Tijare, and S. N. Sawalkar. (2014) ―An Approach for SMS
Spam Detection.‖ Int. J. Res. Advent Technol. 2 (2): 4.

[20] Almeida. T. A., and J. M. G. Hidalgo. (2018) ―SMS Spam Collection.‖


Available from: http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/. [Accessed:
11st April 2018].

49

You might also like