Kriti Final Report
Kriti Final Report
I /We hereby declare that the work presented in this report entitled “SMS Spam
Detection Using Machine Learning” was carried out by me/us. I/We have not submitted
the matter embodied in this report for the award of any other degree or diploma of any
other University or Institute.
I/We have given due credit to the original authors/sources for all the words, ideas,
diagrams, graphics, computer programs, experiments, and results that are not my/our
original contribution. I/We have used quotation marks to identify verbatim sentences
and given credit to the original authors/sources.
I/We affirm that no portion of my/our work is plagiarized, and the experiments and
results reported in the report are not manipulated. In the event of a complaint of
plagiarism and the manipulation of the experiments and results, I/We shall be fully
responsible and answerable.
(Candidates’ Signature)
Date:
i
CERTIFICATE
This is to certify that Project Report entitled “SMS Spam Detection Using Machine
Learning” submitted by Archana, Kriti Yadav, Akanksha Dubey, Anjali Sharma in
partial fulfilment of the requirement for the award of Bachelor of Technology in
Computer Science & Engineering from Buddha Institute of Technology, Gorakhpur
affiliated to Dr. A.P.J. Abdul Kalam Technical University, Lucknow, Uttar Pradesh
represents the work carried out by students under my supervision. The project
embodies result of original work and studies carried out by the students themselves and
the contents of the project do not form the basis for the award of any other degree to the
Candidate or to anybody else.
Ms. Pallavi
Dixit (Assistant
Professor)
Department of Computer Science & Engineering
Buddha Institute of Technology, GIDA, Gorakhpur
ii
ABSTRACT
SMS spam detection is a crucial task in ensuring user privacy and security in mobile
communication. This paper proposes a machine learning approach for detecting SMS spam
messages. Leveraging a dataset of labelled SMS messages, we employ various feature
extraction techniques and machine learning algorithms, including but not limited to, Naive
Bayes, Support Vector Machines (SVM), and Random Forest. We explore the
effectiveness of different feature representations such as bag-of-words, TF-IDF, and word
embeddings. Additionally, we investigate the impact of different preprocessing steps such
as text normalization and stemming. Experimental results demonstrate the effectiveness of
our proposed approach in accurately identifying spam messages while minimizing false
positives. This research contributes to the ongoing efforts in combating SMS spam and
enhances user experience and security in mobile communication.
iii
ACKNOWLEDGEMENT
First and foremost, we extend our heartfelt gratitude to God Almighty for providing us
with the strength, resources, and perseverance to complete this project successfully.
Without His divine guidance, our efforts would not have come to fruition.
Our sincere thanks go to our project guide, Ms. Pallavi Dixit, Assistant Professor in the
Department of Computer Science & Engineering, for her unwavering support,
insightful guidance, and constant encouragement. Your expertise and dedication have
been crucial to our success.
We also express our gratitude to all the faculty members and technical experts whose
help and encouragement have been instrumental in the completion of this project. Your
knowledge and support have been invaluable.
iv
TABLE OF CONTENTS
Page no.
Candidate’s Declaration i
Certificate ii
Acknowledgement iii
Abstract iv
List of Figures ix
List of Tables xi
1.1 INTRODUCTION 1
v
2.1 DATASET 10
2.3 DESIGN 12
2.5.7 Featurization 18
2.5.9 Modeling 19
vi
2.6.4 Class Diagram 24
3.1 GENERAL 27
3.2 PYTHON 31
34
4.2 FEATURE EXTRACTION
35
4.3 CLASSIFICATION ALGORITHMS
36
4.3.1 K-Nearest Neighbor
37
4.3.2 Multinomial Naïve Bayes
vii
38
4.3.3 Support Vector Machine
39
4.3.4 Decision Tree
40
4.3.5 Random Forest
4.4 SCREENSHOTS 41
5.1 TESTING 43
5.2 RESULTS 44
5.3 VERIFICATION 45
Chapter 6 CONCLUSION 46
6.1 CONCLUSION 46
REFERENCES 47-49
viii
LIST OF FIGURES
ix
5.2 Graphical representation of the results 44
x
LIST OF TABLES
xi
CHAPTER 1
INTRODUCTION
1.1 INTRODUCTION
In this world, phone has become a vital product in everyone’s life. Almost every
person uses smart phone for connecting with family, teachers and friends, ordering
from online platforms, using social media and the list goes on. To do these activities
phones should have a sim card with internet connectivity. The sim in our phones
receives many text messages called SMS where only a few percent of it is important.
Other SMS are messages which may have some unwanted information that tries to get
our confidential information like bank details, passwords etc. Among these messages,
some conveys messages like winning a lump amount or wining an expensive product
by answering a silly question.
Researchers around the world have found many methods to reduce the annoying text
messages, SMS. One of the methods is to use machine learning techniques. My
objective is to classify a SMS as ham or spam using machine learning algorithms and
using Count Vectorizer () for feature extraction. I’ve used supervised machine learning
algorithms which are K – nearest neighbor (KNN), Multinomial Naïve Bayes, Support
Vector Machine (SVM), Decision tree and Random Forest over a dataset taken from
1
Kaggle platform to classify whether a SMS is ham or spam and try to block spam
messages which reduces threat to a person and their confidential details.
Nilam Nur Amir Sharif et al[1] built a model to check whether a SMS is ham or spam
using supervised machine learning algorithms like K-Nearest Neighbor(KNN),
Multinomial Naive Bayes, Support Vector Machine(SVM), decision tree and random
forest along with term frequency-inverse document frequency (TF-IDF) for feature
extraction. Their dataset is taken from UCI machine learning repository which contains
5574 instances. After performing feature extraction over the dataset using TF-IDF
vectorizer, the dataset is split into train and test data. All the algorithms are applied on
the dataset by fitting the train data and testing its functionality using predict function
over test data. After all the process, classification report is taken. Observing the values,
it is known that random forest along with TF-IDF vectorizer outperforms the other
algorithm by giving 97.5% accuracy, 0.98 precision score and 0.97 f1-score. SVM
algorithm performed least by giving 87.49% of accuracy, 0.77 precision score and 0.82
f1- score.
2
Prediction of SMS spam has been an important area of research for a long time. the
goal is to apply different machine learning algorithms to SMS spam classification
problem, compare their performance to gain insight and further explore the problem,
and design an application based on one of these algorithms that can filter SMS spams
with high accuracy. The current work proposes a gamut of machine learning and deep
learning-based predictive models for accurately predicting the sms spam movement.
The predictive power of the models is further enhanced by introducing the powerful
deep learning-based long- and short-term memory (LSTM) network into the predictive
framework.
1. The Proposed mode is based on the study of SMS text data and technical
indicators. Algorithm selects best free parameters combination for LSTM to
avoid over-fitting and local minima problems and improve prediction accuracy.
2. Our dataset consists of one large text file in which each line corresponds to a text
message. Therefore, preprocessing of the data, extraction of features, and
tokenization of each message is required. After the feature extraction, an initial
analysis on the data is done using label encoder and then the models like naive
Bayes (NB) algorithm and LSTM are used on next steps are for prediction.
3. The two methods used to predict the spam messages that are Fundamental and
technical analyses.
Random Forest (RF) algorithm will used for classification of ham or spam during this
phase. RF is averaging ensemble learning method that can be used for classification
problem. This algorithm combines various decision tree models in order to eliminate
the over fitting problem in decision trees. In RF algorithm, each tree is capable in
providing its own prediction results, different from each other. As a result, each tree
gives different performances, in which the average of their performances will be
generalized and calculated. During the training phase, a set of decision trees will be
constructed before they can operate on randomly selected features. Regardless, RF can
work well with a large dataset with a variety of feature types, similar to binary,
categorical and numerical. The algorithm works as follows diagram: for each tree in
3
the forest, a bootstrap sample is selected from S where S (i) represents the ith bootstrap.
A decision-tree is then learn using a modified decision-tree learning algorithm.
The algorithm is modified as follows: at each node of the tree, instead of examining all
possible feature-splits, some subset of the features text f F is selected randomly.
where F is the set of Spam features. The node then splits on the best feature in f rather
than F. In practice f is much, much smaller than F. Deciding on which feature to split
is oftentimes the most computationally expensive aspect of decision tree learning. By
narrowing the set of features, the speed up the learning of the tree is increase
drastically.
The KNN algorithm classifies new data based on the class of the k nearest neighbors.
This paper uses the value of k as 6. The distance from neighbors can be calculated
using various distance metrics, such as Euclidean distance, Manhattan distance (used
in this paper), Minkowski distance, etc. The class of the new data may be decided by
majority vote or by an inverse proportion to the distance computed. KNN is a
nongeneralizing method, since the algorithm keeps all of its training data in memory,
possibly transformed into a fast-indexing structure such as a ball tree or a KD tree.
4
1.8.2 Random Forest (RF):
• Accuracy is low.
LITERATURE SURVEY
TITLE: SMS Spam Detection Based on Long Short-Term Memory and Gated
Recurrent Unit
Abstract:
5
An SMS spam is the message that hackers develop and send to people via mobile
devices targeting to get their important information. For people who are ignorant, if
they follow the instruction in the message and fill their important information, such as
internet banking account in a faked website or application, the hacker may get the
information. This may lead to loss their wealth. The efficient spam detection is an
important tool in order to help people to classify whether it is a spam SMS or not. In
this research, we propose a novel SMS spam detection based on the case study of the
SMS spams in English language using Natural Language Process and Deep Learning
techniques. To prepare the data for our model development process, we use word
tokenization, padding data, truncating data and word embedding to make more
dimension in data. Then, this data is used to develop the model based on Long Short-
Term Memory and Gated Recurrent Unit algorithms. The performance of the proposed
models is compared to the models based on machine learning algorithms including
Support Vector Machine and Naïve Bayes. The experimental results show that the
model built from the Long Short-Term Memory technique provides the best overall
accuracy as high as 98.18%. On accurately screening spam messages, this model
shows the ability that it can detect spam messages with the 90.96% accuracy rate,
while the error percentage that it misclassifies a normal message as a spam message is
only 0.74%.
YEAR:-2019
Abstract:
The daily traffic of Short Message Service (SMS) keeps increasing. As a result, it leads
to dramatic increase in mobile attacks such as spammers who plague the service with
spam messages sent to the groups of recipients. Mobile spams are a growing problem
as the number of spams keep increasing day by day even with the filtering systems.
Spams are defined as unsolicited bulk messages in various forms such as unwanted
advertisements, credit opportunities or fake lottery winner notifications. Spam
classification has become more challenging due to complexities of the messages
6
imposed by spammers. Hence, various methods have been developed in order to filter
spams. In this study, methods of term frequency-inverse document frequency (TF-
IDF) and Random Forest Algorithm will be applied on SMS spam message data
collection. Based on the experiment, Random Forest algorithm outperforms other
algorithms with an accuracy of 97.50%
YEAR: - 2020
Abstract:
The spam detection is a big issue in mobile message communication due to which
mobile message communication is insecure. In order to tackle this problem, an
accurate and precise method is needed to detect the spam in mobile message
communication. We proposed the applications of the machine learning-based spam
detection method for accurate detection. In this technique, machine learning classifiers
such as Logistic regression (LR), K-nearest neighbor (K-NN), and decision tree (DT)
are used for classification of ham and spam messages in mobile device communication.
The SMS spam collection data set is used for testing the method. The dataset is split
into two categories for training and testing the research. The results of the experiments
demonstrated that the classification performance of LR is high as compared with K-
NN and DT, and the LR achieved a high accuracy of 99%. Additionally, the proposed
method performance is good as compared with the existing state-of-the-art methods.
TITLE: SMS Spam Detection using Machine Learning and Deep Learning
Techniques
YEAR: - 2021
Abstract:
7
The number of people using mobile devices increasing day by day.SMS (short
message service) is a text message service available in smartphones as well as basic
phones. So, the traffic of SMS increased drastically. The spam messages also increased.
The spammers try to send spam messages for their financial or business benefits like
market growth, lottery ticket information, credit card information, etc. So, spam
classification has special attention. In this paper, we applied various machine learning
and deep learning techniques for SMS spam detection. we used a dataset from UCI and
build a spam detection model. Our experimental results have shown that our LSTM
model outperforms previous models in spam detection with an accuracy of 98.5%. We
used python for all implementations.
YEAR: - 2019
Abstract:
Over recent years, as the popularity of mobile phone devices has increased, Short
Message Service (SMS) has grown into a multi-billion dollars industry. At the same
time, reduction in the cost of messaging services has resulted in growth in unsolicited
commercial advertisements (spams) being sent to mobile phones. In parts of Asia, up
to 30% of text messages were spam in 2012. Lack of real databases for SMS spams,
short length of messages and limited features, and their informal language are the
factors that may cause the established email filtering algorithms to underperform in
their classification. In this project, a database of real SMS Spams from UCI Machine
Learning repository is used, and after preprocessing and feature extraction, different
machine learning techniques are applied to the database. Finally, the results are
compared and the best algorithm for spam filtering for text messaging is introduced.
Final simulation results using 10-fold cross validation shows the best classifier in this
work reduces the overall error rate of best model in original paper citing this dataset by
more than half.
8
1.9 PROPOSED SYSTEM
Applying NB algorithm to the dataset using extracted features with different training
set sizes. The performance in learning curve is evaluated by splitting the dataset into
70% training set and 30% test set. The NB algorithm shows good overall accuracy.
We notice that the length of the text message (number of characters used) is a very
good feature for the classification of spams. Sorting features based on their mutual
information (MI) criteria shows that this feature has the highest MI with target labels.
Additionally, going through the misclassified samples, we notice that text messages
with length below a certain threshold are usually hams, yet because of the tokens
corresponding to the alphabetic words or numeric strings in the message they might be
classified as spams.
By looking at the learning curve, we see that once the NB is trained on features
extracted, the training set error and test set error are close to each other. Therefore, we
do not have a problem of high variance, and gathering more data may not result in
much improvement in the performance of the learning algorithm. As the result, we
should try reducing bias to improve this classifier. This means adding more
meaningful features to the list of tokens can decrease the error rate, and is the option
that is explored next.
• With its simplicity and fast processing time, the proposed algorithm gives
better execution time.
• Both machine learning and deep learning technique is performed to predict the
value effectively.
• Prediction is accurate
9
CHAPTER 2
2.1 DATASET
The dataset used consists of 5572 instances in which 4825 instances are labelled as
ham and 747 instances are labelled as spam as shown in fig2.1. The message in the
dataset is of single line length. The dataset is populated by taking a subset of 3,375
SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a
dataset of about 10,000 legitimate messages collected for research at the Department
of Computer Science at the National University of Singapore. A list of 450 SMS ham
messages is collected from Caroline Tag's PhD Thesis. From SMS Spam Corpus
v.0.1 Big, 1002 ham messages and 322 spam messages are added to the dataset. A
graphical representation of ham and spam SMS count is shown in fig2.1.1.
10
2.2 LIBRARIES USED
The models are executed in jupyter notebook platform by importing all the necessary
packages. The important libraries required are
Pandas
NumPy
Matplotlib
Seaborn
Nltk
Pandas is used for analysing data and preforming operations like reading dataset,
cleaning and pre-processing dataset. NumPy provides n-dimensional array and helps
in working with arrays, linear algebra, matrices and fourier transform. Matplotlib and
seaborn are used mainly in graphical representation of the data. Nltk library is mainly
used when dataset contains categorical values, text and string values.
11
Fig2.2.1. Frequency of ham and spam SMS
2.3 DESIGN
The dataset containing ham and spam SMS is given as input which is pre-processed by
dropping the null columns and changing the name of the columns from v1 and v2 to
label and SMS. Then it undergoes feature extraction using CountVectorizer(). After
that the dataset is split into train and test data and classification algorithms are applied
over the dataset. The model is trained using train data and tested using test data. The
whole process of classifying SMS as ham and spam is represented in a pictorial format
in Fig2.3.1.
A network acts as an interface between sender and receiver. The message from sender
12
is encrypted and sent to network with some confidential details. Then the message
reaches the receiver after confirming the user and decrypting the message. After the
message is received, it undergoes spam check and detects whether the SMS is ham or
spam. If spam, the message is moved to spam folder. This is mainly done in email
services and in some phones. A pictorial representation is shown in fig2.3.2.
The hardware requirements may serve as the basis for a contract for the
implementation of the system and should therefore be a complete and consistent
specification of the whole system. They are used by software engineers as the starting
point for the system design. It shows what the system does and not how it should be
implemented
PROCESSOR : Intel I5
RAM : 4GB
HARD DISK : 40 GB
14
cost, planning team activities, performing tasks and tracking the team‘s and tracking
the team‘s progress throughout the development activity.
The public dataset of SMS labelled messages is obtained from UCI Machine Learning
Repository. The dataset considered in the current research is available on kaggle, a
machine learning repository. This study finds that there are only 5,574 labelled
messages in the dataset, with 4827 of messages belong to ham messages while the
other 747 messages belong to spam messages. Nonetheless, this dataset consists of two
named columns starting with the message labels (ham or spam) followed by strings of
text messages and three unnamed columns.
It‘s time for a data analyst to pick up the baton and lead the way to machine learning
implementation. The job of a data analyst is to find ways and sources of collecting
relevant and comprehensive data, interpreting it, and analyzing results with the help of
statistical techniques.
The type of data depends on what you want to predict.
There is no exact answer to the question ―How much data is needed? ‖ because each
machine learning problem is unique. In turn, the number of attributes data scientists
will use when building a predictive model depends on the attributes predictive value.
The more, the better approach is reasonable for this phase. Some data scientists suggest
considering that less than one-third of collected data may be useful. It‘s difficult to
estimate which part of the data will provide the most accurate results until the model
training begins. That‘s why it‘s important to collect and store all data — internal and
open, structured and unstructured.
The purpose of preprocessing is to convert raw data into a form that fits machine
learning. Structured and clean data allows a data scientist to get more precise results
from an applied machine learning model. The technique includes data formatting,
cleaning, and sampling.
15
2.5.2 DATA FORMATTING
The importance of data formatting grows when data is acquired from various sources
by different people. The first task for a data scientist is to standardize record formats.
A specialist checks whether variables representing each attribute are recorded in the
same way. Titles of products and services, prices, date formats, and addresses are
examples of variables. The principle of data consistency also applies to attributes
represented by numeric ranges.
This set of procedures allows for removing noise and fixing inconsistencies in data. A
data scientist can fill in missing data using imputation techniques, e.g. substituting
missing values with mean attributes. A specialist also detects outliers — observations
that deviate significantly from the rest of distribution. If an outlier indicates erroneous
data, a data scientist deletes or corrects them if possible. This stage also includes
removing incomplete and useless data objects.
Big datasets require more time and computational power for analysis. If a dataset is too
large, applying data sampling is the way to go. A data scientist uses this technique to
select a smaller but representative data sample to build and run models much faster,
and at the same time to produce accurate outcomes.
Pre-processing is the first stage in which the unstructured data is converted into more
structured data. Since keywords in SMS text messages are prone to be replaced by
16
symbols. In this study, the stop word list remover for English language have been
applied to eliminate the stop words in the SMS text messages.
The input given by the user is processed through a number of stages to understand
what the user is trying to say. Natural language processing (NLP) is the ability of a
program to make use of the natural language spoken by a human and comprehend it‘s
meaning. NLP is the study of the computational treatment of natural (human) language.
The development of NLP is challenging because computers are used to of getting a
highly structured input whereas natural language is highly complex and ambiguous
with different linguistic structures and complex variables. NLP has various stages as
follows:
17
Pragmatic analysis: Pragmatic investigation manages outside word information,
which implies learning the outer to the archives and additionally inquiries.
Pragmatics analysis that centers around what was portrayed reinterpreted by
what it really implied, inferring the different parts of language that require true
learning.
2.5.7 FEATURIZATION
Featurization is a way to change some form of data (text data, graph data, time-series
data…) into a numerical vector.
Featurization is different from feature engineering. Feature engineering is just
transforming the numerical features somehow so that the machine learning models
work well. In feature engineering, features are already in the numerical form. Whereas
in Featurization data not need to be in the form of numerical vector.
The machine learning model cannot work with row text data directly. In the end,
machine learning models work with numerical (categorical, real…) features. So it is
import to change some type of data into numerical vector so that we can leverage the
whole power of linear algebra (making the decision boundary between data points) and
statistics tools with other types of data also.
Feature extraction and selection is important for the discrimination of ham and spam in
SMS text messages. For this phases TFIDF will be used. TFIDF is the often-weighting
method used to in the Vector Space Model, particularly in IR domain including text
mining. It is a statistical method to measure the important of a word in the document to
the whole corpus. The term frequency is simply calculated in proportion to the number
of occurrences a word appears in the document and usually normalized in positive
quadrant between 0 and 1 to eliminate bias towards lengthy documents. To construct
the index of terms in TFIDF, punctuation is removed, and all text are lowercase during
tokenization. The first two letter TF or term frequency refers to how important if it
occurs more frequently in a document. Therefore, the higher TF reflects to the more
estimated that the term is significant in respective documents. Additionally, IDF or
Inverse Document Frequency calculated on how infrequent a word or term is in the
documents. The weighted value is estimated using the whole training dataset. The idea
of IDF is that a word is not considered to be good candidate to represent the document
18
if it is occurring frequently in the whole dataset as it might be the stop words or
common words that is generic. Hence only infrequent words in contrast of the entire
dataset is relevant for that documents. TF-IDF does not only assess the importance of
words in the documents but it also evaluates the importance of words in document
database or corpus. In this sense, the word frequency in the document will increase the
weight of words proportionally but will then be offset by corpus ‘s word frequency.
This key characteristic of TF-IDF assumes that there are several words that appear
more often compared to others in the document in general.
After cleaning the data, data is normalized in training and testing the model. When
data is spitted then we train algorithm on the training data set and keep test data set
aside. This training process will produce the training model based on logic and
algorithms and values of the feature in training data. Basically, aim of feature
extraction is to bring all the values under same scale.
A dataset used for machine learning should be partitioned into three subsets — training,
test, and validation sets.
TRAINING SET:- A data scientist uses a training set to train a model and define its
optimal parameters-parameters it has to learn from data.
TEST SET:- A test set is needed for an evaluation of the trained model and its
capability for generalization. The latter means a model ‘s ability to identify patterns in
new unseen data after having been trained over a training data. It’s crucial to use
different subsets for training and testing to avoid model over fitting, which is the
incapacity for generalization we mentioned above.
2.5.9 MODELING
During this stage, a data scientist trains numerous models to define which one of them
provides the most accurate predictions.
19
Model Testing:
The goal of this step is to develop the simplest model able to formulate a target value
fast and well enough. A data scientist can achieve this goal through model tuning.
That’s the optimization of model parameters to achieve an algorithm ‘s best
performance.
One of the more efficient methods for model evaluation and tuning is cross-validation
NAIVE BAYES:
Naive Bayes is a simple probabilistic classifier based on applying Bayes' theorem (or
Bayles’s rule) with strong independence (naive) assumptions. Parameter estimation for
Naïve Bayes models uses the maximum likelihood estimation. It takes only one pass
over the training set and is computationally very fast.
BAYES RULE
20
NB CLASSIFIER
Naïve Bayes classifier is one of the high detection approaches for learning
classification of text documents. Given a set of classified training samples, an
application can learn from these samples, so as to predict the class of an unmet
samples.
The features (n1 , n2 , n3 , n4) which are present in sms are independent from each
other. Every feature (1 ≤ i ≤ 4) text binary value showing whether the particular
property comes in sms. The probability is calculated that the given web belongs to a
class r (r1 : spam and r2 : Ham) as follows:
(r1/N) = ( (r1) ∗ P(N/ri))/P(N )
PERFORMANCE MATRICES:
Data was divided into two portions, training data and testing data, both these portions
consisting 70% and 30% data respectively. All these two algorithms were applied on
same dataset using Enthought Canaopy and results were obtained.
Predicting accuracy is the main evaluation parameter that we used in this work.
Accuracy can be defied using equation. Accuracy is the overall success rate of the
algorithm.
CONFUSION MATRIX:
It is the most commonly used evaluation metrics in predictive analysis mainly because
it is very easy to understand and it can be used to compute other essential metrics such
as accuracy, recall, precision, etc. It is an NxN matrix that describes the overall
performance of a model when used on some dataset, where N is the number of class
labels in the classification problem.
21
All predicted true positive and true negative divided by all positive and negative. True
Positive (TP), True Negative (TN), False Negative (FN) and False Positive (FP)
predicted by all algorithms are presented in table.
True positive (TP) indicates that the positive class is predicted as a positive class, and
the number of sample positive classes was actually predicted by the model.
False negative indicates (FN) that the positive class is predicted as a negative class,
and the number of negative classes in the sample was actually predicted by the model.
False positive (FP) indicates that the negative class is predicted as a positive class, and
the number of positive classes of samples was actually predicted by the model. True
negative (TN) indicates that the negative class is predicted as a negative class, and the
number of sample negative classes was actually predicted by the model.
Designing of system is the process in which it is used to define the interface, modules
and data for a system to specified the demand to satisfy. System design is seen as the
application of the system theory. The main thing of the design a system is to develop
the system architecture by giving the data and information that is necessary for the
implementation of a system.
22
2.6.1 ARCHITECTURE DIAGRAM:
Data flow diagrams are used to graphically represent the flow of data in a business
information system. DFD describes the processes that are involved in a system to
transfer data from the input to the file storage and reports generation. Data flow
diagrams can be divided into logical and physical. The logical data flow diagram
describes flow of data through a system to perform certain functionality of a business.
The physical data flow diagram describes the implementation of the logical data flow
23
2.6.3 USECASE DIAGRAM:
Use case diagrams identify the functionalities provides by the use cases, the actors who
interact with the system and the association between the actors and the functionalities.
The class diagram is a static diagram. It represents the static view of an application.
Class diagram is not only used for visualizing, describing and documenting different
aspects of a system but also for constructing executable code of the software
application
24
2.6.5 SEQUENCE DIAGRAM:
The sequence diagram of a system shows the entity interplay are ordered in the time
order level. So, that it drafts the classes and object that are imply in the that plot and
also the series of message exchange take place betwixt the body that need to be carried
out by the purpose of that scenario.
The Activity Diagram forms effective while modeling the functionality of the system.
Hence this diagram reflects the activities, the types of flows between these activities
and finally the response of objects to these activities.
25
2.6.7 STATE FLOW DIAGRAM:
The below state chart diagram describes the flow of control from one state to another
state (event) in the flow of the events from the creation of an object to its termination.
26
CHAPTER 3
SOFTWARE SPECIFICATION
3.1 GENERAL
ANACONDA
Anaconda distribution comes with more than 1,500 packages as well as the Conda
package and virtual environment manager. It also includes a GUI, Anaconda Navigator,
as a graphical alternative to the Command Line Interface (CLI).
The big difference between Conda and the pip package manager is in how package
dependencies are managed, which is a significant challenge for Python data science
and the reason Conda exists. Pip installs all Python package dependencies required,
whether or not those conflict with other packages you installed previously.
So your working installation of, for example, Google Tensorflow, can suddenly stop
working when you pip install a different package that needs a different version of the
Numpy library. More insidiously, everything might still appear to work but now you
get different results from your data science, or you are unable to reproduce the same
results elsewhere because you didn't pip install in the same order.
Conda analyzes your current environment, everything you have installed, any version
limitations you specify (e.g. you only want tensorflow>= 2.0) and figures out how to
install compatible dependencies. Or it will tell you that what you want can't be done.
Pip, by contrast, will just install the thing you wanted and any dependencies, even if
that breaks other things.Open source packages can be individually installed from the
Anaconda repository, Anaconda Cloud (anaconda.org), or your own private repository
or mirror, using the conda install command. Anaconda Inc compiles and builds all the
packages in the Anaconda repository itself, and provides binaries for Windows 32/64
bit, Linux 64 bit and MacOS 64-bit. You can also install anything on PyPI into a
Conda environment using pip, and Conda knows what it has installed and what pip has
27
installed. Custom packages can be made using the conda build command, and can be
shared with others by uploading them to Anaconda Cloud, PyPI or other
repositories.The default installation of Anaconda2 includes Python 2.7 and Anaconda3
includes Python 3.7. However, you can create new environments that include any
version of Python packaged with conda.
28
The following applications are available by default in Navigator:
• JupyterLab
• Jupyter Notebook
• Console
• Spyder
• Glue viz
• Orange
• RStudio
Microsoft .NET is a set of Microsoft software technologies for rapidly building and
integrating XML Web services, Microsoft Windows-based applications, and Web
solutions. The .NET Framework is a language-neutral platform for writing programs
that can easily and securely interoperate. There‘s no language barrier with .NET: there
are numerous languages available to the developer including Managed C++, C#,
Visual Basic and Java Script. The .NET framework provides the foundation for
components to interact seamlessly, whether locally or remotely on different platforms.
It standardizes common data types and communications protocols so that components
created in different languages can easily interoperate.
NET‖ is also the collective name given to various software components built upon
the .NET platform. These will be both products (Visual Studio.NET and
Windows.NET Server, for instance) and services (like Passport, .NET My Services,
and so on).
29
Python is a powerful multi-purpose programming language created by Guido van
Rossum. It has simple easy-to-use syntax, making it the perfect language for someone
trying to learn computer programming for the first time. Python features are:
• Easy to code
• Free and Open Source
• Object-Oriented Language
• GUI Programming Support
• High-Level Language
• Extensible feature
• Python is Portable language
• Python is Integrated language
• Interpreted
• Large Standard Library
• Dynamically Typed Language
30
3.2 PYTHON:
1. Easy to code:
Python is high level programming language. Python is very easy to learn language as
compared to other language like c, c#, java script, java etc. It is very easy to code in
python language and anybody can learn python basic in few hours or days. It is also
developer-friendly language.
3. Object-Oriented Language:
One of the key features of python is Object-Oriented programming. Python supports
object oriented language and concepts of classes, objects encapsulation etc.
5. High-Level Language:
Python is a high-level language. When we write programs in python, we do not need to
remember the system architecture, nor do we need to manage the memory.
6. Extensible feature:
Python is a Extensible language. we can write our some python code into c or c++
language and also we can compile that code in c/c++ language.
31
Python language is also a portable language. for example, if we have python code for
windows and if we want to run this code on other platform such as Linux, Unix and
Mac then we do not need to change it, we can run this code on any platform.
8. Interpreted Language:
Python is an Interpreted Language. because python code is executed line by line at a
time. like other language c, c++, java etc there is no need to compile python code this
makes it easier to debug our code. The source code of python is converted into an
immediate form called bytecode.
WEB APPLICATIONS
• You can create scalable Web Apps using frameworks and CMS (Content
Management System) that are built on Python. Some of the popular platforms
for creating Web Apps are: Django, Flask, Pyramid, Plone, Django CMS.
• Sites like Mozilla, Reddit, Instagram and PBS are written in Python.
32
• There are numerous libraries available in Python for scientific and numeric
computing. There are libraries like: SciPy and NumPy that are used in general
purpose computing. And, there are specific libraries like: EarthPy for earth
science, AstroPy for Astronomy and so on.
• Also, the language is heavily used in machine learning, data mining and deep
learning.
• Python is slow compared to compiled languages like C++ and Java. It might
not be a good choice if resources are limited and efficiency is a must.
• However, Python is a great language for creating prototypes. For example: You
can use Pygame (library for creating games) to create your game's prototype
first. If you like the prototype, you can use language like C++ to create the
actual game.
33
CHAPTER 4
IMPLEMENTATION
First the table is pre-processed by dropping all the null columns and renaming the
headings of the table. The label for each SMS is denoted in binary values where ‘0’
indicates ‘ham’ SMS and ‘1’ indicates ‘spam’ SMS. Length of every SMS in terms of
characters is calculated and added as a new column to the existing table as shown in
Fig4.1. Now the dataset is ready for undergoing feature extraction.
Here the text messages are going to be converted to a matrix of token counts. This is
done by importing Count Vectorizer from sklearn.feature_extraction.text module.
After this the dataset is split into Train, yttrian and Test, y_test data by importing
train_test_split from sklearn.model_selection module. Train data has 75% of the
dataset and the remaining 25% is comes under test data. Now X_train data undergoes
feature extraction by applying fit transform function using Count Vectorizer. Here fit
transform function gets the raw data, learns the vocabulary in the data and returns
document-term matrix. In the same way X_test also undergoes feature extraction by
applying transform function using Count Vectorizer. Here transform function
transforms data to a document-term matrix. At this point, the SMS in the dataset will
34
be transformed into a matrix with token counts. The train dataset is used for training
the model and the test data is used for testing the model. Train and test data is shown
in pictorial format in the figure below. Now machine learning algorithms will be
applied over the datasets, to test its performance.
This is the main part of the project – train and test the model and check its working.
Classification algorithms comes under supervised machine learning algorithm where
it uses train data to learn thoroughly and tries to predict the category the data falls in
the testing part. Here I’m using five algorithms over the dataset.
K-nearest neighbor
Multinomial Naive bayes
Support Vector Machine
Decision tree
Random forest
After applying these algorithms, their performance is tested based on the accuracy
which is calculated by generating confusion matrix.
35
Accuracy, precision score, f1 score and recall score is calculated using confusion
matrix and is used to check how the model has classified the data in test dataset
during testing. True positive and false negative values are the data which are correctly
predicted and classified. False positive shows that the false data is classified as true
and in true negative, true data is classified as false.
Here prediction is made regarding n-neighbor value and finding the data value close to
n-value. This algorithm stores training dataset and performs an action on train data
during testing time for classifying the test data. The data point is categorized by using
Euclidian distance as a metric and finding minimum distance between the data points
and the data space for classification [2].
36
Fig4.3.1 KNN algorithm
4.3.2 Multinomial Naive Bayes
Multinomial naive bayes algorithm is a part of naïve bayes algorithm which is based
on Bayes theorem. Here each value in the dataset is treated as an independent
entity[2].
This algorithm calculates posterior probability using the formula below
P(a) denotes class prior probability P(b) denotes predictor prior probability.
Using 10-fold cross validation and naive Bayes with multinomial event model and
laplace smoothing on the dataset, the overall error is 1.12 percent, the SC is 94.5
percent, and the overall error is 1.12 percent. BH makes up 0.51 percent of the total.[3]
Using data priors and applying Bayesian naive Bayes for the same event model
reduces SC (93.7%) and BH (0.44%) by a small amount, but overall error remains the
same. This is to be anticipated, given the Bayesian model. When there is a lot of
variation, the algorithm gets better.
37
These operations are done automatically by importing using predefined functions.
First MultinomialNB is imported from sklearn.naive_bayes module. I’ve assigned
‘1.0’ for alpha and set fit_prior value to ‘True’. This function is assigned a s a class to
a variable which is used for fitting train data and y_train dataset using fit() function
after which prediction is done over the test dataset using predict() function. Code
snippet for fitting train data and predicting test data is shown in fig8.
Support Vector Machine shortly called as SVM is an algorithm which can perform
both classification and regression. SVM mainy uses kernels which are the
mathematical function for expanding the dimension where data points are plotted and
classified using a hyper-plane[2].
38
The hyper-plane is placed at a point where the classes are classified corretly. There
are many types of kernels in SVM and I’ve used ‘linear’ as kernel type for
classification of ham and spam SMS. For training and testing the dataset, SVC is
imported from sklearn.svm module and assigned as a class to a variable. The train
data and y_train data are fitted using fit() and prediction is made over the test data
using the predict() function. Code snippet of this algorithm is shown below in fig9.
However, increasing the degree of the polynomial from two to three indicates an
increase in error rates by using the polynomial kernel. As the degree is raised, the
error rate does not change. Furthermore, another kernel is used called the radial basis
function (RBF).[3]
This algorithm reduces the entropy by breaking the dataset into smaller subsets and
develop a decision tree. This tree is very easy to explain and understand since it
visualizes a sequence of Yes or No questions. To apply this algorithm over the
dataset DecisionTreeClassifier is imported from sklearn.tree module. I’ve used
‘entropy’ as criterion with random-state as 1. Then this function is assigned as a class
to a variable which is used with fit() function to fit train data and y_train data, and
39
used with predict() function to classify data in test data. Code snippet of how this
algorithm is applied is shown in fig10 below.
Random forest algorithm is similar to decision tree which is used for classification
and regression. As forest contains many trees, this algorithm is also an ensemble of
many decision trees which merges the trees for better accuracy. It also provides
adiitional randomness on selection of features from dataset to form trees. Random
forest’s performance is similar and, in some cases, excels when comaperd with
decision tree. I’ve used ‘entropy’ as criterion and random-state as 1. To use this
algorithm, RandomForestClassifier is imported from sklearn. ensemble module and is
assigned as class to a variable which will be used with fit () and predict () function.
The train and y_train dataset is fitted using fit () function and predict () function is
applied on test data to test its classification. Code snippet which explains working of
the algorithm is given below.
40
The random forests implementation in the scikitlearn python library is used in this
work, which averages the probabilistic predictions. For this procedure, two different
estimators are simulated. The average error with ten estimators is 2.16 percent, and SC
is 87.7%, while BH is just 0.73 percent. If you use 100 estimators, you'll get it with a
gross error of 1.41 percent, a SC of 92.2 percent, and a BH of 0.51 percent, the overall
error is 1.41 percent.[3]
Now that the model is trained and tested using the dataset by the required algorithms,
their performance is checked in the classification report which has the accuracy score,
precision score, f1 score and recall score needed. This part is explained elaborately in
the next section.
4.4 SCREENSHOTS
41
42
CHAPTER 5
5.1 TESTING
The performance of all the algorithms is tested by generating confusion matrix and
classification report for all the algorithms. The needed values such as accuracy_score,
precision score, recall score, f1-score, confusion matrix and classification report is
imported from sklearn.metrics module. The values are calculated by using data from
confusion matrix and the formula shown in Fig5.1.
Fig 5.1. Formulae used for calculating accuracy, precision, recall, f1-score
The code snippet importing modules and generating confusion matrix is shown in
fig5.1.1 below.
43
5.2 RESULTS
I’ve displayed all the values generated from confusion matrix and classification
report in a tabular format shown in Table1. All the algorithms are combined with
Count Vectorizer ()
function to give the required values. A comparison in the accuracy is done and
represented in a graph format in fig13.
A
1 9
9 9
9
9
9
6
9
9
2
K M S D R
All the values fall in range between 0 and 1. Seeing the accuracy it is known that
Multinomial Naive Bayes, SVC and Random Forest algorithms are having the
accuracy of 98%. When keenly observed, SVM outperforms by giving 98.7% which
is equivalent to 99%.
44
5.3 VERIFICATION
This model also performs better than the model proposed by Nilam Nur Amir Sharif et
al[1] using term frequency-inverse document frequency (TF-IDF) as feature extract. In
their model random forest gave an accuracy of 97.5%. Seeing the results, it is known
that Count Vectorizer function performs better than TF-IDF vectorizer. Both these
vectorizers are used over datasets having text and string as values.
45
CHAPTER 6
CONCLUSION
6.1 CONCLUSION
Spam detection is critical for secure message and e-mail communication. The effective
identification of spam is a major concern, and numerous researchers have suggested a
variety of detection methods. These techniques, on the other hand, are incapable of
effectively and reliably detecting spam. To address this issue, I experimented different
spam detection approaches based on machine learning predictive models. According to
simulation performance, the best classifiers for SMS are multinomial naive Bayes with
LaPlace smoothing and SVM with linear kernel. In the original paper cited, the best
classifier was the SVM as a learning algorithm. This results in a 97.64 percent overall
accuracy.
Future scope of this project will involve adding more feature parameter. The more the
parameters are taken into account more will be the accuracy. The algorithms can also be
applied for analyzing the contents of public comments and thus determine
patterns/relationships between the customer and the company. The use of traditional
algorithms and data mining techniques can also help predict the corporation
performance structure as a whole. In the future, we plan to integrate neural network with
some other techniques such as genetic algorithm or fuzzy logic. Genetic algorithm can
be used to identify optimal network architecture and training parameters. Fuzzy logic
provides the ability to account for some uncertainty produced by the neural network
predictions. Their uses in conjunction with neural network could provide an
improvement for SMS spam prediction.
APPLICATION:
46
REFERENCES
[1] Modupe, A., O. O. Olugbara, and S. O. Ojo. (2014) ―Filtering of Mobile Short
Messaging Communication Using Latent Dirichlet Allocation with Social Network
Analysis‖, in Transactions on Engineering Technologies: Special Volume of the World
Congress on Engineering 2013, G.-C. Yang, S.-I. Ao, and L. Gelman, Eds. Springer
Science & Business. pp. 671–686.
[4] Aski, A. S., and N. K. Sourati. (2016) ―Proposed Efficient Algorithm to Filter
Spam Using Machine Learning Techniques. ‖ Pac. Sci. Rev. Nat. Sci. Eng. 18 (2):145–
149.
[5] Narayan, A., and P. Saxena. (2013) ―The Curse of 140 Characters: Evaluating The
Efficacy of SMS Spam Detection on Android. ‖ p. 33– 42.
[7] Mujtaba, D. G., and M. Yasin. (2014) ―SMS Spam Detection Using Simple
Message Content Features.‖ J. Basic Appl. Sci. Res. 4 (4): 5.
[8] Gudkova, D., M. Vergelis, T. Shcherbakova, and N. Demidova. (2017) ―Spam and
Phishing in Q3 2017.‖ Securelist - Kaspersky Lab‘s Cyberthreat Research and Reports.
Available from: https://securelist.com/spam-and-phishing-in-q3-2017/82901/.
[Accessed: 10th April 2018].
47
[9] Choudhary, N., and A. K. Jain. (2017) ―Towards Filtering of SMS Spam Messages
Using Machine Learning Based Technique‖, in Advanced Informatics for Computing
Research 712: 18-30.
[10] Safie, W., N.N.A. Sjarif, N.F.M. Azmi, S.S. Yuhaniz, R.C. Mohd, and S.Y. Yusof.
(2018) ―SMS Spam Classification using Vector Space Model and Artificial Neural
Network.‖ International Journal of Advances in Soft Computing & Its Applications 10
(3): 129-141.
[11] Fawagreh, Khaled, Mohamed Medhat Gaber, and Eyad Elyan. (2014) ―Random
Forests: From Early Developments to Recent Advancements, Systems Science &
Control Engineering.‖ An Open Access Journal 2 (1): 602-609.
[12] Sajedi, H., G. Z. Parast, and F. Akbari. (2016) ―SMS Spam Filtering Using
Machine Learning Techniques: A Survey.‖ Machine Learning, 1 (1): 14.
[13] Q. Xu, E., W. Xiang, Q. Yang, J. Du, and J. Zhong. (2012) ―SMS Spam Detection
Using Noncontent Features.‖ IEEE Intell. Syst. 27(6): 44–51.
[14] Sethi, G., and V. Bhootna. (2014) SMS Spam Filtering Application Using Android.
[15] Nagwani, N. K. (2017) ―A Bi-Level Text Classification Approach for SMS Spam
Filtering and Identifying Priority Messages.‖ 14 (4): 8.
[16] Delany, S. J., M. Buckley, and D. Greene. (2012) ―SMS Spam Filtering: Methods
and Data,‖ Expert Syst. Appl. 39(10): 9899–9908.
[17] Chan, P. P. K., C. Yang, D. S. Yeung, and W. W. Y. Ng. (2015) ―Spam Filtering
for Short Messages in Adversarial Environment.‖ Neurocomputing 155: 167–176.
[18] Sethi, P., V. Bhandari, and B. Kohli. (2017) ―SMS Spam Detection and
Comparison of Various Machine Learning Algorithms‖, in 2017 International
48
Conference on Computing and Communication Technologies for Smart Nation
(IC3TSN). pp. 28–31.
[19] Warade, S. J., P. A. Tijare, and S. N. Sawalkar. (2014) ―An Approach for SMS
Spam Detection.‖ Int. J. Res. Advent Technol. 2 (2): 4.
49