1822 B.E Cse Batchno 109
1822 B.E Cse Batchno 109
By
PRADEEP K B (38119012)
SCHOOL OF COMPUTING
(DEEMED TO BE UNIVERSITY)
Accredited with Grade “A” by NAAC
1
DEPARTMENT OF COMPUTESCIENCE AND ENGINEERING
BONAFIDE CERTIFICATE
Internal Guide
Dr.L.LAKSHMANAN M.E.,Ph.D.,
Dr.S.VIGNESHWARI M.E.,Ph.D.,
2
DECLARATION
DATE:
3
ACKNOWLEDGEMENT
4
ABSTRACT
The number of people using mobile devices increasing day by day. SMS (short
phones. So, the traffic of SMS increased drastically. The spam messages also
increased. The hackers try to send spam messages for their financial or business
benefits like market growth, lottery ticket information, credit card information, etc. So,
spam classification has special attention. In this paper, we applied various machine
learning and deep learning techniques for SMS spam detection. we used a dataset to
train the machine learning and deep learning models like LSTM and NB. The SMS
spam collection data set is used for testing the method. The dataset is split into two
categories for training and testing the research. Our experimental results have shown
that our NB model outperforms previous models in spam detection with an accuracy of
good.
5
TABLE OF CONTENTS
CHAPTER PAGE
NO. TITLE NO.
CHAPTER 1 : INTRODUCTION
1.1 GENERAL
1.1.1 THE MACHINE LEARNING SYSTEM
1.
1.1.2 FUNDAMENTAL
8
1.2 JUPYTER
1.3 MACHINE LEARNING
10
1.4 CLASSIFICATION TECHNIQUES
13
1.4.1 NEURAL NETWORK AND DEEP LEARNING
1.4.2 METHODOLOGIES - GIVEN INPUT AND EXPECTED 15
OUTPUT
1.5 OBJECTIVE AND SCOPE OF THE PROJECT 16
1.6 EXISTING SYSTEM
1.6.1 DISADVANTAGES OF EXISTING SYSTEM 18
1.6.2 LITERATURE SURVEY
1.7 PROPOSED SYSTEM 21
22
1.7.1 PROPOSED SYSTEM ADVANTAGES
7
CHAPTER I
INTRODUCTION
1.2Jupyter
We will now move on to the task of machine learning itself. In the following sections
we will describe how to use some basic algorithms, and perform regression,
classification, and clustering on some freely available medical datasets concerning
breast cancer and diabetes, and we will also take a look at a DNA microarray dataset.
8
SciKit-Learn
SciKit-Learn provides a standardised interface to many of the most commonly
used machine learning algorithms, and is the most popular and frequently used library
for machine learning for Python. As well as providing many learning algorithms, SciKit-
Learn has a large number of convenience functions for common preprocessing tasks
(for example, normalisation or k-fold cross validation).
SciKit-Learn is a very large software library.
Clustering
Clustering algorithms focus on ordering data together into groups. In general
clustering algorithms are unsupervised—they require no y response variable as input.
That is to say, they attempt to find groups or clusters within data where you do not know
the label for each sample. SciKit-Learn have many clusteringalgorithms, but in this
section we will demonstrate hierarchical clustering on a DNA expression microarray
dataset using an algorithm from the SciPy library.
We will plot a visualisation of the clustering using what is known as a dendrogram, also
using the SciPy library.
The goal is to cluster the data properly in logical groups, in this case into the cancer
types represented by each sample‘s expression data. We do this using agglomerative
hierarchical clustering, using Ward‘s linkage method:
9
1.4Classification
weanalysed data that was unlabelled—we did not know to what class a sample
belonged (known as unsupervised learning). In contrast to this, a supervised problem
deals with labelled data where are aware of the discrete classes to which each sample
belongs. When we wish to predict which class a sample belongs to, we call this a
classification problem. SciKit-Learn has a number of algorithms for classification, in this
section we will look at the Support Vector Machine.
We will work on the Wisconsin breast cancer dataset, split it into a training set and a
test set, train a Support Vector Machine with a linear kernel, and test the trained model
on an unseen dataset. The Support Vector Machine model should be able to predict if a
new sample is malignant or benign based on the features of a new, unseen sample:
10
You will notice that the SVM model performed very well at predicting the malignancy of
new, unseen samples from the test set—this can be quantified nicely by printing a
number of metrics using the classification report function. Here, the precision, recall,
and F1 score (F1 = 2· precision·recall/precision+recall) for each class is shown. The
support column is a count of the number of samples for each class.
Support Vector Machines are a very powerful tool for classification. They work well in
high dimensional spaces, even when the number of features is higher than the number
of samples. However, their running time is quadratic to the number of samples so large
datasets can become difficult to train. Quadratic means that if you increase a dataset in
size by 10 times, it will take 100 times longer to train.
Last, you will notice that the breast cancer dataset consisted of 30 features. This makes
it difficult to visualize or plot the data. To aid in visualization of highly dimensional data,
we can apply a technique called dimensionality reduction.
Dimensionality Reduction
Another important method in machine learning, and data science in general, is
dimensionality reduction. For this example, we will look at the Wisconsin breast cancer
dataset once again. The dataset consists of over 500 samples, where each sample has
30 features. The features relate to images of a fine needle aspirate of breast tissue, and
the features describe the characteristics of the cells present in the images. All features
11
are real values. The target variable is a discrete value (either malignant or benign) and
is therefore a classification dataset.
You will recall from the Iris example in Sect. 7.3 that we plotted a scatter matrix of the
data, where each feature was plotted against every other feature in the dataset to look
for potential correlations (Fig. 3). By examining this plot you could probably find features
which would separate the dataset into groups. Because the dataset only had 4 features
we were able to plot each feature against each other relatively easily. However, as the
numbers of features grow, this becomes less and less feasible, especially if you
consider the gene expression example in Sect. 9.4 which had over 6000 features.
One method that is used to handle data that is highly dimensional is Principle
Component Analysis, or PCA. PCA is an unsupervised algorithm for reducing the
number of dimensions of a dataset. For example, for plotting purposes you might want
to reduce your data down to 2 or 3 dimensions, and PCA allows
you to do this by generating components, which are combinations of the original
features, that you can then use to plot your data.
PCA is an unsupervised algorithm. You supply it with your data, X, and you specify the
number of components you wish to reduce its dimensionality to. This is known as
transforming the data:
Again, you would not use this model for new data—in a real world scenario, you would,
for example, perform a 10-fold cross validation on the dataset, choosing the model
parameters that perform best on the cross validation. This model would be much more
likely to perform well on new data. At the very least, you would randomly select a
subset, say 30% of the data, as a test set and train the model on the remaining 70% of
12
the dataset. You would evaluate the model based on the score on the test set and not
on the training set
.
1.4.1 NEURAL NETWORKS AND DEEP LEARNING
While a proper description of neural networks and deep learning is far beyond the scope
of this chapter, we will however discuss an example use case of one of the most
popular frameworks for deep learning: Keras4.
In this section we will use Keras to build a simple neural network to classify
theWisconsin breast cancer dataset that was described earlier. Often, deep learning
algorithms and neural networks are used to classify images—convolutional neural
networks are especially used for image related classification. However,
they can of course be used for text or tabular-based data as well. In this we will build a
standard feed-forward, densely connected neural network and classify a text-based
cancer dataset in order to demonstrate the framework‘susage.
In this example we are once again using the Wisconsin breast cancer dataset, which
consists of 30 features and 569 individual samples. To make it more challenging for the
neural network, we will use a training set consisting of only 50% of the entire dataset,
and test our neural network on the remaining 50% of the data.
Note,Keras is not installed as part of the Anaconda distribution, to install it use pip:
13
dependencies that must be installed first. Refer to the Theano and TensorFlow
documentation for more information [12].
Keras is a modular API. It allows you to create neural networks by building a stack of
modules, from the input of the neural network, to the output of the neural network, piece
by piece until you have a complete network. Also, Keras can be configured to use your
Graphics Processing Unit, or GPU. This makes training neural networks far faster than if
we were to use a CPU. We begin by importing Keras:
We may want to view the network‘s accuracy on the test (or its loss on the training set)
over time (measured at each epoch), to get a better idea how well it is learning. An
epoch is one complete cycle through the training data.
Fortunately, this is quite easy to plot as Keras‘ fit function returns a history object which
we can use to do exactly this:
This will result in a plot similar to that shown. Often you will also want to plot the loss on
the test set and training set, and the accuracy on the test set and training set.
Plotting the loss and accuracy can be used to see if you are over fitting (you experience
tiny loss on the training set, but large loss on the test set) and to see when your training
has plateaued.
14
PROBLEM STATEMENT:
Prediction of SMS spam has been an important area of research for a long time. the
goal is to apply different machine learning algorithms to SMS spam classification
15
problem, compare their performance to gain insight and further explore the problem,
and design an application based on one of these algorithms that can filter SMS spams
with high accuracy. The current work proposes a gamut of machine learning and deep
learning-based predictive models for accurately predicting the sms spam movement.
The predictive power of the models is further enhanced by introducing the powerful
deep learning-based long- and short-term memory (LSTM) network into the predictive
framework.
1. The Proposed mode is based on the study of sms text data and technical
indicators. Algorithm selects best free parameters combination for LSTM to avoid
over-fitting and local minima problems and improve prediction accuracy.
2. Our dataset consists of one large text file in which each line corresponds to a text
message. Therefore, preprocessing of the data, extraction of features, and
tokenization of each message is required. After the feature extraction, an initial
analysis on the data is done using label encoder and then the models like naive
Bayes (NB) algorithm and LSTM are used on next steps are for prediction.
3. The two methods used to predict the spam messages that are Fundamental and
technical analyses. .
Random Forest (RF) algorithm will used for classification of ham or spam during this
phase. RF is averaging ensemble learning method that can be used for classification
problem. This algorithm combines various decision tree models in order to eliminate the
over fitting problem in decision trees. In RF algorithm, each tree is capable in providing
its own prediction results, different from each other. As a result, each tree gives different
performances, in which the average of their performances will be generalized and
calculated. During the training phase, a set of decision trees will be constructed before
they can operate on randomly selected features. Regardless, RF can work well with a
large dataset with a variety of feature types, similar to binary, categorical and numerical.
The algorithm works as follows diagram: for each tree in the forest, a bootstrap sample
is selected from S where S (i) represents the ith bootstrap. A decision-tree is then learn
using a modified decision-tree learning algorithm.
16
The algorithm is modified as follows: at each node of the tree, instead of examining all
possible feature-splits, some subset of the features text f ⊆ F is selected randomly.
where F is the set of Spam features. The node then splits on the best feature in f rather
than F. In practice f is much, much smaller than F. Deciding on which feature to split is
oftentimes the most computationally expensive aspect of decision tree learning. By
narrowing the set of features, the speed up the learning of the tree is increase
drastically.
EXISTING TECHINIQUE:
TECHNIQUE DEFINITION:
Accuracy is low.
Dataset selection is not correct
Feature extraction is not accurate
Complex structure, not easy to understand.
TITLE: SMS Spam Detection Based on Long Short-Term Memory and Gated Recurrent
Unit
YEAR: - 2019
Abstract:
An SMS spam is the message that hackers develop and send to people via mobile
devices targeting to get their important information. For people who are ignorant, if they
follow the instruction in the message and fill their important information, such as internet
18
banking account in a faked website or application, the hacker may get the information.
This may lead to loss their wealth. The efficient spam detection is an important tool in
order to help people to classify whether it is a spam SMS or not. In this research, we
propose a novel SMS spam detection based on the case study of the SMS spams in
English language using Natural Language Process and Deep Learning techniques. To
prepare the data for our model development process, we use word tokenization,
padding data, truncating data and word embedding to make more dimension in data.
Then, this data is used to develop the model based on Long Short-Term Memory and
Gated Recurrent Unit algorithms. The performance of the proposed models is compared
to the models based on machine learning algorithms including Support Vector Machine
and Naïve Bayes. The experimental results show that the model built from the Long
Short-Term Memory technique provides the best overall accuracy as high as 98.18%.
On accurately screening spam messages, this model shows the ability that it can detect
spam messages with the 90.96% accuracy rate, while the error percentage that it
misclassifies a normal message as a spam message is only 0.74%.
YEAR: - 2019
Abstract:
The daily traffic of Short Message Service (SMS) keeps increasing. As a result, it leads to
dramatic increase in mobile attacks such as spammers who plague the service with spam
messages sent to the groups of recipients. Mobile spams are a growing problem as the number of
spams keep increasing day by day even with the filtering systems. Spams are defined as
unsolicited bulk messages in various forms such as unwanted advertisements, credit
opportunities or fake lottery winner notifications. Spam classification has become more
challenging due to complexities of the messages imposed by spammers. Hence, various methods
have been developed in order to filter spams. In this study, methods of term frequency-inverse
document frequency (TF-IDF) and Random Forest Algorithm will be applied on SMS spam
message data collection. Based on the experiment, Random Forest algorithm outperforms other
algorithms with an accuracy of 97.50%
19
TITLE: Spam Detection Approach for Secure Mobile Message Communication Using
Machine Learning Algorithms
YEAR: - 2020
Abstract:
The spam detection is a big issue in mobile message communication due to which
mobile message communication is insecure. In order to tackle this problem, an accurate
and precise method is needed to detect the spam in mobile message communication.
We proposed the applications of the machine learning-based spam detection method
for accurate detection. In this technique, machine learning classifiers such as Logistic
regression (LR), K-nearest neighbor (K-NN), and decision tree (DT) are used for
classification of ham and spam messages in mobile device communication. The SMS
spam collection data set is used for testing the method. The dataset is split into two
categories for training and testing the research. The results of the experiments
demonstrated that the classification performance of LR is high as compared with K-NN
and DT, and the LR achieved a high accuracy of 99%. Additionally, the proposed
method performance is good as compared with the existing state-of-the-art methods.
TITLE: SMS Spam Detection using Machine Learning and Deep Learning
Techniques
YEAR: - 2021
Abstract:
The number of people using mobile devices increasing day by day.SMS (short message
service) is a text message service available in smartphones as well as basic phones.
So, the traffic of SMS increased drastically. The spam messages also increased. The
spammers try to send spam messages for their financial or business benefits like
market growth, lottery ticket information, credit card information, etc. So, spam
classification has special attention. In this paper, we applied various machine learning
and deep learning techniques for SMS spam detection. we used a dataset from UCI and
20
build a spam detection model. Our experimental results have shown that our LSTM
model outperforms previous models in spam detection with an accuracy of 98.5%. We
used python for all implementations.
YEAR: - 2019
Abstract:
Over recent years, as the popularity of mobile phone devices has increased, Short
Message Service (SMS) has grown into a multi-billion dollars industry. At the same
time, reduction in the cost of messaging services has resulted in growth in unsolicited
commercial advertisements (spams) being sent to mobile phones. In parts of Asia, up to
30% of text messages were spam in 2012. Lack of real databases for SMS spams,
short length of messages and limited features, and their informal language are the
factors that may cause the established email filtering algorithms to underperform in their
classification. In this project, a database of real SMS Spams from UCI Machine
Learning repository is used, and after preprocessing and feature extraction, different
machine learning techniques are applied to the database. Finally, the results are
compared and the best algorithm for spam filtering for text messaging is introduced.
Final simulation results using 10-fold cross validation shows the best classifier in this
work reduces the overall error rate of best model in original paper citing this dataset by
more than half.
Applying NB algorithm to the dataset using extracted features with different training set
sizes. The performance in learning curve is evaluated by splitting the dataset into 70%
training set and 30% test set. The NB algorithm shows good overall accuracy.
We notice that the length of the text message (number of characters used) is a very
good feature for the classification of spams. Sorting features based on their mutual
information (MI) criteria shows that this feature has the highest MI with target labels.
Additionally, going through the misclassified samples, we notice that text messages with
length below a certain threshold are usually hams, yet because of the tokens
21
corresponding to the alphabetic words or numeric strings in the message they might be
classified as spams.
By looking at the learning curve, we see that once the NB is trained on features
extracted, the training set error and test set error are close to each other. Therefore, we
do not have a problem of high variance, and gathering more data may not result in
much improvement in the performance of the learning algorithm. As the result, we
should try reducing bias to improve this classifier. This means adding more meaningful
features to the list of tokens can decrease the error rate, and is the option that is
explored next.
22
Algorithms:
LSTM (Long Short Term Memory):
A special type of RNN, which can learn long-term dependence, is called Long-Short
Term Memory (LSTM). LSTM enables RNN to remember long-term inputs. Contains
information in memory, similar to computer memory. It is able to read, write and delete
information in its memory. This memory can be seen as a closed cell, with a closed
description, the cell decides to store or delete information. In LSTM, there are three
gates: input, forget and exit gate. These gates determine whether new input (input gate)
should be allowed, data deleted because it is not important (forget gate), or allow it to
affect output at current timeline (output gate)
23
LSTM's are a special subset of RNN‘s that can capture context-specific temporal
dependencies for long periods of time. Each LSTM neuron is a memory cell that can
store other information i.e., it maintains its own cell state. While neurons in normal
RNN‘s merely take in their previous hidden state and the current input to output a new
hidden state, an LSTM neuron also takes in its old cell state and outputs its new cell
state.
1. Forget gate: The forget gateway determines when certain parts of the cell will be
inserted with information that is more recent. It subtracts almost 1 in parts of the cell
state to be kept, and zero in values to be ignored.
2. Input gate : Based on the input (e.g., previous output o (t-1), input x (t), and the
previous state of cell c (t-1)), this network category reads the conditions under which
any information should be stored (or updated) in the state cell.
3. Output gate: Depending on the input mode and the cell, this component determines
which information is forwarded in the next location in the network.
Thus, LSTM networks are ideal for exploring how variation in one unwanted sms can
affect the users over a long period of time. They can also decide (in a dynamic fashion)
for how long information about specific past trends in spam sms movement needs to be
retained in order to more accurately predict future trends in the variation of spam sms.
Advantages of LSTM :
The main advantage of LSTM is its ability to read intermediate context. Each unit
remembers details for a long or short period without explicitly utilizing the activation
function within the recurring components. An important fact is that any cell state is
repeated only with the release of the forget gate, which varies between 0 and 1. That is
to say, the gateway for forgetting in the LSTM cell is responsible for both the hardware
and the function of the cell state activation. Thus, the data from the previous cell can
pass through the unchanged cell instead of explicitly increasing or decreasing in each
step or layer, and the instruments can convert to their appropriate values over a limited
time. This allows LSTM to solve a perishable gradient problem - because the amount
stored in the memory cell is not converted in a recurring manner, the gradient does not
end when trained to distribute backwards.
24
CHAPTER 2
2.1 INTRODUCTION
The increasing mobile phones become one of the attached companions for many
individuals. With the explosive penetration of mobile devices and millions of people
sending messages every day, Short Message Service (SMS) has become a multi-
million-dollar commercial industry with a value between 11.3 to 24.7 percent of the
developing countries‘ Gross National Income (GNI) in the early year of 2013.
As the utilization of mobile phone devices has become commonplace, Short Message
Service (SMS) has grown into a multi-billion dollars commercial industry [2]. SMS is a
text communication platform that allows mobile phone users to exchange short text
messages (usually less than 160 seven-bit characters). It is the most widely used data
application with an estimated 3.5 billion active users, or about 80% of all mobile phone
subscribers at the end of 2010 [3]. As the popularity of the platform has increased, we
have seen a surge in the number of unsolicited commercial advertisements sent to
mobile phones using text messaging. SMS spam is still not as common as email spam,
where in 2010 around 90% of emails was spam, and in North America it is still not a
major problem, contributing to less than 1% of text messages exchanged as of
December 2012.
The spam increased in these days due more mobile devices deployed in environment
for e-mail and message communication. Currently, 85% of mails and messages
received by mobile users are spam. The cost of mails and messages are very low for
senders but high for receipts of these messages. The cost paid some time by service
providers and the cost of spam can be measured in the loss of human time and loss of
important messages or mails. Due to these spam mails and messages, the values able
e-mails and messages are affected because each user have limited Internet services,
short time, and memory.
25
The dataset is a large text file, in which each line starts with the label of the message,
followed by the text message string. After preprocessing of the data and extraction of
features, machine learning techniques such as naive Bayes, SVM, and other methods
are applied to the samples, and their performances are compared. Finally, the
performance of best classifier from the project is compared against the performance of
classifiers applied in the original paper citing this dataset.
26
2.3 SYSTEM SPECIFICATION:
The hardware requirements may serve as the basis for a contract for the
implementation of the system and should therefore be a complete and consistent
specification of the whole system. They are used by software engineers as the starting
point for the system design. It shows what the system does and not how it should be
implemented
PROCESSOR : Intel I5
RAM : 4GB
HARD DISK : 40 GB
27
basis for creating the software requirements specification. It is useful in estimating cost,
planning team activities, performing tasks and tracking the team‘s and tracking the
team‘s progress throughout the development activity.
There is no exact answer to the question ―How much data is needed?‖ because each
machine learning problem is unique. In turn, the number of attributes data scientists will
use when building a predictive model depends on the attributes‘ predictive value.
‗The more, the better‘ approach is reasonable for this phase. Some data scientists
suggest considering that less than one-third of collected data may be useful. It‘s difficult
to estimate which part of the data will provide the most accurate results until the model
28
training begins. That‘s why it‘s important to collect and store all data — internal and
open, structured and unstructured.
The purpose of preprocessing is to convert raw data into a form that fits machine
learning. Structured and clean data allows a data scientist to get more precise results from
an applied machine learning model. The technique includes data formatting, cleaning,
and sampling.
Data formatting:The importance of data formatting grows when data is acquired from
various sources by different people. The first task for a data scientist is to standardize
record formats. A specialist checks whether variables representing each attribute are
recorded in the same way. Titles of products and services, prices, date formats, and
addresses are examples of variables. The principle of data consistency also applies to
attributes represented by numeric ranges.
Data cleaning: - This set of procedures allows for removing noise and fixing
inconsistencies in data. A data scientist can fill in missing data using imputation
techniques, e.g. substituting missing values with mean attributes. A specialist also detects
outliers — observations that deviate significantly from the rest of distribution. If an
outlier indicates erroneous data, a data scientist deletes or corrects them if possible. This
stage also includes removing incomplete and useless data objects.
Data sampling: - Big datasets require more time and computational power for analysis. If
a dataset is too large, applying data sampling is the way to go. A data scientist uses this
technique to select a smaller but representative data sample to build and run models
much faster, and at the same time to produce accurate outcomes.
Pre-processing is the first stage in which the unstructured data is converted into more
structured data. Since keywords in SMS text messages are prone to be replaced by
symbols. In this study, the stop word list remover for English language have been
applied to eliminate the stop words in the SMS text messages.
• Syntactic analysis involves analysis of words for grammar and putting the words
together in a manner which can show their relationship. This can be done with a data
structure such as a parse tree or syntax tree. The tree is constructed with the rules of
grammar of the language. If the input can be produced using the syntax tree the input is
found to have correct syntax. For example the string ―I pick that have to‖ will be
considered incorrect syntax.
• Semantic analysis picks up the dictionary meaning of words and tries to understand
the actual meaning of the sentence. It is the process of mapping syntactic structures
with the actual or text independent meaning of the words. Strings like ―hot winter‖ will be
disregarded.
Featuraization:-
Featuraization is a way to change some form of data (text data, graph data, time-series
data…) into a numerical vector.
Featuraization is different from feature engineering. Feature engineering is just
transforming the numerical features somehow so that the machine learning models
work well. In feature engineering, features are already in the numerical form. Whereas
in Featuraization data not need to be in the form of numerical vector.
30
The machine learning model cannot work with row text data directly. In the end,
machine learning models work with numerical (categorical, real…) features. So it is
import to change some type of data into numerical vector so that we can leverage the
whole power of linear algebra (making the decision boundary between data points) and
statistics tools with other types of data also.
Feature extraction and selection is important for the discrimination of ham and spam in
SMS text messages. For this phases TFIDF will be used. TFIDF is the often-weighting
method used to in the Vector Space Model, particularly in IR domain including text
mining. It is a statistical method to measure the important of a word in the document to
the whole corpus. The term frequency is simply calculated in proportion to the number
of occurrences a word appears in the document and usually normalized in positive
quadrant between 0 and 1 to eliminate bias towards lengthy documents. To construct
the index of terms in TFIDF, punctuation is removed, and all text are lowercase during
tokenization. The first two letter TF or term frequency refers to how important if it occurs
more frequently in a document. Therefore, the higher TF reflects to the more estimated
that the term is significant in respective documents. Additionally, IDF or Inverse
Document Frequency calculated on how infrequent a word or term is in the documents.
The weighted value is estimated using the whole training dataset. The idea of IDF is
that a word is not considered to be good candidate to represent the document if it is
occurring frequently in the whole dataset as it might be the stop words or common
words that is generic. Hence only infrequent words in contrast of the entire dataset is
relevant for that documents. TF-IDF does not only assess the importance of words in
the documents but it also evaluates the importance of words in document database or
corpus. In this sense, the word frequency in the document will increase the weight of
words proportionally but will then be offset by corpus‘s word frequency. This key
characteristic of TF-IDF assumes that there are several words that appear more often
compared to others in the document in general.
Splitting of data
After cleaning the data, data is normalized in training and testing the model. When data
is spitted then we train algorithm on the training data set and keep test data set aside.
This training process will produce the training model based on logic and algorithms and
31
values of the feature in training data. Basically aim of feature extraction is to bring all the
values under same scale.
A dataset used for machine learning should be partitioned into three subsets — training,
test, and validation sets.
Training set: -A data scientist uses a training set to train a model and define its optimal
parameters — parameters it has to learn from data.
Test set: - A test set is needed for an evaluation of the trained model and its capability
for generalization. The latter means a model‘s ability to identify patterns in new unseen
data after having been trained over a training data. It‘s crucial to use different subsets
for training and testing to avoid model over fitting, which is the incapacity for
generalization we mentioned above.
Modeling:-
During this stage, a data scientist trains numerous models to define which one of them
provides the most accurate predictions.
Model Testing:-
The goal of this step is to develop the simplest model able to formulate a target value
fast and well enough. A data scientist can achieve this goal through model tuning.
That‘s the optimization of model parameters to achieve an algorithm‘s best
performance.
One of the more efficient methods for model evaluation and tuning is cross-validation
32
specialist tests models with a set of hyper parameter values that received the best
cross-validated score. There are various error metrics for machine learning tasks.
NAIVE BAYES:
Naive Bayes is a simple probabilistic classifier based on applying Bayes' theorem (or
Bayes‘s rule) with strong independence (naive) assumptions. Parameter estimation for
Naïve Bayes models uses the maximum likelihood estimation. It takes only one pass
over the training set and is computationally very fast.
Bayes Rule
A conditional probability is the likelihood of some conclusion, C, given some
evidence/observation, D, where a dependence relationship exists between C and D.
This probability is denoted as
(𝐶|𝐷) where, (𝐷/𝐶) = [(𝐷/𝐶)(𝐶)] /[𝑃(𝐷)]
NB Classifier
Naïve Bayes classifier is one of the high detection approach for learning classification of
text documents. Given a set of classified training samples, an application can learn from
these samples, so as to predict the class of an unmet samples.
The features (𝑛1 , 𝑛2 , 𝑛3 , 𝑛4) which are present in sms are independent from each
other. Every feature (1 ≤ 𝑖 ≤ 4) text binary value showing whether the particular property
comes in sms. The probability is calculated that the given web belongs to a class 𝑟 (𝑟1 :
spam and 𝑟2 :
Ham) as follows:
(𝑟1/𝑁) = ( (𝑟1) ∗ 𝑃(𝑁/𝑟𝑖))/𝑃(𝑁 )
PERFORMANCE MATRICES:
Data was divided into two portions, training data and testing data, both these portions
consisting 70% and 30% data respectively. All these two algorithms were applied on
same dataset using Enthought Canaopy and results were obtained.
33
Predicting accuracy is the main evaluation parameter that we used in this work.
Accuracy can be defied using equation. Accuracy is the overall success rate of the
algorithm.
CONFUSION MATRIX:
It is the most commonly used evaluation metrics in predictive analysis mainly because it
is very easy to understand and it can be used to compute other essential metrics such
as accuracy, recall, precision, etc. It is an NxN matrix that describes the overall
performance of a model when used on some dataset, where N is the number of class
labels in the classification problem.
All predicted true positive and true negative divided by all positive and negative. True
Positive (TP), True Negative (TN), False Negative (FN) and False Positive (FP)
predicted by all algorithms are presented in table.
True positive (TP) indicates that the positive class is predicted as a positive class, and
the number of sample positive classes was actually predicted by the model.
False negative indicates (FN) that the positive class is predicted as a negative class,
and the number of negative classes in the sample was actually predicted by the model.
False positive (FP) indicates that the negative class is predicted as a positive class, and
the number of positive classes of samples was actually predicted by the model.
True negative (TN) indicates that the negative class is predicted as a negative class,
and the number of sample negative classes was actually predicted by the model.
34
Designing of system is the process in which it is used to define the interface, modules
and data for a system to specified the demand to satisfy. System design is seen as the
application of the system theory. The main thing of the design a system is to develop
the system architecture by giving the data and information that is necessary for the
implementation of a system.
35
2.5.3 USECASE DIAGRAM:
Use case diagrams identify the functionalities provides by the use cases, the actors who
interact with the system and the association between the actors and the functionalities.
36
2.5.6 ACTIVITY DIAGRAM:
The Activity Diagram forms effective while modeling the functionality of the system.
Hence this diagram reflects the activities, the types of flows between these activities
and finally the response of objects to these activities.
37
2.5.7 STATE FLOW DIAGRAM:
The below state chart diagram describes the flow of control from one state to another
state (event) in the flow of the events from the creation of an object to its termination.
38
CHAPTER 3
SOFTWARE SPECIFICATION
3.1 GENERAL
ANACONDA
It is a free and open-source distribution of the Python and R programming languages for
scientific computing (data science, machine learning applications, large-scale data
processing, predictive analytics, etc.), that aims to simplify package management and
deployment.
The big difference between Conda and the pip package manager is in how package
dependencies are managed, which is a significant challenge for Python data science
and the reason Conda exists. Pip installs all Python package dependencies required,
whether or not those conflict with other packages you installed previously.
So your working installation of, for example, Google Tensorflow, can suddenly stop
working when you pip install a different package that needs a different version of the
Numpy library. More insidiously, everything might still appear to work but now you get
different results from your data science, or you are unable to reproduce the same
results elsewhere because you didn't pip install in the same order.
Conda analyzes your current environment, everything you have installed, any version
limitations you specify (e.g. you only want tensorflow>= 2.0) and figures out how to
install compatible dependencies. Or it will tell you that what you want can't be done. Pip,
by contrast, will just install the thing you wanted and any dependencies, even if that
breaks other things.Open source packages can be individually installed from the
Anaconda repository, Anaconda Cloud (anaconda.org), or your own private repository
or mirror, using the conda install command. Anaconda Inc compiles and builds all the
packages in the Anaconda repository itself, and provides binaries for Windows 32/64 bit,
Linux 64 bit and MacOS 64-bit. You can also install anything on PyPI into a Conda
environment using pip, and Conda knows what it has installed and what pip has
39
installed. Custom packages can be made using the conda build command, and can be
shared with others by uploading them to Anaconda Cloud, PyPI or other
repositories.The default installation of Anaconda2 includes Python 2.7 and Anaconda3
includes Python 3.7. However, you can create new environments that include any
version of Python packaged with conda.
JupyterLab
Jupyter Notebook
QtConsole
Spyder
Glueviz
40
Orange
Rstudio
Visual Studio Code
Microsoft .NET is a set of Microsoft software technologies for rapidly building and
integrating XML Web services, Microsoft Windows-based applications, and Web
solutions. The .NET Framework is a language-neutral platform for writing programs that
can easily and securely interoperate. There‘s no language barrier with .NET: there are
numerous languages available to the developer including Managed C++, C#, Visual
Basic and Java Script. The .NET framework provides the foundation for components to
interact seamlessly, whether locally or remotely on different platforms. It standardizes
common data types and communications protocols so that components created in
different languages can easily interoperate.
―.NET‖ is also the collective name given to various software components built upon the
.NET platform. These will be both products (Visual Studio.NET and Windows.NET
Server, for instance) and services (like Passport, .NET My Services, and so on).
Easy to code
41
Free and Open Source
Object-Oriented Language
GUI Programming Support
High-Level Language
Extensible feature
Python is Portable language
Python is Integrated language
Interpreted
Large Standard Library
Dynamically Typed Language
3.3 PYTHON:
Python is a powerful multi-purpose programming language created by Guido van
Rossum.
It has simple easy-to-use syntax, making it the perfect language for someone
trying to learn computer programming for the first time.
Features Of Python :
1.Easy to code:
Python is high level programming language. Python is very easy to learn language as
compared to other language like c, c#, java script, java etc. It is very easy to code in
python language and anybody can learn python basic in few hours or days. It is also
developer-friendly language.
3.Object-Oriented Language:
One of the key features of python is Object-Oriented programming. Python supports
object oriented language and concepts of classes, objects encapsulation etc.
5. High-Level Language:
Python is a high-level language. When we write programs in python, we do not need to
remember the system architecture, nor do we need to manage the memory.
42
6.Extensible feature:
Python is a Extensible language. we can write our some python code into c or c++
language and also we can compile that code in c/c++ language.
9. Interpreted Language:
Python is an Interpreted Language. because python code is executed line by line at a
time. like other language c, c++, java etc there is no need to compile python code this
makes it easier to debug our code. The source code of python is converted into an
immediate form called bytecode.
APPLICATIONS OF PYTHON :
WEB APPLICATIONS
You can create scalable Web Apps using frameworks and CMS (Content
Management System) that are built on Python. Some of the popular platforms for
creating Web Apps are:Django, Flask, Pyramid, Plone, Django CMS.
Sites like Mozilla, Reddit, Instagram and PBS are written in Python.
43
3.2.2 CREATING SOFTWARE PROTOTYPES
Python is slow compared to compiled languages like C++ and Java. It might not
be a good choice if resources are limited and efficiency is a must.
However, Python is a great language for creating prototypes. For example: You
can use Pygame (library for creating games) to create your game's prototype
first. If you like the prototype, you can use language like C++ to create the actual
game.
44
CHAPTER 4
IMPLEMENTATION
4.1 GENERAL
Python is a program that was originally designed to simplify the implementation of numerical
linear algebra routines. It has since grown into something much bigger, and it is used to
implement numerical algorithms for a wide range of applications
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import keras
from sklearn.model_selection import train_test_split
import seaborn as sns
from wordcloud import WordCloud
from sklearn.preprocessing import LabelEncoder
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score,confusion_matrix
nltk.download('punkt')
45
d=pd.read_csv(r'C:\Users\kbpra\Music\spam_detection\Dataset\spam.csv',
encoding='latin-1')
d.head()
d
df = d.iloc[:,0:2].values
df = pd.DataFrame(df)
df.columns=['Class','Text']
df.head()
df.info()
df.isna().sum()
df.describe()
Visualization
ns = df["Class"].isin(['ham']).sum(axis=0)
s = df["Class"].isin(['spam']).sum(axis=0)
label=['Spam','Not Spam']
a = [s,ns]
plt.pie(x=a,labels=label,autopct='%1.1f%%')
plt.legend()
plt.show()
plt.figure(figsize=(12,8))
ax =sns.barplot(x=label,y=a)
for p in ax.patches:
x, y = p.get_xy()
46
plt.show
Preprocessing
def clean_data(text):
out = out.lower()
out = out.split()
return out
def tokenize_word(text):
return nltk.word_tokenize(text)
def remove_stopwords(text):
stop_words = set(stopwords.words("english")+['u','ur','r','n'])
return filtered_text
def lemmatize_word(text):
lemmatizer = WordNetLemmatizer()
return lemmas
def get_processed_tokens(text):
text = clean_data(text)
text = tokenize_word(text)
text = remove_stopwords(text)
text = lemmatize_word(text)
return text
nltk.download('stopwords')
47
nltk.download('wordnet')
df['processed_text'] = df['Text'].apply(get_processed_tokens)
df.head()
corpus= []
for i in df["processed_text"]:
corpus.append(msg)
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(corpus).toarray()
X.shape
Word Cloud
text = ' '.join(corpus)
wc = WordCloud(background_color='black').generate(text)
plt.figure(figsize=[15,20])
plt.title("WORD CLOUD")
plt.axis("off")
plt.imshow(wc,interpolation='bilinear')
spam_corpus=[]
for i in df[df['Class']=='spam']["processed_text"]:
spam_corpus.append(msg)
48
wc = WordCloud(background_color='black').generate(text1)
plt.figure(figsize=[15,20])
plt.axis("off")
plt.imshow(wc,interpolation='bilinear')
not_spam_corpus=[]
for i in df[df['Class']=='ham']["processed_text"]:
not_spam_corpus.append(msg)
wc = WordCloud(background_color='black').generate(text2)
plt.figure(figsize=[15,20])
plt.axis("off")
plt.imshow(wc,interpolation='bilinear')
X
X.shape
y = df.iloc[:,0:1]
print(y)
y = df.iloc[:,0:1]
print(y)
print(X.shape,y.shape)
Data Splitting
x_train,x_test,y_train,y_test= train_test_split(X,y,test_size=0.2,random_state=42)
x_train.shape
x_test.shape
y_train.shape
y_train
49
Multinomial Naive Bayes
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB(fit_prior=True)
nb.fit(x_train,y_train)
y_nb = nb.predict(x_test)
cm_nb = confusion_matrix(y_test,y_nb)
sns.heatmap(cm_nb,annot=True,robust=True)
print(ac_nb)
K Nearest Neighbours
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(x_train,y_train)
y_knn=knn.predict(x_test)
cm_knn = confusion_matrix(y_test,y_knn)
sns.heatmap(cm_knn,annot=True,fmt='.2f')
print(ac_knn)
"""### Visualizing and Comparing the Accuracy Scores of the different Models"""
ac = [ac_nb,ac_knn]
plt.figure(figsize=(12,8))
ax = sns.barplot(x=label,y=ac)
50
plt.title('r2 score comparison among different regression
models',fontweight='bold')
for p in ax.patches:
width=p.get_width()
height=p.get_height()
x,y = p.get_xy()
plt.show()
#!/usr/bin/env python
"""Django's command-line utility for administrative tasks."""
import os
import sys
def main():
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'new_project.settings')
try:
from django.core.management import execute_from_command_line
except ImportError as exc:
51
raise ImportError(
"Couldn't import Django. Are you sure it's installed and "
"available on your PYTHONPATH environment variable? Did you "
"forget to activate a virtual environment?"
) from exc
execute_from_command_line(sys.argv)
if __name__ == '__main__':
main()
SCREENSHOT:
52
CHAPTER 5
5.1 CONCLUSION AND REFERENCES
The SMS spam message problem is plaguing almost every country and keeps
increasing without a sign of slowing down as the number of mobile users increase in
addition to cheap rates of SMS services. Therefore, this paper presents the spam
filtering technique using various machine learning algorithms. Based on the experiment,
TF-IDF with Nave bayes classification algorithm outperforms good compare to other
algorithm like LSTM in terms of accuracy percentage. However, it is not enough to
evaluate the performance based on the accuracy alone since the dataset is imbalanced.
After some examinations, NB algorithm still manages to provide good precision and f-
measure with 0.98 of precision while 0.97 for f-measure. Different algorithms will
provide different performances and results based on the features used. For future
works, adding more features such as message lengths might help the classifiers to train
data better and give better performance.
FUTURE SCOPE:
Future scope of this project will involve adding more feature parameter. The more the
parameters are taken into account more will be the accuracy. The algorithms can also
be applied for analyzing the contents of public comments and thus determine
53
patterns/relationships between the customer and the company. The use of traditional
algorithms and data mining techniques can also help predict the corporation
performance structure as a whole. In the future, we plan to integrate neural network with
some other techniques such as genetic algorithm or fuzzy logic. Genetic algorithm can
be used to identify optimal network architecture and training parameters. Fuzzy logic
provides the ability to account for some uncertainty produced by the neural network
predictions. Their uses in conjunction with neural network could provide an
improvement for SMS spam prediction.
APPLICATION:
It can used for company to prevent users using fake links.
Hacking can be prevented.
REFERENCES:
[1] Modupe, A., O. O. Olugbara, and S. O. Ojo. (2014) ―Filtering of Mobile Short
Messaging Communication Using Latent Dirichlet Allocation with Social Network
Analysis‖, in Transactions on Engineering Technologies: Special Volume of the World
Congress on Engineering 2013, G.-C. Yang, S.-I. Ao, and L. Gelman, Eds. Springer
Science & Business. pp. 671–686.
[2] Shirani-Mehr, H. (2013) ―SMS Spam Detection using Machine Learning Approach.‖
p. 4.
[4] Aski, A. S., and N. K. Sourati. (2016) ―Proposed Efficient Algorithm to Filter Spam
Using Machine Learning Techniques.‖ Pac. Sci. Rev. Nat. Sci. Eng. 18 (2):145–149.
[5] Narayan, A., and P. Saxena. (2013) ―The Curse of 140 Characters: Evaluating The
Efficacy of SMS Spam Detection on Android.‖ p. 33– 42.
[6] Almeida, T. A., J. M. Gómez, and A. Yamakami. (2011) ―Contributions to the Study
of SMS Spam Filtering: New Collection and Results.‖ p. 4.
[7] Mujtaba, D. G., and M. Yasin. (2014) ―SMS Spam Detection Using Simple Message
Content Features.‖ J. Basic Appl. Sci. Res. 4 (4): 5.
[8] Gudkova, D., M. Vergelis, T. Shcherbakova, and N. Demidova. (2017) ―Spam and
Phishing in Q3 2017.‖ Securelist - Kaspersky Lab‘s Cyberthreat Research and Reports.
Available from: https://securelist.com/spam-and-phishing-in-q3-2017/82901/. [Accessed:
10th April 2018].
54
[9] Choudhary, N., and A. K. Jain. (2017) ―Towards Filtering of SMS Spam Messages
Using Machine Learning Based Technique‖, in Advanced Informatics for Computing
Research 712: 18-30.
[10] Safie, W., N.N.A. Sjarif, N.F.M. Azmi, S.S. Yuhaniz, R.C. Mohd, and S.Y. Yusof.
(2018) ―SMS Spam Classification using Vector Space Model and Artificial Neural
Network.‖ International Journal of Advances in Soft Computing & Its Applications 10 (3):
129-141.
[11] Fawagreh, Khaled, Mohamed Medhat Gaber, and Eyad Elyan. (2014) ―Random
Forests: From Early Developments to Recent Advancements, Systems Science &
Control Engineering.‖ An Open Access Journal 2 (1): 602-609.
[12] Sajedi, H., G. Z. Parast, and F. Akbari. (2016) ―SMS Spam Filtering Using Machine
Learning Techniques: A Survey.‖ Machine Learning, 1 (1): 14.
[13] Q. Xu, E., W. Xiang, Q. Yang, J. Du, and J. Zhong. (2012) ―SMS Spam Detection
Using Noncontent Features.‖ IEEE Intell. Syst. 27(6): 44–51.
[14] Sethi, G., and V. Bhootna. (2014) SMS Spam Filtering Application Using Android.
[15] Nagwani, N. K. (2017) ―A Bi-Level Text Classification Approach for SMS Spam
Filtering and Identifying Priority Messages.‖ 14 (4): 8.
[16] Delany, S. J., M. Buckley, and D. Greene. (2012) ―SMS Spam Filtering: Methods
and Data,‖ Expert Syst. Appl. 39(10): 9899–9908.
[17] Chan, P. P. K., C. Yang, D. S. Yeung, and W. W. Y. Ng. (2015) ―Spam Filtering for
Short Messages in Adversarial Environment.‖ Neurocomputing 155: 167–176.
[18] Sethi, P., V. Bhandari, and B. Kohli. (2017) ―SMS Spam Detection and Comparison
of Various Machine Learning Algorithms‖, in 2017 International Conference on
Computing and Communication Technologies for Smart Nation (IC3TSN). pp. 28–31.
[19] Warade, S. J., P. A. Tijare, and S. N. Sawalkar. (2014) ―An Approach for SMS
Spam Detection.‖ Int. J. Res. Advent Technol. 2 (2): 4.
[20] Almeida. T. A., and J. M. G. Hidalgo. (2018) ―SMS Spam Collection.‖ Available
from: http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/. [Accessed: 11st April
2018].
55