Sms Spam Detection Project Final
Sms Spam Detection Project Final
ON
“SMS SPAM DETECTION USING MACHINE LEARNING”
Project work submitted in partial fulfilment for the award of the degree of
Submitted by
JAINAPURAM RAGHUVARMA
Enrollment No:101120862022
This is to certify that this project entitled “SMS Span Detection By Using
Machine Learning With GUI” is a bonafide work carried out by
Jainapuram Raghuvarma bearing Hall Ticket No: 101120862022 in
University College Of Science, Saifabad and submitted to Osmania
University in partial fulfillment of the requirements for the award of Master
of Computer Applications.
(JAINAPURAM RAGHUVARMA)
SMS SPAM DETECTION
USING
MACHINE LEARNING
ABSTRACT
Project title: SMS Span Detection By Using Machine Learning.
The number of people who use mobile devices is increasing every day. SMS
(short message service) is a text messaging service that can be used on both
smartphones and regular phones. As a result, SMS traffic skyrocketed. The
number of spam texts has also increased. Spammers attempt to send spam
communications for financial or commercial gain, such as market growth,
lottery ticket information, credit card information, and so on. As a result, spam
classification receives special attention.
1. INTRODUCTION
2. LITERATURE SURVEY
3. PROPOSED SYSTEM
4. SYSTEM REQUIREMENT SPECIFICATION DOCUMENT.
a. System Architecture Block Diagram.
b. System Requirements
i) Software Requirements.
ii) Hardware Requirements .
c. Disadvantage
d. Modules Description
5. SYSTEM DESIGN
a. UML Diagrams
6. DESCRIPTION OF TECHNOLOGIES
a. Machine Learning
b. Python
7. CODE
8. TESTING
9. RESULT
10. CONCLUSION
11. REFERENCE
Introduction:-
In just five years, the number of smartphone users has risen from 1 billion
to 3.8 billion [1]. China, India, and the United States are the top three mobile
phone users. SMS, or Short Message Service, is a text messaging service that
has been around for a while. It is also possible to use SMS without having
access to the internet. As a result, SMS is supported by both smartphones and
basic mobile phones. Despite the fact that smart phones come with a variety of
text messaging apps such as WhatsApp, this service is only available via the
internet. SMS, on the other hand, can be sent at any point in time. As a result,
SMS service traffic is steadily expanding. Unsolicited communications are sent
by spammers. Spammers bombard people with a large quantity of messages for
the advantage of their organizations or personal gain. Spam is the term for these
kind of messages. Despite the availability of numerous SMS spam filtering
solutions[2], complex strategies are still required to deal with this problem.
Spam messages on mobile devices can be irritating. SMS spam and email spam
are two different types of spam messages. The term "spam" or "SMS spam"
refers to the same thing. Spammers use these spam mailings to promote their
utilities or businesses. Users may sometimes suffer financial losses as a result of
spam mailings. Machine Learning is a technology that allows machines to learn
from past data and anticipate future data. Machine learning and deep learning
can now be used to tackle most real-world problems in a variety of fields,
including health, security, and market analysis. Machine learning approaches
include supervised learning, unsupervised learning, semi supervised learning,
and others. The dataset in supervised learning has output labels, whereas
datasets without labels are dealt with in unsupervised learning. We used a UCI
dataset with labels and employed multiple supervised learning techniques to
detect SMS spam.
Literature survey:-
It is not a new period to use machine learning and deep learning techniques to
detect spam. Previously, ML approaches were used to classify SMS spam by a
number of academics. Nilam Nur Amir Sjarif[3] et al. combined the TF-IDF
technique with a random forest classifier and reached a 97.5 percent accuracy.
The TF-IDF approach uses two measurements, Term Frequency and Inverse
Document Frequency, to quantify the words in a document. For email spam
filtering, A.Lakshmanarao et colleagues used four machine learning classifiers:
Decision Trees, Naive Bayes, Logistic Regression, and Random Forest, with the
random forest classifier achieving a 97 percent accuracy. Pavas Navaney[5] et
al. suggested several machine learning techniques and used support vector
machines to obtain a 97.4 percent accuracy. Luo GuangJun [6] et al. used a
variety of shallow machine learning techniques and found that the logistic
regression classifier had a high accuracy rate. For the detection of SMS spam,
Tian Xia[7] et al presented the Hidden Markov Mode. Their model was based
on theheir model used the information about the order of words thereby solving
issues with low term frequency.
This model. M. Nivaashini [8] et.al applied a deep neural network for SMS
spam detection and achieved an accuracy of 98%. They also compared DNN
performance with NB, Random Forest, SVM, and KNN. Mehul Gupta[9] et.al
compared various spam detection machine learning models with deep learning
models and shown that deep learning models achieved a high accuracy rate in
SMS spam detection. Gomatham Sai Sravya[10] et.al compared various
machine learning algorithms for SMS spam detection and achieved the best
accuracy with the Naive Bayes classification model. M.Rubin Julis[11] et.al
applied various machine learning classifiers and achieved an accuracy of 97%
with a support vector machine. K. Sree Ram Murthy [12] et.al proposed
Recurrent Neural Networks for SMS spam detection and achieved a good
accuracy rate. S. Sheikh[13] proposed SMS spam detection using feature
selection and the Neural Network model and achieved a good accuracy rate.
Adem Tekerek[14] et.al applied various machine learning classification models
for SMS spam detection and achieved an accuracy of 97% with a support vector
machine classifier.
Proposed System
The prediction method will employ 3 machine learning algorithms which are
Linear Regression , Random Forest Regressor and Decision tree Regressor.
● STEPS for Proposed Approach-
Step 1:-Initialize the dataset containing training data wholesale price
index
Step 2:-Select all the rows and column 1from dataset to “x” Which is
independent variable
Step 3:-Select all of the rows and column 2 from dataset to “y” Which
is dependent variable
Step 4:- Fit DTR/SVR/LR to the dataset
step 5:-Predict the new value
step 6:-Visualize the result and check the accuracy
a. System Architecture Block Diagram.
b. System Requirements:
i) Software Requirements:
1. Anaconda Navigator
2. ML - NLP
2. RAM 4GB
c. Disadvantages:
● Time complexity was more
● Prediction accuracy was not so high
d. System Modules
1. Data Ingestion:
Data ingestion is the transportation of data from assorted sources to a storage
medium where it can be accessed, used, and analyzed by an organization. The
destination is typically a data warehouse, data mart, database, or a document
store. Sources may be almost anything – including SaaS data, in-house apps,
databases, spreadsheets, or even information scraped from the internet. The data
ingestion layer is the backbone of any analytics architecture. Downstream
reporting and analytics systems rely on consistent and accessible data. There are
different ways of ingesting data, and the design of a particular data ingestion
layer can be based on various models or architectures.
2. Data Preprocessing:
Data Preprocessing is a data mining technique used to transform the raw data
into useful and efficient format. The data here goes through 2 stages 1. Data
Cleaning: It is very important for data to be error free and free of unwanted data.
So, the data is cleansed before performing the next steps. Cleansing of data
includes checking for missing values, duplicate records and invalid formatting
and removing them. 2. Data Transformation: Data Transformation is
transformation of the datasets mathematically; data is transformed into
appropriate forms suitable for data mining process. This allows us to understand
the data more keenly by arranging the 100‟s of records in an orderly way.
Transformation includes Normalization, Standardization, Attribute Selection.
MACHINE LEARNING
Machine Learning is a system that can learn from example through self-
improvement and without being explicitly coded by programmer. The
breakthrough comes with the idea that a machine can singularly learn from the
data (i.e., example) to produce accurate results.
Machine learning combines data with statistical tools to predict an output. This
output is then used by corporate to makes actionable insights. Machine learning
is closely related to data mining and Bayesian predictive modeling. The
machine receives data as input, use an algorithm to formulate answers.
Machine learning is also used for a variety of task like fraud detection,
predictive maintenance, portfolio optimization, automatize task and so on.
DATA RULES
COMPUTER
Machine Learning
How does Machine learning work?
Machine learning is the brain where all the learning takes place. The way the
machine learns is similar to the human being. Humans learn from experience.
The more we know, the more easily we can predict. By analogy, when we face
an unknown situation, the likelihood of success is lower than the known
situation. Machines are trained the same. To make an accurate prediction, the
machine sees an example. When we give the machine a similar example, it can
figure out the outcome. However, like a human, if its feed a previously unseen
example, the machine has difficulties to predict.
The core objective of machine learning is the learning and inference. First of
all, the machine learns through the discovery of patterns. This discovery is made
thanks to the data. One crucial part of the data scientist is to choose carefully
which data to provide to the machine. The list of attributes used to solve a
problem is called a feature vector. You can think of a feature vector as a subset
of data that is used to tackle a problem.
The machine uses some fancy algorithms to simplify the reality and transform
this discovery into a model. Therefore, the learning stage is used to describe the
data and summarize it into a model.
For instance, the machine is trying to understand the relationship between the
wage of an individual and the likelihood to go to a fancy restaurant. It turns out
the machine finds a positive relationship between wage and going to a high-end
restaurant: This is the model
Inferring
1. Define a question
2. Collect data
3. Visualize data
4. Train algorithm
5. Test the Algorithm
6. Collect feedback
7. Refine the algorithm
8. Loop 4-7 until the results are satisfying
9. Use the model to make a prediction
Once the algorithm gets good at drawing the right conclusions, it applies that
knowledge to new sets of data.
Machine learning Algorithms and where they are used?
Machine learning can be grouped into two broad learning tasks: Supervised and
Unsupervised. There are many other algorithms
Supervised learning
An algorithm uses training data and feedback from humans to learn the
relationship of given inputs to a given output. For instance, a practitioner can
use marketing expense and weather forecast as input data to predict the sales of
cans.
You can use supervised learning when the output data is known. The algorithm
will predict new data.
● Classification task
● Regression task
Classification
Imagine you want to predict the gender of a customer for a commercial. You
will start gathering data on the height, weight, job, salary, purchasing basket, etc.
from your customer database. You know the gender of each of your customer, it
can only be male or female. The objective of the classifier will be to assign a
probability of being a male or a female (i.e., the label) based on the information
(i.e., features you have collected). When the model learned how to recognize
male or female, you can use new data to make a prediction. For instance, you
just got new information from an unknown customer, and you want to know if it
is a male or female. If the classifier predicts male = 70%, it means the algorithm
is sure at 70% that this customer is a male, and 30% it is a female.
The label can be of two or more classes. The above example has only two
classes, but if a classifier needs to predict object, it has dozens of classes (e.g.,
glass, table, shoes, etc. each object represents a class)
Regression
When the output is a continuous value, the task is a regression. For instance, a
financial analyst may need to forecast the value of a stock based on a range of
feature like equity, previous stock performances, macroeconomics index. The
system will be trained to estimate the price of the stocks with the lowest
possible error.
K-means Puts data into some groups (k) that each Clustering
clustering contains data with similar characteristics (as
determined by the model, not in advance by
humans)
Automation:
Finance Industry
Government organization
Healthcare industry
● Healthcare was one of the first industry to use machine learning with
image detection.
Marketing
Unsupervised learning can quickly search for comparable patterns in the diverse
dataset. In turn, the machine can perform quality inspection throughout the
logistics hub, shipment with damage and wear.
For instance, IBM's Watson platform can determine shipping container damage.
Watson combines visual and systems-based data to track, report and make
recommendations in real-time.
In past year stock manager relies extensively on the primary method to evaluate
and forecast the inventory. When combining big data and machine learning,
better forecasting techniques have been implemented (an improvement of 20 to
30 % over traditional forecasting tools). In term of sales, it means an increase of
2 to 3 % due to the potential reduction in inventory costs.
Deep Learning
Deep learning is a computer software that mimics the network of neurons in a
brain. It is a subset of machine learning and is called deep learning because it
makes use of deep neural networks. The machine uses different layers to learn
from the data. The depth of the model is represented by the number of layers in
the model. Deep learning is the new state of the art in term of AI. In deep
learning, the learning phase is done through a neural network.
Reinforcement Learning
Reinforcement learning is a subfield of machine learning in which systems are
trained by receiving virtual "rewards" or "punishments," essentially learning by
trial and error. Google's DeepMind has used reinforcement learning to beat a
human champion in the Go games. Reinforcement learning is also used in video
games to improve the gaming experience by providing smarter bot.
One of the most famous algorithms are:
● Q-learning
● Deep Q network
● State-Action-Reward-State-Action (SARSA)
● Deep Deterministic Policy Gradient (DDPG)
Artificial Intelligence
Machine Deep
learning learning
With machine learning, you need fewer data to train the algorithm than deep
learning. Deep learning requires an extensive and diverse set of data to identify
the underlying structure. Besides, machine learning provides a faster-trained
model. Most advanced deep learning architecture can take days to a week to
train. The advantage of deep learning over machine learning is it is highly
accurate. You do not need to understand what features are the best
representation of the data; the neural network learned how to select critical
features. In machine learning, you need to choose for yourself what features to
include in the model.
TensorFlow
the most famous deep learning library in the world is Google's TensorFlow.
Google product uses machine learning in all of its products to improve the
search engine, translation, image captioning or recommendations.
To give a concrete example, Google users can experience a faster and more
refined the search with AI. If the user types a keyword a the search bar, Google
provides a recommendation about what could be the next word.
Google wants to use machine learning to take advantage of their massive
datasets to give users the best experience. Three different groups use machine
learning:
● Researchers
● Data scientists
● Programmers.
They can all use the same toolset to collaborate with each other and improve
their efficiency.
Google does not just have any data; they have the world's most massive
computer, so TensorFlow was built to scale. TensorFlow is a library developed
by the Google Brain Team to accelerate machine learning and deep neural
network research.
It was built to run on multiple CPUs or GPUs and even mobile operating
systems, and it has several wrappers in several languages like Python, C++ or
Java.
You can train it on multiple machines then you can run it on a different machine,
once you have the trained model.
The model can be trained and used on GPUs as well as CPUs. GPUs were
initially designed for video games. In late 2010, Stanford researchers found that
GPU was also very good at matrix operations and algebra so that it makes them
very fast for doing these kinds of calculations. Deep learning relies on a lot of
matrix multiplication. TensorFlow is very fast at computing the matrix
multiplication because it is written in C++. Although it is implemented in C++,
TensorFlow can be accessed and controlled by other languages mainly, Python.
Finally, a significant feature of Tensor Flow is the Tensor Board. The Tensor
Board enables to monitor graphically and visually what TensorFlow is doing.
List of Prominent Algorithms supported by TensorFlow
History of Python
Python was developed by Guido van Rossum in the late eighties and early
nineties at the National Research Institute for Mathematics and Computer
Science in the Netherlands.
Python is copyrighted. Like Perl, Python source code is now available under the
GNU General Public License (GPL).
Easy-to-read: Python code is more clearly defined and visible to the eyes.
A broad standard library: Python's bulk of the library is very portable and
cross-platform compatible on UNIX, Windows, and Macintosh.
Interactive Mode: Python has support for an interactive mode which allows
interactive testing and debugging of snippets of code.
Portable: Python can run on a wide variety of hardware platforms and has the
same interface on all platforms.
Extendable: You can add low-level modules to the Python interpreter. These
modules enable programmers to add to or customize their tools to be more
efficient.
GUI Programming: Python supports GUI applications that can be created and
ported to many system calls, libraries, and windows systems, such as Windows
MFC, Macintosh, and the X Window system of Unix.
Scalable: Python provides a better structure and support for large programs
than shell scripting.
Apart from the above-mentioned features, Python has a big list of good features,
few are listed below:
ANACONDA NAVIGATOR
Anaconda Navigator is a desktop graphical user interface (GUI) included in
Anaconda distribution that allows you to launch applications and easily manage
conda packages, environments and channels without using command-line
commands. Navigator can search for packages on Anaconda Cloud or in a local
Anaconda Repository. It is available for Windows, mac OS and Linux.
Why use Navigator?
In order to run, many scientific packages depend on specific versions of
other packages. Data scientists often use multiple versions of many
packages, and use multiple environments to separate these different versions.
The command line program conda is both a package manager and an
environment manager, to help data scientists ensure that each version of each
package has all the dependencies it requires and works correctly.
Navigator is an easy, point-and-click way to work with packages and
environments without needing to type conda commands in a terminal window.
You can use it to find the packages you want, install them in an environment,
run the packages and update them, all inside Navigator.
WHAT APPLICATIONS CAN I ACCESS USING NAVIGATOR?
The following applications are available by default in Navigator:
● Jupyter Lab
● Jupyter Notebook
● QT Console
● Spyder
● VS Code
● Glue viz
● Orange 3 App
● Rodeo
● RStudio
Advanced conda users can also build your own Navigator applications
How can I run code with Navigator?
The simplest way is with Spyder. From the Navigator Home tab, click Spyder,
and write and execute your code.
You can also use Jupyter Notebooks the same way. Jupyter Notebooks are an
increasingly popular system that combine your code, descriptive text, output,
images and interactive interfaces into a single notebook file that is edited,
viewed and used in a web browser.
What’s new in 1.9?
● Add support for Offline Mode for all environment related actions.
● Add support for custom configuration of main windows links.
● Numerous bug fixes and performance enhancements.
In [18] :
In [19] :
In [20] :
In [21] :
In [22] :
TESTING
Software testing is an investigation conducted to provide stakeholders
with information about the quality of the product or service under test.
Software Testing also provides an objective, independent view of the
software to allow the business to appreciate and understand the risks at
implementation of the software. Test techniques include, but are not
limited to, the process of executing a program or application with the
intent of finding software bugs.
Software Testing can also be stated as the process of validating and
verifying that a software program/application/product:
● Meets the business and technical requirements that guided its
design and Development.
● Works as expected and can be implemented with the same
characteristics.
TESTING METHODS
Functional Testing
Integration Testing
Software integration testing is the incremental integration testing of two
or more integrated software components on a single platform to produce
failures caused by interface defects.