Nisha Internship3

Download as pdf or txt
Download as pdf or txt
You are on page 1of 87

SMS SPAM DETECTION

CHAPTER 1
COMPANY PROFILE
1.1 About The Company:
Pantech E Learning is a subsidiary of the Pantech Group. Pantech is a Think Tank with a
keen interest in sharing technical knowledge expertise to the student and staff community
viz a viz On Campus Courses, In-House Courses, Faculty Development Programs, Hands
on
Sessions, Workshops and Seminars. Domain of expertise : Python Programming , Arduino
Programming Embedded Systems ,PCB Design ,Android App Development ,IoT
Applications VHDL Programming ,Verilog Programming ,Core Java and Advance Java
Programming ,Simulink Design using Matlab, Android App Development Power
Electronics using MATLAB ,IoT using Arduino ,Robotics, Matlab Programming, Cloud
Computing using JAVA ,Python Programming Data Mining & Its Programming ,Machine
Learning using MATLAB Block Chain ,OpenCV & Image Processing ,IoT Using
Raspberry Pi and the Cloud Interface ,Deep Learning using Python ,NS2 Programming
Machine Learning using Python , Computer Vision and Machine Learning, Power
Electronics using MATLAB, Power Systems using MATLAB Image Processing using
Matlab, Renewable Energy using MATLAB Virtual Reality ,Electric Vehicle Design
,Augment Reality Computer Vision - CV Robots.

MACHINE LEARNING ARTIFICIAL INTELLIGENCE

Fig 1.1: Machine Learning and artificial intelligence

DEPT OF CSE, LAEC BIDAR Page 1


SMS SPAM DETECTION

EMBEDDED SYSTEM

Fig 1.1: Embedded System

CYBERSECURITY

Fig 1.1: Cybersecurity

At PANTECH life is all about delivering the highest quality to customers. Reduced costs,
quicker time-to market, huge value adds and enhanced productivity are our way of life. The

DEPT OF CSE, LAEC BIDAR Page 2


SMS SPAM DETECTION

very cornerstone of our success has been our unerring path to ensuring that QA processes
and procedures are met with unwavering dedication.

1.2 Mission:

"To help our customers in achieving their time-to-market objective by being their
dependable technology partners and delivering our commitments on time and every time
with quality."

1.3 Vision:

Pantech e learning solutions will become the market leader in Embedded system
development, firware & manpower outsourcing focusing on specific application areas in
Communications, Automotive and Consumer electronics."

1.4 Services:

Direct Staffing
Contract Staffing
Out Sourcing
Corporate Training

Direct Staffing Services provide Technical resourcing in selective technologies / domains.


We have been successful in locating resources with specific and rare skills to meet the exact
requirement of our clients across the globe.

Contract Staffing Services provides skilled resources to clients to meet their requirements
for defined periods and to over the lengthy selection process by absorbing Consultants,
based on their performance during the deputation.

Corporate Training KNOWX is a BRIDGE between the IT/Electronic Industry and the
Student community.
We have a broad range of course offerings to equip you and your organization with the right
skills, at precisely the right time at right cost.

DEPT OF CSE, LAEC BIDAR Page 3


SMS SPAM DETECTION

CHAPTER 2
INTRODUCTION
In just five years, the number of smartphone users has risen from 1 billion to 3.8 billion. China,
India, and the United States are the top three mobile phone users. SMS, or Short Message
Service, is a text messaging service that has been around for a while. It is also possible to use
SMS without having access to the internet. As a result, SMS is supported by both smartphones
and basic mobile phones. Despite the fact that smart phones come with a variety of text
messaging apps such as WhatsApp, this service is only available via the internet. SMS, on the
other hand, can be sent at any point in time. As a result, SMS service traffic is steadily
expanding. Unsolicited communications are sent by spammers. Spammers bombard people
with a large quantity of messages for the advantage of their organisations or personal gain.
Spam is the term for these kinds of messages. Despite the availability of numerous SMS spam
filtering solutions, complex strategies are still required to deal with this problem. Spam
messages on mobile devices can be irritating. SMS spam and email spam are two different
types of spam messages. The term "spam" or "SMS spam" refers to the same thing. Spammers
use these spam mailings to promote their utilities or businesses. Users may sometimes suffer
financial losses as a result of spam mailings.

Machine Learning is a technology that allows machines to learn from past data and anticipate
future data. Machine learning and deep learning can now be used to tackle most real-world
problems in a variety of fields, including health, security, and market analysis. Machine
learning approaches include supervised learning, unsupervised learning, semi supervised
learning, and others. The dataset in supervised learning has output labels, whereas datasets
without labels are dealt with in unsupervised learning. We used a UCI dataset with labels and
employed multiple supervised learning techniques to detect SMS spam.

For example, if any data set was the characteristics and purchasing behavior of shoppers at
grocery stores, the unsupervised learning task might be to segment the shoppers into groups
or “clusters” that exhibit similar behaviors. Such learning methods might find that college
students, parents with young children, and older adults have characteristic shopping
behaviors that are similar within each group but dissimilar from the other. This is an
unsupervised learning task because there is no right or wrong about how many clusters can
be found in the data, which people belong in which cluster, or even how to describe each
cluster. Now after having a clear understanding of Machine Learning, authors have used the
same in generating the rules, which will help in governing or identifying based on inputs
whether or not the message is SPAM or HAM. For processing the document content authors
have used TF-IDF9 vectorization for generating the Word Cloud. After that authors have
briefly described TF-IDF vectorization works.

DEPT OF CSE, LAEC BIDAR Page 4


SMS SPAM DETECTION

TF-IDF stands for Turn Frequency Inverse Document Frequency, used in machine learning
and text mining as a weighting factor for identifying words features.The weight increases as
the word frequency in a document increases, i.e. weight increases, the more times a term
occurs in the document, but that offset by the number of times the word appears, in the entire
data set or this offset helps remove the importance from really common words like ‘the' or
‘a' appear quite often in all across the document. It is used very often in relevance ranking
and scoring and to move stop words from ML Model, where these stop words don't give any
relevant information about a particular document type or class. Figure 3 represents the TF-
IDF mathematical formula.

DEPT OF CSE, LAEC BIDAR Page 5


SMS SPAM DETECTION

CHAPTER 3

SYSTEM ARCHITECHTURE

3.1 Literature survey:

It is not a new period to use machine learning and deep learning techniques to detect spam.
Previously, ML approaches were used to classify SMS spam by a number of academics.
Nilam Nur Amir Sjarif et al. combined the TF-IDF technique with a random forest classifier
and reached 97.5 percent accuracy. The TF-IDF approach uses two measurements, Term
Frequency and Inverse Document Frequency, to quantify the words in a document. For
email spam filtering, A.Lakshmana Rao et colleagues used four machine learning
classifiers: Decision Trees, Naive Bayes, Logistic Regression, and Random Forest, with the
random forest classifier achieving 97 percent accuracy. Pavas Navaney et al. suggested
several machine learning techniques and used support vector machines to obtain 97.4
percent accuracy. Luo GuangJun et al. used a variety of shallow machine learning
techniques and found that the logistic regression classifier had a high accuracy rate. For the
detection of SMS spam, Tian Xia et al presented the Hidden Markov Model. Their model
used the information about the order of words thereby solving issues with low term
frequency. For example m0ney or mo.ney for the word money. Therefore, instead of words,
they focused on spam detection at the language character level, such as letters in english.

This model. M. Nivaashini et.al applied a deep neural network for SMS spam detection and
achieved an accuracy of 98%. They also compared DNN performance with NB, Random
Forest, SVM, and KNN. Mehul Gupta et.al compared various spam detection machine
learning models with deep learning models and showed that deep learning models achieved
a high accuracy rate in SMS spam detection. Gomatham Sai Sravya et.al compared various
machine learning algorithms for SMS spam detection and achieved the best accuracy with
the Naive Bayes classification model. M. Rubin Julis et.al applied various machine learning
classifiers and achieved an accuracy of 97% with a support vector machine. K. Sree Ram
Murthy et.al proposed Recurrent Neural Networks for SMS spam detection and achieved a
good accuracy rate. S. Sheikh proposed SMS spam detection using feature selection and the
Neural Network model and achieved a good accuracy rate. Adem Tekerek et.al applied
various machine learning classification models for SMS spam detection and achieved an
accuracy of 97% with a support vector machine classifier.

DEPT OF CSE, LAEC BIDAR Page 6


SMS SPAM DETECTION

3.2 BLOCK DIAGRAM

3.3 SYSTEM REQUIREMENTS

Hardware:

1. Windows 7,8,10 64 bit

2. RAM - 4GB

Software:

1. Anaconda Navigator

2. ML - NLP

DEPT OF CSE, LAEC BIDAR Page 7


SMS SPAM DETECTION

Disadvantage:-
● Time complexity was more
● Prediction accuracy was not so high

3.4 Proposed System

The prediction method will employ 3 machine learning algorithms which are support
vector
machine and naive bayes ( multinomial and gaussian nb)
STEPS for Proposed Approach-
Step 1:-Initialize the dataset containing training data wholesale price index
Step 2:-Select all the rows and column 1 from dataset to “x” which is independent variable
Step 3:-Select all of the rows and column 2 from dataset to “y” which is dependent variable
Step 4:- Fit DTR/SVR/LR to the dataset
step 5:-Predict the new value
step 6:-Visualize the result and check the accuracy.

3.5 System Modules

1.Data Ingestion:
Data ingestion is the transportation of data from assorted sources to a storage medium
where it can be accessed, used and analyzed by an organization. The destination is typically
a data warehouse, data mart, database, or a document store. Sources may be almost
anything – including SaaS data, in-house apps, databases, spreadsheets, or even
information scraped from the internet. The data ingestion layer is the backbone of any
analytics architecture. Downstream reporting and analytics systems rely on consistent and
accessible data. There are different ways of ingesting data, and the design of a particular
data ingestion layer can be based on various models or architectures.

2. Data Preprocessing:
Data Preprocessing is a data mining technique used to transform the raw data into useful
and efficient format. The data here goes through two stages. 1. Data Cleaning: It is very
important for data to be error free and free of unwanted data. So, the data is cleansed before
performing the next steps. Cleansing of data includes checking for missing values,
duplicate records and invalid formatting and removing them. 2. Data Transformation: Data
Transformation is transformation of the datasets mathematically; data is transformed into
appropriate forms suitable for data mining process. This allows us to understand the data
more keenly by arranging the 100‟s of records in an orderly way. Transformation includes
Normalization, Standardization, Attribute Selection.

DEPT OF CSE, LAEC BIDAR Page 8


SMS SPAM DETECTION

Exploratory data analysis(EDA) is an approach to understand the datasets more keenly by


means of visual elements like scatter plots, bar plots, etc. This allows us to identify the
trends in the data more accurately and to perform analysis accordingly. From the yearly

trends graphs it is observed that US exports depend on and follow the areas planted and
harvested annually. A sudden drop in China‟s exports in the year 2009 is observed and in
the meantime their imports kept increasing in the last 12 years regardless of the global
yield, which implies China has a huge and lasting demand of soybean crop but now it relies
on the global supply to meet the needs.

4. Feature Extraction : Correlations


Finally, let's take a look at the relationships between numeric features and other numeric
features. Correlation is a value between -1 and 1 that represents how closely values for two
separate features move in unison. Positive correlation means that as one feature increases,
the other increases; eg. a child's age and her height. Negative correlation means that as one
feature increases, the other decreases; eg. hours spent studying and number of parties
attended. Correlations near -1 or 1 indicate a strong relationship. Those closer to 0 indicate
a weak relationship. ‘0’ indicates no relationship.

Evaluation Metric

Modelling of data involves creating a data model for the data to be stored in the database.
The process of modeling means training a Machine Learning Algorithm to predict the
labels from the features, tuning it for business need, and validating it on the hold out data.
The output from modeling is a trained model that can be used for inference, making
predictions on new data points. Modeling is independent of the previous steps in the
Machine Learning process and has standardized inputs which means we can alter the
prediction problem without needing to rewrite all our code. If the business requirements
change, we can generate new label times, build corresponding features, and input them into
the model. Models are implemented and later evaluated for their accuracies using root mean
square error.
Regressors used for prediction purpose -
● Random Forest Regressor- regression method
● Support Vector Regression (SVR) – uses kernel functions
● Linear Regression – regression method
● Decision Tree Regression – regression method
● R2 score - The r2-score of a regression is the percentage of the test set tuples
● Root Mean Square Error: The Root Mean Square Error is evaluated for model

DEPT OF CSE, LAEC BIDAR Page 9


SMS SPAM DETECTION

UML DIAGRAMS

The Unified Modeling Language (UML) is used to specify, visualize, modify, construct
and document the artifacts of an object-oriented software intensive system under
development. UML offers a standard way to visualize a system's architectural blueprints,
including elements such as:
● actors
● business processes
● (logical) components
● activities
● programming language statements
● database schemas, and
● Reusable software components.

UML combines best techniques from data modeling (entity relationship diagrams), business
modeling (work flows), object modeling, and component modeling. It can be used with all
processes, throughout the software development life cycle, and across different
implementation technologies. UML has synthesized the notations of the Booch method, the
Object-modeling technique (OMT) and Object-oriented software engineering (OOSE) by
fusing them into a single, common and widely usable modeling language. UML aims to be
a standard modeling language which can model concurrent and distributed systems.

DEPT OF CSE, LAEC BIDAR Page 10


SMS SPAM DETECTION

Sequence Diagram:
Sequence Diagrams represent the objects participating in the interaction horizontally and
time vertically. A Use Case is a kind of behavioral classifier that represents a declaration of
an offered behavior. Each use case specifies some behavior, possibly including variants that
the subject can perform in collaboration with one or more actors. Use cases define the offered
behavior of the subject without reference to its internal structure. These behaviors involving
interactions between the actor and the subject, may result in changes to the state of the
subject and communications with its environment. A use case can include possible variations
of its basic behavior, including exceptional behavior and error handling.

DEPT OF CSE, LAEC BIDAR Page 11


SMS SPAM DETECTION

Activity Diagrams-:
Activity diagrams are graphical representations of Workflows of stepwise activities and
actions with support for choice, iteration and concurrency. In the Unified Modeling
Language, activity diagrams can be used to describe the business and operational step-by-
step workflows of components in a system. An activity diagram shows the overall flow of
control.

DEPT OF CSE, LAEC BIDAR Page 12


SMS SPAM DETECTION

Use Case diagram:


● UML is a standard language for specifying, visualizing, constructing, and
documenting the artifacts of software systems.
● UML was created by Object Management Group (OMG) and UML 1.0 specification
draft was proposed to the OMG in January 1997.
● OMG is continuously putting effort to make a truly industry standard.
● UML stands for Unified Modeling Language.
● UML is a pictorial language used to make software blueprints

Use Case Diagram

DEPT OF CSE, LAEC BIDAR Page 13


SMS SPAM DETECTION

Class diagram
The class diagram is the main building block of object-oriented modeling. It is used for
general conceptual modeling of the system of the application, and for detailed modeling
translating the models into programming code. Class diagrams can also be used for data
modeling. The classes in a class diagram represent both the main elements, interactions in
the application, and the classes to be programmed.In the diagram, classes are represented
with boxes that contain three compartments:
1. The top compartment contains the name of the class. It is printed in bold and centered,
and the first letter is capitalized.
2. The middle compartment contains the attributes of the class. They are left-aligned and
the first letter is lowercase
3. The bottom compartment contains the operations the class can execute. They are also
left-aligned and the first letter is lowercase.

CLASS DIAGRAM

DEPT OF CSE, LAEC BIDAR Page 14


SMS SPAM DETECTION

CHAPTER 4

DOMAIN SPECIFICATION

4.1 MACHINE LEARNING

Machine Learning is a system that can learn from example through self-improvement and
without being explicitly coded by programmers. The breakthrough comes with the idea
that a machine can singularly learn from the data (i.e., example) to produce accurate
results.
Machine learning combines data with statistical tools to predict an output. This output is
then used by corporations to make actionable insights. Machine learning is closely related
to data mining and Bayesian predictive modeling. The machine receives data as input and
uses an algorithm to formulate answers.
A typical machine learning task is to provide a recommendation. For those who have a
Netflix account, all recommendations of movies or series are based on the user's historical
data. Tech companies are using unsupervised learning to improve the user experience with
personalizing recommendations.
Machine learning is also used for a variety of tasks like fraud detection, predictive
maintenance, portfolio optimization, automatizing tasks and so on.

4.2 Machine Learning vs. Traditional Programming

Traditional programming differs significantly from machine learning. In traditional


programming, a programmer codes all the rules in consultation with an expert in the
industry for which software is being developed. Each rule is based on a logical
foundation; the machine will execute an output following the logical statement. When the
system grows complex, more rules need to be written. It can quickly become
unsustainable to maintain.

Machine Learning

How does Machine learning work?


Machine learning is the brain where all the learning takes place. The way the machine
learns is similar to the human being. Humans learn from experience. The more we know,
the more easily we can predict. By analogy, when we face an unknown situation, the
likelihood of success is lower than the known situation. Machines are trained the same.
To make an accurate prediction, the machine sees an example. When we give the machine
a similar example, it can figure out the outcome. However, like a human, if it feeds a
previously unseen example, the machine has difficulties to predict.

DEPT OF CSE, LAEC BIDAR Page 15


SMS SPAM DETECTION

The core objective of machine learning is the learning and inference.


First of all, the machine learns through the discovery of patterns. This discovery is made
thanks to the data. One crucial part of the data scientist is to choose carefully which data to
provide to the machine. The list of attributes used to solve a problem is called a feature
vector. You can think of a feature vector as a subset of data that is used to tackle a problem.
The machine uses some fancy algorithms to simplify the reality and transform this discovery
into a model. Therefore, the learning stage is used to describe the data and summarize it into
a model.

For instance, the machine is trying to understand the relationship between the wage of an
individual and the likelihood to go to a fancy restaurant. It turns out the machine finds a
positive relationship between wage and going to a high-end restaurant: This is the model
Inferring When the model is built, it is possible to test how powerful it is on never-seen-
before data. The new data are transformed into a features vector, go through the model and
give a prediction. This is all the beautiful part of machine learning. There is no need to update
the rules or train again the model. You can use the model previously trained to make
inference on new data.

DEPT OF CSE, LAEC BIDAR Page 16


SMS SPAM DETECTION

The life of Machine Learning programs is straightforward and can be summarized in the
following points:

1. Define a question
2. Collect data
3. Visualize data
4. Train algorithm
5. Test the Algorithm
6. Collect feedback
7. Refine the algorithm
8. Loop 4-7 until the results are satisfying
9. Use the model to make a prediction

Once the algorithm gets good at drawing the right conclusions, it applies that knowledge to
new sets of data.

Machine learning Algorithms and where they are used?

Machine learning can be grouped into two broad learning tasks: Supervised and
Unsupervised.
There are many other algorithms

DEPT OF CSE, LAEC BIDAR Page 17


SMS SPAM DETECTION

4.2.1 Supervised learning

An algorithm uses training data and feedback from humans to learn the relationship of given
inputs to a given output. For instance, a practitioner can use marketing expense and weather
forecast as input data to predict the sales of cans.
You can use supervised learning when the output data is known. The algorithm will predict
new data.
There are two categories of supervised learning:
● Classification task
● Regression Task

4.2.2 Classification task

Imagine you want to predict the gender of a customer for a commercial. You will start
gathering data on the height, weight, job, salary, purchasing basket, etc. from your
customer database. You know the gender of each of your customers, it can only be male
or female. The objective of the classifier will be to assign a probability of being a male or
a female (i.e., label) based on the information (i.e., features you have collected). When the
model learns how to recognize male or female, you can use new data to make a prediction.
For instance, you just got new information from an unknown customer, and you want to
know if it is a male or female. If the classifier predicts male = 70%, it means the algorithm
is sure at 70% that this customer is a male, and 30% it is a female.
The label can be of two or more classes. The above example has only two classes, but if
a classifier needs to predict object, it has dozens of classes (e.g., glass, table, shoes, etc.
each object represents a class)

4.2.3 Regression task

When the output is a continuous value, the task is a regression. For instance, a financial
analyst may need to forecast the value of a stock based on a range of features like equity,
previous stock performances, macroeconomics index. The system will be trained to
estimate the price of the stocks with the lowest possible error.

4.2.4 Unsupervised learning

In unsupervised learning, an algorithm explores input data without being given an explicit
output variable (e.g., explores customer demographic data to identify patterns)

DEPT OF CSE, LAEC BIDAR Page 18


SMS SPAM DETECTION

4.3 Application of Machine learning


Augmentation:

● Machine learning, which assists humans with their day-to-day tasks, personally

● commercially without having complete control of the output.

● used in different ways such as Virtual Assistant, Data analysis, software solutions.

● primary user is to reduce errors due to human bias.

Automation:

● Machine learning, which works entirely autonomously in any field without the need

● any human intervention. For example, robots performing the essential process steps

● manufacturing plants.

Finance Industry

● Machine learning is growing in popularity in the finance industry. Banks are mainly

● using ML to find patterns inside the data but also to prevent fraud.

Government organization

● The government makes use of ML to manage public safety and utilities. Take the

● example of China with its massive face recognition. The government uses Artificial

● intelligence to prevent jaywalkers.

Healthcare industry

● Healthcare was one of the first industries to use machine learning image detection.

Marketing

● Broad use of AI is done in marketing thanks to abundant access to data.

● mass data, researchers developed mathematical tools like Bayesian analysis

● estimate the value of a customer. With the boom of data, the marketing department

● on AI to optimize the customer relationship and marketing campaign.

DEPT OF CSE, LAEC BIDAR Page 19


SMS SPAM DETECTION

Example of application of Machine Learning in Supply Chain


Machine learning gives terrific results for visual pattern recognition, opening up many
potential applications in physical inspection and maintenance across the entire supply chain
network. Unsupervised learning can quickly search for comparable patterns in the diverse
dataset. In turn, the machine can perform quality inspection throughout the logistics hub,
shipment with damage and wear. For instance, IBM's Watson platform can determine
shipping container damage. Watson combines visual and systems-based data to track, report
and make recommendations in real-time.In the past year, the stock manager relied
extensively on the primary method to evaluate and forecast the inventory.

When combining big data and machine learning, better forecasting techniques have been
implemented (an improvement of 20 to 30 % over traditional forecasting tools). In terms
of sales, it means an increase of 2 to 3 % due to the potential reduction in inventory costs.

Example of Machine Learning Google Car


For example, everybody knows the Google car. The car is full of lasers on the roof which
are telling it where it is regarding the surrounding area. It has radar in the front, which is
informing the car of the speed and motion of all the cars around it. It uses all of that data to
figure out not only how to drive the car but also to figure out and predict what potential
drivers around the car are going to do. What's impressive is that the car is processing almost
a gigabyte a second of data.

4.4 Deep Learning

Deep learning is a computer software that mimics the network of neurons in the brain. It is a
subset of machine learning and is called deep learning because it makes use of deep neural
networks. The machine uses different layers to learn from the data. The depth of the model
is represented by the number of layers in the model. Deep learning is the new state of the art
in terms of AI. In deep learning, the learning phase is done through a neural network.

4.5 Reinforcement Learning

Reinforcement learning is a subfield of machine learning in which systems are trained by


receiving virtual "rewards" or "punishments," essentially learning by trial and error. Google's
DeepMind has used reinforcement learning to beat a human champion in the Go games.
Reinforcement learning is also used in video games to improve the gaming experience by
providing smarter bot.
One of the most famous algorithms are:

● Q-learning
● Deep Q network
● State-Action-Reward-State-Action (SARSA)
● Deep Deterministic Policy Gradient (DDPG)

DEPT OF CSE, LAEC BIDAR Page 20


SMS SPAM DETECTION

4.6 Applications/ Examples of deep learning applications

AI in Finance:

The financial technology sector has already started using AI to save time, reduce costs, and add
value. Deep learning is changing the lending industry by using more robust credit scoring. Credit
decision-makers can use AI for robust credit lending applications to achieve faster, more accurate
risk assessment, using machine intelligence to factor in the character and capacity of applicants.
Underwrite is a Fintech company providing an AI solution for credit makers company.
underwrite.ai uses AI to detect which applicant is more likely to pay back a loan. Their approach
radically outperforms traditional methods.

AI in HR:

Under Armour, a sportswear company revolutionizes hiring and modernizes the candidate
experience with the help of AI. In fact, Under Armour Reduces hiring time for its retail stores by
35%. Under Armour faced a growing popularity interest back in 2012. They had, on average,
30000 resumes a month. Reading all of those applications and beginning to start the screening and
interview process was taking too long. The lengthy process to get people hired and on-boarded
impacted Under Armour's ability to have their retail stores fully staffed, ramped and ready to
operate.
At that time, Under Armour had all of the 'must have' HR technology in place such as transactional
solutions for sourcing, applying, tracking and onboarding but those tools weren't useful enough.
Under armour choose HireVue, an AI provider for HR solution, for both on-demand and live
interviews. The results were bluffing; they managed to decrease by 35% the time to fill. In return,
the hired higher quality staff.

AI in Marketing:

AI is a valuable tool for customer service management and personalization challenges.


Improved speech recognition in call-center management and call routing as a result of the
application of AI techniques allows a more seamless experience for customers.
For example, deep-learning analysis of audio allows systems to assess a customer's emotional
tone. If the customer is responding poorly to the AI chatbot, the system can reroute the
conversation to real, human operators that take over the issue.
Apart from the three examples above, AI is widely used in other sectors/industries.

DEPT OF CSE, LAEC BIDAR Page 21


SMS SPAM DETECTION

Artificial Intelligence

With machine learning, you need fewer data to train the algorithm than deep learning. Deep
learning requires an extensive and diverse set of data to identify the underlying structure. Besides,
machine learning provides a faster-trained model. Most advanced deep learning architecture can
take days to a week to train. The advantage of deep learning over machine learning is it is highly
accurate. You do not need to understand what features are the best representation of the data; the
neural network learned how to select critical features. In machine learning, you need to choose
for yourself what features to include in the model.

DEPT OF CSE, LAEC BIDAR Page 22


SMS SPAM DETECTION

CHAPTER 5

TensorFlow

The most famous deep learning library in the world is Google's TensorFlow. Google uses
machine learning in all of its products to improve the search engine, translation, image
captioning or recommendations.
To give a concrete example, Google users can experience a faster and more refined search
with AI. If the user types a keyword in the search bar, Google provides a recommendation
about what could be the next word.
Google wants to use machine learning to take advantage of their massive datasets to give
users the best experience. Three different groups use machine learning:

● Researchers
● Data scientists
● Programmers.

They can all use the same toolset to collaborate with each other and improve their efficiency.
Google does not just have any data; they have the world's most massive computer, so
TensorFlow was built to scale. TensorFlow is a library developed by the Google Brain Team
to accelerate machine learning and deep neural network research.
It was built to run on multiple CPUs or GPUs and even mobile operating systems, and it has
several wrappers in several languages like Python, C++ or Java.
In this context, you will learn

5.1 TensorFlow Architecture

Tensor flow architecture works in three parts:

● Pre processing the data


● Build the model
● Train and estimate the model

It is called Tensor flow because it takes input as a multidimensional array, also known
as tensors. You can construct a sort of flowchart of operations (called a Graph) that you
want to perform on that input. The input goes in at one end, and then it flows through this
system of multiple operations and comes out the other end as output. This is why it is called
TensorFlow because the tensor goes in, it flows through a list of operations, and then it comes
out the other side.

DEPT OF CSE, LAEC BIDAR Page 23


SMS SPAM DETECTION

Where can Tensor flow run?


TensorFlow can hardware, and software requirements can be classified into
Development Phase: This is when you train the mode. Training is usually done on your
Desktop or laptop.
Run Phase or Inference Phase: Once training is done Tensor flow can be run on many
different platforms. You can run it on

● Desktop running Windows, macOS or Linux


● Cloud as a web service
● Mobile devices like iOS and Android

You can train it on multiple machines then you can run it on a different machine, once you
have the trained model.
The model can be trained and used on GPUs as well as CPUs. GPUs were initially designed
for video games. In late 2010, Stanford researchers found that GPU was also very good at
matrix operations and algebra so that it makes them very fast for doing these kinds of
calculations. Deep learning relies on a lot of matrix multiplication. TensorFlow is very fast
at computing matrix multiplication because it is written in C++. Although it is implemented
in C++, TensorFlow can be accessed and controlled by other languages mainly, Python.
Finally, a significant feature of Tensor Flow is the Tensor Board. The Tensor Board enables
you to monitor graphically and visually what TensorFlow is doing.

5.2 List of Prominent Algorithms supported by TensorFlow

● Linear regression: tf. estimator .Linear Regression


● Classification :tf. Estimator .Linear Classifier
● Deep learning classification: tf. estimator. DNN Classifier
● Booster tree regression: tf.estimator. Boosted Trees Regressor
● Boosted tree classification: tf.estimator. Boosted Trees Classifier

5.3 PYTHON OVERVIEW

Python is a high-level, interpreted, interactive and object-oriented scripting language. Python


is designed to be highly readable. It uses English words frequently whereas other languages
use punctuation, and it has fewer syntactic constructions than other languages.

● Python is Interpreted: Python is processed at runtime by the interpreter. You do


not need to compile your program before executing it. This is similar to PERL and PHP.

DEPT OF CSE, LAEC BIDAR Page 24


SMS SPAM DETECTION

● Python is Interactive: You can actually sit at a Python prompt and interact with
the interpreter directly to write your programs.

● Python is Object-Oriented: Python supports Object-Oriented style or technique


of programming that encapsulates code within objects.

● Python is a Beginner's Language: Python is a great language for the beginner-


level programmers and supports the development of a wide range of applications from
simple text processing to WWW browsers to games.

5.3.1 History of Python

Python was developed by Guido van Rossum in the late eighties and early nineties at the
National Research Institute for Mathematics and Computer Science in the Netherlands.

Python is derived from many other languages, including ABC, Modula-3, C, C++, Algol-
68, Small Talk, Unix shell, and other scripting languages.

Python is copyrighted. Like Perl, Python source code is now available under the GNU
General Public License (GPL).

Python is now maintained by a core development team at the institute, although Guido van
Rossum still holds a vital role in directing its progress.

5.3.2 Python Features


Python's features include:

Easy-to-learn: Python has few keywords, simple structure, and a clearly defined syntax.
This allows the student to pick up the language quickly.

Easy-to-read: Python code is more clearly defined and visible to the eyes.

Easy-to-maintain: Python's source code is fairly easy-to-maintain.

A broad standard library: Python's bulk of the library is very portable and cross-platform
compatible on UNIX, Windows, and Macintosh.

Interactive Mode: Python has support for an interactive mode which allows interactive
testing and debugging of snippets of code.

DEPT OF CSE, LAEC BIDAR Page 25


SMS SPAM DETECTION

Portable: Python can run on a wide variety of hardware platforms and has the same interface
on all platforms.

Extendable: You can add low-level modules to the Python interpreter. These modules enable
programmers to add to or customize their tools to be more efficient.

Databases: Python provides interfaces to all major commercial databases.

GUI Programming: Python supports GUI applications that can be created and ported to
many system calls, libraries, and windows systems, such as Windows MFC, Macintosh, and
the X Window system of Unix.

Scalable: Python provides a better structure and support for large programs than shell
scripting.

Apart from the above-mentioned features, Python has a big list of good features, few are
listed below:

● IT supports functional and structured programming methods as well as OOP.

● It can be used as a scripting language or can be compiled to byte-code.

● It provides very high-level dynamic data types and supports dynamic type checking.

● IT supports automatic garbage collection.

● It can be easily integrated with C, C++, COM, ActiveX, CORBA, and Java.

Python is available on a wide variety of platforms including Linux and Mac OS X. Let's
understand how to set up our Python environment.

DEPT OF CSE, LAEC BIDAR Page 26


SMS SPAM DETECTION

5.4 Python Environment:

Python is available on a wide variety of platforms including Linux and Mac OS X. Let's
understand how to set up our Python environment.

Python’s standard library

● Pandas

● Numpy

● Sklearn

● seaborn

● matplotlib

● Importing Datasets

DEPT OF CSE, LAEC BIDAR Page 27


SMS SPAM DETECTION

CHAPTER 6

Pandas

Pandas is quite a game changer when it comes to analyzing data with Python and it is one
of the most preferred and widely used tools in data munging/wrangling if not THE most
used one. Pandas is an open source What’s cool about Pandas is that it takes data (like a
CSV or TSV file, or a SQL database) and creates a Python object with rows and columns
called data frame that looks very similar to table in a statistical software (think Excel or
SPSS for example. People who are familiar with R would see similarities to R too). This is
so much easier to work with in comparison to working with lists and/or dictionaries through
for loops or list comprehension.

Installation and Getting Started In


order to “get” Pandas you would need to install it. You would also need to have Python 2.7
and above as a pre-requirement for installation. It is also dependent on other libraries (like
NumPy) and has optional dependancies (like Matplotlib for plotting). Therefore, I think that
the easiest way to get Pandas set up is to install it through a package like the Anaconda
distribution , “a cross platform distribution for data analysis and scientific computing.” In
order to use Pandas in your Python IDE (Integrated Development Environment) like Jupyter
Notebook or Spyder (both of them come with Anaconda by default), you need to import the
Pandas library first. Importing a library means loading it into the memory and then it’s there
for you to work with. In order to import Pandas all you have to do is run the following code:
Usually you would add the second part (‘as pd’) so you can access Pandas with
‘pd.command’ instead of needing to write ‘pandas.command’ every time you need to use it.
Also, you would import numpy as well, because it is very useful library for scientific
computing with Python. Now Pandas is ready for use! Remember, you would need to do it
every time you start a new Jupyter Notebook, Spyder file etc.

DEPT OF CSE, LAEC BIDAR Page 28


SMS SPAM DETECTION

6.1 Working with Pandas

Loading and Saving Data with Pandas


When you want to use Pandas for data analysis, you’ll usually use it in one of three
different ways:
● Convert a Python’s list, dictionary or Numpy array to a Pandas data frame

● Open a local file using Pandas, usually a CSV file, but could also be a delimited text
file (like TSV), Excel, etc

● Open a remote file or database like a CSV or a JSONon a website through a URL or
read from a SQL table/database

There are different commands to each of these options, but when you open a file, they
would look like this:

● pd.readfiletype()

As I mentioned before, there are different filetypes Pandas can work with, so you would
replace “filetype” with the actual, well, filetype (like CSV). You would give the path,
filename etc inside the parenthesis. Inside the parenthesis you can also pass different
arguments that relate to how to open the file. There are numerous arguments and in order
to know all you them, you would have to read the documentation (for example, the
documentation for pd.read csv() would contain all the arguments you can pass in this
Pandas command).

In order to convert a certain Python object (dictionary, lists etc) the basic command is:

● pd.DataFrame()

Inside the parenthesis you would specify the object(s) you’re creating the data frame from.
This command also has different arguments .

DEPT OF CSE, LAEC BIDAR Page 29


SMS SPAM DETECTION

You can also save a data frame you’re working with/on to different kinds of files (like CSV,
Excel, JSON and SQL tables). The general code for that is:

● df.to_filetype(filename)

Viewing and Inspecting Data

Now that you’ve loaded your data, it’s time to take a look. How does the data frame look?
Running the name of the data frame would give you the entire table, but you can also get
the first n rows with df.head(n) or the last n rows with df.tail(n). df.shape would give you
the number of rows and columns. df.info() would give you the index, datatype and
memory information. The command s.value_counts(dropna=False) would allow you to
view unique values and counts for a series (like a column or a few columns). A very useful
command is df.describe() which inputs summary statistics for numerical columns. It is
also possible to get statistics on the entire data frame or a series (a column etc):

● df.mean() Returns the mean of all columns


● df.corr() Returns the correlation between columns in a data frame
● df.count() Returns the number of non-null values in each data frame column
● df.max()Returns the highest value in each column
● df.min()Returns the lowest value in each column
● df.median()Returns the median of each column
● df.std()Returns the standard deviation of each column

Selection of Data
One of the things that is so much easier in Pandas is selecting the data you want in
comparison to selecting a value from a list or a dictionary. You can select a column (df[col])
and return column with label col as Series or a few columns (df[[col1, col2]]) and returns
columns as a new DataFrame. You can select by position (s.iloc[0]), or by index
(s.loc['index_one']) . In order to select the first row you can use df.iloc[0,:] and in order to
select the first element of the first column you would run df.iloc[0,0] . These can also be
used in different combinations, so I hope it gives you an idea of the different selection and
indexing you can perform in Pandas.

DEPT OF CSE, LAEC BIDAR Page 30


SMS SPAM DETECTION

Filter, Sort and Groupby


You can use different conditions to filter columns. For example, df[df[year] > 1984] would
give you only the column year is greater than 1984. You can use & (and) or | (or) to add
different conditions to your filtering. This is also called boolean filtering.
It is possible to sort values in a certain column in an ascending order using df.sort values
(col1) ; and also in a descending order using df.sort values(col2,ascending=False).
Furthermore, it’s possible to sort values by col1 in ascending order then col2 descending

order by using df.sort values([col1,col2],ascending=[True,False]). The last command in


this section is group by. It involves splitting the data into groups based on some criteria,
applying a function to each group independently and combining the results into a data
structure. df.group by(col) returns a group by object for values from one column while
df.group by([col1,col2]) returns a group by object for values from multiple columns.

Data Cleaning
Data cleaning is a very important step in data analysis. For example, we always check for
missing values in the data by running pd.isnull() which checks for null Values, and returns
a boolean array (an array of true for missing values and false for non-missing values). In
order to get a sum of null/missing values, run pd.isnull().sum(). pd.notnull() is the opposite
of pd.isnull(). After you get a list of missing values you can get rid of them, or drop them
by using df.dropna() to drop the rows or df.dropna(axis=1) to drop the columns. A different
approach would be to fill the missing values with other values by using df.fillna(x) which
fills the missing values with x (you can put there whatever you want) or s.fillna(s.mean())
to replace all null values with the mean (mean can be replaced with almost any function
from the statistics section). It is sometimes necessary to replace values with different
values. For example, s.replace (1,'one') would replace all values equal to 1 with 'one'. It’s
possible to do it for multiple values: s.replace([1,3],['one','three'])would replace all 1 with
'one' and 3with the You also rename specific : df.rename(columns={'old_name': 'new_
name'})or use df.set_index('column_one') to change the index of the data frame.

Join/Combine
The last set of basic Pandas commands are for joining or combining data frames or
rows/columns. The three commands are: df1.append(df2)— add the rows in df1 to the end
of df2 (columns should be identical)

● df.concat([df1, df2],axis=1) — add the columns in df1 to the end of df2 (rows should
be identical)
● df1.join(df2,on=col1,how='inner') — SQL-style join the columns in df1with the
columns on df2 where the rows for colhave identical values. how can be equal to one of
'left', 'right', 'outer', 'inner'

DEPT OF CSE, LAEC BIDAR Page 31


SMS SPAM DETECTION

CHAPTER 7
Numpy
Numpy is one such powerful library for array processing along with a large collection of
high-level mathematical functions to operate on these arrays. These functions fall into
categories like Linear Algebra, Trigonometry, Statistics, Matrix manipulation, etc.

7.1 Getting NumPy

NumPy’s main object is a homogeneous multidimensional array. Unlike python’s array class
which only handles one-dimensional array, NumPy’s ndarray class can handle
multidimensional array and provides more functionality. NumPy’s dimensions are known
as axes. For example, the array below has 2 dimensions or 2 axes namely rows and columns.
Sometimes dimension is also known as a rank of that particular array or matrix.

7.2 Importing NumPy

NumPy is imported using the following command. Note here np is the convention followed
for the alias so that we don't need to write numpy every time. NumPy is the basic library for
scientific computations in Python and this article illustrates some of its most frequently used
functions. Understanding NumPy is the first major step in the journey of machine learning
and deep learning.

7.3 Sklearn

In python, scikit-learn library has a pre-built functionality under sklearn. Pre processing.
Next thing is to do feature extraction Feature extraction is an attribute reduction process.
Unlike feature selection, which ranks the existing attributes according to their predictive
significance, feature extraction actually transforms the attributes. The transformed
attributes, or features, are linear combinations of the original attributes. Finally our models
are trained using Classifier algorithm.. We use nltk . classify module on Natural Language
Toolkit library on Python. We use the labelled dataset gathered . The rest of our labelled
data will be used to evaluate the models. Some machine learning algorithms were used to
classify pre processed data. The chosen classifiers were Decision tree , Support Vector
Machines and Random forest. These algorithms are very popular in text classification tasks.

DEPT OF CSE, LAEC BIDAR Page 32


SMS SPAM DETECTION

7.4 Seaborn
Data Visualization in Python

Data visualization is the discipline of trying to understand data by placing it in a visual


context, so that patterns, trends and correlations that might not otherwise be detected can
be exposed.

Python offers multiple great graphing libraries that come packed with lots of different
features. No matter if you want to create interactive, live or highly customized plots python
has a excellent library for you.

To get a little overview here are a few popular plotting libraries:

● Matplotlib: low level, provides lots of freedom

● Pandas Visualization: easy to use interface, built on Matplotlib

● Seaborn: high-level interface, great default styles

● ggplot: based on R’s ggplot2, uses Grammar of Graphics

● Plotly: can create interactive plots

In this article, we will learn how to create basic plots using Matplotlib, Pandas visualization
and Seaborn as well as how to use some specific features of each library. This article will
focus on the syntax and not on interpreting the graphs.

7.5 Matplotlib

Matplotlib is the most popular python plotting library. It is a low level library with a Matlab
like interface which offers lots of freedom at the cost of having to write more code.

1. To install Matplotlib pip and conda can be used.

2. pip install matplotlib

3. conda install matplotlib

DEPT OF CSE, LAEC BIDAR Page 33


SMS SPAM DETECTION

Matplotlib is specifically good for creating basic graphs like line charts, bar charts,
histograms and many more. It can be imported by typing:

● import matplotlib.pyplot as plt

Line Chart
In Matplotlib we can create a line chart by calling the plot method. We can also plot multiple
columns in one graph, by looping through the columns we want, and plotting each column
on the same axis.

Line Chart

Histogram
In Matplotlib we can create a Histogram using the hist method. If we pass it categorical data
like the points column from the wine-review dataset it will automatically calculate how
often each class occurs.

DEPT OF CSE, LAEC BIDAR Page 34


SMS SPAM DETECTION

Histogram

Bar Chart
A bar-chart can be created using the bar method. The bar-chart isn’t automatically
calculating the frequency of a category so we are going to use pandas value_counts function
to do this. The bar-chart is useful for categorical data that doesn’t have a lot of different
categories (less than 30) because else it can get quite messy.

Bar-Chart

DEPT OF CSE, LAEC BIDAR Page 35


SMS SPAM DETECTION

Pandas Visualization
Pandas is a open source high-performance, easy-to-use library providing data structures,
such as data frames, and data analysis tools like the visualization tools we will use in this
article. Pandas Visualization makes it really easy to create plots out of a pandas dataframe
and series. It also has a higher level API than Matplotlib and therefore we need less code
for the same results.

1. Pandas can be installed using either pip or conda.

2. pip install pandas

3. conda install pandas

Heatmap
A Heatmap is a graphical representation of data where the individual values contained in a
matrix are represented as colors. Heatmaps are perfect for exploring the correlation of
features in a dataset.
To get the correlation of the features inside a dataset we can call <dataset>.corr() , which is
a Pandas dataframe method. This will give use the correlation matrix.

We can now use either Matplotlib or Seaborn to create the heatmap.

Matplotlib:

DEPT OF CSE, LAEC BIDAR Page 36


SMS SPAM DETECTION

Heatmap without annotations


Data visualization is the discipline of trying to understand data by placing it in a visual
context, so that patterns, trends and correlations that might not otherwise be detected can be
exposed. Python offers multiple great graphing libraries that come packed with lots of
different features. In this article we looked at Matplotlib, Pandas visualization and Seaborn.

7.6 ANACONDA NAVIGATOR

Anaconda Navigator is a desktop graphical user interface (GUI) included in Anaconda


distribution that allows you to launch applications and easily manage conda packages,
environments and channels without using command-line commands. Navigator can search
for packages on Anaconda Cloud or in a local Anaconda Repository. It is available for
Windows, Mac OS and Linux.

Why use Navigator?


In order to run, many scientific packages depend on specific versions of other packages.
Data scientists often use multiple versions of many packages, and use multiple environments
to separate these different versions.
The command line program conda is both a package manager and an environment manager,
to help data scientists ensure that each version of each package has all the dependencies it
requires and works correctly.
Navigator is an easy, point-and-click way to work with packages and environments without
needing to type conda commands in a terminal window. You can use it to find the packages
you want, install them in an environment, run the packages and update them, all inside
Navigator.

WHAT APPLICATIONS CAN I ACCESS USING NAVIGATOR?


The following applications are available by default in Navigator:
● Jupyter Lab
● Jupyter Notebook
● QT Console
● Spyder
● VS Code
● Glue viz
● Orange 3 App
● Rodeo
● RStudio
Advanced conda users can also build your own Navigator applications.

DEPT OF CSE, LAEC BIDAR Page 37


SMS SPAM DETECTION

How can I run code with Navigator?


The simplest way is with Spyder. From the Navigator Home tab, click Spyder, and write and
execute your code.
You can also use Jupyter Notebooks the same way. Jupyter Notebooks are an increasingly
popular system that combine your code, descriptive text, output, images and interactive
interfaces into a single notebook file that is edited, viewed and used in a web browser.
What’s new in 1.9?
● Add support for Offline Mode for all environment related actions.
● Add support for custom configuration of main windows links.
● Numerous bug fixes and performance enhancements.

DEPT OF CSE, LAEC BIDAR Page 38


SMS SPAM DETECTION

CHAPTER 8
TESTING

Software testing is an investigation conducted to provide stakeholders with information


about the quality of the product or service under test. Software Testing also provides an
objective, independent view of the software to allow the business to appreciate and
understand the risks at implementation of the software. Test techniques include, but are not
limited to, the process of executing a program or application with the intent of finding
software bugs. Software Testing can also be stated as the process of validating and verifying
that a software program/application/product:
● Meets the business and technical requirements that guided its design and
Development.
● Works as expected and can be implemented with the same characteristics.

8.1 TESTING METHODS


➔ Functional Testing

Functional tests provide systematic demonstrations that functions tested are available as
specified by the business and technical requirements, system documentation, and user
manuals.
Functional testing is centered on the following items:
● Functions: Identified functions must be exercised.
● Output: Identified classes of software outputs must be exercised.
● Systems/Procedures: system should work properly

➔ Integration Testing

Software integration testing is the incremental integration testing of two or more integrated
software components on a single platform to produce failures caused by interface defects.

Test Case for Excel Sheet Verification:

Here in machine learning we are dealing with a dataset which is in excel sheet format so if
any test case we need means we need to check the excel file. Later on classification will work
on the respective columns of the dataset .

DEPT OF CSE, LAEC BIDAR Page 39


SMS SPAM DETECTION

DEPT OF CSE, LAEC BIDAR Page 40


SMS SPAM DETECTION

CHAPTER 9

RESULTS

Data mining is a process to extract knowledge from existing data. It is used as a tool in
banking and finance, in general, to discover useful information from the operational and
historical data to enable better decision-making. It is an interdisciplinary field, the confluence
of Statistics, Database technology, Information science, Machine learning, and Visualization.
It involves steps that include data selection, data integration, data transformation, data
mining, pattern evaluation, knowledge presentation.

DEPT OF CSE, LAEC BIDAR Page 41


SMS SPAM DETECTION

CHAPTER 10
OUTPUT

Fig 10.1:home page

DEPT OF CSE, LAEC BIDAR Page 42


SMS SPAM DETECTION

Fig 10.2:Enter SMS

DEPT OF CSE, LAEC BIDAR Page 43


SMS SPAM DETECTION

Fig 10.3: Ham Detect SMS

DEPT OF CSE, LAEC BIDAR Page 44


SMS SPAM DETECTION

Fig 10.4: Spam Detect SMS

DEPT OF CSE, LAEC BIDAR Page 45


SMS SPAM DETECTION

CONCLUSION

The research aims at predicting whether the messages are spam or true and it runs on efficient
machine learning algorithms and technologies having a good accuracy. The training datasets
obtained provide enough insights for predicting the appropriate messages . Thus, the system
helps the users in identification of their messages whether they are spam messages or true
messages with certain accurate prediction .

DEPT OF CSE, LAEC BIDAR Page 46


SMS SPAM DETECTION

REFERENCES

[1]Online:https://www.statista.com/statistics/330695/number-ofsmartphone-users
worldwide/
[2] S. M. Abdulhamid, M.S.Abd Latif, Haruna Chiroma, “Robust Heart Disease Prediction
A Review on Mobile SMS Spam Filtering Techniques”, EEE Access, vol. 5, pp. 15650-
15666, 2017, doi: 10.1109/ACCESS.2017.2666785
[3] Nilam Nur Amir Sjarif, N F Mohd Azmi, Suriayati Chuprat, “SMS Spam Message
Detection using Term Frequenct-Inverse Document Frequency and Random Forest
Algorithm,” in The Fifth Information Systems International Conference 2019, Procedia
Computer Science 161 (2019) 509-515,ScienceDirect.
[4] A.Lakshmanarao,K.Chandra Sekhar, Y.Swathi, “An Efficient Spam Classification
System using Ensemble Machine Learning Algorithm,” in Journal of Applied Science and
Computations, Volume 5, Issue 9, September/2018.
[5] Pavas Navaney, Gaurav Dubey, Ajay Rana, “SMS Spam Filtering using Supervised
Machine Learning Algorithms.,” in 8th International Conference on Cloud Computing, Data
Science & Engineering, 978-1- 5386-1719-9/18/ 2018 IEEE.
[6] Luo GuangJun,, Shah Nazir, Habib Ullah Khan, Amin Ul Haq, “Spam Detection
Approach for Secure Mobile Messgae Communication using Machine Learning
Algorithms.,” in Hindawi,Security and Communication Netwroks,Volume 2020,Article
id:8873639.July-2020.
[7] Tian Xia, Xuemin Chen, “A Discrete Hidden Markov Model for SMS Spam Detection.,”
in Applied Science,MDPI, Appl. Sci. 2020, 10, 5011; doi:10.3390/app10145011.
[8] M. Nivaashini, R.S.Soundariya, A.Kodieswari, P.Thangaraj, “: SMS Spam Detection
using Deep Neural Network.,” in International Journal of Pure and Applied Mathematics,
Volume 119 No. 18 2018, 2425-2436.
[9] Mehul Gupta, Aditya Bakliwal, Shubhangi Agarwal,Pulkit Mehndiratta, “: A
Comparative Study of Spam SMS Detection using Machine Learning Classifiers.,” in 2018
Eleventh International Conference on Contemporary Computing (IC3), 2-4 August, 2018.
[10] Gomatham Sai Sravya, G Pradeepini, Vaddeswaram, “: Mobile Sms Spam Filter
Techniques Using Machine Learning Techniques.,” International Journal Of Scientific &
Technology Research Volume 9, Issue 03, March 2020.
[11] M.Rubin Julis, S.AIagesan:, “Spam Detection In Sms Using Machine Learning through
Textmining”, International Journal Of Scientific & Technology Research Volume 9, Issue 02,
February 2020.
[12] K. Sree Ram Murthy,K.Kranthi Kumar, K.Srikar, CH.Nithya, S.Alagesan:, “SMS Spam
Detection using RNN”, : International
DEPT OF CSE, LAEC BIDAR Page 47
SMS SPAM DETECTION

While the Internet has brought unprecedented convenience to many people for managing
their finances and investments, it also provides opportunities for conducting fraud on a
massive scale with little cost to the fraudsters. Fraudsters can manipulate users instead of
hardware/software systems, where barriers to technological compromise have increased
significantly. Phishing is one of the most widely practised Internet frauds. It focuses on the
theft of sensitive personal information such as passwords and credit card details. Phishing
attacks take two forms:

• attempts to deceive victims to cause them to reveal their secrets by pretending to be


trustworthy entities with a real need for such information
• attempts to obtain secrets by planting malware onto victims’ machines.

The specific malware used in phishing attacks is subject of research by the virus and
malware community and is not addressed in this thesis. Phishing attacks that proceed by
deceiving users are the research focus of this thesis and the term ‘phishing attack’ will be
used to refer to this type of attack.

1.1 Objactive:
The main objective of this paper is to detect the Begin, Malicious and Malware URLs with
the use of machine learning.

1.2 Motivatoin:
The reason behind this system is to take precautions to prevent users from these harmful
sites. It will make people conscious in addition to building strong security mechanisms
which are able to detect and prevent phishing URL’s from reaching the user.

1.3 Existing System:


A poorly structured NN model may cause the model to underfit the training dataset. On the
other hand, exaggeration in restructuring the system to suit every single item in the training
dataset may cause the system to be overfitted. One possible solution to avoid the Overfitting

DEPT OF CSE, LAEC BIDAR Page 48


SMS SPAM DETECTION

problem is by restructuring the NN model in terms of tuning some parameters, adding new
neurons to the hidden layer or sometimes adding a new layer to the network. A NN with a
small number of hidden neurons may not have a satisfactory representational power to
model the complexity and diversity inherent in the data. On the other hand, networks with
too many hidden neurons could overfit the data. However, at a certain stage the model can
no longer be improved, therefore, the structuring process should be terminated. Hence, an
acceptable error rate should be specified when creating any NN model, which itself is
considered a problem since it is difficult to determine the acceptable error rate a priori . For
instance, the model designer may set the acceptable error rate to a value that is unreachable
which causes the model to stick in local minima or sometimes the model designer may set
the acceptable error rate to a value that can further be improved.

Disadvantages:
1. It will take time to load all the dataset.
2. Process is not accuracy.
3. It will analyse slowly.

1.4 Proposed System:


Lexical features are based on the observation that the URLs of many illegal sites look
different, compared with legitimate sites. Analysing lexical features enables us to capture
the property for classification purposes. We first distinguish the two parts of a URL: the
host name and the path, from which we extract bag-of-words (strings delimited by ‘/’, ‘?’,
‘.’, ‘=’, ‘-’ and ‘’).

We find that phishing website prefers to have longer URL, more levels (delimited
by dot), more tokens in domain and path, longer token. Besides, phishing and malware
websites could pretend to be a benign one by containing popular brand names as tokens
other than those in second-level domain. Considering phishing websites and malware
websites may use IP address directly so as to cover the suspicious URL, which is very rare
in benign case. Also, phishing URLs are found to contain several suggestive word tokens
(confirm, account, banking, secure, ebayisapi, webscr, login, signin), we check the presence
of these security sensitive words and include the binary value in our features. Intuitively,
malicious sites are always less popular than benign ones. For this reason, site popularity can
be considered as an important feature. Traffic rank feature is acquired from Alexa.com.
Host-based features are based on the observation that malicious sites are always registered
in less reputable hosting centres or regions.

Advantages:
1. All of URLs in the dataset are labelled.
2. We used two supervised learning algorithms random forest and support vector
machine to train using scikit-learn library.

DEPT OF CSE, LAEC BIDAR Page 49


SMS SPAM DETECTION

CHAPTER 3
LITERATURE SURVEY

Title: Large-Scale Automatic Classification of Phishing Pages.

Author: Colin Whittaker, Brian Ryner, Marria Nazif.

Abstract:
Phishing websites, fraudulent sites that impersonate a trusted third party to gain access to
private data, continue to cost Internet users over a billion dollars each year. In this paper,
we describe the design and performance characteristics of a scalable machine learning
classifier we developed to detect phishing websites. We use this classifier to maintain
Google’s phishing blacklist automatically. Our classifier analyses millions of pages a day,
examining the URL and the contents of a page to determine whether or not a page is
phishing. Unlike previous work in this field, we train the classifier on a noisy dataset
consisting of millions of samples from previously collected live classification data. Despite
the noise in the training data, our classifier learns a robust model for identifying phishing
pages which correctly classifies more than 90% of phishing pages several weeks after
training concludes.

Title: “RUS Boost: Improving Classification Performance when training is Skewed”.

Author: Chris Seiffert, Taghi M. Khoshgoftaar, Jason Van Hulse, Amri Napolitano.

Abstract:

Constructing classification models using skewed training data can be a challenging task.
We present RUS Boost, a new algorithm for alleviating the problem of class imbalance.
RUS Boost combines data sampling and boosting ,providing a simple and efficient method
for improving classification performance when training data is imbalanced. In addition to
performing favorably when compared to SMOTE Boost (another hybrid sampling/boosting
algorithm), RUS Boost is computationally less expensive than SMOTE Boost and results
in significantly shorter model training times. This combination of simplicity, speed and
performance makes RUS Boost an excellent technique for learning from imbalanced data.

Title: Application of Machine Learning Algorithm Intrusion Detect Dataset within Misuse
Detection Context.

DEPT OF CSE, LAEC BIDAR Page 50


SMS SPAM DETECTION

Author: Maheshkumar Sabhnani, Gursel Serpen.


Abstract:

A small subset of machine learning algorithms, mostly inductive learning based applied to
the KDD 1999 Cup intrusion detection dataset resulted in dismal performance for user-
toroot and remote-to-local attack categories as reported in the recent literature. The
uncertainty to explore if other machine learning algorithms can demonstrate better
performance compared to the ones already employed constitutes the motivation for the
study reported herein. Specifically, exploration of if certain algorithms perform better
for certain attack classes and consequently, if a multi-expert classifier design can deliver
desired performance measure is of high interest. This paper evaluates performance of
a comprehensive set of pattern recognition and machine learning algorithms on four
attack categories as found in the KDD 1999 Cup intrusion detection dataset. Results
of simulation study implemented to that effect indicated that certain classification
algorithms perform better for certain attack.

Title : Learning Fast Classifiers for Image Spam.


Author : Mark Dredze, Reuven Gevaryahu, Ari Elias-Bachrach.
Abstract :
Recently, spammers have proliferated “image spam”, emails which contain the text of the
spam message in a human readable image instead of the message body, making detection
by conventional content filters difficult. New techniques are needed to filter these messages.
Our goal is to automatically classify an image directly as being spam or ham. We present
features that focus on simple properties of the image, making classification as fast as
possible. Our evaluation shows that accurately classify spam images in excess of 90% and
up to 99% on real world data. Furthermore, we introduce a new feature selection algorithm
that selects features for classification based on their speed as well as predictive power. This
technique produce san accurate system that runs in a tiny fraction of the time. Finally, we
introduce Justin Time (JIT) feature extraction, which creates features at classification time
as needed by the classifier. We demonstrate JIT extraction using a JIT decision that further
increases system speed. This paper makes image spam classification practical by providing
both high accuracy features and a method to learn fast classifiers.

Title : Using Syntactic Features for Phishing Detection.


Author : Gilchan Park, Julia M. Taylor.
Abstract :
This paper reports on the comparison of the subject and object of verbs in their usage
between phishing emails and legitimate emails. The purpose of this research is to explore
whether the syntactic structures and subjects and objects of verbs can be distinguishable
features for phishing detection. To achieve the objective, we have conducted two series of

DEPT OF CSE, LAEC BIDAR Page 51


SMS SPAM DETECTION

experiments: the syntactic similarity for sentences, and the subject and object of verb
comparison. The results of the experiments indicated that both features can be used for
some verbs, but more work has to be done for others.

Title: Detecting Phishing Emails the Natura Language Way.


Author: Rakesh Verma, Narasimha Shashidhar, and Nabil Hossain.
Abstract:
Phishing causes billions of dollars in damage every year and poses a serious threat to the
Internet economy. Email is still the most commonly used medium to launch phishing
attacks. In this paper, we present a comprehensive natural language based scheme to detect
phishing emails using features that are invariant and fundamentally characterize phishing.
Our scheme utilizes all the information present in an email, namely, the header, the links
and the text in the body. Although it is obvious that a phishing email is designed to elicit an
action from the intended victim, none of the existing detection schemes use this fact to
identify phishing emails. Our detection protocol is designed specifically to distinguish
between “actionable” and “informational” emails. To this end, we incorporate natural
language techniques in phishing detection. We also utilize contextual information, when
available, to detect phishing: we study the problem of phishing detection within the
contextual confines of the user’s email box and demonstrate that context plays an important
role in detection. To the best of our knowledge, this is the first scheme that utilizes natural
language techniques and contextual information to detect phishing. We show that our
scheme out performs existing phishing detection schemes. Finally, our protocol detects
phishing at the email level rather than detecting masqueraded websites. This is crucial to
prevent the victim from clicking any harmful links in the email. Our implementation called
Phish Net-NLP, operates between a user’s mail transfer agent (MTA) and mail user agent
(MUA) and process search arriving email for phishing attacks even before reaching the
inbox.

Title : iTrustPage: A User-Assisted Anti-Phishing Tool.


Author : Troy Ronda, Stefan Saroiu, Alec Wolman.
Abstract :
Despite the many solutions proposed by industry and the research community to address
phishing attacks, this problem continues to cause enormous damage. Because of our
inability to deter phishing attacks, the research community needs to develop new
approaches to antiphishing solutions. Most of today’s anti-phishing technologies focus on
automatically detecting and preventing phishing attacks. While automation makes anti-
phishing tools userfriendly, automation also makes them suffer from false positives, false
negatives, and various practical hurdles. As a result, attackers often find simple ways to
escape automatic detection. This paper presents iTrust Page – an anti-phishing tool that
doesnot rely completely on automation to detect phishing. Instead, iTrust Page relies on
user input and external repositories of information to prevent users from filling out phishing
Web forms. With iTrust Page, users help to decide whether or not a Web page is legitimate.
DEPT OF CSE, LAEC BIDAR Page 52
SMS SPAM DETECTION

Because iTrust Page is user-assisted, iTrust Page avoids the false positives and the false
negatives associated with automatic phishing detection. We implemented iTrust Page as a
downloadable extension to FireFox. After being featured on the Mozilla website for FireFox
extensions, iTrust Page was downloaded by more than 5,000 users in a two week period.
We present an analysis of our tool’s effectiveness and ease of use based on our examination
of usage logs collected from the 2,050 users who used iTrust Page for more than two weeks.
Based on these logs, we find that iTrust Page disrupts users on fewer than 2% of the pages
they visit, and the number of disruptions decreases over time.

Title : Phishing Environments, Techniques, and Countermeasures:


A Survey.
Author : Ahmed Aleroud, Lina Zhou.
Abstract:
Phishing has become an increasing threat in online space, largely driven by the evolving
web, mobile, and social networking technologies. Previous phishing taxonomies have
mainly focused on the underlying mechanisms of phishing but ignored the emerging
attacking techniques, targeted environments, and countermeasures for mitigating new
phishing types. This survey investigates phishing attacks and anti-phishing techniques
developed not only in traditional environments such as e-mails and websites, but also in
new environments such as mobile and social networking sites. Taking an integrated view of
phishing, we propose a taxonomy that involves attacking techniques, countermeasures,
targeted environments and communication media. The taxonomy will not only provide
guidance for the design of effective techniques for phishing detection and prevention in
various types of environments, but also facilitate practitioners in evaluating and selecting
tools, methods, and features for handling specific types of phishing problem.

Title : Phishing Detection: A Literature Survey.


Author : Mahmoud Khonji, Youssef Iraqi.
Abstract :
This article surveys the literature on the detection of phishing attacks. Phishing attacks target
vulnerabilities that exist in systems due to the human factor. Many cyber attacks are spread
via mechanisms that exploit weaknesses found in end users , which makes users the weakest
element in the security chain. The phishing problem is broad and no single silver-bullet
solution exists to mitigate all the vulnerabilities effectively, thus multiple techniques are
often implemented to mitigate specific attacks. This paper aims at surveying many of the
recently proposed phishing mitigation techniques. A high-level overview of various
categories of phishing mitigation techniques is also presented, such as: detection, offensive
defense, correction, and prevention, which we belief is critical to present where the phishing
detection techniques fit in the overall mitigation process.

DEPT OF CSE, LAEC BIDAR Page 53


Detecting Phishing Attacks

Title : Online Phishing Classification Using Adversarial Data Mining and Signaling Games.

Author : Gaston L’Huillier, Richard Weber, Nicolas Figueroa.


Abstract :
In adversarial systems, the performance of a classifier decreases after it is deployed, as the
adversary learns to defeat it. Recently, adversarial data mining was introduced, where the
classification problem is viewed as a game mechanism between an adversary and an intelligent
and adaptive classifier .Over the last years, phishing fraud through malicious email messages
has been a serious threat that affects global security and economy, where traditional spam
filtering technique shave shown to be ineffective. In this domain, using dynamic games of
incomplete information, a game the oretic data mining framework is proposed in order to build
an adversary-aware classifier for phishing fraud detection. To build the classifier, an online
version of the Weighted Margin Support Vector Machines with a game theoretic prior
knowledge function is proposed. In this paper, a new content based feature extraction technique
for phishing filtering is described. Experiments show that the proposed classifier is highly
competitive compared with previously proposed online classification algorithms in this
adversarial environment , machine learning techniques over extracted features.

DEPT OF CSE, LAEC BIDAR Page 54


Detecting Phishing Attacks

CHAPTER 4
SYSTEM ARCHITECTURE

4.1 Hardware and Software Requirements:


Hardware:
1.Windows 7,8,10 64 bit

2.RAM 4GB Software:

1. Data Set

2. Python

4.2 Methodology
• Data Collection

• Data Pre-Processing

DEPT OF CSE, LAEC BIDAR Page 55


Detecting Phishing Attacks

• Feature Extraction

• Evaluation model

4.2.1 Data Collection:

Data used in this paper is a set of records. This step is concerned with selecting the subset of
all available data that you will be working with. ML problems start with data preferably, lots
of data (examples or observations) for which you already know the target answer. Data for
which you already know the target answer is called labelled data.

4.2.2 Data Pre-Processing:

Organize your selected data by formatting, cleaning and sampling from it.

Three common data pre-processing steps are:

1. Formatting

2. Cleaning

3. Sampling

Formatting: The data you have selected may not be in a format that is suitable for you to work
with. The data may be in a relational database and you would like it in a flat file, or the data
may be in a proprietary file format and you would like it in a relational database or a text file.

Cleaning: Cleaning data is the removal or fixing of missing data. There may be data instances
that are incomplete and do not carry the data you believe you need to address the problem.
These instances may need to be removed. Additionally, there may be sensitive information in
some of the attributes and these attributes may need to be anonym zed or removed from the
data entirely.

Sampling: There may be far more selected data available than you need to work with. More
data can result in much longer running times for algorithms and larger computational and
memory requirements. You can take a smaller representative sample of the selected data that
may be much faster for exploring and prototyping solutions before considering the whole
dataset.

4.2.3 Feature Extraction:

DEPT OF CSE, LAEC BIDAR Page 56


Detecting Phishing Attacks

Next thing is to do Feature extraction is an attribute extension we created more columns from
URL’s. Finally, our models are trained using Classifier algorithm. We use the labelled dataset
gathered. The rest of our labelled data will be used to evaluate the models. Some machine
learning algorithms were used to classify pre-processed data. The chosen classifiers were
Random forest.

4.2.4 Evaluation Model:

Model Evaluation is an integral part of the model development process. It helps to find the best
model that represents our data and how well the chosen model will work in the future.

To avoid over fitting, both methods use a test set (not seen by the model) to evaluate model
performance. Performance of each classification model is estimated base on its averaged. The
result will be in the visualized form. Representation of classified data in the form of graphs.
Accuracy is defined as the percentage of correct predictions for the test data. It can be
calculated easily by dividing the number of correct predictions by the number of total
predictions. We predict the accuracy over actual and predicted output and calculate accuracy
as –

4.3 Uml Diagrams


4.3.1 Use Case Diagram:

DEPT OF CSE, LAEC BIDAR Page 57


Detecting Phishing Attacks

4.3.2 Class Diagram:

DEPT OF CSE, LAEC BIDAR Page 58


Detecting Phishing Attacks

4.3.3 Sequence Diagram:

4.3.4 Activity Diagram:

DEPT OF CSE, LAEC BIDAR Page 59


Detecting Phishing Attacks

CHAPTER 5
ALGORITHM

5.1 Random Forest:

DEPT OF CSE, LAEC BIDAR Page 60


Detecting Phishing Attacks

Random forest is a type of supervised machine learning algorithm based on ensemble learning.
Ensemble learning is a type of learning where you join different types of algorithms or same
algorithm multiple times to form a more powerful prediction model. The random forest
algorithm combines multiple algorithm of the same type i.e. multiple decision trees, resulting
in a forest of trees, hence the name "Random Forest". The random forest algorithm can be used
for both regression and classification tasks.

5.2 How Random Forest Works:

The following are the basic steps involved in performing the random forest algorithm

1. Pick N random records from the dataset.


2. Build a decision tree based on these N records.
3. Choose the number of trees you want in your algorithm and repeat steps 1 and 2.
4. For classification problem, each tree in the forest predicts the category to which the new
record belongs. Finally, the new record is assigned to the category that wins the majority
vote.

5.3 Advantages Of Using Random Forest:


pros of using random forest for classification and regression.

1. The random forest algorithm is not biased, since, there are multiple trees and each tree
is trained on a subset of data. Basically, the random forest algorithm relies on the power
of "the crowd"; therefore, the overall biasedness of the algorithm is reduced.

2. This algorithm is very stable. Even if a new data point is introduced in the dataset the
overall algorithm is not affected much since new data may impact one tree, but it is very
hard for it to impact all the trees.
3. The random forest algorithm works well when you have both categorical and numerical
features.
4. The random forest algorithm also works well when data has missing values or it has not
been scaled we.

5.5 Domain Specification:

Motion of all the cars around it. It uses all of that data to figure out not only how to drive the
car but also to figure out and predict what potential drivers around the car are going to do.
What's impressive is that the car is processing almost a gigabyte a second of data.

DEPT OF CSE, LAEC BIDAR Page 61


Detecting Phishing Attacks

Deep Learning:
Deep learning is a computer software that mimics the network of neurons in a brain. It is a
subset of machine learning and is called deep learning because it makes use of deep neural
networks. The machine uses different layers to learn from the data. The depth of the model is
represented by the number of layers in the model. Deep learning is the new state of the art in
term of AI. In deep learning, the learning phase is done through a neural network.

Reinforcement Learning:

Reinforcement learning is a subfield of machine learning in which systems are trained by


receiving virtual "rewards" or "punishments," essentially learning by trial and error. Google's
DeepMind has used reinforcement learning to beat a human champion in the Go games.
Reinforcement learning is also used in video games to improve the gaming experience by
providing smarter bot.

One of the most famous algorithms are:

● Q-learning
● Deep Q network
● State-Action-Reward-State-Action (SARSA) ● Deep Deterministic Policy
Gradient (DDPG)

Applications/ Examples of deep learning applications

AI in Finance: The financial technology sector has already started using AI to save time,
reduce costs, and add value. Deep learning is changing the lending industry by using more
robust credit scoring. Credit decision-makers can use AI for robust credit lending applications
to achieve faster, more accurate risk assessment, using machine intelligence to factor in the
character and capacity of applicants.
Underwrite is a Fintech company providing an AI solution for credit makers company.
underwrite.ai uses AI to detect which applicant is more likely to pay back a loan. Their approach
radically outperforms traditional methods.

AI in HR: Under Armour, a sportswear company revolutionizes hiring and modernizes the
candidate experience with the help of AI. In fact, Under Armour Reduces hiring time for its
retail stores by 35%. Under Armour faced a growing popularity interest back in 2012. They
had, on average, 30000 resumes a month. Reading all of those applications and begin to start
the screening and interview process was taking too long. The lengthy process to get people
hired and on-boarded impacted Under Armour's ability to have their retail stores fully staffed,
ramped and ready to operate.

DEPT OF CSE, LAEC BIDAR Page 62


Detecting Phishing Attacks

At that time, Under Armour had all of the 'must have' HR technology in place such as
transactional solutions for sourcing, applying, tracking and onboarding but those tools weren't
useful enough. Under armour choose HireVue, an AI provider for HR solution, for both
ondemand and live interviews. The results were bluffing; they managed to decrease by 35%
the time to fill. In return, the hired higher quality staffs.

AI in Marketing: AI is a valuable tool for customer service management and


personalization challenges. Improved speech recognition in call-center management and call
routing as a result of the application of AI techniques allows a more seamless experience for
customers.

For example, deep-learning analysis of audio allows systems to assess a customer's emotional
tone. If the customer is responding poorly to the AI chatbot, the system can be rerouted the
conversation to real, human operators that take over the issue.

Apart from the three examples above, AI is widely used in other sectors/industries.

Artificial Intelligence

Machine Learning Deep


Learning

DEPT OF CSE, LAEC BIDAR Page 63


Detecting Phishing Attacks

Difference between Machine Learning and Deep Learning

Machine Learning Deep Learning

Data Excellent performances on a Excellent performance on a big


Dependen small/medium dataset. dataset.
cies
Hardware Work on a low-end machine. Requires powerful machine,
dependen preferably with GPU: DL
performs a significant amount
cies
of matrix multiplication.

Feature Need to understand the features that No need to understand the


engineeri represent the data. best feature that represents the
data.
ng
Execution From few minutes to hours. Up to weeks. Neural Network
time needs to compute a significant
number of weights.

Interpret Some algorithms are easy to interpret Difficult to impossible.


ability (logistic, decision tree), some are almost
impossible (SVM, XGBoost)

When to use ML or DL?


In the table below, we summarize the difference between machine learning and deep learning.
Machine learning Deep learning

Training dataset Small Large

Choose features Yes No

Number of Many Few


algorithms

Training time Short Long

With machine learning, you need fewer data to train the algorithm than deep learning. Deep
learning requires an extensive and diverse set of data to identify the underlying structure.
Besides, machine learning provides a faster-trained model. Most advanced deep learning
architecture can take days to a week to train. The advantage of deep learning over machine

DEPT OF CSE, LAEC BIDAR Page 64


Detecting Phishing Attacks

learning is it is highly accurate. You do not need to understand what features are the best
representation of the data; the neural network learned how to select critical features. In machine
learning, you need to choose for yourself what features to include in the model.

TensorFlow

The most famous deep learning library in the world is Google's TensorFlow. Google product
uses machine learning in all of its products to improve the search engine, translation, image
captioning or recommendations.
To give a concrete example, Google users can experience a faster and more refined the search
with AI. If the user types a keyword a the search bar, Google provides a recommendation about
what could be the next word.

Google wants to use machine learning to take advantage of their massive datasets to give users
the best experience. Three different groups use machine learning:

● Researchers
● Data scientists ● Programmers.

They can all use the same toolset to collaborate with each other and improve their efficiency.

Google does not just have any data; they have the world's most massive computer, so
TensorFlow was built to scale. TensorFlow is a library developed by the Google Brain Team to
accelerate machine learning and deep neural network research.

It was built to run on multiple CPUs or GPUs and even mobile operating systems, and it has
several wrappers in several languages like Python, C++ or Java.

In this tutorial, you will learn


DEPT OF CSE, LAEC BIDAR Page 65
Detecting Phishing Attacks

TensorFlow Architecture

TensorFlow architecture works in three parts:

● Pre-processing the data


● Build the model
● Train and estimate the model

It is called TensorFlow because it takes input as a multi-dimensional array, also known as


tensors. You can construct a sort of flowchart of operations (called a Graph) that you want to
perform on that input. The input goes in at one end, and then it flows through this system of
multiple operations and comes out the other end as output.

This is why it is called TensorFlow because the tensor goes in it flows through a list of
operations, and then it comes out the other side.

Where can TensorFlow run?

TensorFlow can hardware, and software requirements can be classified into

Development Phase: This is when you train the mode. Training is usually done on your Desktop
or laptop.

Run Phase or Inference Phase: Once training is done TensorFlow can be run on many different
platforms. You can run it on
● Desktop running Windows, macOS or Linux
● Cloud as a web service
● Mobile devices like iOS and Android

You can train it on multiple machines then you can run it on a different machine, once you have
the trained model.

The model can be trained and used on GPUs as well as CPUs. GPUs were initially designed
for video games. In late 2010, Stanford researchers found that GPU was also very good at
matrix operations and algebra so that it makes them very fast for doing these kinds of
calculations. Deep learning relies on a lot of matrix multiplication. TensorFlow is very fast at
computing the matrix multiplication because it is written in C++. Although it is implemented
in C++, TensorFlow can be accessed and controlled by other languages mainly, Python.

Finally, a significant feature of TensorFlow is the TensorBoard. The TensorBoard enables to


monitor graphically and visually what TensorFlow is doing.

List of Prominent Algorithms supported by TensorFlow

● Linear regression: tf.estimator.LinearRegressor

DEPT OF CSE, LAEC BIDAR Page 66


Detecting Phishing Attacks

● Classification:tf.estimator.LinearClassifier
● Deep learning classification: tf.estimator.DNNClassifier
● Deep learning wipe and deep: tf.estimator.DNNLinearCombinedClassifier
● Booster tree regression: tf.estimator.BoostedTreesRegressor
● Boosted tree classification: tf.estimator.BoostedTreesClassifier

CHAPTER 6

REQUIREMENTS ANALYSIS

6.1 Software Requirements

● Anaconda Navigator
● Python
● Python built-in modules

o Numpy o
Pandas o
Matplotlib o
Sklearn o
Seaborm

DEPT OF CSE, LAEC BIDAR Page 67


Detecting Phishing Attacks

6.1.1 Anaconda Navigator:


Anaconda Navigator is a desktop graphical user interface (GUI) included in Anaconda
distribution that allows you to launch applications and easily manage conda packages,
environments and channels without using command-line commands. Navigator can search for
packages on Anaconda Cloud or in a local Anaconda Repository. It is available for Windows,
mac OS and Linux.

Why use Navigator?

In order to run, many scientific packages depend on specific versions of other packages.
Data scientists often use multiple versions of many packages, and use multiple environments
to separate these different versions.
The command line program conda is both a package manager and an environment manager, to
help data scientists ensure that each version of each package has all the dependencies it requires
and works correctly.
Navigator is an easy, point-and-click way to work with packages and environments without
needing to type conda commands in a terminal window. You can use it to find the packages you
want, install them in an environment, run the packages and update them, all inside Navigator.

What Applications Can I Access Using Navigator?


The following applications are available by default in Navigator:
● JupyterLab
● Jupyter Notebook
● QTConsole
● Spyder
● VSCode
● Glueviz
● Orange 3 App
● Rodeo
● RStudio
Advanced conda users can also build your own Navigator applications How
can I run code with Navigator?

The simplest way is with Spyder. From the Navigator Home tab, click Spyder, and write and
execute your code.
You can also use Jupyter Notebooks the same way. Jupyter Notebooks are an increasingly
popular system that combine your code, descriptive text, output, images and interactive
interfaces into a single notebook file that is edited, viewed and used in a web browser.

DEPT OF CSE, LAEC BIDAR Page 68


Detecting Phishing Attacks

What’s new in 1.9?

● Add support for Offline Mode for all environment related actions.
● Add support for custom configuration of main windows links.
● Numerous bug fixes and performance enhancements.

6.1.2 Python Overview:


Python is a high-level, interpreted, interactive and object-oriented scripting language. Python
is designed to be highly readable. It uses English keywords frequently where as other languages
use punctuation, and it has fewer syntactical constructions than other languages.

Python is Interpreted: Python is processed at runtime by the interpreter. You do not


need to compile your program before executing it. This is similar to PERL and PHP.

Python is Interactive: You can actually sit at a Python prompt and interact with the
interpreter directly to write your programs.

Python is Object-Oriented: Python supports Object-Oriented style or technique of


programming that encapsulates code within objects.

Python is a Beginner's Language: Python is a great language for the beginner-level


programmers and supports the development of a wide range of applications from simple
text processing to WWW browsers to games.

6.1.2.1 History of Python:

Python was developed by Guido van Rossum in the late eighties and early nineties at the
National Research Institute for Mathematics and Computer Science in the Netherlands.

Python is derived from many other languages, including ABC, Modula-3, C, C++, Algol-68,
SmallTalk, Unix shell, and other scripting languages.

DEPT OF CSE, LAEC BIDAR Page 69


Detecting Phishing Attacks

Python is copyrighted. Like Perl, Python source code is now available under the GNU General
Public License (GPL).

Python is now maintained by a core development team at the institute, although Guido van
Rossum still holds a vital role in directing its progress.

6.2.1.2 Python Features:


Python's features include:

Easy-to-learn: Python has few keywords, simple structure, and a clearly defined
syntax. This allows the student to pick up the language quickly.

Easy-to-read: Python code is more clearly defined and visible to the eyes.

Easy-to-maintain: Python's source code is fairly easy-to-maintain.

A broad standard library: Python's bulk of the library is very portable and
crossplatform compatible on UNIX, Windows, and Macintosh.

Interactive Mode: Python has support for an interactive mode which allows
interactive testing and debugging of snippets of code.

Portable: Python can run on a wide variety of hardware platforms and has the same
interface on all platforms.

Extendable: You can add low-level modules to the Python interpreter. These modules
enable programmers to add to or customize their tools to be more efficient.

Databases: Python provides interfaces to all major commercial databases.

GUI Programming: Python supports GUI applications that can be created and ported
to many system calls, libraries, and windows systems, such as Windows MFC,
Macintosh, and the X Window system of Unix.

DEPT OF CSE, LAEC BIDAR Page 70


Detecting Phishing Attacks

Scalable: Python provides a better structure and support for large programs than shell
scripting.

Apart from the above-mentioned features, Python has a big list of good features, few are listed
below:

IT supports functional and structured programming methods as well as OOP.

It can be used as a scripting language or can be compiled to byte-code for building


large applications.

It provides very high-level dynamic data types and supports dynamic type checking.

IT supports automatic garbage collection.

It can be easily integrated with C, C++, COM, ActiveX, CORBA, and Java.

6.3 Python Environment:

Python is available on a wide variety of platforms including Linux and Mac OS X. Let's understand
how to set up our Python environment.

Python’s standard library


● Pandas

● Numpy

● Sklearn

● seaborn

● matplotlib

DEPT OF CSE, LAEC BIDAR Page 71


Detecting Phishing Attacks

● Importing Datasets

6.3.1Pandas

Pandas is quite a game changer when it comes to analyzing data with Python and it is one of
the most preferred and widely used tools in data munging/wrangling if not THE most used one.
Pandas is an open source What’s cool about Pandas is that it takes data (like a CSV or TSV file,
or a SQL database) and creates a Python object with rows and columns called data frame that
looks very similar to table in a statistical software (think Excel or SPSS for example. People
who are familiar with R would see similarities to R too). This is so much easier to work with
in comparison to working with lists and/or dictionaries through for loops or list comprehension.

Installation and Getting Started


In order to “get” Pandas you would need to install it. You would also need to have Python 2.7
and above as a pre-requirement for installation. It is also dependent on other libraries (like
NumPy) and has optional dependancies (like Matplotlib for plotting). Therefore, I think that
the easiest way to get Pandas set up is to install it through a package like the Anaconda
distribution , “a cross platform distribution for data analysis and scientific computing.”

In order to use Pandas in your Python IDE (Integrated Development Environment) like Jupyter
Notebook or Spyder (both of them come with Anaconda by default), you need to import the
Pandas library first. Importing a library means loading it into the memory and then it’s there
for you to work with. In order to import Pandas all you have to do is run the following code:

● import pandas as pd
● import numpy as np

Usually you would add the second part (‘as pd’) so you can access Pandas with ‘pd.command’
instead of needing to write ‘pandas.command’ every time you need to use it. Also, you would
import numpy as well, because it is very useful library for scientific computing with Python.
Now Pandas is ready for use! Remember, you would need to do it every time you start a new
Jupyter Notebook, Spyder file etc.

DEPT OF CSE, LAEC BIDAR Page 72


Detecting Phishing Attacks

Working with Pandas

Loading and Saving Data with Pandas


When you want to use Pandas for data analysis, you’ll usually use it in one of three different
ways:
● Convert a Python’s list, dictionary or Numpy array to a Pandas data frame

● Open a local file using Pandas, usually a CSV file, but could also be a delimited text file
(like TSV), Excel, etc

● Open a remote file or database like a CSV or a JSONon a website through a URL or read
from a SQL table/database

There are different commands to each of these options, but when you open a file, they would
look like this:

● pd.read_filetype()

As I mentioned before, there are different filetypes Pandas can work with, so you would replace
“filetype” with the actual, well, filetype (like CSV). You would give the path, filename etc
inside the parenthesis. Inside the parenthesis you can also pass different arguments that relate
to how to open the file. There are numerous arguments and in order to know all you them, you
would have to read the documentation (for example, the documentation for pd.read_csv()
would contain all the arguments you can pass in this Pandas command).

In order to convert a certain Python object (dictionary, lists etc) the basic command is:

● pd.DataFrame()

DEPT OF CSE, LAEC BIDAR Page 73


Detecting Phishing Attacks

Inside the parenthesis you would specify the object(s) you’re creating the data frame from. This
command also has different arguments .

You can also save a data frame you’re working with/on to different kinds of files (like CSV,
Excel, JSON and SQL tables). The general code for that is:

● df.to_filetype(filename)

Viewing and Inspecting Data


Now that you’ve loaded your data, it’s time to take a look. How does the data frame look?
Running the name of the data frame would give you the entire table, but you can also get the
first n rows with df.head(n) or the last n rows with df.tail(n). df.shape would give you the
number of rows and columns. df.info() would give you the index, datatype and memory
information. The command s.value_counts(dropna=False) would allow you to view unique
values and counts for a series (like a column or a few columns). A very useful command is
df.describe() which inputs summary statistics for numerical columns. It is also possible to get
statistics on the entire data frame or a series (a column etc):

● df.mean() Returns the mean of all columns


● df.corr() Returns the correlation between columns in a data frame
● df.count() Returns the number of non-null values in each data frame column
● df.max()Returns the highest value in each column
● df.min()Returns the lowest value in each column
● df.median()Returns the median of each column
● df.std()Returns the standard deviation of each column

Selection of Data
One of the things that is so much easier in Pandas is selecting the data you want in comparison
to selecting a value from a list or a dictionary. You can select a column (df[col]) and return
column with label col as Series or a few columns (df[[col1, col2]]) and returns columns as a
new DataFrame. You can select by position (s.iloc[0]), or by index (s.loc['index_one']) . In
order to select the first row you can use df.iloc[0,:] and in order to select the first element of
the first column you would run df.iloc[0,0] . These can also be used in different combinations,
so I hope it gives you an idea of the different selection and indexing you can perform in Pandas.

Filter, Sort and Groupby

DEPT OF CSE, LAEC BIDAR Page 74


Detecting Phishing Attacks

You can use different conditions to filter columns. For example, df[df[year] > 1984] would give
you only the column year is greater than 1984. You can use & (and) or | (or) to add different
conditions to your filtering. This is also called boolean filtering.
It is possible to sort values in a certain column in
an ascending order using df.sort_values(col1) ; and also in a
descending order using df.sort_values(col2,ascending=False). Furthermore, it’s possible to
sort values by col1 in ascending order then col2 in descending order by using
df.sort_values([col1,col2],ascending=[True,False]).
The last command in this section is groupby. It involves splitting the data into groups based on
some criteria, applying a function to each group independently and combining the results into
a data structure. df.groupby(col) returns a groupby object for values from one column while
df.groupby([col1,col2]) returns a groupby object for values from multiple columns.

Data Cleaning
Data cleaning is a very important step in data analysis. For example, we always check for
missing values in the data by running pd.isnull() which checks for null Values, and returns a
boolean array (an array of true for missing values and false for non-missing values). In order
to get a sum of null/missing values, run pd.isnull().sum(). pd.notnull() is the opposite of
pd.isnull(). After you get a list of missing values you can get rid of them, or drop them by using
df.dropna() to drop the rows or df.dropna(axis=1) to drop the columns. A different approach
would be to fill the missing values with other values by using df.fillna(x) which fills the missing
values with x (you can put there whatever you want) or s.fillna(s.mean()) to replace all null
values with the mean (mean can be replaced with almost any function from the statistics
section).
It is sometimes necessary to replace values with different
values. For example, s.replace(1,'one') would replace all values equal to 1 with 'one'.
It’s possible to do it for multiple values: s.replace([1,3],['one','three'])would
replace all 1 with 'one' and 3 with 'three'. You can also rename
specific columns by running: df.rename(columns={'old_name': 'new_ name'})or
use df.set_index('column_one') to change the index of the data frame.

Join/Combine
The last set of basic Pandas commands are for joining or combining data frames or
rows/columns. The three commands are: df1.append(df2)— add the rows in df1 to the end of
df2 (columns should be identical)

● df.concat([df1, df2],axis=1) — add the columns in df1 to the end of df2 (rows should be
identical)
● df1.join(df2,on=col1,how='inner') — SQL-style join the columns in df1with the columns
on df2 where the rows for colhave identical values. how can be equal to one
of: 'left', 'right', 'outer', 'inner'

DEPT OF CSE, LAEC BIDAR Page 75


Detecting Phishing Attacks

6.3.2 Numpy
Numpy is one such powerful library for array processing along with a large collection of high-
level mathematical functions to operate on these arrays. These functions fall into categories
like Linear Algebra, Trigonometry, Statistics, Matrix manipulation, etc.

Getting NumPy

NumPy’s main object is a homogeneous multidimensional array. Unlike python’s array class
which only handles one-dimensional array, NumPy’s ndarray class can handle
multidimensional array and provides more functionality. NumPy’s dimensions are known as
axes. For example, the array below has 2 dimensions or 2 axes namely rows and columns.
Sometimes dimension is also known as a rank of that particular array or matrix.

Importing NumPy
NumPy is imported using the following command. Note here np is the convention followed for
the alias so that we don't need to write numpyevery time.

● import numpy as np

NumPy is the basic library for scientific computations in Python and this article illustrates some
of its most frequently used functions. Understanding NumPy is the first major step in the
journey of machine learning and deep learning.

6.3.3 Sklearn

In python, scikit-learn library has a pre-built functionality under sklearn. Pre processing.

Next thing is to do feature extraction Feature extraction is an attribute reduction process. Unlike
feature selection, which ranks the existing attributes according to their predictive significance,
feature extraction actually transforms the attributes. The transformed attributes, or features, are
linear combinations of the original attributes. Finally our models are trained using Classifier
algorithm.. We use nltk . classify module on Natural Language Toolkit library on Python. We

DEPT OF CSE, LAEC BIDAR Page 76


Detecting Phishing Attacks

use the labelled dataset gathered . The rest of our labelled data will be used to evaluate the
models. Some machine learning algorithms were used to classify pre processed data. The
chosen classifiers were Decision tree , Support Vector Machines and Random forest. These
algorithms are very popular in text classification tasks.

6.3.4 Seaborn

Data Visualization in Python

Data visualization is the discipline of trying to understand data by placing it in a visual context,
so that patterns, trends and correlations that might not otherwise be detected can be exposed.
Python offers multiple great graphing libraries that come packed with lots of different features.
No matter if you want to create interactive, live or highly customized plots python has a
excellent library for you.

To get a little overview here are a few popular plotting libraries:

● Matplotlib: low level, provides lots of freedom

● Pandas Visualization: easy to use interface, built on Matplotlib

● Seaborn: high-level interface, great default styles

● ggplot: based on R’s ggplot2, uses Grammar of Graphics

● Plotly: can create interactive plots

In this article, we will learn how to create basic plots using Matplotlib, Pandas visualization
and Seaborn as well as how to use some specific features of each library. This article will focus
on the syntax and not on interpreting the graphs.

6.3.5 Matplotlib

Matplotlib is the most popular python plotting library. It is a low level library with a Matlab
like interface which offers lots of freedom at the cost of having to write more code.

1. To install Matplotlib pip and conda can be used.

2. pip install matplotlib

DEPT OF CSE, LAEC BIDAR Page 77


Detecting Phishing Attacks

3. conda install matplotlib

Matplotlib is specifically good for creating basic graphs like line charts, bar charts, histograms
and many more. It can be imported by typing:

● import matplotlib.pyplot as plt


Line Chart
In Matplotlib we can create a line chart by calling the plot method. We can also plot multiple
columns in one graph, by looping through the columns we want, and plotting each column on
the same axis.

Line Chart

Histogram
In Matplotlib we can create a Histogram using the hist method. If we pass it categorical data
like the points column from the wine-review dataset it will automatically calculate how often
each class occurs.

DEPT OF CSE, LAEC BIDAR Page 78


Detecting Phishing Attacks

Histogram
Bar Chart
A bar-chart can be created using the bar method. The bar-chart isn’t automatically calculating
the frequency of a category so we are going to use pandas value_counts function to do this. The
bar-chart is useful for categorical data that doesn’t have a lot of different categories (less than
30) because else it can get quite messy.

Bar-Chart

Pandas Visualization

DEPT OF CSE, LAEC BIDAR Page 79


Detecting Phishing Attacks

Pandas is a open source high-performance, easy-to-use library providing data structures, such
as dataframes, and data analysis tools like the visualization tools we will use in this article.

Pandas Visualization makes it really easy to create plots out of a pandas dataframe and series.
It also has a higher level API than Matplotlib and therefore we need less code for the same
results.

4. Pandas can be installed using either pip or conda.

5. pip install pandas

6. conda install pandas


Heatmap
A Heatmap is a graphical representation of data where the individual values contained in a
matrix are represented as colors. Heatmaps are perfect for exploring the correlation of features
in a dataset.
To get the correlation of the features inside a dataset we can call <dataset>.corr() ,
which is a Pandas dataframe method. This will give use the correlation matrix.

We can now use either Matplotlib or Seaborn to create the heatmap.

Matplotlib:

Heatmap without annotations

DEPT OF CSE, LAEC BIDAR Page 80


Detecting Phishing Attacks

Data visualization is the discipline of trying to understand data by placing it in a visual context,
so that patterns, trends and correlations that might not otherwise be detected can be exposed.

Python offers multiple great graphing libraries that come packed with lots of different features.
In this article we looked at Matplotlib, Pandas visualization and Seaborn.

CHAPTER 7
TESTING
Software testing is an investigation conducted to provide stakeholders with information about
the quality of the product or service under test. Software Testing also provides an objective,
independent view of the software to allow the business to appreciate and understand the risks
at implementation of the software. Test techniques include, but are not limited to, the process
of executing a program or application with the intent of finding software bugs.
Software Testing can also be stated as the process of validating and verifying that a software
program/application/product:
● Meets the business and technical requirements that guided its design and Development.
● Works as expected and can be implemented with the same characteristics.

7.1 Testing Methods

● Functional Testing

Functional tests provide systematic demonstrations that functions tested are available as
specified by the business and technical requirements, system documentation, and user manuals.
Functional testing is centered on the following items:
● Functions: Identified functions must be exercised.
● Output: Identified classes of software outputs must be exercised.
● Systems/Procedures: system should work properly

7.2 Integration Testing

Software integration testing is the incremental integration testing of two or more integrated
software components on a single platform to produce failures caused by interface defects.

DEPT OF CSE, LAEC BIDAR Page 81


Detecting Phishing Attacks

Test Case for Excel Sheet Verification:

Here in machine learning we are dealing with dataset which is in excel sheet format so if any
test case we need means we need to check excel file. Later on classification will work on the
respective columns of dataset .

Test Case 1 :

DEPT OF CSE, LAEC BIDAR Page 82


Detecting Phishing Attacks

CHAPTER 8
RESULTS

8.1 Confusion Matrix

DEPT OF CSE, LAEC BIDAR Page 83


Detecting Phishing Attacks

CHAPTER 9
OUTPUT

Fig 9.1: Home Page

DEPT OF CSE, LAEC BIDAR Page 84


Detecting Phishing Attacks

Fig 9.2: Adding Detail

Fig 9.3: Website

DEPT OF CSE, LAEC BIDAR Page 85


Detecting Phishing Attacks

CONCLUSION
In this paper, we describe our large-scale system for automatically classifying phishing pages
which maintains a false positive rate below 0.1%. Our classification system examines millions
of potential phishing pages daily in a fraction of the time of a manual review process. By
automatically updating our blacklist with our classifier, we minimize the amount of time that
phishing pages can remain active before we protect our users from them. Even with a perfect
classifier and a robust system, we recognize that our blacklist approach keeps us perpetually a
step behind the phishers. We can only identify a phishing URL and normal URL using machine
learning algorithm. Result we got in terms of accuracy metric.

REFERENCES
[1] G. Aaron and R. Rasmussen, “Global phishing survey: Trends and domain name use in
2016,” 2016.

[2] B. Gupta, A. Tewari, A. K. Jain, and D. P. Agrawal, “Fighting against phishing attacks:
state of the art and future challenges,” Neural Computing and Applications, vol. 28, no.
12, pp. 3629–3654, 2017.

DEPT OF CSE, LAEC BIDAR Page 86


Detecting Phishing Attacks

[3] A. Aleroud and L. Zhou, “Phishing environments, techniques,


survey,”countermeasures: Aand Security Computers & , vol. 68, pp. 160 – 196, 2017.
[Online]. Available: http://
www.sciencedirect.com/science/article/pii/S0167404817300810 G. Aaron and R.
Rasmussen, “Phishing activity trends report: 4th

[4] quarter 2016,” 2014. R. Verma, N. Shashidhar, and N. Hossain, “Detecting phishing

[5] emails the natural language way,” in Computer Security–ESORICS 2012. Springer,
2012, pp. 824–841. M. Khonji, Y. Iraqi, and A. Jones, “Phishing detection: a literature

[6] survey,” IEEE Communications Surveys & Tutorials, vol. 15, no. 4, pp. 2091–2121,
2013. G. Park and J. M. Taylor, “Using syntactic features for phishing [7] detection,”
arXiv preprint arXiv:1506.00037, 2015. R. Dazeley, J. L. Yearwood, B. H.
Kang, and A. V. Kelarev

DEPT OF CSE, LAEC BIDAR Page 87

You might also like