0% found this document useful (0 votes)
183 views64 pages

Fake Url

Uploaded by

Chaitan Bruce
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
183 views64 pages

Fake Url

Uploaded by

Chaitan Bruce
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 64

DETECTION OF URL BASED PHISHING WEBSITES

USING MACHINE LEARNING WITH PYTHON

Abstract:

A fraud attempt to get sensitive and personal information like password, username, and bank
details like credit/debit card details by masking as a reliable organization in electronic
communication. The phishing website will appear the same as the legitimate website and directs
the user to a page to enter personal details of the user on the fake website. Through machine
learning algorithms one can improve the accuracy of the prediction. The proposed method
predicts the URL based phishing websites based on features and also gives maximum accuracy.
This method uses uniform resource locator (URL) features. We identified features that phishing
site URLs contain. The proposed method employs those features for phishing detection. The
proposed system predicts the URL based phishing websites with maximum accuracy. We shall
talk about various machine learning, the algorithm which can help in decision making and
prediction. We shall use one of the algorithm to get better accuracy of prediction.

Keywords:- Phishing, Algorithm, Legitimate, Prediction.

1
1. INTRODUCTION

1.1 INTRODUCTION TO PROJECT:

Phishing imitates the characteristics and features of emails and makes it look the same as
the original one. It appears similar to that of the legitimate source. The user thinks that this email
has come from a genuine company or an organization. This makes the user to forcefully visit the
phishing website through the links given in the phishing email. These phishing websites are made
to mock the appearance of an original organization website. The phishers force user to fill up the
personal information by giving alarming messages or validate account messages etc so that they
fill up the required information which can be used by them to misuse it. They make the situation
such that the user is not left with any other option but to visit their spoofed website.

Phishing is a cyber crime, the reason behind the phishers doing this crime is that it is very easy to
do this, it does not cost anything and it effective. The phishing can easily access the email id of
any person it is very easy to find the email id now a day and you can sending an email to anyone
is freely available across the world. These attackers put very less cost and effort to get valuable
data quickly and easily. The phishing frauds leads to malware infections, loss of data, identity
theft etc. The data in which these cyber criminals are interested is the crucial information of a
user like the password, OTP, credit/ debit card numbers CVV, sensitive data related to business,
medical data, confidential data etc.

Sometimes these criminals also gather information which can give them direct access to the
social media account their emails.

A lot of software / approaches and algorithms are used for phishing detection. These are used at
academic and commercial organization levels. A phishing URL and the parallel page have many
features which are different from the malignant URL. Let us take an example to hide the original
domain name the phishing attacker can select very long and confusing name of the domain. This
is very easily visible. Sometimes they use the IP address instead of using the domain name. On
the other hand they can also use a shorter domain name which will not be relevant to the original
legitimate website. Apart from the URL based feature of phishing detection there are many
different features which can also be used for the detection of Phishing websites namely the
Domain-Based Features, Page-Based Features and Content-Based Features.

In the training phase, we should use the labeled data in which there are samples such as phish
area and legitimate area. If we do this then classification will not be a problem for detecting the
2
phishing domain. To do a working detection model it is very crucial to use data set in the training
phase. We should use samples whose classes are known to us, which means the samples whom
we label as phishing should be detected only as phishing. Similarly the samples which are labeled
as legitimate will be detected as legitimate URL. The dataset to be used for machine learning
must actually consist these features. There so many machine learning algorithms and each
algorithm has its own working mechanism which we have already seen in the previous chapter.
The existing system uses any one of the suitable machine learning algorithms for the detection of
phishing URL and predicts its accuracy. The existing system has good accuracy but it is still not
the best as phishing attack is a very crucial, we have to find a best solution to eliminate this. In
the currently existing system, only one machine learning algorithm is used to predict the
accuracy, using only one algorithm is not a good approach to improve the prediction accuracy.
Each of the algorithms which explain in the earlier chapter has some disadvantages hence it is not
recommended to use one machine learning algorithm to further improve the accuracy.

1.2 METHODOLOGY:

In this section we shall learn about the various classifiers used in machine learning to predict
phishing. We shall also explain our proposed methodology to detect phishing website. In section
A we shall explain various classifiers and methods which can be used to check the phishing and
legitimate website. In section B we shall explain our proposed system.

Machine learning classifiers and methods to detect the phishing website

Detecting and identifying Phishing Websites is really a complex and dynamic problem. Machine
learning has been widely used in many areas to create automated solutions. The phishing attacks
can be carried out in many ways such as email, website, malware, sms and voice. In this work,
we concentrate on detecting website phishing (URL), which is achieved by making use of the
Hybrid Algorithm Approach. Hybrid Algorithm Approach is a mixture of different classifiers
working together which gives good prediction rate and improves the accuracy of the system.

Depending on the application and nature of the dataset used we can use any classification
algorithms mentioned below. As there are different applications, we can not differentiate which
of the algorithms are superior or not. Each of classifiers have its own way of working and
classification. Let us discuss each of them in details.

Naive Bayes Classifier:- This classifier can also be known as a Generative Learning Model. The
classification here is based on Bayes Theorem, it assumes independent predictors. In simple

3
words, this classifier will assume that the existence of specific features in a class is not related to
the existence of any other feature. If there is dependency among the features of each other or on
the presence of other features, all of these will be considered as an independent contribution to
the probability of the output. This classification algorithm is very much useful to large datasets
and is very easy to use.

Random Forest: This classification algorithm are similar to ensemble learning method of
classification. The regression and other tasks, work by building a group of decision trees at
training data level and during the output of the class, which could be the mode of classification or
prediction regression for individual trees. This classifier accuracy for decision trees practice of
overfitting the training data set.

Support vector machine (SVM): This is also one of the classification algorithm which is
supervised and is easy to use. It can used for both classification and regression applications, but it
is more famous to be used in classification applications. In this algorithm each point which is a
data item is plotted in a dimensional space, this space is also known as n dimensional plane,
where the n represents the number of features of the data. The classification is done based on the
differentiation in the classes, these classes are data set points present in different planes.

XGBoost: Recently, the researches have come across an algorithm XGBoost and its usage is very
useful for machine learning classification. It is very much fast and its performance is better as it is
an execution of a boosted decision tree. This classification model is used to improve the
performance of the model and also to improve the speed.

Once the model is trained it is very important to evaluate the classifier which we shall use and
validate its capability. Now in the above section we have seen all the advantages and
disadvantages of all the available classifier. Hence we propose to use one classifier that is
Random forest ,so we to improve the accuracy further of prediction.After applying the
classification the results are generated and the URLs are classified into phishing and legitimate
URLs. The Phishing URLs are blacklisted in the database and the legitimate are white list in the
database

1.3 EXISTING SYSTEM

The existing anti-phishing approaches use the blacklist methods or features based
machine learning techniques. Blacklist methods fail to detect new phishing attacks
and produce high false positive rate. Moreover, existing machine learning based
methods extract features from the third party, search engine, etc. Therefore, they
are complicated, slow in nature, and not fit for the real-time environment. To solve
this problem, this paper presents a machine learning based novel anti-phishing
4
approach that extracts the features from client side only. We have examined the
various attributes of the phishing and legitimate websites in depth and identified 5
new outstanding features to distinguish phishing websites from legitimate ones. 
1.4 PROPOSED SYSTEM:

The dataset of phishing and legitimate URL's is given to the system which is then pre-processed
so that the data is in the useable format for analysis. The features have around 300 characteristics
of phishing websites which is used to differentiate it from legitimate ones.

Each category has its own characteristics of phishing attributes and values are defined. The
specified characteristics are extracted for each URL and valid ranges of inputs are identified.
These values are then assigned to each phishing website risk. For each input the values range
from 0 to 10 , while for output range is from 0 to 100. The phishing attributes values are
represented with binary no 0 and 1 which indicates the attribute is present or not.

After this the data is trained we shall apply a relevant machine learning algorithm to the dataset.
The machine learning algorithms are already explained in previous section. After this we use a
classification named Random forest to predict the accuracy of the detection of the phishing URL,
hence we get our desired result. This is also called a random approach to test the data, in this
method we propose to use the classifier, as mentioned above.

We shall then test the data and evaluate the prediction accuracy which shall be more than the
existing system. We shall now see the different classifiers and discuss the hybrid combination
used for our proposed system.

5
Fig. 1: Proposed System block diagram

6
In the training phase, we should use the labeled data in which there are samples such as phish
area and legitimate area. If we do this then classification will not be a problem for detecting the
phishing domain. To do a working detection model it is very crucial to use data set in the training
phase.

We should use samples whose classes are known to us, which means the samples whom we label
as phishing should be detected only as phishing. Similarly the samples which are labeled as
legitimate will be detected as legitimate URL.

The dataset to be used for machine learning must actually consist these features.There so many
machine learning algorithms and each algorithm has its own working mechanism which we have
already seen in the previous chapter. The existing system uses any one of the suitable machine
learning algorithms for the detection of phishing URL and predicts its accuracy. Each of the
algorithms which explain in the earlier section has some disadvantages hence it is not
recommended to use one machine learning algorithm to detect the phishing website.

7
2. REQUIREMENT AND SPECIFICATIONS

2.1 Functional Requirements:

Graphical User interface with the User.

2.2 Software Requirements:

For developing the application the following are the Software Requirements:

Operating system : Windows 7, Windows 8 , Windows 10

Coding Language : Python (3.7.4)

Technologies : Anaconda Navigator.

IDE : Jupyter Notebook(6.0.3), Python IDLE (3.7.4)

2.3 Hardware Requirements:

For developing the application the following are the Hardware Requirements:

Processor : Pentium IV or higher

RAM : 4 GB

Space on Hard Disk : minimum 20GB

8
ANACONDA NAVIGATOR

Anaconda Navigator is a desktop graphical user interface (GUI) included in Anaconda®


distribution that allows you to launch applications and easily manage conda packages,
environments, and channels without using command-line commands. ... To get Navigator, get the
Navigator Cheat Sheet and install Anaconda. Anaconda Navigator is a desktop graphical user
interface included in Anaconda that allows you to launch applications and easily manage conda
packages, environments, and channels without the need to use command-line commands.

9
Why use Navigator?

In order to run, many scientific packages depend on specific versions of other packages. Data
scientists often use multiple versions of many packages and use multiple environments to
separate these different versions.

The command-line program conda is both a package manager and an environment manager. This
helps data scientists ensure that each version of each package has all the dependencies it requires
and works correctly.

Navigator is an easy, point-and-click way to work with packages and environments without
needing to type conda commands in a terminal window. You can use it to find the packages you
want, install them in an environment, run the packages, and update them – all inside Navigator.

What applications can I access using Navigator?

The following applications are available by default in Navigator:

JupyterLab

Jupyter Notebook

Spyder

VSCode

Glueviz

Orange 3 App

Rstudio

Anaconda Prompt ( Windows only)

Anaconda PowerShell(Windows only)

10
Jupyter Notebook

The Jupyter Notebook is an open-source web application that allows you to create and share
documents that contain live code, equations, visualizations and narrative text. Uses include: data
cleaning and transformation, numerical simulation, statistical modeling, data visualization,
machine learning, and much more.

11
Python IDLE

IDLE (Integrated Development and Learning Environment) is an integrated development


environment (IDE) for Python. The Python installer for Windows contains the IDLE module
by default.

IDLE is not available by default in Python distributions for Linux. It needs to be installed using
the respective package managers. For example, in case of

$ sudo apt-get install idle

IDLE can be used to execute a single statement just like Python Shell and also to create,
modify and execute Python scripts. IDLE provides a fully-featured text editor to create Python
scripts that includes features like syntax highlighting, autocompletion and smart indent. It also
has a debugger with stepping and breakpoints features.

To start IDLE interactive shell, search for the IDLE icon in the start menu and double click on
it.

Python IDLE
12
This will open IDLE, where you can write Python code and execute it as shown below.

Pytho
n IDLE

Now, you can execute Python statements same as in Python Shell as shown below.

13
Pytho
n IDLE

To execute a Python script, create a new file by selecting File -> New File from the menu.

14
Enter multiple statements and save the file with extension .py using File -> Save. For example,
save the following code as hello.py.

Python Script in IDLE

Now, press F5 to run the script in the editor window. The IDLE shell will show the
output.Thus, it is easy to write, test and run Python scripts in IDLE.

15
3. FEASIBILITY STUDY

The feasibility of the project is analyzed in this phase and business proposal is put forth with a

very general plan for the project and some cost estimates. During system analysis the feasibility

study of the proposed system is to be carried out. This is to ensure that the proposed system is not

a burden to the company. For feasibility analysis, some understanding of the major requirements

for the system is essential.

Three key considerations involved in the feasibility analysis are

 ECONOMICAL FEASIBILITY

 TECHNICAL FEASIBILITY

 SOCIAL FEASIBILITY

3.1 ECONOMICAL FEASIBILITY

This study is carried out to check the economic impact that the system will have on the

organization. The amount of fund that the company can pour into the research and development

of the system is limited. The expenditures must be justified. Thus the developed system as well

within the budget and this was achieved because most of the technologies used are freely

available. Only the customized products had to be purchased.

3.2 TECHNICAL FEASIBILITY

This study is carried out to check the technical feasibility, that is, the technical

requirements of the system. Any system developed must not have a high demand on the available

technical resources. This will lead to high demands on the available technical resources. This will

16
lead to high demands being placed on the client. The developed system must have a modest

requirement, as only minimal or null changes are required for implementing this system.

3.3 SOCIAL FEASIBILITY

The aspect of study is to check the level of acceptance of the system by the user. This includes

the process of training the user to use the system efficiently. The user must not feel threatened by

the system, instead must accept it as a necessity. The level of acceptance by the users solely

depends on the methods that are employed to educate the user about the system and to make him

familiar with it. His level of confidence must be raised so that he is also able to make some

constructive criticism, which is welcomed, as he is the final user of the system.

17
4. SYSTEM ANALYSIS

4.1 STUDY OF THE SYSTEM:

To provide flexibility to the users, the interfaces have been developed that are accessible through
a GUI application. The GUI’S at the top level have been categorized as

1. The operational or generic user interface


The ‘operational or generic user interface’ helps the end users of the system in transactions
through the existing data and required services. The operational user interface also helps the
ordinary users in managing their own information in a customized manner as per the included
flexibilities

4.2 INPUT & OUTPOUT REPRESENTATION:

Input design is a part of overall system design. The main objective during the input design is as
given below:

 To produce a cost-effective method of input.


 To achieve the highest possible level of accuracy.
 To ensure that the input is acceptable and understood by the user.

INPUT AND OUTPUT DESIGN

INPUT DESIGN:

The input design is the link between the information system and the user. It
comprises the developing specification and procedures for data preparation and those steps are
necessary to put transaction data in to a usable form for processing can be achieved by inspecting
the computer to read data from a written or printed document or it can occur by having people
keying the data directly into the system. The design of input focuses on controlling the amount of
input required, controlling the errors, avoiding delay, avoiding extra steps and keeping the

18
process simple. The input is designed in such a way so that it provides security and ease of use
with retaining the privacy. Input Design considered the following things:

 What data should be given as input?


 How the data should be arranged or coded?
 The dialog to guide the operating personnel in providing input.
 Methods for preparing input validations and steps to follow when error occur.

OBJECTIVES:

1.Input Design is the process of converting a user-oriented description of the input


into a computer-based system. This design is important to avoid errors in the data input process
and show the correct direction to the management for getting correct information from the
computerized system.

2. It is achieved by creating user-friendly screens for the data entry to handle large
volume of data. The goal of designing input is to make data entry easier and to be free from
errors. The data entry screen is designed in such a way that all the data manipulates can be
performed. It also provides record viewing facilities.

3.When the data is entered it will check for its validity. Data can be entered with the
help of screens. Appropriate messages are provided as when needed so that the user will not be in
maize of instant. Thus the objective of input design is to create an input layout that is easy to
follow.

OUTPUT DESIGN:

A quality output is one, which meets the requirements of the end user and presents
the information clearly. In any system results of processing are communicated to the users and to
other system through outputs. In output design it is determined how the information is to be
displaced for immediate need and also the hard copy output. It is the most important and direct
source information to the user. Efficient and intelligent output design improves the system’s
relationship to help user decision-making.

1. Designing computer output should proceed in an organized, well thought out


manner; the right output must be developed while ensuring that each output element is designed
so that people will find the system can use easily and effectively. When analysis design computer
output, they should Identify the specific output that is needed to meet the requirements.

2. Select methods for presenting information.


19
3. Create document, report, or other formats that contain information produced by the system.

The output form of an information system should accomplish one or more of the following
objectives.

 Convey information about past activities, current status or projections of the


 Future.
 Signal important events, opportunities, problems, or warnings.
 Trigger an action.
 Confirm an action.

4.3 PROCESS MODEL USED WITH JUSTIFICATION:

20
1.Define Project Objectives:

 Specify business Problem


 Acquire subject matter expertise
 Define unit of analysis and prediction target
 Prioritize modeling criteria
 Consider risks and success criteria
 Decide whether to continue

2. Acquire & Explore Data


 Find appropriate data
 Merge data into single table
 Conduct exploratory data analysis
 Find and remove any target leakage
 Feature engineering

3. Model Data
 Variable selection
 Build candidate models
 Model validation and selection

4. Interpret & Communicate


 Interpret model
 Communicate model insights

5. Implement , Document & Maintain


 Set up batch or API prediction system
 Document modeling process for Reproducibility

21
 Create model monitoring and maintenance

4.4 SYSTEM ARCHITECTURE:

Architecture flow:

Below architecture diagram represents mainly flow of training phase to Detection phase. First
data need to be pre-processed and feature extraction using different feature sets and later we need
to train this dataset with the corresponding algorithms and the output is displayed.

22
5. SYSTEM DESIGN

5.1 INTRODUCTION:

Introduction:

Systems design is the process or art of defining the architecture, components,

modules, interfaces, and data for a system to satisfy specified requirements. One could

see it as the application of systems theory to product development.

There is some overlap and synergy with the disciplines of systems analysis, systems

-architecture and systems engineering.

5.2 DATA FLOW DIAGRAM:

Dataset
Collection

Pre-processing

Feature

Extraction

Trained &
Testing
dataset

Algorithm and
classification

Prediction and
Result
23
5.3 UML DIAGRAMS:

Unified Modeling Language:

The “Unified Modeling Language” allows the software engineer to express an analysis model
using the modeling notation that is governed by a set of syntactic semantic and pragmatic rules.

A UML system is represented using five different views that describe the system from distinctly
different perspective. Each view is defined by a set of diagram, which is as follows.

 User Model View


i. This view represents the system from the users perspective.
ii. The analysis representation describes a usage scenario from the end-users
perspective.

 Structural model view


i. In this model the data and functionality are arrived from inside the system.
ii. This model view models the static structures.
 Behavioral Model View
It represents the dynamic of behavioral as parts of the system, depicting the
interactions of collection between various structural elements described in the user
model and structural model view.

 Implementation Model View


In this the structural and behavioral as parts of the system are represented as they are
to be built.

 Environmental Model View

24
In this the structural and behavioral aspects of the environment which the system is to be
implemented are represented.

UML is specifically constructed through two different domains they are:

 UML Analysis modeling, this focuses on the user model and structural model views of the
system.
 UML design modeling, which focuses on the behavioral modeling, implementation
modeling and environmental model views.

USECASE DIAGRAM:

Use case Diagrams represent the functionality of the system from a user’s point of view. Use
cases are used during requirements elicitation and analysis to represent the functionality of the
system. Use cases focus on the behavior of the system from external point of view.

ACTORS:

Actors are external entities that interact with the system. Examples of actors include users like
administrator, bank customer …etc., or another system like central database

25
1 .USE CASE DIAGRAM:

a. User

26
2 .COMPONENT DIAGRAM:

27
3. CLASS DIAGRAM

28
4.ACTIVITY DIAGRAM:

User

29
5. SEQUENCE DIAGRAM

30
6..SYSTEM OVERVIEW

System design is used for understanding the construction of system. We have explained the flow
of our system and the software used in the system in this section.

System Flow

The Fig. 2 explains the flow chart of the system design, we shall explain each of the components
of the flow chart in each section below. To get structured data we do feature generation of the
data at the pre- processing stage. We have used techniques like Random Forest classifier to
detect the phishing and legitimate websites

31
Fig 2:- Flow of the System

ALGORITHM:

RANDOM FOREST:

Random forest is a supervised learning algorithm which is used for both classification as well as
regression. But however, it is mainly used for classification problems. As we know that a forest is
made up of trees and more trees means more robust forest. Similarly, random forest algorithm
creates decision trees on data samples and then gets the prediction from each of them and finally
selects the best solution by means of voting. It is an ensemble method which is better than a
single decision tree because it reduces the over-fitting by averaging the result.

Working of Random Forest Algorithm

We can understand the working of Random Forest algorithm with the help of following steps −

Step 1 − First, start with the selection of random samples from a given dataset.

Step 2 − Next, this algorithm will construct a decision tree for every sample. Then it will get the
prediction result from every decision tree.

Step 3 − In this step, voting will be performed for every predicted result.

Step 4 − At last, select the most voted prediction result as the final prediction result

Data set: The data of urls is obtained from Phishtank website ,where Phishtank is an. anti-
phishing site.It contains 2905 urls which is in unstructured form. Our main objective is to detect
whether the url is phishing or legitimate

32
Step -1 : Data preprocessing

This dataset contains few website links (Some of them are legitimate websites and a few are fake
websites)

Pre-Processing the data before building a model and also Extracting the features from the data
based on certain conditions

We need to split the data according to parts of the URL

A typical URL could have the form http://www.example.com/index.html, which indicates a


protocol (http), a hostname (www.example.com), and a file name (index.html).

33
split(separator, no of splits according to separator (delimiter),expand), renaming columns of data
frame with domain name and address.

Concatenation of data frames with protocol , domain name and address

We need to add new column to data frame with name (“is phished”) .

34
Domain name column can be further sub divided into domain_names as well as
sub_domain_names

Similarly, address column can also be further sub divided into path,query_string,file.

Features Extraction

Feature-1

1.Long URL to Hide the Suspicious Part

If the length of the URL is greater than or equal 54 characters then the URL classified as phishing

0 --- indicates legitimate

1 --- indicates Phishing

2 --- indicates Suspicious

35
Feature-2

2.URL’s having “@” Symbol

Using “@” symbol in the URL leads the browser to ignore everything preceding the “@” symbol
and the real address often follows the “@” symbol.

IF {Url Having @ Symbol→ Phishing

Otherwise→ Legitimate }

0 --- indicates legitimate

1 --- indicates Phishing

36
Feature-3

3.Redirecting using “//”

The existence of “//” within the URL path means that the user will be redirected to another
website. An example of such URL’s is: “http://www.legitimate.com//http://www.phishing.com”.
We examine the location where the “//” appears. We find that if the URL starts with “HTTP”,
that means the “//” should appear in the sixth position. However, if the URL employs “HTTPS”
then the “//” should appear in seventh position.

IF {The Position of the Last Occurrence of "//" in the URL > 7→ Phishing

Otherwise→ Legitimate

0 --- indicates legitimate

1 --- indicates Phishing

37
Feature-4

4.Adding Prefix or Suffix Separated by (-) to the Domain

The dash symbol is rarely used in legitimate URLs. Phishers tend to add prefixes or suffixes
separated by (-) to the domain name so that users feel that they are dealing with a legitimate
webpage.

For example http://www.Confirme-paypal.com/.

IF {Domain Name Part Includes (−) Symbol → Phishing

Otherwise → Legitimate

1 --> indicates phishing

0 --> indicates legitimate

38
Feature - 5

5. Sub-Domain and Multi Sub-Domains

The legitimate URL link has two dots in the URL since we can ignore typing “www.”. If the
number of dots is equal to three then the URL is classified as “Suspicious” since it has one sub-
domain.

However, if the dots are greater than three it is classified as “Phishy” since it will have multiple
sub-domains

0 --- indicates legitimate

1 --- indicates Phishing

2 --- indicates Suspicious

pro is_p lon having redirecti prefix_su sub_


domain_nam
toc Address hish g_ _@_sy on_//_sy ffix_sepe dom
e
ol ed url mbol mbol ration ains

locking-app- payment-update-
htt adverds.000 0.html?
0 yes 1 0 0 1 0
ps webhostapp. fb_source=bookmark
com _apps&...

www.myhea wp-
htt
1 lthcarephar includes/js/jquery/ini yes 2 0 0 0 0
p
macy.ca .php

htt code.google.
2 p/pylevenshtein/ no 0 0 0 0 0
p com

htt linkedin.co
3 no 0 0 0 0 0
p m

4 htt imageshack. f/219/cadir2yr3.jpg no 0 0 0 0 0


39
pro is_p lon having redirecti prefix_su sub_
domain_nam
toc Address hish g_ _@_sy on_//_sy ffix_sepe dom
e
ol ed url mbol mbol ration ains

p com

doc-08-bc-
apps- viewer/secure/pdf/oe
htt
5 viewer.goog seuikt874lsab1nd951 no 1 0 0 1 0
ps
leuserconten o2hb6vb...
t.com

htt www.7-
6 download.html no 0 0 0 1 0
p zip.org

htt
7 ebay.com no 0 0 0 0 0
p

config/login?.done=
htt login.yahoo.
8 https%3a%2f no 1 0 0 0 0
ps com
%2fapi.login.yah...

mobile-
htt
9 free.metulw mobile/impaye/ yes 0 0 0 1 0
p
x.com

uploads/gallery/man
1 htt www.gulfshi
kind/secure/fiscal/ma yes 1 0 0 0 0
0 ps eld.com
nual/m...

1 htt components/kttm/dro
rodritol.com yes 0 0 0 0 0
1 p pbox/

1 htt ecommerce- articles/top-6- no 1 0 0 1 0


2 ps platforms.co ecommerce-
m platform-reviews-

40
pro is_p lon having redirecti prefix_su sub_
domain_nam
toc Address hish g_ _@_sy on_//_sy ffix_sepe dom
e
ol ed url mbol mbol ration ains

2012...

Classification of URLs using Random forest

41
42
Evaluating classifier

43
Sr. No Feature name Description

1 Length of URL Long length of the URL

2 Suspicious character Whether URL has ‗@‘


3 Prefix or suffix Whether URL has “-“ “_”
4 Length of sub-domain Length of sub-domain
5 Redirection using ‘//’ Redirection of website using “//”

URL Features: Referring Table 1. above, Features 1 to 4 are associated with suspicious URL
patterns and characters. Characters such as ‗@‘ and ‗//‘ rarely appear in a URL. Feature 5 is
known for recognizing newly created phishing sites with the proposed methodology. Currently, to
prevent a user from identifying that a site is not legitimate, phishing sites typically hide the
primary domain; the URLs of these phishing sites have unusually long sub-domains.

Feature 3 another new feature that reflects current phishing trends. This feature includes seven
words that are predefined as phishing terms. The seven phishing terms are secure, websrc ,
ebaysapi, signin, banking, confirm, login. Thus, through experiments, we identified seven new
phishing terms and we employ them in our phishing detection technique. We have already
discussed the different classifiers in the above sections.

44
7 . IMPLEMENTATION & SAMPLE CODE

This section provides knowledge about the implementation environment and


throws light on the actual steps for the implementation of dataset to get better
accuracy to predict phishing by using different classifiers combination

Implementation steps

In this section we shall discuss about the actual steps which were implemented
while doing the m experiment. We shall explain the stepwise procedure used to
analyse the data and to predict the phising . The system consists of the following
main steps, We have used unstructured data which consists only urls.There are
2905 urls obtained from Phishtank website which consists of both phishing and
legitimate url where most of urls obtained are phishing.

We have collected unstructured data of urls from Phishtank website.

In preprocessing ,feature generation is done where 5 features are generated from


unstructured data. These features are length of url, has prefix/suffix, number of
dots, number of slash, length of sub domain.

After this a structured dataset is created in which each feature contains binary
value(0,1) which is then passed to the different classifiers.

Next we train the Random Forest classifier and compare their performance on the
basis of accuracy.

Then classifier detects the given url based on the training data that is if the site is
phishing it shows a spoofed website as yes or no.

45
6.1 SAMPLE CODE:

PYTHON CODE

from PyQt5 import QtCore, QtGui, QtWidgets

import feature_extractor

class Ui_Spam_detector(object):

def setupUi(self, Spam_detector):

Spam_detector.setObjectName("Spam_detector")

Spam_detector.resize(521, 389)

self.centralwidget = QtWidgets.QWidget(Spam_detector)

self.centralwidget.setObjectName("centralwidget")

"""check button code and its connectivity to button_click function"""

self.check_button = QtWidgets.QPushButton(self.centralwidget)

self.check_button.setGeometry(QtCore.QRect(210, 170, 93, 28))

self.check_button.setObjectName("check_button")

self.check_button.clicked.connect(self.button_click)

"""url input section"""

self.url_input = QtWidgets.QLineEdit(self.centralwidget)

self.url_input.setGeometry(QtCore.QRect(70, 111, 431, 31))

self.url_input.setObjectName("url_input")

46
self.label = QtWidgets.QLabel(self.centralwidget)

self.label.setGeometry(QtCore.QRect(20, 110, 81, 31))

self.label.setObjectName("label")

"""output message"""

self.output_text = QtWidgets.QTextEdit(self.centralwidget)

self.output_text.setGeometry(QtCore.QRect(30, 241, 461, 121))

self.output_text.setObjectName("output_text")

self.label_2 = QtWidgets.QLabel(self.centralwidget)

self.label_2.setGeometry(QtCore.QRect(110, 10, 311, 41))

self.label_2.setObjectName("label_2")

Spam_detector.setCentralWidget(self.centralwidget)

self.statusbar = QtWidgets.QStatusBar(Spam_detector)

self.statusbar.setObjectName("statusbar")

Spam_detector.setStatusBar(self.statusbar)

self.retranslateUi(Spam_detector)

QtCore.QMetaObject.connectSlotsByName(Spam_detector)

def retranslateUi(self, Spam_detector):

_translate = QtCore.QCoreApplication.translate

47
Spam_detector.setWindowTitle(_translate("Spam_detector",
"MainWindow"))

self.check_button.setText(_translate("Spam_detector", "Check "))

self.label.setText(_translate("Spam_detector",
"<html><head/><body><p><span style=\" font-size:10pt;\">URL
:</span></p></body></html>"))

self.label_2.setText(_translate("Spam_detector", "<html><head/><body><p
align=\"center\"><span style=\" font-size:16pt;\">Spam URL
Detector</span></p></body></html>"))

def button_click(self):

text = self.url_input.text()

#print(text)

obj = feature_extractor.feature_extractor(text)

str1,str2 = obj.extract()

self.output_text.append("{} \n{}\n\n".format(str1,str2))

#def show_output():

if __name__ == "__main__":

import sys

app = QtWidgets.QApplication(sys.argv)

Spam_detector = QtWidgets.QMainWindow()

ui = Ui_Spam_detector()

ui.setupUi(Spam_detector) Spam_detector.show()

sys.exit(app.exec_()

48
8. TESTING & SCREEN SHOTS

The purpose of testing is to discover errors. Testing is the process of trying to


discover every conceivable fault or weakness in a work product. It provides a way
to check the functionality of components, sub assemblies, assemblies and/or a
finished product It is the process of exercising software with the intent of ensuring
that the

Software system meets its requirements and user expectations and does not fail in
an unacceptable manner. There are various types of test. Each test type addresses a
specific testing requirement

TYPES OF TESTS

Unit testing:
Unit testing involves the design of test cases that validate that the internal
program logic is functioning properly, and that program inputs produce valid
outputs. All decision branches and internal code flow should be validated. It is the
testing of individual software units of the application .it is done after the
completion of an individual unit before integration. This is a structural testing, that
relies on knowledge of its construction and is invasive. Unit tests perform basic
tests at component level and test a specific business process, application, and/or
system configuration. Unit tests ensure that each unique path of a business process
performs accurately to the documented specifications and contains clearly defined
inputs and expected results.

Integration testing:
Integration tests are designed to test integrated software components to
determine if they actually run as one program. Testing is event driven and is more
49
concerned with the basic outcome of screens or fields. Integration tests
demonstrate that although the components were individually satisfaction, as shown
by successfully unit testing, the combination of components is correct and
consistent. Integration testing is specifically aimed at exposing the problems that
arise from the combination of components.

Functional testing:
Functional tests provide systematic demonstrations that functions tested are
available as specified by the business and technical requirements, system
documentation, and user manuals.

Functional testing is centered on the following items:

Valid Input : identified classes of valid input must be accepted.

Invalid Input : identified classes of invalid input must be rejected.

Functions : identified functions must be exercised.

Output : identified classes of application outputs must be exercised.

Systems/Procedures: interfacing systems or procedures must be invoked.

Organization and preparation of functional tests is focused on requirements, key


functions, or special test cases. In addition, systematic coverage pertaining to
identify Business process flows; data fields, predefined processes, and successive
processes must be considered for testing. Before functional testing is complete,
additional tests are identified and the effective value of current tests is determined.

50
LEVELS OF TESTING:
Code testing:
This examines the logic of the program. For example, the logic for updating
various sample data and with the sample files and directories were tested and
verified.
Specification Testing:
Executing this specification starting what the program should do and how it
should performed under various conditions. Test cases for various situation and
combination of conditions in all the modules are tested.
Unit testing:
In the unit testing we test each module individually and integrate with the
overall system. Unit testing focuses verification efforts on the smallest unit of
software design in the module. This is also known as module testing. The module
of the system is tested separately. This testing is carried out during programming
stage itself. In the testing step each module is found to work satisfactorily as regard
to expected output from the module. There are some validation checks for fields
also. For example the validation check is done for varying the user input given by
the user which validity of the data entered. It is very easy to find error debut the
system.

Each Module can be tested using the following two Strategies:


• Black Box Testing
• White Box Testing
BLACK BOX TESTING
What is Black Box Testing?

Black box testing is a software testing techniques in which functionality of


the software under test (SUT) is tested without looking at the internal code
structure, implementation details and knowledge of internal paths of the software.

51
This type of testing is based entirely on the software requirements and
specifications.
In Black Box Testing we just focus on inputs and output of the software system
without bothering about internal knowledge of the software program.

                        
The above Black Box can be any software system you want to test. For
example : an operating system like Windows, a website like Google ,a database
like Oracle or even your own custom application. Under Black Box Testing , you
can test these applications by just focusing on the inputs and outputs without
knowing their internal code implementation.
Black box testing - Steps
Here are the generic steps followed to carry out any type of Black Box Testing.
• Initially requirements and specifications of the system are examined.
• Tester chooses valid inputs (positive test scenario) to check whether SUT
processes them correctly. Also some invalid inputs (negative test scenario)
are chosen to verify that the SUT is able to detect them.
• Tester determines expected outputs for all those inputs.
• Software tester constructs test cases with the selected inputs.
• The test cases are executed.
• Software tester compares the actual outputs with the expected outputs.
• Defects if any are fixed and re-tested.
Types of Black Box Testing
There are many types of Black Box Testing but following are the prominent
ones -
• Functional testing – This black box testing type is related to functional
requirements of a system; it is done by software testers.

52
• Non-functional testing – This type of black box testing is not related to testing
of a specific functionality, but non-functional requirements  such as
performance, scalability, usability.
• Regression testing – Regression testing is done  after code fixes , upgrades or
any other system maintenance to check the new code has not affected the
existing code.
WHITE BOX TESTING
White Box Testing is the testing of a software solution's internal coding and
infrastructure. It focuses primarily on strengthening security, the flow of inputs and
outputs through the application, and improving design and usability. White box
testing is also known as clear, open, structural, and glass box testing.
It is one of two parts of the "box testing" approach of software testing. Its
counter-part, blackbox testing, involves testing from an external or end-user type
perspective. On the other hand, Whitebox testing is based on the inner workings of
an application and revolves around internal testing. The term "whitebox" was used
because of the see-through box concept. The clear box or whitebox name
symbolizes the ability to see through the software's outer shell (or "box") into its
inner workings. Likewise, the "black box" in "black box testing" symbolizes not
being able to see the inner workings of the software so that only the end-user
experience can be tested

What do you verify in White Box Testing ?


White box testing involves the testing of the software code for the following:
• Internal security holes
• Broken or poorly structured paths in the coding processes
• The flow of specific inputs through the code
• Expected output
• The functionality of conditional loops
• Testing of each statement, object and function on an individual basis

53
How do you perform White Box Testing?
  To give you a simplified explanation of white box testing, we have divided it
into two basic steps. This is what testers do when testing an application using the
white box testing technique:

Step 1) UNDERSTAND THE SOURCE CODE


The first thing a tester will often do is learn and understand the source code
of the application. Since white box testing involves the testing of the inner
workings of an application, the tester must be very knowledgeable in the
programming languages used in the applications they are testing. Also, the testing
person must be highly aware of secure coding practices. Security is often one of
the primary objectives of testing software. The tester should be able to find
security issues and prevent attacks from hackers and naive users who might inject
malicious code into the application either knowingly or unknowingly.
 Step 2) CREATE TEST CASES AND EXECUTE
The second basic step to white box testing involves testing the application’s
source code for proper flow and structure. One way is by writing more code to test
the application’s source code. The tester will develop little tests for each process or
series of processes in the application. This   method requires that the tester must
have intimate knowledge of the code and is often done by the developer. Other
methods include manual testing, trial and error testing and the use of testing tools
as we will explain further on in this article.

54
Below are the screen shots for the implementation process.

We have the test screen:

Testing Screen

55
We will now test the legitimate website by entering the URL on the test screen. It
will show output in a new window showing the Spoofed webpage as : No

Testing the legitimate site

56
We will now test the phishing website, it will open up in a new window as result
showing spoofed webpage as : Yes

Testing the Phishing website

57
An ROC curve (receiver operating characteristic curve) is a graph showing the
performance of a classification model at all classification thresholds.
This curve plots two parameters: True Positive Rate. False Positive Rate

ROC Curve for Random Forest Classifier

58
9. OBSERVATIONS AND RESULT

Observation:-

As discussed in the earlier sections, we have used one classifier to predict and
detect if the website is phishing or legitimate.

Classifiers Precision Recall F1 AUC Accuracy(%)

Random 0.90 0.80 0.85 0.87 85.6


Forest
Classifier

Result:-

We have got the desired results of testing the site is phishing or

not by using random classifiers. Refer the graph below for the exact results. In the
graph, shown in Fig. 15 shows the AUC, precision, recall and the F1 score
obtained by using classifier. The graph shown in Fig 16. explains about the
accuracy obtained by using different classifiers in the histogram graphical
representation

59
Result
0.92

0.9

0.88

0.86

0.84

0.82

0.8

0.78

0.76

0.74
Graph of AUC Precision Recall F1 Score

Accuracy
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
0.87
0.9
0.8
0.85

60
10. CONCLUSION AND FUTURE SCOPE

Conclusion

It is found that phishing attacks is very crucial and it is important for us to get a
mechanism to detect it. As very important and personal information of the user can
be leaked through phishing websites, it becomes more critical to take care of this
issue. This problem can be easily solved by using any of the machine learning
algorithm with the classifier. We already have classifiers which gives good
prediction rate of the phishing beside, but after our survey that it will be better to
use a hybrid approach for the prediction and further improve the accuracy
prediction rate of phishing websites. We have seen that existing system gives less
accuracy so we proposed a new phishing method that employs URL based features
and also we generated classifiers through several machine learning algorithms. We
have found that our system provides us with 85.6 % of accuracy for Random
Forest Classifier. The proposed technique is much more secured as it detects new
and previous phishing sites.

Future scope

In future if we get structured dataset of phishing we can perform phishing detection


much more faster than any other technique. In future we can use a combination of
any other two or more classifier to get maximum accuracy. We also plan to explore
various phishing techniques that uses Lexical features, Network based features,
Content based features, Webpage based features and HTML and JavaScript
features of web pages which can improve the performance of the system. In
particular, we extract features from URLs and pass it through the various
classifiers.

61
11. BIBILOGRAPHY
REFERENCES
 Wong, R. K. K. (2019). An Empirical Study on Performance Server
Analysis and URL Phishing Prevention to Improve System Management
Through Machine Learning. In Economics of Grids, Clouds, Systems, and
Services: 15th International Conference, GECON 2018, Pisa, Italy,
September 18-20, 2018, Proceedings (Vol. 11113, p. 199). Springer.
 Rao, R. S., & Pais, A. R. (2019). Jail-Phish: An improved search engine
based phishing detection system. Computers & Security, 83, 246-267.
 Ding, Y., Luktarhan, N., Li, K., & Slamu, W. (2019). A keyword-based
combination approach for detecting phishing webpages. computers &
security, 84, 256-275.
 Marchal, S., Saari, K., Singh, N., & Asokan, N. (2016, June). Know your
phish: Novel techniques for detecting phishing sites and their targets. In
2016 IEEE 36th International Conference on Distributed Computing
Systems (ICDCS) (pp. 323-333). IEEE.
 Shekokar, N. M., Shah, C., Mahajan, M., & Rachh, S. (2015). An ideal
approach for detection and prevention of phishing attacks. Procedia
Computer Science, 49, 82-91.
 Rathod, J., & Nandy, D. Anti-Phishing Technique to Detect URL
Obfuscation.
 Hodžić, A., Kevrić, J., & Karadag, A. (2016). Comparison of machine
learning techniques in phishing website classification. In International
Conference on Economic and Social Studies (ICESoS'16) (pp. 249-256).
 Pujara, P., & Chaudhari, M. B. (2018). Phishing Website Detection using
Machine Learning: A Review.
 Desai, A., Jatakia, J., Naik, R., & Raul, N. (2017, May). Malicious web
content detection using machine leaning. In 2017 2nd IEEE International
Conference on Recent Trends in Electronics, Information & Communication
Technology (RTEICT) (pp. 1432-1436). IEEE.
 Lakshmi, V. S., & Vijaya, M. S. (2012). Efficient prediction of phishing
websites using supervised learning algorithms. Procedia Engineering, 30,
798-805.
62
 Jain, A. K., & Gupta, B. B. (2018). PHISH-SAFE: URL features-based
phishing detection system using machine learning. In Cyber Security (pp.
467-474). Springer, Singapore.
 Kazemian, H. B., & Ahmed, S. (2015). Comparisons of machine learning
techniques for detecting malicious webpages. Expert Systems with
Applications, 42(3), 1166-1177.
 Mao, J., Bian, J., Tian, W., Zhu, S., Wei, T., Li, A., & Liang, Z. (2019).
Phishing page detection via learning classifiers from page layout feature.
EURASIP Journal on Wireless Communications and Networking, 2019(1),
43.
 Mohammad, R. M., Thabtah, F., & McCluskey, L. (2012, December). An
assessment of features related to phishing websites using an automated
technique. In 2012 International Conference for Internet Technology and
Secured Transactions (pp. 492-497). IEEE.
 https://www.researchgate.net/publication/226420039-Detection-of-
 Phishing-Attacks-A-Machine-Learning-Approach
 https://www.proofpoint.com/us/threat-reference/phishing
 https://towardsdatascience.com/phishing-domain-detection-with-ml-
5be9c99293e5
 https://en.wikipedia.org/wiki/Phishing
 https://www.techrepublic.com/article/how-to-tackle-phishing-with- machine-
learning/
 https://www.irjet.net/archives/V5/i3/IRJET-V5I3580.pdf
 https://www.hackerearth.com/practice/machine-learning/machine-learning-
algorithms/beginners-tutorial-on-xgboost-parameter-tuning-r/tutorial/
 https://www.datacamp.com/community/tutorials/svm-classification-scikit-
learn-python
 He, M., Horng, S. J., Fan, P., Khan, M. K., Run, R. S., Lai, J. L., ... &
Sutanto, A. (2011). An efficient phishing webpage detector. Expert systems
with applications, 38(10), 12018-12027.
 Le, A., Markopoulou, A., & Faloutsos, M. (2011, April). Phishdef: Url
names say it all. In 2011 Proceedings IEEE INFOCOM (pp. 191-195). IEEE.

63
 Sahingoz, O. K., Buber, E., Demir, O., & Diri, B. (2019). Machine learning
based phishing detection from URLs. Expert Systems with Applications,
117, 345-357.
 Tewari, A., Jain, A. K., & Gupta, B. B. (2016). Recent survey of various
defense mechanisms against phishing attacks. Journal of Information
Privacy and Security, 12(1), 3-13.
 Jain, A. K., & Gupta, B. B. (2016, March). Comparative analysis of features
based machine learning approaches for phishing detection. In 2016 3rd
International Conference on Computing for Sustainable Global Development
(INDIACom) (pp. 2125-2130). IEEE.
 Yuan, H., Chen, X., Li, Y., Yang, Z., & Liu, W. (2018, August). Detecting
Phishing Websites and Targets Based on URLs and Webpage Links. In 2018
24th International Conference on Pattern Recognition (ICPR) (pp. 3669-
3674). IEEE.
 Nguyen, L. A. T., To, B. L., Nguyen, H. K., & Nguyen, M. H. (2013,
October). Detecting phishing web sites: A heuristic URL-based approach. In
2013 International Conference on Advanced Technologies for
Communications (ATC 2013) (pp. 597-602). IEEE.

64

You might also like