Fake Url
Fake Url
Abstract:
A fraud attempt to get sensitive and personal information like password, username, and bank
details like credit/debit card details by masking as a reliable organization in electronic
communication. The phishing website will appear the same as the legitimate website and directs
the user to a page to enter personal details of the user on the fake website. Through machine
learning algorithms one can improve the accuracy of the prediction. The proposed method
predicts the URL based phishing websites based on features and also gives maximum accuracy.
This method uses uniform resource locator (URL) features. We identified features that phishing
site URLs contain. The proposed method employs those features for phishing detection. The
proposed system predicts the URL based phishing websites with maximum accuracy. We shall
talk about various machine learning, the algorithm which can help in decision making and
prediction. We shall use one of the algorithm to get better accuracy of prediction.
1
1. INTRODUCTION
Phishing imitates the characteristics and features of emails and makes it look the same as
the original one. It appears similar to that of the legitimate source. The user thinks that this email
has come from a genuine company or an organization. This makes the user to forcefully visit the
phishing website through the links given in the phishing email. These phishing websites are made
to mock the appearance of an original organization website. The phishers force user to fill up the
personal information by giving alarming messages or validate account messages etc so that they
fill up the required information which can be used by them to misuse it. They make the situation
such that the user is not left with any other option but to visit their spoofed website.
Phishing is a cyber crime, the reason behind the phishers doing this crime is that it is very easy to
do this, it does not cost anything and it effective. The phishing can easily access the email id of
any person it is very easy to find the email id now a day and you can sending an email to anyone
is freely available across the world. These attackers put very less cost and effort to get valuable
data quickly and easily. The phishing frauds leads to malware infections, loss of data, identity
theft etc. The data in which these cyber criminals are interested is the crucial information of a
user like the password, OTP, credit/ debit card numbers CVV, sensitive data related to business,
medical data, confidential data etc.
Sometimes these criminals also gather information which can give them direct access to the
social media account their emails.
A lot of software / approaches and algorithms are used for phishing detection. These are used at
academic and commercial organization levels. A phishing URL and the parallel page have many
features which are different from the malignant URL. Let us take an example to hide the original
domain name the phishing attacker can select very long and confusing name of the domain. This
is very easily visible. Sometimes they use the IP address instead of using the domain name. On
the other hand they can also use a shorter domain name which will not be relevant to the original
legitimate website. Apart from the URL based feature of phishing detection there are many
different features which can also be used for the detection of Phishing websites namely the
Domain-Based Features, Page-Based Features and Content-Based Features.
In the training phase, we should use the labeled data in which there are samples such as phish
area and legitimate area. If we do this then classification will not be a problem for detecting the
2
phishing domain. To do a working detection model it is very crucial to use data set in the training
phase. We should use samples whose classes are known to us, which means the samples whom
we label as phishing should be detected only as phishing. Similarly the samples which are labeled
as legitimate will be detected as legitimate URL. The dataset to be used for machine learning
must actually consist these features. There so many machine learning algorithms and each
algorithm has its own working mechanism which we have already seen in the previous chapter.
The existing system uses any one of the suitable machine learning algorithms for the detection of
phishing URL and predicts its accuracy. The existing system has good accuracy but it is still not
the best as phishing attack is a very crucial, we have to find a best solution to eliminate this. In
the currently existing system, only one machine learning algorithm is used to predict the
accuracy, using only one algorithm is not a good approach to improve the prediction accuracy.
Each of the algorithms which explain in the earlier chapter has some disadvantages hence it is not
recommended to use one machine learning algorithm to further improve the accuracy.
1.2 METHODOLOGY:
In this section we shall learn about the various classifiers used in machine learning to predict
phishing. We shall also explain our proposed methodology to detect phishing website. In section
A we shall explain various classifiers and methods which can be used to check the phishing and
legitimate website. In section B we shall explain our proposed system.
Detecting and identifying Phishing Websites is really a complex and dynamic problem. Machine
learning has been widely used in many areas to create automated solutions. The phishing attacks
can be carried out in many ways such as email, website, malware, sms and voice. In this work,
we concentrate on detecting website phishing (URL), which is achieved by making use of the
Hybrid Algorithm Approach. Hybrid Algorithm Approach is a mixture of different classifiers
working together which gives good prediction rate and improves the accuracy of the system.
Depending on the application and nature of the dataset used we can use any classification
algorithms mentioned below. As there are different applications, we can not differentiate which
of the algorithms are superior or not. Each of classifiers have its own way of working and
classification. Let us discuss each of them in details.
Naive Bayes Classifier:- This classifier can also be known as a Generative Learning Model. The
classification here is based on Bayes Theorem, it assumes independent predictors. In simple
3
words, this classifier will assume that the existence of specific features in a class is not related to
the existence of any other feature. If there is dependency among the features of each other or on
the presence of other features, all of these will be considered as an independent contribution to
the probability of the output. This classification algorithm is very much useful to large datasets
and is very easy to use.
Random Forest: This classification algorithm are similar to ensemble learning method of
classification. The regression and other tasks, work by building a group of decision trees at
training data level and during the output of the class, which could be the mode of classification or
prediction regression for individual trees. This classifier accuracy for decision trees practice of
overfitting the training data set.
Support vector machine (SVM): This is also one of the classification algorithm which is
supervised and is easy to use. It can used for both classification and regression applications, but it
is more famous to be used in classification applications. In this algorithm each point which is a
data item is plotted in a dimensional space, this space is also known as n dimensional plane,
where the n represents the number of features of the data. The classification is done based on the
differentiation in the classes, these classes are data set points present in different planes.
XGBoost: Recently, the researches have come across an algorithm XGBoost and its usage is very
useful for machine learning classification. It is very much fast and its performance is better as it is
an execution of a boosted decision tree. This classification model is used to improve the
performance of the model and also to improve the speed.
Once the model is trained it is very important to evaluate the classifier which we shall use and
validate its capability. Now in the above section we have seen all the advantages and
disadvantages of all the available classifier. Hence we propose to use one classifier that is
Random forest ,so we to improve the accuracy further of prediction.After applying the
classification the results are generated and the URLs are classified into phishing and legitimate
URLs. The Phishing URLs are blacklisted in the database and the legitimate are white list in the
database
The existing anti-phishing approaches use the blacklist methods or features based
machine learning techniques. Blacklist methods fail to detect new phishing attacks
and produce high false positive rate. Moreover, existing machine learning based
methods extract features from the third party, search engine, etc. Therefore, they
are complicated, slow in nature, and not fit for the real-time environment. To solve
this problem, this paper presents a machine learning based novel anti-phishing
4
approach that extracts the features from client side only. We have examined the
various attributes of the phishing and legitimate websites in depth and identified 5
new outstanding features to distinguish phishing websites from legitimate ones.
1.4 PROPOSED SYSTEM:
The dataset of phishing and legitimate URL's is given to the system which is then pre-processed
so that the data is in the useable format for analysis. The features have around 300 characteristics
of phishing websites which is used to differentiate it from legitimate ones.
Each category has its own characteristics of phishing attributes and values are defined. The
specified characteristics are extracted for each URL and valid ranges of inputs are identified.
These values are then assigned to each phishing website risk. For each input the values range
from 0 to 10 , while for output range is from 0 to 100. The phishing attributes values are
represented with binary no 0 and 1 which indicates the attribute is present or not.
After this the data is trained we shall apply a relevant machine learning algorithm to the dataset.
The machine learning algorithms are already explained in previous section. After this we use a
classification named Random forest to predict the accuracy of the detection of the phishing URL,
hence we get our desired result. This is also called a random approach to test the data, in this
method we propose to use the classifier, as mentioned above.
We shall then test the data and evaluate the prediction accuracy which shall be more than the
existing system. We shall now see the different classifiers and discuss the hybrid combination
used for our proposed system.
5
Fig. 1: Proposed System block diagram
6
In the training phase, we should use the labeled data in which there are samples such as phish
area and legitimate area. If we do this then classification will not be a problem for detecting the
phishing domain. To do a working detection model it is very crucial to use data set in the training
phase.
We should use samples whose classes are known to us, which means the samples whom we label
as phishing should be detected only as phishing. Similarly the samples which are labeled as
legitimate will be detected as legitimate URL.
The dataset to be used for machine learning must actually consist these features.There so many
machine learning algorithms and each algorithm has its own working mechanism which we have
already seen in the previous chapter. The existing system uses any one of the suitable machine
learning algorithms for the detection of phishing URL and predicts its accuracy. Each of the
algorithms which explain in the earlier section has some disadvantages hence it is not
recommended to use one machine learning algorithm to detect the phishing website.
7
2. REQUIREMENT AND SPECIFICATIONS
For developing the application the following are the Software Requirements:
For developing the application the following are the Hardware Requirements:
RAM : 4 GB
8
ANACONDA NAVIGATOR
9
Why use Navigator?
In order to run, many scientific packages depend on specific versions of other packages. Data
scientists often use multiple versions of many packages and use multiple environments to
separate these different versions.
The command-line program conda is both a package manager and an environment manager. This
helps data scientists ensure that each version of each package has all the dependencies it requires
and works correctly.
Navigator is an easy, point-and-click way to work with packages and environments without
needing to type conda commands in a terminal window. You can use it to find the packages you
want, install them in an environment, run the packages, and update them – all inside Navigator.
JupyterLab
Jupyter Notebook
Spyder
VSCode
Glueviz
Orange 3 App
Rstudio
10
Jupyter Notebook
The Jupyter Notebook is an open-source web application that allows you to create and share
documents that contain live code, equations, visualizations and narrative text. Uses include: data
cleaning and transformation, numerical simulation, statistical modeling, data visualization,
machine learning, and much more.
11
Python IDLE
IDLE is not available by default in Python distributions for Linux. It needs to be installed using
the respective package managers. For example, in case of
IDLE can be used to execute a single statement just like Python Shell and also to create,
modify and execute Python scripts. IDLE provides a fully-featured text editor to create Python
scripts that includes features like syntax highlighting, autocompletion and smart indent. It also
has a debugger with stepping and breakpoints features.
To start IDLE interactive shell, search for the IDLE icon in the start menu and double click on
it.
Python IDLE
12
This will open IDLE, where you can write Python code and execute it as shown below.
Pytho
n IDLE
Now, you can execute Python statements same as in Python Shell as shown below.
13
Pytho
n IDLE
To execute a Python script, create a new file by selecting File -> New File from the menu.
14
Enter multiple statements and save the file with extension .py using File -> Save. For example,
save the following code as hello.py.
Now, press F5 to run the script in the editor window. The IDLE shell will show the
output.Thus, it is easy to write, test and run Python scripts in IDLE.
15
3. FEASIBILITY STUDY
The feasibility of the project is analyzed in this phase and business proposal is put forth with a
very general plan for the project and some cost estimates. During system analysis the feasibility
study of the proposed system is to be carried out. This is to ensure that the proposed system is not
a burden to the company. For feasibility analysis, some understanding of the major requirements
ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY
This study is carried out to check the economic impact that the system will have on the
organization. The amount of fund that the company can pour into the research and development
of the system is limited. The expenditures must be justified. Thus the developed system as well
within the budget and this was achieved because most of the technologies used are freely
This study is carried out to check the technical feasibility, that is, the technical
requirements of the system. Any system developed must not have a high demand on the available
technical resources. This will lead to high demands on the available technical resources. This will
16
lead to high demands being placed on the client. The developed system must have a modest
requirement, as only minimal or null changes are required for implementing this system.
The aspect of study is to check the level of acceptance of the system by the user. This includes
the process of training the user to use the system efficiently. The user must not feel threatened by
the system, instead must accept it as a necessity. The level of acceptance by the users solely
depends on the methods that are employed to educate the user about the system and to make him
familiar with it. His level of confidence must be raised so that he is also able to make some
17
4. SYSTEM ANALYSIS
To provide flexibility to the users, the interfaces have been developed that are accessible through
a GUI application. The GUI’S at the top level have been categorized as
Input design is a part of overall system design. The main objective during the input design is as
given below:
INPUT DESIGN:
The input design is the link between the information system and the user. It
comprises the developing specification and procedures for data preparation and those steps are
necessary to put transaction data in to a usable form for processing can be achieved by inspecting
the computer to read data from a written or printed document or it can occur by having people
keying the data directly into the system. The design of input focuses on controlling the amount of
input required, controlling the errors, avoiding delay, avoiding extra steps and keeping the
18
process simple. The input is designed in such a way so that it provides security and ease of use
with retaining the privacy. Input Design considered the following things:
OBJECTIVES:
2. It is achieved by creating user-friendly screens for the data entry to handle large
volume of data. The goal of designing input is to make data entry easier and to be free from
errors. The data entry screen is designed in such a way that all the data manipulates can be
performed. It also provides record viewing facilities.
3.When the data is entered it will check for its validity. Data can be entered with the
help of screens. Appropriate messages are provided as when needed so that the user will not be in
maize of instant. Thus the objective of input design is to create an input layout that is easy to
follow.
OUTPUT DESIGN:
A quality output is one, which meets the requirements of the end user and presents
the information clearly. In any system results of processing are communicated to the users and to
other system through outputs. In output design it is determined how the information is to be
displaced for immediate need and also the hard copy output. It is the most important and direct
source information to the user. Efficient and intelligent output design improves the system’s
relationship to help user decision-making.
The output form of an information system should accomplish one or more of the following
objectives.
20
1.Define Project Objectives:
3. Model Data
Variable selection
Build candidate models
Model validation and selection
21
Create model monitoring and maintenance
Architecture flow:
Below architecture diagram represents mainly flow of training phase to Detection phase. First
data need to be pre-processed and feature extraction using different feature sets and later we need
to train this dataset with the corresponding algorithms and the output is displayed.
22
5. SYSTEM DESIGN
5.1 INTRODUCTION:
Introduction:
modules, interfaces, and data for a system to satisfy specified requirements. One could
There is some overlap and synergy with the disciplines of systems analysis, systems
Dataset
Collection
Pre-processing
Feature
Extraction
Trained &
Testing
dataset
Algorithm and
classification
Prediction and
Result
23
5.3 UML DIAGRAMS:
The “Unified Modeling Language” allows the software engineer to express an analysis model
using the modeling notation that is governed by a set of syntactic semantic and pragmatic rules.
A UML system is represented using five different views that describe the system from distinctly
different perspective. Each view is defined by a set of diagram, which is as follows.
24
In this the structural and behavioral aspects of the environment which the system is to be
implemented are represented.
UML Analysis modeling, this focuses on the user model and structural model views of the
system.
UML design modeling, which focuses on the behavioral modeling, implementation
modeling and environmental model views.
USECASE DIAGRAM:
Use case Diagrams represent the functionality of the system from a user’s point of view. Use
cases are used during requirements elicitation and analysis to represent the functionality of the
system. Use cases focus on the behavior of the system from external point of view.
ACTORS:
Actors are external entities that interact with the system. Examples of actors include users like
administrator, bank customer …etc., or another system like central database
25
1 .USE CASE DIAGRAM:
a. User
26
2 .COMPONENT DIAGRAM:
27
3. CLASS DIAGRAM
28
4.ACTIVITY DIAGRAM:
User
29
5. SEQUENCE DIAGRAM
30
6..SYSTEM OVERVIEW
System design is used for understanding the construction of system. We have explained the flow
of our system and the software used in the system in this section.
System Flow
The Fig. 2 explains the flow chart of the system design, we shall explain each of the components
of the flow chart in each section below. To get structured data we do feature generation of the
data at the pre- processing stage. We have used techniques like Random Forest classifier to
detect the phishing and legitimate websites
31
Fig 2:- Flow of the System
ALGORITHM:
RANDOM FOREST:
Random forest is a supervised learning algorithm which is used for both classification as well as
regression. But however, it is mainly used for classification problems. As we know that a forest is
made up of trees and more trees means more robust forest. Similarly, random forest algorithm
creates decision trees on data samples and then gets the prediction from each of them and finally
selects the best solution by means of voting. It is an ensemble method which is better than a
single decision tree because it reduces the over-fitting by averaging the result.
We can understand the working of Random Forest algorithm with the help of following steps −
Step 1 − First, start with the selection of random samples from a given dataset.
Step 2 − Next, this algorithm will construct a decision tree for every sample. Then it will get the
prediction result from every decision tree.
Step 3 − In this step, voting will be performed for every predicted result.
Step 4 − At last, select the most voted prediction result as the final prediction result
Data set: The data of urls is obtained from Phishtank website ,where Phishtank is an. anti-
phishing site.It contains 2905 urls which is in unstructured form. Our main objective is to detect
whether the url is phishing or legitimate
32
Step -1 : Data preprocessing
This dataset contains few website links (Some of them are legitimate websites and a few are fake
websites)
Pre-Processing the data before building a model and also Extracting the features from the data
based on certain conditions
33
split(separator, no of splits according to separator (delimiter),expand), renaming columns of data
frame with domain name and address.
We need to add new column to data frame with name (“is phished”) .
34
Domain name column can be further sub divided into domain_names as well as
sub_domain_names
Similarly, address column can also be further sub divided into path,query_string,file.
Features Extraction
Feature-1
If the length of the URL is greater than or equal 54 characters then the URL classified as phishing
35
Feature-2
Using “@” symbol in the URL leads the browser to ignore everything preceding the “@” symbol
and the real address often follows the “@” symbol.
Otherwise→ Legitimate }
36
Feature-3
The existence of “//” within the URL path means that the user will be redirected to another
website. An example of such URL’s is: “http://www.legitimate.com//http://www.phishing.com”.
We examine the location where the “//” appears. We find that if the URL starts with “HTTP”,
that means the “//” should appear in the sixth position. However, if the URL employs “HTTPS”
then the “//” should appear in seventh position.
IF {The Position of the Last Occurrence of "//" in the URL > 7→ Phishing
Otherwise→ Legitimate
37
Feature-4
The dash symbol is rarely used in legitimate URLs. Phishers tend to add prefixes or suffixes
separated by (-) to the domain name so that users feel that they are dealing with a legitimate
webpage.
For example http://www.Confirme-paypal.com/.
Otherwise → Legitimate
38
Feature - 5
The legitimate URL link has two dots in the URL since we can ignore typing “www.”. If the
number of dots is equal to three then the URL is classified as “Suspicious” since it has one sub-
domain.
However, if the dots are greater than three it is classified as “Phishy” since it will have multiple
sub-domains
locking-app- payment-update-
htt adverds.000 0.html?
0 yes 1 0 0 1 0
ps webhostapp. fb_source=bookmark
com _apps&...
www.myhea wp-
htt
1 lthcarephar includes/js/jquery/ini yes 2 0 0 0 0
p
macy.ca .php
htt code.google.
2 p/pylevenshtein/ no 0 0 0 0 0
p com
htt linkedin.co
3 no 0 0 0 0 0
p m
p com
doc-08-bc-
apps- viewer/secure/pdf/oe
htt
5 viewer.goog seuikt874lsab1nd951 no 1 0 0 1 0
ps
leuserconten o2hb6vb...
t.com
htt www.7-
6 download.html no 0 0 0 1 0
p zip.org
htt
7 ebay.com no 0 0 0 0 0
p
config/login?.done=
htt login.yahoo.
8 https%3a%2f no 1 0 0 0 0
ps com
%2fapi.login.yah...
mobile-
htt
9 free.metulw mobile/impaye/ yes 0 0 0 1 0
p
x.com
uploads/gallery/man
1 htt www.gulfshi
kind/secure/fiscal/ma yes 1 0 0 0 0
0 ps eld.com
nual/m...
1 htt components/kttm/dro
rodritol.com yes 0 0 0 0 0
1 p pbox/
40
pro is_p lon having redirecti prefix_su sub_
domain_nam
toc Address hish g_ _@_sy on_//_sy ffix_sepe dom
e
ol ed url mbol mbol ration ains
2012...
41
42
Evaluating classifier
43
Sr. No Feature name Description
URL Features: Referring Table 1. above, Features 1 to 4 are associated with suspicious URL
patterns and characters. Characters such as ‗@‘ and ‗//‘ rarely appear in a URL. Feature 5 is
known for recognizing newly created phishing sites with the proposed methodology. Currently, to
prevent a user from identifying that a site is not legitimate, phishing sites typically hide the
primary domain; the URLs of these phishing sites have unusually long sub-domains.
Feature 3 another new feature that reflects current phishing trends. This feature includes seven
words that are predefined as phishing terms. The seven phishing terms are secure, websrc ,
ebaysapi, signin, banking, confirm, login. Thus, through experiments, we identified seven new
phishing terms and we employ them in our phishing detection technique. We have already
discussed the different classifiers in the above sections.
44
7 . IMPLEMENTATION & SAMPLE CODE
Implementation steps
In this section we shall discuss about the actual steps which were implemented
while doing the m experiment. We shall explain the stepwise procedure used to
analyse the data and to predict the phising . The system consists of the following
main steps, We have used unstructured data which consists only urls.There are
2905 urls obtained from Phishtank website which consists of both phishing and
legitimate url where most of urls obtained are phishing.
After this a structured dataset is created in which each feature contains binary
value(0,1) which is then passed to the different classifiers.
Next we train the Random Forest classifier and compare their performance on the
basis of accuracy.
Then classifier detects the given url based on the training data that is if the site is
phishing it shows a spoofed website as yes or no.
45
6.1 SAMPLE CODE:
PYTHON CODE
import feature_extractor
class Ui_Spam_detector(object):
Spam_detector.setObjectName("Spam_detector")
Spam_detector.resize(521, 389)
self.centralwidget = QtWidgets.QWidget(Spam_detector)
self.centralwidget.setObjectName("centralwidget")
self.check_button = QtWidgets.QPushButton(self.centralwidget)
self.check_button.setObjectName("check_button")
self.check_button.clicked.connect(self.button_click)
self.url_input = QtWidgets.QLineEdit(self.centralwidget)
self.url_input.setObjectName("url_input")
46
self.label = QtWidgets.QLabel(self.centralwidget)
self.label.setObjectName("label")
"""output message"""
self.output_text = QtWidgets.QTextEdit(self.centralwidget)
self.output_text.setObjectName("output_text")
self.label_2 = QtWidgets.QLabel(self.centralwidget)
self.label_2.setObjectName("label_2")
Spam_detector.setCentralWidget(self.centralwidget)
self.statusbar = QtWidgets.QStatusBar(Spam_detector)
self.statusbar.setObjectName("statusbar")
Spam_detector.setStatusBar(self.statusbar)
self.retranslateUi(Spam_detector)
QtCore.QMetaObject.connectSlotsByName(Spam_detector)
_translate = QtCore.QCoreApplication.translate
47
Spam_detector.setWindowTitle(_translate("Spam_detector",
"MainWindow"))
self.label.setText(_translate("Spam_detector",
"<html><head/><body><p><span style=\" font-size:10pt;\">URL
:</span></p></body></html>"))
self.label_2.setText(_translate("Spam_detector", "<html><head/><body><p
align=\"center\"><span style=\" font-size:16pt;\">Spam URL
Detector</span></p></body></html>"))
def button_click(self):
text = self.url_input.text()
#print(text)
obj = feature_extractor.feature_extractor(text)
str1,str2 = obj.extract()
self.output_text.append("{} \n{}\n\n".format(str1,str2))
#def show_output():
if __name__ == "__main__":
import sys
app = QtWidgets.QApplication(sys.argv)
Spam_detector = QtWidgets.QMainWindow()
ui = Ui_Spam_detector()
ui.setupUi(Spam_detector) Spam_detector.show()
sys.exit(app.exec_()
48
8. TESTING & SCREEN SHOTS
Software system meets its requirements and user expectations and does not fail in
an unacceptable manner. There are various types of test. Each test type addresses a
specific testing requirement
TYPES OF TESTS
Unit testing:
Unit testing involves the design of test cases that validate that the internal
program logic is functioning properly, and that program inputs produce valid
outputs. All decision branches and internal code flow should be validated. It is the
testing of individual software units of the application .it is done after the
completion of an individual unit before integration. This is a structural testing, that
relies on knowledge of its construction and is invasive. Unit tests perform basic
tests at component level and test a specific business process, application, and/or
system configuration. Unit tests ensure that each unique path of a business process
performs accurately to the documented specifications and contains clearly defined
inputs and expected results.
Integration testing:
Integration tests are designed to test integrated software components to
determine if they actually run as one program. Testing is event driven and is more
49
concerned with the basic outcome of screens or fields. Integration tests
demonstrate that although the components were individually satisfaction, as shown
by successfully unit testing, the combination of components is correct and
consistent. Integration testing is specifically aimed at exposing the problems that
arise from the combination of components.
Functional testing:
Functional tests provide systematic demonstrations that functions tested are
available as specified by the business and technical requirements, system
documentation, and user manuals.
50
LEVELS OF TESTING:
Code testing:
This examines the logic of the program. For example, the logic for updating
various sample data and with the sample files and directories were tested and
verified.
Specification Testing:
Executing this specification starting what the program should do and how it
should performed under various conditions. Test cases for various situation and
combination of conditions in all the modules are tested.
Unit testing:
In the unit testing we test each module individually and integrate with the
overall system. Unit testing focuses verification efforts on the smallest unit of
software design in the module. This is also known as module testing. The module
of the system is tested separately. This testing is carried out during programming
stage itself. In the testing step each module is found to work satisfactorily as regard
to expected output from the module. There are some validation checks for fields
also. For example the validation check is done for varying the user input given by
the user which validity of the data entered. It is very easy to find error debut the
system.
51
This type of testing is based entirely on the software requirements and
specifications.
In Black Box Testing we just focus on inputs and output of the software system
without bothering about internal knowledge of the software program.
The above Black Box can be any software system you want to test. For
example : an operating system like Windows, a website like Google ,a database
like Oracle or even your own custom application. Under Black Box Testing , you
can test these applications by just focusing on the inputs and outputs without
knowing their internal code implementation.
Black box testing - Steps
Here are the generic steps followed to carry out any type of Black Box Testing.
• Initially requirements and specifications of the system are examined.
• Tester chooses valid inputs (positive test scenario) to check whether SUT
processes them correctly. Also some invalid inputs (negative test scenario)
are chosen to verify that the SUT is able to detect them.
• Tester determines expected outputs for all those inputs.
• Software tester constructs test cases with the selected inputs.
• The test cases are executed.
• Software tester compares the actual outputs with the expected outputs.
• Defects if any are fixed and re-tested.
Types of Black Box Testing
There are many types of Black Box Testing but following are the prominent
ones -
• Functional testing – This black box testing type is related to functional
requirements of a system; it is done by software testers.
52
• Non-functional testing – This type of black box testing is not related to testing
of a specific functionality, but non-functional requirements such as
performance, scalability, usability.
• Regression testing – Regression testing is done after code fixes , upgrades or
any other system maintenance to check the new code has not affected the
existing code.
WHITE BOX TESTING
White Box Testing is the testing of a software solution's internal coding and
infrastructure. It focuses primarily on strengthening security, the flow of inputs and
outputs through the application, and improving design and usability. White box
testing is also known as clear, open, structural, and glass box testing.
It is one of two parts of the "box testing" approach of software testing. Its
counter-part, blackbox testing, involves testing from an external or end-user type
perspective. On the other hand, Whitebox testing is based on the inner workings of
an application and revolves around internal testing. The term "whitebox" was used
because of the see-through box concept. The clear box or whitebox name
symbolizes the ability to see through the software's outer shell (or "box") into its
inner workings. Likewise, the "black box" in "black box testing" symbolizes not
being able to see the inner workings of the software so that only the end-user
experience can be tested
53
How do you perform White Box Testing?
To give you a simplified explanation of white box testing, we have divided it
into two basic steps. This is what testers do when testing an application using the
white box testing technique:
54
Below are the screen shots for the implementation process.
Testing Screen
55
We will now test the legitimate website by entering the URL on the test screen. It
will show output in a new window showing the Spoofed webpage as : No
56
We will now test the phishing website, it will open up in a new window as result
showing spoofed webpage as : Yes
57
An ROC curve (receiver operating characteristic curve) is a graph showing the
performance of a classification model at all classification thresholds.
This curve plots two parameters: True Positive Rate. False Positive Rate
58
9. OBSERVATIONS AND RESULT
Observation:-
As discussed in the earlier sections, we have used one classifier to predict and
detect if the website is phishing or legitimate.
Result:-
not by using random classifiers. Refer the graph below for the exact results. In the
graph, shown in Fig. 15 shows the AUC, precision, recall and the F1 score
obtained by using classifier. The graph shown in Fig 16. explains about the
accuracy obtained by using different classifiers in the histogram graphical
representation
59
Result
0.92
0.9
0.88
0.86
0.84
0.82
0.8
0.78
0.76
0.74
Graph of AUC Precision Recall F1 Score
Accuracy
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
0.87
0.9
0.8
0.85
60
10. CONCLUSION AND FUTURE SCOPE
Conclusion
It is found that phishing attacks is very crucial and it is important for us to get a
mechanism to detect it. As very important and personal information of the user can
be leaked through phishing websites, it becomes more critical to take care of this
issue. This problem can be easily solved by using any of the machine learning
algorithm with the classifier. We already have classifiers which gives good
prediction rate of the phishing beside, but after our survey that it will be better to
use a hybrid approach for the prediction and further improve the accuracy
prediction rate of phishing websites. We have seen that existing system gives less
accuracy so we proposed a new phishing method that employs URL based features
and also we generated classifiers through several machine learning algorithms. We
have found that our system provides us with 85.6 % of accuracy for Random
Forest Classifier. The proposed technique is much more secured as it detects new
and previous phishing sites.
Future scope
61
11. BIBILOGRAPHY
REFERENCES
Wong, R. K. K. (2019). An Empirical Study on Performance Server
Analysis and URL Phishing Prevention to Improve System Management
Through Machine Learning. In Economics of Grids, Clouds, Systems, and
Services: 15th International Conference, GECON 2018, Pisa, Italy,
September 18-20, 2018, Proceedings (Vol. 11113, p. 199). Springer.
Rao, R. S., & Pais, A. R. (2019). Jail-Phish: An improved search engine
based phishing detection system. Computers & Security, 83, 246-267.
Ding, Y., Luktarhan, N., Li, K., & Slamu, W. (2019). A keyword-based
combination approach for detecting phishing webpages. computers &
security, 84, 256-275.
Marchal, S., Saari, K., Singh, N., & Asokan, N. (2016, June). Know your
phish: Novel techniques for detecting phishing sites and their targets. In
2016 IEEE 36th International Conference on Distributed Computing
Systems (ICDCS) (pp. 323-333). IEEE.
Shekokar, N. M., Shah, C., Mahajan, M., & Rachh, S. (2015). An ideal
approach for detection and prevention of phishing attacks. Procedia
Computer Science, 49, 82-91.
Rathod, J., & Nandy, D. Anti-Phishing Technique to Detect URL
Obfuscation.
Hodžić, A., Kevrić, J., & Karadag, A. (2016). Comparison of machine
learning techniques in phishing website classification. In International
Conference on Economic and Social Studies (ICESoS'16) (pp. 249-256).
Pujara, P., & Chaudhari, M. B. (2018). Phishing Website Detection using
Machine Learning: A Review.
Desai, A., Jatakia, J., Naik, R., & Raul, N. (2017, May). Malicious web
content detection using machine leaning. In 2017 2nd IEEE International
Conference on Recent Trends in Electronics, Information & Communication
Technology (RTEICT) (pp. 1432-1436). IEEE.
Lakshmi, V. S., & Vijaya, M. S. (2012). Efficient prediction of phishing
websites using supervised learning algorithms. Procedia Engineering, 30,
798-805.
62
Jain, A. K., & Gupta, B. B. (2018). PHISH-SAFE: URL features-based
phishing detection system using machine learning. In Cyber Security (pp.
467-474). Springer, Singapore.
Kazemian, H. B., & Ahmed, S. (2015). Comparisons of machine learning
techniques for detecting malicious webpages. Expert Systems with
Applications, 42(3), 1166-1177.
Mao, J., Bian, J., Tian, W., Zhu, S., Wei, T., Li, A., & Liang, Z. (2019).
Phishing page detection via learning classifiers from page layout feature.
EURASIP Journal on Wireless Communications and Networking, 2019(1),
43.
Mohammad, R. M., Thabtah, F., & McCluskey, L. (2012, December). An
assessment of features related to phishing websites using an automated
technique. In 2012 International Conference for Internet Technology and
Secured Transactions (pp. 492-497). IEEE.
https://www.researchgate.net/publication/226420039-Detection-of-
Phishing-Attacks-A-Machine-Learning-Approach
https://www.proofpoint.com/us/threat-reference/phishing
https://towardsdatascience.com/phishing-domain-detection-with-ml-
5be9c99293e5
https://en.wikipedia.org/wiki/Phishing
https://www.techrepublic.com/article/how-to-tackle-phishing-with- machine-
learning/
https://www.irjet.net/archives/V5/i3/IRJET-V5I3580.pdf
https://www.hackerearth.com/practice/machine-learning/machine-learning-
algorithms/beginners-tutorial-on-xgboost-parameter-tuning-r/tutorial/
https://www.datacamp.com/community/tutorials/svm-classification-scikit-
learn-python
He, M., Horng, S. J., Fan, P., Khan, M. K., Run, R. S., Lai, J. L., ... &
Sutanto, A. (2011). An efficient phishing webpage detector. Expert systems
with applications, 38(10), 12018-12027.
Le, A., Markopoulou, A., & Faloutsos, M. (2011, April). Phishdef: Url
names say it all. In 2011 Proceedings IEEE INFOCOM (pp. 191-195). IEEE.
63
Sahingoz, O. K., Buber, E., Demir, O., & Diri, B. (2019). Machine learning
based phishing detection from URLs. Expert Systems with Applications,
117, 345-357.
Tewari, A., Jain, A. K., & Gupta, B. B. (2016). Recent survey of various
defense mechanisms against phishing attacks. Journal of Information
Privacy and Security, 12(1), 3-13.
Jain, A. K., & Gupta, B. B. (2016, March). Comparative analysis of features
based machine learning approaches for phishing detection. In 2016 3rd
International Conference on Computing for Sustainable Global Development
(INDIACom) (pp. 2125-2130). IEEE.
Yuan, H., Chen, X., Li, Y., Yang, Z., & Liu, W. (2018, August). Detecting
Phishing Websites and Targets Based on URLs and Webpage Links. In 2018
24th International Conference on Pattern Recognition (ICPR) (pp. 3669-
3674). IEEE.
Nguyen, L. A. T., To, B. L., Nguyen, H. K., & Nguyen, M. H. (2013,
October). Detecting phishing web sites: A heuristic URL-based approach. In
2013 International Conference on Advanced Technologies for
Communications (ATC 2013) (pp. 597-602). IEEE.
64