Detection of Phishing Website
Detection of Phishing Website
Detection of Phishing Website
OBJECTIVE
HARDWARE REQUIREMENTS:
SOFTWARE REQUIREMENTS:
1.3OBJECTIVE
2.It is achieved by creating user-friendly screens for the data entry to handle large
volume of data. The goal of designing input is to make data entry easier and to be
free from errors. The data entry screen is designed in such a way that all the data
manipulates can be performed. It also provides record viewing facilities.
3.When the data is entered it will check for its validity. Data can be entered with
the help of screens. Appropriate messages are provided as when needed so that the
user will not be in maize of instant. Thus the objective of input design is to create
an input layout that is easy to follow
CHAPTER 2
LITERATURE SURVEY
2.1 EXISTING AND PRAPOSED SYSTEMN
EXISTING SYSTEM:
H. Huang et al., (2009) proposed the frameworks that distinguish the phishing utilizing
page section similitude that breaks down universal resource locator tokens to create
forecast preciseness phishing pages normally keep its CSS vogue like their objective
pages.
S. Marchal et al., (2017) proposed this technique to differentiate Phishing website
depends on the examination of authentic site server log knowledge. An application Off-
the- Hook application or identification of phishing website. Free, displays a couple of
outstanding properties together with high preciseness, whole autonomy, and nice
language-freedom, speed of selection, flexibility to dynamic phish and flexibility to
advancement in phishing ways.
Mustafa Aydin et al. proposed a classification algorithm for phishing website detection
by extracting websites' URL features and analyzing subset based feature selection
methods. It implements feature extraction and selection methods for the detection of
phishing websites. The extracted features about the URL of the pages and composed
feature matrix are categorized into five different analyses as Alpha- numeric Character
Analysis, Keyword Analysis, Security Analysis, Domain Identity Analysis and Rank
Based Analysis. Most of these features are the textual properties of the URL itself and
others based on third parties services.
In the existing system they have used Logistic Regression, Multinomial Naive Bayes, and
XG Boost are the machine learning methods that are compared. DISADVANTAGES OF
EXISTING SYSTEM:
The existing models have low latency.
Existing systems do not have a specific user interface.
The existing system model fails to predict a continuous outcome. It only works when the
dependent or outcome variable is dichotomous.
The existing system model may not be accurate if the sample size is too small.
The existing may lead to overfitting problem.
We have developed our project using a website as a platform for all the users. This is an
interactive and responsive website that will be used to detect whether a website is
legitimate or phishing. This website is made using different web designing languages
which include HTML, CSS, Javascript and Flask framework in Python. The basic
structure of the website is made with the help of HTML. CSS is used to add effects to the
website and make it more attractive and user-friendly. It must be noted that the website is
created for all users, hence it must be easy to operate with and no user should face any
difficulty while making its use.
The proposed system is trained with the dataset consists of different features and note that
the dataset don’t contain any website URL. The dataset consists of different features that
are to be taken into consideration while determining a website URL as legitimate or
phishing.
Develop a standalone application or a browser extension that users can install. This
tool can scan URLs or emails in real-time and alert users if they are potentially
accessing a phishing website or clicking on a malicious link.
Create an email client plugin or standalone tool that integrates with popular email
services. This tool can scan incoming emails for phishing indicators such as
suspicious links or content, helping users identify and avoid phishing attempts.
Phishing Simulation Platform: Build a platform that allows organizations to
simulate phishing attacks on their employees. This can help in training employees
to recognize phishing attempts and improve overall cybersecurity awareness within
the organization.
Develop a system where users can report suspected phishing websites or emails.
Implement machine learning models to automatically analyze reported cases and
identify new phishing patterns for further investigation.
Create a dashboard that provides real-time insights into phishing trends and
activities. This can be useful for security analysts and organizations to monitor and
respond quickly to emerging phishing threats.
Build an API that developers can integrate into their applications to add phishing
detection capabilities. This can be useful for online platforms, browsers, and
security software to enhance their security features.
CHAPTER 3
METHODOLOGY
Step -1 The first step in gradient boosting is to build a base model to predict the
observations in the training dataset. For simplicity we take an average of the target
column and assume that to be the predicted value as shown below:
Why did I say we take the average of the target column? Well, there is math
involved behind this. Mathematically the first step can be written as:
Looking at this may give you a headache, but don’t worry we will try to
understand what is written here.
Here L is our loss function
Gamma is our predicted value
argmin means we have to find a predicted value/gamma for which the loss function
is minimum.
Since the target column is continuous our loss function will be:
Here yi is the observed value
And gamma is the predicted value
Now we need to find a minimum value of gamma such that this loss function is
minimum. We all have studied how to find minima and maxima in our 12th grade.
Did we use to differentiate this loss function and then put it equal to 0 right? Yes,
we will do the same here.
Let’s see how to do this with the help of our example. Remember that y_i is our
observed value and gamma_i is our predicted value, by plugging the values in the
above formula we get:
Hence for gamma=14500, the loss function will be minimum so this value will
become our prediction for the base model.
Step-2 The next step is to calculate the pseudo residuals which are (observed value
– predicted value)
Again the question comes why only observed – predicted? Everything is
mathematically proved, let’s from where did this formula come from. This step can
be written as:
Here F(xi) is the previous model and m is the number of DT made.
We are just taking the derivative of loss function w.r.t the predicted value and we
have already calculated this derivative:
If you see the formula of residuals above, we see that the derivative of the loss
function is multiplied by a negative sign, so now we get:
The predicted value here is the prediction made by the previous model. In our
example the prediction made by the previous model (initial base model prediction)
is 14500, to calculate the residuals our formula becomes:
Let’s see why do we take the average of all the numbers. Mathematically this step
can be represented as:
Here hm(xi) is the DT made on residuals and m is the number of DT. When m=1
we are talking about the 1st DT and when it is “M” we are talking about the last
DT.
The output value for the leaf is the value of gamma that minimizes the Loss
function. The left-hand side “Gamma” is the output value of a particular leaf. On
the right-hand side [Fm-1(xi)+ƴhm(xi))] is similar to step 1 but here the difference
is that we are taking previous predictions whereas earlier there was no previous
prediction.
Let’s understand this even better with the help of an example. Suppose this is our
regressor tree:
Image Source
We see 1st residual goes in R1,1 ,2nd and 3rd residuals go in R2,1 and
4th residual goes in R3,1 .
Let’s calculate the output for the first leave that is R1,1
Now we need to find the value for gamma for which this function is minimum. So
we find the derivative of this equation w.r.t gamma and put it equal to 0.
Hence the leaf R1,1 has an output value of -2500. Now let’s solve for the R2,1
Let’s take the derivative to get the minimum value of gamma for which this
function is
We end up with the average of the residuals in the leaf R2,1 . Hence if we get any
leaf with more than 1 residual, we can simply find the average of that leaf and that
will be our final output.
Now after calculating the output of all the leaves, we get:
Here Fm-1(x) is the prediction of the base model (previous prediction) since F1-
1=0 , F0 is our base model hence the previous prediction is 14500.
nu is the learning rate that is usually selected between 0-1. It reduces the effect
each tree has on the final prediction, and this improves accuracy in the long run.
Let’s take nu=0.1 in this example.
Hm(x) is the recent DT made on the residuals.
Let’s calculate the new prediction now:
I am taking a hypothetical example here just to
make you understand how this predicts for a new dataset:
If a new data point says height = 1.40 comes, it’ll go through all the trees and then
will give the prediction. Here we have only 2 trees hence the datapoint will go
through these 2 trees and the final output will be F2(x).
What is Gradient Boosting Classifier?
The loss function for the classification problem is given below:
Our first step in the gradient boosting algorithm was to initialize the model with
some constant value, there we used the average of the target column but here we’ll
use log(odds) to get that constant value. The question comes why log(odds)?
When we differentiate this loss function, we will get a function of log(odds) and
then we need to find a value of log(odds) for which the loss function is minimum.
Confused right? Okay let’s see how it works:
Let’s first transform this loss function so that it is a function of log(odds), I’ll tell
you later why we did this transformation.
3.2 ARCHITECTURE
3.3 DEVELOPMENT
Data Collection:
In the first module we develop the data collection process. This is the first real step
towards the real development of a machine learning model, collecting data. This is
a critical step that will cascade in how good the model will be, the more and better
data that we get; the better our model will perform.
There are several techniques to collect the data, like web scraping, manual
interventions. The dataset is referred from the popular dataset repository called
kaggle. The following is the dataset link for the Detection of Phishing Websites
Using Machine Learning.
Dataset:
The dataset consists of 11054 individual data. There are 32 columns in the dataset,
which are described below.
Index: index id
UsingIP: (categorical - signed numeric) : { -1,1 }
LongURL: (categorical - signed numeric) : { 1,0,-1 }
ShortURL: (categorical - signed numeric) : { 1,-1 }
Symbol@: (categorical - signed numeric) : { 1,-1 }
Data Preparation:
Wrangle data and prepare it for training. Clean that which may require it (remove
duplicates, correct errors, deal with missing values, normalization, data type
conversions, etc.).
Randomize data, which erases the effects of the particular order in which we
collected and/or otherwise prepared our data.
CONCLUSION
[1] Chengshan Zhang, Steve Sheng, Brad Wardman, Gary Warner, Lorrie Faith
Cranor, Jason Hong. Phishing Blacklists: An Empirical Study In: CEAS 2009:
Proceedings of the 6th Conference on Email and Anti-Spam, Mountain View,
California, USA, July 16-17, 2009.
[2] Andrew Jones, Mahmoud Khonji, Youssef Iraqi, Senior Member A Literature
Review on Phishing Detection 2091-2121 in IEEE Communications Surveys and
Tutorials, vol. 15, no. 4, 2013. 2013.
[7] Shuang Hao, Luca Invernizzi, Yong Fang, Christopher Kruegel, Giovanni
Vigna. Cheng Huang, Shuang Hao, Luca Invernizzi, Yong Fang, Christopher
Kruegel, Giovanni Vigna. Gossip: Detecting Malicious Domains from Mailing List
Discussions Automatically pp. 494-505 in Proceedings of the 2017 ACM Asia
Conference on Computer and Communications Security (ASIA CCS 2017), Abu
Dhabi, United Arab Emirates, April 2-6, 2017.
[8] Gonzalo Nápoles, Rafael Falcon, Koen Vanhoof, Mario Köppen. Frank
Vanhoenshoven, Gonzalo Nápoles, Rafael Falcon, Koen Vanhoof, Mario Köppen.
Machine learning algorithms are used to detect dangerous URLs. The 2016 IEEE
Symposium Series on Computational Intelligence (SSCI 2016) was held on
December 6-9, 2016.
[9] Hillary Sanders, Joshua Saxe, Richard Harang, Cody Wild A Deep Learning
Approach to Detecting Malicious Web Content in a Fast, Format-Independent
Way. pp. 8-14 in Proceedings of the 2018 IEEE Symposium on Security and
Privacy Workshops (SPW 2018), San Francisco, CA, USA, August 2.