Detection of Phishing Website

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 23

SYNOPSIS

OBJECTIVE

Spyware is a common strategy used to mislead unsuspecting people into providing


private data through fraudulent websites. Deception website URLs are meant to
collect personal information, such as user names, passwords, and online financial
transactions. Phishers use websites that are visually and linguistically similar to
authentic websites. Using anti-phishing strategies to detect phishing is vital to
prevent the fast evolution of phishing techniques as technology advances. Machine
learning is an effective method for preventing phishing attacks. Attackers typically
use phishing because it is easy to trick a victim into clicking a malicious link that
appears real.
ABSTRACT
Criminals seeking sensitive information construct illegal clones of actual websites
and e-mail accounts. The e-mail will be made up of real firm logos and slogans.
When a user clicks on a link provided by these hackers, the hackers gain access to
all of the user's private information, including bank account information, personal
login passwords, and images. Random Forest and Decision Tree algorithms are
heavily employed in present systems, and their accuracy has to be enhanced. The
existing models have low latency. Existing systems do not have a specific user
interface. In the current system, different algorithms are not compared. Consumers
are led to a faked website that appears to be from the authentic company when the
e-mails or the links provided are opened. The models are used to detect phishing
Websites based on URL significance features, as well as to find and implement the
optimal machine learning model. Logistic Regression, Multinomial Naive Bayes,
and XG Boost are the machine learning methods that are compared. The Logistic
Regression algorithm outperforms the other two.
MODULES
 Data Collection
 Dataset
 Data Preparation
 Model Selection
 Analyze and Prediction
 Accuracy on test set
 Saving the Trained Model
SYSTEM SPECIFICATION

HARDWARE REQUIREMENTS:

 System : Pentium i3 Processor.


 Hard Disk : 500 GB.
 Monitor : 15’’ LED
 Input Devices : Keyboard, Mouse
 Ram : 4 GB

SOFTWARE REQUIREMENTS:

 Operating system : Windows 10.


 Coding Language : Python.
CONCLUSION
The suggested system is trained on a dataset with various attributes and no website
URLs. The dataset includes many criteria that should be considered when deciding
if a website URL is real or fraudulent.
CHAPTER 1
INTRODUCTION
1.1 PROJECT OVERVIEW
Consumers have lost billions of dollars each year as a result of phishing operations.
Refers to thieves' tricks for obtaining private information from a group of
unwitting Internet users. Fraudsters obtain personal and financial account
information such as usernames and passwords using fake email and phishing
software to steal sensitive information. This research examines strategies for
detecting phishing Web sites using machine learning techniques to analyze various
aspects of benign and phishing URLs. It investigates how linguistic cues, host
features, and page significance attributes are used to identify phishing site.The
fine-tuned parameters aid in the selection of the most appropriate machine learning
method for distinguishing between phishing and benign sites. Criminals that seek
to steal sensitive information first establish illegal duplicates of legitimate websites
and e-mail accounts, frequently from financial institutions or other companies that
deal with financial data. The e-mail will be made up of real firm logos and slogans.
One of the reasons for the rapid growth of the internet as a means of
communication is that it allows the misuse of trademarks, brand names, and other
corporate identities that consumers rely on as verification processes. "Spoof" e-
mails are sent to many people in order make them involved in the criminal
deception. Consumers are paid on a fraudulent website that appears to come from
the real company when these emails are opened or when a link is clicked on the
email.
1.2 PROBLEM DESCRIPTION
The current models have minimal latency. Existing systems lack a distinct user
interface. The current system model fails to forecast a continuous result. It works
only if the dependent or outcome variable is binary. The present system model may
be inaccurate if the sample size is too small. The existence may result in an
overfitting issue.

1.3OBJECTIVE

1.Input Design is the process of converting a user-oriented description of the input


into a computer-based system. This design is important to avoid errors in the data
input process and show the correct direction to the management for getting correct
information from the computerized system.

2.It is achieved by creating user-friendly screens for the data entry to handle large
volume of data. The goal of designing input is to make data entry easier and to be
free from errors. The data entry screen is designed in such a way that all the data
manipulates can be performed. It also provides record viewing facilities.

3.When the data is entered it will check for its validity. Data can be entered with
the help of screens. Appropriate messages are provided as when needed so that the
user will not be in maize of instant. Thus the objective of input design is to create
an input layout that is easy to follow
CHAPTER 2
LITERATURE SURVEY
2.1 EXISTING AND PRAPOSED SYSTEMN

EXISTING SYSTEM:

 H. Huang et al., (2009) proposed the frameworks that distinguish the phishing utilizing
page section similitude that breaks down universal resource locator tokens to create
forecast preciseness phishing pages normally keep its CSS vogue like their objective
pages.
 S. Marchal et al., (2017) proposed this technique to differentiate Phishing website
depends on the examination of authentic site server log knowledge. An application Off-
the- Hook application or identification of phishing website. Free, displays a couple of
outstanding properties together with high preciseness, whole autonomy, and nice
language-freedom, speed of selection, flexibility to dynamic phish and flexibility to
advancement in phishing ways.
 Mustafa Aydin et al. proposed a classification algorithm for phishing website detection
by extracting websites' URL features and analyzing subset based feature selection
methods. It implements feature extraction and selection methods for the detection of
phishing websites. The extracted features about the URL of the pages and composed
feature matrix are categorized into five different analyses as Alpha- numeric Character
Analysis, Keyword Analysis, Security Analysis, Domain Identity Analysis and Rank
Based Analysis. Most of these features are the textual properties of the URL itself and
others based on third parties services.
 In the existing system they have used Logistic Regression, Multinomial Naive Bayes, and
XG Boost are the machine learning methods that are compared. DISADVANTAGES OF
EXISTING SYSTEM:
 The existing models have low latency.
 Existing systems do not have a specific user interface.
 The existing system model fails to predict a continuous outcome. It only works when the
dependent or outcome variable is dichotomous.
 The existing system model may not be accurate if the sample size is too small.
 The existing may lead to overfitting problem.

2.2 PROPOSED SYSTEM:

 We have developed our project using a website as a platform for all the users. This is an
interactive and responsive website that will be used to detect whether a website is
legitimate or phishing. This website is made using different web designing languages
which include HTML, CSS, Javascript and Flask framework in Python. The basic
structure of the website is made with the help of HTML. CSS is used to add effects to the
website and make it more attractive and user-friendly. It must be noted that the website is
created for all users, hence it must be easy to operate with and no user should face any
difficulty while making its use.
 The proposed system is trained with the dataset consists of different features and note that
the dataset don’t contain any website URL. The dataset consists of different features that
are to be taken into consideration while determining a website URL as legitimate or
phishing.

ADVANTAGES OF PROPOSED SYSTEM:

 User Interface is provided


 Model is trained using many features
 High level of accuracy
 The proposed system is generally more accurate compare to other modes.
 The proposed system can train faster especially on larger datasets.
 The proposed system most of them provide support handling categorical features.
 The proposed system some of them handle missing values natively.
2.3 APPLICATIONS

Phishing Website Detection Tool:

Develop a standalone application or a browser extension that users can install. This
tool can scan URLs or emails in real-time and alert users if they are potentially
accessing a phishing website or clicking on a malicious link.

Email Phishing Protection:

Create an email client plugin or standalone tool that integrates with popular email
services. This tool can scan incoming emails for phishing indicators such as
suspicious links or content, helping users identify and avoid phishing attempts.
Phishing Simulation Platform: Build a platform that allows organizations to
simulate phishing attacks on their employees. This can help in training employees
to recognize phishing attempts and improve overall cybersecurity awareness within
the organization.

Phishing Reporting System:

Develop a system where users can report suspected phishing websites or emails.
Implement machine learning models to automatically analyze reported cases and
identify new phishing patterns for further investigation.

Real-time Phishing Dashboard:

Create a dashboard that provides real-time insights into phishing trends and
activities. This can be useful for security analysts and organizations to monitor and
respond quickly to emerging phishing threats.

Phishing Awareness Training App:


Develop an interactive application or gamified platform that educates users about
phishing techniques and how to spot and avoid them. Include quizzes, simulations,
and educational content to engage users and improve their cybersecurity
knowledge.

API for Phishing Detection:

Build an API that developers can integrate into their applications to add phishing
detection capabilities. This can be useful for online platforms, browsers, and
security software to enhance their security features.
CHAPTER 3

METHODOLOGY

We used Gradient Boosting Classifier machine learning algorithm. We got an


accuracy of training Accuracy 98.9% so we implemented this algorithm.

Step -1 The first step in gradient boosting is to build a base model to predict the
observations in the training dataset. For simplicity we take an average of the target
column and assume that to be the predicted value as shown below:
Why did I say we take the average of the target column? Well, there is math
involved behind this. Mathematically the first step can be written as:

Looking at this may give you a headache, but don’t worry we will try to
understand what is written here.
Here L is our loss function
Gamma is our predicted value
argmin means we have to find a predicted value/gamma for which the loss function
is minimum.
Since the target column is continuous our loss function will be:
Here yi is the observed value
And gamma is the predicted value
Now we need to find a minimum value of gamma such that this loss function is
minimum. We all have studied how to find minima and maxima in our 12th grade.
Did we use to differentiate this loss function and then put it equal to 0 right? Yes,
we will do the same here.

Let’s see how to do this with the help of our example. Remember that y_i is our
observed value and gamma_i is our predicted value, by plugging the values in the
above formula we get:
Hence for gamma=14500, the loss function will be minimum so this value will
become our prediction for the base model.
Step-2 The next step is to calculate the pseudo residuals which are (observed value
– predicted value)
Again the question comes why only observed – predicted? Everything is
mathematically proved, let’s from where did this formula come from. This step can
be written as:
Here F(xi) is the previous model and m is the number of DT made.
We are just taking the derivative of loss function w.r.t the predicted value and we
have already calculated this derivative:

If you see the formula of residuals above, we see that the derivative of the loss
function is multiplied by a negative sign, so now we get:

The predicted value here is the prediction made by the previous model. In our
example the prediction made by the previous model (initial base model prediction)
is 14500, to calculate the residuals our formula becomes:

Let’s see why do we take the average of all the numbers. Mathematically this step
can be represented as:
Here hm(xi) is the DT made on residuals and m is the number of DT. When m=1
we are talking about the 1st DT and when it is “M” we are talking about the last
DT.
The output value for the leaf is the value of gamma that minimizes the Loss
function. The left-hand side “Gamma” is the output value of a particular leaf. On
the right-hand side [Fm-1(xi)+ƴhm(xi))] is similar to step 1 but here the difference
is that we are taking previous predictions whereas earlier there was no previous
prediction.
Let’s understand this even better with the help of an example. Suppose this is our
regressor tree:

Image Source
We see 1st residual goes in R1,1 ,2nd and 3rd residuals go in R2,1 and
4th residual goes in R3,1 .
Let’s calculate the output for the first leave that is R1,1
Now we need to find the value for gamma for which this function is minimum. So
we find the derivative of this equation w.r.t gamma and put it equal to 0.

Hence the leaf R1,1 has an output value of -2500. Now let’s solve for the R2,1
Let’s take the derivative to get the minimum value of gamma for which this
function is

We end up with the average of the residuals in the leaf R2,1 . Hence if we get any
leaf with more than 1 residual, we can simply find the average of that leaf and that
will be our final output.
Now after calculating the output of all the leaves, we get:

Image Source: Author


Step-5 This is finally the last step where we have to update the predictions of the
previous model. It can be updated as:

where m is the number of decision trees made.


Since we have just started building our model so our m=1. Now to make a new DT
our new predictions will be:

Here Fm-1(x) is the prediction of the base model (previous prediction) since F1-
1=0 , F0 is our base model hence the previous prediction is 14500.
nu is the learning rate that is usually selected between 0-1. It reduces the effect
each tree has on the final prediction, and this improves accuracy in the long run.
Let’s take nu=0.1 in this example.
Hm(x) is the recent DT made on the residuals.
Let’s calculate the new prediction now:
I am taking a hypothetical example here just to
make you understand how this predicts for a new dataset:

Image Source: Author

If a new data point says height = 1.40 comes, it’ll go through all the trees and then
will give the prediction. Here we have only 2 trees hence the datapoint will go
through these 2 trees and the final output will be F2(x).
What is Gradient Boosting Classifier?
The loss function for the classification problem is given below:
Our first step in the gradient boosting algorithm was to initialize the model with
some constant value, there we used the average of the target column but here we’ll
use log(odds) to get that constant value. The question comes why log(odds)?
When we differentiate this loss function, we will get a function of log(odds) and
then we need to find a value of log(odds) for which the loss function is minimum.
Confused right? Okay let’s see how it works:
Let’s first transform this loss function so that it is a function of log(odds), I’ll tell
you later why we did this transformation.

3.2 ARCHITECTURE
3.3 DEVELOPMENT

Data Collection:

In the first module we develop the data collection process. This is the first real step
towards the real development of a machine learning model, collecting data. This is
a critical step that will cascade in how good the model will be, the more and better
data that we get; the better our model will perform.

There are several techniques to collect the data, like web scraping, manual
interventions. The dataset is referred from the popular dataset repository called
kaggle. The following is the dataset link for the Detection of Phishing Websites
Using Machine Learning.

Dataset:

The dataset consists of 11054 individual data. There are 32 columns in the dataset,
which are described below.
Index: index id
UsingIP: (categorical - signed numeric) : { -1,1 }
LongURL: (categorical - signed numeric) : { 1,0,-1 }
ShortURL: (categorical - signed numeric) : { 1,-1 }
Symbol@: (categorical - signed numeric) : { 1,-1 }

Data Preparation:

Wrangle data and prepare it for training. Clean that which may require it (remove
duplicates, correct errors, deal with missing values, normalization, data type
conversions, etc.).

Randomize data, which erases the effects of the particular order in which we
collected and/or otherwise prepared our data.

Visualize data to help detect relevant relationships between variables or class


imbalances (bias alert!), or perform other exploratory analysis.

Split into training and evaluation sets.

Accuracy on test set:

We got an accuracy of 97.6% on test set.

Saving the Trained Model:


Once you’re confident enough to take your trained and tested model into the
production-ready environment, the first step is to save it into a .h5 or .pkl file using
a library like pickle.
Make sure you have pickle installed in your environment.
Next, let’s import the module and dump the model into. pkl file
CHAPTER 4

CONCLUSION

It is remarkable that a good anti-phishing system should be able to predict phishing


attacks in a reasonable amount of time. Accepting that having a good anti-phishing
gadget available at a reasonable time is also necessary for expanding the scope of
phishing site detection. The current system merely detects phishing websites using
Gradient Boosting Classifier. We achieved 97% detection accuracy using Gradient
Boosting Classifier with lowest false positive rate.
BIBILOGRAPHY

[1] Chengshan Zhang, Steve Sheng, Brad Wardman, Gary Warner, Lorrie Faith
Cranor, Jason Hong. Phishing Blacklists: An Empirical Study In: CEAS 2009:
Proceedings of the 6th Conference on Email and Anti-Spam, Mountain View,
California, USA, July 16-17, 2009.

[2] Andrew Jones, Mahmoud Khonji, Youssef Iraqi, Senior Member A Literature
Review on Phishing Detection 2091-2121 in IEEE Communications Surveys and
Tutorials, vol. 15, no. 4, 2013. 2013.

[3] Alessandro Acquisti, Idris Adjerid, Rebecca Balebako, Laura Brandimarte,


Lorrie Faith Cranor, Saranga Komanduri, Pedro Giovanni Leon, Norman Sadeh,
Florian Schaub, Many Understanding and Assisting Users' Online Choices with
Nudges for Privacy and Security 50(3), Article No. 44, ACM Computing Surveys,
2017.

[4] Helena Matute, Mara M. Moreno-Fernández, Fernando Blanco, Pablo Garaizar


I'm looking for phishers. To combat electronic fraud, Internet users' sensitivity to
visual deception indicators should be improved. pp.421-436 in Computers in
Human Behavior, Vol.69, 2017.

[5] F.J. Overink, M. Junger, L. Montoya. Preventing social engineering assaults


with priming and warnings does not work. pp.75-87 in Computers in Human
Behavior, Vol.66, 2017. 2017.
[6] M. El-Alfy, El-Sayed M. Probabilistic Neural Networks and K-Medoids
Clustering are used to detect phishing websites. The Computer Journal, 60(12),
pp.1745-1759, published in 2017.

[7] Shuang Hao, Luca Invernizzi, Yong Fang, Christopher Kruegel, Giovanni
Vigna. Cheng Huang, Shuang Hao, Luca Invernizzi, Yong Fang, Christopher
Kruegel, Giovanni Vigna. Gossip: Detecting Malicious Domains from Mailing List
Discussions Automatically pp. 494-505 in Proceedings of the 2017 ACM Asia
Conference on Computer and Communications Security (ASIA CCS 2017), Abu
Dhabi, United Arab Emirates, April 2-6, 2017.

[8] Gonzalo Nápoles, Rafael Falcon, Koen Vanhoof, Mario Köppen. Frank
Vanhoenshoven, Gonzalo Nápoles, Rafael Falcon, Koen Vanhoof, Mario Köppen.
Machine learning algorithms are used to detect dangerous URLs. The 2016 IEEE
Symposium Series on Computational Intelligence (SSCI 2016) was held on
December 6-9, 2016.

[9] Hillary Sanders, Joshua Saxe, Richard Harang, Cody Wild A Deep Learning
Approach to Detecting Malicious Web Content in a Fast, Format-Independent
Way. pp. 8-14 in Proceedings of the 2018 IEEE Symposium on Security and
Privacy Workshops (SPW 2018), San Francisco, CA, USA, August 2.

You might also like