Phishing URL Detection Using ML: Project Report
Phishing URL Detection Using ML: Project Report
PROJECT REPORT
Submitted by
Prof. Manikandan. K
TABLE OF CONTENTS
1 Abstract 2
2 Introduction 2
3 Literature Review 3-5
4 Problem Formulation 5
6 Methodology 10 - 11
7 Flow Diagrams 12
8 Implementation 13 - 14
11 Code Snippets 17 - 22
12 Appendix 22
13 References 22 - 23
1
1. Abstract :
2. Introduction:
2
3. Literature Review :
3
In this paper, they The drawback of this
discussed three system is detecting some
2. Detection and
approaches for detecting minimal false positive and
Prevention of phishing websites. First
false negative results.
is by analyzing various
Phishing Websites These drawbacks can be
features of the URL,
using Machine second is by checking eliminated by introducing
legitimacy of a website much richer features to
by knowing where the feed to the machine
website is being hosted learning algorithm that
and who is managing it, would result in much
the third approach uses
higher accuracy.
visual appearance based
analysis for checking
genuineness of the
website. We make use of
Machine Learning
techniques and
algorithms for
evaluation of these
different features of
URL and websites.
3. Phishing In this paper, they Decision trees Bayes Net,
Detection: A critically analysed
and SVM achieved good
Recent Intelligent recent studies related to
phishing in the research detection rates. However,
Machine Learning
literature based on ML
Comparison models extracted by
techniques. We show
based on Models how these ML decision trees showed very
Content and approaches derive the
large amounts of
Features. classification models
and their advantages and information which may
disadvantages. More
overwhelm novice users
importantly, we
investigate in-depth and security experts, and
eight ML techniques on
thus will be hard to manage
real datasets related to
phishing and perform or understand. Moreover,
thorough comparisons
Bayes Net and SVM
of these techniques. The
aim of the comparisons showed good performance
4
is to determine a with respect to accuracy,
suitable approach that
yet their models are hard to
may serve as an anti
phishing tool, based on understand by end-users.
the model content as
well as the detection rate
of phishing activities.
4. Phishing Website Weight to the extracted Finally, WHOIS lookups are
Detection Using words from website
performed to obtain domain
URL-Assisted content. To form the final
weights, the URL name owners. A successful
Brand Name
weighting system
Weighting System match of the domain name
computes further weight
to be added up with the owner will conclude the
initial weight of the
query website as a legitimate
words. Based on the final
weights, a few of the website. Otherwise, the query
words are selected as the
website will be labelled as a
brand name. The brand
name is then submitted to phishing website. A detailed
the search engine to
explanation of each
retrieve the domain name
with the highest number component will be discussed
of occurrences among the
in the following subsection.
search results.
This section discusses the
steps to extract plain text and
URLs from the HTML source
code. When a web page is
loaded, the browser creates a
Document Object Model
(DOM) of the page. DOM
defines the HTML content in
structured nodes.
5
them in their phishing website, especially the logo
website. In order to detect in their phishing websites,
phishing websites, the first this motivates us to propose
question to ask is: how to
an anti-phishing method
differentiate a phishing
website from a legitimate based on the identification of
website given the fact that website identity through the
they look identical? If we logo. This is rational as the
can somehow determine logo is usually representing
the real identity of a query the identity of a legitimate
website (if the query website. In this paper, the
website is a phishing
proposed method involves
website, the real identity
will be the identity of the two main processes: logo
targeted website), we can segmentation and website
then differentiate them. identity identification.
6. Problem Formulation :
6
the value of a target variable by learning simple decision rules inferred
from the data features. A tree can be seen as a piecewise constant
approximation.
Merits :
● Simple to understand and to interpret. Trees can be visualised.
● Requires little data preparation. Other techniques often require data
normalisation, dummy variables need to be created and blank
values to be removed. Note however that this module does not
support missing values.
● DT can handle both numerical and categorical data.
● Decision trees provide a clear indication of which fields are most
important for prediction or classification.
Demerits :
7
Logistic regression is one of the most popular Machine Learning
algorithms, which comes under the Supervised Learning technique. It is
used for predicting the categorical dependent variable using a given set of
independent variables. Logistic regression predicts the output of a
categorical dependent variable. Therefore the outcome must be a
categorical or discrete value. It can be either Yes or No, 0 or 1, True or
False, etc. but instead of giving the exact value as 0 and 1, it gives the
probabilistic values which lie between 0 and 1.
8
The goal of the SVM algorithm is to create the best line or decision
boundary that can segregate n-dimensional space into classes so that we
can easily put the new data point in the correct category in the future.
This best decision boundary is called a hyperplane. SVM chooses the
extreme points/vectors that help in creating the hyperplane. These
extreme cases are called support vectors, and hence the algorithm is
termed as Support Vector Machine.
using a straight line, then such data is termed as non-linear data and
9
predictive accuracy of that dataset. Instead of relying on one decision
tree, the random forest takes the prediction from each tree and based on
the majority votes of predictions, and it predicts the final output.
Since the random forest combines multiple trees to predict the class of the
dataset, it is possible that some decision trees may predict the correct
output, while others may not. But together, all the trees predict the correct
output. Therefore, below are two assumptions for a better Random forest
classifier:
dataset so that the classifier can predict accurate results rather than
a guessed result.
● The predictions from each tree must have very low correlations.
Regression tasks.
● It enhances the accuracy of the model and prevents the overfitting issue.
10
● Although random forest can be used for both classification and regression
8. Methodology :
1. URL-Based Features
2. Domain-Based Features
3. Page-Based Features
4. Content-Based Features
11
URL-Based Features URL is the first thing to analyse a website to decide
whether it is a phishing or not. As we mentioned before, URLs of
phishing domains have some distinctive points. Features which are
related to these points are obtained when the URL is processed.
Some of URL-Based Features are given below.
1. Digit count in the URL
2. Total length of URL
3. Checking whether the URL is Typosquatting or not. (google.com
→ goggle.com)
4. Checking whether it includes a legitimate brand name or not
(apple-icloud-login.com)
5. Number of subdomains in URL
6. Is Top Level Domain (TLD) one of the commonly used one?
machine learning algorithms and each algorithm has its own working
mechanism. In this project, we have explained the Decision Tree
Algorithm, because I think this algorithm is simple and powerful.
Modules included:
1. Data training : We have used the Random Forest algorithm to train
our data set.
2. FrontEnd and Server maintenance : A localhost server is created
and all the required HTML files are hosted over there. This module
will take care of flow of the data among the programme files.
3. Extracting URL features : This module takes the URL and pass it
through various filters and extract the features like domain,
protocols, sub-domain, SSL certificates etc..,
4. Predicting of URL type : This module takes the output of the
previous module and processes it and assigns a flag value of the
URL, which later helps in identifying its safety.
12
7. Flow Diagrams :
13
8. Implementation :
14
15
9. Results and Discussion :
● System info :
○ Hardware specifications :
- Intel core processor ( i5 - recommended )
- Memory : 2GB ( 4GB - recommended )
- Disk space : 1GB ( >1GB - recommended )
○ Software specification :
- Windows OS
- Python3 : with required modules
- Jupyter notebook ( Google Colab - recommended )
- IDE to code the front end( VS Code-recommended)
- Browser ( Chrome - recommended )
- Modules supporting server maintenance
● Dataset :
- Phishcoop.csv ( taken from UCI-repo )
- Contains 11055 entries each with 32 - attributes
- No null entries
- 6157 - positive examples, 4898 - negative examples
● Input type :
- URL of a site to be verified
16
○ Decision Tree
○ Logistic Regression
○ Support-Vector Machine
○ Random Forest
Out of these Random Forest gives a maximum accuracy score of 98.6%.
So we generated a finalised_model.pkl which predicts the input urls.
17
Because of the threat posed by phishing attacks, more research
should still be carried out to add on the existing knowledge solutions.
Hackers are still creating new ways to exploit the human trust nature.
And a more adequate technique for model testing should be considered to
help in a better way of validation for a model before its deployment in the
real world.
Future work :
Our project has some limitations checking multiple domains and ip
addresses. So we are planning to overcome those limitations in the
coming future. Then we’ll try to make it as a chrome extension and
deploy to real use.
18
Fig4. Decision tree rules generation
19
Fig6. Logistic regression correlation among features
20
Fig8. SVM accuracy score
21
Fig10. Random Forest accuracy score
22
Fig12. Accuracy comparison graph
12. Appendix :
● Colab file :
https://colab.research.google.com/drive/1ehQDur3iPhPpa2r2GArdtF5Qv
6DtRPjR?usp=sharing
● Project files :
https://github.com/Krishnachaitanya-learn/Phishing_detection4QgbIchQf
xkmOCllw4CX4X_GV?usp=sharing
13. References :
23
3. Abdelhamid, N., Thabtah, F., & Abdel-jaber, H. (2017, July).
Phishing detection: A recent intelligent machine learning comparison
based on models content and features. In 2017 IEEE international
conference on intelligence and security informatics (ISI) (pp. 72-77).
IEEE.
4. Bhat, T., & Godse, S. P. (2018, August). In 2018 Fourth international
conference on computing communication control and automation
(ICCUBEA) (pp. 1-5). IEEE.
5. Thomas, C. (2019, December). Detection of phishing URLs
using machine learning techniques. recent intelligent machine
learning comparison based on models content and features. In
2020 IEEE international conference on intelligence and security
informatics (ISI) (pp. 71-87).
24