Enhanced Phishing Website Detection: Leveraging Random Forest and XGBoost Algorithms With Hybrid Features

Phishing technique is used by hackers or attackers to scam the people on internet into giving private details such as login credentials of various profiles, social security numbers (SSNs), banking information, etc. Attackers disguise a webpage as an official legit website. Blacklist or whitelist, heuristic, and visual similarity-based anti-phishing solutions are unable to detect zero-hour phishing assaults or newly created websites.

Uploaded by

International Journal of Innovative Science and Research Technology

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views4 pages

Enhanced Phishing Website Detection: Leveraging Random Forest and XGBoost Algorithms With Hybrid Features

Uploaded by

International Journal of Innovative Science and Research Technology

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Volume 8, Issue 7, July – 2023 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165

Enhanced Phishing Website Detection: Leveraging

Random Forest and XGBoost Algorithms with
Hybrid Features
Prof. Ashwini Bhavsar1, Adarsh Waikar2, Ayush Petkar3, Seema Mane4, Vishwatej Sarwale5
1
Prof. Ashwini Bhavsar, Dept. of Computer Engineering, PCCOER, Maharashtra, India
2
Adarsh Waikar, Dept. of Computer Engineering, PCCOER, Maharashtra, India
3
Ayush Petkar, Dept. of Computer Engineering, PCCOER, Maharashtra, India
4
Seema Mane, Dept. of Computer Engineering, PCCOER, Maharashtra, India
5
Vishwatej Sarwale, Dept. of Computer Engineering, PCCOER, Maharashtra, India

Abstract:- Phishing technique is used by hackers or there are 4.66 billion internet users worldwide, up 7.3 percent
attackers to scam the people on internet into giving (316 million additional users) from January 2020. Internet
private details such as login credentials of various penetration currently stands at 59.5 percent, which gives
profiles, social security numbers (SSNs), banking phishing attackers the chance to profit by extorting and
information, etc. Attackers disguise a webpage as an stealing private data from online users [3]. The attacker
official legit website. Blacklist or whitelist, heuristic, and creates a fake website and distributes links via emails,
visual similarity-based anti-phishing solutions are unable Facebook, Twitter, and other social media applications.
to detect zero-hour phishing assaults or newly created When a user unknowingly opens the link and changes or fills
websites. Older methods are more complex and not in any sensitive and private credentials, attackers obtain
suitable for day-to-day scenarios since they rely on access to the user’s information such as financial
external sources such as search engines. As a result, information, personal information, login credentials, and so
finding newly constructed phishing websites in a real- on. Cybercriminals utilize stolen information for a range of
time context is a significant hurdle in the field of illicit actions, including blackmailing victims. Consumers fall
cybersecurity. This paper presents a hybrid feature-based prey to phishing mainly because of the following reasons:
anti-phishing approach that nullifies these problems by  User’s understanding of URLs is generally poor
extracting characteristics from URL and hyperlink data  Visitors do not know which websites to believe.
that is only available on the client side. Also, a brand-new  Redirected, shorten URLs or hidden URLs prevent users
dataset is created for experiments employing well-liked from seeing the full address of the web page.
machine-learning classification techniques. Our  Users do not have much time to look up a URL fast or
experimental findings dictated that the presented random unintentionally reach certain online pages.
forest-based phishing website detection approach is more  Consumers lack the ability to discern the difference
effective and gives a higher accuracy result of 96.81% between trustworthy and counterfeit websites.
with the blend of the XG Boost technique.
Phishing assaults are now being used to distribute
Keywords:- Cybersecurity, Phishing Detection, Machine dangerous software such as ransomware. So, in this work, we
Learning, Hyperlink Feature, URL Feature, Anti-Phishing, concentrate on efficiently identifying phishing websites to
XG Boost, Hybrid Feature. prevent unaware internet users from falling victim to phishers
and thereby lessen the emotional and financial damages. As
I. INTRODUCTION of today, everything in our day-to-day lives is now digitally
stored as data and the formally actionable insights that can be
In 2022 alone, about 69% of the world’s population, extracted are the reason to provide intelligent solutions.
actively used the internet. This shows that number of internet “Data science” has recently become a trending topic in the
users will keep on increasing in the coming times. In the field computing world. Such data-driven solutions may be utilized
of cybersecurity, phishing is currently one of the most serious to create an effective model as well as an intelligent decision-
and dangerous online threats [1]. The rapid advancement of making system in a variety of real-world application
Internet technology has greatly boosted the use of social domains, such as business, financial analysis, cybersecurity,
media, online banking, e-commerce services, and other IoT applications, and many more. As a result, the goal of this
similar services. In 2022, 166,187,118 harmful email article is to provide an effective data-driven solution that uses
attachments were stopped by Kaspersky Mail Anti-Virus. machine learning techniques to evaluate whether a website is
Aims to click on phishing URLs were blocked 507,851,735 phishing. The majority of machine learning-based phishing
times by our anti-phishing system. The takeover of a detection algorithms gather characteristics from the URL,
Telegram account was related to 378,496 attempts to click on search engine, third-party, online traffic, DNS, and so on.
phishing URLs. According to “A Digital Report in 2021”
data from We Are Social (Global Overview Report 2021) [2],

IJISRT23JUL307 www.ijisrt.com 615

Volume 8, Issue 7, July – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
Because of the difficulties and time limits, these In 2020, Poonam Kumari, Apoorva H R Gowda,
methods may not be suited for real-time phishing detection. Bhandhavya K, Bhavya M U, Spurthi M N [4] proposed a
Phishing sites have a typical life cycle of less than nine Hybrid Mode to address the problems brought on by phishing
hours, and half of them are removed in less than a day. Yet, websites. By merging several models, a hybrid-based model is
most phishing pages that use hacked domains remain online created, which increases the accuracy of phishing attack
for longer than a day. As a result, the research question detection.
addressed in this work is, “How can we design an efficient
and intelligent phishing detection model while taking into In 2021, Sinduja. S, Monisha. S, Priya Dharshini. S,
consideration the challenges listed above?” Sneha. K, Vaishnavi. R [5] proposed a Machine Learning
Algorithm for efficient detection of phishing website using the
In this study, we develop a hybrid feature-based hyperlink features and random forest classifier.
phishing detection method to address this research topic. This
method successfully detects phishing websites and addresses In 2021, [6] Om Sapate, Sumit Kolhe, Shantanu Taro,
the aforementioned issues. We employ URL-based features Vishal Kumar Kashyap proposed the framework for phishing
to identify fraudulent websites. website detection in mobile devices which exploits various
web browsers plug-in as well as machine learning based
Our feature extraction technique is independent of any engine to detect zero-hour phishing website.
search engine or third-party services. From the website’s
source code, we examine and extract Hyperlink In 2020, Mehmet Korkmaz, Ozgur Koray Sahingoz and
characteristics. We get distinct types of data from the Banu Diri [7] proposed a machine learning-based phishing
hyperlink information and different attributes from the URL. detection system that analyses URLs using eight distinct
To train our classification model, we mix all the algorithms and three different datasets to compare the findings
characteristics to create a hybrid feature set of features. The to existing research.
main contributions of our paper are as follows:
In 2019, Mohammad Mehdi Yadollahi, Farzaneh
 We start by gathering legal and phishing website URLs Shoeleh, Elham Serkani, Afsaneh Madani, Hossein Gharaee
from open-source platforms to construct a dataset. Phish [8] proposed a machine learning algorithm that can
Tank is such a platform for phishing website information. distinguish between legal and fraudulent websites online and
 We provide a method for accurately detecting phishing with lots of features. The suggested method is totally client-
that dynamically harvests hybrid characteristics and side and doesn’t call for any third-party services because it
makes extensive use of them. extracts various kinds of discriminative information from
 Our suggested machine learning-based solution URLs and webpage source code.
accurately and efficiently recognizes zero-hour phishing
assaults. In 2021, Youness Mourtaji, Mohammed Bouhorma,
Daniyal Alghazzawi, Ghadah Aldabbagh and Abdullah
II. LITERATURE SURVEY Alghamdi [9] proposed a new approach that merged the
scores of different features selected using the various feature
In 2022, Sumitra Das Guptta, Khandaker Tayef Shahriar selection methods hence increasing the dependency of the
& Hamed Alqahtani [1] proposed a machine learning model selected feature sets.
and system that helps in detecting phishing websites by
utilizing and analyzing URLs and their hyperlink-based In 2020, Xuqiao Yu [10] proposed a Hybrid Model
complex features to get higher accuracy without using third- combining the Deep Belief Network and the machine learning
party applications. The classification algorithms used in the method of Support Vector Machines to increase the accuracy
research are Logistics Regression, Random Forest, Decision of the detection. The Hybrid Model also covers up for the
Tree, SVM, and XG Boost. An accuracy of 99.17% is weakness of the other model used in the research.
achieved over traditional approaches.
III. METHODOLOGIES
In 2022, Adarsh Mandadi; Saikiran Boppana; Vishnu
Ravella; R Kavitha [2] this research involves the use of Existing Methods: This method involves finding the
mainly two classifiers which are Random Forest and Decision websites credibility by finding the URL structure. Illicit
Tree classifier. It correctly identifies the phishing and domain or a phishing website looks suspicious based on the
legitimate URLs with 87.0% and 82.4% accuracy various reasons like if it has some misspelled words or is
respectively. pointing towards a false top-level domain, involvement of
fraudulent URL, a young domain age, significantly lower
In 2020, Jian Feng, Lianyang Zou, Ou Ye and Jingzhou page rank, or long URL.
Han [3] proposed a Multidimensional Algorithm where the
major components are automated learning representations List-based techniques involves the use of whitelisted
from multi-aspect features through the representation learning and blacklisted websites stored on universal website database
and feature extraction using a hybrid model of deep learning such as phish Tanks. If the domain of the concerned website
network. This approach makes use of CNN-LSTM and NLP. matches one present in the blacklisted sites, than it is termed
as a Phishing Website and if the domain matches one present

IJISRT23JUL307 www.ijisrt.com 616

Volume 8, Issue 7, July – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
in the whitelisted sites than it is identified as a legitimate  HTML codes, CSS (Cascading Style Sheets), website
website. logo, and so on
 Status bar customization, where a fake URL is displayed
using various JavaScript methods
 Disabling the ability to right-click feature to prevent users
from inspecting or viewing the source code
 Use of pop-up windows
 iFrame redirection

D.Domain-Based Features: These include:

 Doubtful young domains
 Suspicious DNS record
 Low website traffic
 PageRank, since it is observed that most of the phishing
websites have no PageRank
 Site is indexed by Google or not

Fig 1 Existing Methodology

This method has a primary flaw since it cannot identify

the newly created websites whose domains are absent from
the listed database i.e.; it is not useful to deal with the zero-
hour phishing attacks.

 Proposed Method
The List based characteristics, Visual Similarity based
characteristics, Machine Learning based approaches help us
to identify whether the website is valid URL or not. The
various features category can be divided into four main Fig 2 Proposed System
categories:
The above characteristics are based on the URL and
A. Address Bar-Based Features hyperlink features of a website.
These features include those which are directly
compiled from the URLs, like the URL length greater than Building a machine learning model is the next step
54, or whether an IP address is present in the URL, whether which helps us to detect the zero-hour phishing websites.
various URL shortening services (tinyurl.com or bit.ly) were
used, or redirection is used. Additional features also include Given all the standards that can help us in detecting
the following: phishing URLs, we can use a machine learning algorithm,
 Addition of suffix or prefix separated by (-) in the domain such as random forest classifier or a decision tree classifier to
 Presence of sub-domains and domain help us decide whether an URL is valid or not.
 Existence of HTTPS
 Domain registration age Machine Learning based Approach is used wherein a
 Favicon loading from different Domain dataset is created with extracted features. Furthermore, a
 Using a non-standard port classification algorithm is trained on the URL and the
Hyperlink characteristics of the phishing website. When a
B. Abnormal Features: These include machine learning model is trained against heuristic features
 Images are loaded in the body from a different URL then it can also be used to detect the zero-hour phishing
 Lesser or minimum use of meta tags website. Overall, of all the phishing website detection
 Server Form Handler (SFH) uses approaches present, the machine learning approach is better
 Submitting information to the email suited.
 An abnormal URL
 Abbreviations and Acronyms
C. HTML and JavaScript-Based Features: These includes
A. CART
the characteristics like:
Defined as a Classification and Regression Tree
 Website forwarding
(CART), is a special type of Decision Tree that describes
 Page source code, photos, textual content used in the how the values of a target variable can be predicted based on
website the values of feature variables.

IJISRT23JUL307 www.ijisrt.com 617

Volume 8, Issue 7, July – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
B. URL [8] Om Sapate, Sumit Kolhe, Shantanu Taro, Vishal Kumar
Defined as Uniform Resource Locator is an Internet Kashyap, “Preventing Phishing Attacks in Real-Time
resource that can be found using a URL (Uniform Resource Using Machine Learning,” International Journal of
Locator), a special identifier. It is additionally known as a Advanced Research in Science, Communication and
web address. Technology (IJARSCT) 2021.
[9] Jian Fend, Lianyang Zou, Ou Ye, Jingzhou Han,
C. SL “Phishing Webpage Detection Method Based on
Defined as Supervised Learning is a machine learning Multidimensional Features Driven by Deep Learning,”
technique used to train various models based on labeled data. IEEE Access, 2020.
Examples of Supervised learning algorithms are Logistics [10] Arathi Krishna V, Anusree A, Blessy Jose, Karthika
Regression, Linear Regression, and Naive Bias Anilkumar, Ojus Thomas Lee, “Phishing Detection
using Machine Learning based URL Analysis: A
D. RAM Survey,” INTERNATIONAL JOURNAL OF
Defined as Random-Access Memory is a type of ENGINEERING RESEARCH TECHNOLOGY
computer memory where the data which is currently under (IJERT) NCREIS – 2021
process is stored or recorded. It stores the current working
process data.

IV. CONCLUSIONS

From the study of various papers, we analyzed that out

of the various methods present for the detection of phishing
websites the machine learning method coupled with the URL
and Hyperlink features gives the best result. Such a model is
also capable of identifying zero-hour phishing websites
which are often misinterpreted when analyzed by the existing
methods.

REFERENCES

[1] K. T. S. Sumitra Das Guptta,” Modeling Hybrid

Feature-Based Phishing Websites Detection Using
Machine Learning Techniques,” Annals of Data
Science, 2022.
[2] Odeh, Ammar Jamil et al. “Machine Learning
Techniques for Detection of Website Phishing: A
Review for Promises and Challenges.” 2021 IEEE 11th
Annual Computing and Communication Workshop and
Conference (CCWC) (2021)
[3] M. S. Sinduja. S, “Efficient Phishing Website Detection
using Machine Learning Algorithm.” International
Journal for Research in Applied Science Engineering
Technology, 2021
[4] Rishikesh Mahajan, “Phishing Website Detection using
Machine Learning Algorithms”, International Journal of
Computer Applications Volume 181 – No. 23, October
2020
[5] Steve Sheng, Ponnurangam Kumaraguru, “Improving
Phishing Countermeasures: An Analysis of Expert
Interviews”, International Journal of Computer
Applications Volume 181 – No. 23, October 2020.
[6] Mohammad Alsharaiah, Ahmad Adel, “A new phishing
website detection framework using Ensemble
classification and clustering,” International Journal of
Data and Network Science, 2023
[7] Poonam Kumari, Apoorva H R Gowda, Bhandhavya K,
Bhavya M U, Spurthi M N, "Detecting Phishing Sites
using Hybrid Model," INTERNATIONAL JOURNAL
OF ENGINEERING RESEARCH TECHNOLOGY
(IJERT) 2020