Enhanced Phishing Website Detection: Leveraging Random Forest and XGBoost Algorithms With Hybrid Features
Enhanced Phishing Website Detection: Leveraging Random Forest and XGBoost Algorithms With Hybrid Features
ISSN No:-2456-2165
Abstract:- Phishing technique is used by hackers or there are 4.66 billion internet users worldwide, up 7.3 percent
attackers to scam the people on internet into giving (316 million additional users) from January 2020. Internet
private details such as login credentials of various penetration currently stands at 59.5 percent, which gives
profiles, social security numbers (SSNs), banking phishing attackers the chance to profit by extorting and
information, etc. Attackers disguise a webpage as an stealing private data from online users [3]. The attacker
official legit website. Blacklist or whitelist, heuristic, and creates a fake website and distributes links via emails,
visual similarity-based anti-phishing solutions are unable Facebook, Twitter, and other social media applications.
to detect zero-hour phishing assaults or newly created When a user unknowingly opens the link and changes or fills
websites. Older methods are more complex and not in any sensitive and private credentials, attackers obtain
suitable for day-to-day scenarios since they rely on access to the user’s information such as financial
external sources such as search engines. As a result, information, personal information, login credentials, and so
finding newly constructed phishing websites in a real- on. Cybercriminals utilize stolen information for a range of
time context is a significant hurdle in the field of illicit actions, including blackmailing victims. Consumers fall
cybersecurity. This paper presents a hybrid feature-based prey to phishing mainly because of the following reasons:
anti-phishing approach that nullifies these problems by User’s understanding of URLs is generally poor
extracting characteristics from URL and hyperlink data Visitors do not know which websites to believe.
that is only available on the client side. Also, a brand-new Redirected, shorten URLs or hidden URLs prevent users
dataset is created for experiments employing well-liked from seeing the full address of the web page.
machine-learning classification techniques. Our Users do not have much time to look up a URL fast or
experimental findings dictated that the presented random unintentionally reach certain online pages.
forest-based phishing website detection approach is more Consumers lack the ability to discern the difference
effective and gives a higher accuracy result of 96.81% between trustworthy and counterfeit websites.
with the blend of the XG Boost technique.
Phishing assaults are now being used to distribute
Keywords:- Cybersecurity, Phishing Detection, Machine dangerous software such as ransomware. So, in this work, we
Learning, Hyperlink Feature, URL Feature, Anti-Phishing, concentrate on efficiently identifying phishing websites to
XG Boost, Hybrid Feature. prevent unaware internet users from falling victim to phishers
and thereby lessen the emotional and financial damages. As
I. INTRODUCTION of today, everything in our day-to-day lives is now digitally
stored as data and the formally actionable insights that can be
In 2022 alone, about 69% of the world’s population, extracted are the reason to provide intelligent solutions.
actively used the internet. This shows that number of internet “Data science” has recently become a trending topic in the
users will keep on increasing in the coming times. In the field computing world. Such data-driven solutions may be utilized
of cybersecurity, phishing is currently one of the most serious to create an effective model as well as an intelligent decision-
and dangerous online threats [1]. The rapid advancement of making system in a variety of real-world application
Internet technology has greatly boosted the use of social domains, such as business, financial analysis, cybersecurity,
media, online banking, e-commerce services, and other IoT applications, and many more. As a result, the goal of this
similar services. In 2022, 166,187,118 harmful email article is to provide an effective data-driven solution that uses
attachments were stopped by Kaspersky Mail Anti-Virus. machine learning techniques to evaluate whether a website is
Aims to click on phishing URLs were blocked 507,851,735 phishing. The majority of machine learning-based phishing
times by our anti-phishing system. The takeover of a detection algorithms gather characteristics from the URL,
Telegram account was related to 378,496 attempts to click on search engine, third-party, online traffic, DNS, and so on.
phishing URLs. According to “A Digital Report in 2021”
data from We Are Social (Global Overview Report 2021) [2],
Proposed Method
The List based characteristics, Visual Similarity based
characteristics, Machine Learning based approaches help us
to identify whether the website is valid URL or not. The
various features category can be divided into four main Fig 2 Proposed System
categories:
The above characteristics are based on the URL and
A. Address Bar-Based Features hyperlink features of a website.
These features include those which are directly
compiled from the URLs, like the URL length greater than Building a machine learning model is the next step
54, or whether an IP address is present in the URL, whether which helps us to detect the zero-hour phishing websites.
various URL shortening services (tinyurl.com or bit.ly) were
used, or redirection is used. Additional features also include Given all the standards that can help us in detecting
the following: phishing URLs, we can use a machine learning algorithm,
Addition of suffix or prefix separated by (-) in the domain such as random forest classifier or a decision tree classifier to
Presence of sub-domains and domain help us decide whether an URL is valid or not.
Existence of HTTPS
Domain registration age Machine Learning based Approach is used wherein a
Favicon loading from different Domain dataset is created with extracted features. Furthermore, a
Using a non-standard port classification algorithm is trained on the URL and the
Hyperlink characteristics of the phishing website. When a
B. Abnormal Features: These include machine learning model is trained against heuristic features
Images are loaded in the body from a different URL then it can also be used to detect the zero-hour phishing
Lesser or minimum use of meta tags website. Overall, of all the phishing website detection
Server Form Handler (SFH) uses approaches present, the machine learning approach is better
Submitting information to the email suited.
An abnormal URL
Abbreviations and Acronyms
C. HTML and JavaScript-Based Features: These includes
A. CART
the characteristics like:
Defined as a Classification and Regression Tree
Website forwarding
(CART), is a special type of Decision Tree that describes
Page source code, photos, textual content used in the how the values of a target variable can be predicted based on
website the values of feature variables.
IV. CONCLUSIONS
REFERENCES