Research Paper
Research Paper
Abstract— Phishing is an online threat where an designed to trick the user for downloading the malware
attacker impersonates an authentic and trustworthy and make user to share the sensitive information or they
organization to obtain sensitive information from a make user to share the personal data. Personal data can
victim. One example of such is trolling, which has be anyone’s bank account details, card numbers, any
long been considered a problem. However, recent social media id or the login credentials.
advances in phishing detection, such as machine Phishing is the most common type of the social
learning-based methods, have assisted in combatting engineering attack. The practice of deceiving,
these attacks. Therefore, this paper develops and pressuring or manipulating people into sending
compares four models for investigating the efficiency information or assets to the wrong people. Social
of using machine learning to detect phishing domains. engineering attacks rely on human error and pressuring
It also compares the most accurate model of the four tactics for the success. The attacker typically
with existing solutions in the literature. The work masquerades as a person or organization the victim
carried out in this study is an update in the previous trusts—e.g., a coworker, a boss, a company the victim or
systematic literature surveys with more focus on the victim’s employer does business with—and creates a
latest trends in phishing detection techniques. This sense of urgency that drives the victim to act rashly.
study enhances readers' understanding of different Hackers and fraudsters use these tactics because it’s
types of phishing website detection techniques, the easier and less expensive to trick people than it is to hack
data sets used, and the comparative performance of into a computer or network.
algorithms used. Our findings show that the model
Typically, phishing attack exploits the social
based on the K means clustering is the most accurate
of the other four techniques and outperforms other engineering to lure the victim through sending a spoofed
link by redirecting the victim to a fake web page. The
solutions in the literature.
spoofed link is placed on the popular web pages or sent
Keywords— phishing detection, machine via email to the victim. The fake webpage is created
learning, phishing domains, artificial neural similar to the legitimate webpage. Thus, rather than
networks, support vector machine, decision tree, directing the victim request to the real web server, it will
random forest. be directed to the attacker server. The current solutions
of antivirus, firewall and designated software do not
fully prevent the web spoofing attack.
The implementation of Secure Socket Layer (SSL) and
I. INTRODUCTION digital certificate (CA) also does not protect the web
The rapid evolution of the technology has brought
user against such attack. In web spoofing attack, the
unpredictable convenience to our lives. But it has also
attacker diverts the request to fake web server. In fact, a
given rise to a significant threat-phishing attacks. Social
certain type of SSL and CA can be forged while
engineering attacks are common security threat which
everything appears to be legitimate. According to,
are used to reveal the private and confidential
secure browsing connection does virtually nothing to
information by simply tricking the user without being
protect the users especially from the attackers that have
detected. Phishing attacks are basically fraudulent
knowledge on how the “secure” connections actually
emails, text messages, phone calls, websites that are
work. This paper develops an anti-web spoofing solution
based on inspecting the URLs of fake web pages. This training step is completed, the model can be
solution developed series of steps to check applied to test data. This method allows
characteristics of websites Uniform Resources Locators results to be predicted and then compared
(URLs). to expected results [3]. Figure 2 shows how
Our Phishing detection website project is a proactive each tree is responsible for producing
response to the escalating cyber threats that exploits different products when given an
human vulnerability. The website is meticulously independent random sample.
designed to combat phishing attempts by employing The random forest is used for its error
advanced algorithms, machine learning, and real-time generalization technique, and the random
data analysis. By leveraging these technologies, our forest’s accuracy improves as the forest
platform will empower users to identify and thwart grows in size. After randomly picking the
phishing attacks effectively, thereby safeguarding features for the error rate, the accuracy is
their sensitive information from failing into the wrong entirely dependent on the correlation
hands. between the trees. The random forest’s
characteristics might be created by tracking
the error and correlation between nodes. As
a consequence, the relevance of a variable
II. BACKGROUND
can be measured.
Some machine learning algorithms that are currently
being used and have proven efficient in phishing
domain detection, some of these are:
1. Random Forest
Random forest is a collection of supervised
learning algorithms for classification and
regression used in predictive modeling and
machine learning [1]. Random forest has
attracted attention due to its fast
distribution and high accuracy. It
aggregates the results and predictions of
various decision trees to select the best
results: class type (most common value in
the decision tree) or average predictions.
Random forest divides the data set into two
parts: training and testing. It then randomly
selects many examples from the training.
Then, for each example, the researchers
used a decision tree that divided each option
into two children using the optimal
distribution. After that, users must repeat
the last step to vote for each prediction and
choose the prediction with the most votes as
the final result. The main hyperparameters
in random forest are used to increase the
predictive power of the model or make the
model faster [2]. In this case more trees can
improve performance and make predictions
more stable, but can also increase
processing time. Using the maximum
number of pages in addition to the minimum
number of pages can improve the
performance of the algorithm. Once the Figure 2. A comparison of DT and RF [4].
2.Support Vector Machine 3.Gradient Boosting
SVM is a supervised learning method based Gradient Boosting algorithms have
on pattern recognition and regression emerged as a focal point in machine
study. Scientific research can identify the learning research owing to their
key factors needed to successfully learn exceptional performance across a wide
specific, simple algorithms; Most range of predictive tasks. In research
applications in the world need to use papers, Gradient Boosting is frequently
complex tools and algorithms (such as scrutinized for its ability to enhance
neural networks); This is also very predictive accuracy, particularly when
important in theory. It is difficult to define. confronted with extensive and intricate
SVM is the intersection of learning theory datasets. Scholars often delve into the
and practice. The models they create are algorithm's nuances, proposing innovative
both complex (for example, they feature a enhancements such as novel loss functions,
large class of neural networks) and yet regularization methods, or optimization
simple enough to be analyzed strategies to augment performance or tackle
mathematically. This is because SVM is a specific challenges like overfitting.
linear algorithm in high-dimensional space Moreover, the applicability of Gradient
[5]. As shown in Figure 3 SVM predicts Boosting across diverse domains such as
labels by creating a decision boundary (like finance, healthcare, natural language
a general plane) with at least one label processing, and computer vision is a
between two groups. Data points and common subject of investigation, with
support vectors are controlled by researchers examining its comparative
hyperplanes. Uses the distance between efficacy against other machine learning
data points to classify each group techniques and tailoring its implementation
independently. to accommodate specific data
Previous research has demonstrated that characteristics or tasks.
the hyperplane with the greatest margin of As scalability can be a concern due to the
separation between the two classes offers sequential nature of Gradient Boosting,
the highest generalization performance research papers frequently explore methods
[6].The best hyperplane is found by solving to improve efficiency, including
a convex optimization problem involving the parallelization, distributed computing, or
minimization of a quadratic function under hardware acceleration. Additionally, efforts
linear inequality constraints. The answer to enhance the interpretability of Gradient
maybe expressed in terms of support Boosting models are prevalent, with
vectors, which are a subset of the training researchers devising techniques such as
instances. Support vectors include all the feature importance analysis, partial
information required to solve a dependence plots, and model visualization
classification issue since the result will to elucidate the inner workings of these
remain the same even if all other vectors are complex algorithms.
removed. Through benchmarking and comparative
studies, researchers aim to elucidate the
strengths and weaknesses of Gradient
Boosting, thus contributing to the
advancement of machine learning
methodologies and applications.
There is a technique called the Gradient
Boosted Trees whose base learner is CART
(Classification and Regression Trees). The
below diagram explains how gradient-
boosted trees are trained for
regression problems.
Figure 3. Support vector machine [2].
4.Logistic Regression
Logistic regression uses a logistic function
called a sigmoid function to map
predictions and their probabilities. The
sigmoid function refers to an S-shaped
curve that converts any real value to a range
between 0 and 1.
y(pred) =
y1 + (eta * r1) + (eta *r2)+…+(eta *rN)
Table 1. Comparison table of the latest research focusing on
III. RELATED WORK machine learning phishing detection techniques
In general, users will ignore website URLs. This
increases their chances of falling into phishing
domains, which can be prevented by determining
whether the URL is genuine. Unfortunately, modern
methods for detecting phishing attacks have limited
accuracy and detect only 20% of attempts. Machine
learning techniques for phishing detection can
produce better results, but they are time-consuming
and not scalable even with small databases.
Additionally, heuristic-based phishing detection has
a false positive rate. Previous research on anti-
phishing models has focused on strategies to change
performance.
However, the use of reduced and integrated models
can increase the accuracy of these models. Machine
learning algorithms for phishing domain detection
are popular and their use has become a simple
classification problem. To build an ML detection
model, the cell data must contain features related to
phishing and legitimate websites in the cluster.
Previous studies have shown that detection accuracy
is high when using robust machine learning. Various
IV. METHODOLOGY
selection strategies are used to reduce features. To Utilizing the Kaggle dataset, four phishing
train a machine learning model to predict phishing detection models were developed using K means
attacks and legitimate traffic, a dataset needs to be clustering algorithms. The normalization feature was
provided as input. employed as a preprocessing strategy to improve the
When features are reduced, dataset visualization models' accuracy. The proposed models were able to
becomes more efficient and easier to understand. The detect different types of attacks from the UCI dataset.
most important products of DT, C4.5, k-NN and SVM The following subsections discuss the dataset used
algorithms are; They have used many research and implemented algorithms; Sections 4.1 and 4.2,
projects and investigated phishing attacks with the respectfully.
most accurate and effective results. As empirical tests
1. Dataset Used
show, manually adjust parameters and training
The dataset is borrowed from Kaggle,
periods, and poor detection accuracy are prevalent
https://www.kaggle.com/eswarchandt/phish
problems.
ing-website-detector. A collection of
Despite these benefits, researchers have noted the
website URLs for 11000+ websites. Each
limits of their studies. Many pointed out that
sample has 30 website parameters and a
ensemble learning techniques have not been applied
class label identifying it as a phishing
and that feature selection and reduction have not
website or not (1 or -1). The overview of this
been performed. A range of strategies has been
dataset is, it has 11054 samples with 32
applied to combat phishing attacks. One paper [7]
features.
used different classifiers, such as naive Bayes and
SVM. Similarly, the authors in [8] utilized random
2. Implemented Algorithm
forest to differentiate phishing attacks from normal
websites. To increase accuracy, this paper utilized the
MinMax normalization feature as a preprocessing
step in each proposed model. Normalization is a
useful strategy for improving the accuracy of
machine learning models, and it is required for some
models to work properly. The MinMax
normalization technique in the suggested model V. Model’s Flowchart
compresses the data to a domain of [0, 1], which
improves the model training input quality (see Phishing is a concern to many individuals. However,
Equations (1) and (2)). existing methods, such as browser security
indicators, cannot detect phishing websites. Due to
X_std = (X − X.min) / (X.max − X.min) …………..(1) the limits of current technology, users must evaluate
whether a URL is phishing or not on their own. As a
X_scalar = X_std × (max − min) + min …………..(2) result, an automated technique for phishing website
identification should be explored for increased cyber
To enhance the model performance and safety. This study shows how an implemented feature
complexities, we used a data normalization strategy, extraction approach and a prediction model based
as shown in Table 2. The algorithm selects on a random forest classifier help increase the
significant aspects from the initial dataset by likelihood that a user will correctly identify a
determining the prediction outcome, which is phishing website.
performed by filtering it through 30 features. The Each of the developed models, as shown in Figure 7,
UCI dataset is split 80/20 into training and testing employs a feature selection technique to increase its
sets, respectively, by using c5-fold cross-validation, accuracy. The data analysis heat map picks those
which presented the best performance in the latest that are most crucial in affecting the forecasted
research. The prediction model is then taught using result by filtering the most interesting features out of
machine learning, which employs various learning the original dataset. As a result, irrelevant features
models. This is particularly useful for making have no effect on the model’s efficiency or
predictions, as utilizing many models ensures that prediction.
the results are not biased toward a single model. To
account for this, we present the results of all the
models combined and totaled to establish their
maximum accuracies. If most of the models indicate
that a domain is phishing, then the model’s
prediction accuracy confirms that the domain is a
phishing attempt.
• Detecting zero-day phishing attacks: Zero-day The most important way to protect the user from
phishing attacks are new and unknown attacks phishing attack is the education awareness. Internet
that have not been seen before. Researchers can users must be aware of all security tips which are
develop machine learning algorithms that can given by experts. Every user should also be trained
detect zero-day phishing attacks by analyzing the not to blindly follow the links to websites where they
behavior of users and the network. have to enter their sensitive information. It is
essential to check the URL before entering the
• Detecting phishing attacks on mobile devices: website. In Future System can upgrade to automatic
With the increasing use of mobile devices, Detect the web page and the compatibility of the
phishing attacks on mobile devices are becoming Application with the web browser. Additional work
more common. Researchers can develop machine also can be done by adding some other
learning algorithms that can detect phishing characteristics to distinguishing the fake web pages
attacks on mobile devices by analyzing the user’s from the legitimate web pages. PhishChecker
behavior and the characteristics of the mobile application also can be upgraded into the web phone
device. application in detecting phishing on the mobile
platform.
• Developing real-time phishing detection
systems: Real-time phishing detection systems There are many features that can be improved in the
can detect phishing attacks as they happen, work, for various other issues. The heuristics can be
allowing users to take immediate action to further developed to detect phishing attacks in the
protect themselves. Researchers can develop presence of embedded objects like flash. Identity
machine learning algorithms that can detect extraction is an important operation and it was
phishing attacks in real-time by analyzing improved with the Optical Character Recognition
network traffic and user behavior. (OCR) system to extract the text and images. More
effective inferring rules for identifying a given
suspicious web page, and strategies for discovering
if it is a phishing target, should be designed in order
to further improve the overall performance of this
system.