0% found this document useful (0 votes)
46 views13 pages

FINALREPORT

The document is a seminar report on detecting phishing attacks using machine learning. It was written by Yanamala Yamuna, a student at the Mahatma Gandhi Institute of Technology, for their Bachelor of Technology degree in Computer Science and Engineering. The report develops and compares machine learning models for detecting phishing domains, including artificial neural networks, support vector machines, decision trees, and random forests. It finds that the random forest technique is the most accurate model and outperforms other solutions in literature for phishing detection.

Uploaded by

smilygrace85
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views13 pages

FINALREPORT

The document is a seminar report on detecting phishing attacks using machine learning. It was written by Yanamala Yamuna, a student at the Mahatma Gandhi Institute of Technology, for their Bachelor of Technology degree in Computer Science and Engineering. The report develops and compares machine learning models for detecting phishing domains, including artificial neural networks, support vector machines, decision trees, and random forests. It finds that the random forest technique is the most accurate model and outperforms other solutions in literature for phishing detection.

Uploaded by

smilygrace85
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

A SEMINAR (CS705PC) REPORT

On

“PHISHING ATTACKS DETECTION USING ML”

By

YANAMALA YAMUNA

H.T. No: 20261A0560

Under the Guidance of

Mr. A. RATNA RAJU

(Assistant Professor)

BACHELOR OF TECHNOLOGY

IN

COMPUTER SCIENCE AND ENGINEERING

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

MAHATMA GANDHI INSTITUTE OF TECHNOLOGY

(Affiliated to Jawaharlal Nehru Technology University Hyderabad)

GANDIPET, HYDERABAD-500075, Telangana (INDIA)


(MGIT)

MAHATMA GANDHI INSTITUTE OF TECHNOLOGY


(Estd. in 1997 by Chaitanya Bharathi Educational Society) (Affiliated to JNTUH, Hyderabad
Accredited by NBA, AICTE, New Delhi) Kokapet (village & gram panchayat), Gandipet,
(Mandal), Ranga Reddy (Dist.), Chaitanya Bharathi P.O, Hyderabad- 500075

CERTIFICATE

This is to certify that the Seminar (CS705PC) entitled “PHISHING ATTACKS DETECTION
USING ML” being submitted by YANAMALA YAMUNA bearing Roll No: 20261A0560 in
partial fulfillment of the requirements for the Award of the Degree of Bachelor of Technology
in Computer Science and Engineering is a record of bonafide work carried out by her.

COORDINATORS HEAD OF DEPARTMENT

Ms. M. Mamatha Dr.C.R.K.Reddy


Professor, Dept. Of CSE
Dr. K. Sreekala

PHISHING ATTACKS DETECTION


Yanamala Yamuna
. Department of Computer Science & Engineering
Mahatma Gandhi Institute of Technology
Hyderabad, India [email protected]
global finances, raising the risk for both clients and
businesses and further increasing the need for
Abstract protection online. The process of defending
Phishing is an online threat where an attacker cyberspace against threats such as phishing is
impersonates an authentic and trustworthy known as cybersecurity. Protecting internet-
organization to obtain sensitive information from a connected resources from cyber-attacks is
victim. One example of such is trolling, which has cybersecurity’s main goal.
long been considered a problem. However, recent Cybersecurity is becoming increasingly
advances in phishing detection, such as machine complicated as cyber-attacks become more
learning-based methods, have assisted in complex and more frequent, making it difficult to
combatting these attacks. Therefore, this paper recognize, assess, and handle significant risk
develops and compares four models for events. The Anti-Phishing Working Group
investigating the efficiency of using machine (APWG) discovered more than 51,000 distinct
learning to detect phishing domains. It also phishing websites. According to the Rivest–
compares the most accurate model of the four with Shamir–Adleman (RSA) analysis, phishing attacks
existing solutions in the literature. These models cost global enterprises $9 billion in 2016. Over one
were developed using artificial neural networks million phishing attacks were listed in 2016, a 65%
(ANNs), support vector machines (SVMs), increase from the previous year [. The frequency of
decision trees (DTs), and random forest (RF) these attacks erodes consumers’ trust in social
techniques. Moreover, the uniform resource works, such as webpages.
locator’s (URL’s) UCI phishing domains dataset is There are various types of web fraud, and
used as a benchmark to evaluate the models. Our phishing websites are a common entry point for
findings show that the model based on the random online social engineering attempts. To start, the
forest technique is the most accurate of the other hacker creates a webpage by impersonating a
four techniques and outperforms other solutions in reputable website. They then send these fishy URLs
the literature. to potential victims via spam chats, messages, or
social media sites, hoping that unsuspecting users
will believe it is a real URL. If users enter their
1. Introduction personal information (bank account numbers,
government savings numbers, and so on) at the link
Phishing is an online crime that tries to trick sent by the hacker, that data will be compromised.
unsuspecting users into exposing their sensitive There are a lot of strategies to combat
(and valuable) personal information. This can phishing. Artificial intelligence (AI) has had a huge
include usernames, passwords, financial account impact on almost every industry, including
details, login credentials, personal addresses, and cybersecurity because AI can detect spam,
social relationships, which the attacker then uses for phishing, spear phishing, and other assaults using
malicious purposes, such as identity theft. Phishing past attacks in the form of datasets.
is usually perpetrated by a hacker disguising This paper develops and compares the
themself as a trustworthy entity, an effect achieved effectiveness of machine learning (ML)
by combining both social engineering and technical classification models in detecting phishing
tricks. domains. The goal is to improve detection by using
Phishing domains are one type of attack. the most accurate model of the four to predict if a
These domains obtain sensitive information webpage is a phish or legal. A phished domain is
without authorization, either through blackmail or difficult to analyze and comprehend since it
by directing users to a fake website that looks involves social and technical issues for which there
similar to a real one. Both then request personal is no one-size-fits-all analysis. As a result, all
information. Security breaches occur when users phishing domain causes and features were analyzed
enter their private data into these sites, as the quantitatively and qualitatively to determine where
assailant now has personal information that may be to focus the model to better decrease the danger
used to commit identity theft. arising from a visit to a phishing website,
Most financial and government institutions particularly regarding consumer trust.
have improved their direct internet offerings to
combat potential security breaches such as phishing
domains. However, as services on the Internet
continue to grow, so has the public’s reliance on
online services. Despite the risks that phishing
attacks pose, online shopping, banking, and bill 2. Background
payment have all become popular in the United
States and developed European countries. Common machine learning classification
Successful phishing attempts have had an impact on techniques have proven efficient in phishing
domain detection, including the following:
2.1. Decision Tree the dataset into two sections: the training set and the
test set. It then randomly selects multiple samples
A decision tree helps individuals make better
from the training set. Next, the researcher uses the
decisions via a tree-like graph or modeling of
decision tree for each sample, which divides each
alternatives and their possible implications, such as
selection into two daughters using best division.
likely outcomes, resource costs, and utility. It is one
Thereafter, users must repeat the last step to vote for
strategy of many to demonstrate an algorithm
each prediction result and select the most voted
completely made up of conditional control
prediction as the final result. The main hyper-
statements. Decision trees are frequently used to
parameters in the random forest are used to either
analyze the underlying relationships in big datasets.
increase the predictive power of the model or to
The decision tree’s goal is to observe a process; by
make the model faster. In this context, a higher
doing this, researchers can utilize its attributes,
number of trees can increase the performance as
allowing it to be assigned to a certain class, as
well as make the predictions more stable, but it also
shown in Figure 1, which shows a training
increases the processing time. The employment of
algorithm that creates the structure of such a
a maximum number of features, in addition to a
decision tree. After its construction, a decision tree
minimum number of leaves, may improve
may be used to assess further samples with variable
algorithm performance.
degrees of success, depending on how well it
Once the training step is completed, the
represents the dataset. The success rate is
model can be applied to a test dataset. This
determined by several aspects, including the size of
procedure allows for the estimation of predictions
the dataset used to create the tree, the class-wise
and then for the comparison of the results against
overlapping of variable observations, the algorithm
the expected values. Figure 2 shows how each tree
used to build the tree, and the usage of extra
is responsible for producing a distinct output after
methods to enhance tree development.
being fed an independent random sample vector.
The random forest is used for its error
generalization technique, and the random forest’s
accuracy improves as the forest grows in size. After
randomly picking the features for the error rate, the
accuracy is entirely dependent on the correlation
between the trees. The random forest’s
characteristics might be created by tracking the
error and correlation between nodes. As a
consequence, the relevance of a variable can be
measured [17].

Figure 1. Decision tree algorithm.


The root node is the starting point of the
decision tree (also known as the parent node). It
represents the full dataset, which is then split into
two or more homogenous groups, also called child
nodes. These eventually lead to the leaf nodes,
which are the tree’s final output, after which no
further splits are possible. Splitting is the process of Figure 2. A comparison of DT and RF.
separating the root node into sub-nodes based on
the conditions specified. A branch/sub-tree is a tree 2.3. Support Vector Machine
that has been created by splitting a larger tree.
SVM is a supervised learning method based
Pruning is the removal of unwanted branches.
on statistical learning theory utilized for pattern
2.2. Random Forest identification and regression. Statistical learning
theory can pinpoint the factors needed to
Random forest is an ensemble of supervised successfully learn specific, easy algorithms; real-
learning algorithms for classification and world applications frequently require more
regression used in predictive modeling and machine complicated tools and algorithms (such as neural
learning techniques. The random forest has networks), which are much more difficult to
attracted the attention of academics because of its analyze theoretically. SVMs are the meeting point
speed and accuracy in categorization. It gathers the of learning theory and practice. They create models
results and predictions of several decision trees to that are both complicated (including a huge class of
choose the best output: the mode of the classes (the neural networks, for example) and simple enough
value that appears most often in the decision tree to be mathematically examined. This is because an
results) or mean prediction. Random forest splits
SVM is a linear algorithm in a high-dimensional enhance weak learners is well established. Below
space. As shown in Figure 3, SVM predicts labels are three major types of meta-algorithms that are
by generating a decision boundary, such as a regularly used in ensemble approaches .
hyperplane, between two specified classes with a
2.4.1. Bagging
minimum of one label. The data points and support
vectors are handled by the hyperplane. It takes Bagging, or bootstrap aggregation, is a
advantage of the distance between data points to powerful, effective, and simple ensemble method
categorize each class independently. [24]. The method uses bootstrapping to sample
several copies of a training set. It may be applied
with any form of classification or regression model,
as demonstrated in Figure 4. Bagging works well
with nonlinear models that are unstable.

Figure 3. Support vector machine.


Previous research has demonstrated that the
hyperplane with the greatest margin of separation Figure 4. Ensemble learning (bagging) [25].
between the two classes offers the highest 2.4.2. Boosting
generalization performance. The best hyperplane is
found by solving a convex optimization problem Boosting is a meta-algorithm that can be
involving the minimization of a quadratic function thought of as a method of model averaging. It is the
under linear inequality constraints. The answer may most popular ensemble approach, as well as one of
be expressed in terms of support vectors, which are the most effective learning concepts. This method
a subset of the training instances. Support vectors was created for classification, but it can also be
include all the information required to solve a applied to regression. The original boosting
classification issue since the result will remain the algorithm created a strong learner by combining
same even if all other vectors are removed. three weak learners.

2.4. Ensemble Classification Techniques 2.4.3. Stacking

Building a fair model from a dataset is one of Stacking is the process of integrating
the main goals of machine learning algorithms. numerous classifiers created by various learning
Learning, or training, is the process of developing algorithms into a single dataset of feature vector
models from data, and the learned model is referred pairs and their classifications. A set of base-level
to as a hypothesis or learner. Ensemble methods classifiers is constructed in the first phase, and a
learn algorithms that create a set of classifiers and meta-level classifier is trained in the second phase,
then use their predictions to put new data points into as shown in Figure 5.
categories.
Ensembles are far more accurate than the
individual classifiers that make them up. Ensemble
methods, also known as committee-based learning
or learning multiple classifier systems, are used to
train numerous hypotheses to solve a problem.
Random forest trees are a common form of
ensemble modeling in which many decision trees
are utilized to predict outcomes. Figure 4 shows a
general ensemble architecture . Figure 5. Ensemble learning, bagging, and
An ensemble is made up of numerous boosting machine learning techniques .
hypotheses or learners that are produced from
training data using a basic learning method. Most 2.5. Ensemble Classification Techniques
ensemble methods produce homogeneous base A neural network (NN) is a mathematical
learners or homogeneous ensembles using a single- model that mimics the behaviour of biological
based learning algorithm, but some approaches use neurons and the nervous system. ANNs utilize
multiple learning algorithms to build heterogeneous technological solutions to imitate the architecture
ensembles. The ability of ensemble approaches to and functions of the neural system of human brains
[29]. They use neural network topologies to Users, overall, tend to overlook a website’s
represent physical systems in this way. McCulloch URL. This makes them more likely to fall prey to a
and Pitts introduced the ANN theory for the first phishing domain, which might otherwise be
time in [30]. ANNs are appropriate for addressing avoided by determining whether a URL is
the mapping issue from one dataset to another when authentic. Unfortunately, traditional methods for
they have strong nonlinear mapping capabilities detecting phishing attacks have limited accuracy
[31]. ANNs can be categorized into two types of and can only detect roughly 20% of attempts. ML
signal transmission modes: feedforward and techniques for phishing detection produce better
feedback neural networks, each of which has a results, but they are time-consuming, even on small
distinct framework. Feedback neural networks play databases, and they are not scalable. Furthermore,
a significant role in AI; however, they have only heuristics-based phish detection has a significant
been used in a few applications due to solid waste false-positive rate. Previous research on anti-
concerns. In the application of biosorption capacity, phishing models has concentrated on strategies to
several researchers compared the models of modify efficiency. Even so, feature reduction and
feedforward neural networks such as multilayer the use of an ensemble model can improve these
perception ANNs and feedback neural networks, models’ accuracy even further.
which found that feedforward neural networks had For phishing domain detection, machine
lower prediction errors than feedback neural learning algorithms are prevalent, and using them
networks. has become a straightforward categorization
In a multilayer feedforward neural network, problem. The data at hand must have properties
neurons in one layer communicate with those in the relevant to phishing and legitimate website classes
next layer through various weighted linkages. There to build an ML-based detection model. Previous
are three kinds of neuron layers: input, hidden, and works have shown that when robust machine-
output. The neurons in the input layer receive learning approaches are utilized, detection accuracy
external data, such as from sensory receivers; the is high. To reduce features, a variety of feature
neurons in the hidden layer imitate a biological selection strategies are applied.
neural network to transmit that data, and the To train a machine learning model to predict
neurons in the output layer offer a judgment output. phishing attacks versus legal traffic, a batch of data
Although several hidden layers are feasible, is given as the input. Dataset visualization becomes
typically, only one hidden layer is employed, more efficient and intelligible when characteristics
especially with small sample sizes. Neurons only are reduced. The DT, C4.5, k-NN, and SVM
link between layers, not inside them. In a algorithms are the most important classifiers; they
feedforward neural network, signals can only go have been utilized in numerous research projects,
one direction, from input to output. ANNs have and they have detected phishing attacks with the
been extensively employed in numerous activities, greatest accuracy and efficiency. According to the
including environmental difficulties and even solid empirical experiment’s findings, manual parameter
waste-related issues, due to these simplifications. adjustment, protracted training periods, and poor
Complex systems and correlations in labeled detection accuracy are prevalent problems with
data are recognized using these models. Deep modern deep learning systems.
neural networks (DNNs) are more complicated Despite these benefits, researchers have noted
neural networks with hidden layers that conduct the limits of their studies. Many pointed out that
much more complex functions than basic sigmoid ensemble learning techniques have not been applied
or ReLU activations . The architecture of a deep and that feature selection and reduction have not
learning model is shown in Figure 6. been performed. A range of strategies has been
applied to combat phishing attacks. One paper used
different classifiers, such as naive Bayes and SVM.
Similarly, the authors in utilized random forest to
differentiate phishing attacks from normal
websites.
Subasi et al. reported that their proposed
classifiers were extremely effective at classifying
phishing websites. They reported that random forest
was the most accurate classifier, at 97.26%.
The authors of proposed a paper
concentrating on feature selection in phishing
Figure 6. A deep neural network used for phishing
websites. They sorted the characteristics into six
detection.
groups using the UCI dataset, which has more than
11,000 URLs and 30 characteristics. They chose
three groups and decided that these were the best
3. Related Work
solutions for detecting phishing attacks accurately.
Patil et al. suggested three strategies for abnormal phishing websites. PSO weighting
detecting phishing websites. The first entailed distinguishes different aspects of a website,
assessing various URL attributes; the second considering how important they are in detecting
determined the validity of the website by phishing from legitimate websites. According to the
determining where it was hosted and who managed findings, their proposed PSO-based component
it; and the third method determined the authenticity weighting improved the ML model’s ability to
of the website through visual, appearance-based recognize and monitor both phishing and legitimate
analysis. They used ML methodologies and websites individually.
algorithms to assess the numerous aspects of the The authors of employed an evolutionary
URLs and websites. neuro-fuzzy intelligence system-based resilient
Joshi et al. used a binary classifier based on approach with integrated features to identify and
an RF algorithm and a feature selection algorithm guard against phishing attacks.
based on the relief algorithm. They utilized data The authors of introduced the PhishBench
from the Mendeley domain as the source for their benchmarking structure, which permits researchers
feature selection algorithm. They then used the to evaluate the characteristics of phishing attacks
selected features to train an RF algorithm to predict and fully comprehend different evaluation
phishing attacks. circumstances, unified framework specifications,
The work of Ubing et al. employed three data, machine learning algorithms, and evaluation
ensemble learning strategies: bagging, boosting, metrics. When the proportion of phishing and
and stacking. Their dataset had 30 characteristics authentic traffic fell from one to 10, the
and 5126 records in the result column. The data classification execution was reduced. In terms of
comes from UCI, which is open to the public. They the F1 score, the drop in execution ranged from
integrated their classifiers to achieve the highest 5.9% to 42%.
level of accuracy possible from a DT. An intelligent phishing website identification
The authors of suggested a new method based method was proposed by Subasi and Kremic. They
on both URLs as inputs and HTML-related data. used proprietary machine learning approaches to
After the features were extracted, a stacking differentiate phishing websites. Several classifier
strategy merged the learners. The researchers then approaches were applied to create a reliable and
ran tests on a variety of datasets, including 2000 intelligent phishing detection system. The
webpages taken from Phishtank (1000 legitimate performances of their ML approaches were
and 1000 phishing sites). The second dataset came evaluated using ROC area, F-measure, and AUC.
from Alexa and contained nearly 50,000 websites. With a 97.61% accuracy, Adaboost with SVM
To improve their accuracy, they used SVMs, NNs, outperformed all other classification approaches.
DT, and RF, which they combined through Alternatively, Mao et al. developed a
stacking. This study obtained a high level of learning-based technique for determining page
accuracy using a variety of classifiers. design comparability, which might be utilized to
The authors of looked at how stacking identify phishing attack pages. They built a
techniques could be used to identify phishing phishing classifier using dual ML algorithms, a
websites. The goal of these tests was to enhance support vector machine, and a decision tree for
precision metrics using PCA and stacking the most effective page layout aspects. They used genuine
efficient classifiers. Other classifiers using website page testing from phishtank.com and
proposed features N1 and N2 outperformed alexa.com to validate their methodology.
stacking (RF, NN, stowing). The tests were carried Tyagi et al. employed a dataset from the
out using datasets from phishing websites. With University of California at Irvine’s machine
11,055 web pages, the dataset had 32 preprocessed learning repository, which had 2456 unique URLs
characteristics. and more than 11,000 URLs, with 6157 phishing
Another strategy is the extra-tree base and 4898 normal URL. They took 30 characteristics
classifier utilized by the authors of , who used it to from the URLs and utilized them to forecast attacks.
classify several meta models: AdaBoost, bagging, They employed DT, RF, gradient boosting,
rotation forest, and LogitBoost-Extra Tree. The generalized linear, and PCA as machine learning
suggested models outperformed current ML-based techniques.
phishing attack detection models, and, as a result, Chen and Chen employed the SMOTE
the authors recommended using meta-algorithms to approach to increase their model’s detection
create phishing attack detection models. coverage. They trained machine learning models
To improve the detection of phishing such as bagging, RF, and XGboost. The XGboost
websites, the authors of suggested a phishing approach, which they proposed, yielded the
detection model based on a particle swarm maximum accuracy. They utilized the Phishtank
optimization (PSO) algorithm. Their proposed database, which contained over 24,000 phishing
method used PSO to weigh distinct websites, and 4000 legitimate websites.
resulting in increased accuracy for classifying
Alternatively, Abdelhamid et al. developed a implemented algorithms; Section 4.1 and Section
model content and feature comparison to detect 4.2, respectfully.
attacks. They used a PhishTank dataset with
4.1. Dataset Used: UCI Phishing Websites
approximately 11,000 samples. They utilized a
technique called enhanced dynamic rule induction, Standard datasets already exist for the
which they said was the first machine learning and development of phishing website detection
deep learning algorithm to be used as an anti- algorithms. Other studies classified websites to
phishing tool. With two major threshold establish a list of legitimate and phishing sites for
frequencies and rule strength, this algorithm passed further consideration. This work, on the other hand,
datasets. Only “strong” characteristics were stored utilizes the freely accessible phishing dataset from
in the training dataset, and these features became UCI machine learning repository that can be found
part of the rule, while others were eliminated. in , and was prepared by. This dataset was created
A study by Jain and Gupta tested two to build machine learning-based phishing website
databases. Their model was more accurate on detection algorithms. It is comprised of extensive
Phishtank, which has over 1500 phishing URLs, properties that span four distinct categories. They
followed by Openphish, which has over 600 designed and extracted characteristics from the
phishing URLs and 1600 real URLs, as well as 66 following categories: Address Bar, HTML and
valid URLs and 252 legal URLs. They enhanced JavaScript, Abnormal, and Domain. This study was
phishing detection accuracy using machine learning performed using a phishing domain dataset with 31
methods such as RF, SVM, NN, logistic regression attributes that can either take a binary or ternary
(LR), and NB. On the client side, they employed a value. This dataset has 11,055 records, and each
successful feature extraction approach. record includes 31 characteristics. The
Lakshmi et al. suggested a novel method for characteristics of the collection are identified by
detecting phishing websites by looking for names, such as URL Length, Submitting to Email,
hyperlinks in the source code of the corresponding Shortening Service, Abnormal URL, Having an At
website’s HTML page. The suggested method Symbol, and Redirect.
employed a feature vector with 30 parameters to
detect malicious online pages. These characteristics 4.2. Implemented Algorithm
were used to train a supervised DNN model with an To increase accuracy, this paper utilized the
Adam optimizer to distinguish between fraudulent MinMax normalization feature as a preprocessing
and legitimate websites. To do so, the model step in each proposed model. Normalization is a
employed a listwise process. When compared to useful strategy for improving the accuracy of
other traditional ML algorithms such as SVM, machine learning models, and it is required for
Adaboost, and AdaRank, the proposed model some models to work properly. The MinMax
outperformed the others, with a 96% accuracy rate. normalization technique in the suggested model
Table 1 presents the summary of ML compresses the data to a domain of [0, 1], which
approaches for phishing website detection. The next improves the model training input quality (see
table shows that some studies provide highly Equations (1) and (2)).
efficient results using ML for phishing attack X_std = (X − X.min)/(X.max − X.min)
detection. (1)
Table 1. Comparison table of the latest research
focusing on machine learning phishing detection X_scalar = X_std × (max − min) + min
techniques. (2)
To enhance the model performance and
complexities, we used a data normalization
strategy, as shown in Table 2. The algorithm
selects significant aspects from the initial dataset by
determining the prediction outcome, which is
performed by filtering it through 30 features. The
UCI dataset is split 80/20 into training and testing
4. Methodology sets, respectively, by using c5-fold cross-validation,
which presented the best performance in the latest
Utilizing the UCI dataset, four phishing research. The prediction model is then taught using
detection models were developed using ANN, machine learning, which employs various learning
SVM, DTs, and RF algorithms. The MinMax models. This is particularly useful for making
normalization feature was employed as a predictions, as utilizing many models ensures that
preprocessing strategy to improve the models' the results are not biased toward a single model. To
accuracy. The proposed models were able to detect account for this, we present the results of all the
different types of attacks from the UCI dataset.The models combined and totaled to establish their
following subsections discuss the dataset used and maximum accuracies. If most of the models
indicate that a domain is phishing, then the model’s 2. Check the data features.
prediction accuracy confirms that the domain is a 3. Check the proposed data types.
phishing attempt. 4. Clean missing values from the
Table 2. The performance results before and after data.
using the normalization technique. 5. Split the data into training and
testing sets.
6. Train the model using four
machine-learning techniques: RF, SVM,
DT, and ANN.
7. Evaluate the model’s performance
to estimate the accuracy and calculate the
accuracy results.
5. Model’s Flowchart 8. Select the best model as the final
model.
Phishing is a concern to many individuals.
However, existing methods, such as browser
security indicators, cannot detect phishing 6. Findings and Analysis
websites. Due to the limits of current technology,
users must evaluate whether a URL is phishing or To identify the most accurate machine
not on their own. As a result, an automated learning model for detecting phishing domains, this
technique for phishing website identification paper employed an experimental approach using
should be explored for increased cyber safety. This four ML techniques: SVM, ANN, RF, and DT.
study shows how an implemented feature extraction With a total of 11,055 data instances, the UCI
approach and a prediction model based on a random dataset was utilized for experimentation. Thirty
forest classifier help increase the likelihood that a features were used for evaluating the dataset, and
user will correctly identify a phishing website. the 31st feature was used as the output. Table
Each of the developed models, as shown 3 displays the outcomes of the simulation with the
in Figure 7, employs a feature selection technique true positive rate (TPR), false positive rate (FPR),
to increase its accuracy. The data analysis heat map true negative rate (TNR), and false negative rate.
picks those that are most crucial in affecting the Moreover, a five-fold cross-validation method was
forecasted result by filtering the most interesting employed for the classification procedure. The 10-
features out of the original dataset. As a result, fold cross-validation approach was used to locate a
irrelevant features have no effect on the model’s greater performance accuracy dataset. Cross-
efficiency or prediction. validation is a predictive performance model
evaluation technique used to check a machine-
learning algorithm’s performance in generating
predictions on newer data on which it has not been
trained. The examination of the confusion matrix is
the basis for the classification technique’s result
performance.
Table 3. Evaluation results and parameters used of
the proposed classifiers.

The results are shown in Table 4. The RF


model provided the highest detection accuracy rate
at 97%, followed by DT at 96%, ANNs at 95%, and
SVM at 94%. Figure 8 depicts these results.
Finally, Table 5 compares the RF model to the
state-of-the-art results in the literature.

Figure 7. Model’s flowchart.


A summary of models’ flowchart steps
follows:
1. Read the URL’s UCI phishing
websites dataset.
achieved the highest performance and outperforms
other schemes in the literature.
Future work includes examining more
machine learning algorithm techniques for phishing
domains.

XII. REFERENCES
1. F. Salahdine and N. Kaabouch, "Social
Engineering Attacks: A Survey", Future Internet J,
vol. 11, no. 89, pp. 1-17, 2019.
Figure 8. Proposal evaluation results. 2. R. Mohammad, F. Thabtah and L. McCluskey,
Table 4. Evaluation results in (%). "Intelligent rule-based phishing websites
classification", IET Inf. Secur., pp. 153-160, 2014.
3. F. Salahdine and N. Kaabouch, "Security threats
detection and countermeasures for physical layer in
cognitive radio networks: A survey", Physical
Commun. J., 2020.
4. J. He and Y. Zhu, "Social
Table 5. Examining existing phishing domain
engineering/phishing", Encycl. Soc. Netw. Anal.
detection model.
Min., pp. 1777-1783, 2014.
5. M. Moghimi and A. Varjani, "New rule-based
phishing detection method", Expert Syst. Appl., vol.
53, pp. 231-242, 2016.
6. B. Gupta, N. Arachchilage and K. Psannis,
"Defending against phishing attacks: Taxonomy of
Table 5 lists other research dealing with methods current issues and future
phishing attacks and crucial information about directions", Telecommun. Syst., vol. 67, pp. 247-
different machine-learning techniques. Three 267, 2018.
solutions based on ensemble learning, including the 7. J. Hong, T. Kim and S. Kim, "Phishing URL
bagging, boosting, and stacking methods, were detection with lexical features and blacklisted
developed by Ubing et al.They combined their domains", Adaptive Auton. Secur. Cyber Syst., pp.
classifiers to attain a 95.4% accuracy rate in their 253-267, 2020.
results. Lakshmi et al. proposed a new method for 8. Y. Huang, Q. Yang, J. Qin and W. Wen,
detecting phishing websites by scanning the source "Phishing URL Detection via CNN and Attention-
code of the related website’s HTML page for Based Hierarchical RNN", IEEE Int. Conf. Trust
linkages. They achieved a 96% accuracy rate. The Security Privacy Comput. Commun., pp. 112-119,
researchers in suggested three meta-learner models 2019.
using ForestPA; the suggested meta-learners are 9. M Moghimi and AY Varjani, "New rule-based
efficient, according to their experimental data, with phishing detection method", Expert systems with
the lowest accuracy at 97.4%. The accuracy values applications., vol. 1, no. 53, pp. 231-42, 2016.
in this paper vary from 0.95 to 0.97%, except for 10. G. Ramesh, I. Krishnamurthi and K. Kumar,
Alsariera et al. [71], who got 97.4%, but this model "An efficacious method for detecting phishing
takes longer to train and implement than RF and DT webpages through target domain
classifiers. identification", Decision Support Systems, vol. 61,
pp. 12-22, 2014.
11. Y. Suga, "SSL/TLS servers status survey about
7. Conclusions and Future Works enabling forward secrecy", Int. Conf. Network-
Based Information Systems, pp. 501-505, 2014.
In this work, we investigated the practicality
and the efficiency of using machine learning for
12. A. Albarqi, E. Alzaid, F. Ghamdi, S. Asiri and
phishing detection. We developed four machine
J. Kar, "Public key infrastructure: A survey", J. Inf.
learning models based on artificial neural networks
Secur., vol. 06, no. 01, pp. 31-37, 2015.
(ANNs), support vector machines (SVMs),
13. S. Krishnamurthy and A. Ve, "Information
decision trees (DTs), and random forest (RF)
retrieval models: Trends and techniques", Web
techniques. We then selected the most
Semant. Textual Vis. Inf. Retr., pp. 17-42, 2017.
outperforming model of the fours and compared its
performance with other solutions in the literature.
14. A. Kharraz, W. Robertson and E. Kirda,
The overall results show random forest (RF) model
"Surveylance: Automatically detecting online
survey scams", IEEE Symp. Secur. Privacy, pp.
723-739, 2018.

15. Y. Reddy and N. Varma, "Review on


supervised learning techniques", Emerg. Res. Data
Eng. Syst. Comput. Commun. J., pp. 577-587, 2020.
16. C. Bircano and N. Arıca, "A comparison of
activation functions in artificial neural
networks", Signal Proc. Commun. App. Conf, pp.
1-4, 2018.
17 Y. Arjoune, F. Salahdine, Md. Islam, E. Ghribi
and N. Kaabouch, "A novel jamming attacks
detection approach based on machine learning for
wireless communication", Int. Conf. Inf. Netw, pp.
1-6, 2020.

18.F.Salahdine and N,Kaabouch, “Social


Engineering Attacks: A Survey,” Future Internet J,,
11, 89, pp. 1-17, 2022.
19. R. Mohammad, F. Thabtah, and L. McCluskey,
“Intelligent rule-based phishing websites
classification,” IET Inf. Secur., pp. 153–160, 2021.
20. F. Salahdine and N. Kaabouch, “Security
threats, detection, and countermeasures for physical
layer in cognitive radio networks: A survey,”
Physical Commun. J., 2020

You might also like