FINALREPORT
FINALREPORT
On
By
YANAMALA YAMUNA
(Assistant Professor)
BACHELOR OF TECHNOLOGY
IN
CERTIFICATE
This is to certify that the Seminar (CS705PC) entitled “PHISHING ATTACKS DETECTION
USING ML” being submitted by YANAMALA YAMUNA bearing Roll No: 20261A0560 in
partial fulfillment of the requirements for the Award of the Degree of Bachelor of Technology
in Computer Science and Engineering is a record of bonafide work carried out by her.
Building a fair model from a dataset is one of Stacking is the process of integrating
the main goals of machine learning algorithms. numerous classifiers created by various learning
Learning, or training, is the process of developing algorithms into a single dataset of feature vector
models from data, and the learned model is referred pairs and their classifications. A set of base-level
to as a hypothesis or learner. Ensemble methods classifiers is constructed in the first phase, and a
learn algorithms that create a set of classifiers and meta-level classifier is trained in the second phase,
then use their predictions to put new data points into as shown in Figure 5.
categories.
Ensembles are far more accurate than the
individual classifiers that make them up. Ensemble
methods, also known as committee-based learning
or learning multiple classifier systems, are used to
train numerous hypotheses to solve a problem.
Random forest trees are a common form of
ensemble modeling in which many decision trees
are utilized to predict outcomes. Figure 4 shows a
general ensemble architecture . Figure 5. Ensemble learning, bagging, and
An ensemble is made up of numerous boosting machine learning techniques .
hypotheses or learners that are produced from
training data using a basic learning method. Most 2.5. Ensemble Classification Techniques
ensemble methods produce homogeneous base A neural network (NN) is a mathematical
learners or homogeneous ensembles using a single- model that mimics the behaviour of biological
based learning algorithm, but some approaches use neurons and the nervous system. ANNs utilize
multiple learning algorithms to build heterogeneous technological solutions to imitate the architecture
ensembles. The ability of ensemble approaches to and functions of the neural system of human brains
[29]. They use neural network topologies to Users, overall, tend to overlook a website’s
represent physical systems in this way. McCulloch URL. This makes them more likely to fall prey to a
and Pitts introduced the ANN theory for the first phishing domain, which might otherwise be
time in [30]. ANNs are appropriate for addressing avoided by determining whether a URL is
the mapping issue from one dataset to another when authentic. Unfortunately, traditional methods for
they have strong nonlinear mapping capabilities detecting phishing attacks have limited accuracy
[31]. ANNs can be categorized into two types of and can only detect roughly 20% of attempts. ML
signal transmission modes: feedforward and techniques for phishing detection produce better
feedback neural networks, each of which has a results, but they are time-consuming, even on small
distinct framework. Feedback neural networks play databases, and they are not scalable. Furthermore,
a significant role in AI; however, they have only heuristics-based phish detection has a significant
been used in a few applications due to solid waste false-positive rate. Previous research on anti-
concerns. In the application of biosorption capacity, phishing models has concentrated on strategies to
several researchers compared the models of modify efficiency. Even so, feature reduction and
feedforward neural networks such as multilayer the use of an ensemble model can improve these
perception ANNs and feedback neural networks, models’ accuracy even further.
which found that feedforward neural networks had For phishing domain detection, machine
lower prediction errors than feedback neural learning algorithms are prevalent, and using them
networks. has become a straightforward categorization
In a multilayer feedforward neural network, problem. The data at hand must have properties
neurons in one layer communicate with those in the relevant to phishing and legitimate website classes
next layer through various weighted linkages. There to build an ML-based detection model. Previous
are three kinds of neuron layers: input, hidden, and works have shown that when robust machine-
output. The neurons in the input layer receive learning approaches are utilized, detection accuracy
external data, such as from sensory receivers; the is high. To reduce features, a variety of feature
neurons in the hidden layer imitate a biological selection strategies are applied.
neural network to transmit that data, and the To train a machine learning model to predict
neurons in the output layer offer a judgment output. phishing attacks versus legal traffic, a batch of data
Although several hidden layers are feasible, is given as the input. Dataset visualization becomes
typically, only one hidden layer is employed, more efficient and intelligible when characteristics
especially with small sample sizes. Neurons only are reduced. The DT, C4.5, k-NN, and SVM
link between layers, not inside them. In a algorithms are the most important classifiers; they
feedforward neural network, signals can only go have been utilized in numerous research projects,
one direction, from input to output. ANNs have and they have detected phishing attacks with the
been extensively employed in numerous activities, greatest accuracy and efficiency. According to the
including environmental difficulties and even solid empirical experiment’s findings, manual parameter
waste-related issues, due to these simplifications. adjustment, protracted training periods, and poor
Complex systems and correlations in labeled detection accuracy are prevalent problems with
data are recognized using these models. Deep modern deep learning systems.
neural networks (DNNs) are more complicated Despite these benefits, researchers have noted
neural networks with hidden layers that conduct the limits of their studies. Many pointed out that
much more complex functions than basic sigmoid ensemble learning techniques have not been applied
or ReLU activations . The architecture of a deep and that feature selection and reduction have not
learning model is shown in Figure 6. been performed. A range of strategies has been
applied to combat phishing attacks. One paper used
different classifiers, such as naive Bayes and SVM.
Similarly, the authors in utilized random forest to
differentiate phishing attacks from normal
websites.
Subasi et al. reported that their proposed
classifiers were extremely effective at classifying
phishing websites. They reported that random forest
was the most accurate classifier, at 97.26%.
The authors of proposed a paper
concentrating on feature selection in phishing
Figure 6. A deep neural network used for phishing
websites. They sorted the characteristics into six
detection.
groups using the UCI dataset, which has more than
11,000 URLs and 30 characteristics. They chose
three groups and decided that these were the best
3. Related Work
solutions for detecting phishing attacks accurately.
Patil et al. suggested three strategies for abnormal phishing websites. PSO weighting
detecting phishing websites. The first entailed distinguishes different aspects of a website,
assessing various URL attributes; the second considering how important they are in detecting
determined the validity of the website by phishing from legitimate websites. According to the
determining where it was hosted and who managed findings, their proposed PSO-based component
it; and the third method determined the authenticity weighting improved the ML model’s ability to
of the website through visual, appearance-based recognize and monitor both phishing and legitimate
analysis. They used ML methodologies and websites individually.
algorithms to assess the numerous aspects of the The authors of employed an evolutionary
URLs and websites. neuro-fuzzy intelligence system-based resilient
Joshi et al. used a binary classifier based on approach with integrated features to identify and
an RF algorithm and a feature selection algorithm guard against phishing attacks.
based on the relief algorithm. They utilized data The authors of introduced the PhishBench
from the Mendeley domain as the source for their benchmarking structure, which permits researchers
feature selection algorithm. They then used the to evaluate the characteristics of phishing attacks
selected features to train an RF algorithm to predict and fully comprehend different evaluation
phishing attacks. circumstances, unified framework specifications,
The work of Ubing et al. employed three data, machine learning algorithms, and evaluation
ensemble learning strategies: bagging, boosting, metrics. When the proportion of phishing and
and stacking. Their dataset had 30 characteristics authentic traffic fell from one to 10, the
and 5126 records in the result column. The data classification execution was reduced. In terms of
comes from UCI, which is open to the public. They the F1 score, the drop in execution ranged from
integrated their classifiers to achieve the highest 5.9% to 42%.
level of accuracy possible from a DT. An intelligent phishing website identification
The authors of suggested a new method based method was proposed by Subasi and Kremic. They
on both URLs as inputs and HTML-related data. used proprietary machine learning approaches to
After the features were extracted, a stacking differentiate phishing websites. Several classifier
strategy merged the learners. The researchers then approaches were applied to create a reliable and
ran tests on a variety of datasets, including 2000 intelligent phishing detection system. The
webpages taken from Phishtank (1000 legitimate performances of their ML approaches were
and 1000 phishing sites). The second dataset came evaluated using ROC area, F-measure, and AUC.
from Alexa and contained nearly 50,000 websites. With a 97.61% accuracy, Adaboost with SVM
To improve their accuracy, they used SVMs, NNs, outperformed all other classification approaches.
DT, and RF, which they combined through Alternatively, Mao et al. developed a
stacking. This study obtained a high level of learning-based technique for determining page
accuracy using a variety of classifiers. design comparability, which might be utilized to
The authors of looked at how stacking identify phishing attack pages. They built a
techniques could be used to identify phishing phishing classifier using dual ML algorithms, a
websites. The goal of these tests was to enhance support vector machine, and a decision tree for
precision metrics using PCA and stacking the most effective page layout aspects. They used genuine
efficient classifiers. Other classifiers using website page testing from phishtank.com and
proposed features N1 and N2 outperformed alexa.com to validate their methodology.
stacking (RF, NN, stowing). The tests were carried Tyagi et al. employed a dataset from the
out using datasets from phishing websites. With University of California at Irvine’s machine
11,055 web pages, the dataset had 32 preprocessed learning repository, which had 2456 unique URLs
characteristics. and more than 11,000 URLs, with 6157 phishing
Another strategy is the extra-tree base and 4898 normal URL. They took 30 characteristics
classifier utilized by the authors of , who used it to from the URLs and utilized them to forecast attacks.
classify several meta models: AdaBoost, bagging, They employed DT, RF, gradient boosting,
rotation forest, and LogitBoost-Extra Tree. The generalized linear, and PCA as machine learning
suggested models outperformed current ML-based techniques.
phishing attack detection models, and, as a result, Chen and Chen employed the SMOTE
the authors recommended using meta-algorithms to approach to increase their model’s detection
create phishing attack detection models. coverage. They trained machine learning models
To improve the detection of phishing such as bagging, RF, and XGboost. The XGboost
websites, the authors of suggested a phishing approach, which they proposed, yielded the
detection model based on a particle swarm maximum accuracy. They utilized the Phishtank
optimization (PSO) algorithm. Their proposed database, which contained over 24,000 phishing
method used PSO to weigh distinct websites, and 4000 legitimate websites.
resulting in increased accuracy for classifying
Alternatively, Abdelhamid et al. developed a implemented algorithms; Section 4.1 and Section
model content and feature comparison to detect 4.2, respectfully.
attacks. They used a PhishTank dataset with
4.1. Dataset Used: UCI Phishing Websites
approximately 11,000 samples. They utilized a
technique called enhanced dynamic rule induction, Standard datasets already exist for the
which they said was the first machine learning and development of phishing website detection
deep learning algorithm to be used as an anti- algorithms. Other studies classified websites to
phishing tool. With two major threshold establish a list of legitimate and phishing sites for
frequencies and rule strength, this algorithm passed further consideration. This work, on the other hand,
datasets. Only “strong” characteristics were stored utilizes the freely accessible phishing dataset from
in the training dataset, and these features became UCI machine learning repository that can be found
part of the rule, while others were eliminated. in , and was prepared by. This dataset was created
A study by Jain and Gupta tested two to build machine learning-based phishing website
databases. Their model was more accurate on detection algorithms. It is comprised of extensive
Phishtank, which has over 1500 phishing URLs, properties that span four distinct categories. They
followed by Openphish, which has over 600 designed and extracted characteristics from the
phishing URLs and 1600 real URLs, as well as 66 following categories: Address Bar, HTML and
valid URLs and 252 legal URLs. They enhanced JavaScript, Abnormal, and Domain. This study was
phishing detection accuracy using machine learning performed using a phishing domain dataset with 31
methods such as RF, SVM, NN, logistic regression attributes that can either take a binary or ternary
(LR), and NB. On the client side, they employed a value. This dataset has 11,055 records, and each
successful feature extraction approach. record includes 31 characteristics. The
Lakshmi et al. suggested a novel method for characteristics of the collection are identified by
detecting phishing websites by looking for names, such as URL Length, Submitting to Email,
hyperlinks in the source code of the corresponding Shortening Service, Abnormal URL, Having an At
website’s HTML page. The suggested method Symbol, and Redirect.
employed a feature vector with 30 parameters to
detect malicious online pages. These characteristics 4.2. Implemented Algorithm
were used to train a supervised DNN model with an To increase accuracy, this paper utilized the
Adam optimizer to distinguish between fraudulent MinMax normalization feature as a preprocessing
and legitimate websites. To do so, the model step in each proposed model. Normalization is a
employed a listwise process. When compared to useful strategy for improving the accuracy of
other traditional ML algorithms such as SVM, machine learning models, and it is required for
Adaboost, and AdaRank, the proposed model some models to work properly. The MinMax
outperformed the others, with a 96% accuracy rate. normalization technique in the suggested model
Table 1 presents the summary of ML compresses the data to a domain of [0, 1], which
approaches for phishing website detection. The next improves the model training input quality (see
table shows that some studies provide highly Equations (1) and (2)).
efficient results using ML for phishing attack X_std = (X − X.min)/(X.max − X.min)
detection. (1)
Table 1. Comparison table of the latest research
focusing on machine learning phishing detection X_scalar = X_std × (max − min) + min
techniques. (2)
To enhance the model performance and
complexities, we used a data normalization
strategy, as shown in Table 2. The algorithm
selects significant aspects from the initial dataset by
determining the prediction outcome, which is
performed by filtering it through 30 features. The
UCI dataset is split 80/20 into training and testing
4. Methodology sets, respectively, by using c5-fold cross-validation,
which presented the best performance in the latest
Utilizing the UCI dataset, four phishing research. The prediction model is then taught using
detection models were developed using ANN, machine learning, which employs various learning
SVM, DTs, and RF algorithms. The MinMax models. This is particularly useful for making
normalization feature was employed as a predictions, as utilizing many models ensures that
preprocessing strategy to improve the models' the results are not biased toward a single model. To
accuracy. The proposed models were able to detect account for this, we present the results of all the
different types of attacks from the UCI dataset.The models combined and totaled to establish their
following subsections discuss the dataset used and maximum accuracies. If most of the models
indicate that a domain is phishing, then the model’s 2. Check the data features.
prediction accuracy confirms that the domain is a 3. Check the proposed data types.
phishing attempt. 4. Clean missing values from the
Table 2. The performance results before and after data.
using the normalization technique. 5. Split the data into training and
testing sets.
6. Train the model using four
machine-learning techniques: RF, SVM,
DT, and ANN.
7. Evaluate the model’s performance
to estimate the accuracy and calculate the
accuracy results.
5. Model’s Flowchart 8. Select the best model as the final
model.
Phishing is a concern to many individuals.
However, existing methods, such as browser
security indicators, cannot detect phishing 6. Findings and Analysis
websites. Due to the limits of current technology,
users must evaluate whether a URL is phishing or To identify the most accurate machine
not on their own. As a result, an automated learning model for detecting phishing domains, this
technique for phishing website identification paper employed an experimental approach using
should be explored for increased cyber safety. This four ML techniques: SVM, ANN, RF, and DT.
study shows how an implemented feature extraction With a total of 11,055 data instances, the UCI
approach and a prediction model based on a random dataset was utilized for experimentation. Thirty
forest classifier help increase the likelihood that a features were used for evaluating the dataset, and
user will correctly identify a phishing website. the 31st feature was used as the output. Table
Each of the developed models, as shown 3 displays the outcomes of the simulation with the
in Figure 7, employs a feature selection technique true positive rate (TPR), false positive rate (FPR),
to increase its accuracy. The data analysis heat map true negative rate (TNR), and false negative rate.
picks those that are most crucial in affecting the Moreover, a five-fold cross-validation method was
forecasted result by filtering the most interesting employed for the classification procedure. The 10-
features out of the original dataset. As a result, fold cross-validation approach was used to locate a
irrelevant features have no effect on the model’s greater performance accuracy dataset. Cross-
efficiency or prediction. validation is a predictive performance model
evaluation technique used to check a machine-
learning algorithm’s performance in generating
predictions on newer data on which it has not been
trained. The examination of the confusion matrix is
the basis for the classification technique’s result
performance.
Table 3. Evaluation results and parameters used of
the proposed classifiers.
XII. REFERENCES
1. F. Salahdine and N. Kaabouch, "Social
Engineering Attacks: A Survey", Future Internet J,
vol. 11, no. 89, pp. 1-17, 2019.
Figure 8. Proposal evaluation results. 2. R. Mohammad, F. Thabtah and L. McCluskey,
Table 4. Evaluation results in (%). "Intelligent rule-based phishing websites
classification", IET Inf. Secur., pp. 153-160, 2014.
3. F. Salahdine and N. Kaabouch, "Security threats
detection and countermeasures for physical layer in
cognitive radio networks: A survey", Physical
Commun. J., 2020.
4. J. He and Y. Zhu, "Social
Table 5. Examining existing phishing domain
engineering/phishing", Encycl. Soc. Netw. Anal.
detection model.
Min., pp. 1777-1783, 2014.
5. M. Moghimi and A. Varjani, "New rule-based
phishing detection method", Expert Syst. Appl., vol.
53, pp. 231-242, 2016.
6. B. Gupta, N. Arachchilage and K. Psannis,
"Defending against phishing attacks: Taxonomy of
Table 5 lists other research dealing with methods current issues and future
phishing attacks and crucial information about directions", Telecommun. Syst., vol. 67, pp. 247-
different machine-learning techniques. Three 267, 2018.
solutions based on ensemble learning, including the 7. J. Hong, T. Kim and S. Kim, "Phishing URL
bagging, boosting, and stacking methods, were detection with lexical features and blacklisted
developed by Ubing et al.They combined their domains", Adaptive Auton. Secur. Cyber Syst., pp.
classifiers to attain a 95.4% accuracy rate in their 253-267, 2020.
results. Lakshmi et al. proposed a new method for 8. Y. Huang, Q. Yang, J. Qin and W. Wen,
detecting phishing websites by scanning the source "Phishing URL Detection via CNN and Attention-
code of the related website’s HTML page for Based Hierarchical RNN", IEEE Int. Conf. Trust
linkages. They achieved a 96% accuracy rate. The Security Privacy Comput. Commun., pp. 112-119,
researchers in suggested three meta-learner models 2019.
using ForestPA; the suggested meta-learners are 9. M Moghimi and AY Varjani, "New rule-based
efficient, according to their experimental data, with phishing detection method", Expert systems with
the lowest accuracy at 97.4%. The accuracy values applications., vol. 1, no. 53, pp. 231-42, 2016.
in this paper vary from 0.95 to 0.97%, except for 10. G. Ramesh, I. Krishnamurthi and K. Kumar,
Alsariera et al. [71], who got 97.4%, but this model "An efficacious method for detecting phishing
takes longer to train and implement than RF and DT webpages through target domain
classifiers. identification", Decision Support Systems, vol. 61,
pp. 12-22, 2014.
11. Y. Suga, "SSL/TLS servers status survey about
7. Conclusions and Future Works enabling forward secrecy", Int. Conf. Network-
Based Information Systems, pp. 501-505, 2014.
In this work, we investigated the practicality
and the efficiency of using machine learning for
12. A. Albarqi, E. Alzaid, F. Ghamdi, S. Asiri and
phishing detection. We developed four machine
J. Kar, "Public key infrastructure: A survey", J. Inf.
learning models based on artificial neural networks
Secur., vol. 06, no. 01, pp. 31-37, 2015.
(ANNs), support vector machines (SVMs),
13. S. Krishnamurthy and A. Ve, "Information
decision trees (DTs), and random forest (RF)
retrieval models: Trends and techniques", Web
techniques. We then selected the most
Semant. Textual Vis. Inf. Retr., pp. 17-42, 2017.
outperforming model of the fours and compared its
performance with other solutions in the literature.
14. A. Kharraz, W. Robertson and E. Kirda,
The overall results show random forest (RF) model
"Surveylance: Automatically detecting online
survey scams", IEEE Symp. Secur. Privacy, pp.
723-739, 2018.