0% found this document useful (0 votes)

33 views13 pages

Phishing

Uploaded by

Mangayarkarasi R

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views13 pages

Phishing

Uploaded by

Mangayarkarasi R

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Appl. Math. Inf. Sci. 18, No.

6, 1481-1493 (2024) 1481

Applied Mathematics & Information Sciences
An International Journal

http://dx.doi.org/10.18576/amis/180624

Enhanced Phishing Detection: An Ensemble Stacking

Model with DT-RFECV and SMOTE
Mangayarkarasi Ramaiah1, Vanmathi Chandrasekaran1, Vikash Chand1, Asokan Vasudevan2,3, Suleiman Ibrahim
Mohamma4,5,∗, Eddie Eu Hui Soon2, Qusai Shambour6, and Muhammad Turki Alshurideh7
1 School of Computer Science Engineering and Information Systems, Vellore Institute of Technology, 632014 Vellore, India
2 Faculty of Business and Communications, INTI International University, Persiaran Perdana BBN Putra Nilai, 71800 Nilai, Negeri
Sembilan, Malaysia
3 Wekerle Business School, Budapest, Jázmin u. 10, 1083 Hungary
4 Electronic Marketing and Social Media, Economic and Administrative Sciences Zarqa University, 13110 Zarqa, Jordan
5 INTI International University, 71800 Negeri Sembilan, Malaysia
6 Software Engineering Department, Hourani Center for Applied Scientific Research, Al-Ahliyya Amman University, 19111 Amman,
Jordan
7 Department of Marketing, School of Business, The University of Jordan, Amman 11942, Jordan

Received: 15 Aug. 2024, Revised: 5 Oct. 2024, Accepted: 10 Oct. 2024

Published online: 1 Nov. 2024

Abstract: Phishing websites are a significant threat, constantly evolving to deceive users into revealing sensitive information. While
current anti-phishing systems rely on URLs, website content, and third-party data, they often struggle to keep pace with these dynamic
scams. This study addresses these challenges by introducing a novel approach that analyzes the effectiveness of URL-based features,
JavaScript characteristics, and anomaly-based indicators in detecting malicious web links. To overcome the issues of data imbalance
and feature selection, our approach incorporates SMOTE oversampling and a Decision Tree-Recursive Feature Elimination cross-
validation (DT-RFECV) wrapper method. The selected features are then used to train an ensemble stacking model that combines
Decision Trees, Random Forests, and Bagging. The framework was rigorously evaluated on two benchmarking datasets and achieved
impressive accuracy rates of 97.7% on Dataset-1 and 97.5% on Dataset-2 using ten features, underscoring the effectiveness of our
approach. Our proposed framework significantly contributes to the internet community’s defense against phishing scams with its unique
features, ensemble model construction, and promising results.

Keywords: URLs, DT-RFECV, Machine learning, ensemble stacking model, phishing scam, financial inclusion

1 Introduction By combining social engineering and technical expertise,

attackers extract sensitive personal information. Phishing
The Internet has become a hostile environment where campaigns are typically initiated through deceptive
attacks can be launched rapidly and are difficult to emails, SMS messages, or social media posts, enticing
prevent, detect, and trace. Safeguarding the core security victims to click on malicious links. Despite having existed
principles—privacy, integrity, and accessibility is for over three decades, phishing remains a widespread
challenging. Mutual trust, once a cornerstone of the threat, resulting in significant annual financial inclusion.
Internet’s decentralized nature, has eroded due to the Phishing attacks are the leading cause of online security
prevalence of malicious activities. However, the breaches, with their frequency increasing. According to
importance of Internet security is undeniable, especially Astra Security, phishing emails constitute approximately
for the growth of e-commerce. Phishing is a prominent 1.2% of all emails sent, equating to 3.4 billion daily. The
network attack that capitalizes on human trust by similarity between phishing and legitimate websites
mimicking legitimate websites. These fraudulent sites makes detection difficult for users, who often overlook
often replicate the appearance of reputable companies like URL details. Consequently, many phishing crimes go
eBay, Facebook, Amazon, and Microsoft to deceive users. unreported. Anti-phishing tools primarily fall into four
∗ Corresponding author e-mail: dr [email protected]
c 2024 NSP
Natural Sciences Publishing Cor.
1482 M. Ramaiah et al.: Enhanced Phishing Detection: An Ensemble Stacking Model...

categories: whitelist/blacklist, deep learning, machine models. While comparing the results in terms of accuracy,
learning, and heuristics [1,2,3,4,5]. the XGBoost and Random Forest (RF) models’ results are
Although blacklisting and whitelisting are the most better than those of others. The challenges in developing
widely used anti-phishing techniques, their ability to Ml models are sufficient samples against the considered
withstand zero-day attacks is uncertain because they rely output class labels and suitable feature selection
on a centralized database to verify the legitimacy of the methodology. The presented work in this paper
website [6,7,8,9]. Heuristic-based anti-phishing solutions demonstrates the ML model’s efficiency for phishing
rely on a third party to assess the website’s validity. website detection using suitable data sampling techniques
Although the web page’s content, page ranking, and other and efficient feature selection methods. To reduce
aspects are included in the heuristic-based approach, the response time, the language-independent phishing
reliability of the data, which is taken from a third party, is webpage detection mechanism operates without relying
controversial [10,11,12]. To combat the new phishing and on external data. The presented novel approach utilizes
to alleviate technical hindrances, machine learning-based features from multiple sources, including URL, address
anti-phishing solutions have been presented in various domain, JavaScript, URL file, and directory attributes. To
venues [13,14,15]. The main highlight of using AI optimize efficiency, a minimal set of features from diverse
techniques for the candidate task is to learn the hidden categories is used to train various machine learning
pattern to detect unseen fake information on the phishing models. Feature significance was determined using
web link. Over the past few years, ransomware has DT-RFECV (Decision-Tree-Recursive Feature
become the most prevalent type of cybercrime, with Elimination).
phishing being the most widely employed distribution Decision-Tree-Recursive Feature Elimination offers a
method. Even while an ML-based anti-phishing solution valuable tool for feature selection, providing benefits such
might lessen the impact of a zero-day attack, it requires as feature ranking, improved model performance,
well-designed features from both legitimate and computational efficiency, versatility, and enhanced
malicious URLs that are updated. An XGB-based interpretability. The selected features were then used to
anti-phishing solution has been built upon URL character develop an ensemble stacking model. The key
order, hyperlink-specific and TF-IDF plaintext, and noisy contributions of this research are summarized below:
character features of HTML. As we move forward and
look at innovative rule-based techniques has been –Predominant feature sets are computed using
presented by [16,17,18] for detecting phishing scams in DT-RFECV.
online banking. The candidate SVM-based phish-detector –The significance of these derived features is analyzed
has been upon different features to detect the fake through various ML models.
information. However, the prospective phish detector –An ensemble stacking model is designed to mitigate
independently determines its capabilities from other cyber-threats posed by phishing scams.
sources such as search engines, network browser –The model’s resilience is evaluated using two datasets.
histories, and blacklists. Additionally, the features are –Results are compared to a recently released phishing
language-dependent because they were taken from the detection framework to assess its competence.
webpage’s content. Effective features maximize the
detection rate of phishing crimes. Filter-based feature The rest of the manuscript is organized in the
selection [19,20,21] and demonstrate encouraging following way. Section 2 examines earlier research and
outcomes [22,23,24]. methods for identifying phishingscams. Section 3 depicts
Filter-based metrics [21,25] used statistical tools the information about the dataset used and features
requiring less computing power. However, there is some selection, Section 4 elaborates on the methodology
uncertainty over its ability to forecast suitable features applied in this research. The results are reported in
dynamically. In contrast to the filter tool, wrapped-based Section 5, and the study’s conclusions are presented in
methods take advantage of the machine learning model’s Section 6.
capacity to identify compelling features. Attempts to
extract highly influential features [21,26,27,28] introduce
a novel feature selection method based on the wrapper- 2 Related Works
method and yield superior results. When the number of
features is huge, it takes longer to define them, but in the This section analyses the various aspects of the
end, it improves the classifier’s performance. Phishing ML-basedphishing detection frameworks demonstrated in
scam detection presented [29,30,31] used diverse multiple venues—an ML-based anti-phishing solution
categories, URLs, domains, HTML and JavaScript, and [32,33,34] to mitigate phishing scams. Experimentation
abnormal features. Sixteen machine-learning models were was conducted using collected samples from various
trained using two datasets. One dataset had balanced sources. Nineteen significant features were selected using
classes (equal numbers of benign and malignant samples), Pearson correlation analysis. Various ML models were
while the other was imbalanced. Top ten significant trained on features extracted from URLs, login forms,
features are extracted to train various machine learning hyperlinks, CSS, and web identity. The detailed results

c 2024 NSP
Natural Sciences Publishing Cor.
Appl. Math. Inf. Sci. 18, No. 6, 1481-1493 (2024) / www.naturalspublishing.com/Journals.asp 1483

demonstrate the effectiveness of different feature types algorithms, Learning Without Forgetting (LWF), and
for the problem. Response time is low since the model is Elastic Weight Consolidation (EWC). These CL
entirely built on client-side features. Similarly, algorithms enable the VNN to acquire new information
Al-Shanableh et al. [35] and Jain & Gupta [36] utilized while preserving previously learned knowledge.
client-side features, mainly hyperlink information. ML To enhance the robustness of phishing website
models were constructed based on twelve features detection, the authors in [49] considered a dataset
extracted from hyperlinks. Consequently, a positive comprising 112 attributes. The study explores various
aspect of the presented solution is its applicability to scenarios to assess the resilience of the phishing detection
websites in any human language. Conversely, its framework. Data imbalance was addressed using
effectiveness is contingent on the website being designed SMOTEENN, and 13 constant features were identified
using HTML. To strengthen phishing scam detection, a and removed to improve response time. Subsequently,
two-phase enabled framework was built based on URL Principal Component Analysis (PCA) and Linear
and source code features [37,38,39]. In the first phase, Discriminant Analysis (LDA) were employed to reduce
similarity-based attributes are used to generate a feature dimensionality. Based on the results, ML-based
fingerprint, which is then compared to stored fingerprints phishing detection models were unaffected by PCA and
to identify potential malicious websites. In the second LDA, butremoving constant features significantly
phase, approximately 21 features extracted from URLs improved detection accuracy. Similarly, authors in [50]
and source code are used to train an ensemble model also employed PCA for dimensionality reduction on a
employing Random Forest, XGBoost, and Extra Trees balanced dataset. SVM and DNN models were trained
classifiers. Since no webpage exists independently on the and evaluated. It would be beneficial to include details
internet, every webpage is connected to various resources, about the PCA parameters, such as the number of
such as forwarding pages. In most cases, phishing attacks principal components retained, and consider exploring
may neglect to conceal this information. additional ML models for a more comprehensive
The frameworks presented in [40] use heterogeneous comparison. To address the technical challenges of small
information networks (HIN) to understand the semantic datasets, a large dataset containing more samples for both
and syntactic relationship among the various objects that phishing and legitimate categories should be presented
constitute the web page and compute the Phish score for [51]. An Optimal Feature Vectorization Algorithm
the nodes and nodes attributes [39] are used to train ML (OFVA) was introduced to extract 41 features, including
models. A deep learning-based phishing detection 10 novel ones, effectively detecting phishing scams.
framework was presented by [41] based on the merits of Content-related features were excluded to reduce
character level and word level embedding for the input response time. Authors in [52] developed an SVM-based
URL information. The study did not focus on data phishing detection model utilizing URL-based features. A
imbalance, which might lead to overfitting issues. chi-square metric was employed to select nine significant
Recent research indicates that employing optimization features from an initial set of sixteen—the SVM model
techniques for hyperparameter tuning [42,43,44,45] and with a polynomial kernel function performed better than
feature selection (Ramaiah et al., 2024) significantly its radial basis function counterpart.
enhances the performance of machine learning models. The literature offers numerous robust solutions for
Similarly, in the context of phishing detection, authors in detecting phishing websites using ML models (Table 1),
[46] utilize Genetic Algorithms (GA) to optimize but there’s a need to cultivate the comprehensive nature of
hyperparameters for various machine learning models. such models. This involves enabling ML models to
The study employs three datasets to demonstrate its understand phishing websites and develop resilience
robustness against evolving phishing attacks. Although against emerging threats deeply. One of the challenges
the inclusion of GA improves results, the iterative nature associated with the candidate problem is insufficient
of the process increases computational time. To infer labeled samples. Very few publicly available datasets [50]
deeper insights into the syntactic and semantic maintain equal samples for benign and malign labels. In
information in the text extracted from source code, the the cited literature, the work presented in [40,48], and
authors in [47] employ various word embedding [49] used datasets where the number of benign samples is
algorithms to enhance detection accuracy. The resulting higher than the number of malign samples. Conversely,
word embeddings are used to generate feature vectors, Ejaz et al. [53], Bahaghigha et al. [49], Tamal et al. [51],
subsequently engaged to design ensemble and multimodal and Shombot et al. [52] utilize datasets with more malign
phishing detectors. The phishing framework presented in samples than benign samples.
[48] proposes eight features extracted from the URL, one They were training a machine learning model to
from the plaintext character level and six from hyperlinks. detect phishing websites, andhaving an imbalanced
An additional seven features from the literature were dataset with significantly more benign than malicious
incorporated, resulting in 15 features used by the authors samples can pose challenges. The model might become
to compile the customized dataset. To mitigate overly focused on recognizing benign patterns, leading to
performance degradation over time, a vanilla neural false negatives where phishing websites are incorrectly
network (VNN) model trained using continual learning classified as safe. With fewer malicious samples, the

c 2024 NSP
Natural Sciences Publishing Cor.
1484 M. Ramaiah et al.: Enhanced Phishing Detection: An Ensemble Stacking Model...

Table 1: Cutting-edge phishing detection modules

Feature
Approach Description Dataset Limitations
Selection
1918 benign and Model can detect
Jain AK and Uses the URL and
Statistical tool 2141 phishing websites built exclusively
Gupta BB [34] source code features.
websites. with HTML.
1116 benign and
Jain and ML models uses the Uses the domain Model can detect websites
1428 phishing
Gupta [36] hyperlink features. expertise. built solely with HTML
websites.
Two-phase framework
Rao and Uses the domain 4097 phishing websites
included similarity Relies on source-code.
Pais [39] expertise and 5438 benign websites.
based and ensemble model.
HinPhish utilizes link Webpage with more link may
30,649 benign samples
Guo et al. [40] information from webpage Not used hinders the performance of
and 29,496 phish samples.
objects to construct a HIN the model.
HDP-CNN identifies phishing The model’s performance on
URLs by leveraging both 71,556 phishing URLs large datasets is likely compromised
Zheng et al. [41] Not used
character and word-level and 344,794 benign URLs. due to severe class imbalance,
representations. potentially leading to overfitting.
Ensemble models hyper-
UCI and two datasets
Sarem,et.al [46] parameters are computed Not used Response time is high.
from Mendeley
using GA
Word embedding method
outputs are used to train the Language dependant. Word embedding
Rao et.al [47] Not used 5076 benign 5438 phishing
ensemble and multimodal demands more storage.
enabled phishing detector
The ML model employs a
To unveil the language-specific plain text
rich feature set including URL, 32,972 safe sites and 27,280
Aljofey, content of a webpage, accessing its
hyperlink, and character-level Not used phishing URLs along with
and Qingshan [48] underlying HTML source code is
TF-IDF features extracted from HTML codes.
essential.
HTML’s plain text.
VNN with CL has been used
99504 phishing and As new information learned increases
Ejaz et. al [53] upon the samples collected Not used
81,082 benign the size of the learning parameters
from 2018-2020
Constant features
ML models built upon PCA 58,000 benign and Without feature extraction, the number
Bahaghighat et. al [49] are removed using
and LDA were presented 30,647 phishing of features are high
statistical tool
SVM and DNN models 5000 benign,
Exploration of other ML models could
Elumalai[50] trained upon the public PCA 5000 malign
be established.
dataset samples
247950 instances, of which
Intra URL features are Optimal feature 128541 are from phishing
Tamal et al. [51] Content related features were discarded.
used to train 15 ML models vectorization algorithm URLs and 119409 are from
legitimate URLs.
SVM based anti-phishing
548 benign,
Shombot et. al [52] framework built upon the Chi-Square Dataset size is small
805 phishing URLs
URL based features.

model might struggle to learn subtle phishing 3 Proposed Methodology

characteristics, reducing its ability to identify them
accurately. Conversely, if the dataset has too many This section provides a detailed description of the
malicious samples, the model might become overly presented anti-phishing solution. Exploratory Data
sensitive to malicious patterns, leading to false positives Analysis (EDA) techniques are applied in the
where benign websites are incorrectly flagged as pre-processing phase to analyze the features. Data
phishing. In both cases, a high accuracy score might not sampling is employed to balance the class distribution.
accurately reflect the model’s performance, as the model DT RFECV is used to select the most essential features.
could predict everything as benign or malicious to Next, machine learning models are trained on 80% of the
achieve high accuracy. The presented phishing detection data. The trained models are then evaluated on the
framework employs appropriate data sampling techniques remaining 20% of the data for testing. After assessing the
to ensure the generalization of machine learning models. performance of the individual models, an ensemble
Contrary to the literature that utilizes statistical tools [34, stacking model is built that combines Decision Trees
52] for feature selection, the studies presented in [36,39] (DT), Random Forests (RF), and Bagging Classifiers
rely on domain expertise to identify the most significant (BC) for improved accuracy. The candidate proposal’s
features. In contrast, the research in [40,41] employ all architecture view is shown in Figure 1.
features from the dataset, potentially leading to increased
prediction time. Hence, to mitigate the mentioned
technical hindrances, the presented ML model is built 3.1 Dataset Description
upon the features derived through the DT-REF algorithm.
The proposed framework is experimented on two datasets
https://data.mendeley.com/datasets/h3cgnj8hft/1: DS-1 is

c 2024 NSP
Natural Sciences Publishing Cor.
Appl. Math. Inf. Sci. 18, No. 6, 1481-1493 (2024) / www.naturalspublishing.com/Journals.asp 1485

distribution statistics of dataset-2 can be found in Figure

3. 111 distinct features are provided to differentiate the
anomaly entities uniquely. Nineteen URL properties,
twenty-one Domain properties, eighteen URL file
properties, eighteen URL directory properties, and fifteen
more properties make up the distinct feature categories in
the dataset.

3.2 Pre-processing
The dataset DS-1 has an equal number of phishing and
legitimate samples. It has been ensured that the dataset is
devoid of null values. This indicates that each instance’s
features have valid values and no missing data. The
absence of null values enables continuous analysis and
modeling without the need for imputation or managing
Fig. 1: The architecture of the Proposed Anti-phishing missing values. Each instance in the dataset is unique,
Model. indicating no duplicate records are present. The
uniqueness of tuples ensures that each instance
contributes independently to the machine learning
process, preventing any duplication bias that could distort
a compilation of characteristics extracted from both the results. A box plot analysis was conducted on the
phishing and legitimate websites. 48 features and 10,000 dataset to resolve this issue. The box plot analysis
samples are available on both phishing and legitimate revealed that no outliers were identified in the dataset.
labels. The graphical representation is displayed in Figure This suggests that the data points don’t contain extreme
2. Four categories of features constitute the dataset, values that would skew the analysis or affect the model’s
offering better insight into web pages: sixteen functionality, and instead fall within a reasonable range.
address-based features, four domain-based features, In dataset 2, all instances with null values have been
twenty-one Abnormal-based features, and six eliminated. This procedure verifies that the remaining
JavaScript-based features. data is comprehensive and contains no missing valuesor
redundant recordings. The dataset DS-2 had an
imbalanced distribution of legitimate and phishing
instances, with 58,000 legitimate instances and 30,647
phishing instances, respectively. Synthetic Minority
Over-Sampling Technique (SMOTE) analysis addressed
this class imbalance and provideda more balanced
dataset. To match the number of phishing cases in the
majority class, SMOTE creates fictional instances of the
minority class, which, in this case, are valid instances.
Consequently, the dataset was rebalanced to contain
58,000 instances of phishing and legitimate connections.
The sample statistics before and after applying the
SMOTE can be found in Figures 3(a) and 3(b). This phase
ensures the quality and integrity of the dataset prior to
analysis.

3.3 Features Selection

Fig. 2: Sample distribution statistics for dataset-1 (DS-1).
Feature selection is a crucial piece of information to have
before training the model. The redundant or highly
https://data.mendeley.com/datasets/72ptz43s9v/1: correlated independent features in extremely
DS2 is an extensive collection of examples of both high-dimensional data usually generate problems for
authentic and phishing websites. The collection contains models. This could increase the training time needed for
88,647 cases, 58,000 instances of legitimate websites, and the machine learning model, worsening the over fitting
30,647 instances of phishing websites. Sample issues. This section describes selecting the most

c 2024 NSP
Natural Sciences Publishing Cor.
1486 M. Ramaiah et al.: Enhanced Phishing Detection: An Ensemble Stacking Model...

accuracy metrics from the RFE process are used to assess

feature selection effectiveness in classifying attack and
non-attack classes. Notably, the accuracy values
determining the best model within each training sample
set are based on final assessments using the test dataset,
not internal metrics from RFE or cross-validation
iterations. The top ten features derived using DT-REFCV
upon dataset-1 and dataset-2 are furnished in Table 2 and
Table 3.

Table 2: Top Ten features from Dataset-1 (DS-1)

PctExtHyperlinks(AF6), PctExtNullSelfRedirectHyperlinksRT(JF6),
FrequentDomainNameMismatch(AF14) , InsecureForms(AF9),
Feature set PctNullSelfRedirectHyperlinks(AF13), NumDash(AdF5),
PctExtResourceUrls(AF7), SubmitInfoToEmail(AF18),
PathLevel(AdF3), IframeOrFrame(AF19)

Table 3: Top Ten features from Dataset-2 (DS-2)

qty dot directory(UDF1),time domain activation(OF5),
directory length(UDF18),asn ip(OF4),time response(OF2),
Feature set
length url(UF19),qty dot domain(DF1),ttl hostname(OF10),
time domain expiration(OF6), qty nameservers(OF6)

3.4 Machine Learning Models

Once the dominant features were selected, various

Fig. 3: (a) Sample Distribution Statistics of DS-2, (b) machine learning models are trained upon them. To test
Sample Distribution of DS-2 after Applying SMOTE. their efficacy, all the trained ML models are tested using a
test dataset. This ensures they can adapt to unseen threats,
a must in this ever-changing battle. Then, the ML models’
performance is analysed to select the promising models.
After analyzing the models’ results, an ensemble stacking
important features from the data. The method of model has been designed using DT, RF, and Bagging.
Recursive feature elimination (RFE) is used along with a
decision tree (DT) model and cross-validation (CV) to
evaluate feature importance. DT-RFECV utilized 10-fold
cross-validation with StratifiedKFold. This method 3.4.1 Decision Tree (DT)
ensures each data split maintains the original class
distribution, which is crucial for classification tasks to Regression and classification are two applications of
avoid biases caused by class imbalance. During the nonparametric supervised learning techniques, such as
recursive feature elimination (RFE) process, features are decision trees. By learning decision rules derived from
iteratively eliminated, and accuracy metrics are computed the features of the data, a decision tree classifier creates a
at each step to measure the impact on model performance. model that predicts when the target variable will be
This aids in understanding the contribution of individual estimated. The if-then-else decision rule is linked to
features, and the optimal feature set is chosen based on decision tree algorithms. Deeper decision rules and a
the highest overall accuracy. Subsequently, the selected more suitable model result from the deeper tree.
feature set undergoes evaluation using 10-fold Classifiers construct an adjudication tree Tree-like
cross-validation. Here, the decision tree (DT) model is structures. The method separates the dataset into smaller,
trained and evaluated ten times, each time with a different more manageable chunks and simultaneously enhances
fold as the validation set. The average and standard the decision tree that goes along with it. The end product
deviation of accuracy across these iterations provide will be a tree with leaf and decision nodes. The leaf nodes
insights into model robustness and generalization to are used to convey a classification or judgment. A
unseen data. Throughout the cross-validation, internal decision node is any node that has two or more branches.

c 2024 NSP
Natural Sciences Publishing Cor.
Appl. Math. Inf. Sci. 18, No. 6, 1481-1493 (2024) / www.naturalspublishing.com/Journals.asp 1487

3.4.2 Random Forest (RF)

Using bootstrap aggregation or bagging, several

classification and regression trees (CART) are combined
to create the supervised learning technique known as
random forest (RF). Several regression trees referred to as
‘ntree’ were built, and random subsets of independent
variables (referred to as “mtry”) were used for each split
of a tree. The dependent variable, phishing or
legitimate,is predicted using the average of all the trees.
The out-of-bag Samples not included in the bootstrap set
were used for an internal cross-validation accuracy and
variable significance evaluation.

3.4.3 Bagging Classifier (BC)

Bagging is an ensemble learning method incorporating

Fig. 4: Algorithm for Ensemble Stacked Classifier.
multiple base classifiers, typically decision trees, by
training them on distinct bootstrap samples from the
original dataset. The ultimate classification is determined
by aggregating the predictions of these primary
classifiers. By incorporating diversity among the base
4 Experimented Results and discussion
models, bagging decreases overfitting and enhances
The experiment is conducted on a Machine booted with
generalization.
Windows 11 operating system and powered with
processor: Intel(R) Xeon(R) CPU @ 2.20GHz, RAM
31GB. The proposed framework is implemented Python
3.4.4 Proposed Ensemble Stacking Model Version: 3.10.10. To design the machine models, Pandas,
NumPy, Scikit-Learn (sklearn) libraries were used. For
One way to create effective classifiers that outperform training and testing the models 80:20 ratio is used for
conventional ML classifiers in classification accuracy is to providing the samples. Performance measures including
use the ensemble technique. Stacking stands out as a recall, accuracy, precision, F1 score, TPR, and FPR are
potent ensemble learning method that amalgamates the employed to assess the recommended anti-phishing
predictions generated by several individual models, solution. The accuracy score (A) is the most commonly
culminating in a final prediction that is more resilient and used metric to evaluate model performance in binary and
accurate. The training ensemble model includes two multi-class classification problems. The corresponding
levels. Firstly, Base Learnersare the individual models mathematical expression is given in equation 1. A
trained on the original data. The candidate experiment machine learning model precision is its ability to identify
uses three different models DT, RF, and Bagging. The true positives. The model’s recall accuracy is also
other level of the model is Meta-Learner.Upon the measured. Sensitivity is real positive interest. The
completion of training for the base learners, their mathematical expression for precision and recall are
predictions are employed as input to facilitate the training furnished in equation 2 and 3. The trade-off between
of a meta-learner. The meta-learner learns how to recall and precision evaluates the ML model’s accuracy
combine the predictions of the base learners to make a while accounting for FP and FN, which can be assessed
final prediction. ELNet is used in this with meta-learner. by the F1 score (equation 4).
ELNet, also known as Elastic Net Regularization, is a
T+ + T−
regression and classification technique that combines L1 A= (1)
and L2regularization. Combining these two techniques T + + T − + F+ + F−
can lead to better performance than using either alone. T+
ELNet regularization can help reduce the final model’s P= (2)
variance, making it more robust to noise and outliers. T + + F+
Because ELNet models produce coefficients indicating T+
the relative relevance of each feature, they are easier to R= (3)
interpret than other ensemble models like random forests. T + + F−
The presented ensemble stacking model steps are P∗R
furnished in Figure 4. F1score = 2 (4)
P+R

c 2024 NSP
Natural Sciences Publishing Cor.
1488 M. Ramaiah et al.: Enhanced Phishing Detection: An Ensemble Stacking Model...

Table 4: Significant feature through DT RFECV from Table 6: Tested Results with the cutting-edge method’s
Dataset-1 (DS-1) results (DS-1)

Type of features Feature number Methods A P R

Address based F3,F5 [11]RF 0.965 0.962 0.974
Abnormal-based F6,F7,F9,F13,F14,F18,F19 [11]BC 0.963 0.962 0.973
JavaScript-based F6 [11]DT 0.957 0.960 0.962
[19]GA-ADB 0.972 0.971 0.972
Table 5: Tested Results of various ML models upon (DS- [19]GA-BC 0.975 0.972 0.978
1) [24]DNN 0.997 0.998 0.996
[24]SVM 0.999 1.0 0.998
Table 4DT RFECV P-Stacking 0.977 0.975 0.979
Methods A P R F1 Score
DT 0.971 0.963 0.981 0.972
RF 0.976 0.973 0.980 0.976
LR 0.911 0.903 0.923 0.913 terms of accuracy, and F1-score than that of the other
GRB 0.970 0.965 0.975 0.970 models. This experiment suggests that the features
ADB 0.962 0.960 0.964 0.962 selected by DT-RFECV significantly improve the ability
SVM 0.911 0.901 0.925 0.913 of ML models to accurately classify phishing web links.
KNN 0.940 0.943 0.937 0.940 ROC curves obtained through diverse ML models can be
GNB 0.794 0.917 0.652 0.762 found in Figure 5.
BC 0.976 0.974 0.978 0.976
P-Stacking 0.977 0.975 0.979 0.977

True positives (TPs) are those situations in which the

model accurately predicts positive outcomes. True
negatives (TN) are instances in which the model
accurately predicts a negative result. False positives, or
FP, are instances in which the model predicts positive
outcomes inaccurately. Last but not least, FN denotes
false negatives—situations in which the model predicts
negative outcomes inaccurately. The metrics TPR, FPR,
and AUC are also computed to evaluate the binary
classification models. The comparable mathematical
expression of FPR (False Positive Rate) and TPR (True
Positive Rate) is represented in Equation 5 and Equation
6. ROC reveals the relationship among TPR and FPR. A
model with minimum value compared to FPR is broadly Fig. 5: Comparison of AUC values of various ML models.
accepted. In contrast, a higher value is appreciated for
TPR. AUC is the measurement of the area underneath the
ROC curve. The Figure 5 reveals RF and Stacking model’s results
are comparable and better results than the other models.
T+ These studies also evaluated anti-phishing solutions on
T PR = (5) the same datasets used by the proposed experiment.
T + + F−
According to Table 5, the frameworks in [54] used all 48
F+ features for their machine learning models. In Almomani
FPR = (6) et al. [54] and Al-Sarem et al. [46], sixteen classifiers
F+ + T − were evaluated, with RF, BC, and DT models performing
The proposed framework is evaluated on two best. The anti-phishing solution in [46] employed Genetic
Mendeley datasets. The first experiment 1, considered 48 Optimization to fine-tune hyperparameters, achieving the
features from Dataset-1. DT-RFECV identified the top ten best results with GA-ADB (AdaBoost), GA-BC
features, which are listed in Table 4. by their numbers and (Bagging), and GA-enabled stacking models Notably, the
in Table 2 by their names. Then various machine learning proposed P-Stacking model outperforms the solutions in
(ML) models are trained on these selected features. The [45] despite using only the top ten features from
results are reported in Table 5. As shown in Table 5, Dataset-1.
decision tree (DT), random forest (RF), and bagging In the point of security analysis, the candidate
classifier (BC) models performed better than others. framework become resistant to the iframe-attack
Notably, the ensemble stacking model did a better job in (IframeOrFrame). Although phishing scams are typically

c 2024 NSP
Natural Sciences Publishing Cor.
Appl. Math. Inf. Sci. 18, No. 6, 1481-1493 (2024) / www.naturalspublishing.com/Journals.asp 1489

Table 7: Significant feature through DT FS from Dataset-

2 (DS-2)

Type of features (DS-2) Feature number

URL directory F1,F18
URL features F19
Domain features F1
Others F2,F4,F5,F6,F8,F10

Table 8: Tested Results of various ML models upon (DS-

Methods A P R F1 S
DT 0.96 0.96 0.960 0.96
RF 0.975 0.969 0.981 0.975
LR 0.910 0.922 0.894 0.908
GRB 0.954 0.942 0.966 0.954 Fig. 6: Comparison of AUC values of various Machine
ADB 0.935 0.925 0.945 0.935 learning models.
SVM 0.762 0.733 0.816 0.772
KNN 0.893 0.875 0.914 0.894
GNB 0.829 0.927 0.710 0.804
BC 0.971 0.971 0.971 0.971 Table 9: Comparative results with the cutting-edge
P-Stacking 0.975 0.971 0.978 0.975 methods (DS-2)

Methods A P R
[19]GA-ADB 0.940 0.909 0.920
triggered by the presence of an external link or an empty [19]GA-RF 0.964 0.946 0.950
link, the work that is being presented has features such as [19]GA-XGB 0.973 0.962 0.961
[19]GA-BC 0.969 0.953 0.959
PctExtHyperlinks, PctExtNullSelfRedirectHyperlinksRT,
[23]LR 0.953 0.941 0.968
and PctNullSelfRedirectHyperlinks that strengthen its
[23]NB 0.930 0.909 0.961
defences against malicious resources and the effects of [23]XGB 0.992 0.991 0.994
self-redirection. The FrequentDomainNameMismatch P-Stacking 0.975 0.971 0.978
vulnerability could lead to detect man-in-the-middle
attack by hackers.
In Experiment 2, the Dataset-2 has 111 features. The
results obtained through various conventional ML models
along with the ensemble stacking model using the
significant feature through DT-RFECV are furnished in results obtained in [49] were then compared with the
Table 7. Features derived through the DT-RFECV are proposed ensemble stacking model. Compared to
provided in Table 7 and the corresponding feature names XGBoost, the performance of the proposed method is
are furnished in Table 3. Table 8 displays the results better interms of number of features and method used.
obtained through the presented ensemble stacking model The graphical representation of the same is shown in
along with the other baseline models. While observing the Figure 7.
results presented in Table 8, the presented model showing
better performance than the baseline models. The ROC In terms of security analysis, the primary component
curves obtained by the various ML models can be found that has the biggest impact on the outcome is the features
in Figure 6. that DT RFECV listed in Table 7. Few of the semantic 10
To portray the superiority of the presented work, the features from DS-2 relate to URLs, URL directories, and
experimented results are compared with the results in [46, domain-based networks. The highly significant features
49] and furnished in Table 9. For the candidate dataset, from six other categories time domain activation (OF5),
the frameworks in [46] apply optimization techniques to asn ip (OF4), time response (OF2), ttl hostname (OF10),
derive the hyper-parameters of various machine learning time domain expiration (OF6), and
models. Notably, the anti-phishing frameworks in [46] qty nameservers(OF6) become the cause of the
utilizes all 111 features, whereas the presented solution guaranteed accuracy in phishing scam detection. The
achieves better results using only ten features. The proposed fine-tuned stacking ensemble method achieves
phishing scam detection method proposed in [49] superior performance on Dataset 1 compared to recent
employs SMOTEENN to address the class imbalance works [46,49]. This was evident in both accuracy and
issue. To reduce dimensionality, statistical methods were recall metrics. Additionally, for Dataset 2, our method
used to eliminate 13 constant features and used PCA. The surpassed the solution presented in [46,49].

c 2024 NSP
Natural Sciences Publishing Cor.
1490 M. Ramaiah et al.: Enhanced Phishing Detection: An Ensemble Stacking Model...

Funding

The authors offer special gratitude to INTI International

University for the opportunity to conduct research and
publish the research work. In particular, the authors
would like to thank INTI International University for
funding the publication of this research work. Also, we
extend our heartfelt gratitude to all research participants
for their valuable contributions, which have been integral
to the success of this study.

Fig. 7: Tested results using DS-2 along with its counterpart

methods. Conflict of Interest

The authors have no conflict of interest to declare.

5 Conclusion
Phishing, a pervasive and harmful cyberattack, employs References
deception to obtain sensitive information. This research
leverages machine learning to address the evolving nature [1] A.M. Al-Adamat, M.K. Alserhan, L.S. Mohammad, D.
of phishing threats. To overcome the challenges of sample Singh, S.I.S. Al-Hawary, A.A. Mohammad, M.F. Hunitie,
imbalance and feature selection, the proposed approach The Impact of Digital Marketing Tools on Customer
incorporates SMOTE oversampling and a Decision Loyalty of Jordanian Islamic Banks. In Emerging Trends
Tree-Recursive Feature Elimination (DT-RFECV) and Innovation in Business and Finance (pp. 105-118).
wrapper method. DT-RFECV calculates the importance Singapore: Springer Nature Singapore (2023).
of features and utilizes cross-validation to prevent [2] M.S. Al-Batah, E.R. Al-Kwaldeh, M. Abdel Wahed, M.
overfitting. The study identified two promising feature Alzyoud, N. Al-Shanableh, Enhancement over DBSCAN
subsets for each dataset using DT-RFECV. To evaluate Satellite Spatial Data Clustering. Journal of Electrical and
their effectiveness, various ML models are trained and Computer Engineering,2024, 2330624 (2024).
tested. Random forests, decision trees, and bagging [3] M.S. Al-Batah, M.S. Alzboon, M. Alzyoud, N. Al-
Shanableh, Enhancing Image Cryptography Performance
models demonstrated reliable predictive capabilities.
with Block Left Rotation Operations. Applied Computational
Subsequently, a stacking ensemble model is developed to
Intelligence and Soft Computing,2024, 3641927 (2024).
improve performance further. The proposed model’s
[4] M.M. Alani, H. Tawfik, Phishnot: A cloud-based machine-
results are compared with recent anti-phishing solutions, learning approach to phishing url detection. Computer
demonstrating superior performance using fewer features Networks,218, 109407 (2022).
according to quantitative metrics. Furthermore, the [5] A. Adwan, M. Alsoud, The impact of brand’s effectiveness
framework is designed to resist common cyberattacks in on navigating issues related to diversity equity and inclusion.
the IoT environment, including iframe attacks, Uncertain Supply Chain Management,12, 2101-2112 (2024).
man-in-the-middle attacks, and vulnerabilities related to [6] F.M. Aldaihani, A.A. Mohammad, H. AlChahadat, S.I.S. Al-
null and external links. As future work, the solution can Hawary, M.F. Almaaitah, N.A. Al-Husban, A. Mohammad,
be extended to mitigate phishing scams on blockchain Customers’ perception of the social responsibility in
platforms. Additionally, we will establish generalizability the private hospitals in Greater Amman. In The effect
metrics for the models to ensure their adaptability to new of information technology on business and marketing
contexts. intelligence systems (pp. 2177-2191). Cham: Springer
International Publishing (2023).
[7] F.A. Al-Fakeh, M.S. Al-Shaikh, S.I.S. Al-Hawary, L.S.
Mohammad, D. Singh, A.A. Mohammad, M.H. Al-Safadi,
Acknowledgement The Impact of Integrated Marketing Communications
Tools on Achieving Competitive Advantage in Jordanian
The authors thank all the respondents who provided Universities. In Emerging Trends and Innovation in Business
valuable responses and support for the survey. They offer and Finance (pp. 149-165). Singapore: Springer Nature
special gratitude to INTI International for publishing the Singapore (2023).
research work, particularly to INTI International [8] A.S. Al-Adwan, H. Berger, Exploring physicians’
University for funding its publication, and acknowledge behavioural intention toward the adoption of electronic
the partial funding support provided by the Electronic health records: an empirical study from Jordan. International
Marketing and Social Media Department, Economic and Journal of Healthcare Technology and Management,15,
Administrative Sciences, Zarqa University. 89-111 (2015).

c 2024 NSP
Natural Sciences Publishing Cor.
Appl. Math. Inf. Sci. 18, No. 6, 1481-1493 (2024) / www.naturalspublishing.com/Journals.asp 1491

[9] S. Purkait, Examining the effectiveness of phishing filters neural network architecture. Transactions on Emerging
against DNS based phishing attacks. Information & Telecommunications Technologies,32, e4221 (2021).
Computer Security,23, 333-346 (2015). [22] A.S. Al-Adwan, M. Alsoud, N. Li, T.E. Majali, J.
[10] R. Rao, S.T. Ali, Phishshield: a desktop application to detect Smedley, A. Habibi, Unlocking future learning: Exploring
phishing webpages through heuristic approach. Procedia higher education students’ intention to adopt meta-education.
Computer Science,54, 147-156 (2015). Heliyon,10, e29544 (2024).
[11] Q.Y. Shambour, M.M. Abualhaj, A. Abu-Shareha, A.H. [23] M. Alsharaiah, M. Abualhaj, L. Baniata, A. Al-saaidah,
Hussein, Q.M. Kharma, Mitigating Healthcare Information Q. Kharma, M. Al-Zyoud, An innovative network intrusion
Overload: a Trust-aware Multi-Criteria Collaborative detection system (NIDS): Hierarchical deep learning model
Filtering Model. Journal of Applied Data Sciences,5, based on Unsw-Nb15 dataset. International Journal of Data
1134-1146 (2024). and Network Science,8, 709-722 (2024).
[12] R. Al Khouri, M. Al Fauri, The Impact of Working Capital [24] R. Mangayarkarasi, C. Vanmathi, V. Ravi, A robust malware
Management on the Profitability of Jordanian Companies traffic classifier to combat security breaches in industry 4.0
Listed on the Amman Stock Exchange. Al-Balqa Journal for applications. Concurrency and Computation: Practice and
Research and Studies,26, 77-97 (2023). Experience, e7772 (2023).
[13] D.A. Al-Husban, S.I.S. Al-Hawary, I.R. AlTaweel, [25] A.A. Mohammad, I.A. Khanfar, B. Al-Oraini, A.
N.A. Al-Husban, M.F. Almaaitah, F.M. Aldaihani, D.I. Vasudevan, I.M. Suleiman, Z. Fei, Predictive analytics
Mohammad, The impact of intellectual capital on competitive on artificial intelligence in supply chain optimization. Data
capabilities: evidence from firms listed in ASE. In The effect and Metadata,3, 395-395 (2024).
of information technology on business and marketing [26] S. Abusaleh, M. Arabasy, M. Abukeshek, T. Qarem, Impacts
intelligence systems (pp. 1707-1723). Cham: Springer of E-learning on the Efficiency of Interior Design Education
International Publishing (2023). (A comparative study about the efficiency of interior
[14] M.I. Alkhawaldeh, F.M. Aldaihani, B.A. Al-Zyoud, S.I.S. design education before and during the novel Coronavirus
Al-Hawary, N.A. Shamaileh, A.A. Mohammad, O.A. Al- (COVID-19) pandemic). Al-Balqa Journal for Research and
Adamat, Impact of internal marketing practices on intention Studies,27, 47-63 (2024).
[27] H. Hmoud, A.S. Al-Adwan, O. Horani, H. Yaseen, J. Al
to stay in commercial banks in Jordan. In The effect
Zoubi, Factors influencing business intelligence adoption by
of information technology on business and marketing
higher education institutions. Journal of Open Innovation:
intelligence systems (pp. 2231-2247). Cham: Springer
Technology, Market, and Complexity,9, 100111 (2023).
International Publishing (2023).
[28] A.A. Mohammad, F.L. Aityassine, Z.N. Al-fugaha, M.
[15] R. Yang, K. Zheng, B. Wu, C. Wu, X. Wang, Phishing
Alshurideh, N.S. Alajarmeh, A.A. Al-Momani, A.M. Al-
website detection based on deep convolutional neural
Adamat, The Impact of Influencer Marketing on Brand
network and random forest ensemble learning. Sensors,21,
Perception: A Study of Jordanian Customers Influenced on
8281 (2021).
Social Media Platforms. In Business Analytical Capabilities
[16] M.S. Alshura, S.S. Tayeh, Y.S. Melhem, F.N. Al-Shaikh, and Artificial Intelligence-Enabled Analytics: Applications
H.M. Almomani, F.L. Aityassine, A.A. Mohammad, and Challenges in the Digital Era (pp. 363-376). Cham:
Authentic leadership and its impact on sustainable Springer Nature Switzerland (2024).
performance: the mediating role of knowledge ability [29] A.A. Mohammad, M.Y. Barghouth, N.A. Al-Husban, F.M.
in Jordan customs department. In The effect of information Aldaihani, D.A. Al-Husban, A.A. Lemoun, S.I.S. Al-
technology on business and marketing intelligence systems Hawary, Does Social Media Marketing Affect Marketing
(pp. 1437-1454). Cham: Springer International Publishing Performance. In Emerging Trends and Innovation in Business
(2023). and Finance (pp. 21-34). Singapore: Springer Nature
[17] A.A. Mohammad, I.A. Khanfar, B. Al Oraini, A. Vasudevan, Singapore (2023).
I.M. Suleiman, M. Ala’a, User acceptance of health [30] M.M. Abualhaj, Q.Y. Shambour, A. Alsaaidah, A.
information technologies (HIT): an application of the theory Abu-Shareha, S. Al-Khatib, M.O. Hiari, Enhancing
of planned behavior. Data and Metadata,3, 394-394 (2024). Spam Detection Using Hybrid of Harris Hawks and
[18] M. Moghimi, A.Y. Varjani, New rule-based phishing Firefly Optimization Algorithms. Journal of Applied Data
detection method. Expert systems with applications,53, 231- Sciences,5, 901-911 (2024).
242 (2016). [31] R. Ghoneim, M. Arabasy, The Role of Artworks of
[19] N. Al-shanableh, M. Alzyoud, R.Y. Al-husban, N.M. Architectural Design in Emphasizing the Arab Identity. Al-
Alshanableh, A. Al-Oun, M.S. Al-Batah, S. Alzboon, Balqa Journal for Research and Studies,27, 1-14 (2024).
Advanced Ensemble Machine Learning Techniques for [32] A.A. Mohammad, M.M. Al-Qasem, S.M. Khodeer, F.M.
Optimizing Diabetes Mellitus Prognostication: A Detailed Aldaihani, A.F. Alserhan, A.A. Haija, S.I.S. Al-Hawary,
Examination of Hospital Data. Data and Metadata,3, 363- Effect of Green Branding on Customers Green Consciousness
363 (2024). Toward Green Technology. In Emerging Trends and
[20] N. Al-shanableh, M.S. Alzyoud, E. Nashnush, Enhancing Innovation in Business and Finance (pp. 35-48). Singapore:
Email Spam Detection Through Ensemble Machine Springer Nature Singapore (2023).
Learning: A Comprehensive Evaluation Of Model Integration [33] M. Odeh, S.S. Badrakhan, N. Flayyih, M.O. Sabri, Z.
And Performance. Communications of the IIMA,22, 2 (2024). Abdijabar, H. Alsabatin, S. Hammad, Quantifying the Impact
[21] M. Ramaiah, V. Chandrasekaran, V. Ravi, N. Kumar, of the COVID-19 Pandemic on Quality Assurance Practice.
An intrusion detection system using optimized deep Appl. Math.,18, 989-996 (2024).

c 2024 NSP
Natural Sciences Publishing Cor.
1492 M. Ramaiah et al.: Enhanced Phishing Detection: An Ensemble Stacking Model...

[34] A.K. Jain, B.B. Gupta, Towards detection of phishing [48] A. Aljofey, Q. Jiang, A. Rasool, H. Chen, W. Liu, Q. Qu, Y.
websites on client-side using machine learning based Wang, An effective detection approach for phishing websites
approach. Telecommunication Systems,68, 687-700 (2018). using URL and HTML features. Scientific Reports,12, 8842
[35] n. Al-Shanableh, M. Al-Zyoud, R.Y. Al-Husban, N. Al- (2022).
Shdayfat, J.F. Alkhawaldeh, N.S. Alajarmeh, S.I.S. Al- [49] M. Bahaghighat, M. Ghasemi, F. Ozen, A high-
Hawary, Data Mining to Reveal Factors Associated with accuracy phishing website detection method based on
Quality of life among Jordanian Women with Breast Cancer. machine learning. Journal of Information Security and
Appl. Math.,18, 403-408 (2024). Applications,77, 103553 (2023).
[36] A.K. Jain, B.B. Gupta, A machine learning based approach [50] K. Elumalai, D. Bose, Advancement of Phishing Attack
for phishing detection using hyperlinks information. Journal Detection Using Machine Learning. Journal of Electrical
of Ambient Intelligence and Humanized Computing,10, 2015- Systems,20, 1208-1213 (2024).
2028 (2019). [51] M.A. Tamal, M.K. Islam, T. Bhuiyan, A. Sattar, N.
[37] L. Mobaideen, A. Adaileh, The Impact Of Organizational Prince, Unveiling suspicious phishing attacks: enhancing
Culture On Improving Institutional Performance In Aqaba detection with an optimal feature vectorization algorithm
Special Economic Zone Authority In Jordan. Al-Balqa and supervised machine learning. Frontiers in Computer
Journal for Research and Studies,27, 1-21 (2024). Science,6, 1428013 (2024).
[38] A.S. Al-Adwan, M.M. Al-Debei, The determinants of [52] E.S. Shombot, G. Dusserre, R. Bestak, N.B. Ahmed,
Gen Z’s metaverse adoption decisions in higher education: An application for predicting phishing attacks: A case of
integrating UTAUT2 with personal innovativeness in IT. implementing a support vector machine learning model.
Education and Information Technologies,2S, 7413-7445 Cyber Security and Applications,2, 100036 (2024).
(2024). [53] A. Ejaz, A.N. Mian, S. Manzoor, Life-long phishing attack
[39] R.S. Rao, A.R. Pais, Two level filtering mechanism to detection using continual learning. Scientific Reports,13,
detect phishing sites using lightweight visual similarity 11488 (2023).
[54] A. Almomani, M. Alauthman, M.T. Shatnawi, M.
approach. Journal of Ambient Intelligence and Humanized
Alweshah, A. Alrosan, W. Alomoush, B.B. Gupta, Phishing
Computing,11, 3853-3872 (2020).
website detection with semantic features based on machine
[40] B. Guo, Y. Zhang, C. Xu, F. Shi, Y. Li, M. Zhang,
learning classifiers: a comparative study. International
HinPhish: An effective phishing detection approach based on
Journal on Semantic Web and Information Systems,18, 1-24
heterogeneous information networks. Applied Sciences,11,
(2022).
9733 (2021).
[41] F. Zheng, Q. Yan, C.M. Victor, F. Leung, Y.U. Richard,
Z. Ming, HDP-CNN: High way deep pyramid convolution
neural network combining word-level and character-level M. Ramaiah received her
representations for phishing website detection. Computers & Ph.D. Degree in Information
Security,114, 102584 (2022). Technology and Engineering
[42] N. Al-shanableh, S. Anagreh, A.A. Haija, M. Alzyoud, from Vellore Institute
M. Azzam, H.M. Maabreh, S.I.S. Al-Hawary, The Adoption of Technology, M.E.
of RegTech in Enhancing Tax Compliance: Evidence in Computer Science
from Telecommunication Companies in Jordan. In Business from Anna University. She is
Analytical Capabilities and Artificial Intelligence-enabled working as a Professor in the
Analytics: Applications and Challenges in the Digital Era School of Computer Science
(pp. 181-195). Cham: Springer Nature Switzerland (2024).
Engineering and Information
[43] F.Y. Al-Kasassbeh, S.M. Awaisheh, M.A. Odeibat, S.M. Systems at VIT University, Vellore, India. She has
Awaesheh, L. Al-Khalaileh, M. Al-Braizat, Digital Human
attended many national and international conferences and
Rights in Jordanian Legislation and International Agreement.
published articles in reputed journals. Her research
International Journal of Cyber Criminology,18, 37-57
(2024).
interest includes cyber-security, Blockchain, Image
[44] N. Al-Dabbas, The Scope and Procedures of the Expert
Processing, Machine Learning, and Artificial Intelligence.
Recusal in the Arbitration Case: A Fundamental Analytical Her Orcid ID is: https://orcid.org/0000-0003-3088-6001.
Study in Accordance with Jordanian Law. Al-Balqa Journal
for Research and Studies,27, 291-306 (2024). V. Chandrasekaran
[45] A.M. Vincent, P. Jidesh, An improved hyperparameter is a Senior Professor
optimization framework for AutoML systems using at the School of Computer
evolutionary algorithms. Scientific Reports,13, 4737 (2023). Science Engineering
[46] M. Al-Sarem, F. Saeed, Z.G. Al-Mekhlafi, B.A. and Information
Mohammed, T. Al-Hadhrami, M.T. Alshammari, T.S. Systems, Vellore Institute
Alshammari, An optimized stacking ensemble model for of Technology (VIT),
phishing websites detection. Electronics,10, 1285 (2021). Vellore Campus, India.
[47] R.S. Rao, A. Umarekar, A.R. Pais, Application of word She holds a Ph.D. in
embedding and machine learning in detecting phishing Information Technology
websites. Telecommunication Systems,79, 33-45 (2022). and Engineering from VIT University, a Master’s

c 2024 NSP
Natural Sciences Publishing Cor.
Appl. Math. Inf. Sci. 18, No. 6, 1481-1493 (2024) / www.naturalspublishing.com/Journals.asp 1493

degree in Information Technology from Sathyabama

University, and a Bachelor’s degree in Computer Science Eddie Eu Hui Soon
from the University of Madras. With 21 years of is a Senior Lecturer at
experience in teaching and research, her expertise spans INTI International University
Image Processing, Deep Learning, Computer Vision, with over 20 years of
Blockchain, Cyber-Physical Systems, and IoT. She is also experience in academia
an active member of the Computer Society of India and and the animation industry.
the Soft Computing Research Society. Her Orcid ID is: Before academia, he worked
https://orcid.org/0000-0001-5833-8803. as a Technical Director
in Malaysian production
Vikash Chand is houses, contributing to TV
pursuing an M.Tech in commercials, series, feature films, and corporate videos.
Software Engineering He continues to consult in the animation and gaming
at Vellore Institute of industry, specializing in 3D cinematic design. His
Technology. He is currently research spans transdisciplinary topics, including Graph
working as a cloud engineer Theory, Systems Design, and digital frameworks. Dr.
intern at Signify. During Soon is also involved in prototyping and visualization at
his course tenure, he has the university’s fabrication lab and supports research
participated in many technical initiatives through journal and website management.
events and presented
technical papers at international conferences. His Orcid Qusai Shambour is
ID is orcid.org/0007-4835-1036. affiliated with the Laboratory
of Decision Systems and
Asokan Vasudevan e-Service Intelligence, within
is a distinguished academic at the Centre for Quantum
INTI International University, Computation and Intelligent
Malaysia. He holds multiple Systems at the University
degrees, including a PhD of Technology Sydney. He is
in Management from part of the School of Software
UNITEN, Malaysia, in the Faculty of Engineering
and has held key roles such as and Information Technology. His research primarily
Lecturer, Department Chair, focuses on recommender systems, collaborative filtering,
and Program Director. His multi-criteria decision-making, and fuzzy logic. Dr.
research, published in esteemed journals, focuses on Shambour explores topics such as recommendation
business management, ethics, and leadership. Dr. accuracy, semantic similarity, user preferences, and the
Vasudevan has received several awards, including the cold-start problem in recommendation approaches. His
Best Lecturer Award from Infrastructure University Kuala work is pivotal in addressing issues like information
Lumpur and the Teaching Excellence Award from overload and enhancing the quality of personalized
INTI International University. His ORCID ID is recommendations in online services and social networks.
orcid.org/0000-0002-9866-4045. His ORCID ID is orcid.org/0000-0002-3026-845X.

Suleiman Ibrahim Muhammad Turki

Mohammad is a Professor Alshurideh is a faculty
of Business Management member at the School
at Al al-Bayt University, of Business at the University
Jordan (currently at Zarqa of Jordan and the College
University, Jordan), with of Business Administration,
more than 17 years of at the University
teaching experience. He has of Sharjah, UAE. He teaches
published over 100 research a variety of Marketing
papers in prestigious journals. and Business courses
He holds a PhD in Financial Management and an MCom to both undergraduate and postgraduate students.
from Rajasthan University, India, and a Bachelor’s in With over 170 published papers, his research focuses
Commerce from Yarmouk University, Jordan. His primarily on Customer Relationship Management
research interests focus on supply chain management, (CRM) and customer retention. His ORCID ID is
Marketing, and total quality (TQ). His ORCID ID is orcid.org/0000-0002-7336-381X.
orcid.org/0000-0001-6156-9063.

c 2024 NSP
Natural Sciences Publishing Cor.

Effective Ensemble Learning Phishing Detection System Using Hybrid Feature Selection
No ratings yet
Effective Ensemble Learning Phishing Detection System Using Hybrid Feature Selection
16 pages
Lemar - CC
No ratings yet
Lemar - CC
27 pages
Igwilo Chiamaka Mary
No ratings yet
Igwilo Chiamaka Mary
57 pages
FR - Detecting Malicious Urls Using Data Analytics
No ratings yet
FR - Detecting Malicious Urls Using Data Analytics
17 pages
IET Information Security - 2023 - Prabakaran
No ratings yet
IET Information Security - 2023 - Prabakaran
18 pages
Phishing
No ratings yet
Phishing
9 pages
A Predictive Model For Phishing Attack D
No ratings yet
A Predictive Model For Phishing Attack D
6 pages
1 s2.0 S0957417423016858 Main
No ratings yet
1 s2.0 S0957417423016858 Main
13 pages
Synopsis 043705
No ratings yet
Synopsis 043705
21 pages
An Ensemble Method For Phishing Websites Detection Based On XGBoost
No ratings yet
An Ensemble Method For Phishing Websites Detection Based On XGBoost
6 pages
URL Phishing
No ratings yet
URL Phishing
36 pages
Malicious URL Detection Using Random Forest
No ratings yet
Malicious URL Detection Using Random Forest
36 pages
Fin Irjmets1682919970
No ratings yet
Fin Irjmets1682919970
5 pages
Depuuu DOCNW
No ratings yet
Depuuu DOCNW
28 pages
Enhancing Phishing URL Detection Through Comprehen
No ratings yet
Enhancing Phishing URL Detection Through Comprehen
7 pages
JETIR2504A41
No ratings yet
JETIR2504A41
7 pages
A Multi-Algorithm Approach For Phishing Uniform Resource Locator's Detection
No ratings yet
A Multi-Algorithm Approach For Phishing Uniform Resource Locator's Detection
10 pages
1 s2.0 S0957417422014373 Main
No ratings yet
1 s2.0 S0957417422014373 Main
13 pages
Second Review
No ratings yet
Second Review
26 pages
1229-Article Text-12170-1-10-20250203-2
No ratings yet
1229-Article Text-12170-1-10-20250203-2
13 pages
Improved Detection of Phishing Websites Using Machine Learning 11-6-2024
No ratings yet
Improved Detection of Phishing Websites Using Machine Learning 11-6-2024
15 pages
1 s2.0 S1877050920307602 Main
No ratings yet
1 s2.0 S1877050920307602 Main
9 pages
Phishing PPT Final
No ratings yet
Phishing PPT Final
24 pages
Presentation Slides
No ratings yet
Presentation Slides
42 pages
Content Pages CPE
No ratings yet
Content Pages CPE
79 pages
Final Synopsisi 2
No ratings yet
Final Synopsisi 2
11 pages
Al-Hadhrami
No ratings yet
Al-Hadhrami
17 pages
Principles of Analytical Method Validation
100% (2)
Principles of Analytical Method Validation
34 pages
Major Project Final Report
No ratings yet
Major Project Final Report
53 pages
An Effective Detection Approach For Phishing URL U
No ratings yet
An Effective Detection Approach For Phishing URL U
16 pages
Generative Adversarial Network-Based Phishing URL Detection With Variational Autoencoder and Transformer
No ratings yet
Generative Adversarial Network-Based Phishing URL Detection With Variational Autoencoder and Transformer
8 pages
Phishing Detection in Email Using Deep Learning
No ratings yet
Phishing Detection in Email Using Deep Learning
8 pages
A Sophisticated Framework For The Accurate Detection of Phishing Websites
No ratings yet
A Sophisticated Framework For The Accurate Detection of Phishing Websites
23 pages
Phishing Detection Using ML
No ratings yet
Phishing Detection Using ML
11 pages
Phishing Paper 2
No ratings yet
Phishing Paper 2
6 pages
Base Paper
No ratings yet
Base Paper
16 pages
A Machine Learning Based Approach For Phishing Detection Using
No ratings yet
A Machine Learning Based Approach For Phishing Detection Using
14 pages
Updated Phishing Url Detection
No ratings yet
Updated Phishing Url Detection
13 pages
3406 6866 1 PB
No ratings yet
3406 6866 1 PB
10 pages
20mis0106 VL2023240102875 Pe003
No ratings yet
20mis0106 VL2023240102875 Pe003
42 pages
Detecting Phishing Domains Using Deep Learning
No ratings yet
Detecting Phishing Domains Using Deep Learning
15 pages
Phishing Web Site Detection Using Diverse Machine Learning Algorithms
No ratings yet
Phishing Web Site Detection Using Diverse Machine Learning Algorithms
16 pages
PUMMP: Phishing URL Detection Using Machine Learning With Monomorphic and Polymorphic Treatment of Features
No ratings yet
PUMMP: Phishing URL Detection Using Machine Learning With Monomorphic and Polymorphic Treatment of Features
20 pages
PhishNotCloud-Based ML
No ratings yet
PhishNotCloud-Based ML
11 pages
Machine Learning For Detecting The Phishing Threats
No ratings yet
Machine Learning For Detecting The Phishing Threats
6 pages
CSE3502-Final J Comp Report
No ratings yet
CSE3502-Final J Comp Report
20 pages
AI/ML Dual Approach For Phishing Domain Detection: URL and Image Analysis
No ratings yet
AI/ML Dual Approach For Phishing Domain Detection: URL and Image Analysis
11 pages
Phishing URL Detection Using LSTM Based Ensemble Learning Approaches
No ratings yet
Phishing URL Detection Using LSTM Based Ensemble Learning Approaches
17 pages
Real Time Phishing Website Detectionusing ML
No ratings yet
Real Time Phishing Website Detectionusing ML
4 pages
SAS - Session - 8 - Research 2
No ratings yet
SAS - Session - 8 - Research 2
4 pages
Web-Based Machine Learning Framework For Phishing URL Detection and Analysis
No ratings yet
Web-Based Machine Learning Framework For Phishing URL Detection and Analysis
7 pages
CyberSec Review3 Team10
No ratings yet
CyberSec Review3 Team10
28 pages
Batch-5 Journal-6 ECE-D New
No ratings yet
Batch-5 Journal-6 ECE-D New
6 pages
Leveraging Advanced Machine Learning Techniques For Phishing Website Detection
No ratings yet
Leveraging Advanced Machine Learning Techniques For Phishing Website Detection
6 pages
Paper 2
No ratings yet
Paper 2
10 pages
Automated Phishing Detection Through URL Analysis and Machine Learning
No ratings yet
Automated Phishing Detection Through URL Analysis and Machine Learning
9 pages
Machine Learning-Driven Phishing Detection: A Robust Browser Extension Solution
No ratings yet
Machine Learning-Driven Phishing Detection: A Robust Browser Extension Solution
4 pages
128 Submission
No ratings yet
128 Submission
7 pages
Enhanced Phishing Website Detection: Leveraging Random Forest and XGBoost Algorithms With Hybrid Features
No ratings yet
Enhanced Phishing Website Detection: Leveraging Random Forest and XGBoost Algorithms With Hybrid Features
4 pages
Detecting Phishing Websites Using Machine Learning
No ratings yet
Detecting Phishing Websites Using Machine Learning
16 pages
Detection of Phishing Website
No ratings yet
Detection of Phishing Website
12 pages
Phishing Website Detection Using ML 2-1
No ratings yet
Phishing Website Detection Using ML 2-1
20 pages
Format CC Ratio PDF
No ratings yet
Format CC Ratio PDF
141 pages
Basic Principles of Analytical Method Validation
No ratings yet
Basic Principles of Analytical Method Validation
34 pages
Ad3002 Health Care Analytics
No ratings yet
Ad3002 Health Care Analytics
76 pages
VPH Bits
No ratings yet
VPH Bits
63 pages
Dyslexia Prediction Using Machine Learning
No ratings yet
Dyslexia Prediction Using Machine Learning
9 pages
Evidence Based Medicine Beginners Handbook
No ratings yet
Evidence Based Medicine Beginners Handbook
40 pages
Predictive Machine Learning Applying Cross Industry Standard Process For Data Mining For The Diagnosis of Diabetes Mellitus Type 2
No ratings yet
Predictive Machine Learning Applying Cross Industry Standard Process For Data Mining For The Diagnosis of Diabetes Mellitus Type 2
14 pages
Test of Motor Proficiency Second Edition (BOT-2) : Compatibility of The Complete and Short Form and Its Usefulness For Middle-Age School Children
No ratings yet
Test of Motor Proficiency Second Edition (BOT-2) : Compatibility of The Complete and Short Form and Its Usefulness For Middle-Age School Children
8 pages
Prediction of Floods in Kerala Using Hybrid Model of CNN and LSTM
No ratings yet
Prediction of Floods in Kerala Using Hybrid Model of CNN and LSTM
7 pages
〈85〉 Bacterial Endotoxins Test
No ratings yet
〈85〉 Bacterial Endotoxins Test
6 pages
HSG Vs Sonohysterography
No ratings yet
HSG Vs Sonohysterography
4 pages
Tutorial 1
0% (1)
Tutorial 1
5 pages
Big Data Assignment #1: Submitted To/ Eng. Eman Hossam
No ratings yet
Big Data Assignment #1: Submitted To/ Eng. Eman Hossam
16 pages
Overcoming The Challenges That Delay Development of Your Lateral Flow Assay
No ratings yet
Overcoming The Challenges That Delay Development of Your Lateral Flow Assay
6 pages
TMT Oral
No ratings yet
TMT Oral
7 pages
SWYC Manual v101 Web Format 33016
No ratings yet
SWYC Manual v101 Web Format 33016
157 pages
PDF History Physical Examination Laboratorytesting and Emergency Department DD
No ratings yet
PDF History Physical Examination Laboratorytesting and Emergency Department DD
17 pages
Neutrophil-To-Lymphocyte Ratio For The Diagnosis of Pediatric Acute Appendicitis: A Systematic Review and Meta-Analysis
No ratings yet
Neutrophil-To-Lymphocyte Ratio For The Diagnosis of Pediatric Acute Appendicitis: A Systematic Review and Meta-Analysis
11 pages
AD3S
No ratings yet
AD3S
6 pages
Titanic - Machine Learning From Disaster - Kaggle
No ratings yet
Titanic - Machine Learning From Disaster - Kaggle
19 pages
Reticulocyte Hemoglobin
No ratings yet
Reticulocyte Hemoglobin
6 pages
Identification and Prediction of Chronic Diseases Using Machine
No ratings yet
Identification and Prediction of Chronic Diseases Using Machine
9 pages
Screening in Public Health
No ratings yet
Screening in Public Health
8 pages
Acute Medicine Surgery - 2022 - Okazaki - Diagnostic Accuracy of Pelvic Imaging For Acute Pelvic Inflammatory Disease in
No ratings yet
Acute Medicine Surgery - 2022 - Okazaki - Diagnostic Accuracy of Pelvic Imaging For Acute Pelvic Inflammatory Disease in
11 pages
160PC Differential Pressure Sensor
No ratings yet
160PC Differential Pressure Sensor
4 pages
Yolk Sac Shape Outcome
No ratings yet
Yolk Sac Shape Outcome
11 pages
Proficiency of Pallor in Predicting Anemia: Original Research
No ratings yet
Proficiency of Pallor in Predicting Anemia: Original Research
4 pages
Honeypot Systems and Techniques: Definitive Reference for Developers and Engineers
From Everand
Honeypot Systems and Techniques: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet

Phishing

Uploaded by

Phishing

Uploaded by

Appl. Math. Inf. Sci. 18, No.

6, 1481-1493 (2024) 1481

Enhanced Phishing Detection: An Ensemble Stacking

Received: 15 Aug. 2024, Revised: 5 Oct. 2024, Accepted: 10 Oct. 2024

1 Introduction By combining social engineering and technical expertise,

Table 1: Cutting-edge phishing detection modules

model might struggle to learn subtle phishing 3 Proposed Methodology

distribution statistics of dataset-2 can be found in Figure

3.3 Features Selection

accuracy metrics from the RFE process are used to assess

Table 2: Top Ten features from Dataset-1 (DS-1)

Table 3: Top Ten features from Dataset-2 (DS-2)

3.4 Machine Learning Models

Once the dominant features were selected, various

3.4.2 Random Forest (RF)

Using bootstrap aggregation or bagging, several

3.4.3 Bagging Classifier (BC)

Bagging is an ensemble learning method incorporating

Type of features Feature number Methods A P R

True positives (TPs) are those situations in which the

Table 7: Significant feature through DT FS from Dataset-

Type of features (DS-2) Feature number

Table 8: Tested Results of various ML models upon (DS-

The authors offer special gratitude to INTI International

Fig. 7: Tested results using DS-2 along with its counterpart

The authors have no conflict of interest to declare.

degree in Information Technology from Sathyabama

Suleiman Ibrahim Muhammad Turki

You might also like