An Application For Predicting Phishing Attacks A Case of Implementing
An Application For Predicting Phishing Attacks A Case of Implementing
a r t i c l e i n f o a b s t r a c t
Keywords: The imminent threat that phishing websites poses is a major concern for internet users worldwide. These fraud-
Phishing ulent websites are crafted by cyber attackers to appear trustworthy and deceive vulnerable users into divulging
SVM confidential data like medical health records, credit card details, passwords, and Personal Identifiable information
Cybersecurity
(PII). To bait their victims, cybercriminals employ tactics such as social engineering, spear-phishing attacks, and
Healthcare
email phishing scams. As a result, unsuspecting individuals may be enticed to visit these websites, putting their
Machine learning
AI sensitive information at risk. This work presents an application designed to predict phishing attacks after compar-
Cyberattack ing polynomial and radial basis function of support vector machine (SVM). The proposed application leverages a
dataset of known legitimate, suspicious and phishing attacks stored in a database and employs an SVM algorithm
for classification based on user input. The application provides a user-friendly graphical user interface (GUI) that
allows reporting of new phishing incidents based on the features that have strong relationship in determining if
a website is phishing or not. The proposed application utilizes the inherent scalability of database technology to
support record expansion whenever there is an instance of a user initiating phishing prediction thereby, making
it suitable for use in a wide range of organizational settings.
1. Introduction ation of a replica of an existing web page to fool a user into submitting
personal, financial, or password data.” Although, definitions of phish-
The world experienced a paradigm shift in the modus operandi of ing attacks can be fragmented by focusing on social engineering aspect
cybercriminals since the COVID-19 outbreak as more than 150 coun- and theft of PII, a more robust definition by Alkhalil et al., [9] suggest
tries experienced partial or complete movement restriction alongside “phishing as a socio-technical attack, in which the attacker targets spe-
significant alteration in the method in which economic activities are cific valuables by exploiting an existing vulnerability to pass a specific
conducted [1,2]. Cybercrime is an illegal action aimed at computer sys- threat via a selected medium into the victim’s system, utilizing social
tems or networks, encompassing a wide spectrum of potentially crimi- engineering tricks or some other techniques to convince the victim into
nal activities [3]. Cyberattack could be directed at vulnerable computer taking a specific action that causes various types of damages.” These
networks or, it can rely on the victim’s implicit participation in the at- threats can range from malicious web links, attachments and fraudulent
tacker’s criminal scheme for the activity to be successful as in the case of data entry forms. The criticality of this attack method cannot be over em-
social engineering attacks. As defined by [4] and [5], social engineering phasized as Sánchez-Paniagua [10] mentions that, phishing is the most
attack is the art of psychologically manipulating an individual through challenging social engineering attack to curb, due to the large number of
persuasion to reveal sensitive or confidential information. Amongst the people currently engaging in online activities and that makes it certainly
popular types of social engineering attacks such as trojan horse, shoul- more challenging to detect and prevent. A recent Proofpoint study re-
der surfing and dumpster diving [6], phishing stands out as the most ported by Techopedia in 2023, reveals that a staggering 83 % of compa-
frequently employed technique [2,3]. nies fall victim to phishing attacks annually, highlighting the pervasive
Phishing is a method of cyberattack whereby cyber criminals attempt nature of this cyber threat. The alarming trend is further underscored
to get hold of people’s personal identifiable information by misleading by a substantial 345 % surge in unique phishing sites observed between
them using psychological trickery [7]. Other authors as Merwe et al., 2020 and 2021. The FBI’s Internet Crime Complaint Center (IC3) re-
[8] considers phishing as “a fraudulent activity that involves the cre- ports a significant escalation in phishing incidents, with a staggering
https://doi.org/10.1016/j.csa.2024.100036
Received 21 August 2023; Received in revised form 16 November 2023; Accepted 8 January 2024
Available online 17 January 2024
2772-9184/© 2024 The Authors. Publishing Services by Elsevier B.V. on behalf of KeAi Communications Co., Ltd. This is an open access article under the CC BY
license (http://creativecommons.org/licenses/by/4.0/)
E.S. Shombot, G. Dusserre, R. Bestak et al. Cyber Security and Applications 2 (2024) 100036
800,944 reports and losses exceeding $10.3 billion in 2022 alone. The 2. Literature review
financial repercussions are substantial, as each phishing attack carries
an average cost of $4.91 million for corporations. These statistics under- Generally, the practice of using ML in cybersecurity is still in its in-
score the pressing need for robust cybersecurity measures and height- fancy or experimental stages, demonstrating a substantial gap between
ened awareness to counter the growing sophistication and prevalence of research and practice. As a result, the use of machine learning in cy-
phishing attacks in the digital landscape [11]. Phishing attacks mostly bersecurity is presented in a very disjointed manner, which makes it
include fraudulent emails [12], websites [13], phone calls [14], text difficult to deploy in practice [19,20]. The promise of artificial intelli-
messages [15] that appear to be from legitimate sources such as banks, gence and machine learning as presented in most literature in terms of
social media platforms, or government agencies and the successful out- what they can achieve in cybersecurity can at best be considered specu-
come in some cases is the installation of dangerous malware [16]. Due lative [21], largely because AI and ML are exclusively data-driven, and
to the widespread use of online transactions and services, phishing at- currently, the availability of such data is lacking or too specific to a
tacks have become a major concern for individuals and organizations given use case that it cannot be reproduced for other identified contexts
alike. As shown in Fig. 1, it is quite clear that the number of unique [22]. In terms of the practical application of AI and ML as implemented
phishing sites detected worldwide from 3rd quarter 2013 to 3rd quar- in most use cases presented in some of the most recent literature [23], a
ter 2022 has been on the rise, with a dramatic surge in the year 2020, clearer picture can be painted as to the extent of research and what as-
which most likely can be attributed to the COVID-19 pandemic that pects of cybersecurity can be achieved by leveraging AI and ML methods
caused some major disorientation in the general conduct of economic [24]. As far as the three successful applications of ML in cybersecurity
activities. as articulated by Apruzzese et al. [19], which include machine Learning
This work emphasis is primarily on using website features to predict in network intrusion detection, Machine learning in malware detection,
phishing websites based on SVM machine learning algorithm because as and Machine learning in phishing detection, this work focuses on phish-
mentioned by Cui et al. 2020 [17], the features required to detect phish- ing detection using machine learning and this section explores all the
ing attacks are different depending on the attack vector used. Therefore, relevant literature on the subject matter.
the key contribution of this work to knowledge is;
1. Implement a machine learning algorithm that will predict phishing 2.1. Systematic literature review of phishing
websites based on their features.
2. Develop a web application that will assimilate the phishing detection A systematic literature survey was conducted by Safi and Singh,
algorithm in an interactive way and provide a platform for users to 2023 [25], intended to provide a comprehensive analysis of the tech-
make online predictions based on the identified features. niques used in detecting phishing websites. Their work uncovers that
3. Capture each user prediction interaction session as a new record despite the extensive literature search, a comprehensive overview of all
in a database for the purpose of expanding the record in the the significant approaches employed in this domain was inadequate. Ad-
database and improving the performance of the machine learning ditionally, there was a lack of a systematic resource that aggregates the
algorithm. methodologies, data sets, and algorithms utilized in phishing website de-
2
E.S. Shombot, G. Dusserre, R. Bestak et al. Cyber Security and Applications 2 (2024) 100036
tection. Thus, there was a requirement for an authorized review to study before the location is taken down [35]. Due to the growing threat of
this field and provide a comprehensive summary. The main contribu- phishing and the negative impact it has on the economy and reputation
tion of Safi and Singh, 2023 [25] is to firstly unearth the most effective of businesses, it is critical to have effective countermeasures in place,
techniques for detecting phishing websites, so that security managers such as filters, machine learning, blacklists, and active and passive in-
can effortlessly choose the most effective method from a range of anti- dicators that alert users. While the combination of the aforementioned
phishing approaches for their security systems and secondly, provide a countermeasures with staff training enhances the ability to deal with
good systematic literature review paper that will succinctly capture the phishing, it’s essential to acknowledge that training users may not al-
current state of techniques, data sets and algorithms used to deal with ways be effective. This is often due to factors such as low motivation or
phishing problem. unwillingness to engage with training materials [16].
Another review paper by Qabajeh et al., 2018 [26], titled recent re- In a similar study conducted by Minocha and Singh, 2022 [36], they
view of conventional vs. automated cybersecurity anti-phishing tech- identify that there are two categories of automatic detection techniques
niques provides another dimension of review. Their work explores In- for phishing sites: (i) list-based systems, and (ii) machine learning-based
tegrating a classification system with intelligent machine learning tech- techniques. List-based systems use safelists and denylists to categorize
nology in the browser as a promising anti-phishing approach that detects websites as legitimate or phishing, respectively. Additionally, the article
and alerts users to phishing activities. Their paper is design to be mostly points out that the traditional methods of detection only provide around
encompassing as it reviews and analyzes legal, training, educational, 20 % success.
and intelligent anti-phishing approaches and highlights their similar- Tan et al. [37] propose a phishing detection method that is based on
ities, differences, positive and negative aspects from user and perfor- comparing the actual and target identities of a webpage. The proposed
mance perspectives. The study also identifies ways to combat phishing method, PhishWHO, comprises three phases:
through intelligent and conventional methods, making it beneficial for
1. Extracting identity keywords from the website’s textual content, em-
computer security experts, web security researchers, and business own-
ploying a unique weighted URL tokens system based on the N-gram
ers.
model.
Another comprehensive survey study was carried out by Basit et al.
2. Identifying the target domain name through a search engine, then
2021 [27] where the authors comprehensively explored AI-enabled
selecting the domain based on identity-relevant characteristics.
phishing attacks detection techniques and extrapolated that, most phish-
3. Proposing a 3-tier identity matching system to assess the authenticity
ing attacks detection methods fall into four categories. Deep learning,
of the queried webpage.
which is the latest progress in deep learning methodologies proposes
that the categorization of phishing websites using deep neural networks Data mining approach as proposed by Abdelhamid et al., [38], con-
(NN) could surpass the conventional machine learning (ML) algorithms. siders phishing a typical classification problem and the objective of the
Nonetheless, the outcomes of employing deep NN significantly rely on classification task is to categorize a new website into one of the prede-
the configuration of various learning parameters [28]. Secondly, ma- fined classes, such as phishing, legitimate or suspicious. After a website
chine learning method, considered to be popular because it appears is loaded on the browser, a set of feature values is extracted, which play
that most phishing attacks types are classification problems. The de- a crucial role in determining the website type. By utilizing the rules de-
gree of accuracy is relatively high when using this detection method rived from historical data. Also, their work alongside that of [39,40] and
but that depends largely on the dataset and the features therein [29,30]. [41] based on associative classification which is a data mining technique
Thirdly, Scenario-based phishing attack detection method, that is pred- that combines classification and association rule mining.
icated on different scenarios however, these scenarios yield different Content based approach is another method of detection articulated
outcomes based on methods used [27]. Some examples of this method by Nguyen et al., [42], Zhang et al., [43], and Jha et al., [44]. This
include Begum and Badugu, 2019 [31] that relies on the consolida- method classifies websites as either phishing or non-phishing. This ap-
tion of techniques such as Machine Learning (ML) based approaches, proach relies on analyzing the contents of the site to determine its classi-
Non-machine Learning-based approaches, Neural Network-based ap- fication. The "Term Frequency/Inverse Document Frequency" (TF-IDF)
proaches, and Behavior-based detection approaches for the detection algorithm is commonly used for this type of content analysis. Zhang
of phishing attacks. Other authors such as Fatima et al., 2019 [32] pre- et al., used a similar content-based approach in their research, which
sented PhishI for security training based on gaming and Chiew et al., they called CANTINA. Their results showed that this technique had a
2018 [13] focused on phishing attack detection based on their features, high accuracy rate of 97 % in identifying phishing websites. However,
medium and vectors. Lastly, Hybrid learning (HL) based phishing attack to reduce false positives, heuristics were applied, resulting in a decrease
detection suggest the most recent future direction for phishing attack in accuracy to around 90 %. Other sources such as [45–48] also dis-
detection which could be based on leveraging more than one machine cussed and implemented other variants of content-based approach to
learning model as in the case of Pandey et al., 2020 [33] where they pro- phishing detection to corroborate the works of Zhang et al. [43].
posed random forest and support vector machine algorithm as a hybrid Various approaches have been explored in the literature to enhance
model for phishing detection. phishing detection [49], each accompanied by its own set of drawbacks.
One method involves specifying weights for words extracted from URLs
2.2. Conventional phishing detection methods and HTML contents, focusing on elements like brand names, with a de-
pendency on a third-party server, Yahoo Search, resulting in an accuracy
According to Hong J. 2012, there are three main ways to combat rate of 98.20 % [50]. However, a drawback of this approach is its re-
phishing attacks [16]: by implementing invisible protections that re- liance on an external server and over dependence on textual content.
quire no action from the user, by creating better user interfaces, and by Another strategy utilizes logo image analysis to identify web page au-
providing effective training[16]. There are currently over 500 toolkits thenticity, matching real and fake webpages, but with a dependency
available for phishing attack [34], some of which are designed to trick on Google Image Search and an accuracy rate of 93.40 % [51]. The
the phisher into providing false information. Criminals and security pro- drawback here is the reliance on a third-party server and the exclusiv-
fessionals are engaged in a constant competition to outsmart each other ity of this method to only use images. In another method, the use of
[16]. URL heuristics and website rank for detection is implemented, but the
Phishers use various techniques, such as fast flux, which involves drawback lies in the time-consuming process of feature extraction and
using a pool of proxies and domain names to hide the location of the website rank examination, achieving an accuracy rate of 97.16 % [52].
phishing website [16]. This technique can extend the average lifespan Rao and Ali implemented an advanced version of this technique us-
of a phishing website to 196 hours, compared to the average of 62 hours ing a desktop application called "Phish Shield." They used novel heuris-
3
E.S. Shombot, G. Dusserre, R. Bestak et al. Cyber Security and Applications 2 (2024) 100036
tics based on the URL to reduce false positives to 0.035 % and achieve ilarly, Bountakas and Xenakis, 2023 [57], proposed hybrid ensemble
an accuracy of 96.57 %. Their approach included null footer links, max- learning PHishing email detection based on stacking and soft voting.
imum frequency domains, copyrights, and whitelists for the detection
process. However, their approach had a limitation in terms of response 3. Proposed system
time, which could be improved with newer methodologies such as ge-
netic algorithms and neural networks. The development of the proposed system starts by using a dataset of
website characteristics labeled as either legitimate, suspicious or phish-
2.3. Machine learning approach ing. Based on the datapoints, the SVM algorithm is utilized and op-
timized using polynomial or radial basis function to determine which
The work of Alani and Tawfik, 2022 [15], pivots on building a kernel provides better model accuracy with minimal errors. The trained
machine-learning-based phishing detection system using only the URL model is then used to create a web application that runs on a server. This
as they argue that their approach provides better network protection by application takes features of a website as input and outputs a classifi-
reducing the attack surface. They also applied recursive feature elimi- cation label based on the trained model. To ensure that the SVM model
nation (RFE) which is a very useful feature selection method to reduce remains accurate over time, a database is created to store new user inter-
the number of features to the most important and critical features. Their actions with websites. These interactions can be analyzed and fed back
work leveraged a pipeline for five machine learning classifiers of ran- into the SVM model to improve its accuracy. This continuous improve-
dom forest, logistic regression, decision tree, gaussian naïve bayes and ment process ensures that the model remains up-to-date with the latest
multi-layer perceptron using a 75 % training and 25 % testing set. Over- trends and techniques used by phishing websites. In summary, creating
all, random forest performed better than the other models they tested. a machine learning model to identify and differentiate between legit-
Finally, in contrast to other locally hosted phishing solutions, their sys- imate, suspicious, and phishing websites involves training the model
tem can be deployed to the cloud as an API to be integrated as a browser using SVM, creating a web application to implement the model, creat-
plugin. ing a database to store user interactions, and continuously improving
Sahoo et al. [53] discuss various malicious attacks, as well as differ- the model based on new data. By following these steps, the accuracy
ent types of machine learning and features for detecting malicious URLs. and reliability of the model can be maximized, and users can be better
The paper primarily focuses on the identification of features used for protected against phishing attacks Fig. 2.
classifying malicious websites, grouped into five categories, and high-
lights the design and limitations of some of these features. The authors 4. Methodology
also provide examples of machine learning algorithms and their appli-
cation in detecting malicious websites. As depicted in Fig. 3, the methodology for this paper begins by identi-
Ensemble learning classification is another method popularized for fying the phishing dataset by Abdelhamid et al., [38], and preprocessing
phishing detection which is based on using multiple classifier or algo- it to assess its suitability for the intended task. Next, we identified the
rithms to solve a classification problem [54]. The work carried out by Al- independent features (input variables) and the class label (output vari-
Sarem et al., 2021 [55], proposed an optimized stacking ensemble model able). Clearly defining these elements is fundamental to the training of a
for phishing websites detection. Their approach includes three stages of machine learning model. Subsequently, we determined the nature of the
training, ranking and testing. The classifiers, namely random forests, phishing problem we aim to solve based on the dataset’s characteristics.
AdaBoost, XGBoost, Bagging, GradientBoost, and LightGBM, were ini- Utilizing a pair plot library in Python, we identified the relationships
tially trained without utilizing any optimization method. Subsequently, between pairs of variables in the dataset. From the outcomes of the pair
the genetic algorithm was employed to optimize these classifiers by plot, we established that we are dealing with a classification problem.
determining the most favorable parameter values for various ensem- Therefore, we progressed to identify and test several supervised learning
ble models. This process enabled the selection of the optimal parame- classification algorithms such as Random Forest, gradient boosting, de-
ters for these classifiers. Another work by Abawajy and Kelarev, 2012 cision trees, and SVM. We then proceeded to build and train the model,
[56] used a multi-tier ensemble construction of classifiers for phishing selecting SVM as it exhibited the highest accuracy in our case. Addition-
however, their work only focused on email detection and filtering. Sim- ally, we implemented two optimization kernels of SVM, namely RBF and
4
E.S. Shombot, G. Dusserre, R. Bestak et al. Cyber Security and Applications 2 (2024) 100036
4.1. Justification for algorithm used For the dataset used in the development of the application, this ar-
ticle leverages the work of Abdelhamid et al. [38]. The dataset name is
Different algorithms are better suited for different types of data and titled “website phishing” made available on the 11/1/2016 with multi-
different problems. So the choice of algorithm used for this work de- variate characteristics, designed for classification tasks and consist of
pends on the specific characteristics of the data, the problem we are integer-type features [64]. Several features related to legitimate and
solving, and the trade-offs between model complexity, interpretability, phishing websites were identified, and a dataset comprising 1353 web-
and computational efficiency [58]. For instance, logistic regression is sites from diverse sources was collected. The Phishtank data archive
used in classification problems where the datapoints are linearly sepa- (www.phishtank.com), a community website that allows users to sub-
rable however, that will not be applicable in this case given that our mit, verify, track and share phishing data, was the source of phishing
data is a three-label multiclass classification problem [59]. Therefore, websites. Legitimate websites were sourced from Yahoo and starting
we experimented with classification algorithms that are better suited point directories using a PHP web script. The PHP script was integrated
for multiclass classification problem such as SVM, Random Forest, Gra- with a browser, enabling the authors to collect 548 legitimate websites
dient boosting and Decision trees. From the results shown in Fig. 4, the out of the 1353 total websites. The dataset consisted of 702 phishing
SVM model performed relatively better than the other models. There are URLs and 103 URLs classified as suspicious. When a website is deemed
also theoretical dimensions that are abundantly articulated in the liter- SUSPICIOUS, it indicates that it displays features that are characteristic
ature on the use of SVMs in solving multiclass classification problems of both legitimate and phishing websites, implying that the website has
[60–63]. Some of these inherent advantages include; both genuine and fraudulent attributes.
5
E.S. Shombot, G. Dusserre, R. Bestak et al. Cyber Security and Applications 2 (2024) 100036
Furthermore, another impressive thing about the dataset is they ap- 4.4. Exploratory data analysis
plied the Chi-Square measure [65], which is a statistical method used
to determine the degree of association or independence between two As captured in Fig. 5. The EDA captures the features of the dataset
categorical variables. It involves calculating the difference between ob- and its distribution across the class label of [-1 = phishing, 0 = suspi-
served and expected frequencies of occurrence in a contingency table, cious, 1 = legitimate] Fig. 7.
and then comparing these differences to their expected values under the
assumption of independence. By doing this, it was possible to come up 4.5. Support vector machine
with concise features that are precise and avoid noise in data that may
not be necessarily important in making accurate predictions. After the Support Vector Machines (SVMs) are a family of generalized linear
application of the Chi-square test, 9 features had the most correlation classification methods used for both classification and regression tasks
with class attribute among the 16 initially identified features. These fea- in supervised learning [67]. They have a special property of simultane-
tures are Request URL, Age of Domain, HTTPS and SSL, Website Traffic, ously minimizing the empirical classification error and maximizing the
Long URL, SFH, Pop-Up window, URL of Anchor, Redirect URL and Us- geometric margin, which has earned them the nickname of Maximum
ing the IP Address. Margin Classifiers. SVMs are based on the Structural Risk Minimization
6
E.S. Shombot, G. Dusserre, R. Bestak et al. Cyber Security and Applications 2 (2024) 100036
(SRM) principle and work by mapping the input vectors into a higher di- RBF kernel function is [68]:
mensional space where a maximal separating hyperplane is constructed.
K(xi, xj) = exp(−𝛾||xi − xj||2 )
Two parallel hyperplanes are constructed on each side of the separat-
ing hyperplane to separate the data. The hyperplane that maximizes the where xi and xj are input vectors, ||.|| denotes the Euclidean distance
distance between these parallel hyperplanes is chosen as the separating between xi and xj, and 𝛾 is a hyperparameter that controls the width of
hyperplane. It is assumed that a larger margin or distance between the the kernel.
parallel hyperplanes leads to better generalization error of the classifier This kernel function maps the input data into a higher-dimensional
[67,68]. feature space where it becomes separable by a linear decision boundary.
In this work, the SVM algorithm utilizes a hyperplane to effectively The RBF kernel is widely used in support vector machines due to its
distinguish between different data elements and classify them as phish- ability to handle complex, nonlinear decision boundaries in the data
ing, legitimate or suspicious based on the dataset. By separating the fea- [68].
tures, the hyperplane ensures the best separation of data. SVM is then
mapped into the same space, and it predicts the category based on which 4.5.2. Polynomial kernel
side of the gap the point or input falls on. Furthermore, in implementing A polynomial kernel is a type of kernel that can be used in SVMs. It
SVM, two kernels of Radial Basis function and Polynomial were tested is defined as follows [68]:
on the dataset to determine which one works optimally with the dataset. d
K(xi, xj) = (γxiT xj + r)
This is the equation for the polynomial kernel function used in sup-
4.5.1. Radial basis function kernel port vector machines (SVMs). The kernel function calculates the simi-
This is the kernel function used in the radial basis function (RBF) larity between two data points, xi and xj, by computing the dot product
kernel of support vector machines. The mathematical equation for the of their feature vectors and raising it to the power of d, while adding a
7
E.S. Shombot, G. Dusserre, R. Bestak et al. Cyber Security and Applications 2 (2024) 100036
Table 1
Features and definition.
1. Server Form Handler (SFH) When user submits information, it is transferred to a server for processing, typically on the same domain, but phishers may use an
empty server form handler or transfer the information to a different domain.
2. Pop-up Windows High prevalence of pop-up windows pre-empting users to enter their personal identifiable information is more likely to be associated
with phishing sites.
3. SSL Certificate HTTPS protocol presence indicates a legitimate website, but phishers may use fake HTTPS, hence verify HTTPS by trusted issuers (e.g.,
GeoTrust, GoDaddy, VeriSign).
4. Web Redirect Phishers often use link redirection to deceive users into submitting their information to a fraudulent site, making it difficult for users to
detect the real link they are being directed to.
5. @ Symbol The ‘‘@’’ symbol leads the browser to ignore everything prior it and redirects the user to the link typed after it.
6. Web Traffic Phishing websites have low web traffic and short life, whereas legitimate websites have high traffic and lower rank, typically less than
or equal to 150,000 according to Alexadatabase.
7. Long URL Phishers may hide parts of the URL to redirect user information or upload pages to suspicious domains, with no reliable length to
distinguish phishing from legitimate URLs, but a length greater than 54 characters may indicate a phishing URL [66]
8. Age of domain Websites with a duration of less than 1 year of online presence may be deemed risky.
9. IP address Using an IP address in the domain name of the URL is an indicator someone is trying to access the personal information. This trick
involves links that may begin with an IP address that most companies do not commonly use any more.
8
E.S. Shombot, G. Dusserre, R. Bestak et al. Cyber Security and Applications 2 (2024) 100036
constant r. The value of 𝛾 determines the width of the kernel and affects when initiated by an actor or user. Finally, the appearance of a graphi-
the smoothness of the decision boundary. The polynomial kernel is used cal user interface (GUI) as implemented for the user to interact with to
to transform the input space to a higher dimensional space to achieve a predict a phishing website attack is captured in Fig. 10.
better separation of data points [68].
9
E.S. Shombot, G. Dusserre, R. Bestak et al. Cyber Security and Applications 2 (2024) 100036
7. Conclusion
References
[1] A.K. Jain, N. Debnath, A.K. Jain, APuML: an efficient approach to detect mobile
phishing webpages using machine learning, Wirel. Pers. Commun. 125 (4) (2022)
3227–3248 Aug, doi:10.1007/s11277-022-09707-w.
[2] A. Yasin, R. Fatima, L. Liu, J. Wanga, R. Ali, Z. Wei, Counteracting so-
cial engineering attacks, Comput. Fraud Secur. 2021 (10) (2021) 15–19 Oct,
doi:10.1016/S1361-3723(21)00108-1.
[3] A. Mughaid, S. AlZu’bi, A. Hnaif, S. Taamneh, A. Alnajjar, E.A. Elsoud, An intelli-
Fig. 10. Graphical user interface. gent cyber security phishing detection system using deep learning techniques, Clust.
Comput. 25 (6) (2022) 3819–3828 Dec, doi:10.1007/s10586-022-03604-4.
[4] P. Suresh, et al., Chapter 10 - contemporary survey on effectiveness of machine and
deep learning techniques for cyber security, in: Machine Learning for Biometrics,
of a desired result. Depending on the parameters entered by the user, a Academic Press, 2022, pp. 177–200, doi:10.1016/B978-0-323-85209-8.00007-9. P.
result of legitimate, suspicious, or phishing will be returned. P. Sarangi, M. Panda, S. Mishra, B. S. P. Mishra, and B. Majhiin Cognitive Data
Science in Sustainable Computing.
For the future scope of this work, it is worth mentioning that phish- [5] K.C. Bourne, Chapter 15 - security, in: Application Administra-
ing attacks are very dynamic in nature, and as soon as cyber crimi- tors Handbook, Morgan Kaufmann, Boston, 2014, pp. 242–267,
nals are contained in their traps, they exploit other methods to ma- doi:10.1016/B978-0-12-398545-3.00015-7. K. C. Bourne.
[6] S.D. Applegate, Social engineering: hacking the wetware!, Inf. Secur. J. Glob. Per-
neuver the problem in order to gain an advantage. This work greatly spect. 18 (1) (2009) 40–46 Feb, doi:10.1080/19393550802623214.
relies on known intrinsic attributes found on websites and selects key [7] K. Chetioui, B. Bah, A.O. Alami, A. Bahnasse, Overview of social engineering
features that are prevalent with substantial correlation with the legit- attacks on social networks, Procedia Comput. Sci. 198 (2022) 656–661 Jan,
doi:10.1016/j.procs.2021.12.302.
imate or illegitimate status of unknown websites passed through the [8] A. van der Merwe, M. Loock, M. Dabrowski, Characteristics and responsibilities in-
ML algorithm. Given the explosion of web technologies such as Django, volved in a phishing attack, in: Proceedings of the 4th International Symposium on
Meteor JS, Yii, and Motion UI, these nine selected features may not Information and Communication Technologies, in WISICT ’05, Cape Town, South
Africa, Trinity College Dublin, 2005, pp. 249–254. Jan.
be sufficient to accommodate the newer trends in websites used by
[9] Z. Alkhalil, C. Hewage, L. Nawaf, I. Khan, Phishing attacks: a recent com-
cyber criminals. Therefore, the scope of this work for the future will prehensive study and a new anatomy, Front. Comput. Sci. 3 (2021) Accessed:
be to explore newer trends in website phishing and continually make Nov. 13, 2023. [Online]. Available: https://www.frontiersin.org/articles/10.3389/
provision to evolve our web application and database to capture such fcomp.2021.563060.
[10] M. Sánchez-Paniagua, E. Fidalgo, E. Alegre, R. Alaiz-Rodríguez, Phishing websites
peculiarities. detection using a novel multipurpose dataset and web technologies features, Expert
Syst. Appl. 207 (2022) 118010 Nov, doi:10.1016/j.eswa.2022.118010.
[11] J. Rushton, “50+ phishing statistics you need to know – where, Who &
6. Limitation What is Targeted,” Techopedia. Accessed: Nov. 13, 2023. [Online]. Available:
https://www.techopedia.com/phishing-statistics.
[12] R.M. Mohammad, F. Thabtah, L. McCluskey, Tutorial and critical analy-
The scope of this work primarily focuses on website phishing, even
sis of phishing websites methods, Comput. Sci. Rev. 17 (2015) 1–24 Aug,
though there are other forms of phishing attacks such as spearphishing doi:10.1016/j.cosrev.2015.04.001.
and email phishing that can potentially be devastating. The SVM predic- [13] K.L. Chiew, K.S.C. Yong, C.L. Tan, A survey of phishing attacks: their types,
vectors and technical approaches, Expert Syst. Appl. 106 (2018) 1–20 Sep,
tion model is based on a dataset of about 1400 records, which may have
doi:10.1016/j.eswa.2018.03.050.
had an impact on the model’s accuracy; however, this work has reme- [14] B.B. Gupta, N.A.G. Arachchilage, K.E. Psannis, Defending against phishing attacks:
diated that by providing a function that will allow that dataset to grow taxonomy of methods, current issues and future directions, Telecommun. Syst. 67
through a database and that will enable the machine learning model to (2) (2018) 247–267 Feb, doi:10.1007/s11235-017-0334-z.
[15] M.M. Alani, H. Tawfik, PhishNot: a cloud-based machine-learning ap-
consistently correct itself, which will in turn improve the accuracy of proach to phishing URL detection, Comput. Netw. 218 (2022) 109407 Dec,
the model. doi:10.1016/j.comnet.2022.109407.
10
E.S. Shombot, G. Dusserre, R. Bestak et al. Cyber Security and Applications 2 (2024) 100036
[16] J. Hong, The state of phishing attacks, Commun. ACM 55 (1) (2012) 74–81 Jan, Wide Web, in WWW ’07, New York, NY, USA, Association for Computing Machinery,
doi:10.1145/2063176.2063197. 2007, pp. 639–648, doi:10.1145/1242572.1242659. May.
[17] Q. Cui, G.V. Jourdan, G.v. Bochmann, I.V. Onut, SemanticPhish: a semantic-based [44] A.K. Jha, R. Muthalagu, P.M. Pawar, Intelligent phishing website de-
scanning system for early detection of phishing attacks, in: Proceedings of the tection using machine learning, Multimed. Tools Appl. (2023) Feb,
2020 APWG Symposium on Electronic Crime Research (eCrime), 2020, pp. 1–12, doi:10.1007/s11042-023-14731-4.
doi:10.1109/eCrime51433.2020.9493252. Nov. [45] A.K. Jain, S. Parashar, P. Katare, I. Sharma, PhishSKaPe: a content based approach
[18] “Number of global phishing sites 2022,” Statista. Accessed: May 03, 2023. [On- to escape phishing attacks, Procedia Comput. Sci. 171 (2020) 1102–1109 Jan,
line]. Available: https://www.statista.com/statistics/266155/number-of-phishing- doi:10.1016/j.procs.2020.04.118.
domain-names-worldwide/. [46] B. Wardman, T. Stallings, G. Warner, A. Skjellum, High-performance content-based
[19] G. Apruzzese, et al., The role of machine learning in cybersecurity, Digit. Threats phishing attack detection, in: 2011 eCrime Researchers Summit, IEEE, San Diego,
Res. Pract. 4 (1) (2023) 8:1-8:38Mar, doi:10.1145/3545574. USA, 2011, pp. 1–9, doi:10.1109/eCrime.2011.6151977. Nov.
[20] A. Parisi, Hands-on Artificial Intelligence for Cybersecurity: Implement Smart AI Sys- [47] K. Komiyama, T. Seko, Y. Ichinose, K. Kato, K. Kawano, H. Yoshiura, In-depth
tems for Preventing Cyber Attacks and Detecting Threats and Network Anomalies, evaluation of content-based phishing detection to clarify its strengths and limita-
Packt Publishing, Birmingham, UK, 2019. tions, in: U- and E-Service, Science and Technology, Springer, Berlin, Heidelberg,
[21] H. Karimipour, F. Derakhshan, AI-Enabled Threat Detection and Security Analysis 2010, pp. 95–106, doi:10.1007/978-3-642-17644-9_11. T. Kim, J. Ma, W. Fang, B.
for Industrial IoT, Springer Nature, 2021. Park, B.H. Kang, and D. Ślęzakin Communications in Computer and Information
[22] Information and Operations Management Department Texas A&M University, R. Sen, Science.
G. Heim, Information and Operations Management Department Texas A&M Univer- [48] S. Afroz, R. Greenstadt, PhishZoo: detecting phishing websites by looking at them, in:
sity, Q. Zhu, and Information and Operations Management Department Texas A&M Proceedings of the 2011 IEEE 5th International Conference on Semantic Computing,
University, Artificial intelligence and machine learning in cybersecurity: applica- Palo Alto, CA, USA, IEEE, 2011, pp. 368–375, doi:10.1109/ICSC.2011.52. Sep.
tions, challenges, and opportunities for MIS academics, Commun. Assoc. Inf. Syst. [49] A. Abuzuraiq, M. Alkasassbeh, M. Almseidin, Intelligent methods for accurately de-
51 (1) (2022) 179–209, doi:10.17705/1CAIS.05109. tecting phishing websites, in: Proceedings of the 2020 11th International Confer-
[23] M. Wazid, A.K. Das, V. Chamola, Y. Park, Uniting cyber security and machine learn- ence on Information and Communication Systems (ICICS), Irbid, Jordan, IEEE, 2020,
ing: advantages, challenges and future research, ICT Express 8 (3) (2022) 313–321 pp. 085–090, doi:10.1109/ICICS49469.2020.239509. Apr.
Sep, doi:10.1016/j.icte.2022.04.007. [50] P.M. Al-kasassbeh, Intelligent methods for accurately detecting phishing websites,
[24] R. Kaur, D. Gabrijelčič, T. Klobučar, Artificial intelligence for cybersecurity: lit- in: Proceedings of the 2020 11th International Conference on Information and Com-
erature review and future research directions, Inf. Fusion 97 (2023) 101804 Sep, munication Systems ICICS, 2020 Jan.Accessed: Nov. 13, 2023. [Online]. Available:.
doi:10.1016/j.inffus.2023.101804. [51] K.L. Chiew, E. Chang, S. Sze, W. Tiong, Available online utilisation of website logo for
[25] A. Safi, S. Singh, A systematic literature review on phishing website detection phishing detection, Comput. Secur. 54 (2015) Aug, doi:10.1016/j.cose.2015.07.006.
techniques, J. King Saud Univ. Comput. Inf. Sci. 35 (2) (2023) 590–611 Feb, [52] H.Y.A. Abutair, A. Belghith, Using case-based reasoning for phishing detection, Pro-
doi:10.1016/j.jksuci.2023.01.004. cedia Comput. Sci. 109 (2017) 281–288 Jan, doi:10.1016/j.procs.2017.05.352.
[26] I. Qabajeh, F. Thabtah, F. Chiclana, A recent review of conventional vs. automated [53] D. Sahoo, C. Liu, and S. C. H. Hoi, “Malicious URL detection using machine learning:
cybersecurity anti-phishing techniques, Comput. Sci. Rev. 29 (2018) 44–55 Aug, a survey.” arXiv, Aug. 21, 2019.
doi:10.1016/j.cosrev.2018.05.003. [54] Z.H. Zhou, Ensemble learning, in: Machine Learning, Springer, Singapore, 2021,
[27] A. Basit, M. Zafar, X. Liu, A.R. Javed, Z. Jalil, K. Kifayat, A comprehensive survey of pp. 181–210, doi:10.1007/978-981-15-1967-3_8. Z.H. Zhou.
AI-enabled phishing attacks detection techniques, Telecommun. Syst. 76 (1) (2021) [55] M. Al-Sarem, et al., An optimized stacking ensemble model for phishing web-
139–154 Jan, doi:10.1007/s11235-020-00733-2. sites detection, Electronics 10 (11) (2021) 11 Art. no.Jan, doi:10.3390/electron-
[28] G. Vrbančič, I. Fister jr, and V. Podgorelec, “Swarm intelligence approaches for pa- ics10111285.
rameter setting of deep learning neural network: case study on phishing websites [56] J. Abawajy, A. Kelarev, A multi-tier ensemble construction of classifiers for phishing
classification,” Jun. 2018, pp. 1–8. email detection and filtering, in: Cyberspace Safety and Security, Springer, Berlin,
[29] J. James, S. L., and C. Thomas, “Detection of phishing URLs using machine learning Heidelberg, 2012, pp. 48–56, doi:10.1007/978-3-642-35362-8_5. Y. Xiang, J. Lopez,
techniques,” Dec. 2013, pp. 304–309. C.C. J. Kuo, and W. Zhouin Lecture Notes in Computer Science.
[30] S.W. Liew, N.F.M. Sani, Mohd.T. Abdullah, R. Yaakob, M.Y. Sharum, An effective [57] P. Bountakas, C. Xenakis, HELPHED: hybrid ensemble learning phish-
security alert mechanism for real-time phishing tweet detection on Twitter, Comput. ing email detection, J. Netw. Comput. Appl. 210 (2023) 103545 Jan,
Secur. 83 (2019) 201–207 Jun, doi:10.1016/j.cose.2019.02.004. doi:10.1016/j.jnca.2022.103545.
[31] A. Begum, S. Badugu, A study of malicious URL detection using machine learning [58] “Choosing the right estimator,” scikit-learn. Accessed: Nov. 14, 2023. [Online].
and heuristic approaches, Learn. Anal. Intell. Syst. (2019) 587. Available: https://scikit-learn/stable/tutorial/machine_learning_map/index.html.
[32] R. Fatima, A. Yasin, L. Liu, J. Wang, How persuasive is a phishing email? A phish- [59] “1.1. Linear Models,” scikit-learn. Accessed: Nov. 14, 2023. [Online]. Available:
ing game for phishing awareness, J. Comput. Secur. 27 (6) (2019) 581–612 Jan, https://scikit-learn/stable/modules/linear_model.html.
doi:10.3233/JCS-181253. [60] D. Anguita, A. Ghio, N. Greco, L. Oneto, S. Ridella, Model selection for support vec-
[33] A. Pandey, N. Gill, K. Sai Prasad Nadendla, I.S. Thaseen, Identification of phishing tor machines: advantages and disadvantages of the machine learning theory, in: Pro-
attack in websites using random forest-SVM hybrid model, in: Intelligent Systems ceedings of the 2010 International Joint Conference on Neural Networks (IJCNN),
Design and Applications, Springer International Publishing, Cham, 2020, pp. 120– 2010, pp. 1–8, doi:10.1109/IJCNN.2010.5596450. Jul.
128, doi:10.1007/978-3-030-16660-1_12. vol. 941A. Abraham, A. K. Cherukuri, P. [61] “4. Supervised learning: models and concepts - machine learning and data science
Melin, and N. GandhiAdvances in Intelligent Systems and Computing, vol. 941. blueprints for finance [Book].” Accessed: Nov. 14, 2023. [Online]. Available:
[34] M. Cova, C. Kruegel, G. Vigna, There is no free phish: an analysis of ‘free’ and live https://www.oreilly.com/library/view/machine-learning-and/9781492073048/
phishing kits, in: Proceedings of the 2nd Conference on USENIX Workshop on Of- ch04.html.
fensive Technologies, in WOOT’08, USA, USENIX Association, 2008, pp. 1–8. Jul. [62] E.A. Zanaty, Support vector machines (SVMs) versus multilayer perception
[35] T. Moore, R. Clayton, Examining the impact of website take-down on phishing, in: (MLP) in data classification, Egypt. Inform. J. 13 (3) (2012) 177–183 Nov,
Proceedings of the Anti-Phishing Working Groups 2nd Annual eCrime Researchers doi:10.1016/j.eij.2012.08.002.
Summit, in eCrime ’07, New York, NY, USA, Association for Computing Machinery, [63] C.W. Hsu, C.J. Lin, A comparison of methods for multiclass support vector machines,
2007, pp. 1–13, doi:10.1145/1299015.1299016. Oct. IEEE Trans. Neural Netw. 13 (2) (2002) 415–425, doi:10.1109/72.991427.
[36] S. Minocha, B. Singh, A novel phishing detection system using binary modified equi- [64] N. Abdelhamid, “Website phishing.” UCI Machine Learning Repository, 2014.
librium optimizer for feature selection, Comput. Electr. Eng. 98 (2022) 107689 Mar, [65] I.H. Witten, E. Frank, Data mining: practical machine learning tools and tech-
doi:10.1016/j.compeleceng.2022.107689. niques with Java implementations, ACM SIGMOD Rec. 31 (1) (2002) 76–77 Mar,
[37] C.L. Tan, K.L. Chiew, K. Wong, S.N. Sze, PhishWHO: Phishing webpage detection via doi:10.1145/507338.507355.
identity keywords extraction and target domain name finder, Decis. Support Syst. [66] R.M. Mohammad, F. Thabtah, L. McCluskey, An assessment of features related to
88 (2016) 18–27, doi:10.1016/j.dss.2016.05.005. phishing websites using an automated technique, in: Proceedings of the 2012 In-
[38] N. Abdelhamid, A. Ayesh, F. Thabtah, Phishing detection based associative ternational Conference for Internet Technology and Secured Transactions, 2012,
classification data mining, Expert Syst. Appl. 41 (13) (2014) 5948–5959 Oct, pp. 492–497. Dec.
doi:10.1016/j.eswa.2014.03.019. [67] S.N. Wan Ahmad, Comparative performance of machine learning methods for clas-
[39] M.A. Jabbar, B.L. Deekshatulu, P. Chandra, Knowledge discovery using associative sification on phishing attack detection, Int. J. Adv. Trends Comput. Sci. Eng. 9 (1.5)
classification for heart disease prediction, Adv. Intell. Syst. Comput. 182 (2013) 29– (2020) 349–354 Sep, doi:10.30534/ijatcse/2020/4991.52020.
39 AISC, doi:10.1007/978-3-642-32063-7_4. [68] D. K. Srivastava and L. Bhambhu, “Data classification using support vector machine,”
[40] F. Thabtah, P. Cowling, Y. Peng, MCAR: multi-class classification based on associa- 2005.
tion rule, in: Proceedings of the 3rd ACS/IEEE International Conference on Computer [69] D. Wahyudi, M. Niswar, A.A.P. Alimuddin, Website phising detection application
Systems and Applications, 2005, 2005, p. 33, doi:10.1109/AICCSA.2005.1387030. using support vector machine (SVM), J. Inf. Technol. Its Util. 5 (1) (2022) 18–24
Jan. Jun, doi:10.56873/jitu.5.1.4836.
[41] G. Costa, R. Ortale, E. Ritacco, X-Class: Associative classification of XML documents [70] M. Nabet, L. George, Phishing attacks detection by using support
by structure, ACM Trans. Inf. Syst. 31 (1) (2013), doi:10.1145/2414782.2414785. vector machine, J. Al-Qadisiyah Comput. Sci. Math. 15 (2023) Sep,
[42] L.A.T. Nguyen, B.L. To, H.K. Nguyen, M.H. Nguyen, Detecting phishing web sites: doi:10.29304/jqcm.2023.15.2.1242.
a heuristic URL-based approach, in: Proceedings of the 2013 International Confer- [71] D. Aksu, A. Abdulwakil, and M. A. Aydin, “Detecting phishing websites using support
ence on Advanced Technologies for Communications ATC 2013, 2013, pp. 597–602, vector machine algorithm,” presented at the Pressacademia, Jun. 2017, pp. 139–142.
doi:10.1109/ATC.2013.6698185. Oct. doi:10.17261/Pressacademia.2017.582.
[43] Y. Zhang, J.I. Hong, L.F. Cranor, Cantina: a content-based approach to detecting [72] A. Altaher, Phishing websites classification using hybrid SVM and KNN approach,
phishing web sites, in: Proceedings of the 16th International Conference on World Int. J. Adv. Comput. Sci. Appl. 8 (6) (2017), doi:10.14569/IJACSA.2017.080611.
11
E.S. Shombot, G. Dusserre, R. Bestak et al. Cyber Security and Applications 2 (2024) 100036
[73] R. Karnik and D. G. M. Bhandari, “Support vector machine based malware and [76] S. Alnemari, M. Alshammari, Detecting phishing domains using machine learning,
phishing website detection,” 2016. Accessed: Nov. 14, 2023. [Online]. Available: Appl. Sci. 13 (8) (2023) 8 Art. no.Jan, doi:10.3390/app13084649.
https://www.semanticscholar.org/paper/Support-Vector-Machine-Based-Malware- [77] Z. Alshingiti, R. Alaqel, J. Al-Muhtadi, Q.E.U. Haq, K. Saleem, M.H. Faheem, A deep
and-Phishing-Karnik-Bhandari/ffea603ec9f33931c9de630ba1a6ac71924f1539. learning-based phishing detection system using CNN, LSTM, and LSTM-CNN, Elec-
[74] A. Mandadi, S. Boppana, V. Ravella, R. Kavitha, Phishing website detec- tronics 12 (1) (2023) 1 Art. no.Jan, doi:10.3390/electronics12010232.
tion using machine learning, in: Proceedings of the 2022 IEEE 7th Inter- [78] Md.A.A. Siddiq, M. Arifuzzaman, M.S. Islam, Phishing website detection using deep
national Conference for Convergence in Technology (I2CT), 2022, pp. 1–4, learning, in: Proceedings of the 2nd International Conference on Computing Ad-
doi:10.1109/I2CT54291.2022.9824801. Apr. vancements, in ICCA ’22, New York, NY, USA, Association for Computing Machin-
[75] A.K. Dutta, Detecting phishing websites using machine learning technique, PLoS ery, 2022, pp. 83–88, doi:10.1145/3542954.3542967. Aug.
ONE 16 (10) (2021) e0258361 Oct, doi:10.1371/journal.pone.0258361.
12