Phishing Detection Using Clustering and Machine Learning
Phishing Detection Using Clustering and Machine Learning
Corresponding Author:
Yahia Hasan Jazyah
Faculty of Computer Studies, Arab Open University
St. Mohammed Nazzal Al-Moassab, Ardiya, Kuwait
Email: [email protected]
1. INTRODUCTION
Phishing is a pervasive and insidious form of cybercrime that preys on human psychology and
technical vulnerabilities. It involves the use of deceptive techniques to trick individuals or organizations into
divulging sensitive information, such as login credentials, financial details, or personal data. Phishing attacks
are often the initial entry point for broader cyber threats, including identity theft, fraud, and malware infections.
To combat this growing menace, effective phishing detection methods have become indispensable.
The sophistication of phishing attacks continues to evolve, making it a challenging task to thwart these
threats. Cybercriminals utilize a variety of tactics, including misleading emails, fraudulent websites, and social
engineering strategies that exploit human trust and curiosity. The dynamic nature of these attacks means that
traditional, static security measures are often ineffective. This has led to the development of advanced and
adaptive techniques for detecting and mitigating phishing attempts.
Phishing detection involves the identification and prevention of deceptive or malicious content within
emails, websites, or other digital communication channels. It encompasses a broad spectrum of methods,
ranging from rule-based filters and signature-based systems to more advanced approaches that leverage
artificial intelligence (AI), machine learning (ML), and behavioral analysis. As cybercriminals constantly
refine their tactics to bypass conventional defenses, the need for innovative and responsive detection
mechanisms has become increasingly pressing.
In this context, this paper explores the landscape of phishing detection, addressing both the existing
challenges and the latest advancements in the field. And proposing a hybrid algorithm that merges between
clustering and classification using deep learning (DL) and decision tree (DT) for two different datasets (DSs).
The main contributions of this work are summarized in three-fold:
− A robust hybrid algorithm to detect phishing attacks using clustering, classification, and stability-
correlation and correlation (ScC) feature selection methods was proposed considering the speed and
simplicity.
− Thorough statistical analysis of the proposed method using well-known phishing datasets.
− A comparison between the proposed method and other well-known methods for detecting phishing websites
that are presented in the literature, methods that use DL and DT together with ScC, rough set (RS), and
principal component analysis (PCA), and methods that use DL and DT together with RS-K-mean and PCA-
K-mean.
The remaining of this article is organized as follows: section 2 presents comparisons between phishing
detection Algorithms, section 3 is preliminaries about feature selection methods, section 4 presents the
proposed algorithm, section 5 presents the complexity of the proposed algorithm, and section 6 is the
conclusion.
3. PRELIMINARIES
3.1. Feature selection methods
Feature selection methods for reducing the size of datasets are classified into three groups: filter,
wrapper, and embedded [1], [2]. The filter method calculates a score for each feature and all features with
scores more than a pre-defined threshold value are chosen. On the other hand, wrapper methods use a classifier
to evaluate the effectiveness of various reducts and choose the best of them. It is more powerful than filter
methods, but it is also more complex. Conversely, to wrapper methods, embedded methods judge feature
selection in the training procedure. Filter methods were applied for feature selection to select the most
presenting attributes that have the highest information in a dataset that can distinguish between classes. The
applied methods are explained next. Various simple ML algorithms may work jointly (known as hybrid) to
complement and enhance each other, Table 3 presents a comparison between different hybrid methods for
phishing website detection methods in terms of accuracy.
Where Dependency (SI, A) is the dependency score of feature A concerning the set of instances SI, SI is the
set of instances for which the feature is evaluated, A is the feature for which the dependency is measured, |SI|
is the number of instances in SI, |U| is the total number of instances in the dataset, |SI|_A is the number of
distinct values of feature A in SI. The dependency score measures the significance of feature A in
discriminating between different classes or values of the target variable within the set of instances SI. A higher
Dependency Score indicates a stronger dependency, and therefore, the feature is considered more important
for classification.
relevant features, the first method is stability-correlation (Sc) feature selection which combines both
stability-based and correlation-based criteria to select features, it aims to identify features that are stable across
different subsets of the data and highly correlated with the target variable or class labels. The stability-
correlation score is calculated using (2).
𝑚𝑜𝑑𝑒(𝑋𝑖)
𝑆= (2)
𝑛
̅ )2
(𝑥−x
𝑆𝑡𝑥 = √∑ (4)
𝑛−1
(𝑦−ȳ)2
𝑆𝑡𝑦 = √∑ (5)
𝑛−1
where: n is the number of pairs of data used, Σ is sigma which represents the summation, 𝑥̅ = the mean of all
x-values, ȳ is the mean of all y-values, Stx is the standard deviation of variable x, Sty is the standard deviation
of variable y.
where: EV(PC_k) is the proportion of the total variance explained by the kth principal component (PC),
eigenvalue_k is the eigenvalue associated with the kth principal component, and total eigenvalues is the sum
of all eigenvalues obtained from PCA.
datasets, making them a powerful tool for various ML and AI tasks. DL, like convolutional and recurrent neural
networks, can analyse email content and patterns in network traffic to detect phishing. However, they can be
computationally intensive and require substantial amounts of labeled data for training. Proper architecture
selection, hyperparameter tuning, and data preprocessing are crucial for the successful deployment of DL
models.
Where: true positives (TP) is the number of phishing emails or websites correctly identified as phishing, true
negatives (TN) is the number of legitimate (non-phishing) emails or websites correctly identified as
non-phishing, false positives (FP) is the number of legitimate emails or websites incorrectly identified as
phishing (a type of error), false negatives (FN) is the number of phishing emails or websites incorrectly
identified as legitimate (another type of error).
− Area under curve (AUC): it is one of the well-known measurements in ML area. It is used to assess the
performance of binary classification models. It quantifies the ability of a model to distinguish between two
classes (positive and negative) by measuring the area under the receiver operating characteristic (ROC) curve.
− Precision: it is a measure of the accuracy of a model in correctly identifying positive instances among the
instances that it has classified as positive TP. It provides information about the ability of model to avoid
FP. Precision is calculated by (8).
TP
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = TP+FP (8)
− Recall: it is another vital metric that evaluates ML algorithms. It is known as sensitivity or true positive
rate, which is used to assess the performance of a binary classification model by measuring the ability of
model to correctly identify all relevant instances from the positive class. It is calculated using (9).
TP
𝑅𝑒𝑐𝑎𝑙𝑙 = TP+FN (9)
For the two reduced datasets, the accuracy of the proposed method given by DT and DL was the
highest among all other feature selection methods compared to ScC, RS, and PCA. The detection accuracies
of ScC-K-mean, RS-K-mean, and PCA-K-mean were significantly improved by applying the k-mean
clustering method to the filter feature selection methods (ScC, RS, and PCA) using DS1 from 92.15%, 94.33%,
and 93.86% to 99.37%, 99.53%, and 99.53% respectively when DL classifier was used and from 91.61%,
92.82%, and 86.36% to 100%, 98.16%, and 97.75% respectively when DT classifier was used. A comparison
was also made for DS2, and the improvement was very significant.
Comparing the three suggested hybrid methods (ScC-K-mean, RS-K-mean, and PCA-K-mean), the
accuracy of the ScC-K-mean using DT was the best among them with 100% accuracy for DS1 and 97.2% for
DS2. For AUC and precision, the ScC-K-mean method shows higher performance (mostly 100%) for most of
the readings compared to RS-K-mean, and PCA-K-mean and very high values for recall and F-measure. It is
clear from the results that applying K-mean clustering after the feature selection methods (ScC-K-mean,
RS-K-mean, and PCA-K-mean) achieves higher performance in terms of accuracy, AUC, precision, recall, and
F-measure for the selected DSs. And among them, ScC-K-mean was the best.
K-mean, in this work, proves its performance. K-mean clustering after feature selection using RS,
ScC, and PCA performs better than only feature selection (RS, ScC, and PCA), but it depends on the specific
issue that being solved and the nature of data. Each of these techniques serves different purposes and excels in
distinct scenarios.
The performance of the proposed approach was compared with other hybrid approaches in the
previous work used in detecting phishing websites. The accuracy of the phishing hybrid approach using genetic
algorithm (GA) was 91.13% [23], while it was 95.76% using features-based ANN and K-medoids clustering
algorithm [24]. The study of Suleman and Awan [25] using another generating genetic algorithm (YAGGA)
gave an accuracy reached 95%. Meanwhile, the study of Vrbančič et al. [26] using the bat algorithm and hybrid
bat algorithm gave an accuracy of 96.5%. For the work that used DPI, SDN, and ANN, the accuracy was
98.39% [27]. The accuracy of the method that uses ScC and forward feature selection methods was 92.56%
[31]. As our proposed method's highest accuracy for the UCI phishing websites was 100% using the DT
classifier and 99.37% using the DL classifier, we proudly concluded that our approach is the pioneer in solving
phishing problems.
Figures 3 (a) to 3(b) compares between measurements before and after applying the K-mean clustering
for RS. Figures 4 (a) to 4(b) compares between measurements before and after applying the K-mean clustering
for ScC. While Figures 5(a) to 5(b) compares between measurements before and after applying the K-mean
clustering for PCA.
(a) (b)
Figure 3. Comparisons between measurements before and after K-mean clustering – RS for (a) DS1 and (b)
DS2
(a) (b)
Figure 4. Comparisons between measurements before and after clustering – ScC for (a) DS1 and (b) DS2
(a) (b)
Figure 5. Comparisons between measurements before and after clustering – PCA for (a) DS1 and (b) DS2
to assign data points to clusters and update cluster centroids. The time complexity for K-Means is typically
O(n*k*I*d), where:n is the number of data points (instances), k is the number of clusters, I is the number of
iterations, d is the number of features (dimensions).
The number of iterations (I) can vary, and typically K-Means converges relatively quickly, but it's not
guaranteed to converge to a global optimum. Clustering itself does not directly affect the time complexity of
feature selection methods. However, there can be indirect relationships between clustering and feature selection
that may impact the overall computational complexity of a ML pipeline, such as preprocessing, feature
importance, data size, and parallelization.
6. CONCLUSION
The choice of using RS, ScC, PCA, K-mean, DT, or DL depends on the specific problem, data, and
goals. Each of these techniques has its strengths and weaknesses, and the right choice should be based on the
characteristics of the analysis. This research proposes a hybrid method of traditional feature selection methods,
K-mean clustering in addition to classification using DL and DT (ScC-K-mean) for two different DSs that
include data about phishing detection. Simulation results show that the proposed algorithm outperforms the
traditional tested methods that use DL and DT together with ScC, RS, and PCA methods. Also the other
proposed hybrid methods that use DL and DT together with RS-K-mean and PCA-K-mean, and other hybrid
methods explained earlier in section 4. Future studies will include comparing the proposed algorithm with other
ML algorithms and investigating prospects for developing tools to improve the performance of the proposed
algorithm. The proposed method could also be tested against further highly dimensioned phishing datasets,
semi-structured and unstructured phishing datasets, and other types of attacks such as spam and malware.
REFERENCES
[1] I. Guyon and A. Elissef, “an introduction to variable and feature selection,” Journal of machine learning research, vol. 3, pp. 1157–
1182, 2003.
[2] A. K. Das, S. Sengupta, and S. Bhattacharyya, “A group incremental feature selection for classification using rough set theory based
genetic algorithm,” Applied Soft Computing Journal, vol. 65, pp. 400–411, 2018, doi: 10.1016/j.asoc.2018.01.040.
[3] A. K. Jain, S. Parashar, P. Katare, and I. Sharma, “PhishSKaPe: A content based approach to escape phishing attacks,” Procedia
Computer Science, vol. 171, pp. 1102–1109, 2020, doi: 10.1016/j.procs.2020.04.118.
[4] A. Anhari, “Alexa dataset,” Kaggle, 2023. Accessed: May 1, 2023. [Online]. Available:
https://www.kaggle.com/datasets/aanhari/alexa-dataset
[5] OpenPhish, “OpenPhish - Phishing Intelligence,” Open Phish, 2021. Accessed: May 1, 2023. [Online]. Available:
https://openphish.com/
[6] “PhishTank | Join the fight against phishing” Phish Tank. Accessed: October 1, 2023. [Online]. Available:
https://www.phishtank.com/index.php
[7] G. Sonowal and K. S. Kuppusamy, “PhiDMA – A phishing detection model with multi-filter approach,” Journal of King Saud
University - Computer and Information Sciences, vol. 32, no. 1, pp. 99–112, 2020, doi: 10.1016/j.jksuci.2017.07.005.
[8] “Phishload dataset,” Phish Load, 2023. Accessed: May 1, 2023. [Online]. Available:
https://www.medien.ifi.lmu.de/team/max.maurer/files/phishload/download.html
[9] R. S. Rao, A. R. Pais, and P. Anand, “A heuristic technique to detect phishing websites using TWSVM classifier,” Neural
Computing and Applications, vol. 33, no. 11, pp. 5733–5752, 2021, doi: 10.1007/s00521-020-05354-z.
[10] M. Babagoli, M. P. Aghababa, and V. Solouk, “Heuristic nonlinear regression strategy for detecting phishing websites,” Soft
Computing, vol. 23, no. 12, pp. 4315–4327, 2019, doi: 10.1007/s00500-018-3084-2.
[11] R. M. Mohammad, F. Thabtah, and L. Mccluskey, “Phishing websites features,” School of Computing and Engineering, University
of Huddersfield, pp. 1–7, 2015.
[12] R. Mohammad and L. McCluskey, “Phishing websites,” UCI Machine Learning Repository, 2012, doi: 10.24432/C51W2X.
[13] K. L. Chiew, C. L. Tan, K. S. Wong, K. S. C. Yong, and W. K. Tiong, “A new hybrid ensemble feature selection framework for
machine learning-based phishing detection system,” Information Sciences, vol. 484, pp. 153–166, 2019, doi:
10.1016/j.ins.2019.01.064.
[14] M. M. Yadollahi, F. Shoeleh, E. Serkani, A. Madani, and H. Gharaee, “An adaptive machine learning based approach for phishing
detection using hybrid features,” 2019 5th International Conference on Web Research, ICWR 2019, pp. 281–286, 2019, doi:
10.1109/ICWR.2019.8765265.
[15] S. Smadi, N. Aslam, and L. Zhang, “Detection of online phishing email using dynamic evolving neural network based on
reinforcement learning,” Decision Support Systems, vol. 107, pp. 88–102, 2018, doi: 10.1016/j.dss.2018.01.001.
[16] J. Nazario, “Index of /~jose/phishing,” Monkey. Accessed: May 1, 2023. [Online]. Available: https://monkey.org/~jose/phishing/
[17] “Index of /old/publiccorpus,” Spam Assassin, 2019. Accessed: May 1, 2023. [Online]. Available:
https://spamassassin.apache.org/old/publiccorpus/
[18] W. Wei, Q. Ke, J. Nowak, M. Korytkowski, R. Scherer, and M. Woźniak, “Accurate and fast URL phishing detector: A
convolutional neural network approach,” Computer Networks, vol. 178, 2020, doi: 10.1016/j.comnet.2020.107275.
[19] “Common crawl-open repository of web crawl data,” Common Crawl. Accessed: May 01, 2023. [Online]. Available:
http://commoncrawl.org/
[20] S. Smadi, N. Aslam, L. Zhang, R. Alasem, and M. A. Hossain, “Detection of phishing emails using data mining algorithms,” SKIMA
2015 - 9th International Conference on Software, Knowledge, Information Management and Applications, 2016, doi:
10.1109/SKIMA.2015.7399985.
[21] A. Subasi, E. Molah, F. Almkallawi, and T. J. Chaudhery, “Intelligent phishing website detection using random forest classifier,”
2017 International Conference on Electrical and Computing Technologies and Applications, ICECTA 2017, vol. 2018, pp. 1–5,
2017, doi: 10.1109/ICECTA.2017.8252051.
[22] “WEKA dataset,” Waikato GitHub, 2023. Accessed: May 1, 2023. [Online]. Available: https://waikato.github.io/weka-
wiki/datasets/
[23] W. Ali and A. A. Ahmed, “Hybrid intelligent phishing website prediction using deep neural networks with genetic algorithm-based
feature selection and weighting,” IET Information Security, vol. 13, no. 6, pp. 659–669, 2019, doi: 10.1049/iet-ifs.2019.0006.
[24] E. Zhu, Y. Ju, Z. Chen, F. Liu, and X. Fang, “DTOF-ANN: An artificial neural network phishing detection model based on decision
tree and optimal features,” Applied Soft Computing Journal, vol. 95, 2020, doi: 10.1016/j.asoc.2020.106505.
[25] M. T. Suleman and S. M. Awan, “Optimization of URL-based phishing websites detection through genetic algorithms,” Automatic
Control and Computer Sciences, vol. 53, no. 4, pp. 333–341, 2019, doi: 10.3103/S0146411619040102.
[26] G. Vrbančič, I. Fister, and V. Podgorelec, “Swarm intelligence approaches for parameter setting of deep learning neural network:
Case study on phishing websites classification,” ACM International Conference Proceeding Series, 2018, doi:
10.1145/3227609.3227655.
[27] T. Chin, K. Xiong, and C. Hu, “Phishlimiter: A phishing detection and mitigation approach using software-defined networking,”
IEEE Access, vol. 6, pp. 42513–42531, 2018, doi: 10.1109/ACCESS.2018.2837889.
[28] W. Chen, X. A. Wang, W. Zhang, and C. Xu, “Phishing detection research based on pso-bp neural network,” Advances in Internet,
Data & Web Technologies, vol. 17, pp. 990–998, 2018, doi: 10.1007/978-3-319-75928-9_91.
[29] A. Kumar, S. S. Roy, S. Saxena, and S. S. S. Rawat, “Phishing detection by determining reliability factor using rough set theory,”
Proceedings - 2013 International Conference on Machine Intelligence Research and Advancement, ICMIRA 2013, pp. 236–240,
2014, doi: 10.1109/ICMIRA.2013.51.
[30] L. Al-Shalabi, “New feature selection algorithm based on feature stability and correlation,” IEEE Access, vol. 10, pp. 4699–4713,
2022, doi: 10.1109/ACCESS.2022.3140209.
[31] M. N. Alam, D. Sarma, F. F. Lima, I. Saha, R. E. Ulfath, and S. Hossain, “Phishing attacks detection using machine learning
approach,” Proceedings of the 3rd International Conference on Smart Systems and Inventive Technology, ICSSIT 2020, pp. 1173–
1179, 2020, doi: 10.1109/ICSSIT48917.2020.9214225.
[32] “UCI machine learning repository,” 2017. [Online]. Available: https://archive.ics.uci.edu
[33] K. Althobaiti, M. K. Wolters, N. Alsufyani, and K. Vaniea, “using clustering algorithms to automatically identify phishing
campaigns,” IEEE Access, vol. 11, pp. 96502–96513, 2023, doi: 10.1109/ACCESS.2023.3310810.
[34] S. Mondal, D. Maheshwari, N. Pai, and A. Biwalkar, “A review on detecting phishing URLs using clustering algorithms,” 2019 6th
IEEE International Conference on Advances in Computing, Communication and Control, 2019, doi:
10.1109/ICAC347590.2019.9036837.
[35] M. Miškuf and I. Zolotová, “Comparison between multi-class classifiers and deep learning with focus on industry 4.0,” 2016
Cybernetics & Informatics (K&I), Levoca, Slovakia, 2016, pp. 1-5, doi: 10.1109/CYBERI.2016.7438633.
[36] S. H. Liu, Q. J. Sheng, B. Wu, Z. Z. Shi, and F. Hu, “Research on efficient algorithms for rough set methods,” Jisuanji
Xuebao/Chinese Journal of Computers, vol. 26, no. 5, pp. 524–529, 2003.
[37] N. Q. Do, A. Selamat, O. Krejcar, T. Yokoi, and H. Fujita, “Phishing webpage classification via deep learning‐based algorithms:
An empirical study,” Applied Sciences, vol. 11, no. 19, 2021, doi: 10.3390/app11199210.
[38] M. Zareapoor and K. R. Seeja, “Feature extraction or feature selection for text classification: a case study on phishing email
detection,” International Journal of Information Engineering and Electronic Business, vol. 7, no. 2, pp. 60–65, 2015, doi:
10.5815/ijieeb.2015.02.08.
BIOGRAPHIES OF AUTHORS
Yahia Hasan Jazyah received the B.S. degree in Communications and Electronics
Engineering from Applied Science University, Jordan, in 2000, M.Sc. degrees in Computer
Science from Amman Arab University, Jordan in 2005, and the Ph.D. degree in Data
Telecommunications and Networks from the University of Salford, UK in 2011. Since 2019, he
has been an associate Professor with the Information Technology and Computing, Arab Open
University, Kuwait. He is the author of many journal articles and conference proceedings. His
research interests include wireless routing protocols for UWB MANET, 5G, WSN, and BGP.
He is an academic reviewer in several international journals. He was a recipient of excellence
award in the scientific research from the Arab Open University in Kuwait for the academic year
2018/2019. He can be contacted at email: [email protected] or [email protected].