Phishing Detection Based On Machine Learning and Feature Selection Methods
Phishing Detection Based On Machine Learning and Feature Selection Methods
Nidal Alnidami
National Information Technology Center, Amman, Jordan
1 Introduction
The Internet is everywhere today, and the society uses web services for a range of
activities such as sharing knowledge, social communication, and performing various
financial activities, which include buying, selling and money transferring and more
other things. Malicious websites are a severe threat to the Internet’s users, and una-
ware users can become victims of malicious URLs that host undesirable content such
as spam, phishing, drive-by-download, and drive-by-exploits. Phishing is a conven-
tional attack on the Internet, and it is defined as the social engineering process of
luring users into fraudulent websites to obtain their personal or sensitive information
such as their user names, passwords, addresses, credit card details, social security
172 http://www.i-jim.org
Paper—Phishing Detection Based on Machine Learning and Feature Selection Methods
compares between different machine learning algorithms to find which one is more
efficient.
2 Related Works
In this section, recent works that used phishing detection approaches that utilized
with machine learning algorithms will be discussed.
According to content-based approach, in [9], a novel method that utilizes a logo
image to determine the identity of the web page by matching real and fake web-pages.
The proposed approach is composed of two phases, which are logo extraction and
identity verification. In the first phase, machine learning algorithms are used to detect
the right logo image. While in the second phase, image search offered by Google is
used to return the fake identity, then it will be utilized for the verification. Because the
relation among the logo and domain name is unique, the domain name is treated as the
identity of the logo. So, a comparison among the domain name retrieved by Google
with the one from web page query will permit us to distinguish between phishing and
legitimate web pages. The experimental results notice that logo extraction phase en-
hanced phishing detection accuracy, and it is more useful than extraction phases based
on textual features. The system has been evaluated by using two different datasets that
made of 1140 phishing obtained from Phish-Tank and legitimate web-pages obtained
from Alexa. They only selected the most sensitive eight features out of 23 features.
They justify utilizing feature selection because using all the 23 features would con-
suming the time. The accuracy of the proposed system is 93.4%.
On the other hand, some studies combined a heuristic based with a machine learn-
ing algorithm to enhance a classification process of web pages. Machine learning
algorithms are utilized a clarify features and effective algorithm to produce an accu-
rate classifier model to distinguish between phishing and legitimate web-pages. In the
work of [10], they suggested heuristic based phishing detection method that used to
recognize the phishing site. In the beginning, the system extracts and utilize URL-
based features. Then, these features are applied to machine learning algorithms, and it
will recognize if the web page is phished or legitimate. The system used 10 features
on the input URL’s dataset. It implements features extraction from URL inputs using
.NET Script. The output results are categorized as either Legitimate or Phishing. Sup-
port Vector Machine algorithm is used on extracted features result and find the value
for FP, TP, FN and TN and also have calculated the value of F1-measure and the
accuracy that presented 96%. Dataset of URLs are collected from Phish-Tank and
yahoo directory, which contains 200 Legitimate and phishing web pages URLs.
Likewise, in [11], they implemented a heuristic based phishing detection approach
besides machine learning algorithms features of URL. The proposed method elicited
URL features of web pages requested by the user and applied them to decide if a re-
quested web page is phishing or not. To choose a classifier that most effectiveness for
employing URL-based features, five machine learning techniques are utilized: support
vector machine (SVM), naive Bayes, decision tree, k-nearest neighbour (KNN), ran-
dom tree, and random forest. To evaluating and training a classifier a dataset that
collected 3,000 phishing web-pages from Phish-Tank and 3,000 legitimate webpages
from DMOZ. 26 URL-based features are extracted and utilized. The experiment re-
sults show that machine learning classifier that achieved the best performance is Ran-
dom Forest (FR) with 98.23% of accuracy.
Additionally, in [12], authors also proposed a heuristic based method to detect
phishing URLs by utilizing URLs features. The system is evaluated using data sets
that consist of more than 16,000 phishing and 31,000 non-phishing URLs is em-
ployed. They used a set of 138 features in detecting phishing URLs. Features are
categorized into four groups, which are Lexical based features, Keyword based fea-
tures, Reputation-based features, and Search engine-based features. Furthermore,
seven different classifiers are implemented which are Support Vector Machines (SVM
with RBF kernel), SVM with linear kernel, Multilayer Perceptron (MLP), Random
Forest (RF), Nave Bayes (NB), Logistic Regression (LR) and C4.5. According to
experiment results, Random Forest (RF) achieved a higher accuracy rate and lower
error rate.
In the previous works, a heuristic based approach is implemented with a machine
learning algorithms, each of them has its own data sets, employing different features
and applying several machine learning algorithms, but in both Random Forest algo-
rithm is achieved the most effective classification rate of web-pages, likewise, in our
work, we use different dataset, different features and applying in different machine
learning algorithms in addition to employing different feature selection techniques but
also the random forest shows the best results. Next two studies will demonstrate a
hybrid machine learning approaches that get a benefit from strengthens of each algo-
rithm and overlooked about the weaknesses, because more effective techniques are
needed to limit the fast evolution of phishing attacks.
The study of [4], they proposed a method that combines two algorithms, K-nearest
neighbors (KNN) algorithm which is effective against noisy data and Support Vector
Machine (SVM) algorithm, which is a robust classifier, a combination is done in two
phases. At first, applying KNN then SVM is employing as a classification tool. The
dataset used for the experiment is taken from related work, the dataset contains more
than 1353 sample gathered from various sources, each sample record composed of
nine features and the class label which is Phishing, Legitimate or Suspicious web
page. Consequently, the clearness of KNN is integrated with the effectiveness of
SVM, regardless of their own disadvantages when they used individually. The accu-
racy of the proposed method is 90.04%. In [13], authors proposed a fast and accurate
phishing detection method that combined both Naive Bays (NB) and Support Vector
Machine (SVM), utilizing features of URLs and web-page contents. NB is used in
detecting web pages. As long as the web pages are not detected efficiently and still
suspicious, SVM will be employed to reclassifying the web pages. The used learning
dataset is generated from Phish Tank which is 600 phishing web pages, and 400 are
legitimate ones, 100 legitimate and 100 phishing web pages are occupied as the train-
ing set, and the rest are carried as testing dataset. Experimental results exhibit that this
proposed approach achieved high detection accuracy and lower detection time.
174 http://www.i-jim.org
Paper—Phishing Detection Based on Machine Learning and Feature Selection Methods
Data set used in this study is offered by Chiew et al [14] which composed of 48
features taken out from 5000 phishing web-pages and 5000 legitimate web-pages.
Phishing webpages are collected from Phish-Tank and Open-Phish, while legitimate
web-pages are collected from Alexa and Common Crawl. These web-pages are down-
loaded on two distinct sessions, from January to May 2015 and through May to June
2017. Browser automation framework is employed to improve the feature extraction
method, which is more accurate and robust in contrast with parsing technique based
on regular expressions. Features in this dataset are classified into three groups, which
are Address bar-based, Abnormal-based, and HTML/JavaScript-based features. Ad-
dress bar-based are the features in the URL of the web page like URL’s length and
port number, abnormal-based are features of abnormal actions on the web page like
downloading objects from external domains, and HTML/JavaScript-based are features
of HTML and JavaScript methods placed in the source code of the web page [15]. In
this work, we chose this dataset because it is the most recent dataset in this field.
176 http://www.i-jim.org
Paper—Phishing Detection Based on Machine Learning and Feature Selection Methods
layer, and output layer each has its own functionality (Fig. 3), an input layer is used to
obtain the signal, an output layer turns out a decision about the input, and there is at
least one hidden layer that is the computational engine of the MLP. It is usually uti-
lized to supervised learning problems: it is trained on a group of input-output pairs
and learns the correlation and dependencies among them [20].
5 Feature Selection
Feature selection is employed to decrease the size of the data to enhance the mod-
el’s performance and reducing the computation time. Simply, the feature selection
keeps the most important fields and eliminates unimportant ones. However, it also
gives useful and robust results. In this work, different feature selection methods will
be utilized to enhance the phishing detection method by increasing the accuracy rate
and decreasing the time that taken to build the model.
used to asset the best subsets. Hybrid methods obtained high accuracy and high-
efficiency rates [21].
The aim of this study is to assess different feature selection techniques in term of
accuracy and computational execution. Out of the overall 48 features used in phishing
detection, some features will be optional in detecting phishing web pages. Therefore,
the essential features are taken away from the original dataset that is particularly ef-
fective in phishing detection, which will be debated in the results section. Different
experiments had been done on different filters methods of feature selection techniques
such as InfoGain, ReliefF, PCA, and attribute. However, InfoGain and ReliefF had
been chosen in our work because they attain the best accuracy rates than the remnant
techniques.
• InfoGain: It shows the significance of the features and determines which one of
them is the most helpful for distinguishing among the classes. The value of In-
foGain is calculated in the training data set. It is used in decision tree algorithms
because it can help in deciding the best split; which high value indicates that split is
excellent and low value indicates that the split is not good enough. The equation
(1) used to estimate the value of an attribute by calculating the information gain
according to the class [17].
• ReliefF: As a filter-based feature selection method, Relief used to evaluate the
quality of every feature according to the context of other features and the relevance
of the feature to given target notion [22]. The produced value of the algorithm is
between - 1 and 1 for every feature in addition with positive numbers designating
more significance or weighted attributes. The weight of an attribute is reduplicative
upgraded, and it has a probabilistic description. The fundamental principle of relief
is that important attributes are equivalent to instances of the same class.
InfoGain(Class,Attribute) = H(Class)−H(Class|Attribute) (1)
6 Model Evaluation
To evaluate the models, there are many assessment tools. But we attend to evaluate
our model using the accuracy equation because the utilized dataset is Binary and Bal-
anced data set. So, calculating the accuracy rates will be enough, efficient and accu-
rate. To apply the accuracy formula, we should mention that there are two kinds of
classification methods in accordance with the number of classes which are binary
classification and multi-class classification. Where in binary classification there are
only two classes whereas in multi-class classification the number of classes is more
than two. In binary classes (Fig. 4), assume we have two classes, P for the positive
class and N for negative class [23].
178 http://www.i-jim.org
Paper—Phishing Detection Based on Machine Learning and Feature Selection Methods
• True Positive (TP): the true prediction rate of the positive samples. The predicted
value is positive, and the actual value is also positive
• False Positive (FP): negative value incorrectly classified as positive
• True Negative (TN): the true prediction of negative samples. The predicted value is
negative, and the actual value is also negative
• False Negative (FN): positive value incorrectly classified as negative
Accuracy refers to the ratio of correctly classified instances. It is the most used
evaluation metric for the performance of binary classification problems. Also, it is
determining the accuracy of the classification model. Accuracy is calculated using the
following equation (2).
(2)
In this study, the dataset mentioned in section 3 was employed, which contains 48
different features. For analysis and comparing between used classifiers, Weka 3.8.3
has been utilized. Weka is a set of machine learning algorithms used for different data
mining functions such as data preparation, classification, regression, clustering, asso-
ciation rules mining, and visualization. Two different feature selection algorithms
have been used in this study: InfoGain and ReliefF. The details of the top 15 extracted
features from both algorithms are described in Table 1.
For the experiments, the 10-fold cross-validation technique is utilized in testing the
models for the reason that it minimizes the estimation variance. By using this tech-
nique, the training dataset should be divided into 10 subsets, then each of these sub-
sets must be tested in the remaining nine subsets. Every test subset is employed once a
time in all 10 repetitions. Table 2,3 and 4 show the performance of the three selected
algorithms (J48, RF, and MLP) using infoGain and reliefF feature selection methods
with top 5, top 10 and top 15 features.
Furthermore, other two experiment were performed to get the best accuracy and the
least time to build the model. First one is the intersect of top 15 features using in-
foGain and reliefF - that present 10 features which are 27, 28, 48, 34, 14, 35, 47, 39,
30, 3. As it has seen in Table 5. The second experiment results in 20 features which
are the Union of top 15 features using infoGain and reliefF. These features are 27, 28,
48, 34, 14, 35, 47, 39, 30,3 ,5, 1, 22, 25, 23.See Table 5.
180 http://www.i-jim.org
Paper—Phishing Detection Based on Machine Learning and Feature Selection Methods
The experiments results show that using the 20 features that result from the Union
of top 15 features using infoGain and reliefF is presents very close accuracy rates of
using the whole 48 feature. In addition, it takes much less time to build the model.
8 Conclusion
9 References
[1] Anti-Phishing Working Group . phishing activity trends report 4 th quarter. https://docs.ap
wg.org, 2018.
[2] Microsoft Security Intelligence Report . volume 24. https://www.microsoft.com/security,
2019.
[3] Hossein Shirazi. Unbiased phishing detection using domain name based features. PhD the-
sis, Colorado State University. Libraries.
[4] Altyeb Altaher. Phishing websites classification using hybrid svm and knn approach. In-
ternational Journal of Advanced Computer Science and Applications, 8(6):90–95, 2017.
https://doi.org/10.14569/ijacsa.2017.080611
[5] Yi-Shin Chen, Yi-Hsuan Yu, Huei-Sin Liu, and Pang-Chieh Wang. Detect phishing by
checking content consistency. In Proceedings of the 2014 IEEE 15th International Confer-
ence on Information Reuse and Integration (IEEE IRI 2014), pages 109–119. IEEE, 2014.
https://doi.org/10.1109/iri.2014.7051880
[6] Neda Abdelhamid, Aladdin Ayesh, and Fadi Thabtah. Phishing detection based associative
classification data mining. Expert Systems with Applications, 41(13):5948–5959, 2014.
https://doi.org/10.1016/j.eswa.2014.03.019
[7] Mahmood Moghimi and Ali Yazdian Varjani. New rule-based phishing detection method.
Expert systems with applications, 53:231–242, 2016. https://doi.org/10.1016/j.eswa.2016.
01.028
[8] Maher Aburrous, M Alamgir Hossain, Keshav Dahal, and Fadi Thabtah. Intelligent phish-
ing detection system for e-banking using fuzzy data mining. Expert systems with applica-
tions, 37(12):7913–7921, 2010. https://doi.org/10.1016/j.eswa.2010.04.044
[9] Kang Leng Chiew, Ee Hung Chang, Wei King Tiong, et al. Utilisation of website logo for
phishing detection. Computers & Security, 54:16–26, 2015. https://doi.org/10.1016/j.cose.
2015.07.006
[10] Jaydeep Solanki and Rupesh G Vaishnav. Website phishing detection using heuristic based
approach. In Proceedings of the third international conference on advances in computing,
electronics and electrical technology, 2015.
[11] Jin-Lee Lee, Dong-Hyun Kim, and Lee Chang-Hoon. Heuristic-based approach for phish-
ing site detection using url features. In Proc. of the Third Intl. Conf. on Advances in Com-
puting, Electronics and Electrical Technology-CEET, pages 131–135, 2015. https://doi.
org/10.15224/978-1-63248-056-9-84
[12] Ram B Basnet and Tenzin Doleck. Towards developing a tool to detect phishing urls: a
machine learning approach. In 2015 IEEE International Conference on Computational In-
telligence & Communication Technology, pages 220–223. IEEE, 2015. https://doi.org/10.
1109/cict.2015.63
[13] Xiaoqing Gu, Hongyuan Wang, and Tongguang Ni. An efficient approach to detecting
phishing web. Journal of Computational Information Systems, 9(14):5553–5560, 2013.
[14] Kang Leng Chiew, Choon Lin Tan, KokSheik Wong, Kelvin SC Yong, and Wei King
Tiong. A new hybrid ensemble feature selection framework for machine learning-based
phishing detection system. Information Sciences, 484:153–166, 2019. https://doi.org/10.10
16/j.ins.2019.01.064
[15] Mahdieh Zabihimayvan and Derek Doran. Fuzzy rough set feature selection to enhance
phishing attack detection. arXiv preprint arXiv:1903.05675, 2019. https://doi.org/10.1109/
fuzz-ieee.2019.8858884
[16] Adwan Yasin and Abdelmunem Abuhasan. An intelligent classification model for phishing
email detection. arXiv preprint arXiv:1608.02196, 2016.
[17] Mohammad Almseidin, Maen Alzubi, Szilveszter Kovacs, and Mouhammd Alkasassbeh.
Evaluation of machine learning algorithms for intrusion detection system. In 2017 IEEE
15th International Symposium on Intelligent Systems and Informatics (SISY), pages
000277–000282. IEEE, 2017. https://doi.org/10.1109/sisy.2017.8080566
[18] Mouhammad Alkasassbeh and Mohammad Almseidin. Machine learning methods for net-
work intrusion detection. Icccnt 2018 - The 20TH International Conference On Compu-
ting, Communication And Networking Technologies, 2018.
[19] Ibrahim Obeidat, Nabhan Hamadneh, Mouhammd Alkasassbeh, Mohammad Almseidin,
and Mazen AlZubi. Intensive pre-processing of kdd cup 99 for network intrusion classifi-
cation using machine learning techniques. 2019. https://doi.org/10.3991/ijim.v13i01.9679
[20] Mouhammd Alkasassbeh, Ghazi Al-Naymat, AB Hassanat, and Mohammad Almseidin.
Detecting distributed denial of service attacks using data mining techniques. International
Journal of Advanced Computer Science and Applications, 7(1):436–445, 2016. https://doi.
org/10.14569/ijacsa.2016.070159
[21] Mouhammad Alkasassbeh. An empirical evaluation for the intrusion detection features
based on machine learning and feature selection methods. Journal of Theoretical and Ap-
plied Information Technology, 95(22), 2017.
182 http://www.i-jim.org
Paper—Phishing Detection Based on Machine Learning and Feature Selection Methods
[22] Ryan J Urbanowicz, Melissa Meeker, William La Cava, Randal S Olson, and Jason H
Moore. Relief-based feature selection: introduction and review. Journal of biomedical in-
formatics, 2018. https://doi.org/10.1016/j.jbi.2018.07.014
[23] Alaa Tharwat. Classification assessment methods. Applied Computing and Informatics,
2018.
10 Authors
Article submitted 2019-07-31. Resubmitted 2019-09-15. Final acceptance 2019-09-21. Final version
published as submitted by the authors.