0% found this document useful (0 votes)
27 views6 pages

Phishing Attacks Detection A Machine Learning-Based Approach

Good for Study

Uploaded by

gptu35083
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views6 pages

Phishing Attacks Detection A Machine Learning-Based Approach

Good for Study

Uploaded by

gptu35083
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Phishing Attacks Detection

A Machine Learning-Based Approach


Fatima Salahdine1,2, Zakaria El Mrabet1, Naima Kaabouch1
1
School of Electrical Engineering & Computer Sciences, University of North Dakota
Grand Forks, ND-58203, USA
2
Department of Electrical and Computer Engineering, the University of North Carolina at Charlotte,
Charlotte, NC-28223, USA
{fatima.salahdine, zakaria.elmrabet, naima.kaabouch}@und.edu

Abstract- Phishing attacks are one of the most common social techniques can be classified into four categories: rule-based,
engineering attacks targeting users’ emails to fraudulently steal white and blacklist, heuristic, and hybrid. The rule-based
confidential and sensitive information. They can be used as a part approach consists of using data mining techniques to train the
of more massive attacks launched to gain a foothold in corporate model based on a specific dataset with a certain number of
or government networks. Over the last decade, a number of anti- features, then extract some phishing attacks rules. For instance,
phishing techniques have been proposed to detect and mitigate a rule-based phishing attacks approach was proposed for the
these attacks. However, they are still inefficient and inaccurate. banking service in which several features were selected,
Thus, there is a great need for efficient and accurate detection including IP address, SSL certificate, web address length,
techniques to cope with these attacks. In this paper, we proposed
number of dots in URL, and blacklist keywords. In [4], the
a phishing attack detection technique based on machine learning.
We collected and analyzed more than 4000 phishing emails
authors proposed a data mining tool called Multi-label Classifier
targeting the email service of the University of North Dakota. We Associative Classification in which 16 features were selected,
modeled these attacks by selecting 10 relevant features and including IP address, Long URL, URL's having @ symbol,
building a large dataset. This dataset was used to train, validate, prefix and suffix, and DNS record. In [5], a rule-based technique
and test the machine learning algorithms. For performance was described, in which 17 features were selected and different
evaluation, four metrics have been used, namely probability of classifiers were used, namely C4.5, RIPPER, PRISM, and CBA.
detection, probability of miss-detection, probability of false alarm, The results show that C4.5 outperforms the other algorithms in
and accuracy. The experimental results show that better detection terms of detection rate and accuracy. Rule-based approaches are
can be achieved using an artificial neural network. easy to implement; however, they represent some shortcomings,
including a low accuracy rate.
Keywords- Security; Phishing attacks; Machine learning
Other techniques are based on whitelist and blacklist
I. INTRODUCTION approaches [6][7]. In [6], a white-list-based approach was
With more than 7 billion email accounts worldwide in 2021 proposed in which a number of features related to the legitimate
websites were recorded, such as URL, IP address, and Login
and over 3 million emails sent per second, email services have
become an indispensable way for personal and professional User Interface. When the user visits a website that does not
transactions. However, the massive use of email services has match any entry in this list, the requested website is classified as
grabbed the attention of attackers as a potential field for malicious. In [7], a blacklist-based approach was proposed in
launching successful attacks. Compromising an email account which the URL of the suspicious webpage is divided into several
becomes challenging or almost impossible since the email parts and compared to a list of phishing websites. The list of
service providers offer secure E2E communication. Thus, the suspicious websites is gathered from several sources, including
attackers opt for using social engineering strategies to spam traps and open phishing email databases. Whitelist and
compromise email accounts by manipulating human blacklist approaches are inefficient in dealing with new
webpages that are not included in those pre-established lists. In
intelligence to obtain critical and confidential information [1].
addition, these lists require frequent updates, which can be
Phishing attacks perform by sending forged emails looking computationally expensive.
legitimate from an authentic entity to a victim or a group of
victims [2][3]. They aim at obtaining users’ confidential data or For the heuristic techniques, feature sets are selected, and the
impact of each set in increasing the detection likelihood is
uploading malware on their machines. For instance, the
attackers send an email with a redirection link to a malicious investigated. The tested feature sets can range from URL, IP
address to HTML DOM of the webpage. For instance, a
website where the user is requested to provide some sensitive
data, including bank account number or login and password. The heuristic-based technique was proposed in which 20 heuristic
features were selected [8]. The results show that the URL-based
attacker can also attach a file to the fake email to be uploaded
by the victim, which can automatically trigger the execution of and HTML-based heuristics are effective, and they outperform
the blacklist-based approach. In [9], a heuristic-based approach
embedded malware.
called CANTINA+ was proposed to extract the most frequent
To cope with phishing attacks and mitigate their potential words in the webpage and search for them on a search engine.
risks, a number of techniques have been proposed. These The webpage is classified as legitimate if it appears in the first

978-1-6654-0690-1/21/$31.00 ©2021 IEEE


results of the research since the first reported webpage is the 1) SSL certificate
most visited and their likelihood of being legitimate is high. When a user is requested to enter confidential data on
However, the attacker can access these entries and make legitimate websites, the exchanged data between the server and
malicious webpages appear in the first search results. the end-user is encrypted, which can be achieved through the
A number of hybrid detection techniques have been HTTP protocol with an additional secure socket layer [13].
proposed that combine the fuzzy-logic approach with other data However, most of the phishing emails include HTTP links
mining techniques [10][11][12]. In [10], a hybrid approach was without any supplementary secure layer exposing the data to
proposed that has an accuracy of 98.5% with 288 features. It potential unauthorized access and loss. Thus, if an email
requires a considerable number of features, which makes its includes a secure HTTP link, then it is legitimate; otherwise, it
implementation complex. In [11], a hybrid approach was is malicious.
proposed reaching an accuracy of 86.38% with 27 features. "##$% '()* → ',-(.(/0., ,/0('
However, it was not clear how the features were extracted. A Feature 1: if !
1.ℎ,34(%, → %5%6,7(85% ,/0('
target identification algorithm was designed to identify phishing
webpages [12]. It is based on third-party services to investigate 2) Certificate authority
in-depth the content of the suspicious link and verify its source, Not every HTTPS link can guarantee a secure connection to
which may result in more processing time. the server and make the sensitive data undisclosed to a third
In this paper, we investigate the efficiency of the machine party since the SSL certificate can be delivered by an
learning approaches in detecting phishing emails. After unauthentic entity or self-signed. An SSL certificate is
understanding the research problem’s requirements and insufficient to decide if an HTTSs link is secure. Investigating
analyzing the training dataset, we selected three models among the identity of the entity that issued the certificate is crucial in
others, namely support vector machine (SVM), logistic verifying the email's legitimacy [14]. Thus, if an SSL certificate
regression (LR), and artificial neural network (ANN). We is not delivered by a trusted and credible authority such as
explored other variations with different kernel types and GoDaddy, Comodo, and Symantec, then the email is suspicious.
different architectures. The dataset used to train and test the 95.ℎ,).(7 :9 → ',-(.(/0., ;;<
classifiers was from real attacks launched against the email Feature 2: if !
1.ℎ,34(%, → %5%6(7(85% ;;< 7,3.(=(70.,
service of the University of North Dakota.
3) Blacklist keywords
The rest of this paper is organized as follows. Section II
describes the methodology of the proposed approach. Section III Phishing emails share in common some keywords and short
discusses and compares the simulation results. Finally, a phrases. These keywords have a sense of urgency, including
conclusion is given at the end. "Click Now", "Verify Now," "Valid in 24h", and "Update Now."
Including such keywords in the email, the body provides clues
II. METHODOLOGY about the illicitness of the email. In this paper, we established a
A. Features selection list of several suspicious keywords used by the attackers to grab
the attention of the victim [15]. If the email includes one or more
Usually, a typical email is composed of a header and a body. blacklist words, then it is malicious.
The email header has a specific structure consisting of several
information related to the sender and the receiver, including >/0(' 483? ∈ {B'07*'(%.} → %5%6(7(85% ,/0('
their IP addresses, the subject, and the date. Regarding the email Feature 3:if!
8.ℎ,34(%, → ',-(.(/0., ,/0('
body, it has no specific format, and it can be customized and
different from one email to another. However, there are some 4) Redirection URL
items that can be found in any typical email, such as text, link to Some phishing emails include a link that implicitly redirects
a website, attached files, and the email's signature. Since not the the user to a hidden server before reaching the requested
entire email content is relevant in detecting legitimate emails website, such as a proxy server. This server will handle the
from malicious ones, it is important to select and extract only communication between the user, the malicious, and the
those specific features that are used in phishing emails. In this legitimate websites [16]. GET request of the HTTP protocol is
paper, we used ten relevant features in which eight are extracted used to verify the legitimacy of an URL.
from the email body while the rest is from the email header.
These features are: sender email address, attached file extension, D># ('()*!"# ) ≠ '()*!"# → %5%6(7(85% ,/0('
Feature 4:if!
blacklist keywords, secure socket layer (SSL) certificate, 1.ℎ,34(%, → ',-(.(/0., ,/0('
certificate authority (CA), redirection URL, hiding links, clear 5) Hiding links
IP address, website traffic, and webpage age. Individual features
may not reveal the legitimacy of an email but combining several An alternative way to hide the actual URL website is to use
features increase the likelihood of detecting potential phishing hiding links, which rely on two techniques: URL shorteners and
emails. customized HTML emails. In the former, the attacker wraps the
real URL in a short one such as "goo.gl", or "j.mp". In the latter,
the attacker forges an HTML email with the Cascading Style @domain name ∈ {'(%. 8= 73,?(B', ?8/0(),}
Sheets and JavaScript scripts to customize the webpage link Feature 9: if H )0/,%} → ',-(.(/0., ,/0('
with a personalized clicked text or image. Thus, an email is 1.ℎ,34(%, → %5%6(7(85% ,/0('
suspicious if it includes a short URL.
10) Attached file extension
'()*!"# (% %ℎ83.!"# 83 '()*!"# (% ℎ(??,) ()%(?,
Feature 5:ifH(/0-, 83”7'(7*,? .,J.”) → %5%6(7(85% ,/0(' It is used to increase the likelihood of detecting phishing
1.ℎ,34(%, → ',-(.(/0., ,/0(' emails. Some phishing emails include an attached file, including
an embedded payload. This payload can be an executable shell
6) Clear IP address script giving the attacker the privileges to execute some
command on the user's machine. One of the known tools used
Some phishing emails include links with a clear IP address.
by attackers to forge phishing emails is the social engineering
"https://50.10.125.26/index.php" is an example that indicates Toolkit installed by default on the Kali Linux. It generates a file
the illegitimacy of the email. Using an IP address instead of the
including the payload with ".exe" or ".dll" extension. If the
specific domain name is because malicious webpage links last
attached file has ".exe" extension, then the email is suspicious.
for less than three days, and attackers do not buy a domain name
for a short period of time. Thus, if a link includes a clear IP =(',_)0/,. exe → %5%6(7(85% ,/0('
address, then it is suspicious. Feature 10: if !
1.ℎ,34(%, → ',-(.(/0., ,/0('
Feature 6: if B. Classification techniques
'()*!"# ()7'5?,% K$ 0??3,%% → %5%6(7(85% ,/0(' In this work, we compared the performance of three
!
1.ℎ,34(%, → ',-(.(/0., ,/0(' classifiers, namely support vector machine (SVM), logistic
7) Website traffic regression (LR), and artificial neural network (ANN). SVM is a
machine learning algorithm used for solving classification and
Legitimate websites receive a number of requests with a regression problems [17]. It is based on a hyper line classifier
specific traffic rate per day. A legitimate website has a rank less that separates and maximizes the margin between two distance
or equal to 150,000 in the Alexa database. However, phishing classes. Let the dataset, D, be given as {(x1, y1), (x2, y2),…,( xN,
websites are not often visited as they have a short lifetime, and yN)}, where xi is the set of training tuples with the associated
their traffic is low. class labeled yi. Each yi can take one of two values, either +1 or
.30==(7 < 150000 → ',-(.(/0., ,/0(' -1, corresponding respectively in our case to the class ‘phishing
Feature 7: if ! email’ or ‘legitimate email’. SVM finds the best decision
1.ℎ,34(%, → %5%6(7(85% ,/0('
boundary to separate these two classes using a hyper line, h,
8) Age of the webpage which can be defined as
Since most phishing webpages have a short lifetime, the age ℎ(J) = W ∗ Y + B = ∑%
$&' \$ Q$ (J$ , J) + B (1)
of the webpages can provide information about their legitimacy.
The age of the authentic website is usually more than one year. where W is the weight, B is the bias, ^ is the number of features
Thus, if the email includes a webpage link with less than one in the dataset, xi is the set of training tuples, and \$ is the
year, then it is suspicious. Lagrange multiplier. In the case of non-linear data, one can first
transform the data through non-linear mapping to another higher
4,B60-, 0-, > 1 Q,03 → ',-(.(/0., ,/0(' dimension space and then use a linear model to separate the data.
Feature8: if !
1.ℎ,34(%, → %5%6(7(85% ,/0(' The mapping function is done by a kernel function K and the
9) Sender’s email address equation can be rewrite the equation (1) as

In some phishing emails, there is an inconsistency between ℎ(J) = W ∗ Y + B = ∑%


$&' \$ Q$ _(J$ , J) + B (2)
the email subject and the address of the sender. For instance, where _(J$ , J) is the kernel function. In this paper, we used the
some malicious emails seem to be emitted by an authentic entity, polynomial function as a kernel. SVM classifies a new email
such as Microsoft or Dropbox, since the email's subject states based on its position with respect to that hyper line. If an email's
something similar to "the user X has shared some files with you" features lie on or above the hyper line, then it belongs to the
or "Reinitialize the password." However, the sender's email phishing email class.
address includes a strange domain name such as
"@sharing.dboxfile.com" or "@dropbox.com." Thus, such LR is a supervised machine learning technique used for
inconsistency can be relevant in detecting malicious senders. predicting discrete output class, classification, and binary
Thus, if a domain name does not belong to the credible domain classification [3]. It is based on different hypothesis functions
names list, then the email is suspicious. for predicting a binary-value output. In this paper, sigmoid
function is considered as a hypothesis function. It is given by
1 (Q = 1); (ii) if l ≪ 0, the hypothesis function satisfies ℎ( (J) <
ℎ( `J ($) a = (3) 0.5, which corresponds to the absence of the attack (Q = 0).
+ ∑% (#)
#&' (! - !
1+,
where 4. is the weight associated with each input J. and ^ is III. RESULTS AND DISCUSSION
the number of features. In this paper, we opted for gradient To train, validate, and test the models, we built a dataset
descent (GD) as an optimization technique to define the consisting of 4000 real phishing emails. These emails were
appropriate weight that minimizes the prediction error. collected from the North Dakota email system from May 22,
ANN is a supervised machine learning algorithm used for 2017, to June 20, 2018. The collected data include some
classification and regression prediction. It is composed of an redundant emails because some attackers sent the same forged
input layer, one or multiple hidden layers, and an output layer email to multiple users, or they used it to conduct the same attack
where each layer is composed of several neurons. A neuron is a several times. Thus, we analyzed and improved the dataset by
computation unit that takes a set of inputs associated with removing the duplicated and redundant emails and reducing the
weights and predicts the output using an activation function. number of instances to 2000 phishing emails. The legitimate
There are several activation functions, including the sigmoid emails were collected from legitimate accounts and emitted by
function, hyperbolic tangent function, and rectified linear unit an authentic entity. To keep the number of phishing and
function [15]. Training an ANN model involves forward legitimates emails equally distributed in the dataset and to avoid
propagation and backward propagation. For each instance in the bias towards any one of these two types of classes, 2000
dataset, the forward propagation is used to compute the legitimate emails were selected. Thus, the final dataset contains
predicted output and compare it with the actual one and then 4000 instances with legitimate and phishing emails, as presented
calculate the error between these two values. To minimize the in Table I.
error, the backward propagation updates the weights associated
TABLE I. COLLECTED PHISHING DATASET
with each input using gradient descent. Forward and backward
propagation are repeated until ANN reaches a minimum error Total samples 4000
value. (/0 neuron of the '/0 layer is given by Total phishing emails 2000
(1) (1) (1) 1+' (1)
0. = -. (∑2
$&' 4.$ 0$ + B.$ ) (4) Total legitimate emails 2000

Total training samples 2800


The activation function of the output layer of an ANN with
one neuron is given as follows: Total testing samples 1200

(1) Total number of features 10


ℎ( (J) = -(∑2 1
$&' 4$ 0$ ) (5)
ANN learn their weights and biases using GD technique,
Given a training set b`J (') , Q (') a, … , `J (2) , Q (2) ad, the cross- Since some classifiers cannot be trained on categorical data,
entropy cost function e(W) is given by the dataset went through a pre-processing process in which all
'
the nominal values were converted into numerical values. The
($)
e(W) = − ∑2 4 ($)
$ ∑3&' Q3 log`ℎ( (J )3 a + (1 − same converting model was used to map the nominal data to the
2
($) 5
∑#+'
8 8 9'
(1) 6 nominal one in the entire dataset. In addition, the dataset went
Q3 )log (1 − ℎ( (J ($) )3 ) + 1&' ∑$&' ∑.&' (4.,$ )
( (
(6)
62 through a feature scaling process to make the data normally
where j is the regularization parameter, m is the training data distributed with zero as a mean and a standard deviation of 1.
size, K is the number of the output classes, and ℎ( is the These processes can reduce the processing time for some
(1) classifiers along with avoiding the divergence issues that could
hypothesis function, and 4.,$ is the weights assigned to the link arise. The performance evaluation of the algorithms was
between the (.ℎ and k.ℎ neurons of '.ℎ layer. conducted using several metrics: Pd, Pfa, Pmd, and accuracy. Pd is
The process consists of minimizing the cross-entropy cost the likelihood to detect suspicious emails when they are
function e(W). Backpropagation aims at updating all the suspicious. Pfa is the likelihood to detect a suspicious email
weights simultaneously to minimize the cost function. The while it is legitimate. Pmd is the likelihood to classify a legitimate
hypothesis is the case of a sigmoid function given as: email when this email is suspicious. The accuracy is the
likelihood that a classifier attributes legitimate email to the class
'
(l) = '9: )* (7) of “legitimate email”. These metrics are expressed as
;<2=:> @A B:/:C/:B 8<8D:C$@<8 :2E$18
where z is the vector of weights associated with the vector of $? = %<2=:> @A 8<8D$C$@<8 :2E$18
(8)
features x. In this binary classification, there are two cases based
;<2=:> @A AE18: B:/:C/:B 8<8D:C$@<8 :2E$18
on the values of l: (i) if l ≫ 0, the hypothesis function satisfies $=0 = (9)
%<2=:> @A 8<8D$C$@<8 :2E$18
ℎ( (J) > 0.5, which corresponds to the presence of the attack
;<2=:> @A 2$88 B:/:C/:B 8<8:DC$@<8 :2E$18
$/? = (10)
%<2=:> @A 8<8D$C$@<8 :2E$18
Examples of results are presented in Table II through Table different values of this parameter. Examples of results are given
IV. ANN performance is affected by many parameters, in Fig. 1 through 5. Fig. 1 represents Pd against the regularization
including the number of hidden layers, the number of hidden parameter. It can be seen that Pd increases with the increase of
neurons in each layer, and activation function. To find the right the regularization parameter, reaching its maximum at 0.006
set of parameters that maximize the ANN performance, we with 87.2%. For values higher than 0.06, Pd decreases slightly
conducted several experiments using the generated dataset with but it remains constant at 87.1%.
different combinations of these parameters.
Examples of results are represented in Table II. As it can be
seen, ANN with two hidden layers of 100 neurons, each with
relu function, has the best performance as it achieves the highest
Pd of 90.3%, the lowest Pmd of 9.7%, and the highest accuracy
of 94.5%. Thus, ANN with two hidden layers of 100 neurons
each and relu activation function has the best performance.

TABLE II. ANN PERFORMANCE EVALUATION

Algorithm Pd Pfa Pmd Accuracy

(100) / Relu function 90.10% 1.40% 9.90% 94.40%

Fig.1. Pd as a function of the regularization parameter.


(100,100) / Relu function 90.30% 1.50% 9.70% 94.50%
Fig. 2 represents Pfa as a function of the regularization
(100) / tanh function 90% 1.40% 10% 94.30% parameter. As one can see Pfa has three different regimes. For
the range [0, 0.1], Pfa is constant with an average equal to 6.5%.
(100,100) / tanh function 90.10% 1.50% 9.90% 94.30% For the range [0.1, 0.4], Pfa is decreasing with the increase of the
regularization parameter to reach its lowest values at 0.4 with
(100) / sigmoid 88.90% 1.40% 11.10% 93.80%
1.4%. For values higher than 0.4, Pfa remains constant at 1.4%.

(100,100) / sigmoid 88.70% 1.40% 11.30% 93.70%

TABLE III. SVM’S PERFORMANCE WITH SEVERAL KERNELS

Algorithm Pd Pfa Pmd Accuracy

Linear SVM 29.8% 44.8% 70.2% 42.6%

Cubic SVM 63.4% 54.5% 36.6% 54.4%

RBF SVM 82.3% 27.7% 17.7% 77.3%


Fig.2. Pfa versus the regularization parameter of logistic regression.
Sigmoid SVM 43.3% 24.7% 56.7% 59.4% Fig.3 represents Pmd against the regularization parameter of
LR. It can be seen that for the range of [0, 0.08], Pmd decreases
with the increase of the regularization parameter to reach its
As the performance of SVM dependents on the kernel used minimum at 0.08 with 12.8%. However, for values higher than
for email classification, four kernels were considered, namely: 0.08, increasing the regularization parameter does not have any
linear, polynomial, radial basis function (RBF), and sigmoid impact on Pmd as it remains constant at around 12.8%.
kernels. Examples of results are presented in Table III. Through
comparing the performance of the SVM algorithms, we can
conclude that SVM based RBF kernel achieves a Pd of 82.3%,
Pfa of 27.7%, Pmd of 17.7%, and overall accuracy of 77.3%.
Thus, it provides better results compared to the other algorithms.
To investigate the impact of the regularization parameter on
the LR performance, several kernels were performed using
0.7, it produces the best results. Based on the best performance
of each classifier, a performance comparison between these
algorithms is given in Table IV. As one can see, ANN with two
hidden layers with Relu function has the highest Pd and
accuracy, the lowest Pfa and Pmd compared to LR and SVM.
CONCLUSION
In this paper, we proposed a phishing attack detection
technique using machine learning. Three classifiers are trained
and tested on the dataset. For each classifier, a parametric study
is conducted, and the best results are reported for evaluation. For
SVM, high accuracy is reported by Gaussian Radial basis
function kernel. For LR, the high accuracy is given by a
Fig.3. Pmd versus the regularization parameter of logistic regression. regularization parameter corresponding to 0.4. For ANN, high
Fig. 4 represents the accuracy as a function of the accuracy is achieved with two hidden layers, 100 neurons each,
regularization parameter. One can see that the accuracy and with the Relu activation function. Therefore, the proposed
increases when the regularization parameter is less than 1, while model allows fast and accurate phishing attacks detection.
it is constant for values higher than 1. It reaches its maximum REFERENCES
value of 92.9% when the regularization parameter is 0.4. Thus,
LR represents better performance with a regularization
[1] F. Salahdine and N, Kaabouch, “Social Engineering Attacks: A
parameter higher than 0.7. Survey,” Future Internet J,, 11, 89, pp. 1-17, 2019.
[2] R. Mohammad, F. Thabtah, and L. McCluskey, “Intelligent rule-based
phishing websites classification,” IET Inf. Secur., pp. 153–160, 2014.
[3] F. Salahdine and N. Kaabouch, “Security threats, detection, and
countermeasures for physical layer in cognitive radio networks: A
survey,” Physical Commun. J., 2020.
[4] J. He and Y. Zhu, “Social engineering/phishing,” Encycl. Soc. Netw. Anal.
Min., pp. 1777–1783, 2014.
[5] M. Moghimi and A. Varjani, “New rule-based phishing detection method,”
Expert Syst. Appl., vol. 53, pp. 231–242, 2016.
[6] B. Gupta, N. Arachchilage, and K. Psannis, “Defending against phishing
attacks: Taxonomy of methods, current issues and future directions,”
Telecommun. Syst. 67, 247–267, 2018.
[7] J. Hong, T. Kim, and S. Kim, “Phishing URL detection with lexical
features and blacklisted domains,” Adaptive Auton. Secur. Cyber Syst.,
pp. 253-267, 2020.
[8] Y. Huang, Q. Yang, J. Qin, W. Wen, “Phishing URL Detection via CNN
and Attention-Based Hierarchical RNN,” IEEE Int. Conf. Trust, Security,
Fig.4. Accuracy versus the regularization parameter of logistic regression. Privacy Comput. Commun., pp. 112-119, 2019.
[9] Moghimi M, Varjani AY. New rule-based phishing detection method.
TABLE IV. COMPARISON BETWEEN ANN, SVM, AND LR Expert systems with applications., 1;53:231-42, 2016.
[10] G. Ramesh, I. Krishnamurthi, and K. Kumar, “An efficacious method for
Algorithm Pd Pfa Pmd Accuracy detecting phishing webpages through target domain identification,”
ANN (100,100) 90.3% 1.5% 9.7% 94.5% Decision Support Systems, vol. 61, no. ,pp. 12–22, 2014.
Relu function [11] Y. Suga, “SSL/TLS servers status survey about enabling forward secrecy,”
Int. Conf. Network-Based Information Systems, pp. 501–505, 2014.
SVM Gaussian 82.3% 27.7% 17.7% 77.3% [12] A. Albarqi, E. Alzaid, F. Ghamdi, S. Asiri, and J. Kar, “Public key
Radial basis function
infrastructure: A survey,” J. Inf. Secur., vol. 06, no. 01, pp. 31–37, 2015.
LR regularization [13] S. Krishnamurthy and A. Ve, “Information retrieval models: Trends and
87.1% 1.4% 12.9% 92.9%
parameter=0.7 techniques,” Web Semant. Textual Vis. Inf. Retr., pp. 17–42, 2017.
[14] A. Kharraz, W. Robertson, and E. Kirda, “Surveylance: Automatically
detecting online survey scams,” IEEE Symp. Secur. Privacy, pp. 723–
739, 2018.
Table IV evaluates the performance of the three classifiers [15] Y. Reddy and N. Varma, “Review on supervised learning techniques,”
based on the four metrics. For ANN, we selected two hidden Emerg. Res. Data Eng. Syst. Comput. Commun. J., pp. 577-587, 2020.
layers with 100 neurons, each with the Relu activation function [16] C. Bircano and N. Arıca, “A comparison of activation functions in artificial
since it produces the best results compared to other activation neural networks,” Signal Proc. Commun. App. Conf., pp. 1-4, 2018.
functions. Regarding SVM, Gaussian Radial basis kernel is [17] Y. Arjoune, F. Salahdine, Md. Islam, E. Ghribi, and N. Kaabouch, “A
novel jamming attacks detection approach based on machine learning for
selected since it produces better results in terms of Pd, Pfa, Pmd, wireless communication,” Int. Conf. Inf. Netw., pp. 1–6, 2020.
and accuracy. For LR, when a regularization parameter equal to

You might also like