Phishing Attacks Detection A Machine Learning-Based Approach
Phishing Attacks Detection A Machine Learning-Based Approach
Abstract- Phishing attacks are one of the most common social techniques can be classified into four categories: rule-based,
engineering attacks targeting users’ emails to fraudulently steal white and blacklist, heuristic, and hybrid. The rule-based
confidential and sensitive information. They can be used as a part approach consists of using data mining techniques to train the
of more massive attacks launched to gain a foothold in corporate model based on a specific dataset with a certain number of
or government networks. Over the last decade, a number of anti- features, then extract some phishing attacks rules. For instance,
phishing techniques have been proposed to detect and mitigate a rule-based phishing attacks approach was proposed for the
these attacks. However, they are still inefficient and inaccurate. banking service in which several features were selected,
Thus, there is a great need for efficient and accurate detection including IP address, SSL certificate, web address length,
techniques to cope with these attacks. In this paper, we proposed
number of dots in URL, and blacklist keywords. In [4], the
a phishing attack detection technique based on machine learning.
We collected and analyzed more than 4000 phishing emails
authors proposed a data mining tool called Multi-label Classifier
targeting the email service of the University of North Dakota. We Associative Classification in which 16 features were selected,
modeled these attacks by selecting 10 relevant features and including IP address, Long URL, URL's having @ symbol,
building a large dataset. This dataset was used to train, validate, prefix and suffix, and DNS record. In [5], a rule-based technique
and test the machine learning algorithms. For performance was described, in which 17 features were selected and different
evaluation, four metrics have been used, namely probability of classifiers were used, namely C4.5, RIPPER, PRISM, and CBA.
detection, probability of miss-detection, probability of false alarm, The results show that C4.5 outperforms the other algorithms in
and accuracy. The experimental results show that better detection terms of detection rate and accuracy. Rule-based approaches are
can be achieved using an artificial neural network. easy to implement; however, they represent some shortcomings,
including a low accuracy rate.
Keywords- Security; Phishing attacks; Machine learning
Other techniques are based on whitelist and blacklist
I. INTRODUCTION approaches [6][7]. In [6], a white-list-based approach was
With more than 7 billion email accounts worldwide in 2021 proposed in which a number of features related to the legitimate
websites were recorded, such as URL, IP address, and Login
and over 3 million emails sent per second, email services have
become an indispensable way for personal and professional User Interface. When the user visits a website that does not
transactions. However, the massive use of email services has match any entry in this list, the requested website is classified as
grabbed the attention of attackers as a potential field for malicious. In [7], a blacklist-based approach was proposed in
launching successful attacks. Compromising an email account which the URL of the suspicious webpage is divided into several
becomes challenging or almost impossible since the email parts and compared to a list of phishing websites. The list of
service providers offer secure E2E communication. Thus, the suspicious websites is gathered from several sources, including
attackers opt for using social engineering strategies to spam traps and open phishing email databases. Whitelist and
compromise email accounts by manipulating human blacklist approaches are inefficient in dealing with new
webpages that are not included in those pre-established lists. In
intelligence to obtain critical and confidential information [1].
addition, these lists require frequent updates, which can be
Phishing attacks perform by sending forged emails looking computationally expensive.
legitimate from an authentic entity to a victim or a group of
victims [2][3]. They aim at obtaining users’ confidential data or For the heuristic techniques, feature sets are selected, and the
impact of each set in increasing the detection likelihood is
uploading malware on their machines. For instance, the
attackers send an email with a redirection link to a malicious investigated. The tested feature sets can range from URL, IP
address to HTML DOM of the webpage. For instance, a
website where the user is requested to provide some sensitive
data, including bank account number or login and password. The heuristic-based technique was proposed in which 20 heuristic
features were selected [8]. The results show that the URL-based
attacker can also attach a file to the fake email to be uploaded
by the victim, which can automatically trigger the execution of and HTML-based heuristics are effective, and they outperform
the blacklist-based approach. In [9], a heuristic-based approach
embedded malware.
called CANTINA+ was proposed to extract the most frequent
To cope with phishing attacks and mitigate their potential words in the webpage and search for them on a search engine.
risks, a number of techniques have been proposed. These The webpage is classified as legitimate if it appears in the first