Phishing Attacks Detection Using Machine Learning Approach
Phishing Attacks Detection Using Machine Learning Approach
Abstract- Evolving digital transformation has exacerbated account information, and credit card details for the
cybersecurity threats globally. Digitization expands the doors transaction. Fishers always change their strategy to attack
wider to cybercriminals. Initially cyber threats approach in the the system. Social engineering [3-6] is one of the essential
form of phishing to steal the confidential user credentials.
techniques the fishers use. Using this technique, they gather
Usually, Hackers will influence the users through phishing in
personal credentials from a trustworthy person. Phishers
order to gain access to the organization's digital assets and
networks. With security breaches, cybercriminals execute create false websites and spoof email in such a way that they
ransomware attack, get unauthorized access, and shut down are very similar and sometimes look like a real company
systems and even demand a ransom for releasing the access. website that comes from a source. Sometimes the attackers
Anti-phishing software and techniques are circumvented by act like a real source and force the users to update the
the phishers for dodging tactics. Though threat intelligence and system.
behavioural analytics systems support organizations to spot the
unusual traffic patterns, still the best practice to prevent Moreover, they threaten the customer to suspend the
phishing attacks is defended in depth. In this perspective, the account and demand ransom. Email spoofing is another
proposed research work has developed a model to detect the technique used for phishing fraud [7]. Customers are usually
phishing attacks using machine learning (ML) algorithms like
misled to disclose private information like passwords and
random forest (RF) and decision tree (DT). A standard
credit card number. Thus fishing is mainly used to steal
legitimate dataset of phishing attacks from Kaggle was aided
for ML processing. To analyze the attributes of the dataset, the
valuable information such as bank account, password , and
proposed model has used feature selection algorithms like credit card details [8]. This type of scam is increasing
principal component analysis (PCA). Finally, a maximum rapidly, and individuals, business -people are losing their
accuracy of 97% was achieved through the random forest trust in online business. Thus, a negative impression of
algorithm. clients on online business was swarmed as they lost faith in
online transactions. Even though encryption software is used
Keywords—Phishing attack; phishing attack detection; artificial
to protect the information in the computers' storage, they are
intelligence; machine learning; deep learning; convolutional
also vulnerable to attacks [9]. In this paper, the detection of
neural network
fishing was performed through ML.
I. INT RODUCT ION
II. BACKGROUND AND RELAT ED W ORK
Phishing attacks have become anxiety for the cyber world. It
There are various types of phishing attacks used to cheat the
causes enormous problems for privacy and financial issues
users. Besides, various phishing detection techniques and
of internet users. Scammers, namely fishers, create false
tools are also available to defend phishing attacks .
websites [1, 2] to feel and look like a genuine to deceive the
Classification is one of the techniques used to detect website
people. They spoof emails to steal the identity of legitimate
users. They gather personal covert information, password,
Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on December 13,2021 at 12:36:58 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Smart Systems and Inventive Technology (ICSSIT 2020)
IEEE Xplore Part Number: CFP20P17-ART; ISBN: 978-1-7281-5821-1
phishing [10, 11, 29]. Here, common types of a phishing party with attachment or link. They request to send an
attack and classification techniques are described below: updated version of the original [13].
Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on December 13,2021 at 12:36:58 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Smart Systems and Inventive Technology (ICSSIT 2020)
IEEE Xplore Part Number: CFP20P17-ART; ISBN: 978-1-7281-5821-1
III. M ET HODOLOGY the variables and classify the datasets. The proposed model
This section explains the methodology to detect Phishing was presented in figure 1. To experiment with the phishing
attack using ML, and also explains the proposed framework. website firstly, the dataset was selected. The attributes were
The experiment was carried out by using ML approaches. then analyzed using a feature selection algorithm. The
ML approaches can be applied in two ways. The first one is proposed model has used the REF, Relief-F, IG, and GR
supervised learning, and another is unsupervised learning. algorithm for feature selection. Further, the feature is
Feature selection is crucial for ML algorithms. It reduces the classified between the weakest and most vigorous. Then
redundancy of data which is irrelevant or unnecessary in the PCA was applied for analysis.
data sets. Another statistical method, principal component
analysis (PCA), has been used to identify the components of
A. Data Acquisition
Data acquisition is essential for data analysis—datasets from
kaggle.com for our research.
B. Data Preprocessing
Data pre-processing is an essential task for the ML
application. It was done from raw data and was formatted
using the data mining technique. A clean and noise-free
Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on December 13,2021 at 12:36:58 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Smart Systems and Inventive Technology (ICSSIT 2020)
IEEE Xplore Part Number: CFP20P17-ART; ISBN: 978-1-7281-5821-1
Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on December 13,2021 at 12:36:58 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Smart Systems and Inventive Technology (ICSSIT 2020)
IEEE Xplore Part Number: CFP20P17-ART; ISBN: 978-1-7281-5821-1
3) Random Forest
Decision tree (DT) is one of the most popular algorithms in IV. RESULT S AND DISCUSSIONS
machine learning for binary classification. It results from the Phishing attack detection based on feature analysis, data
decision very fast by creating a small tree and can predict analysis on the selected dataset was carried in this paper.
upon training dataset. As its name implies tree, it holds The confusion matrix shows the performance table on
nodes and attribute denotes a test. The branch is the accuracy when compared with the actual classifications in
consequence of the test, and each terminal or end node, the dataset. Accuracy, precision, recall, and F1 score were
which is called leaves are the labels of the classification. used for performance evaluation which was calculated based
Determining the best attribute is the most important in this on the confusion matrix. Confusion matrix used specific
algorithm. Ross Quinlan developed the decision tree ID3 table layout for the projection of the performance, as shown
algorithm. It was primarily used in data mining and in Table 3 and computed according to the following
information theory. Now it is used in machine learning and equations:
natural language processing. The proposed model has used
the ID3 algorithm in this paper to classify the website, Precision =TP/(TP + FP)
whether it was an official or phishing website. The
following steps are followed to get the outcome of the Recall =TP/(TP + FN)
classification of this algorithm:
Accuracy = (TP + TN)/(TP + FP + TN + FN)
1. Start with the training data set. Give it the name 'S'
and it should have attributes and classification F1 Score = 2*((precision*recall)/(precision+recall))
2. Determine the best attribute of the data sets True Positive (TP): Correctly predicted phishing URLs were
detected with the actual phishing URLs. False Negative
3. Divide the 'S' which each have a value of the best (FN): The actual phishing URLs were false classified and
attributes detected as legitimate URLs. False Positive (FP): The actual
legitimate URLs were classified as false values and detected
4. Build a decision tree node which holds the best as phishing URLs. True Negative (TN): The actual class and
attribute. the predicted class was the same as it showed here the actual
Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on December 13,2021 at 12:36:58 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Smart Systems and Inventive Technology (ICSSIT 2020)
IEEE Xplore Part Number: CFP20P17-ART; ISBN: 978-1-7281-5821-1
Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on December 13,2021 at 12:36:58 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Smart Systems and Inventive Technology (ICSSIT 2020)
IEEE Xplore Part Number: CFP20P17-ART; ISBN: 978-1-7281-5821-1
Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on December 13,2021 at 12:36:58 UTC from IEEE Xplore. Restrictions apply.