0% found this document useful (0 votes)
114 views5 pages

Detection of Phishing Websites Using Machine Learning Techniques

This document summarizes a research paper that proposes using machine learning techniques to detect phishing websites. It discusses how phishing attacks have increased with more online activity. The paper aims to implement an effective phishing detection system using supervised classification models like K-Nearest Neighbor, Kernel Support Vector Machine, Decision Tree, and Random Forest Classifier. It finds that the Random Forest Classifier achieved the highest accuracy of 96.82% on the test dataset. Previous related work on phishing detection using techniques like feature selection, neural networks, and visual similarity are also summarized.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
114 views5 pages

Detection of Phishing Websites Using Machine Learning Techniques

This document summarizes a research paper that proposes using machine learning techniques to detect phishing websites. It discusses how phishing attacks have increased with more online activity. The paper aims to implement an effective phishing detection system using supervised classification models like K-Nearest Neighbor, Kernel Support Vector Machine, Decision Tree, and Random Forest Classifier. It finds that the Random Forest Classifier achieved the highest accuracy of 96.82% on the test dataset. Previous related work on phishing detection using techniques like feature selection, neural networks, and visual similarity are also summarized.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

International Journal of Computer Science and Information Security (IJCSIS),

Vol. 18, No. 7, July 2020

Detection of phishing websites using Machine Learning Techniques


Bhagyashree A V Anjan K Koundinya
M.Tech Scholar, Associate Professor and PG
Dept. Of. Computer Science Coordinator,
& Engineering, Dept. Of. Computer Science &
BMS Institute of Technology Engineering,
& Management, Yelahanka, BMS Institute of Technology &
Bengaluru, India Management, Yelahanka,
[email protected] Bengaluru, India
[email protected]

Abstract— With the developing interaction of the Internet and


I. INTRODUCTION
public activity, the Internet is taking a gander at how individuals
learn and work, however it likewise opens us to raising genuine As innovation keeps on developing, phishing strategies
security dangers. Step by step instructions to perceive different began to advance quickly and this should be forestalled by
system assaults, especially attacks not seen already, is a key issue utilizing against phishing systems to distinguish phishing.
that should be unraveled critically. The target of phishing website Several anti-phishing tools are available and has its own
URLs is to gather the individual data like client's name,
disadvantages. The paper concentrates on basic Machine
passwords and on the web banking exchanges. Phishers use the
learning supervised classification techniques to find a
sites which are outwardly and semantically like those of genuine
solution to phishing attacks.
sites. Since a large portion of the clients go online to get to the
administrations given by government and financial foundations, Supervised classification involves labeled dataset that is
there has been a significant increment in phishing assaults in last used to train the model. All the four algorithms used,
few years. Machine learning is a useful asset used to endeavor namely; K- Nearest Neighbor, Kernel support vector
against phishing assaults. There are a few strategies or ways to machine, decision tree and random forest classifier are
deal with identifying phishing sites. The fundamental point of this classification models. The dataset chosen has thirty features
paper is to execute the framework with high efficiency, exactness that are considered to classify a website as phishing or
and cost effectively. The task is actualized utilizing 4 ML
legitimate.
managed classification models. The four classification models are
K-Nearest Neighbor, Kernel Support vector machine, Decision II. RELATED WORK
tree and Random Forest classifier. It was discovered that the
Random Forest classifier is most accurate for the chosen dataset This section concentrates on the previous works carried out
and gives an accuracy score of 96.82%. to detect phishing websites using Machine learning.
A phishing detection model which is based on a method that
Keywords- Machine Learning, classification, Cyber security,
involves optimal features’ selection and also neural network
Phishing, KNN, Kernel SVM, Decision Tree, Random Forest
was proposed [1]. This work involves, feature validity
Classifier
value, a new index that is generated to check the impact of
the features on detection of phishing websites. Then based
on this index an algorithm is created to get the optimal
features from phishing websites.

1 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 18, No. 7, July 2020

Fuzzy Rough Set hypothesis [2] is executed as a technique to Among a few phishing discoveries schemes, the plan
find the most impactful features from a few standard datasets. utilizing visual closeness is gathering looks. It takes a
The features that are selected are then fed to classifiers for screen capture of site and stores it to a database. In the event
detection of phishing. that the info site's screen capture is practically like the
A three-phase phishing attack detection model [3] called as database's one, it is anticipated as phishing. In any case, if
Web Crawler based Phishing Attack Detector was proposed. numerous comparative sites exist, the first input site is
It takes as input features the web content, traffic and URL. viewed as genuine. Therefore, it can't effectively foresee
Based on those features, phishing or non-phishing website real site and perceiving phishing objective becomes
classification is made. troublesome. Visual similarity based phishing detection
A detection system was proposed [4] that matches the strategy [9] is proposed utilizing picture and CSS with
dynamic environment with the phishing websites. This is target site finder.
absolutely a client-side arrangement and doesn’t require any
III. PROPOSED SYSTEM
third-party help.
The architecture of the proposed system is as shown in figure
Parse Tree validation is another technique that is proposed to
3.1. The URLs that are to be distinguished as either phishing
detect phishing website [5]. The approach makes used of
or legitimate is the input to the classifier. The dataset is split
hyperlinks of current page by utilizing the Google API and
into training and testing dataset. The training part of the
builds a parse tree with intercepted hyperlinks. The parsing
dataset is utilized to train the classifier. The classifier
starts from the root and follows the Depth first search
recognizes the pattern from the training dataset. To test for the
algorithm and checks if any intermediate or leaf nodes has
classifier, the testing part of the dataset is used. The classifier
the value same as the root.
then predicts any URL as either phishing or legitimate based
Feature engineering plays a vital role in detection of phishing
on the pattern learnt.
websites although, accuracy depends a lot on the knowledge
of features. To extract features from different dimensions are
very useful but also time consuming. To fix this drawback, a
multidimensional feature [6] detection of phishing was
proposed which is a fast method.
A method which combines the collection, validation and
detection of phishing websites into a tool online was
proposed. It monitors the PhishTank’s blacklist [7] and
detects websites that are phishing in real time.
A solid relationship was worked between the identified
heuristics and the authenticity of a site by dissecting training
sets of sites (both phishing and authentic sites) and in the Fig: 3.1 System architecture
system break down new patterns and report findings. A system To give a detailed view of the system, figure 3.2 gives a level
called Phishing-Detective [8] is introduced that distinguishes 2 data flow diagram in Gane-Sarson notation.
phishing sites dependent on existing and recently discovered
heuristics.

2 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 18, No. 7, July 2020

Feature extraction: Values of the features are extracted


using some of the python modules. IP address, length of
the URL, domain name, subdomains are some of the
features being considered. These values are extracted
and stored in a list.

IV. EXPERIMENTAL ANALYSIS AND RESULT


Among the four algorithms considered i.e. K-nearest
neighbor, kernel support vector machine, decision tree
and random forest classifier, the performance was
evaluated using four factors.
The accuracy, recall score, precision and the F1 score of all
the four algorithms were calculated. The equations to calculate
the above mentioned are as follows:

Accuracy:

Fig: 3.2. level 2 DFD of the system

Recall score:
The above figure shows the system being categorized into
three as preprocessing, feature scaling and classification. The
details are graphically depicted and gives an idea of how the
system works. The classifier in this case are four. The four
classifiers used are K-nearest Neighbor, Kernel SVM,
Decision tree and Random forest classifier. Precision:

The steps involved in the implementation involves:


Splitting of dataset: The dataset is divided into testing and
training set. 75% of the dataset is considered for training and
25% for testing.
Preprocessing: This step involves filling or removal of the F1 score:
missing fields and getting a dataset that is clean.
Feature scaling: Feature scaling is also a part of
preprocessing. It is a process in which the independent
variables in the dataset are normalized within a fixed
range so that no feature dominates the other for the The comparative plots of the above four performance
prediction. metrics is as follows:

3 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 18, No. 7, July 2020

Accuracy:

Fig: 4.3. Comparative plot of precision score

Fig: 4.1. Comparative plot of accuracy score F1 score:

The fig 4.1 shows that Random forest classifier has the
highest accuracy compared to the other three models for the
considered dataset.
Recall score:

Fig 4.4. Comparative plot of F1 score

The fig 4.4 shows the comparative plot of F1 scores of


the four algorithms and RFC again has the highest F1
score.
Fig: 4.2. Comparative plot of recall score
To brief the result, the Table 4.1 briefs of the
The fig 4.2 shows that RFC again has a recall score of
scores of the four algorithms for the four-performance
98.28% which is high compared to the other three. metrics.
Precision: The fig 4.3 shows that precision score of
Random forest classifier is the highest of all four and
gives a precision score of 96%.

Table 4.1: Performance metrics value of the algorithms

4 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 18, No. 7, July 2020

From the above table, it is clear that Random forest In 2018 Recent Advances on Engineering, Technology and
classifier is the best among the four classifiers Computational Sciences (RAETCS), pages 1-4, 2018.
considered and for the dataset chosen. [6] P. Yang, G. Zhao, and P. Zeng. Phishing website
detection based on multidimensional features driven by
V. CONCLUSION
deep learning. IEEE Access, 7:15196-15209, 2019.
The exhibition of phishing is transforming into a
[7] J. Li and S. Wang. Phishbox: An approach for
propelled risk to this rapidly growing universe of
phishing validation and detection. In 2017 IEEE 15th Intl
development. The paper means to investigate this area
Conf on Dependable, Autonomic and Secure Computing,
by indicating a use case of recognizing phishing sites
15th Intl
utilizing ML. It planned to fabricate a phishing
Conf on Pervasive Intelligence and Computing, 3rd Intl
detection method utilizing ML devices and strategies.
Conf on Big Data Intelligence and Computing and Cyber
The proposed system made use of four models of
Science and Technology
classification namely KNN, kernel SVM, decision tree
Congress(DASC/PiCom/DataCom/CyberSciTech), pages
and random forest classifier. Random forest classifier
557564, 2017.
being an ensemble classifier gave the best accuracy [8] A. J. Park, R. N. Quadari, and H. H. Tsang.
score of 96.82% for the chosen dataset that considers Phishing website detection framework through web
about 30 features for the prediction. scraping and data mining. In 2017 8th IEEE Annual
Information Technology, Electronics and Mobile
Communication Conference (IEMCON), pages 680-684,
REFERENCES
2017.
[1] E. Zhu, Y. Chen, C. Ye, X. Li, and F. Liu. Ofs-nn: [9] S. Haruta, H. Asahina, and I. Sasase. Visual
An e ective phishing websites detection model based on similaritybased phishing detection scheme using image and
optimal feature selection and neural network. IEEE Access, css with target website finder. In GLOBECOM 2017 – 2017
7:73271-73284, 2019. IEEE Global Communications Conference, pages 1-6, 2017.
[2] Mahdieh Zabihimayvan and Derek Doran. Fuzzy [10] Y. Xin, L. Kong, Z. Liu, Y. Chen, Y. Li, H. Zhu,
rough set feature selection to enhance phishing attack M. Gao, H. Hou, and C. Wang. Machine learning and deep
detection, 03 2019. learning methods for cybersecurity. IEEE Access, 6:35365-
[3] T. Nathezhtha, D. Sangeetha, and V. Vaidehi. 35381, 2018.
Wcpad:Web crawling based phishing attack detection. In [11] N. Agrawal and S. Singh. Origin (dynamic
2019 blacklisting) based spammer detection and spam mail
International Carnahan Conference on Security Technology filtering approach. In 2016 Third International Conference
(ICCST), pages 1-6, 2019. on Digital Information Processing, Data Mining, and
[4] M. M. Yadollahi, F. Shoeleh, E. Serkani, A.
Wireless Communications (DIPDMWC), pages 99-104,
Madani, and H. Gharaee. An adaptive machine learning
2016.
based approach for phishing detection using hybrid features.
[12] S. Patil and S. Dhage. A methodical overview on
In 2019 5th International Conference on Web Research
phishing detection along with an organized way to construct
(ICWR), pages 281-286, 2019.
an anti-phishing framework. In 2019 5th International
[5] C. E. Shyni, A. D. Sundar, and G. S. E. Ebby.
Conference on Advanced Computing Communication
Phishing detection in websites using parse tree validation.
Systems (ICACCS), pages 588-593, 2019.

5 https://sites.google.com/site/ijcsis/
ISSN 1947-5500

You might also like