Cse3502-Information Security Management: Phishing Detection Using Data Mining Techniques

Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

CSE3502-INFORMATION SECURITY

MANAGEMENT

Phishing Detection using Data Mining Techniques


Final Review

SUBHANU SANKAR ROY 20BIT0151

SANJAY NITHIN S 20BIT0150

SHREYAS CHITRANSH 20BIT0202

Submitted to

Dr. ANBARASA KUMAR A

School of Information Technology and Engineering


1) Objectives:

● Apply Data Mining Techniques in Cybersecurity.


● Phishing Detection based on URL String and IP address
analysis.
● Explore different data preprocessing and modeling
techniques.
2) Literature Survey:

No Name of Paper Authors Methodologies Advantages/Out Challenges


put

1 A Comparative Md. Milon The authors have The authors have availability and
Analysis of Uddin, Kazi used a dataset also compared the quality of data,
Machine Arfatul Islam, consisting of results of their the selection of
Learning-Based Muntasir legitimate and machine learning appropriate
Mamun, Vivek phishing URLs, models with a features and
Website Kumar Tiwari, which were baseline model algorithms, and
Phishing Jounsup Park collected from that uses a the trade-off
Detection Using various sources, rule-based between
URL Information such as PhishTank, approach for detection
OpenPhish, and detecting phishing accuracy and
Google Safe websites. false positives.
Browsing. They
have extracted URL
features such as
the length of the
URL, number of
slashes, presence
of certain keywords,
and the domain
age, and used
these features to
train and evaluate
the performance of
different machine
learning algorithms,
including Support
Vector Machines
(SVM), Random
Forest, K-Nearest
Neighbor (KNN),
and Artificial Neural
Networks (ANN)
2 Detection of Aashutosh XSS attack is CNN approach Their main
Cyber Attacks: Bhardwaj; detected using yields 98.59% objective is to
XSS, SQLI, Saheb Singh CNN approach, accuracy for demonstrate
Phishing Attacks Chandok; SQLI attack is detecting XSS how
and Detecting Aniket detected using attacks, Logistic fundamentally
Intrusion Using Bagnawar; Logistic Regression Regression different the
Machine Shubham approach, phishing approach yields intrusion
Learning Mishra; is detected using 92.85% accuracy detection
Algorithms Deepak SVM approach. In for SQLI, SVM problem is from
Uplaonkar addition to the approach yields these other
above specified 85.62% accuracy applications,
attacks: DTC, BNB, for phishing making it far
KNN approaches attacks. more
are employed to Approaches like challenging for
detect the intrusion DTC, BNB, KNN the intrusion
in the system. yields an accuracy detection
of 99.47%, community to
90.67% and utilize machine
99.16% learning
respectively for effectively
detecting
intrusions.
3 Phishing Attacks Aljabri, Malak They began by In the first dataset, As per the
Detection using examining the RF and SVM traditional
Machine Mirza, Samiha datasets to models methods,
Learning and determine their outperformed When a new
Deep Learning features, sizes, and others with an URL is
Models shortcomings. The accuracy of 100% received, it is
datasets were then in detected compared
preprocessed, phishing URLs. In against the
where the class the second signature list. If
imbalance issue dataset as well, a match is
was solved. Then, RF outperformed found, the URL
the most correlated the other models is labeled as
features were achieving an malicious.
selected. Finally, accuracy of Moreover, due
the classification 92.83%. to the reliance
models were on a
applied, and the pre-defined
results were signature,
evaluated. attackers can
easily evade
them, and
systems that
follow this
approach will
not be able to
identify new
harmful URLs
4 Phishing Adarsh In this, the first After training the The overall
Website Mandadi; algorithm is trained accuracy of method to
Detection Using Saikiran with base data set Random forest is detect phishing
Machine Boppana; which is used as 87.0% and the websites by
Learning Vishnu training data and accuracy of the updating
Ravella; R the data which is Decision tree is blacklisted
Kavitha taken from the web 82.4%. URLs, Internet
traffic acts as input Protocol to the
for the feature antivirus
extraction which is database
done mainly on which is
three types of additionally
features URL referred to as
based, the blacklist
domain-based, method. The
Html/JS-based major
features and this disadvantage
feature extracted of this
data acts as testing approach is
data and this that it cannot
machine learning detect
model is exposed to zero-hour
API and the phishing
prediction will be attacks.
done and output is
generated as
phishing or
legitimate.
5 Phishing Attacks Mohammad In this perspective, a maximum RF had less
Detection using Nazmul Alam; the proposed accuracy of 97% variance, and it
Machine Dhiman research work has was achieved could handle
Learning Sarma; developed a model through the the over-fitting
Approach Farzana Firoz to detect the random forest problem. The
Lima; Ishita phishing attacks algorithm random forest
Saha; using machine tree achieved
Rubaiath-E- learning (ML) an accuracy of
Ulfath; Sohrab algorithms like 97%. In our
Hossain random forest (RF) future work,
and decision tree fishing attacks
(DT). A standard will be
legitimate dataset predicted from
of phishing attacks the logged
from Kaggle was dataset of
aided for ML attacks by
processing. To using a
analyze the convolution
attributes of the neural network
dataset, the (CNN).
proposed model
has used feature
selection algorithms
like principal
component analysis
(PCA)
6 PWDGAN: Trinh Nguyen In this article, they To evaluate the Initially, the
Generating Bac; Phan The build a model performance of classifiers are
Adversarial Duy; Van-Hau based on the proposed capable of
Malicious URL Pham generative model, several good detection
Examples for adversarial network machine learning when the TPR
Deceiving (GAN) – a deep algorithms are value of
Black-Box learning-based used as the phishing URLs
Phishing framework to black-box detection
Website conduct black-box phishing detector, reached 100%
Detector using attacks using including Support for both the
GANs Phishtank and Vector Machine training set and
Alexa datasets that (SVM), Decision the testing set,
try to evade and Tree (DT), except for the
bypass ML-based Random Forest RF and LR
phishing detectors. (RF), Logistic classifier with
Regression (LR), the TPR value
Multi-layer of 99%. At the
Perceptron (MLP). 100th epoch,
the classifiers
decrease the
rate of
detecting
malicious
samples, even
the DT
classifier could
not distinguish
these malicious
samples with
the TPR value
of 0%.
7 Aliyu Alhaji The authors used a The use of The
Phishi Abubakar, dataset of 1,500 machine learning effectiveness
ng Abubakar phishing and algorithms can of the machine
Detect Adamu, and non-phishing emails improve the learning model
ion Halima Sadia to train and test six accuracy and is highly
Using Iliyasu machine learning efficiency of dependent on
Machi algorithms: KNN, phishing the quality and
ne SVM, Random detection. It can quantity of the
Learni Forest, Decision also reduce the training
ng Tree, Naïve Bayes, need for manual dataset.
Algori and Logistic analysis and Phishing
thm Regression. The increase the attacks are
performance of speed of constantly
each algorithm was detection. evolving, and
evaluated using the model may
accuracy, precision, become less
recall, and F1-score effective over
metrics. time if it is not
updated with
new data.

8 Manal The authors used a Machine learning Need for a


Phishi AlGhamdi, dataset of can detect large, diverse
ng Ahmed legitimate and previously dataset for
Websi AlEroud, and phishing websites unknown phishing training.
tes Ahmed to train and test attacks.
Detect Alghamdi various machine Need to select
ion learning algorithms. Can analyze appropriate
using They extracted many websites in features and
Machi features from the a short amount of algorithms.
ne websites using a time.
Learni combination of
ng HTML parsing and
web page
rendering. The
authors evaluated
the performance of
the algorithms
using metrics such
as accuracy,
precision, recall,
and F1 score.
9 Muhammed The paper The proposed The accuracy
Phishi Salih Özdemir proposes the use of approach can be of the system
ng and Hakan machine learning applied to many can be affected
websi Koç and deep learning websites, making by the quality
te techniques for it suitable for of the dataset
detect detecting phishing real-world used for
ion websites. The applications. training.
using dataset used in the
machi study comprises The use of The proposed
ne 1100 phishing multiple feature approach may
learni websites and 2000 extraction not be effective
ng legitimate websites. techniques and against
and Three different classifiers sophisticated
deep feature extraction improves the phishing
learni techniques were performance of attacks that
ng used to extract the system. use advanced
techni features from the social
ques website URLs. engineering
These features techniques.
were then fed into
three different
classifiers, namely
K-Nearest
Neighbors (KNN),
Random Forest,
and Artificial Neural
Networks (ANN), to
classify the
websites as
phishing or
legitimate.
10 Srishti Rawal, The authors used a The use of One of the
Phishi Bhuvan Rawal, dataset of 1,000 machine learning challenges of
ng Aakhila legitimate emails algorithms can using machine
Detect Shaheen, and 1,000 phishing help improve the learning for
ion in Shubham emails to train and accuracy of phishing
E-mail Malik test their machine phishing detection detection is the
s learning models. compared to need for a
using They extracted traditional large and
Machi features from the rule-based diverse dataset
ne emails such as approaches. of both
Learni sender address, Machine learning legitimate and
ng subject line, body models can also phishing emails
text, and embedded adapt to new to train the
links. They then phishing models.
used various techniques and Another
machine learning patterns, making challenge is
algorithms such as them more robust. the possibility
decision trees, of false
random forests, and positives or
support vector false
machines to negatives,
classify the emails which can
as legitimate or affect the
phishing. effectiveness
of the models.

11 D. Yogesh and The authors have Achieved high Limited feature


Phishi A. tried to detect accuracy in selection and
ng Ramachandra phishing attacks by detecting phishing extraction may
Websi n using Machine websites (up to affect
te learning algorithms 98.7%) accuracy.
Detect (Logistic
ion Regression, Can be used in Limited to
using K-Nearest real-time to detect detecting
Machi Neighbor, Decision phishing websites known types of
ne Tree, Random as they are phishing
Learni Forest) created websites, may
ng miss newly
Algori created ones
thms
12 Muhammad Content-based The proposed The
Conte Imran Sarwar, phishing detection methodology performance of
nt-Ba Mohammad using machine achieved high the proposed
sed Ahmad, Adil learning accuracy in methodology
Phishi Mehmood techniques. The detecting phishing may be
ng Khan, authors collected a emails. affected by the
Detect Muhammad dataset consisting quality of the
ion Naeem, Syed of legitimate and The methodology training
with Ali Abbas, phishing emails and can be applied to dataset.
Machi Muhammad used several different types of
ne Awais Shibli machine learning emails, such as Phishing
Learni algorithms, phishing emails attacks are
ng including Naïve targeting social becoming more
Bayes, Random media or banking sophisticated,
Forest, and Support websites. and new
Vector Machines phishing
(SVM), to train and techniques
test their models. may not be
The authors also detected by the
used feature proposed
selection methodology.
techniques to select
the most relevant
features for their
models.
13 T. Holz, M. The authors Real-time Phishing
Real-ti Engelberth, F. propose a system detection of websites may
me Freiling, and E. to detect phishing phishing websites. use stolen or
detect Gerhards-Padil websites in fake
ion of la real-time using certificates,
phishi public key making it
ng certificates. They Relies on the difficult to rely
websi collect a large analysis of public solely on
tes number of key certificates, certificate
using legitimate and which are widely analysis.
public phishing websites used and trusted.
key and extract the
certifi certificates from
cates them. The The system
certificates are then may generate
analyzed using false positives
several features if a legitimate
such as the website uses
certificate authority, an unusual
the certificate chain, certificate
and the hostname. configuration.
Machine learning
algorithms are
trained on these
features to classify
websites as
legitimate or
phishing.
14 Saurabh The study The study The dataset
Perfor Singh, Sarika evaluates the provides a used in the
manc Jain, and performance of five comprehensive study may not
e Manju Khari machine learning analysis of be
Analy algorithms in different machine representative
sis of detecting learning of all types of
Machi web-based phishing algorithms for phishing
ne attacks. web-based attacks.
Learni phishing
ng The authors detection. The
Algori collected a dataset performance of
thms of legitimate and It includes a large the algorithms
Used phishing URLs and dataset with a may vary
for extracted 30 diverse set of depending on
Web features for each phishing attacks, the specific
Based URL. allowing for a features and
Phishi thorough metrics used
ng They then trained evaluation of the for evaluation.
Detect and tested the algorithms.
ion algorithms using
various
performance
metrics, including
accuracy and F1
score.

15 Buket Geyik, Data collection from Accurate Limited dataset


Detect PhishTank and classification of availability
ion of K¨ubra OpenPhish URLs
Phishi Erensoy, Emre
ng Kocyigit Feature extraction High detection
Websi using six features rate and low false Limited number
tes positive rate of features for
from Implementation of classification
URLs four classification Identification of
by algorithms most effective
using Implementation of algorithm
Classi model using WEKA Dependence
ficatio software Ability to compare on selected
n performance of algorithms
Techn Cross-validation different
iques technique for model algorithms
on evaluation
WEKA
16 Xun Dong, The authors The system does The system
User John A. Clark, propose a system not rely on static requires
Behav Jeremy L. for detecting features of access to a
iour Jacob phishing websites websites that can large amount
Based based on user be easily spoofed of user
Phishi behavior. The by attackers. behavior data,
ng system uses which can be
Websi machine learning difficult to
tes algorithms to obtain.
Detect analyze user The system is
ion behavior data and based on user The system
identify patterns behavior, which is may produce
that indicate the difficult for false positives
likelihood of a attackers to mimic if users have
website being a unusual
phishing site. behavior
patterns.

17 Nisheeth The authors used a The use of a The dataset


Web Joshi, Ajay classifier ensemble classifier used by the
Phishi Kumar approach to detect ensemble authors may
ng phishing websites. approach allows not be
Detect They collected a for more accurate representative
ion dataset of both detection of of all phishing
using legitimate and phishing websites websites and
Classi phishing websites compared to using may not
fier and extracted a single classifier. generalize well
Ense features such as to new,
mble URL length, domain previously
age, and SSL unseen
certificate The authors used phishing
information. They a diverse set of attacks.
then trained features to train
multiple classifiers, their classifiers,
including decision which helps to
trees, naive Bayes, capture different The accuracy
and random forests, aspects of of the
on the extracted phishing websites. approach may
features. Finally, decrease if the
they combined the features used
results of these to train the
classifiers to make classifiers are
a final decision on not well-suited
whether a website to the specific
is legitimate or phishing attack
phishing being detected.

18 Mohammad Ensemble approach The use of an The accuracy


Ense Khaledur using six machine ensemble of the model is
mble Rahman, learning algorithms: approach dependent on
Phishi Shadman Random Forest, increases the the quality and
ng Sakib, Tanjila Decision Tree, accuracy and quantity of the
Attack Farah, and AdaBoost, Gradient reliability of the dataset used.
s Muhammad Boosting, Logistic phishing detection
Detect Al-Hashimi Regression, and system.
ion Naive Bayes. The
using approach involved The model may
Machi pre-processing the be susceptible
ne dataset, feature The inclusion of to false
Learni extraction, feature multiple machine positives and
ng selection, and learning false
Algori training and testing algorithms negatives,
thm of the models. ensures that the which can
system can detect result in
a wide range of legitimate
phishing attacks. websites being
blocked or
phishing
websites being
allowed
through.

19 Sohrab The authors are High accuracy in Large amount


Phishi Hossain, trying to detect detecting phishing of labeled
ng Dhiman phishing attacks attacks. training data is
Attack Sarma, Rana using deep learning required.
s Joyti Chakma
Detect
ion Ability to identify
using new and evolving The model
Deep phishing attacks. may not
Learni generalize well
ng to different
Appro types of
ach Automated and phishing
real-time attacks.
detection.
Adversarial
attacks can be
Reduced false used to evade
positive rates. detection.

20 Hamza M. Machine Machine Features used


Machi El-Said, Tarek learning-based learning-based may not be
ne M. Mahmoud, approach using approach can sufficient for
Learni and M. F. Tolba decision trees and learn and adapt to detecting all
ng feature extraction new and evolving types of
Based techniques to phishing phishing
Phishi classify phishing techniques. attacks.
ng websites based on
Web their URLs and
Sites webpage content.
Detect The authors Combination of The model may
ion extracted features URL and not generalize
from the URL such webpage content well to new and
as length, domain features provides unseen
name, and TLD, a more phishing
and from webpage comprehensive attacks or
content such as approach to websites.
hyperlinks and phishing
HTML tags. They detection.
used these features
to train and test The dataset
their decision tree used for
model. training and
testing the
model may not
be
representative
of all possible
phishing
attacks.

3) Dataset Description:

Link:- https://gregavrbancic.github.io/Phishing-Dataset/

● 111 attributes and 1 target attribute is given in the dataset.


● It is an asymmetric binary classification.
● No null values were found in the dataset.

4) Data Preprocessing:

● Attributes that don’t have any value other than zeroes are ignored.
● In Total, 20 such attributes were removed.
● Min-Max Scaling is done manually to preserve the dataframe as the
pre-defined functions return a numpy array.
● Feature Selection was initially done by Correlation Analysis, but the
attributes were interrelated and it was hard to make redundant
dataset.
● So, LassoCV technique is performed on the dataset, which will
return a list of important attributes whose important coefficient factor
is not zero.
● In Total, 38 attributes were selected by LassoCV.
● To reduce dimensionality further, Principal Components Analysis
was performed and 20 components were generated.
Screenshots:-

1) Correlation Analysis on the dataset.


2) Correlation Analysis on the features who have
correlation > 0.5 with the target attribute.
3) Feature Importance evaluated by LassoCV
5) Techniques/Models used:

● Support Vector Machine, Logistic Regression, Random Forest


Classifier, AdaBoost models were trained on the preprocessed data.
● Further, a Voting Classifier Ensemble model was developed
aggregating the above trained models.

6) Results:

1) Confusion Matrix for the Voting model.

2) Error rate for the final model.


3) Classification Report of the Voting Classifier

4) Comparison of Voting Classifier Model with its constituents.


7) Conclusion:

It is observed that the voting classifier model works better than three
of its constituent models but little less than Random Forest
Classifier. Error rate of the model is almost negligible.

Code:
https://github.com/SanjayNithin2002/Phising-Detection-Using-Data-Mining
-Techniques

8) References:

1. Phishing Detection Using Machine Learning Techniques-


https://arxiv.org/abs/2009.11116.

2. An Efficient Approach for Phishing Detection using Machine


Learning -
https://link.springer.com/chapter/10.1007/978-981-15-8711-5_12

3. Machine learning based phishing detection from URLs -


https://www.sciencedirect.com/science/article/pii/S09574174183060
67

4. A machine learning based approach for phishing detection using


hyperlinks information -
https://link.springer.com/article/10.1007/s12652-018-0798-z

5. Phishing Detection Using Machine Learning Technique -


https://ieeexplore.ieee.org/abstract/document/9283771/

6. Deep Learning for Phishing Detection: Taxonomy, Current


Challenges and Future Directions -
https://ieeexplore.ieee.org/abstract/document/9716113/

You might also like