Cse3502-Information Security Management: Phishing Detection Using Data Mining Techniques
Cse3502-Information Security Management: Phishing Detection Using Data Mining Techniques
Cse3502-Information Security Management: Phishing Detection Using Data Mining Techniques
MANAGEMENT
Submitted to
1 A Comparative Md. Milon The authors have The authors have availability and
Analysis of Uddin, Kazi used a dataset also compared the quality of data,
Machine Arfatul Islam, consisting of results of their the selection of
Learning-Based Muntasir legitimate and machine learning appropriate
Mamun, Vivek phishing URLs, models with a features and
Website Kumar Tiwari, which were baseline model algorithms, and
Phishing Jounsup Park collected from that uses a the trade-off
Detection Using various sources, rule-based between
URL Information such as PhishTank, approach for detection
OpenPhish, and detecting phishing accuracy and
Google Safe websites. false positives.
Browsing. They
have extracted URL
features such as
the length of the
URL, number of
slashes, presence
of certain keywords,
and the domain
age, and used
these features to
train and evaluate
the performance of
different machine
learning algorithms,
including Support
Vector Machines
(SVM), Random
Forest, K-Nearest
Neighbor (KNN),
and Artificial Neural
Networks (ANN)
2 Detection of Aashutosh XSS attack is CNN approach Their main
Cyber Attacks: Bhardwaj; detected using yields 98.59% objective is to
XSS, SQLI, Saheb Singh CNN approach, accuracy for demonstrate
Phishing Attacks Chandok; SQLI attack is detecting XSS how
and Detecting Aniket detected using attacks, Logistic fundamentally
Intrusion Using Bagnawar; Logistic Regression Regression different the
Machine Shubham approach, phishing approach yields intrusion
Learning Mishra; is detected using 92.85% accuracy detection
Algorithms Deepak SVM approach. In for SQLI, SVM problem is from
Uplaonkar addition to the approach yields these other
above specified 85.62% accuracy applications,
attacks: DTC, BNB, for phishing making it far
KNN approaches attacks. more
are employed to Approaches like challenging for
detect the intrusion DTC, BNB, KNN the intrusion
in the system. yields an accuracy detection
of 99.47%, community to
90.67% and utilize machine
99.16% learning
respectively for effectively
detecting
intrusions.
3 Phishing Attacks Aljabri, Malak They began by In the first dataset, As per the
Detection using examining the RF and SVM traditional
Machine Mirza, Samiha datasets to models methods,
Learning and determine their outperformed When a new
Deep Learning features, sizes, and others with an URL is
Models shortcomings. The accuracy of 100% received, it is
datasets were then in detected compared
preprocessed, phishing URLs. In against the
where the class the second signature list. If
imbalance issue dataset as well, a match is
was solved. Then, RF outperformed found, the URL
the most correlated the other models is labeled as
features were achieving an malicious.
selected. Finally, accuracy of Moreover, due
the classification 92.83%. to the reliance
models were on a
applied, and the pre-defined
results were signature,
evaluated. attackers can
easily evade
them, and
systems that
follow this
approach will
not be able to
identify new
harmful URLs
4 Phishing Adarsh In this, the first After training the The overall
Website Mandadi; algorithm is trained accuracy of method to
Detection Using Saikiran with base data set Random forest is detect phishing
Machine Boppana; which is used as 87.0% and the websites by
Learning Vishnu training data and accuracy of the updating
Ravella; R the data which is Decision tree is blacklisted
Kavitha taken from the web 82.4%. URLs, Internet
traffic acts as input Protocol to the
for the feature antivirus
extraction which is database
done mainly on which is
three types of additionally
features URL referred to as
based, the blacklist
domain-based, method. The
Html/JS-based major
features and this disadvantage
feature extracted of this
data acts as testing approach is
data and this that it cannot
machine learning detect
model is exposed to zero-hour
API and the phishing
prediction will be attacks.
done and output is
generated as
phishing or
legitimate.
5 Phishing Attacks Mohammad In this perspective, a maximum RF had less
Detection using Nazmul Alam; the proposed accuracy of 97% variance, and it
Machine Dhiman research work has was achieved could handle
Learning Sarma; developed a model through the the over-fitting
Approach Farzana Firoz to detect the random forest problem. The
Lima; Ishita phishing attacks algorithm random forest
Saha; using machine tree achieved
Rubaiath-E- learning (ML) an accuracy of
Ulfath; Sohrab algorithms like 97%. In our
Hossain random forest (RF) future work,
and decision tree fishing attacks
(DT). A standard will be
legitimate dataset predicted from
of phishing attacks the logged
from Kaggle was dataset of
aided for ML attacks by
processing. To using a
analyze the convolution
attributes of the neural network
dataset, the (CNN).
proposed model
has used feature
selection algorithms
like principal
component analysis
(PCA)
6 PWDGAN: Trinh Nguyen In this article, they To evaluate the Initially, the
Generating Bac; Phan The build a model performance of classifiers are
Adversarial Duy; Van-Hau based on the proposed capable of
Malicious URL Pham generative model, several good detection
Examples for adversarial network machine learning when the TPR
Deceiving (GAN) – a deep algorithms are value of
Black-Box learning-based used as the phishing URLs
Phishing framework to black-box detection
Website conduct black-box phishing detector, reached 100%
Detector using attacks using including Support for both the
GANs Phishtank and Vector Machine training set and
Alexa datasets that (SVM), Decision the testing set,
try to evade and Tree (DT), except for the
bypass ML-based Random Forest RF and LR
phishing detectors. (RF), Logistic classifier with
Regression (LR), the TPR value
Multi-layer of 99%. At the
Perceptron (MLP). 100th epoch,
the classifiers
decrease the
rate of
detecting
malicious
samples, even
the DT
classifier could
not distinguish
these malicious
samples with
the TPR value
of 0%.
7 Aliyu Alhaji The authors used a The use of The
Phishi Abubakar, dataset of 1,500 machine learning effectiveness
ng Abubakar phishing and algorithms can of the machine
Detect Adamu, and non-phishing emails improve the learning model
ion Halima Sadia to train and test six accuracy and is highly
Using Iliyasu machine learning efficiency of dependent on
Machi algorithms: KNN, phishing the quality and
ne SVM, Random detection. It can quantity of the
Learni Forest, Decision also reduce the training
ng Tree, Naïve Bayes, need for manual dataset.
Algori and Logistic analysis and Phishing
thm Regression. The increase the attacks are
performance of speed of constantly
each algorithm was detection. evolving, and
evaluated using the model may
accuracy, precision, become less
recall, and F1-score effective over
metrics. time if it is not
updated with
new data.
3) Dataset Description:
Link:- https://gregavrbancic.github.io/Phishing-Dataset/
4) Data Preprocessing:
● Attributes that don’t have any value other than zeroes are ignored.
● In Total, 20 such attributes were removed.
● Min-Max Scaling is done manually to preserve the dataframe as the
pre-defined functions return a numpy array.
● Feature Selection was initially done by Correlation Analysis, but the
attributes were interrelated and it was hard to make redundant
dataset.
● So, LassoCV technique is performed on the dataset, which will
return a list of important attributes whose important coefficient factor
is not zero.
● In Total, 38 attributes were selected by LassoCV.
● To reduce dimensionality further, Principal Components Analysis
was performed and 20 components were generated.
Screenshots:-
6) Results:
It is observed that the voting classifier model works better than three
of its constituent models but little less than Random Forest
Classifier. Error rate of the model is almost negligible.
Code:
https://github.com/SanjayNithin2002/Phising-Detection-Using-Data-Mining
-Techniques
8) References: