0% found this document useful (0 votes)
96 views26 pages

Final

The document presents a project on detecting phishing sites using machine learning techniques, detailing the motivations, objectives, and methodologies employed. It highlights various types of phishing attacks, their economic impact, and the challenges in current detection systems, while proposing a new system that utilizes multiple machine learning models for improved accuracy. The results indicate that the XGBoost model achieved the highest accuracy of 87% in detecting phishing websites.

Uploaded by

Faizan Ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
96 views26 pages

Final

The document presents a project on detecting phishing sites using machine learning techniques, detailing the motivations, objectives, and methodologies employed. It highlights various types of phishing attacks, their economic impact, and the challenges in current detection systems, while proposing a new system that utilizes multiple machine learning models for improved accuracy. The results indicate that the XGBoost model achieved the highest accuracy of 87% in detecting phishing websites.

Uploaded by

Faizan Ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 26

B.

Tech External Project Evaluation ,VIIIth Sem

Detection of Phishing Sites Using Machine Learning Techniques


Presented by :
Vishal Pandey (2020490363)
Under the Supervision of:-
Ayan Mahmood (2020401758)
Dr. Gauri Shankar Mishra
Rohit Raj (2020561626)

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING


SHARDA SCHOOL OF ENGINEERING AND TECHNOLOGY
1
April, 2024
APPROVAL FROM GUIDE FOR THE EVALUATION
Agenda
Approval from guide for the evaluation
Introduction
Motivation
Objectives
Literature Survey
Proposed System
Screenshot of the Approval of the Certificate of the Project Report
Methodology
Machine Learning Models
Dataset
Model used
Result
References
Proof of the Outcome
Introduction
• Phishing is a form of cyber attack where attackers impersonate legitimate
entities (such as companies, government agencies, or financial institutions) to
trick individuals into disclosing sensitive information, such as passwords, credit
card numbers, or personal data.
• The primary goal of phishing is to deceive individuals into unknowingly
revealing sensitive data, making it a prevalent method employed by attackers to
compromise the security of valuable information.

Phishing Scenario

• In a typical phishing scenario, attackers create deceptive emails that


convincingly imitate genuine communications.
• The objective is to entice recipients to click on links within these emails, leading
them to counterfeit websites designed to look like legitimate platforms.
• Once on these fake websites, individuals are coerced into providing confidential
information such as login details, passwords, and financial data.
Types of Phishing:

• Spear Phishing: Targeted phishing aimed at specific individuals or organizations, using


personalized information to increase believability and success rates.
• Whaling Fishing: Targeted phishing that focuses on high-profile individuals within
organizations, such as executives or senior managers, with the goal of obtaining valuable
information or funds.
• Phishing: Phishing conducted via phone calls or voice messages, where attackers pose as trusted
entities to deceive victims into divulging sensitive information or performing actions.
• Smishing: Phishing conducted via text messages, where attackers send fraudulent SMS
containing malicious links or instructions to trick recipients into providing personal information
or taking harmful actions.
• Email Phishing: Common form of phishing where attackers send deceptive emails
impersonating legitimate entities to trick recipients into clicking on malicious links, downloading
attachments, or providing sensitive information.

Economic Impact

•Phishing has a substantial economic impact on a global scale.


•The focus of phishing efforts often targets payment companies and webmail services, as evidenced
by research conducted by the Anti-Phishing Working Group (APWG).
•The concentration of attacks in these areas contributes significantly to economic losses.
Characteristics of Phishing Websites

• Phishing websites are designed to closely resemble legitimate ones, creating an almost
identical appearance to deceive users.
• The structure of phishing sites is intentionally kept simple, allowing for quick creation and
deployment compared to the more time-consuming development process of legitimate
websites.
• Existing phishing detection technology primarily analyzes individual webpage features,
often overlooking the broader topology and structure.

Technology Shift Impact

• Global advancements in networking and communication technology have led to a


significant shift in users' daily activities.
• Essential functions such as electronic banking, social networking, and e-commerce have
migrated to the digital realm.
• However, this migration has created a fertile ground for cyberattacks, introducing security
vulnerabilities that impact both inexperienced and skilled users.
Challenges in Protection

• Despite the importance of experienced users, protecting them from falling


victim to phishing scams is challenging.
• Cybercriminals often exploit the individual characteristics of skilled users in
their attempts to deceive.
• The challenges in shielding against phishing extend beyond inexperienced
users, impacting entire networks and highlighting the evolving tactics of cyber
attackers.
Motivation
• The rise of phishing attacks has led to an increase in financial losses and data breaches for
individuals and organizations alike.
• Solving phishing is motivated by the need to protect individuals and organizations from the
harmful effects of these attacks.
• One motivation for solving phishing is to protect individuals' personal information and
financial assets, as phishing attacks can result in identity theft, unauthorized access to bank
accounts, and credit card fraud.
• Solving phishing can help prevent these types of attacks, safeguarding individuals' sensitive
information and financial well-being.
• Another motivation for solving phishing is to protect organizations from financial losses and
reputational damage.
• Phishing attacks can result in loss of intellectual property, trade secrets, and other sensitive
information, causing financial and reputational damage.
• Solving phishing can help organizations maintain their competitive edge by protecting their
intellectual property and safeguarding their reputation.
Objectives
• To Conduct research to explore machine learning approaches for classifying web URLs as malicious or non-
malicious.
• To Investigate the efficacy of algorithms (Decision Trees, Random Forests, XGBoost, SVM, and MLP) in
accurately detecting malicious URLs.
• To Aim to enhance cybersecurity measures by identifying key features by the help of algorithm inbuilt
techniques and algorithmic strategies that optimize detection accuracy by using GridSearchCV.
• To Contribute novel insights to URL-based threat detection methodologies and improve interpretability of
classification models.
Literature Survey
Comparative Analysis
S.no
Research pros cons dataset
The paper contain an experimental results that show https://www.kaggle.com/datasets/taruntiwarihp/phi
“A Machine Learning Approach for that the proposed approach achieves an accuracy in The dataset may not represent all types of phishing shing-site-urls
1.
Detecting Phishing Websites” (2020) detecting phishing websites, with an F1 score of websites.
0.98.
Various models were proposed in this research for Paper Not contain the detailed features for detection https://www.kaggle.com/datasets/eswarchandt/
2. “Detecting Phishing Websites” 2020 phishing-website-detector
detecting phished sites of a phished sites.
Model can potentially improve the accuracy and The approach is based on supervised learning, which https://www.kaggle.com/datasets/shashwatwork/
“Phishing Website Detection”
3. efficiency for a phishing detection compared to requires a labeled dataset for training which can be web-page-phishing-detection-dataset
In 2020
traditional rule-based approaches time consuming.
The paper analyze the performance of the logistic https://www.kaggle.com/datasets/taruntiwarihp/phi
The paper does not elaborate the limitations shing-site-urls
regression algorithm with other machine learning
Detecting Phishing Websites using /drawbacks of the proposed work, such as false
4. algorithms, which can help researchers and students
Machine Learning Techniques in 2020 positives or the possibility of attackers adapting
choose the most appropriate algorithm for their use
their tactics to escape the detection.
case.
Machine learning algorithms can be sensitive to the https://www.kaggle.com/datasets/shashwatwork/
“A machine Learning based Approach” Can be used in various domains such as health care,
5. quality and quantity of data used for training, which web-page-phishing-detection-dataset
in 2020 finance, cybersecurity.
can lead to biased or inaccurate results.
Labeled dataset was used which can be time https://www.kaggle.com/datasets/eswarchandt/
6. “Detection Of Phishing Websites” 2021 phishing-website-detector
consuming on evaluating it for the model.
Machine learning models can be susceptible to https://www.kaggle.com/datasets/shashwatwork/
The paper analyze the performance of the gradient web-page-phishing-detection-dataset
adversarial attacks, where malicious acts
7. “Phishing Detection” 2021 booster which can be knowledge full for
intentionally to manipulate the input data to deceive
researchers and innovators.
the model.
Machine learning algorithms can potentially The paper does not elaborate the limitations https://www.kaggle.com/datasets/taruntiwarihp/phi
A comprehensive Study On Phishing discover patterns and insights that may be difficult /drawbacks of the proposed work, such as false shing-site-urls
8.
Websites or impossible to detect using traditional statistical positives or the possibility of attackers adapting
methods. their tactics to escape the detection.
The paper contain an experimental results that show https://www.kaggle.com/datasets/taruntiwarihp/
The approach is based on supervised learning, which phishing-site-urls
Phishing Website using Improved that the proposed approach achieves an accuracy in
9. requires a labeled dataset for training which can be
Random Forest Algorithm detecting phishing websites, with an F1 score of
time consuming.
0.98.
Phishing Website Detection using Deep Deep learning model can simply learn through the https://www.kaggle.com/datasets/sid321axn/
If a hidden layer fails then the whole system abrupt malicious-urls-dataset
10. Learning and Graph Convolutional hidden layers and can be accurate or positive
badly.
Networks results. Paper ID : 2208 Category : UG / PG / RS / Faculty
Challenges in Previous works

• The datasets are very small and may not represent all types of Phishing
websites on which the models are trained .
• Lack of Features to determine the result and train the model.
• The challenges in shielding against phishing extend beyond inexperienced
users, impacting entire networks and highlighting the evolving tactics of cyber
attackers.
Improvement:

• The dataset is balanced with around one thousand rows and represent all types
phishing website.
• Features used to train the models are around 17 to enhance the ability.
• Dynamic nature of websites and constantly evolving tactics to bypass detection
systems to accurately detect phishing websites.
• We have used different models Support Vector Machine, Random Forest, Decision
Tree, XGBoost, Multilayer Perceptron
Proposed System

Screenshot of the Approval of the
Certificate of the Project Report
Methodology
Collection Of Data
•Gather a dataset of websites, labelled as phishing or legitimate
with the attributes which could help more than being good or
bad as the result may in good accuracy as well as good
precision. Extract impactful features such as URL, domain age,
IP address etc. As shown in Fig 1 Which describes the attributes
quantity .
•Pre-process the data by handling missing values, encoding
categorical features, and scaling numerical features.
•Now based on the dataset we can select the features based upon
address bar, HTML & JavaScript, Domain. And these features
can be expanded further like address bar consist of URL, special
symbols for sites, length of the domain, redirection ‘//’, http,
URL shortening techniques as well as suffix and prefix.
•Domain Features are:
•DNS Record
•Traffic on Website
•Domain Age
•Duration of Domain
Fig1: Classification of attributes
Methodology

Data Splitting
Split the dataset into three subsets: training, validation, and testing.
Common splits are 80% for training, 20% testing.
Feature Selection
Use techniques like feature importance from Random Forest or
XGBoost to selecting the most relevant and explorable features,
reducing dimensionality and potentially improving model
performance. Fig 2 (Describes the Relation between the variables or
attributes)

Model Selection:
Choose the five models we have mentioned (SVM, MLP, Random
Forest, XGBoost, Decision Trees) for building classifiers.

Model Training:
Training each model according to the relevant most relevant
features according to each models which may result in good
accuracy. Fig1: Classification of attributes
Methodology

Model Evaluation:
Evaluate each model's performance on the test dataset using
appropriate metrics. Common metrics include precision, accuracies
on test data and training data.
Assess the models' ability to correctly classify phishing websites
while minimizing false positives
Model Deployment:
Once satisfied with the models' performance, deploy them in a
production environment for real-time or batch processing of new
website URLs.
Implement a monitoring system to continuously evaluate model
performance and update them as needed.
Education and Awareness:
Educate users and organizations about the risks of phishing attacks
and encourage best practices for safe online behavior.
Remember that the effectiveness of each model may vary depending
on the characteristics of the dataset and the specific features chosen.
Regularly updating and re-evaluating the models is crucial for
Fig1: Classification of attributes
maintaining their accuracy in detecting evolving phishing threats.
.
Machine Learning Models
This is a supervised machine learning task. There are two major types of supervised machine
learning problems, called classification and regression.
The data set comes under classification problem, as the input URL is classified as
phished(1) or Legitimate(0). The machine learning models considered to train the dataset
are:
• Decision Tree
• RandomForest
• Multilayer
Perceptrons
• XGBoost
• Support Vector Machines
Criteria MLP SVM Decision Tree Random Forest XGBoost
Type of Algorithm Neural Network Supervised Learning Supervised Learning Ensemble Learning Ensemble Learning
Flexibility Highly flexible Moderately flexible Highly flexible Highly flexible Highly flexible
Interpretability Complex model Moderate Easy to understand Moderate Moderate

Handling Non-linearity Yes Yes Yes Yes Yes

Robust to Outliers Sensitive Sensitive Sensitive Robust Robust

Handling Categorical
Requires encoding Requires encoding Handles naturally Handles naturally Handles naturally
Features
Depends on
Training Time Depends on the kernel Fast training Slower than DT Faster than RF
architecture
Memory Usage Moderate Moderate Low Higher Moderate
Scalability Scales with data Scales with data Scales with data Scales with data Scales with data
Ensemble Capabilities No No No Yes Yes

Hyperparameter
Moderate High Moderate High High
Sensitivity
Handling Missing
Requires imputation Not directly handles Not directly handles Handles naturally Handles naturally
Values

General-purpose, Classification, Classification, Classification, Classification,


Use Cases
complex tasks regression regression regression regression
Accuracy Of Models
Dataset:
We have used dataset from UCI Machine Learning Repository.

Link for the dataset:

https://archive.ics.uci.edu/dataset/327/
phishing+websites
Model used: Xgboost(Extreme Gradient boosting)
Why Xgboost?
• High Accuracy: XGBoost is known for its high accuracy and efficiency in
classification tasks. It can effectively handle complex data and find patterns that may
indicate whether a website is legitimate or phishing.
• XGBoost is an implementation of the gradient boosting algorithm, which builds
multiple decision trees sequentially, each one correcting the errors of its predecessor.
This iterative approach tends to produce highly accurate models.
• Feature Importance: This insight can be valuable for further analysis and
improvement of the model.
• Regularization: XGBoost incorporates L1 and L2 regularization techniques to prevent
overfitting, which is crucial when dealing with datasets that may have noise or
irrelevant features.
• Flexibility: XGBoost is highly flexible and can be easily tuned to achieve optimal
performance for specific datasets and objectives. This flexibility allows researchers
and practitioners to experiment with different hyperparameters and configurations to
improve the model's performance.
The results produce by the models shown in

RESULT Table 1. and the most appropriable results are


produced by XGboost. Which have the best
accuracy of 87%.

S.no Model Training Testing


Accuracy Accuracy

1. Decision Tree 81.2% 81.7%


2. Random Forest 81.7% 82.3%

3. MLP 82.8% 83.1%


4. SVM 80% 80.7%
5. XGBoost 86.7% 86.2%

Fig 8. Predicted Results •Table 1. Results Produced by Models


1. Deshpande, Atharva, Omkar Pedamkar, Nachiket Chaudhary, and Swapna Borde. "Detection
of phishing websites using Machine Learning." International Journal of Engineering
Research & Technology (IJERT) 10, no. 05 (2021).
2. Odeh, Ammar, Ismail Keshta, and Eman Abdelfattah."Machine learningtechniquesfor
detection of website phishing: A review for promises and challenges." In 2021 IEEE 11th
Annual Computing and Communication Workshop and Conference (CCWC), pp. 0813-0818.
IEEE, 2021.
3. Alkawaz, Mohammed Hazim, Stephanie Joanne Steven, and Asif Iqbal Hajamydeen.
"Detecting phishing website using machinelearning." In 2020 16th IEEE International
Colloquium onSignal Processing & Its Applications (CSPA), pp. 111-114. IEEE, 2020.
4. Ubing, Alyssa Anne, Syukrina Kamilia Binti Jasmi, Azween Abdullah, N. Z. Jhanjhi, and
Mahadevan Supramaniam. "Phishing website detection: An improved accuracythrough
feature selection and ensemble learning." International Journal of Advanced Computer

REFERENCES 5.
Science and Applications 10, no. 1 (2019).
Kulkarni, Arun D., and Leonard L. Brown III. "Phishing websites detection using machine
learning." (2019)
6. Alswailem, Amani, Bashayr Alabdullah, Norah Alrumayh, and Aram Alsedrani. "Detecting
phishing websites using machinelearning." In 2019 2nd International Conference on
Computer Applications &Information Security (ICCAIS), pp. 1-6. IEEE, 2019.
7. Yang, Peng, Guangzhen Zhao, and Peng Zeng. "Phishing website detection based on
multidimensional features driven by deep learning." IEEE access 7 (2019): 15196-15209.
8. Patil, Vaibhav, Pritesh Thakkar, Chirag Shah, Tushar Bhat, and S. P. Godse."Detection and
prevention of phishing websites using machine learning approach." In 2018 Fourth
international conference on computing communication control and automation (ICCUBEA),
pp. 1-5. Ieee, 2018.

9. Islam, Mazharul, and Nihad Karim Chowdhury. "Phishing websites detection using machine
learning based classification techniques ." In International Conference on Advanced
Informationand Communication Technology, Chittagong, Bangladesh..2016
Proof Of Outcome
• Link: h tt p s : / / s c h o l a r. g o o g l e . c o m /s c h o l a r ?
h l = e n & a s _ s d t = 0 % 2 C 5 & q = d e t e c ti o n + o f + p h i s h i n g + 2 5 8 4 -
2137&btnG=#d=gs_qabs&t=1710223875636&u=%23p%3DJjBlSVKK7s8J
Thank
You

You might also like