Final
Final
Phishing Scenario
Economic Impact
• Phishing websites are designed to closely resemble legitimate ones, creating an almost
identical appearance to deceive users.
• The structure of phishing sites is intentionally kept simple, allowing for quick creation and
deployment compared to the more time-consuming development process of legitimate
websites.
• Existing phishing detection technology primarily analyzes individual webpage features,
often overlooking the broader topology and structure.
• The datasets are very small and may not represent all types of Phishing
websites on which the models are trained .
• Lack of Features to determine the result and train the model.
• The challenges in shielding against phishing extend beyond inexperienced
users, impacting entire networks and highlighting the evolving tactics of cyber
attackers.
Improvement:
• The dataset is balanced with around one thousand rows and represent all types
phishing website.
• Features used to train the models are around 17 to enhance the ability.
• Dynamic nature of websites and constantly evolving tactics to bypass detection
systems to accurately detect phishing websites.
• We have used different models Support Vector Machine, Random Forest, Decision
Tree, XGBoost, Multilayer Perceptron
Proposed System
•
Screenshot of the Approval of the
Certificate of the Project Report
Methodology
Collection Of Data
•Gather a dataset of websites, labelled as phishing or legitimate
with the attributes which could help more than being good or
bad as the result may in good accuracy as well as good
precision. Extract impactful features such as URL, domain age,
IP address etc. As shown in Fig 1 Which describes the attributes
quantity .
•Pre-process the data by handling missing values, encoding
categorical features, and scaling numerical features.
•Now based on the dataset we can select the features based upon
address bar, HTML & JavaScript, Domain. And these features
can be expanded further like address bar consist of URL, special
symbols for sites, length of the domain, redirection ‘//’, http,
URL shortening techniques as well as suffix and prefix.
•Domain Features are:
•DNS Record
•Traffic on Website
•Domain Age
•Duration of Domain
Fig1: Classification of attributes
Methodology
Data Splitting
Split the dataset into three subsets: training, validation, and testing.
Common splits are 80% for training, 20% testing.
Feature Selection
Use techniques like feature importance from Random Forest or
XGBoost to selecting the most relevant and explorable features,
reducing dimensionality and potentially improving model
performance. Fig 2 (Describes the Relation between the variables or
attributes)
Model Selection:
Choose the five models we have mentioned (SVM, MLP, Random
Forest, XGBoost, Decision Trees) for building classifiers.
Model Training:
Training each model according to the relevant most relevant
features according to each models which may result in good
accuracy. Fig1: Classification of attributes
Methodology
Model Evaluation:
Evaluate each model's performance on the test dataset using
appropriate metrics. Common metrics include precision, accuracies
on test data and training data.
Assess the models' ability to correctly classify phishing websites
while minimizing false positives
Model Deployment:
Once satisfied with the models' performance, deploy them in a
production environment for real-time or batch processing of new
website URLs.
Implement a monitoring system to continuously evaluate model
performance and update them as needed.
Education and Awareness:
Educate users and organizations about the risks of phishing attacks
and encourage best practices for safe online behavior.
Remember that the effectiveness of each model may vary depending
on the characteristics of the dataset and the specific features chosen.
Regularly updating and re-evaluating the models is crucial for
Fig1: Classification of attributes
maintaining their accuracy in detecting evolving phishing threats.
.
Machine Learning Models
This is a supervised machine learning task. There are two major types of supervised machine
learning problems, called classification and regression.
The data set comes under classification problem, as the input URL is classified as
phished(1) or Legitimate(0). The machine learning models considered to train the dataset
are:
• Decision Tree
• RandomForest
• Multilayer
Perceptrons
• XGBoost
• Support Vector Machines
Criteria MLP SVM Decision Tree Random Forest XGBoost
Type of Algorithm Neural Network Supervised Learning Supervised Learning Ensemble Learning Ensemble Learning
Flexibility Highly flexible Moderately flexible Highly flexible Highly flexible Highly flexible
Interpretability Complex model Moderate Easy to understand Moderate Moderate
Handling Categorical
Requires encoding Requires encoding Handles naturally Handles naturally Handles naturally
Features
Depends on
Training Time Depends on the kernel Fast training Slower than DT Faster than RF
architecture
Memory Usage Moderate Moderate Low Higher Moderate
Scalability Scales with data Scales with data Scales with data Scales with data Scales with data
Ensemble Capabilities No No No Yes Yes
Hyperparameter
Moderate High Moderate High High
Sensitivity
Handling Missing
Requires imputation Not directly handles Not directly handles Handles naturally Handles naturally
Values
https://archive.ics.uci.edu/dataset/327/
phishing+websites
Model used: Xgboost(Extreme Gradient boosting)
Why Xgboost?
• High Accuracy: XGBoost is known for its high accuracy and efficiency in
classification tasks. It can effectively handle complex data and find patterns that may
indicate whether a website is legitimate or phishing.
• XGBoost is an implementation of the gradient boosting algorithm, which builds
multiple decision trees sequentially, each one correcting the errors of its predecessor.
This iterative approach tends to produce highly accurate models.
• Feature Importance: This insight can be valuable for further analysis and
improvement of the model.
• Regularization: XGBoost incorporates L1 and L2 regularization techniques to prevent
overfitting, which is crucial when dealing with datasets that may have noise or
irrelevant features.
• Flexibility: XGBoost is highly flexible and can be easily tuned to achieve optimal
performance for specific datasets and objectives. This flexibility allows researchers
and practitioners to experiment with different hyperparameters and configurations to
improve the model's performance.
The results produce by the models shown in
REFERENCES 5.
Science and Applications 10, no. 1 (2019).
Kulkarni, Arun D., and Leonard L. Brown III. "Phishing websites detection using machine
learning." (2019)
6. Alswailem, Amani, Bashayr Alabdullah, Norah Alrumayh, and Aram Alsedrani. "Detecting
phishing websites using machinelearning." In 2019 2nd International Conference on
Computer Applications &Information Security (ICCAIS), pp. 1-6. IEEE, 2019.
7. Yang, Peng, Guangzhen Zhao, and Peng Zeng. "Phishing website detection based on
multidimensional features driven by deep learning." IEEE access 7 (2019): 15196-15209.
8. Patil, Vaibhav, Pritesh Thakkar, Chirag Shah, Tushar Bhat, and S. P. Godse."Detection and
prevention of phishing websites using machine learning approach." In 2018 Fourth
international conference on computing communication control and automation (ICCUBEA),
pp. 1-5. Ieee, 2018.
9. Islam, Mazharul, and Nihad Karim Chowdhury. "Phishing websites detection using machine
learning based classification techniques ." In International Conference on Advanced
Informationand Communication Technology, Chittagong, Bangladesh..2016
Proof Of Outcome
• Link: h tt p s : / / s c h o l a r. g o o g l e . c o m /s c h o l a r ?
h l = e n & a s _ s d t = 0 % 2 C 5 & q = d e t e c ti o n + o f + p h i s h i n g + 2 5 8 4 -
2137&btnG=#d=gs_qabs&t=1710223875636&u=%23p%3DJjBlSVKK7s8J
Thank
You