0% found this document useful (0 votes)
16 views

Project Report 15

Uploaded by

ajithsasi2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Project Report 15

Uploaded by

ajithsasi2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

PHISH CATCHER AND WEB SPOOFING

ATTACK USING MACHINE LEARNING

PROJECT REPORT

PHASE I

Submitted by

ABINAYA D [21CS062]
AKSHAYA E [21CS067]
INIKA R K [21CS092]
NANDHINI S [21CS116]

in partial fulfillment for the award of the degree


of

BACHELOR OF ENGINEERING
in
COMPUTER SCIENCE AND ENGINEERING

MUTHAYAMMAL ENGINEERING COLLEGE

(AUTONOMOUS)

RASIPURAM – 637 408

ANNA UNIVERSITY::CHENNAI- 600 025

DECEMBER 2024
MUTHAYAMMAL ENGINEERING COLLEGE
(AUTONOMOUS)
RASIPURAM

BONAFIDE CERTIFICATE

Certified that this Report “PHISH CATCHER AND WEB SPOOFING ATTACK
USING MACHINE LEARNING” is the bonafide work of “ABINAYA D [21CS062],
AKSHAYA E [21CS067], INIKA R K [21CS092], NANDHINI S
[21CS116]” who carried out the work under my supervision.

SIGNATURE SIGNATURE
Dr.G.KAVITHA, M.S (By Research), Ph.D., Mrs.S.NAZEEMA, M.E.,
PROFESSOR ASSISTANT PROFESSOR
HEAD OF THE DEPARTMENT SUPERVISOR
--
Department of Computer Science and Department of Computer Science and
Engineering, Engineering,
Muthayammal Engineering College Muthayammal Engineering College
(Autonomous), Rasipuram-637 408. (Autonomous), Rasipuram-637 408.

Submitted for the Project Work Phase-I Viva-Voce examination held on ___________

INTERNAL EXAMINER EXTERNAL EXAMINER


ACKNOWLEDGEMENT

We would like to thank our College Chairman Shri.R.KANDASAMY and our

Secretary Dr.K.GUNASEKARAN, M.E., Ph.D., F.I.E., who encourages us in all

activities.

We here like to record our deep sense of gratitude to our beloved Principal

Dr.M.MADHESWARAN, M.E., Ph.D., MBA., for providing us the required facility to

complete our project successfully.

We extend our sincere thanks and gratitude to our Head of the Department

Dr.G.KAVITHA, M.S(By Research), Ph.D., Department of Computer Science and

Engineering for her valuable suggestions throughout the project.

It is pleasure to acknowledge the contribution made by our Project Coordinator

Dr.N.NAVEENKUMAR., M.E., Ph.D., Associate Professor, Department of Computer

Science and Engineering for his efforts to complete our project successfully.

It is grateful to acknowledge the support provided by our Project Guide

Mrs.S.NAZEEMA., M.E., Assistant Professor, Department of Computer Science and

Engineering for her guidance to complete our project successfully.

We are very much thankful to our Parents, Friends and all Faculty Members of the

Department of Computer Science and Engineering, who helped us in the successful

completion of the project.

iii
Vision of the Institute
To be a Centre of excellence in Engineering, Technology and Management on par with
International standards
Mission of the Institute
• To prepare the students with high professional skills and ethical values
• To impart knowledge through best practices
• To instill spirit of innovation through training, research and development
• To undertake continuous assessment and remedial measures
• To achieve academic excellence through intellectual, emotional and social
stimulation

Vision of the Department


To produce the Computer Science and Engineering graduates with the Innovative and
Entrepreneur skills to face the challenges ahead
Mission of the Department
M1: To impart knowledge in the state of art technologies in Computer Science and
Engineering
M2: To inculcate the analytical and logical skills in the field of Computer Science and
Engineering
M3: To prepare the graduates with Ethical values to become successful Entrepreneurs
Program Educational Objectives (PEOs)
PEO1: Graduates will be able to Practice as an IT Professional in Multinational
Companies
PEO2: Graduates will be able to Gain necessary skills and to pursue higher education
for career growth
PEO3: Graduates will be able to Exhibit the leadership skills and ethical values in the
day to day life

iv
Program Outcomes (POs)
PO1 - Engineering knowledge: Apply the knowledge of mathematics, science,
engineering fundamentals, and an engineering specialization to the solution of complex
engineering problems.
PO2 - Problem analysis: Identify, formulate, review research literature, and analyze
complex engineering problems reaching substantiated conclusions using first principles
of mathematics, natural sciences, and engineering sciences.
PO3 - Design/development of solutions: Design solutions for complex engineering
problems and design system components or processes that meet the specified needswith
appropriate consideration for the public health and safety, and the cultural, societal, and
environmental considerations.
PO4 - Conduct investigations of complex problems: Use research-based knowledge
and research methods including design of experiments, analysis and interpretation of
data, and synthesis of the information to provide valid conclusions.
PO5 - Modern tool usage: Create, select, and apply appropriate techniques, resources,
and modern engineering and IT tools including prediction and modeling to complex
engineering activities with an understanding of the limitations.
PO6 - The engineer and society: Apply reasoning informed by the contextual
knowledge to assess societal, health, safety, legal and cultural issues and the consequent
responsibilities relevant to the professional engineering practice.
PO7 - Environment and sustainability: Understand the impact of the professional
engineering solutions in societal and environmental contexts, and demonstrate the
knowledge of, and need for sustainable development.
PO8 - Ethics: Apply ethical principles and commit to professional ethics and
responsibilities and norms of the engineering practice.
PO9 - Individual and team work: Function effectively as an individual, and as a
member or leader in diverse teams, and in multidisciplinary settings.

v
PO10 - Communication: Communicate effectively on complex engineering activities
with the engineering community and with society at large, such as, being able to
comprehend and write effective reports and design documentation, make effective
presentations, and give and receive clear instructions.
PO11 - Project management and finance: Demonstrate knowledge and understanding
of the engineering and management principles and apply these to one’s own work, as a
member and leader in a team, to manage projects and in multidisciplinary environments.
PO12 - Life-long learning: Recognize the need for, and have the preparation and ability
to engage in independent and life-long learning in the broadest context of technological
change.
Program Specific Outcomes (PSOs)
PSO1: Graduates should be able to design and analyze the algorithms to develop an
Intelligent Systems
PSO2: Graduates should be able to apply the acquired skills to provide efficient
solutions for real time problems
PSO3: Graduates should be able to exhibit an understanding of System Architecture,
Networking and Information Security.

vi
COURSE OUTCOMES:
At the end of the course, the student will able to
21CSP01.CO1 Understand the technical concepts of project area.
21CSP01.CO2 Identify the problem and formulation
21CSP01.CO3 Design the Problem Statement
21CSP01.CO4 Formulate the algorithm by using the design
21CSP01.CO5 Develop the Module

PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12 PSO1 PSO2 PSO3
✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔

SIGNATURE OF STUDENTS SIGNATURE OF GUIDE

vii
INDEX

CHAPTER NO. TABLE OF CONTENT PAGE NO.

ABSTRACT x
LIST OF FIGURES xi

LIST OF ABBREVIATIONS xii


1 INTRODUCTION 1

1.1 PROJECT OVERVIEW 1

1.2 OBJECTIVES 3
1.3 MACHINE LEARNING 3
1.4 ADVANTAGES 4

2 LITERATURE SURVEY 3

3 SYSTEM ANALYSIS 6

3.1 EXISTING SYSTEM 6


3.1.1 LIMITATIONS 6
3.2 PROPOSED SYSTEM 7

3.2.1 ADVANTAGES 7
4 SYSTEM REQUIREMENTS 8
4.1 HARDWARE REQUIREMENTS 8

4.2 SOFTWARE REQUIREMENTS 8

4.3 SOFTWARE DESCRIPTION 9

5 PROJECT DESIGN 12

5.1 BLOCK DIAGRAM 12

viii
5.2 DATASET 13

5.3 PREPROCESSING 13

5.4 FEATURE EXTRACTION 15


5.5 MODEL IMPLEMENTATION 15

6 CONCLUSION AND FUTURE 19


ENHANCEMENT

6.1 CONCLUSION 19

6.2 FUTURE ENHANCEMENT 19

REFERENCE 20

ix
ABSTRACT

Phishing websites are a significant security challenge, presenting a direct


threat to the confidentiality, integrity, and availability of data belonging to both
companies and consumers. They often serve as the starting point for numerous
cyberattacks, such as credential theft, malware distribution, and financial fraud.
Despite the widespread awareness of phishing threats, they remain effective due
to their evolving techniques and ability to mimic legitimate entities convincingly.
Over the years, researchers have worked extensively to develop automatic
detection systems for phishing websites. An important aspect of phishing website
detection is the analysis of the web pages hosted at suspicious URLs. By
automating feature extraction and continuously learning from new data, ML- based
systems can adapt to evolving threats, reducing the dependence on manual
intervention. While advanced detection methods have demonstrated considerable
success, they often rely heavily on manual feature engineering, which involves
identifying and defining specific characteristics of phishing websites. This reliance
makes them less adaptable to novel and sophisticated phishing strategies,
particularly zero-day phishing attacks that exploit previously unknown
vulnerabilities. This research aims to address the challenges of phishing website
detection by leveraging machine learning and deep learning techniques. By
analyzing and extracting relevant features from datasets, ML models and deep
neural networks can be trained to recognize patterns indicative of phishing activity.
These models not only enhance detection accuracy but also improve
responsiveness to emerging threats. The proposed method represents a step
forward in combating phishing attacks, with the potential to provide robust and
scalable solutions for real-world applications.

x
LIST OF FIGURES

FIG. NO. FIGURE NAME PAGE NO.

4.1 Working of python interpreter 10

5.1 System Architecture 12


5.2 Random Forest Architecture 17

xi
LIST OF ABBREVIATIONS

TERM ABBREVIATIONS

AI Artificial Intelligence
CSV Comma Separated Values
DC Decision Tree
DL Deep Learning
DNN Deep Neural Network
EDA Exploratory Data Analysis
HTML Hyper Text Markup Language
HTTP Hyper Text Transfer Protocol
KNN K-nearest Neighbors Algorithm
LSTM Long Short-Term Memory
ML Machine Learning
PSO Particle Swarm Optimization
SSL Secure Socket Layer
SVM Support Vector Machine
URL Uniform Resource Locator
VGA Video Graphics Array

xi
i
CHAPTER 1
INTRODUCTION

1.1 PROJECT OVERVIEW


Phishing has become the most serious problem, harming individuals,
corporations, and even entire countries. The availability of multiple services such
as online banking, entertainment, education, software downloading, and social
networking has accelerated the Web's evolution in recent years. As a result, a
massive amount of data is constantly downloaded and transferred to the Internet.
Spoofed emails pretending to be from reputable businesses and agencies are used
in social engineering techniques to direct consumers to fake websites that deceive
users into giving financial information such as usernames and passwords.
Technical tricks involve the installation of malicious software on computers to
steal credentials directly, with systems frequently used to intercept users' online
account usernames and passwords. Deceptive Phishing: This is the most frequent
type of phishing assault, in which a Cyber criminal impersonates a well-known
institution, domain, or organization to acquire sensitive personal information from
the victim, such as login credentials, passwords, bank account information, credit
card information, and so on. Because there is no personalization or customization
for the people, this form of attack lacks sophistication. Emails containing
malicious URLs in this sort of phishing email contain a lot of personalization
information about the potential victim. The recipient's name, company name,
designation, friends, co-workers, and other social information may be included in
the email.

1.2 OBJECTIVE
The objective of phishing website detection is to identify and classify
fraudulent websites that mimic legitimate ones to deceive users into providing
1
sensitive information. This is typically achieved using machine learning
techniques that analyze various features of URLs, content, and user behavior to
differentiate between legitimate and malicious sites. The goal is to enhance online
security by accurately predicting phishing attempts and reducing the risk of user
data compromise.

1.3 MACHINE LEARNING


Machine learning techniques for phishing website detection involve various
algorithms and methods that analyze features of URLs and website content to
identify fraudulent sites. Machine learning techniques for phishing website
detection utilize advanced algorithms and methods to analyse and differentiate
between legitimate and fraudulent websites. These approaches are highly effective
due to their ability to process large datasets, adapt to evolving phishing tactics,
and provide automated real-time analysis.An ensemble method that builds
multiple decision trees and merges their results for improved accuracy. A
classification technique that finds the optimal hyperplane to separate phishing
and legitimate URLs. Particularly Long Short-Term Memory (LSTM) networks
are used to capture sequential patterns in URL data.

1.4 ADVANTAGES
• Automated Detection
• Adaptability

• High Accuracy

• Feature Extraction

• Efficient Optimization

• Faster Decision Making

2
CHAPTER 2
LITERATURE SURVEY

Title : Phish Catcher : Client-side defence against Web Spoofing Attack


using Machine Learning
Year : 2023
Author : Muzammil Ahamed, Wilayat Khan, Zawar Hussain Khan
Phish Catcher is a client-side defence system designed to combat web
spoofing attacks, leveraging machine learning to provide real-time protection
for users. Web spoofing attacks, such as phishing, trick users into interacting with
malicious websites that mimic legitimate ones to steal sensitive data. Using
machine learning, the system evaluates website attributes such as URL patterns,
SSL certificates, visual similarities, and behaviour to identify phishing attempts
with high accuracy, even for previously unseen threats.
The system employs a trained machine learning model that runs locally
on the user's device, ensuring privacy and low latency. The result is a lightweight,
responsive, and effective tool that empowers users to navigate the web securely,
reducing risks for end-users and organizations while safeguarding sensitive
information in an increasingly digital world of falling victim to phishing attacks.

Title : Particle Swarm Optimization-Based Feature Weighting For


Improving Intelligent Phishing Website Detection
Year : 2020
Author : Sharaf Melabary, Waleed Ali
The project focuses on enhancing the detection of phishing websitesusing
a machine learning approach that incorporates Particle Swarm Optimization
(PSO) for feature weighting. Phishing websites are malicious sites

3
designed to steal sensitive user information like login credentials and financial
details. To address this, the project aims to improve the performance of machine
learning classifiers. By assigning appropriate weights to features, the project aims
to better distinguish between legitimate and phishing websites, thereby improving
classification accuracy.
In this approach, website attributes such as URL structure, HTML content,
and server behaviour are analyzed and treated as input features for a machine
learning classifier. PSO is used to identify the optimal weight for each feature by
minimizing a fitness function that reflects classification errors. The result is a
robust, intelligent system capable of real-time detection of phishing websites,
contributing significantly to cybersecurity by mitigating online fraud risks and
protecting users from digital threats.

Title : Detection of phishing website using Machine Learning


Year : 2020
Author : Abdul Razaque, Dauren Sabyrov, Mohamed Ben Haj Frej.
The project aims to detect phishing websites using machine learning
techniques, offering a proactive solution to mitigate online fraud. Phishing
websites mimic legitimate sites to steal sensitive information like passwords,
credit card details, and personal data. This system leverages machine learning
models to analyze various features of websites, such as URL structure, domain
properties, page content, and server behavior. By training these models ondatasets
containing both phishing and legitimate websites, the system can identify subtle
patterns and anomalies that distinguish malicious sites from authentic ones,
providing accurate and automated detection.
Key features of websites are extracted and used to train machine learning
classifiers such as Decision Trees, Support Vector Machines (SVM), or ensemble
models like Random Forest. These classifiers analyze input data to

4
predict whether a website is phishing or legitimate. The approach is efficient,
scalable, and adaptable, enabling the detection of newly emerging phishing sites
that traditional rule-based methods might miss.

Title : Prediction of Phishing website using ML


Year : 2020
Author : Asif Iqbal, Mohammed Hazim, Stephanie Joanne Steven
The project focuses on predicting phishing websites using machine
learning to safeguard users against online fraud and cyberattacks. Phishing
websites are crafted to appear like legitimate sites to deceive users and steal
sensitive information, such as login credentials or financial details. By utilizing
machine learning models, the system learns patterns and anomalies in website
features—such as URL structures, domain information, page content, and
security indicators. This enables the system to predict whether a given website
is legitimate or a phishing attempt, providing proactive and automated protection.
Machine learning algorithms like Random Forest, Support Vector
Machines (SVM), or Neural Networks are trained on datasets containing both
legitimate and phishing websites. These models analyze key attributes of a
website to classify it accurately. The predictive system is designed to work
efficiently in real-time, ensuring that users are warned about phishing threats as
they browse. This approach adapts to emerging phishing techniques, offering a
more robust and scalable solution compared to traditional methods. By enhancing
early detection capabilities, the project contributes significantly to cybersecurity,
reducing the risk of data breaches and identity theft.

5
CHAPTER 3
SYSTEM ANALYSIS

3.1 EXISTING SYSTEM


Anti-phishing strategies encompass both educating internet users about
potential threats and implementing technical defences to counter phishing attacks.
This paper focuses primarily on reviewing the technical defence methodologies
that have been proposed in recent years. Among these, identifying phishing
websites has proven to be a highly effective approach in mitigating the risk of
users being deceived and their sensitive information being compromised. With the
rapid advancement of machine learning technologies, numerous methodologies
leveraging these techniques have been developed to enhance the accuracy and
efficiency of phishing detection systems. Machine learning-based approaches
analyse various features of websites, such as URL patterns, domain attributes, and
website content, to recognize phishing attempts with improved prediction
performance. The main objective of this paper is to provide a comprehensive
survey of these methodologies and explore effective techniques to prevent
phishing attacks in real-time environments. This review aims to shed light on the
current state of technical defences and guide future developments in combating
phishing threats.

3.1.1 LIMITATIONS
• Phishing attacks are constantly evolving, making it difficult for machine
learning models to keep up.
• Attackers often use sophisticated methods, such as URL obfuscation and
social engineering, to evade detection.
• Phishing websites may share characteristics with legitimate sites,
complicating the feature extraction process.
6
• Phishing detection needs to occur in real-time to be effective, which can
strain computational resources.
• Machine learning models often rely on historical data for training, which
may not account for new phishing strategies.

3.2 PROPOSED SYSTEM


Phishing is one of the most prevalent forms of cyberattacks, where a
cybercriminal masquerades as a trusted institution, domain, or organization to
deceive victims and extract sensitive personal information. The targeted data often
includes login credentials, passwords, bank account details, credit card
information, and other confidential records. A common tactic in these attacks is
the use of phishing emails containing malicious URLs. These emails are carefully
crafted with significant personalization, incorporating information about the
victim to make the deception more convincing. One advanced variant of phishing,
known as "spear phishing," focuses on specific individuals and involves a high
level of customization. When the target is a senior corporate leader, such as a CEO
or other top-level executive, this method is referred to as "whaling." In these cases,
attackers exploit embedded malicious URLs to compromise the target’s account
or system, gaining access to critical organizational data. This emphasizes the need
for heightened awareness and advanced security protocols to counter such
sophisticated attacks.

3.2.1 ADVANTAGES
• Enhanced Detection Accuracy
• Realtime Mitigation
• Adaptability to Evolving threats
• Reduction of False Positives

7
CHAPTER 4
SYSTEM REQUIREMENTS

4.1 HARDWARE REQUIREMENTS

• System : HP IV 2.4 GHz

• Hard Disk : 40 GB.

• Monitor : 15 inch VGA Color.

• Mouse : Logitech Mouse.

• Ram : 512 MB.

• Keyboard : Standard Keyboard.

4.2 SOFTWARE REQUIREMENTS

• Operating System : Windows 7 or later

• Platform : Python technology.

• Tool : Python 3.10.5, flask

• Front End : HTML, CSS, Java Script

• Back End : Python, Pandas

8
4.3 SOFTWARE DESCRIPTION

• Python

• Python Features
Python
Python is a high-level, general-purpose programming language created by
Guido van Rossum in the early 1990s. It has since become one of the most popular
and widely used programming languages due to its simplicity, readability, and
versatility. Python is known for its clear syntax, which makes it easy to learn and
use, even for beginners. It also has a vast and active community that contributes
to its development and support. Python is a versatile language that can be used for
a wide range of tasks, including web development, data science, machine learning,
and scientific computing. It is also a popular choice for scripting and automation
tasks. Python's popularity is due in part to its large and comprehensive standard
library, which includes modules for a variety of tasks, such as file I/O,
networking, and web scraping. Python is a dynamicand interpreted language,
which means that it does not need to be compiled before it can be run. This makes
Python very fast and easy to develop with, as you can write and run code without
having to wait for it to compile. Python is also a memory-managed language,
which means that you don't have to worry about manually allocating and
deallocating memory. Python's popularity is due in part to its large and
comprehensive standard library, which includes modules of tasks, such as file I/O,
networking, and web scraping. Develop web applications: Python is a popular
choice for web development due to its ease of use and powerful frameworks like
Django and Flask. Analyze data: Python is a popular choice for data science due
to its powerful libraries like NumPy, Pandas, and Build machine learning 20
models: Python is a popular choice for machine learning due to its powerful
libraries like scikit-learn and Tensor Flow.

9
Figure 4.1 Working of Python Interpreter

Python is a powerful and versatile programming language that can be used


for a wide range of tasks. It is a popular choice for programmers of all levels of
experience, and it is a great language to learn for anyone who wants to get
started with programming. Python's clear and concise syntax makes it easy to learn
and use, even for beginners. Python can be used for a wide range of tasks,
including web development, data science, machine learning, and scientific
computing. Python is a powerful language that can be used to solve a wide range
of problems. Python has a large and active community that contributes to its
development and support. Python's standard library includes modules for a variety
of tasks, such as file I/O, networking, and web scraping. If you are looking for a
programming language that is easy to learn, versatile, powerful,and has a large
and active community, then Python is a great choice for you. Python is a popular
choice for programmers of all levels of experience.
Pandas
Pandas is a powerful Python library for data manipulation and analysis,
making it invaluable for phishing website detection systems. It efficiently handles
structured data, allowing developers to preprocess, analyze, and

10
transform data for machine learning models. Pandas simplifies tasks like loading
datasets from CSV or databases, extracting features such as URL length or special
characters, and cleaning data by handling duplicates or missing values. Itsupports
exploratory data analysis (EDA) to identify patterns and trends and prepares data
for machine learning by encoding labels or normalizing features. Post-prediction,
Pandas aids in result analysis, such as identifying misclassified URLs, and
integrates with visualization libraries like Matplotlib for reporting.Its seamless
integration with other Python tools and scalability makes Pandas essential for
creating robust phishing detection workflows.
Flask
Flask is a lightweight and versatile Python web framework, making it ideal
for building phishing website detection systems. Its simplicity and scalability
enable seamless integration of machine learning models, API development, and
interactive user interfaces. Flask allows easy deployment of ML models as
RESTful APIs, enabling workflows where URLs are input, preprocessed, and
classified as phishing or legitimate. It supports building APIs for integration with
mobile or web applications and serves HTML templates via Jinja2 for basic user
interfaces. Flask also handles feature extraction and preprocessing, such as
analyzing URL length or special characters, and supports real-time phishing
detection through fast request processing. It integrates with databases like
MySQL or MongoDB to store results and log queries while being scalable enough
to fit into microservices-based architectures. Additionally, its error handling and
feedback capabilities enhance user experience and system reliability.

11
CHAPTER 5
PROJECT DESIGN

5.1 BLOCK DIAGRAM

Figure 5.1 System Architecture

12
5.2 DATASET

For phishing detection, the dataset was sourced from an open-source


platform called Phish tank, which provided data in CSV format. The dataset
initially contained 18 columns, and it underwent transformation through a series
of data preprocessing techniques. To better understand the features, several data
frame methods were employed for exploration and familiarization with the data
structure. Visualization techniques, such as generating plots and graphs, were
utilized to analyse data distribution and relationships between features.

Upon inspection, it was observed that the Domain column was irrelevant
for training the machine learning model. After removing this column, the dataset
was refined to 16 features alongside a target column. The features from both
legitimate and phishing URL datasets were concatenated in the feature extraction
stage without shuffling, which could potentially introduce bias. To address this,
the data was shuffled to balance its distribution before splitting it into training and
testing sets. Shuffling the dataset helps ensure a fair representation of classes in
both subsets, preventing biases and reducing the risk of overfitting during model
training.

5.3 PREPROCESSING
Preprocessing is a crucial step in phishing website detection as it involves
transforming raw data into a structured format that can be effectively analyzed by
machine learning models. The primary goal of preprocessing is to enhance the
accuracy and efficiency of detection systems by ensuring that the data used is
clean, relevant, and capable of revealing patterns that distinguish phishing
websites from legitimate ones. One of the key tasks in preprocessing is feature
extraction from URLs, which involves identifying important characteristics such
as the length of the URL, the presence of suspicious
13
keywords, and the structure of the domain name. Features like subdomains, the
use of HTTPS, and domain name patterns are also crucial indicators that need to
be extracted and analyzed. In addition to URL-based features, analyzing the
HTML content of a website is essential. The HTML structure can contain valuable
information, such as the use of specific meta tags, embedded links, and JavaScript,
all of which may point to the presence of phishing attempts.

Data cleaning is another critical aspect of preprocessing, which involves


removing errors, handling missing values, and eliminating irrelevant or duplicate
data entries. This ensures that the dataset used for training the machine learning
models is consistent and of high quality, preventing the model from learning from
noisy or irrelevant information. Once the data is cleaned, transformation
techniques such as tokenization, normalization, and encoding are applied.
Tokenization breaks down the URL or HTML content into smaller, meaningful
units like words or characters, which makes it easier to identify patterns within the
data. Normalization ensures that all features are on a comparable scale, preventing
any one feature from dominating the learning process. Encoding is applied to
convert categorical data, such as specific keywords or domain names, into
numerical values that machine learning algorithms can process.

Together, these preprocessing techniques enable the detection system to


identify key patterns and anomalies in the data that can indicate phishing
activities. By transforming raw data into a clean, structured format, preprocessing
makes it easier for machine learning models to learn the distinguishing features
of phishing websites, leading to more accurate and reliable detection.

14
5.4 FEAUTURE EXTRACTION
Feature extraction plays a critical role in detecting phishing websites
and web spoofing attacks using machine learning. Key features extracted from
URLs include length, presence of suspicious keywords, domain patterns, and
subdomains, all of which can signal phishing attempts. Additionally, analyzing
HTML content helps identify malicious elements like iFrames, forms designed
to capture user credentials, and suspicious scripts. Server characteristics such as
domain age, IP address location, and SSL certificate validity are also important
indicators, as phishing sites often use newly registered domains or invalid SSL
certificates. Content-based features like text analysis help detect phishing, as these
sites often contain spelling errors or aggressive language to prompt user action.
Behavioural features, such as mouse movements or click patterns, and time spent
on the website, can provide additional insights into suspicious activity. Lastly,
contextual features like reputation data or referral information further assist in
identifying phishing websites.

5.5 MODEL IMPLEMENTATON


Decision Tree Classifier:
For classification and regression applications, decision trees are
commonly used models. They basically learn a hierarchy of if/else questions
that leads to a choice. Learning a decision tree is memorizing the sequence of
if/else questions that leads to the correct answer in the shortest amount of time.
The method runs through all potential tests to discover the one that is most
informative about the target variable to build a tree. There’s not much
mathematics involved here. Since it is very easy to use and interpret it is one of
the most widely used and practical methods used in Machine Learning. It is a tool
that has applications spanning several different areas. Decision trees can be
used for classification as well as regression problems. The name itself suggests
15
that it uses a flowchart like a tree structure to show the predictions that result from
a series of feature-based splits. It starts with a root node and ends with a decision
made by leaves. Root Nodes – It is the node present at the beginning ofa decision
tree from this node the population starts dividing according to various features.
Decision Nodes – the nodes we get after splitting the root nodes are called
Decision Node. Leaf Nodes – the nodes where further splitting is not possible are
called leaf nodes or terminal nodes. Sub-tree – just like a small portion of a graph
is called sub-graph similarly a subsection of this decision tree is called sub-tree.
Pruning – is nothing but cutting down some nodes to stop overfitting. You must
be asking this question to yourself that when do we stop growing our tree?
Usually, real-world datasets have a large number of features, which will result in
a large number of splits, which in turn gives a huge tree.

Such trees take time to build and can lead to overfitting. That means the
tree will give very good accuracy on the training dataset but will give bad
accuracy in test data. There are many ways to tackle this problem through
hyperparameter tuning. We can set the maximum depth of our decision tree using
the max_depth parameter. The more the value of max_depth, the more complex
your tree will be. The training error will off-course decrease if we increase the
max_depth value but when our test data comes into the picture, we will get a very
bad accuracy. Hence you need a value that will not overfit aswell as underfit
our data and for this, you can use GridSearchCV.

Random Forest Classifier:


Random forests are one of the most extensively used machine learning
approaches for regression and classification. A random forest is just a collection
of decision trees, each somewhat different from the others. The notion behind
random forests is that while each tree may do a decent job of predicting, it will
almost certainly overfit on some data. They are incredibly powerful, frequently

16
operate effectively without a lot of parameters adjusting, and don't require data
scalability. Random forest is a Supervised Machine Learning Algorithm that is
used widely in Classification and Regression problems. It builds decision trees on
different samples and takes their majority vote for classification and average in
case of regression.

Figure 5.2 Random Forest Architecture

One of the most important features of the Random Forest Algorithm is


that it can handle the data set containing continuous variables as in the case of
regression and categorical variables as in the case of classification. It performs
better results for classification problems. Let’s dive into a real-life analogy to
understand this concept further. A student named X wants to choose a course after
his 10+2, and he is confused about the choice of course based on his skill set.

17
So he decides to consult various people like his cousins, teachers, parents,
degree students, and working people. He asks them varied questions like why he
should choose, job opportunities with that course, course fee, etc. Finally, after
consulting various people about the course he decides to take the course suggested
by most of the people. Ensemble uses two types of methods:

Bagging– It creates a different training subset from sample training data


with replacement & the final output is based on majority voting. For example,
Random Forest.

Boosting– It combines weak learners into strong learners by creating


sequential models such that the final model has the highest accuracy.

ADVANTAGES

• Accuracy: Random Forest is less prone to overfitting compared to a


single decision tree, making it highly accurate.

• Handles Non-linear Relationships: Phishing websites may exhibit


complex and non-linear patterns, and Random Forest can capture these
relationships effectively.

• Robustness: Even if some features are noisy or missing, RandomForest


can still make accurate predictions due to its averaging of many trees.

18
CHAPTER 6
CONCLUSION AND FUTURE ENHANCEMENT

6.1 CONCLUSION
This survey reviewed various algorithms and approaches proposed by
researchers for detecting phishing websites using machine learning techniques.
The analysis of the literature revealed that most researchers utilized well-known
machine learning algorithms such as Naïve Bayes, Support Vector Machines
(SVM), Decision Trees, and Random Forests to build their detection models.
These algorithms were chosen for their proven reliability and effectiveness in
handling classification problems, especially in identifying phishing websites.The
survey also summarized experimentally successful techniques in identifying
phishing URLs. As phishing attacks continue to grow in complexity and
frequency, the survey emphasized the need to continuously update detection
systems by incorporating new features or replacing outdated ones. This
adaptability ensures that machine learning models remain effective in combating
the evolving nature of phishing threats. By exploring and integrating these
advancements, the field of phishing detection can further enhance its resilience
against sophisticated cyberattacks.

6.2 FUTURE ENHANCEMENT


Feature enhancement in the context of phishing website detection refers to
the process of improving and augmenting the features used in the model to
increase its predictive power and accuracy. Count the number of subdomains in
the URL. Phishing sites may use multiple subdomains to appear legitimate.

19
REFERENCES

1. Abdulrahman Alreshidi, Ahamed B. Altamimi, Muzammil Ahamed, Wilayat


Khan, Zawar Hussain Khan, “ PhishCatcher : Client-side defence against
Web Spoofing Attack using Machine Learning, IEEE Access - 2023.
2. Abdul Razaque, Aidana Shaikhyn , Dauren Sabyrov, Mohamed Ben Haj Fej.
“Detection of phishing website using Machine Learning” – 2020
3. Abdullateef O. Balogun, Ammar K. Alazzawi , Victor Elijah Adeyemo and
Yazan A. Al-Sarieral . “PSO based Phishing detection Website” – 2022
4. W.Ali, ‘‘Phishing website detection based on supervised machine learning
with wrapper features selection,’’ Int. J. Adv. Comput. Sci. Appl., vol. 8,
no. 9, pp. 72–78 - 2017
5. Asif Iqbal Hajamydeen, Mohammed Hazim Alkawaz, Stephanie Joanne
Steven “Prediction of Phishing website using ML” – 2020.
6. Castaño, E. Fidalgo-Fernández, and F. Janez-Martino, Creation of a Phishing
Kit Dataset for Phishing Websites Identification. León, Spain: TFM, Univ.
León, 2022.
7. Q. Cui, G.-V. Jourdan, G. V. Bochmann, and I.-V. ‘‘Proactive detection of
phishing kit traffic,’’ in Proc. Int. Conf. Appl. Cryptography. Netw. Secur.
Cham, Switzerland: Springer, 2021.
8. A.K. Jain and B. B. Gupta, ‘‘A machine learning based approach for phishing
detection using hyperlinks information,’’ J. Ambient Intell. Humanized
Compute., vol. 10, no. 5, pp. 2015–2028, May 2019.
9. W. Khan, A. Ahmad, A. Qamar, M. Kamran, and M. Altaf, ‘‘SpoofCatch: A
client-side protection tool against phishing attacks,’’ IT Prof., vol. 23, no.
2, pp. 65–74, Mar. 2021.

10. J. Mao, W. Tian, P. Li, T. Wei, and Z. Liang, ‘‘Phishing-alarm: Robust and
efficient phishing detection via page component similarity,’’ IEEE Access,
vol.5, pp. 17020–17030 - 2017.
20
11. P. Rao, J. Gyani, and G. Narsimha, ‘‘Fake profiles identification in online
social networks using machine learning and NLP,’’ Int. J. Appl. Eng. Res.,
vol. 13, no. 6, pp. 973–4562 – 2018
12. D. Sahoo, C. Liu, and S. C. H. Hoi, ‘‘Malicious URL detection using
machine learning: A survey,’’ 2017, arXiv:1701.07179.
13. M. Sanchez-Paniagua, E. F. Fernandez, E. Alegre, W. Al-Nabki, and V.
Gonzalez-Castro, ‘‘Phishing URL detection: A real-case scenario through
login URLs,’’ IEEE Access, vol. 10, pp. 42949–42960 – 2022
14. E.Sharafam alebary , Waleed Ali (Member, IEEE) “Particle Swarm
Optimization-Based Feature Weighting for Improving Intelligent Phishing
Website Detection.” – 2020.
15. K. Yu, L. Tan, S. Mumtaz, S. Al-Rubaye, A. Al-Dulaimi, A. K. Bashir, and
A. Khan, ‘‘Securing critical infrastructures: Deep-learning-based threat
detection in IIoT,’’ IEEE Commun. Mag., vol. 59, no. 10, pp. 76–82, Oct.
2021.

21
22

You might also like