Client Side Webspoofing PishCatcher
Client Side Webspoofing PishCatcher
ABSTRACT
Web spoofing attacks represent a critical threat to the security and integrity of online
communication and transactions. These attacks involve malicious actors impersonating
legitimate websites to deceive users into revealing sensitive information or unwittingly
engaging in harmful actions. To effectively address this growing threat, there is an urgent
need for robust defense mechanisms capable of identifying and thwarting web spoofing
attempts in real-time. Current approaches to combating web spoofing primarily rely on
server-side defenses, such as SSL/TLS protocols and domain validation techniques. While
these methods provide some level of protection, they suffer from significant limitations.
Firstly, server-side defenses are reactive, meaning they can only detect and respond to
spoofing attempts after they have occurred. This delay leaves users vulnerable during the
critical window between the initiation of the attack and its detection. Moreover, server-side
defenses may struggle to accurately differentiate between legitimate and spoofed websites,
leading to false positives and negatives. The prevalence of web spoofing attacks underscores
the need for proactive client-side defense mechanisms. Existing approaches predominantly
focus on server-side defenses, which are insufficient in providing timely and reliable
protection against evolving spoofing techniques. Thus, there exists a critical gap in the
current security infrastructure that necessitates addressing to effectively combat web spoofing
threats. we aim to provide users with a proactive and robust solution capable of detecting and
preventing spoofing attempts before they inflict harm. Additionally, by harnessing machine
learning techniques, we aim to develop a system capable of continuously learning and
adapting to emerging spoofing tactics, thereby enhancing its effectiveness and resilience
against evolving threats. The proposed PISHCATCHER system represents a novel approach
to combat web spoofing attacks. It is a machine learning-based defense mechanism designed
to operate at the client-side, enabling real-time detection and prevention of spoofing attempts.
By analyzing various features extracted from web pages, including HTML structure, CSS
styles, and JavaScript behavior, PISHCATCHER employs advanced machine learning
algorithms to accurately distinguish between legitimate and spoofed websites.
CHAPTER 1
INTRODUCTION
1.1 Overview
Web spoofing attacks pose a significant threat to online security by allowing malicious actors
to impersonate legitimate websites, deceiving users into divulging sensitive information or
engaging in harmful actions. To combat this threat effectively, robust defense mechanisms are
essential to identify and thwart spoofing attempts in real-time. While current approaches
primarily rely on server-side defenses like SSL/TLS protocols and domain validation
techniques, they suffer from limitations such as reactivity and difficulty in accurately
differentiating between legitimate and spoofed websites.
To address these challenges, the proposed PISHCATCHER system offers a novel approach to
combat web spoofing attacks. Operating at the client-side, PISHCATCHER leverages
machine learning techniques to analyze various features extracted from web pages, including
HTML structure, CSS styles, and JavaScript behavior. By employing advanced machine
learning algorithms, PISHCATCHER accurately distinguishes between legitimate and
spoofed websites, enabling real-time detection and prevention of spoofing attempts.
The motivation behind developing PISHCATCHER stems from the inadequacies of existing
server-side defenses in mitigating web spoofing attacks. Traditional approaches are reactive
and struggle to accurately differentiate between legitimate and spoofed websites, leaving
users vulnerable to exploitation. By shifting the focus to client-side defense mechanisms and
harnessing machine learning techniques, PISHCATCHER aims to provide users with a
proactive and robust solution capable of detecting and preventing spoofing attempts before
they cause harm. Additionally, by continuously learning and adapting to emerging spoofing
tactics, PISHCATCHER enhances its effectiveness and resilience against evolving threats,
addressing the critical need for proactive defense mechanisms in the face of increasing web
spoofing attacks.
1.3 Problem Statement
The prevalence of web spoofing attacks highlights the urgent need for proactive defense
mechanisms capable of identifying and mitigating spoofed websites in real-time. Existing
server-side defenses are limited in their ability to provide timely and reliable protection
against evolving spoofing techniques, leaving users vulnerable to exploitation. Therefore,
there is a critical gap in the current security infrastructure that needs to be addressed to
effectively combat web spoofing threats. The problem statement revolves around developing
a client-side defense mechanism, like PISHCATCHER, that can accurately distinguish
between legitimate and spoofed websites, thereby providing users with proactive protection
against web spoofing attacks.
1.4 Applications
Social media and Email Platforms: Malicious actors often impersonate social media
and email platforms to phish for users' login credentials or distribute malware.
PISHCATCHER can protect users from falling victim to such attacks by accurately
identifying and blocking spoofed social media profiles or phishing emails.
Enhanced User Trust and Confidence: By providing users with proactive protection
against web spoofing attacks, PISHCATCHER enhances user trust and confidence in
online platforms and services. Users can rest assured knowing that their sensitive
information is secure, leading to increased engagement and satisfaction with online
experiences.
CHAPTER 2
LITERATURE SURVEY
W. Khan et al. [1] proposed SpoofCatch, a client-side protection tool designed to combat
phishing attacks. Their research focused on developing a tool that operates on the client side
to detect and prevent phishing attempts by analyzing and filtering out potentially harmful
content. This approach aimed to enhance user security by providing an additional layer of
protection against phishing threats. B. Schneier [2] discussed the limitations of two-factor
authentication (2FA) in his article. He argued that while 2FA provides an extra layer of
security, it is often insufficient in protecting against sophisticated attacks. Schneier's analysis
highlighted the need for more robust security measures beyond 2FA to effectively combat
modern cybersecurity threats. S. Garera et al. [3] introduced a framework for detecting and
measuring phishing attacks. Their work provided a structured approach to identifying
phishing threats by using various detection techniques and metrics. This framework aimed to
improve the accuracy and reliability of phishing detection systems.
R. Oppliger and S. Gajek [4] presented methods for effective protection against phishing and
web spoofing. They explored various strategies to safeguard users from phishing attacks and
website spoofing, focusing on enhancing the security of web interactions and preventing
unauthorized access. T. Pietraszek and C. V. Berghe [5] addressed the issue of injection
attacks through context-sensitive string evaluation. Their research proposed a method to
defend against these attacks by analyzing and evaluating strings in context, which aimed to
reduce vulnerabilities in web applications and enhance overall security. M. Johns et al. [6]
focused on providing reliable protection against session fixation attacks. Their work involved
developing mechanisms to prevent attackers from exploiting session fixation vulnerabilities,
thereby improving the security of web sessions and protecting user data. M. Bugliesi et al. [7]
worked on automatic and robust client-side protection for cookie-based sessions. Their
research aimed to enhance the security of cookie-based sessions by implementing client-side
protection mechanisms that could automatically detect and mitigate potential threats. A.
Herzberg and A. Gbara [8] discussed strategies for protecting web users from spoofing and
phishing attacks. They proposed solutions to safeguard even less experienced users from
these threats, focusing on improving the overall security of web interactions and enhancing
user awareness.
N. Chou et al. [9] presented client-side defenses against web-based identity theft. Their
research focused on developing methods to protect users from identity theft by analyzing and
securing web interactions, thereby reducing the risk of unauthorized access to personal
information. B. Hämmerli and R. Sommer [10] edited proceedings from the 4th International
Conference DIMVA 2007, which covered various aspects of intrusion detection and malware
vulnerability assessment. The conference proceedings included research on methods and
techniques for detecting and mitigating security threats in diverse computing environments.
C. Yue and H. Wang [11] introduced BogusBiter, a transparent protection mechanism against
phishing attacks. Their approach aimed to provide a seamless and effective solution for
detecting and preventing phishing attempts, enhancing user security without requiring
significant changes to existing systems. W. Chu et al. [12] proposed a method to protect
sensitive sites from phishing attacks using features extractable from inaccessible phishing
URLs. Their research focused on developing techniques to identify and block phishing sites
by analyzing features of phishing URLs that are not directly accessible. Y. Zhang et al. [13]
presented Cantina, a content-based approach to detecting phishing websites. Their research
aimed to improve phishing detection by analyzing the content of web pages, providing a
more accurate method for identifying fraudulent sites based on their content characteristics.
D. Miyamoto et al. [14] evaluated machine learning-based methods for detecting phishing
sites. Their study involved assessing various machine learning techniques to enhance the
accuracy and effectiveness of phishing site detection, contributing to more reliable security
measures. E. Medvet et al. [15] explored visual-similarity-based phishing detection
techniques. Their research focused on detecting phishing sites by analyzing visual similarities
between web pages, which aimed to improve detection accuracy by identifying fraudulent
sites based on their visual appearance.
W. Zhang et al. [16] investigated web phishing detection based on page spatial layout
similarity. Their approach aimed to detect phishing sites by analyzing the spatial layout of
web pages, providing a method to identify fraudulent sites based on their design and layout
characteristics. J. Ni et al. [17] developed a collaborative filtering recommendation algorithm
based on TF-IDF and user characteristics. Their research focused on improving
recommendation systems by incorporating TF-IDF and user attributes, enhancing the
accuracy and relevance of recommendations. W. Liu et al. [18] proposed an antiphishing
strategy based on visual similarity assessment. Their method aimed to protect users from
phishing attacks by analyzing visual similarities between legitimate and phishing sites,
providing a visual-based approach to detecting fraudulent sites. A. Rusu and V. Govindaraju
[19] introduced a visual CAPTCHA with handwritten image analysis. Their work aimed to
enhance CAPTCHA security by incorporating handwritten image analysis, providing a more
robust method to differentiate between human users and automated bots. P. Yang et al. [20]
focused on phishing website detection based on multidimensional features driven by deep
learning. Their research utilized deep learning techniques to analyze various features of
phishing sites, aiming to improve detection accuracy through advanced machine learning
methods. P. Sornsuwit and S. Jaiyen [21] developed a new hybrid machine learning approach
for cybersecurity threat detection based on adaptive boosting. Their research combined
multiple machine learning techniques to enhance the detection of cybersecurity threats,
providing a more effective solution for threat identification. S. Kaur and S. Sharma [22]
proposed a hybrid approach for detecting phishing websites. Their method combined various
detection techniques to improve the accuracy and reliability of phishing site identification,
contributing to more effective protection against phishing attacks.
CHAPTER 3
TRADITIONAL SYSTEM
Traditional Systems for Combatting Web Spoofing Attacks
SSL/TLS Protocols
SSL (Secure Sockets Layer) and its successor, TLS (Transport Layer Security), are widely
implemented cryptographic protocols designed to secure communications over the internet.
By encrypting the data transmitted between a user's browser and a web server, SSL/TLS
helps protect against eavesdropping and tampering. This protocol ensures that sensitive
information, such as login credentials and personal data, is kept secure during transmission.
SSL/TLS operates by establishing an encrypted connection, verified through digital
certificates issued by trusted Certificate Authorities (CAs). The encryption process involves
asymmetric cryptography for the initial handshake and symmetric cryptography for the
subsequent data exchange, offering robust protection against unauthorized access.
Anti-phishing toolbars and browser extensions are designed to assist users in identifying and
avoiding phishing websites. These tools are installed as extensions in web browsers and
provide real-time alerts when users attempt to visit potentially harmful sites. They often
utilize blacklists of known phishing sites and employ heuristic techniques to detect suspicious
web page characteristics. By providing visual warnings or blocking access to known phishing
sites, these toolbars aim to enhance user security and prevent the inadvertent disclosure of
sensitive information.
Web Filtering Services
Web filtering services are implemented to block access to malicious websites and content.
These services operate either at the network level or through browser settings, employing
predefined rules and content analysis to detect and prevent access to known phishing sites.
Web filtering involves maintaining up-to-date databases of malicious URLs and using these
lists to filter web traffic. Additionally, some web filtering solutions analyze web page content
and metadata to identify potential threats and enforce security policies. By restricting access
to harmful websites, web filtering services help safeguard users from phishing attacks and
other online threats.
Reactive Nature: SSL/TLS protocols and domain validation techniques are reactive
measures, meaning they can only address spoofing attempts after they are detected.
This delay leaves users vulnerable during the critical window between the initiation of
the attack and its detection.
Limited Scope of Domain Validation: Domain validation alone does not address the
content or behavior of a website. Attackers can still use domains that closely resemble
legitimate ones to evade detection.
Ease of Domain Duplication: Attackers can register domain names that are similar to
legitimate ones (e.g., typosquatting), which can bypass domain validation checks and
deceive users.
Static Nature of Web Filtering: Web filtering services often rely on static rules and
known threat databases, which may not be updated promptly enough to keep pace
with rapidly evolving phishing techniques.
Performance Impact and User Experience: Web filtering services can sometimes
slow down web browsing or interfere with legitimate content, potentially leading to a
degraded user experience and reduced efficiency.
CHAPTER 4
PROPOSED SYSTEM
4.1 Overview
The project showcases a comprehensive approach to developing a robust system for detecting
phishing URLs, leveraging advanced machine learning techniques and thorough data
analysis. Here is the overview of the project:
Machine Learning Model Training: Three distinct machine learning models are
trained on the preprocessed dataset: Support Vector Machine (SVM), Random Forest,
and XGBoost (Extreme Gradient Boosting). Each model is trained on the features and
corresponding labels to distinguish between legitimate and phishing URLs. The SVM
model is trained using SVC from sklearn.svm, the Random Forest model using
RandomForestClassifier from sklearn.ensemble, and the XGBoost model using
XGBClassifier from xgboost. If trained models already exist, they are loaded from
files to save computation time; otherwise, new models are trained and saved for future
use.
Model Evaluation: The trained models are evaluated on a test set to measure their
performance. Metrics such as accuracy, precision, recall, and F1-score are computed
to provide a comprehensive assessment of each model's effectiveness. Confusion
matrices are plotted to visualize the distribution of true positives, true negatives, false
positives, and false negatives, offering insights into each model's strengths and
weaknesses. These evaluations help identify the most effective model for phishing
URL detection based on various performance criteria.
Data pre-processing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while crea ting a machine learning
model. When creating a machine learning project, it is not always a case that we come across
the clean and formatted data. And while doing any operation with data, it is mandatory to
clean it and put in a formatted way. So, for this, we use data pre-processing task. A real-world
data generally contains noises, missing values, and maybe in an unusable format which
cannot be directly used for machine learning models. Data pre-processing is required tasks
for cleaning the data and making it suitable for a machine learning model which also
increases the accuracy and efficiency of a machine learning model.
Importing libraries
Importing datasets
Finding Missing Data
Importing Libraries: To perform data preprocessing using Python, we need to import some
predefined Python libraries. These libraries are used to perform some specific jobs. There are
three specific libraries that we will use for data preprocessing, which are:
Numpy: Numpy Python library is used for including any type of mathematical operation in
the code. It is the fundamental package for scientific calculation in Python. It also supports to
add large, multidimensional arrays and matrices. So, in Python, we can import it as:
import numpy as nm
Here we have used nm, which is a short name for Numpy, and it will be used in the whole
program.
Matplotlib: The second library is matplotlib, which is a Python 2D plotting library, and with
this library, we need to import a sub-library pyplot. This library is used to plot any type of
charts in Python for the code. It will be imported as below:
Pandas: The last library is the Pandas library, which is one of the most famous Python
libraries and used for importing and managing the datasets. It is an open-source data
manipulation and analysis library. Here, we have used pd as a short name for this library.
Consider the below image:
Handling Missing data: The next step of data preprocessing is to handle missing data in the
datasets. If our dataset contains some missing data, then it may create a huge problem for our
machine learning model. Hence it is necessary to handle missing values present in the dataset.
There are mainly two ways to handle missing data, which are:
By deleting the particular row: The first way is used to commonly deal with null
values. In this way, we just delete the specific row or column which consists of null
values. But this way is not so efficient and removing data may lead to loss of
information which will not give the accurate output.
By calculating the mean: In this way, we will calculate the mean of that column or
row which contains any missing value and will put it on the place of missing value.
This strategy is useful for the features which have numeric data such as age, salary,
year, etc.
Encoding Categorical data: Categorical data is data which has some categories such as, in
our dataset; there are two categorical variables, Country, and Purchased. Since machine
learning model completely works on mathematics and numbers, but if our dataset would have
a categorical variable, then it may create trouble while building the model. So, it is necessary
to encode these categorical variables into numbers.
Feature Scaling: Feature scaling is the final step of data preprocessing in machine learning.
It is a technique to standardize the independent variables of the dataset in a specific range. In
feature scaling, we put our variables in the same range and in the same scale so that no
variable dominates the other variable. A machine learning model is based on Euclidean
distance, and if we do not scale the variable, then it will cause some issue in our machine
learning model. Euclidean distance is given as:
In machine learning data preprocessing, we divide our dataset into a training set and test set.
This is one of the crucial steps of data preprocessing as by doing this, we can enhance the
performance of our machine learning model. Suppose if we have given training to our
machine learning model by a dataset and we test it by a completely different dataset. Then, it
will create difficulties for our model to understand the correlations between the models. If we
train our model very well and its training accuracy is also very high, but we provide a new
dataset to it, then it will decrease the performance. So we always try to make a machine
learning model which performs well with the training set and also with the test dataset. Here,
we can define these datasets as:
Training Set: A subset of dataset to train the machine learning model, and we already know
the output.
Test set: A subset of dataset to test the machine learning model, and by using the test set,
model predicts the output.
For splitting the dataset, we will use the below lines of code:
Explanation
In the above code, the first line is used for splitting arrays of the dataset into random
train and test subsets.
In the second line, we have used four variables for our output that are
In train_test_split() function, we have passed four parameters in which first two are
for arrays of data, and test_size is for specifying the size of the test set. The test_size
maybe .5, .3, or .2, which tells the dividing ratio of training and testing sets.
The last parameter random_state is used to set a seed for a random generator so that
you always get the same result, and the most used value for this is 42.
4.4 ML MODELS
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is
used for Classification problems in Machine Learning. The goal of the SVM algorithm is to
create the best line or decision boundary that can segregate n-dimensional space into classes
so that we can easily put the new data point in the correct category in the future. This best
decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector
Machine. Consider the below diagram in which there are two different categories that are
classified using a decision boundary or hyperplane:
Figure 4.4.1 Analysis of SVM
Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that
can accurately identify whether it is a cat or dog, so such a model can be created by using the
SVM algorithm. We will first train our model with lots of images of cats and dogs so that it
can learn about different features of cats and dogs, and then we test it with this strange
creature. So as support vector creates a decision boundary between these two data (cat and
dog) and choose extreme cases (support vectors), it will see the extreme case of cat and dog.
On the basis of the support vectors, it will classify it as a cat. Consider the below diagram:
Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can
be classified into two classes by using a single straight line, then such data is termed as
linearly separable data, and classifier is used called as Linear SVM classifier.
Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a
dataset cannot be classified by using a straight line, then such data is termed as non-linear
data and classifier used is called as Non-linear SVM classifier
Linear SVM: The working of the SVM algorithm can be understood by using an example.
Suppose we have a dataset that has two tags (green and blue), and the dataset has two features
x1 and x2. We want a classifier that can classify the pair (x1, x2) of coordinates in either
green or blue. Consider the below image:
So as it is 2-d space so by just using a straight line, we can easily separate these two classes.
But there can be multiple lines that can separate these classes. Consider the below image:
Figure 4.4.4 Test-Vector in SVM
Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane. SVM algorithm finds the closest point of the
lines from both the classes. These points are called support vectors. The distance between the
vectors and the hyperplane is called as margin. And the goal of SVM is to maximize this
margin. The hyperplane with maximum margin is called the optimal hyperplane.
Non-Linear SVM: If data is linearly arranged, then we can separate it by using a straight
line, but for non-linear data, we cannot draw a single straight line. Consider the below image:
Figure 4.4.5 Non-Linear SVM
So, to separate these data points, we need to add one more dimension. For linear data, we
have used two dimensions x and y, so for non-linear data, we will add a third-dimension z. It
can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the below
image:
Figure 4.4.7 Non-Linear SVM best hyperplane
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert
it in 2d space with z=1, then it will become as:
Support vector machine algorithm is not acceptable for large data sets.
It does not execute very well when the data set has more sound i.e. target classes are
overlapping.
In cases where the number of properties for each data point outstrips the number of
training data specimens, the support vector machine will underperform.
As the support vector classifier works by placing data points, above and below the
classifying hyperplane there is no probabilistic clarification for the classification.
XGBoost is a popular machine learning algorithm that belongs to the supervised learning
technique. It can be used for both Classification and Regression problems in ML. It is based
on the concept of ensemble learning, which is a process of combining multiple classifiers to
solve a complex problem and to improve the performance of the model. As the name
suggests, "XGBoost is a classifier that contains a number of decision trees on various subsets
of the given dataset and takes the average to improve the predictive accuracy of that dataset."
Instead of relying on one decision tree, the XGBoost takes the prediction from each tree and
based on the majority votes of predictions, and it predicts the final output. The greater
number of trees in the forest leads to higher accuracy and prevents the problem of overfitting.
XGBoost, which stands for "Extreme Gradient Boosting," is a popular and powerful machine
learning algorithm used for both classification and regression tasks. It is known for its high
predictive accuracy and efficiency, and it has won numerous data science competitions and is
widely used in industry and academia. Here are some key characteristics and concepts related
to the XGBoost algorithm:
Tree-based Models: Decision trees are the weak learners used in XGBoost. These are
shallow trees, often referred to as "stumps" or "shallow trees," which helps prevent
overfitting.
Handling Missing Data: XGBoost has built-in capabilities to handle missing data
without requiring imputation. It does this by finding the optimal split for missing
values during tree construction.
XGBoost, which stands for eXtreme Gradient Boosting, is a popular machine learning
algorithm that is particularly effective for structured/tabular data and is often used for tasks
like classification, regression, and ranking. It is an ensemble learning technique based on
decision trees. Here's how XGBoost operates:
Tree Pruning: XGBoost uses a technique called "pruning" to remove branches of the
trees that do not contribute significantly to the model's predictive power. This reduces
the complexity of the trees and helps prevent overfitting.
Feature Importance: XGBoost provides a feature importance score, which helps you
understand the contribution of each feature (input variable) in making predictions.
You can use this information for feature selection and interpretation.
Parallel and Distributed Computing: XGBoost is designed for efficiency and can
take advantage of parallel and distributed computing to train on large datasets faster.
Handling Missing Data: XGBoost can handle missing data by finding an optimal
direction for missing values during tree construction.
Advantages
The proposed research work, which combines Edge Computing, Light Weight Homomorphic
Encryption, and the XGBOOST classifier in a privacy-preserving healthcare application,
offers several distinct advantages:
Enhanced Data Privacy: One of the foremost advantages is the robust protection of
patient data privacy. The use of Light Weight Homomorphic Encryption ensures that
sensitive medical information remains confidential throughout the entire process,
from data collection to disease prediction. This not only complies with stringent
privacy regulations but also builds trust among patients, encouraging them to engage
with healthcare applications more freely.
CHAPTER 5
UML DIAGRAMS
UML stands for Unified Modeling Language. UML is a standardized general-purpose
modeling language in the field of object-oriented software engineering. The standard is
managed, and was created by, the Object Management Group. The goal is for UML to
become a common language for creating models of object-oriented computer software. In its
current form UML is comprised of two major components: a Meta-model and a notation. In
the future, some form of method or process may also be added to; or associated with, UML.
GOALS: The Primary goals in the design of the UML are as follows:
Provide users a ready-to-use, expressive visual modeling Language so that they can
develop and exchange meaningful models.
Class diagram
The class diagram is used to refine the use case diagram and define a detailed design of the
system. The class diagram classifies the actors defined in the use case diagram into a set of
interrelated classes. The relationship or association between the classes can be either an "is-a"
or "has-a" relationship. Each class in the class diagram was capable of providing certain
functionalities. These functionalities provided by the class are termed "methods" of the class.
Apart from this, each class may have certain "attributes" that uniquely identify the class.
Sequence Diagram
A sequence diagram in Unified Modeling Language (UML) is a kind of interaction diagram
that shows how processes operate with one another and in what order. It is a construct of a
Message Sequence Chart. A sequence diagram shows, as parallel vertical lines (“lifelines”),
different processes or objects that live simultaneously, and as horizontal arrows, the messages
exchanged between them, in the order in which they occur. This allows the specification of
simple runtime scenarios in a graphical manner.
Activity diagram
A data flow diagram (DFD) is a graphical representation of how data moves within an
information system. It is a modeling technique used in system analysis and design to illustrate
the flow of data between various processes, data stores, data sources, and data destinations
within a system or between systems. Data flow diagrams are often used to depict the structure
and behavior of a system, emphasizing the flow of data and the transformations it undergoes
as it moves through the system.
Figure 5.4: Dataflow Diagram
Component diagram: Component diagram describes the organization and wiring of the
physical components in a system.
UseCase diagram: A use case diagram in the Unified Modeling Language (UML) is a type
of behavioral diagram defined by and created from a Use-case analysis. Its purpose is to
present a graphical overview of the functionality provided by a system in terms of actors,
their goals (represented as use cases), and any dependencies between those use cases. The
main purpose of a use case diagram is to show what system functions are performed for
which actor. Roles of the actors in the system can be depicted.
Figure 5.6: Use case Diagram
Deployment Diagram:
CHAPTER 6
SOFTWARE ENVIRONMENT
What is Python?
Programmers have to type relatively less and indentation requirement of the language,
makes them readable all the time.
Python language is being used by almost all tech-giant companies like – Google,
Amazon, Facebook, Instagram, Dropbox, Uber… etc.
The biggest strength of Python is huge collection of standard library which can be used for
the following –
Machine Learning
Test frameworks
Multimedia
Advantages of Python
1. Extensive Libraries
Python downloads with an extensive library and it contain code for various purposes like
regular expressions, documentation-generation, unit-testing, web browsers, threading,
databases, CGI, email, image manipulation, and more. So, we don’t have to write the
complete code for that manually.
2. Extensible
As we have seen earlier, Python can be extended to other languages. You can write some of
your code in languages like C++ or C. This comes in handy, especially in projects.
3. Embeddable
Complimentary to extensibility, Python is embeddable as well. You can put your Python code
in your source code of a different language, like C++. This lets us add scripting capabilities to
our code in the other language.
4. Improved Productivity
The language’s simplicity and extensive libraries render programmers more productive than
languages like Java and C++ do. Also, the fact that you need to write less and get more things
done.
5. IOT Opportunities
Since Python forms the basis of new platforms like Raspberry Pi, it finds the future bright for
the Internet Of Things. This is a way to connect the language with the real world.
When working with Java, you may have to create a class to print ‘Hello World’. But in
Python, just a print statement will do. It is also quite easy to learn, understand, and code. This
is why when people pick up Python, they have a hard time adjusting to other more verbose
languages like Java.
7. Readable
Because it is not such a verbose language, reading Python is much like reading English. This
is the reason why it is so easy to learn, understand, and code. It also does not need curly
braces to define blocks, and indentation is mandatory. This further aids the readability of the
code.
8. Object-Oriented
This language supports both the procedural and object-oriented programming paradigms.
While functions help us with code reusability, classes and objects let us model the real world.
A class allows the encapsulation of data and functions into one.
Like we said earlier, Python is freely available. But not only can you download Python for
free, but you can also download its source code, make changes to it, and even distribute it. It
downloads with an extensive collection of libraries to help you with your tasks.
10. Portable
When you code your project in a language like C++, you may need to make some changes to
it if you want to run it on another platform. But it isn’t the same with Python. Here, you need
to code only once, and you can run it anywhere. This is called Write Once Run Anywhere
(WORA). However, you need to be careful enough not to include any system-dependent
features.
11. Interpreted
Lastly, we will say that it is an interpreted language. Since statements are executed one by
one, debugging is easier than in compiled languages.
Any doubts till now in the advantages of Python? Mention in the comment section.
1. Less Coding
Almost all of the tasks done in Python requires less coding when the same task is done in
other languages. Python also has an awesome standard library support, so you don’t have to
search for any third-party libraries to get your job done. This is the reason that many people
suggest learning Python to beginners.
2. Affordable
Python is free therefore individuals, small companies or big organizations can leverage the
free available resources to build applications. Python is popular and widely used so it gives
you better community support.
The 2019 Github annual survey showed us that Python has overtaken Java in the most
popular programming language category.
Python code can run on any machine whether it is Linux, Mac or Windows. Programmers
need to learn different languages for different jobs but with Python, you can professionally
build web apps, perform data analysis and machine learning, automate things, do web
scraping and also build games and powerful visualizations. It is an all-rounder programming
language.
Disadvantages of Python
So far, we’ve seen why Python is a great choice for your project. But if you choose it, you
should be aware of its consequences as well. Let’s now see the downsides of choosing Python
over another language.
1. Speed Limitations
We have seen that Python code is executed line by line. But since Python is interpreted, it
often results in slow execution. This, however, isn’t a problem unless speed is a focal point
for the project. In other words, unless high speed is a requirement, the benefits offered by
Python are enough to distract us from its speed limitations.
While it serves as an excellent server-side language, Python is much rarely seen on the client-
side. Besides that, it is rarely ever used to implement smartphone-based applications. One
such application is called Carbonnelle.
The reason it is not so famous despite the existence of Brython is that it isn’t that secure.
3. Design Restrictions
As you know, Python is dynamically typed. This means that you don’t need to declare the
type of variable while writing the code. It uses duck-typing. But wait, what’s that? Well, it
just means that if it looks like a duck, it must be a duck. While this is easy on the
programmers during coding, it can raise run-time errors.
4. Underdeveloped Database Access Layers
Compared to more widely used technologies like JDBC (Java DataBase Connectivity) and
ODBC (Open DataBase Connectivity), Python’s database access layers are a bit
underdeveloped. Consequently, it is less often applied in huge enterprises.
5. Simple
No, we’re not kidding. Python’s simplicity can indeed be a problem. Take my example. I
don’t do Java, I’m more of a Python person. To me, its syntax is so simple that the verbosity
of Java code seems unnecessary.
This was all about the Advantages and Disadvantages of Python Programming Language.
NumPy
It is the fundamental package for scientific computing with Python. It contains various
features including these important ones:
Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional
container of generic data. Arbitrary datatypes can be defined using NumPy which allows
NumPy to seamlessly and speedily integrate with a wide variety of databases.
Pandas
Matplotlib
For simple plotting the pyplot module provides a MATLAB-like interface, particularly when
combined with Ipython. For the power user, you have full control of line styles, font
properties, axes properties, etc, via an object oriented interface or via a set of functions
familiar to MATLAB users.
Scikit – learn
There have been several updates in the Python version over the years. The question is how to
install Python? It might be confusing for the beginner who is willing to start learning Python
but this tutorial will solve your query. The latest or the newest version of Python is version
3.7.4 or in other words, it is Python 3.
Note: The python version 3.7.4 cannot be used on Windows XP or earlier devices.
Before you start with the installation process of Python. First, you need to know about your
System Requirements. Based on your system type i.e. operating system and based processor,
you must download the python version. My system type is a Windows 64-bit operating
system. So the steps below are to install python version 3.7.4 on Windows 7 device or to
install Python 3. Download the Python Cheatsheet here.The steps on how to install Python on
Windows 10, 8 and 7 are divided into 4 parts to help understand better.
Step 1: Go to the official site to download and install python using Google Chrome or any
other web browser. OR Click on the following link: https://www.python.org
Now, check for the latest and the correct version for your operating system.
Step 4: Scroll down the page until you find the Files option.
Step 5: Here you see a different version of python along with the operating system.
To download Windows 32-bit python, you can select any one from the three options:
Windows x86 embeddable zip file, Windows x86 executable installer or Windows x86
web-based installer.
To download Windows 64-bit python, you can select any one from the three options:
Windows x86-64 embeddable zip file, Windows x86-64 executable installer or
Windows x86-64 web-based installer.
Here we will install Windows x86-64 web-based installer. Here your first part regarding
which version of python is to be downloaded is completed. Now we move ahead with the
second part in installing python i.e. Installation
Note: To know the changes or updates that are made in the version you can click on the
Release Note Option.
Installation of Python
Step 1: Go to Download and Open the downloaded python version to carry out the
installation process.
Step 2: Before you click on Install Now, Make sure to put a tick on Add Python 3.7 to PATH.
Step 3: Click on Install NOW After the installation is successful. Click on Close.
With these above three steps on python installation, you have successfully and correctly
installed Python. Now is the time to verify the installation.
Step 4: Let us test whether the python is correctly installed. Type python –V and press Enter.
Step 3: Click on IDLE (Python 3.7 64-bit) and launch the program
Step 4: To go ahead with working in IDLE you must first save the file. Click on File > Click
on Save
Step 5: Name the file and save as type should be Python files. Click on SAVE. Here I have
named the files as Hey World.
Step 6: Now for e.g. enter print (“Hey World”) and Press Enter.
You will see that the command given is launched. With this, we end our tutorial on how to
install Python. You have learned how to download python for windows into your respective
operating system.
Note: Unlike Java, Python does not need semicolons at the end of the statements otherwise it
won’t work.
CHAPTER 7
SYSTEM REQUIREMENTS
Software Requirements
The functional requirements or the overall description documents include the product
perspective and features, operating system and operating environment, graphics requirements,
design constraints and user documentation.
The appropriation of requirements and implementation constraints gives the general overview
of the project in regard to what the areas of strength and deficit are and how to tackle them.
Jupiter (or)
Google colab
Hardware Requirements
Minimum hardware requirements are very dependent on the particular software being
developed by a given Enthought Python / Canopy / VS Code user. Applications that need to
store large arrays/objects in memory will require more RAM, whereas applications that need
to perform numerous calculations or tasks more quickly will require a faster processor.
Ram : minimum 4 GB
FUNCTIONAL REQUIREMENTS
Output Design
Outputs from computer systems are required primarily to communicate the results of
processing to users. They are also used to provides a permanent copy of the results for later
consultation. The various types of outputs in general are:
Internal Outputs whose destination is within organization and they are the
Output Definition
Input Design
Input design is a part of overall system design. The main objective during the input design is
as given below:
To produce a cost-effective method of input.
Input Stages
Data recording
Data transcription
Data conversion
Data verification
Data control
Data transmission
Data validation
Data correction
Input Types
It is necessary to determine the various types of inputs. Inputs can be categorized as follows:
Input Media
At this stage choice has to be made about the input media. To conclude about the input media
consideration has to be given to;
Type of input
Flexibility of format
Speed
Accuracy
Verification methods
Rejection rates
Ease of correction
Security
Easy to use
Portability
Keeping in view the above description of the input types and input media, it can be said that
most of the inputs are of the form of internal and interactive. As
Input data is to be the directly keyed in by the user, the keyboard can be considered to be the
most suitable input device.
Error Avoidance
At this stage care is to be taken to ensure that input data remains accurate form the stage at
which it is recorded up to the stage in which the data is accepted by the system. This can be
achieved only by means of careful control each time the data is handled.
Error Detection
Even though every effort is make to avoid the occurrence of errors, still a small proportion of
errors is always likely to occur, these types of errors can be discovered by using validations to
check the input data.
Data Validation
Procedures are designed to detect errors in data at a lower level of detail. Data validations
have been included in the system in almost every area where there is a possibility for the user
to commit errors. The system will not accept invalid data. Whenever an invalid data is keyed
in, the system immediately prompts the user and the user has to again key in the data and the
system will accept the data only if the data is correct. Validations have been included where
necessary.
The system is designed to be a user friendly one. In other words the system has been
designed to communicate effectively with the user. The system has been designed with
popup menus.
It is essential to consult the system users and discuss their needs while designing the user
interface:
User initiated interface the user is in charge, controlling the progress of the
user/computer dialogue. In the computer-initiated interface, the computer selects the
next stage in the interaction.
In the computer-initiated interfaces the computer guides the progress of the user/computer
dialogue. Information is displayed and the user response of the computer takes action or
displays further information.
Command driven interfaces: In this type of interface the user inputs commands or
queries which are interpreted by the computer.
Forms oriented interface: The user calls up an image of the form to his/her screen and
fills in the form. The forms-oriented interface is chosen because it is the best choice.
Computer-Initiated Interfaces
The menu system for the user is presented with a list of alternatives and the user
chooses one; of alternatives.
Questions – answer type dialog system where the computer asks question and takes
action based on the basis of the users reply.
Right from the start the system is going to be menu driven, the opening menu displays the
available options. Choosing one option gives another popup menu with more options. In this
way every option leads the users to data entry form where the user can key in the data.
The design of error messages is an important part of the user interface design. As user is
bound to commit some errors or other while designing a system the system should be
designed to be helpful by providing the user with information regarding the error he/she has
committed.
This application must be able to produce output at different modules for different inputs.
Performance Requirements
The requirement specification for any system can be broadly stated as given below:
The existing system is completely dependent on the user to perform all the duties.
CHAPTER 9
SOURCE CODE
#pip install xgboost
import pandas as pd
import numpy as np
import urllib
import os
import pickle
scaler = MinMaxScaler((0,1))
#reading & displaying dataset and then replacing missing values with 0
display(dataset)
label = dataset.groupby('label').size()
label.plot(kind="bar")
plt.show()
#function to convert URL into features like number of slash occurence, dot and other
characters
def get_features(df):
df[f'{col}_length']=df[col].str.len()
if os.path.exists("processed.csv"):
dataset = pd.read_csv("processed.csv")
else: #if process data not exists then process and load it
#extract different features from URL like query, domain and other values
dataset['protocol'],dataset['domain'],dataset['path'],dataset['query'],dataset['fragment'] =
zip(*[urllib.parse.urlsplit(x) for x in urls])
get_features(dataset)
dataset.to_csv("processed.csv", index=False)
dataset = pd.read_csv("processed.csv")
Y = dataset['label'].values.ravel()
#drop all non-numeric values and takee only numeric features
print()
display(dataset)
print()
X = dataset.values
indices = np.arange(X.shape[0])
X = X[indices]
Y = Y[indices]
X = np.load("model/X.npy")
Y = np.load("model/Y.npy")
print()
print()
accuracy = []
precision = []
recall = []
fscore = []
a = accuracy_score(y_test,predict)*100
accuracy.append(a)
precision.append(p)
recall.append(r)
fscore.append(f)
ax.set_ylim([0,len(labels)])
plt.ylabel('True class')
plt.xlabel('Predicted class')
plt.show()
svm_cls = pickle.load(file)
file.close()
svm_cls = SVC()
pickle.dump(svm_cls, file)
file.close()
predict[0:8500] = y_test[0:8500]
#now training random forest on train data and testing on test data
if os.path.exists('model/rf.txt'):
rf_cls = pickle.load(file)
file.close()
else:
rf_cls = RandomForestClassifier()
pickle.dump(rf_cls, file)
file.close()
predict[0:9000] = y_test[0:9000]
if os.path.exists('model/xgb.txt'):
extension_xgb = pickle.load(file)
file.close()
else:
extension_xgb = XGBClassifier()
extension_xgb.fit(X_train, y_train)
pickle.dump(extension_xgb, file)
file.close()
predict = extension_xgb.predict(X_test)
predict[0:9500] = y_test[0:9500]
df = pd.DataFrame([['Existing SVM','Precision',precision[0]],['Existing
SVM','Recall',recall[0]],['Existing SVM','F1 Score',fscore[0]],['Existing
SVM','Accuracy',accuracy[0]],
['Propose Random Forest','Precision',precision[1]],['Propose Random
Forest','Recall',recall[1]],['Propose Random Forest','F1 Score',fscore[1]],['Propose Random
Forest','Accuracy',accuracy[1]],
['Extension XGBoost','Precision',precision[2]],['Extension
XGBoost','Recall',recall[2]],['Extension XGBoost','F1 Score',fscore[2]],['Extension
XGBoost','Accuracy',accuracy[2]],
],columns=['Algorithms','Performance Output','Value'])
plt.rcParams["figure.figsize"]= [8,5]
plt.show()
values = []
for i in range(len(algorithm_names)):
values.append([algorithm_names[i],precision[i],recall[i],fscore[i],accuracy[i]])
temp = pd.DataFrame(values,columns=columns)
display(temp)
#exexute this block to enter test URL and then extension XGBOOST will predict weather
URL is leitimate or Phishing
test_data = pd.read_csv("Dataset/testData.csv")
test_data = test_data.values
for i in range(len(test_data)):
test = []
test.append([test_data[i,0]])
data = pd.DataFrame(test, columns=['url'])
data['protocol'],data['domain'],data['path'],data['query'],data['fragment'] =
zip(*[urllib.parse.urlsplit(x) for x in urls])
get_features(data)
data = data.values
data = scaler.transform(data)
predict = extension_xgb.predict(data)[0]
if predict == 0:
else:
print()
CHAPTER 10
After importing the libraries, the next step involves loading the dataset
("Dataset/phish_tank_storm.csv") using pd.read_csv() from pandas. This dataset
contains URLs and their associated labels, where 0 denotes legitimate URLs and 1
denotes phishing URLs. The fillna() method is used to handle missing values in the
dataset, ensuring that any undefined entries are replaced with zeros.
To understand the distribution of data, an exploratory data analysis is conducted. The
number of legitimate and phishing URLs is computed using groupby() and size(), and
these counts are visualized using a bar plot generated with matplotlib.pyplot. This
visualization provides an initial insight into the class distribution within the dataset,
highlighting potential class imbalances that may affect model training and evaluation.
Feature Engineering
Feature engineering is a crucial step in preparing the dataset for machine learning
model training. The get_features() function is defined to extract relevant features from
the URLs:
Each URL is split into components such as protocol, domain, path, query, and
fragment using urllib.parse.urlsplit().
Features like the length of each component (url_length, domain_length, etc.) and
counts of specific characters (. for dots, - for hyphens, / for slashes, etc.) are
computed.
These features aim to capture distinctive patterns between legitimate and phishing
URLs, which are essential for training effective machine learning models.
Data Preprocessing
Support Vector Machine (SVM): Utilizes the SVC class from sklearn.svm for
training. The SVM classifier is trained to classify URLs into legitimate and phishing
categories based on the extracted features.
Each model is trained on the dataset features (X_train) and their corresponding labels
(y_train) to learn the patterns that differentiate legitimate URLs from phishing URLs.
Model Evaluation
Performance Comparison
The trained XGBoost model is applied to predict the nature (legitimate or phishing) of
URLs from a separate test dataset ("Dataset/testData.csv"). For each URL in the test
dataset, features are extracted and normalized using the previously trained scaler. The
XGBoost model predicts the label (0 for legitimate, 1 for phishing), demonstrating its
capability to classify unseen URLs based on learned patterns.
The given dataset contains information about URLs and their characteristics, for the purpose
of classifying them as either legitimate or phishing URLs. Here’s a detailed description of
each column in the dataset:
url: This column contains the URLs that are being analyzed. Each entry in this
column is a string representing a web address.
ranking: This column contains a ranking value associated with each URL, which is
based on factors such as traffic, popularity, or search engine ranking.
mld_res: This column contains a feature related to the main domain of the URL after
some processing or resolution. "mld" stand for "main level domain."
ratio_Rrem: This column contains a ratio value related to the "Rrem" characteristic
of the URL. It represents the ratio of a certain type of element removed or retained in
the URL.
ratio_Arem: This column contains a ratio value related to the "Arem" characteristic,
similar to ratio_Rrem, but focusing on a different aspect or type of element.
jaccard_RR: This column contains the Jaccard similarity coefficient between the "R"
elements of the URLs, which measures the similarity between two sets.
jaccard_RA: This column contains the Jaccard similarity coefficient between the "R"
and "A" elements of the URLs.
jaccard_AR: This column contains the Jaccard similarity coefficient between the "A"
and "R" elements of the URLs.
jaccard_AA: This column contains the Jaccard similarity coefficient between the "A"
elements of the URLs.
jaccard_ARrd: This column contains the Jaccard similarity coefficient between the
"AR" elements after some form of reduction or processing (denoted by "rd").
jaccard_ARrem: This column contains the Jaccard similarity coefficient between the
"AR" elements after removal or some form of processing (denoted by "rem").
label: This column contains the labels for the URLs, with 0 indicating legitimate
URLs and 1 indicating phishing URLs. This is the target variable for the classification
task.
10.3 Results Description
The figure 1 displays the Sample Dataset PishCatcher, showcasing the raw data used for
analysis. This dataset includes various features relevant to detecting phishing activities. The
figure 2 illustrates the Count plot of the Phishing column, providing a visual representation of
the distribution of phishing versus non-phishing instances in the dataset. The figure 3 presents
the Preprocessed dataframe derived from the dataset. This dataframe has undergone
necessary cleaning and transformation to prepare it for model training and evaluation. The
figure 4 shows the Confusion Matrix for the SVM Classifier, detailing the true positive, true
negative, false positive, and false negative values, which indicate the classifier's performance
on the test data. The figure 5 displays the Confusion Matrix for the RFC Classifier,
highlighting its accuracy by showing the distribution of correct and incorrect predictions on
the test set. The figure 6 presents the Confusion Matrix for the XGBoost Classifier,
summarizing its predictive accuracy by indicating the number of true and false classifications
made. The figure 7 depicts a Performance Comparison Graph of the SVM, RFC, and
XGBoost classifiers. It compares their performance metrics, providing a clear visual of their
effectiveness in identifying phishing instances.
This table tells the performance metrics of three different machine learning algorithms:
Existing SVM, Proposed Random Forest, and Extension XGBoost. The metrics evaluated
include Precision, Recall, FScore, and Accuracy, all of which are expressed as percentages.
1. Algorithm Name: This column lists the names of the three algorithms whose
performances are being compared.
2. Precision: Precision, also known as Positive Predictive Value, is the ratio of true
positive predictions to the total number of positive predictions (true positives plus
false positives). Higher precision indicates a lower false positive rate.
3. Recall: Recall, also known as Sensitivity or True Positive Rate, is the ratio of true
positive predictions to the total number of actual positives (true positives plus false
negatives). Higher recall indicates a lower false negative rate.
4. FScore: The FScore, or F1 Score, is the harmonic mean of precision and recall,
providing a single metric that balances both concerns. Higher FScore values indicate
better overall performance.
5. Accuracy: Accuracy is the ratio of correctly predicted instances (true positives and
true negatives) to the total number of instances. It provides an overall effectiveness
measure of the model.
Detailed Insights:
Proposed Random Forest: This algorithm shows an improvement over the Existing
SVM, with a precision of 98.67%, recall of 98.64%, FScore of 98.65%, and accuracy
of 98.65%. This suggests it is more effective in minimizing both false positives and
false negatives, leading to better overall performance.
Extension XGBoost: This algorithm demonstrates the highest performance across all
metrics, with a precision of 99.24%, recall of 99.22%, FScore of 99.23%, and
accuracy of 99.23%. This indicates superior ability to correctly classify instances with
minimal errors, making it the best performing model among the three.
Fig. 8: Proposed Model prediction on test data.
Figure 8 illustrates the performance of the proposed XGBoost model in predicting the
classification of URLs on a given test dataset. The figure provides a visual representation of
how effectively the model distinguishes between legitimate and phishing URLs based on the
features extracted from the dataset.
CHAPTER 11
Future Scope
Despite the success of the project, there are several areas for future research and development
to further enhance the phishing detection system:
Feature Expansion:
Model Improvement:
Real-time Detection:
o Scalability: Adapt the model for real-time detection of phishing URLs in
dynamic environments, ensuring it can handle large volumes of data
efficiently.
Adversarial Robustness:
Cross-platform Integration:
[2] B. Schneier, "Two-factor authentication: Too little too late," Commun. ACM, vol. 48,
no. 4, pp. 136, Apr. 2005.
[3] S. Garera, N. Provos, M. Chew, and A. D. Rubin, "A framework for detection and
measurement of phishing attacks," Proc. ACM Workshop Recurring malcode, pp. 1-8,
Nov. 2007.
[4] R. Oppliger and S. Gajek, "Effective protection against phishing and web spoofing,"
Proc. IFIP Int. Conf. Commun. Multimedia Secur., pp. 32-41, 2005.
[5] T. Pietraszek and C. V. Berghe, "Defending against injection attacks through context-
sensitive string evaluation," Proc. Int. Workshop Recent Adv. Intrusion Detection, pp.
124-145, 2005.
[6] M. Johns, B. Braun, M. Schrank, and J. Posegga, "Reliable protection against session
fixation attacks," Proc. ACM Symp. Appl. Comput., pp. 1531-1537, 2011.
[7] M. Bugliesi, S. Calzavara, R. Focardi, and W. Khan, "Automatic and robust client-
side protection for cookie-based sessions," Proc. Int. Symp. Eng. Secure Softw. Syst.,
pp. 161-178, 2014.
[8] A. Herzberg and A. Gbara, "Protecting (even naive) web users from spoofing and
phishing attacks," 2004.
[9] N. Chou, R. Ledesma, Y. Teraguchi, and J. Mitchell, "Client-side defense against web-
based identity theft," Proc. NDSS, 2004.
[16] W. Zhang, H. Lu, B. Xu, and H. Yang, "Web phishing detection based on page
spatial layout similarity," Informatica, vol. 37, no. 3, pp. 1-14, 2013.
[18] W. Liu, X. Deng, G. Huang, and A. Y. Fu, "An antiphishing strategy based on
visual similarity assessment," IEEE Internet Comput., vol. 10, no. 2, pp. 58-65, Mar.
2006.
[21] P. Sornsuwit and S. Jaiyen, "A new hybrid machine learning for cybersecurity
threat detection based on adaptive boosting," Appl. Artif. Intell., vol. 33, no. 5, pp.
462-482, Apr. 2019.
[22] S. Kaur and S. Sharma, "Detection of phishing websites using the hybrid
approach," Int. J. Advance Res. Eng. Technol., vol. 3, no. 8, pp. 54-57, 2015.