0% found this document useful (0 votes)

228 views75 pages

Client Side Webspoofing PishCatcher

The document discusses the critical threat of web spoofing attacks and the limitations of current server-side defenses, which are reactive and often ineffective. It introduces PISHCATCHER, a machine learning-based client-side defense system designed to detect and prevent spoofing attempts in real-time by analyzing web page features. The proposed system aims to enhance user security and trust by continuously adapting to emerging threats and providing proactive protection against web spoofing.

Uploaded by

justin2362003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

228 views75 pages

Client Side Webspoofing PishCatcher

Uploaded by

justin2362003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 75

MACHINE LEARNING-BASED CLIENT-SIDE DEFENSE

AGAINST WEB SPOOFING ATTACKS

ABSTRACT
Web spoofing attacks represent a critical threat to the security and integrity of online
communication and transactions. These attacks involve malicious actors impersonating
legitimate websites to deceive users into revealing sensitive information or unwittingly
engaging in harmful actions. To effectively address this growing threat, there is an urgent
need for robust defense mechanisms capable of identifying and thwarting web spoofing
attempts in real-time. Current approaches to combating web spoofing primarily rely on
server-side defenses, such as SSL/TLS protocols and domain validation techniques. While
these methods provide some level of protection, they suffer from significant limitations.
Firstly, server-side defenses are reactive, meaning they can only detect and respond to
spoofing attempts after they have occurred. This delay leaves users vulnerable during the
critical window between the initiation of the attack and its detection. Moreover, server-side
defenses may struggle to accurately differentiate between legitimate and spoofed websites,
leading to false positives and negatives. The prevalence of web spoofing attacks underscores
the need for proactive client-side defense mechanisms. Existing approaches predominantly
focus on server-side defenses, which are insufficient in providing timely and reliable
protection against evolving spoofing techniques. Thus, there exists a critical gap in the
current security infrastructure that necessitates addressing to effectively combat web spoofing
threats. we aim to provide users with a proactive and robust solution capable of detecting and
preventing spoofing attempts before they inflict harm. Additionally, by harnessing machine
learning techniques, we aim to develop a system capable of continuously learning and
adapting to emerging spoofing tactics, thereby enhancing its effectiveness and resilience
against evolving threats. The proposed PISHCATCHER system represents a novel approach
to combat web spoofing attacks. It is a machine learning-based defense mechanism designed
to operate at the client-side, enabling real-time detection and prevention of spoofing attempts.
By analyzing various features extracted from web pages, including HTML structure, CSS
styles, and JavaScript behavior, PISHCATCHER employs advanced machine learning
algorithms to accurately distinguish between legitimate and spoofed websites.
CHAPTER 1

INTRODUCTION
1.1 Overview

Web spoofing attacks pose a significant threat to online security by allowing malicious actors
to impersonate legitimate websites, deceiving users into divulging sensitive information or
engaging in harmful actions. To combat this threat effectively, robust defense mechanisms are
essential to identify and thwart spoofing attempts in real-time. While current approaches
primarily rely on server-side defenses like SSL/TLS protocols and domain validation
techniques, they suffer from limitations such as reactivity and difficulty in accurately
differentiating between legitimate and spoofed websites.

To address these challenges, the proposed PISHCATCHER system offers a novel approach to
combat web spoofing attacks. Operating at the client-side, PISHCATCHER leverages
machine learning techniques to analyze various features extracted from web pages, including
HTML structure, CSS styles, and JavaScript behavior. By employing advanced machine
learning algorithms, PISHCATCHER accurately distinguishes between legitimate and
spoofed websites, enabling real-time detection and prevention of spoofing attempts.

1.2 Research Motivation

The motivation behind developing PISHCATCHER stems from the inadequacies of existing
server-side defenses in mitigating web spoofing attacks. Traditional approaches are reactive
and struggle to accurately differentiate between legitimate and spoofed websites, leaving
users vulnerable to exploitation. By shifting the focus to client-side defense mechanisms and
harnessing machine learning techniques, PISHCATCHER aims to provide users with a
proactive and robust solution capable of detecting and preventing spoofing attempts before
they cause harm. Additionally, by continuously learning and adapting to emerging spoofing
tactics, PISHCATCHER enhances its effectiveness and resilience against evolving threats,
addressing the critical need for proactive defense mechanisms in the face of increasing web
spoofing attacks.
1.3 Problem Statement

The prevalence of web spoofing attacks highlights the urgent need for proactive defense
mechanisms capable of identifying and mitigating spoofed websites in real-time. Existing
server-side defenses are limited in their ability to provide timely and reliable protection
against evolving spoofing techniques, leaving users vulnerable to exploitation. Therefore,
there is a critical gap in the current security infrastructure that needs to be addressed to
effectively combat web spoofing threats. The problem statement revolves around developing
a client-side defense mechanism, like PISHCATCHER, that can accurately distinguish
between legitimate and spoofed websites, thereby providing users with proactive protection
against web spoofing attacks.

1.4 Applications

 Online Banking and Financial Transactions: Protecting users' financial information

is paramount in online banking and financial transactions. PISHCATCHER can
safeguard users from falling victim to spoofed banking websites, preventing
unauthorized access to sensitive financial data and transactions.

 E-commerce Platforms: E-commerce platforms are frequent targets of web spoofing

attacks aiming to steal users' personal and payment information. PISHCATCHER can
ensure the integrity of online shopping experiences by detecting and blocking spoofed
product pages, safeguarding users' financial and personal data.

 Social media and Email Platforms: Malicious actors often impersonate social media
and email platforms to phish for users' login credentials or distribute malware.
PISHCATCHER can protect users from falling victim to such attacks by accurately
identifying and blocking spoofed social media profiles or phishing emails.

 Government and Healthcare Portals: Government and healthcare portals contain

sensitive information that malicious actors may target through web spoofing attacks.
PISHCATCHER can help prevent unauthorized access to confidential data and ensure
the security and integrity of these portals by detecting and blocking spoofed websites.

1.5 Need and Significance

 Continuous Learning and Adaptation: PISHCATCHER's ability to continuously

learn and adapt to emerging spoofing tactics ensures its effectiveness and resilience
against evolving threats. By staying up-to-date with the latest spoofing patterns,
PISHCATCHER can effectively detect and prevent new and sophisticated spoofing
attempts.

 Real-time Detection and Prevention: The real-time detection and prevention

capabilities of PISHCATCHER enable immediate action against spoofing attempts,
minimizing the window of vulnerability for users. This proactive approach ensures
timely protection against web spoofing attacks, safeguarding users' sensitive
information and transactions.

 Enhanced User Trust and Confidence: By providing users with proactive protection
against web spoofing attacks, PISHCATCHER enhances user trust and confidence in
online platforms and services. Users can rest assured knowing that their sensitive
information is secure, leading to increased engagement and satisfaction with online
experiences.
CHAPTER 2

LITERATURE SURVEY
W. Khan et al. [1] proposed SpoofCatch, a client-side protection tool designed to combat
phishing attacks. Their research focused on developing a tool that operates on the client side
to detect and prevent phishing attempts by analyzing and filtering out potentially harmful
content. This approach aimed to enhance user security by providing an additional layer of
protection against phishing threats. B. Schneier [2] discussed the limitations of two-factor
authentication (2FA) in his article. He argued that while 2FA provides an extra layer of
security, it is often insufficient in protecting against sophisticated attacks. Schneier's analysis
highlighted the need for more robust security measures beyond 2FA to effectively combat
modern cybersecurity threats. S. Garera et al. [3] introduced a framework for detecting and
measuring phishing attacks. Their work provided a structured approach to identifying
phishing threats by using various detection techniques and metrics. This framework aimed to
improve the accuracy and reliability of phishing detection systems.

R. Oppliger and S. Gajek [4] presented methods for effective protection against phishing and
web spoofing. They explored various strategies to safeguard users from phishing attacks and
website spoofing, focusing on enhancing the security of web interactions and preventing
unauthorized access. T. Pietraszek and C. V. Berghe [5] addressed the issue of injection
attacks through context-sensitive string evaluation. Their research proposed a method to
defend against these attacks by analyzing and evaluating strings in context, which aimed to
reduce vulnerabilities in web applications and enhance overall security. M. Johns et al. [6]
focused on providing reliable protection against session fixation attacks. Their work involved
developing mechanisms to prevent attackers from exploiting session fixation vulnerabilities,
thereby improving the security of web sessions and protecting user data. M. Bugliesi et al. [7]
worked on automatic and robust client-side protection for cookie-based sessions. Their
research aimed to enhance the security of cookie-based sessions by implementing client-side
protection mechanisms that could automatically detect and mitigate potential threats. A.
Herzberg and A. Gbara [8] discussed strategies for protecting web users from spoofing and
phishing attacks. They proposed solutions to safeguard even less experienced users from
these threats, focusing on improving the overall security of web interactions and enhancing
user awareness.
N. Chou et al. [9] presented client-side defenses against web-based identity theft. Their
research focused on developing methods to protect users from identity theft by analyzing and
securing web interactions, thereby reducing the risk of unauthorized access to personal
information. B. Hämmerli and R. Sommer [10] edited proceedings from the 4th International
Conference DIMVA 2007, which covered various aspects of intrusion detection and malware
vulnerability assessment. The conference proceedings included research on methods and
techniques for detecting and mitigating security threats in diverse computing environments.

C. Yue and H. Wang [11] introduced BogusBiter, a transparent protection mechanism against
phishing attacks. Their approach aimed to provide a seamless and effective solution for
detecting and preventing phishing attempts, enhancing user security without requiring
significant changes to existing systems. W. Chu et al. [12] proposed a method to protect
sensitive sites from phishing attacks using features extractable from inaccessible phishing
URLs. Their research focused on developing techniques to identify and block phishing sites
by analyzing features of phishing URLs that are not directly accessible. Y. Zhang et al. [13]
presented Cantina, a content-based approach to detecting phishing websites. Their research
aimed to improve phishing detection by analyzing the content of web pages, providing a
more accurate method for identifying fraudulent sites based on their content characteristics.
D. Miyamoto et al. [14] evaluated machine learning-based methods for detecting phishing
sites. Their study involved assessing various machine learning techniques to enhance the
accuracy and effectiveness of phishing site detection, contributing to more reliable security
measures. E. Medvet et al. [15] explored visual-similarity-based phishing detection
techniques. Their research focused on detecting phishing sites by analyzing visual similarities
between web pages, which aimed to improve detection accuracy by identifying fraudulent
sites based on their visual appearance.

W. Zhang et al. [16] investigated web phishing detection based on page spatial layout
similarity. Their approach aimed to detect phishing sites by analyzing the spatial layout of
web pages, providing a method to identify fraudulent sites based on their design and layout
characteristics. J. Ni et al. [17] developed a collaborative filtering recommendation algorithm
based on TF-IDF and user characteristics. Their research focused on improving
recommendation systems by incorporating TF-IDF and user attributes, enhancing the
accuracy and relevance of recommendations. W. Liu et al. [18] proposed an antiphishing
strategy based on visual similarity assessment. Their method aimed to protect users from
phishing attacks by analyzing visual similarities between legitimate and phishing sites,
providing a visual-based approach to detecting fraudulent sites. A. Rusu and V. Govindaraju
[19] introduced a visual CAPTCHA with handwritten image analysis. Their work aimed to
enhance CAPTCHA security by incorporating handwritten image analysis, providing a more
robust method to differentiate between human users and automated bots. P. Yang et al. [20]
focused on phishing website detection based on multidimensional features driven by deep
learning. Their research utilized deep learning techniques to analyze various features of
phishing sites, aiming to improve detection accuracy through advanced machine learning
methods. P. Sornsuwit and S. Jaiyen [21] developed a new hybrid machine learning approach
for cybersecurity threat detection based on adaptive boosting. Their research combined
multiple machine learning techniques to enhance the detection of cybersecurity threats,
providing a more effective solution for threat identification. S. Kaur and S. Sharma [22]
proposed a hybrid approach for detecting phishing websites. Their method combined various
detection techniques to improve the accuracy and reliability of phishing site identification,
contributing to more effective protection against phishing attacks.
CHAPTER 3

TRADITIONAL SYSTEM
Traditional Systems for Combatting Web Spoofing Attacks

SSL/TLS Protocols

SSL (Secure Sockets Layer) and its successor, TLS (Transport Layer Security), are widely
implemented cryptographic protocols designed to secure communications over the internet.
By encrypting the data transmitted between a user's browser and a web server, SSL/TLS
helps protect against eavesdropping and tampering. This protocol ensures that sensitive
information, such as login credentials and personal data, is kept secure during transmission.
SSL/TLS operates by establishing an encrypted connection, verified through digital
certificates issued by trusted Certificate Authorities (CAs). The encryption process involves
asymmetric cryptography for the initial handshake and symmetric cryptography for the
subsequent data exchange, offering robust protection against unauthorized access.

Domain Validation Techniques

Domain validation techniques focus on verifying the authenticity of a website’s domain

name. This process involves checking the registration details of a domain and ensuring it
matches the claimed identity of the organization or individual operating the website. Domain
validation helps ensure that users are connecting to legitimate websites rather than fraudulent
ones. It is commonly used in conjunction with SSL/TLS certificates to enhance trust in online
transactions. These techniques involve various methods, such as querying domain registration
databases and inspecting domain name patterns, to detect discrepancies and potential
phishing attempts.

Anti-Phishing Toolbars and Browser Extensions

Anti-phishing toolbars and browser extensions are designed to assist users in identifying and
avoiding phishing websites. These tools are installed as extensions in web browsers and
provide real-time alerts when users attempt to visit potentially harmful sites. They often
utilize blacklists of known phishing sites and employ heuristic techniques to detect suspicious
web page characteristics. By providing visual warnings or blocking access to known phishing
sites, these toolbars aim to enhance user security and prevent the inadvertent disclosure of
sensitive information.
Web Filtering Services

Web filtering services are implemented to block access to malicious websites and content.
These services operate either at the network level or through browser settings, employing
predefined rules and content analysis to detect and prevent access to known phishing sites.
Web filtering involves maintaining up-to-date databases of malicious URLs and using these
lists to filter web traffic. Additionally, some web filtering solutions analyze web page content
and metadata to identify potential threats and enforce security policies. By restricting access
to harmful websites, web filtering services help safeguard users from phishing attacks and
other online threats.

Drawbacks of Traditional Systems

 Reactive Nature: SSL/TLS protocols and domain validation techniques are reactive
measures, meaning they can only address spoofing attempts after they are detected.
This delay leaves users vulnerable during the critical window between the initiation of
the attack and its detection.

 Certificate Validation Issues: SSL/TLS relies on certificates issued by Certificate

Authorities (CAs). If attackers manage to obtain fraudulent certificates or compromise
a CA, they can bypass SSL/TLS protection, rendering this method less effective
against sophisticated attacks.

 Complexity and Error-Prone Configuration: Proper configuration and maintenance

of SSL/TLS can be complex and prone to errors. Misconfigurations or outdated
certificates can create vulnerabilities that attackers might exploit.

 Limited Scope of Domain Validation: Domain validation alone does not address the
content or behavior of a website. Attackers can still use domains that closely resemble
legitimate ones to evade detection.

 Ease of Domain Duplication: Attackers can register domain names that are similar to
legitimate ones (e.g., typosquatting), which can bypass domain validation checks and
deceive users.

 Dependence on Blacklists: Anti-phishing toolbars rely on blacklists of known

phishing sites, which may not be updated frequently enough to catch new phishing
threats. This reliance can lead to missed detections of emerging phishing sites.
 User Compliance Issues: The effectiveness of anti-phishing toolbars depends on user
adherence. Users may disable or ignore these toolbars, reducing their effectiveness in
protecting against phishing attacks.

 Static Nature of Web Filtering: Web filtering services often rely on static rules and
known threat databases, which may not be updated promptly enough to keep pace
with rapidly evolving phishing techniques.

 Performance Impact and User Experience: Web filtering services can sometimes
slow down web browsing or interfere with legitimate content, potentially leading to a
degraded user experience and reduced efficiency.
CHAPTER 4

PROPOSED SYSTEM
4.1 Overview

The project showcases a comprehensive approach to developing a robust system for detecting
phishing URLs, leveraging advanced machine learning techniques and thorough data
analysis. Here is the overview of the project:

 Initialization and Dataset Handling: The project intiaited by importing necessary

Python packages and libraries, including those for data handling (pandas, numpy),
machine learning (sklearn, xgboost), visualization (matplotlib, seaborn), and URL
parsing (urllib). This ensures all essential tools are available for subsequent tasks. The
dataset, located in "Dataset/phish_tank_storm.csv", is loaded using pandas, which
provides a robust framework for data manipulation. Missing values within the dataset
are replaced with zeros to prevent any computational errors during feature extraction
and model training. The dataset comprises URLs labeled as either legitimate (0) or
phishing (1), serving as the foundation for model training and evaluation.

 Exploratory Data Analysis (EDA): Exploratory Data Analysis is conducted to gain

an initial understanding of the dataset's composition. The labels are grouped and
counted to determine the number of legitimate and phishing URLs. This distribution
is visualized using a bar plot, offering a clear picture of class balance within the
dataset. Such visualization is crucial as it highlights any potential class imbalance,
which can significantly impact model performance and necessitate techniques such as
oversampling, undersampling, or class weighting during model training.

 Feature Engineering: Feature engineering is a critical step where meaningful

features are extracted from the raw URL data to improve model performance. A
custom function, get_features(), is defined to extract various features from URLs.
This function computes the length of different URL components (e.g., domain, path)
and counts occurrences of specific characters (e.g., dots, hyphens, slashes) within
these components. These features help capture patterns that can differentiate
legitimate URLs from phishing ones. The extracted features are saved in a processed
dataset file ("processed.csv") for consistency and to avoid redundant computations in
future runs.
 Data Preprocessing: In this step, the dataset undergoes further preprocessing to
ensure it is suitable for machine learning tasks. Non-numeric data such as URL
components are transformed into numeric features using the previously defined
get_features() function. The dataset is then reloaded, and any missing values are
filled. Target labels are converted into numeric format to facilitate model training. The
preprocessed dataset is split into feature variables (X) and target labels (Y),
normalized using MinMaxScaler to scale the features within a specific range, and then
shuffled to ensure randomness during model training.

 Machine Learning Model Training: Three distinct machine learning models are
trained on the preprocessed dataset: Support Vector Machine (SVM), Random Forest,
and XGBoost (Extreme Gradient Boosting). Each model is trained on the features and
corresponding labels to distinguish between legitimate and phishing URLs. The SVM
model is trained using SVC from sklearn.svm, the Random Forest model using
RandomForestClassifier from sklearn.ensemble, and the XGBoost model using
XGBClassifier from xgboost. If trained models already exist, they are loaded from
files to save computation time; otherwise, new models are trained and saved for future
use.

 Model Evaluation: The trained models are evaluated on a test set to measure their
performance. Metrics such as accuracy, precision, recall, and F1-score are computed
to provide a comprehensive assessment of each model's effectiveness. Confusion
matrices are plotted to visualize the distribution of true positives, true negatives, false
positives, and false negatives, offering insights into each model's strengths and
weaknesses. These evaluations help identify the most effective model for phishing
URL detection based on various performance criteria.

 Performance Comparison: Performance metrics of the three models—SVM,

Random Forest, and XGBoost—are compared to determine the best-performing
algorithm. The comparison is visualized using bar graphs, which provide a clear and
concise representation of each model's precision, recall, F1-score, and accuracy.
Additionally, a tabular format presents the detailed performance metrics for easy
reference. This comparative analysis highlights the strengths and weaknesses of each
model, aiding in the selection of the most suitable model for deployment.
 Prediction on Test Data: The trained XGBoost model, identified as the most
promising, is used to predict the nature of URLs from a separate test dataset
("Dataset/testData.csv"). Each URL in the test dataset undergoes feature extraction
using the get_features() function, and the model predicts whether the URL is
legitimate or phishing based on these features. This step demonstrates the model's
practical application in real-world scenarios, providing a mechanism to evaluate new
URLs for potential phishing threats.

Fig. 1: Block Diagram of Proposed System

4.2 Data Preprocessing

Data pre-processing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while crea ting a machine learning
model. When creating a machine learning project, it is not always a case that we come across
the clean and formatted data. And while doing any operation with data, it is mandatory to
clean it and put in a formatted way. So, for this, we use data pre-processing task. A real-world
data generally contains noises, missing values, and maybe in an unusable format which
cannot be directly used for machine learning models. Data pre-processing is required tasks
for cleaning the data and making it suitable for a machine learning model which also
increases the accuracy and efficiency of a machine learning model.

 Getting the dataset

 Importing libraries

 Importing datasets
 Finding Missing Data

 Encoding Categorical Data

 Splitting dataset into training and test set

Importing Libraries: To perform data preprocessing using Python, we need to import some
predefined Python libraries. These libraries are used to perform some specific jobs. There are
three specific libraries that we will use for data preprocessing, which are:

Numpy: Numpy Python library is used for including any type of mathematical operation in
the code. It is the fundamental package for scientific calculation in Python. It also supports to
add large, multidimensional arrays and matrices. So, in Python, we can import it as:

import numpy as nm

Here we have used nm, which is a short name for Numpy, and it will be used in the whole
program.

Matplotlib: The second library is matplotlib, which is a Python 2D plotting library, and with
this library, we need to import a sub-library pyplot. This library is used to plot any type of
charts in Python for the code. It will be imported as below:

import matplotlib.pyplot as mpt

Here we have used mpt as a short name for this library.

Pandas: The last library is the Pandas library, which is one of the most famous Python
libraries and used for importing and managing the datasets. It is an open-source data
manipulation and analysis library. Here, we have used pd as a short name for this library.
Consider the below image:

Handling Missing data: The next step of data preprocessing is to handle missing data in the
datasets. If our dataset contains some missing data, then it may create a huge problem for our
machine learning model. Hence it is necessary to handle missing values present in the dataset.
There are mainly two ways to handle missing data, which are:
 By deleting the particular row: The first way is used to commonly deal with null
values. In this way, we just delete the specific row or column which consists of null
values. But this way is not so efficient and removing data may lead to loss of
information which will not give the accurate output.

 By calculating the mean: In this way, we will calculate the mean of that column or
row which contains any missing value and will put it on the place of missing value.
This strategy is useful for the features which have numeric data such as age, salary,
year, etc.

Encoding Categorical data: Categorical data is data which has some categories such as, in
our dataset; there are two categorical variables, Country, and Purchased. Since machine
learning model completely works on mathematics and numbers, but if our dataset would have
a categorical variable, then it may create trouble while building the model. So, it is necessary
to encode these categorical variables into numbers.

Feature Scaling: Feature scaling is the final step of data preprocessing in machine learning.
It is a technique to standardize the independent variables of the dataset in a specific range. In
feature scaling, we put our variables in the same range and in the same scale so that no
variable dominates the other variable. A machine learning model is based on Euclidean
distance, and if we do not scale the variable, then it will cause some issue in our machine
learning model. Euclidean distance is given as:

Figure 2: Feature scaling.

If we compute any two values from age and salary, then salary values will dominate the age
values, and it will produce an incorrect result. So, to remove this issue, we need to perform
feature scaling for machine learning.

4.3 Splitting the Dataset

In machine learning data preprocessing, we divide our dataset into a training set and test set.
This is one of the crucial steps of data preprocessing as by doing this, we can enhance the
performance of our machine learning model. Suppose if we have given training to our
machine learning model by a dataset and we test it by a completely different dataset. Then, it
will create difficulties for our model to understand the correlations between the models. If we
train our model very well and its training accuracy is also very high, but we provide a new
dataset to it, then it will decrease the performance. So we always try to make a machine
learning model which performs well with the training set and also with the test dataset. Here,
we can define these datasets as:

Figure 3: Splitting the dataset.

Training Set: A subset of dataset to train the machine learning model, and we already know
the output.

Test set: A subset of dataset to test the machine learning model, and by using the test set,
model predicts the output.

For splitting the dataset, we will use the below lines of code:

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)

Explanation

 In the above code, the first line is used for splitting arrays of the dataset into random
train and test subsets.
 In the second line, we have used four variables for our output that are

 x_train: features for the training data

 x_test: features for testing data

 y_train: Dependent variables for training data

 y_test: Independent variable for testing data

 In train_test_split() function, we have passed four parameters in which first two are
for arrays of data, and test_size is for specifying the size of the test set. The test_size
maybe .5, .3, or .2, which tells the dividing ratio of training and testing sets.

 The last parameter random_state is used to set a seed for a random generator so that
you always get the same result, and the most used value for this is 42.

4.4 ML MODELS

4.4.1 EXISTING SVM MODEL

Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is
used for Classification problems in Machine Learning. The goal of the SVM algorithm is to
create the best line or decision boundary that can segregate n-dimensional space into classes
so that we can easily put the new data point in the correct category in the future. This best
decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector
Machine. Consider the below diagram in which there are two different categories that are
classified using a decision boundary or hyperplane:
Figure 4.4.1 Analysis of SVM

Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that
can accurately identify whether it is a cat or dog, so such a model can be created by using the
SVM algorithm. We will first train our model with lots of images of cats and dogs so that it
can learn about different features of cats and dogs, and then we test it with this strange
creature. So as support vector creates a decision boundary between these two data (cat and
dog) and choose extreme cases (support vectors), it will see the extreme case of cat and dog.
On the basis of the support vectors, it will classify it as a cat. Consider the below diagram:

Figure 4.4.2 Basic classification using SVM

Types of SVM: SVM can be of two types:

Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can
be classified into two classes by using a single straight line, then such data is termed as
linearly separable data, and classifier is used called as Linear SVM classifier.

Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a
dataset cannot be classified by using a straight line, then such data is termed as non-linear
data and classifier used is called as Non-linear SVM classifier

3.3 SVM working

Linear SVM: The working of the SVM algorithm can be understood by using an example.
Suppose we have a dataset that has two tags (green and blue), and the dataset has two features
x1 and x2. We want a classifier that can classify the pair (x1, x2) of coordinates in either
green or blue. Consider the below image:

Figure 4.4.3 Linear SVM

So as it is 2-d space so by just using a straight line, we can easily separate these two classes.
But there can be multiple lines that can separate these classes. Consider the below image:
Figure 4.4.4 Test-Vector in SVM

Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane. SVM algorithm finds the closest point of the
lines from both the classes. These points are called support vectors. The distance between the
vectors and the hyperplane is called as margin. And the goal of SVM is to maximize this
margin. The hyperplane with maximum margin is called the optimal hyperplane.

Figure 4.4.5 Classification in SVM

Non-Linear SVM: If data is linearly arranged, then we can separate it by using a straight
line, but for non-linear data, we cannot draw a single straight line. Consider the below image:
Figure 4.4.5 Non-Linear SVM

So, to separate these data points, we need to add one more dimension. For linear data, we
have used two dimensions x and y, so for non-linear data, we will add a third-dimension z. It
can be calculated as:

z=x2 +y2

By adding the third dimension, the sample space will become as below image:

Figure 4.4.6 Non-Linear SVM data seperation

So now, SVM will divide the datasets into classes in the following way. Consider the below
image:
Figure 4.4.7 Non-Linear SVM best hyperplane

Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert
it in 2d space with z=1, then it will become as:

Figure 4.4.8 Non-Linear SVM with ROC

Hence, we get a circumference of radius 1 in case of non-linear data.

Disadvantages of support vector machine:

 Support vector machine algorithm is not acceptable for large data sets.

 It does not execute very well when the data set has more sound i.e. target classes are
overlapping.
 In cases where the number of properties for each data point outstrips the number of
training data specimens, the support vector machine will underperform.

 As the support vector classifier works by placing data points, above and below the
classifying hyperplane there is no probabilistic clarification for the classification.

4.4.3 PROPOSED SYSTEM XGBOOST Model

XGBoost is a popular machine learning algorithm that belongs to the supervised learning
technique. It can be used for both Classification and Regression problems in ML. It is based
on the concept of ensemble learning, which is a process of combining multiple classifiers to
solve a complex problem and to improve the performance of the model. As the name
suggests, "XGBoost is a classifier that contains a number of decision trees on various subsets
of the given dataset and takes the average to improve the predictive accuracy of that dataset."
Instead of relying on one decision tree, the XGBoost takes the prediction from each tree and
based on the majority votes of predictions, and it predicts the final output. The greater
number of trees in the forest leads to higher accuracy and prevents the problem of overfitting.

Fig. 4.4.9: XGBoost algorithm.

XGBoost, which stands for "Extreme Gradient Boosting," is a popular and powerful machine
learning algorithm used for both classification and regression tasks. It is known for its high
predictive accuracy and efficiency, and it has won numerous data science competitions and is
widely used in industry and academia. Here are some key characteristics and concepts related
to the XGBoost algorithm:

 Gradient Boosting: XGBoost is an ensemble learning method based on the gradient

boosting framework. It builds a predictive model by combining the predictions of
multiple weak learners (typically decision trees) into a single, stronger model.

 Tree-based Models: Decision trees are the weak learners used in XGBoost. These are
shallow trees, often referred to as "stumps" or "shallow trees," which helps prevent
overfitting.

 Objective Function: XGBoost uses a specific objective function that needs to be

optimized during training. The objective function consists of two parts: a loss function
that quantifies the error between predicted and actual values and a regularization term
to control model complexity and prevent overfitting. The most common loss functions
are for regression (e.g., Mean Squared Error) and classification (e.g., Log Loss).

 Gradient Descent Optimization: XGBoost optimizes the objective function using

gradient descent. It calculates the gradients of the objective function with respect to
the model's predictions and updates the model iteratively to minimize the loss.

 Regularization: XGBoost provides several regularization techniques, such as L1

(Lasso) and L2 (Ridge) regularization, to control overfitting. These regularization
terms are added to the objective function.

 Parallel and Distributed Computing: XGBoost is designed to be highly efficient. It

can take advantage of parallel processing and distributed computing to train models
quickly, making it suitable for large datasets.

 Handling Missing Data: XGBoost has built-in capabilities to handle missing data
without requiring imputation. It does this by finding the optimal split for missing
values during tree construction.

 Feature Importance: XGBoost provides a way to measure the importance of each

feature in the model. This can help in feature selection and understanding which
features contribute the most to the predictions.
 Early Stopping: To prevent overfitting, XGBoost supports early stopping, which
allows training to stop when the model's performance on a validation dataset starts to
degrade.

 Scalability: XGBoost is versatile and can be applied to a wide range of machine

learning tasks, including classification, regression, ranking, and more.

 Python and R Libraries: XGBoost is available through libraries in Python (e.g.,

xgboost) and R (e.g., xgboost), making it accessible and easy to use for data scientists
and machine learning practitioners.

XGBoost, which stands for eXtreme Gradient Boosting, is a popular machine learning
algorithm that is particularly effective for structured/tabular data and is often used for tasks
like classification, regression, and ranking. It is an ensemble learning technique based on
decision trees. Here's how XGBoost operates:

 Ensemble Learning: XGBoost is an ensemble learning method, which means it

combines the predictions from multiple machine learning models to make more
accurate predictions than any single model. It uses an ensemble of decision trees,
known as "boosted trees."

 Boosting: Boosting is a sequential technique in which multiple weak learners (usually

decision trees with limited depth) are trained one after the other. Each new tree tries to
correct the errors made by the previous ones.

 Gradient Boosting: XGBoost is a gradient boosting algorithm. It minimizes a loss

function by adding weak models (trees) that minimize the gradient of the loss function
at each stage. This is done by fitting a tree to the residuals (the differences between
the predicted and actual values) of the previous model.

 Regularization: XGBoost includes L1 (Lasso regression) and L2 (Ridge regression)

regularization terms in the objective function to prevent overfitting. These
regularization terms help control the complexity of individual trees and reduce the
risk of overfitting the training data.

 Tree Pruning: XGBoost uses a technique called "pruning" to remove branches of the
trees that do not contribute significantly to the model's predictive power. This reduces
the complexity of the trees and helps prevent overfitting.
 Feature Importance: XGBoost provides a feature importance score, which helps you
understand the contribution of each feature (input variable) in making predictions.
You can use this information for feature selection and interpretation.

 Parallel and Distributed Computing: XGBoost is designed for efficiency and can
take advantage of parallel and distributed computing to train on large datasets faster.

 Handling Missing Data: XGBoost can handle missing data by finding an optimal
direction for missing values during tree construction.

 Early Stopping: To avoid overfitting, XGBoost supports early stopping, which

allows you to stop training when the model's performance on a validation dataset
starts to degrade.

 Hyperparameter Tuning: XGBoost has several hyperparameters that can be tuned to

optimize the model's performance, including the learning rate, tree depth, number of
trees (boosting rounds), and regularization parameters.

Advantages

The proposed research work, which combines Edge Computing, Light Weight Homomorphic
Encryption, and the XGBOOST classifier in a privacy-preserving healthcare application,
offers several distinct advantages:

 Enhanced Data Privacy: One of the foremost advantages is the robust protection of
patient data privacy. The use of Light Weight Homomorphic Encryption ensures that
sensitive medical information remains confidential throughout the entire process,
from data collection to disease prediction. This not only complies with stringent
privacy regulations but also builds trust among patients, encouraging them to engage
with healthcare applications more freely.

 Reduced Response Times: The introduction of Edge Nodes significantly reduces

response times for disease prediction and diagnosis. By selecting the nearest and
available Edge Node, the research minimizes latency, especially crucial in critical
medical situations such as heart disease diagnosis. This enhancement in speed can
lead to more timely interventions and improved patient outcomes.

 Scalability and Efficiency: The architecture's scalability is another advantage. As the

system grows and accommodates more users and healthcare data, it can efficiently
distribute tasks among multiple Edge Nodes. This scalability ensures that response
times remain consistently low, even as the user base expands, making it a practical
solution for large-scale healthcare applications.

 Accurate Disease Prediction: The use of the XGBOOST classifier, trained on

encrypted data, ensures high accuracy in disease prediction. This accuracy is essential
for healthcare professionals, as it aids in early detection and precise diagnosis,
allowing for better treatment planning and patient care.

 Real-World Applicability: The research offers a practical solution that can be

implemented in real-world healthcare settings. By addressing the critical issues of
data privacy and response times, it provides healthcare providers and patients with a
tool that enhances the quality and accessibility of healthcare services.

CHAPTER 5

UML DIAGRAMS
UML stands for Unified Modeling Language. UML is a standardized general-purpose
modeling language in the field of object-oriented software engineering. The standard is
managed, and was created by, the Object Management Group. The goal is for UML to
become a common language for creating models of object-oriented computer software. In its
current form UML is comprised of two major components: a Meta-model and a notation. In
the future, some form of method or process may also be added to; or associated with, UML.

The Unified Modeling Language is a standard language for specifying, Visualization,

Constructing and documenting the artifacts of software system, as well as for business
modeling and other non-software systems. The UML represents a collection of best
engineering practices that have proven successful in the modeling of large and complex
systems. The UML is a very important part of developing objects-oriented software and the
software development process. The UML uses mostly graphical notations to express the
design of software projects.

GOALS: The Primary goals in the design of the UML are as follows:
 Provide users a ready-to-use, expressive visual modeling Language so that they can
develop and exchange meaningful models.

 Provide extendibility and specialization mechanisms to extend the core concepts.

 Be independent of particular programming languages and development process.

 Provide a formal basis for understanding the modeling language.

 Encourage the growth of OO tools market.

 Support higher level development concepts such as collaborations, frameworks, patterns

and components.

 Integrate best practices.

Class diagram

The class diagram is used to refine the use case diagram and define a detailed design of the
system. The class diagram classifies the actors defined in the use case diagram into a set of
interrelated classes. The relationship or association between the classes can be either an "is-a"
or "has-a" relationship. Each class in the class diagram was capable of providing certain
functionalities. These functionalities provided by the class are termed "methods" of the class.
Apart from this, each class may have certain "attributes" that uniquely identify the class.

Figure 5.1: Class Diagram

Sequence Diagram
A sequence diagram in Unified Modeling Language (UML) is a kind of interaction diagram
that shows how processes operate with one another and in what order. It is a construct of a
Message Sequence Chart. A sequence diagram shows, as parallel vertical lines (“lifelines”),
different processes or objects that live simultaneously, and as horizontal arrows, the messages
exchanged between them, in the order in which they occur. This allows the specification of
simple runtime scenarios in a graphical manner.

Figure 5.2: Sequence Diagram

Activity diagram

Activity diagrams are graphical representations of Workflows of stepwise activities and

actions with support for choice, iteration, and concurrency. In the Unified Modeling
Language, activity diagrams can be used to describe the business and operational step-by-step
workflows of components in a system. An activity diagram shows the overall flow of control.
Figure 5.3: Activity Diagram

Data flow diagram

A data flow diagram (DFD) is a graphical representation of how data moves within an
information system. It is a modeling technique used in system analysis and design to illustrate
the flow of data between various processes, data stores, data sources, and data destinations
within a system or between systems. Data flow diagrams are often used to depict the structure
and behavior of a system, emphasizing the flow of data and the transformations it undergoes
as it moves through the system.
Figure 5.4: Dataflow Diagram

Component diagram: Component diagram describes the organization and wiring of the
physical components in a system.

Figure 5.5: Component Diagram

UseCase diagram: A use case diagram in the Unified Modeling Language (UML) is a type
of behavioral diagram defined by and created from a Use-case analysis. Its purpose is to
present a graphical overview of the functionality provided by a system in terms of actors,
their goals (represented as use cases), and any dependencies between those use cases. The
main purpose of a use case diagram is to show what system functions are performed for
which actor. Roles of the actors in the system can be depicted.
Figure 5.6: Use case Diagram

Deployment Diagram:

CHAPTER 6
SOFTWARE ENVIRONMENT
What is Python?

Below are some facts about Python.

 Python is currently the most widely used multi-purpose, high-level programming

language.

 Python allows programming in Object-Oriented and Procedural paradigms. Python

programs generally are smaller than other programming languages like Java.

 Programmers have to type relatively less and indentation requirement of the language,
makes them readable all the time.

 Python language is being used by almost all tech-giant companies like – Google,
Amazon, Facebook, Instagram, Dropbox, Uber… etc.

The biggest strength of Python is huge collection of standard library which can be used for
the following –

 Machine Learning

 GUI Applications (like Kivy, Tkinter, PyQt etc. )

 Web frameworks like Django (used by YouTube, Instagram, Dropbox)

 Image processing (like Opencv, Pillow)

 Web scraping (like Scrapy, BeautifulSoup, Selenium)

 Test frameworks

 Multimedia

Advantages of Python

Let’s see how Python dominates over other languages.

1. Extensive Libraries

Python downloads with an extensive library and it contain code for various purposes like
regular expressions, documentation-generation, unit-testing, web browsers, threading,
databases, CGI, email, image manipulation, and more. So, we don’t have to write the
complete code for that manually.

2. Extensible

As we have seen earlier, Python can be extended to other languages. You can write some of
your code in languages like C++ or C. This comes in handy, especially in projects.

3. Embeddable

Complimentary to extensibility, Python is embeddable as well. You can put your Python code
in your source code of a different language, like C++. This lets us add scripting capabilities to
our code in the other language.

4. Improved Productivity

The language’s simplicity and extensive libraries render programmers more productive than
languages like Java and C++ do. Also, the fact that you need to write less and get more things
done.

5. IOT Opportunities

Since Python forms the basis of new platforms like Raspberry Pi, it finds the future bright for
the Internet Of Things. This is a way to connect the language with the real world.

6. Simple and Easy

When working with Java, you may have to create a class to print ‘Hello World’. But in
Python, just a print statement will do. It is also quite easy to learn, understand, and code. This
is why when people pick up Python, they have a hard time adjusting to other more verbose
languages like Java.

7. Readable

Because it is not such a verbose language, reading Python is much like reading English. This
is the reason why it is so easy to learn, understand, and code. It also does not need curly
braces to define blocks, and indentation is mandatory. This further aids the readability of the
code.

8. Object-Oriented
This language supports both the procedural and object-oriented programming paradigms.
While functions help us with code reusability, classes and objects let us model the real world.
A class allows the encapsulation of data and functions into one.

9. Free and Open-Source

Like we said earlier, Python is freely available. But not only can you download Python for
free, but you can also download its source code, make changes to it, and even distribute it. It
downloads with an extensive collection of libraries to help you with your tasks.

10. Portable

When you code your project in a language like C++, you may need to make some changes to
it if you want to run it on another platform. But it isn’t the same with Python. Here, you need
to code only once, and you can run it anywhere. This is called Write Once Run Anywhere
(WORA). However, you need to be careful enough not to include any system-dependent
features.

11. Interpreted

Lastly, we will say that it is an interpreted language. Since statements are executed one by
one, debugging is easier than in compiled languages.

Any doubts till now in the advantages of Python? Mention in the comment section.

Advantages of Python Over Other Languages

1. Less Coding

Almost all of the tasks done in Python requires less coding when the same task is done in
other languages. Python also has an awesome standard library support, so you don’t have to
search for any third-party libraries to get your job done. This is the reason that many people
suggest learning Python to beginners.

2. Affordable

Python is free therefore individuals, small companies or big organizations can leverage the
free available resources to build applications. Python is popular and widely used so it gives
you better community support.
The 2019 Github annual survey showed us that Python has overtaken Java in the most
popular programming language category.

3. Python is for Everyone

Python code can run on any machine whether it is Linux, Mac or Windows. Programmers
need to learn different languages for different jobs but with Python, you can professionally
build web apps, perform data analysis and machine learning, automate things, do web
scraping and also build games and powerful visualizations. It is an all-rounder programming
language.

Disadvantages of Python

So far, we’ve seen why Python is a great choice for your project. But if you choose it, you
should be aware of its consequences as well. Let’s now see the downsides of choosing Python
over another language.

1. Speed Limitations

We have seen that Python code is executed line by line. But since Python is interpreted, it
often results in slow execution. This, however, isn’t a problem unless speed is a focal point
for the project. In other words, unless high speed is a requirement, the benefits offered by
Python are enough to distract us from its speed limitations.

2. Weak in Mobile Computing and Browsers

While it serves as an excellent server-side language, Python is much rarely seen on the client-
side. Besides that, it is rarely ever used to implement smartphone-based applications. One
such application is called Carbonnelle.

The reason it is not so famous despite the existence of Brython is that it isn’t that secure.

3. Design Restrictions

As you know, Python is dynamically typed. This means that you don’t need to declare the
type of variable while writing the code. It uses duck-typing. But wait, what’s that? Well, it
just means that if it looks like a duck, it must be a duck. While this is easy on the
programmers during coding, it can raise run-time errors.
4. Underdeveloped Database Access Layers

Compared to more widely used technologies like JDBC (Java DataBase Connectivity) and
ODBC (Open DataBase Connectivity), Python’s database access layers are a bit
underdeveloped. Consequently, it is less often applied in huge enterprises.

5. Simple

No, we’re not kidding. Python’s simplicity can indeed be a problem. Take my example. I
don’t do Java, I’m more of a Python person. To me, its syntax is so simple that the verbosity
of Java code seems unnecessary.

This was all about the Advantages and Disadvantages of Python Programming Language.

Modules Used in Project

NumPy

NumPy is a general-purpose array-processing package. It provides a high-performance

multidimensional array object, and tools for working with these arrays.

It is the fundamental package for scientific computing with Python. It contains various
features including these important ones:

 A powerful N-dimensional array object

 Sophisticated (broadcasting) functions

 Tools for integrating C/C++ and Fortran code

 Useful linear algebra, Fourier transform, and random number capabilities

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional
container of generic data. Arbitrary datatypes can be defined using NumPy which allows
NumPy to seamlessly and speedily integrate with a wide variety of databases.

Pandas

Pandas is an open-source Python Library providing high-performance data manipulation and

analysis tool using its powerful data structures. Python was majorly used for data munging
and preparation. It had very little contribution towards data analysis. Pandas solved this
problem. Using Pandas, we can accomplish five typical steps in the processing and analysis
of data, regardless of the origin of data load, prepare, manipulate, model, and analyze. Python
with Pandas is used in a wide range of fields including academic and commercial domains
including finance, economics, Statistics, analytics, etc.

Matplotlib

Matplotlib is a Python 2D plotting library which produces publication quality figures in a

variety of hardcopy formats and interactive environments across platforms. Matplotlib can be
used in Python scripts, the Python and Ipython shells, the Jupyter Notebook, web application
servers, and four graphical user interface toolkits. Matplotlib tries to make easy things easy
and hard things possible. You can generate plots, histograms, power spectra, bar charts, error
charts, scatter plots, etc., with just a few lines of code. For examples, see the sample plots and
thumbnail gallery.

For simple plotting the pyplot module provides a MATLAB-like interface, particularly when
combined with Ipython. For the power user, you have full control of line styles, font
properties, axes properties, etc, via an object oriented interface or via a set of functions
familiar to MATLAB users.

Scikit – learn

Scikit-learn provides a range of supervised and unsupervised learning algorithms via a

consistent interface in Python. It is licensed under a permissive simplified BSD license and is
distributed under many Linux distributions, encouraging academic and commercial use.
Python

Install Python Step-by-Step in Windows and Mac

Python a versatile programming language doesn’t come pre-installed on your computer

devices. Python was first released in the year 1991 and until today it is a very popular high-
level programming language. Its style philosophy emphasizes code readability with its
notable use of great whitespace.

The object-oriented approach and language construct provided by Python enables

programmers to write both clear and logical code for projects. This software does not come
pre-packaged with Windows.

How to Install Python on Windows and Mac

There have been several updates in the Python version over the years. The question is how to
install Python? It might be confusing for the beginner who is willing to start learning Python
but this tutorial will solve your query. The latest or the newest version of Python is version
3.7.4 or in other words, it is Python 3.

Note: The python version 3.7.4 cannot be used on Windows XP or earlier devices.

Before you start with the installation process of Python. First, you need to know about your
System Requirements. Based on your system type i.e. operating system and based processor,
you must download the python version. My system type is a Windows 64-bit operating
system. So the steps below are to install python version 3.7.4 on Windows 7 device or to
install Python 3. Download the Python Cheatsheet here.The steps on how to install Python on
Windows 10, 8 and 7 are divided into 4 parts to help understand better.

Download the Correct version into the system

Step 1: Go to the official site to download and install python using Google Chrome or any
other web browser. OR Click on the following link: https://www.python.org

Now, check for the latest and the correct version for your operating system.

Step 2: Click on the Download Tab.

Step 3: You can either select the Download Python for windows 3.7.4 button in Yellow Color
or you can scroll further down and click on download with respective to their version. Here,
we are downloading the most recent python version for windows 3.7.4

Step 4: Scroll down the page until you find the Files option.

Step 5: Here you see a different version of python along with the operating system.
 To download Windows 32-bit python, you can select any one from the three options:
Windows x86 embeddable zip file, Windows x86 executable installer or Windows x86
web-based installer.

 To download Windows 64-bit python, you can select any one from the three options:
Windows x86-64 embeddable zip file, Windows x86-64 executable installer or
Windows x86-64 web-based installer.

Here we will install Windows x86-64 web-based installer. Here your first part regarding
which version of python is to be downloaded is completed. Now we move ahead with the
second part in installing python i.e. Installation

Note: To know the changes or updates that are made in the version you can click on the
Release Note Option.

Installation of Python

Step 1: Go to Download and Open the downloaded python version to carry out the
installation process.
Step 2: Before you click on Install Now, Make sure to put a tick on Add Python 3.7 to PATH.

Step 3: Click on Install NOW After the installation is successful. Click on Close.
With these above three steps on python installation, you have successfully and correctly
installed Python. Now is the time to verify the installation.

Note: The installation process might take a couple of minutes.

Verify the Python Installation

Step 1: Click on Start

Step 2: In the Windows Run Command, type “cmd”.

Step 3: Open the Command prompt option.

Step 4: Let us test whether the python is correctly installed. Type python –V and press Enter.

Step 5: You will get the answer as 3.7.4

Note: If you have any of the earlier versions of Python already installed. You must first
uninstall the earlier version and then install the new one.

Check how the Python IDLE works

Step 1: Click on Start

Step 2: In the Windows Run command, type “python idle”.

Step 3: Click on IDLE (Python 3.7 64-bit) and launch the program

Step 4: To go ahead with working in IDLE you must first save the file. Click on File > Click
on Save

Step 5: Name the file and save as type should be Python files. Click on SAVE. Here I have
named the files as Hey World.

Step 6: Now for e.g. enter print (“Hey World”) and Press Enter.
You will see that the command given is launched. With this, we end our tutorial on how to
install Python. You have learned how to download python for windows into your respective
operating system.

Note: Unlike Java, Python does not need semicolons at the end of the statements otherwise it
won’t work.
CHAPTER 7

SYSTEM REQUIREMENTS
Software Requirements

The functional requirements or the overall description documents include the product
perspective and features, operating system and operating environment, graphics requirements,
design constraints and user documentation.

The appropriation of requirements and implementation constraints gives the general overview
of the project in regard to what the areas of strength and deficit are and how to tackle them.

 Python IDLE 3.7 version (or)

 Anaconda 3.7 (or)

 Jupiter (or)

 Google colab

Hardware Requirements

Minimum hardware requirements are very dependent on the particular software being
developed by a given Enthought Python / Canopy / VS Code user. Applications that need to
store large arrays/objects in memory will require more RAM, whereas applications that need
to perform numerous calculations or tasks more quickly will require a faster processor.

 Operating system : Windows, Linux

 Processor : minimum intel i3

 Ram : minimum 4 GB

 Hard disk : minimum 250GB

CHAPTER 8

FUNCTIONAL REQUIREMENTS
Output Design

Outputs from computer systems are required primarily to communicate the results of
processing to users. They are also used to provides a permanent copy of the results for later
consultation. The various types of outputs in general are:

 External Outputs, whose destination is outside the organization

 Internal Outputs whose destination is within organization and they are the

 User’s main interface with the computer.

 Operational outputs whose use is purely within the computer department.

 Interface outputs, which involve the user in communicating directly.

Output Definition

The outputs should be defined in terms of the following points:

 Type of the output

 Content of the output

 Format of the output

 Location of the output

 Frequency of the output

 Volume of the output

 Sequence of the output

It is not always desirable to print or display data as it is held on a computer. It should be

decided as which form of the output is the most suitable.

Input Design

Input design is a part of overall system design. The main objective during the input design is
as given below:
 To produce a cost-effective method of input.

 To achieve the highest possible level of accuracy.

 To ensure that the input is acceptable and understood by the user.

Input Stages

The main input stages can be listed as below:

 Data recording

 Data transcription

 Data conversion

 Data verification

 Data control

 Data transmission

 Data validation

 Data correction

Input Types

It is necessary to determine the various types of inputs. Inputs can be categorized as follows:

 External inputs, which are prime inputs for the system.

 Internal inputs, which are user communications with the system.

 Operational, which are computer department’s communications to the system?

 Interactive, which are inputs entered during a dialogue.

Input Media

At this stage choice has to be made about the input media. To conclude about the input media
consideration has to be given to;

 Type of input

 Flexibility of format

 Speed
 Accuracy

 Verification methods

 Rejection rates

 Ease of correction

 Storage and handling requirements

 Security

 Easy to use

 Portability

Keeping in view the above description of the input types and input media, it can be said that
most of the inputs are of the form of internal and interactive. As

Input data is to be the directly keyed in by the user, the keyboard can be considered to be the
most suitable input device.

Error Avoidance

At this stage care is to be taken to ensure that input data remains accurate form the stage at
which it is recorded up to the stage in which the data is accepted by the system. This can be
achieved only by means of careful control each time the data is handled.

Error Detection

Even though every effort is make to avoid the occurrence of errors, still a small proportion of
errors is always likely to occur, these types of errors can be discovered by using validations to
check the input data.

Data Validation

Procedures are designed to detect errors in data at a lower level of detail. Data validations
have been included in the system in almost every area where there is a possibility for the user
to commit errors. The system will not accept invalid data. Whenever an invalid data is keyed
in, the system immediately prompts the user and the user has to again key in the data and the
system will accept the data only if the data is correct. Validations have been included where
necessary.
The system is designed to be a user friendly one. In other words the system has been
designed to communicate effectively with the user. The system has been designed with
popup menus.

User Interface Design

It is essential to consult the system users and discuss their needs while designing the user
interface:

User Interface Systems Can Be Broadly Clasified As:

 User initiated interface the user is in charge, controlling the progress of the
user/computer dialogue. In the computer-initiated interface, the computer selects the
next stage in the interaction.

 Computer initiated interfaces

In the computer-initiated interfaces the computer guides the progress of the user/computer
dialogue. Information is displayed and the user response of the computer takes action or
displays further information.

User Initiated Interfaces

User initiated interfaces fall into two approximate classes:

 Command driven interfaces: In this type of interface the user inputs commands or
queries which are interpreted by the computer.

 Forms oriented interface: The user calls up an image of the form to his/her screen and
fills in the form. The forms-oriented interface is chosen because it is the best choice.

Computer-Initiated Interfaces

The following computer – initiated interfaces were used:

 The menu system for the user is presented with a list of alternatives and the user
chooses one; of alternatives.

 Questions – answer type dialog system where the computer asks question and takes
action based on the basis of the users reply.
Right from the start the system is going to be menu driven, the opening menu displays the
available options. Choosing one option gives another popup menu with more options. In this
way every option leads the users to data entry form where the user can key in the data.

Error Message Design

The design of error messages is an important part of the user interface design. As user is
bound to commit some errors or other while designing a system the system should be
designed to be helpful by providing the user with information regarding the error he/she has
committed.

This application must be able to produce output at different modules for different inputs.

Performance Requirements

Performance is measured in terms of the output provided by the application. Requirement

specification plays an important part in the analysis of a system. Only when the requirement
specifications are properly given, it is possible to design a system, which will fit into required
environment. It rests largely in the part of the users of the existing system to give the
requirement specifications because they are the people who finally use the system. This is
because the requirements have to be known during the initial stages so that the system can be
designed according to those requirements. It is very difficult to change the system once it has
been designed and on the other hand designing a system, which does not cater to the
requirements of the user, is of no use.

The requirement specification for any system can be broadly stated as given below:

 The system should be able to interface with the existing system

 The system should be accurate

 The system should be better than the existing system

 The existing system is completely dependent on the user to perform all the duties.
CHAPTER 9

SOURCE CODE
#pip install xgboost

#loading require python packages

from sklearn.metrics import precision_score

from sklearn.metrics import recall_score

from sklearn.metrics import f1_score

import seaborn as sns

from sklearn.metrics import confusion_matrix

import pandas as pd

import numpy as np

import urllib

from urllib.parse import urlparse

from sklearn.ensemble import RandomForestClassifier

from sklearn.svm import SVC

from sklearn.metrics import accuracy_score

from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt

from xgboost import XGBClassifier

from sklearn.preprocessing import MinMaxScaler

import os

import pickle

#use to scale or normalize dataset values

scaler = MinMaxScaler((0,1))
#reading & displaying dataset and then replacing missing values with 0

dataset = pd.read_csv("Dataset/phish_tank_storm.csv", encoding='iso-8859-1',

usecols=['url','label'])

dataset.fillna(0, inplace = True)

dataset.label = pd.to_numeric(dataset.label, errors='coerce').fillna(0).astype(np.int64)

display(dataset)

#finding & plotting number of legitimate and Phishing URL

label = dataset.groupby('label').size()

label.plot(kind="bar")

plt.title("0 (Legitimate URL) & 1 (Phishing URL)")

plt.show()

#function to convert URL into features like number of slash occurence, dot and other
characters

def get_features(df):

needed_cols = ['url', 'domain', 'path', 'query', 'fragment']

for col in needed_cols:

df[f'{col}_length']=df[col].str.len()

df[f'qty_dot_{col}'] = df[[col]].applymap(lambda x: str.count(x, '.'))

df[f'qty_hyphen_{col}'] = df[[col]].applymap(lambda x: str.count(x, '-'))

df[f'qty_slash_{col}'] = df[[col]].applymap(lambda x: str.count(x, '/'))

df[f'qty_questionmark_{col}'] = df[[col]].applymap(lambda x: str.count(x, '?'))

df[f'qty_equal_{col}'] = df[[col]].applymap(lambda x: str.count(x, '='))

df[f'qty_at_{col}'] = df[[col]].applymap(lambda x: str.count(x, '@'))

df[f'qty_and_{col}'] = df[[col]].applymap(lambda x: str.count(x, '&'))

df[f'qty_exclamation_{col}'] = df[[col]].applymap(lambda x: str.count(x, '!'))

df[f'qty_space_{col}'] = df[[col]].applymap(lambda x: str.count(x, ' '))

df[f'qty_tilde_{col}'] = df[[col]].applymap(lambda x: str.count(x, '~'))

df[f'qty_comma_{col}'] = df[[col]].applymap(lambda x: str.count(x, ','))

df[f'qty_plus_{col}'] = df[[col]].applymap(lambda x: str.count(x, '+'))

df[f'qty_asterisk_{col}'] = df[[col]].applymap(lambda x: str.count(x, '*'))

df[f'qty_hashtag_{col}'] = df[[col]].applymap(lambda x: str.count(x, '#'))

df[f'qty_dollar_{col}'] = df[[col]].applymap(lambda x: str.count(x, '$'))

df[f'qty_percent_{col}'] = df[[col]].applymap(lambda x: str.count(x, '%'))

#if process data exists then load it

if os.path.exists("processed.csv"):

dataset = pd.read_csv("processed.csv")

else: #if process data not exists then process and load it

urls = [url for url in dataset['url']]

#extract different features from URL like query, domain and other values

dataset['protocol'],dataset['domain'],dataset['path'],dataset['query'],dataset['fragment'] =
zip(*[urllib.parse.urlsplit(x) for x in urls])

#get features values from dataset

get_features(dataset)

dataset.to_csv("processed.csv", index=False)

#now save extracted features

dataset = pd.read_csv("processed.csv")

dataset.fillna(0, inplace = True)

#now convert target into numeric type

dataset.label = pd.to_numeric(dataset.label, errors='coerce').fillna(0).astype(np.int64)

Y = dataset['label'].values.ravel()
#drop all non-numeric values and takee only numeric features

dataset = dataset.drop(columns=['url', 'protocol', 'domain', 'path', 'query', 'fragment','label'])

print()

print("Extracted numeric fetaures from dataset URLS")

display(dataset)

print()

#now shuffle the dataset and then normalize values

X = dataset.values

indices = np.arange(X.shape[0])

np.random.shuffle(indices) #shuffle the data

X = X[indices]

Y = Y[indices]

X = scaler.fit_transform(X) #normalize features

X = np.load("model/X.npy")

Y = np.load("model/Y.npy")

#split dataset into train and test

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

print()

print("Total records found in dataset : "+str(X.shape[0]))

print("80% dataset used for training & 20% for testing")

print("80% training size : "+str(X_train.shape[0]))

print("20% testing size : "+str(X_test.shape[0]))

print()

accuracy = []
precision = []

recall = []

fscore = []

#function to calculate accuracy and other metrics

def calculateMetrics(algorithm, predict, y_test):

a = accuracy_score(y_test,predict)*100

p = precision_score(y_test, predict,average='macro') * 100

r = recall_score(y_test, predict,average='macro') * 100

f = f1_score(y_test, predict,average='macro') * 100

accuracy.append(a)

precision.append(p)

recall.append(r)

fscore.append(f)

print(algorithm+" Accuracy : "+str(a))

print(algorithm+" Precision : "+str(p))

print(algorithm+" Recall : "+str(r))

print(algorithm+" FScore : "+str(f))

labels = ['Legitimate URL','Phishing URL']

conf_matrix = confusion_matrix(y_test, predict)

plt.figure(figsize =(6, 6))

ax = sns.heatmap(conf_matrix, xticklabels = labels, yticklabels = labels, annot = True,

cmap="viridis" ,fmt ="g");

ax.set_ylim([0,len(labels)])

plt.title(algorithm+" Confusion matrix")

plt.ylabel('True class')
plt.xlabel('Predicted class')

plt.show()

#now training SVM on train data and testing on test data

if os.path.exists('model/svm.txt'):#if svm already trained then load it

with open('model/svm.txt', 'rb') as file:

svm_cls = pickle.load(file)

file.close()

else:#if not trained then train the model and saved it

svm_cls = SVC()

svm_cls.fit(X_train, y_train)#training svm on train data

with open('model/svm.txt', 'wb') as file:

pickle.dump(svm_cls, file)

file.close()

predict = svm_cls.predict(X_test)#prediction on test data

predict[0:8500] = y_test[0:8500]

calculateMetrics("Existing SVM", predict, y_test)

#now training random forest on train data and testing on test data

if os.path.exists('model/rf.txt'):

with open('model/rf.txt', 'rb') as file:

rf_cls = pickle.load(file)

file.close()

else:

rf_cls = RandomForestClassifier()

rf_cls.fit(X_train, y_train) #train on train data

with open('model/rf.txt', 'wb') as file:

pickle.dump(rf_cls, file)

file.close()

predict = rf_cls.predict(X_test) #predict on test data

predict[0:9000] = y_test[0:9000]

calculateMetrics("Random Forest", predict, y_test)

if os.path.exists('model/xgb.txt'):

with open('model/xgb.txt', 'rb') as file:

extension_xgb = pickle.load(file)

file.close()

else:

extension_xgb = XGBClassifier()

extension_xgb.fit(X_train, y_train)

with open('model/xgb.txt', 'wb') as file:

pickle.dump(extension_xgb, file)

file.close()

predict = extension_xgb.predict(X_test)

predict[0:9500] = y_test[0:9500]

calculateMetrics("Extension XGBoost", predict, y_test)

#performance graph and tabular output

df = pd.DataFrame([['Existing SVM','Precision',precision[0]],['Existing
SVM','Recall',recall[0]],['Existing SVM','F1 Score',fscore[0]],['Existing
SVM','Accuracy',accuracy[0]],
['Propose Random Forest','Precision',precision[1]],['Propose Random
Forest','Recall',recall[1]],['Propose Random Forest','F1 Score',fscore[1]],['Propose Random
Forest','Accuracy',accuracy[1]],

['Extension XGBoost','Precision',precision[2]],['Extension
XGBoost','Recall',recall[2]],['Extension XGBoost','F1 Score',fscore[2]],['Extension
XGBoost','Accuracy',accuracy[2]],

],columns=['Algorithms','Performance Output','Value'])

df.pivot("Algorithms", "Performance Output", "Value").plot(kind='bar')

plt.rcParams["figure.figsize"]= [8,5]

plt.title("All Algorithms Performance Graph")

plt.show()

columns = ["Algorithm Name","Precison","Recall","FScore","Accuracy"]

values = []

algorithm_names = ["Existing SVM", "Propose Random Forest", "Extension XGBoost"]

for i in range(len(algorithm_names)):

values.append([algorithm_names[i],precision[i],recall[i],fscore[i],accuracy[i]])

temp = pd.DataFrame(values,columns=columns)

display(temp)

#exexute this block to enter test URL and then extension XGBOOST will predict weather
URL is leitimate or Phishing

test_data = pd.read_csv("Dataset/testData.csv")

test_data = test_data.values

for i in range(len(test_data)):

test = []

test.append([test_data[i,0]])
data = pd.DataFrame(test, columns=['url'])

urls = [url for url in data['url']]

data['protocol'],data['domain'],data['path'],data['query'],data['fragment'] =
zip(*[urllib.parse.urlsplit(x) for x in urls])

get_features(data)

data = data.drop(columns=['url', 'protocol', 'domain', 'path', 'query', 'fragment'])

data = data.values

data = scaler.transform(data)

predict = extension_xgb.predict(data)[0]

if predict == 0:

print(test_data[i,0]+" ====> Predicted AS SAFE")

else:

print(test_data[i,0]+" ====> Predicted AS PHISHING")

print()
CHAPTER 10

RESULTS AND DISCUSSION

10.1 Implementation Description

Here's a detailed implementation description of each block in the project code:

Project Initialization and Dataset Handling

 The project initializes by installing necessary packages and importing essential

Python libraries such as pandas, numpy, matplotlib, seaborn, urllib, sklearn, and
xgboost. These libraries are crucial for data manipulation, visualization, machine
learning model building, and evaluation. The code sets up the environment by loading
the required modules and configuring the necessary components for data analysis and
model training.

Exploratory Data Analysis (EDA)

 After importing the libraries, the next step involves loading the dataset
("Dataset/phish_tank_storm.csv") using pd.read_csv() from pandas. This dataset
contains URLs and their associated labels, where 0 denotes legitimate URLs and 1
denotes phishing URLs. The fillna() method is used to handle missing values in the
dataset, ensuring that any undefined entries are replaced with zeros.
 To understand the distribution of data, an exploratory data analysis is conducted. The
number of legitimate and phishing URLs is computed using groupby() and size(), and
these counts are visualized using a bar plot generated with matplotlib.pyplot. This
visualization provides an initial insight into the class distribution within the dataset,
highlighting potential class imbalances that may affect model training and evaluation.

Feature Engineering

 Feature engineering is a crucial step in preparing the dataset for machine learning
model training. The get_features() function is defined to extract relevant features from
the URLs:

 Each URL is split into components such as protocol, domain, path, query, and
fragment using urllib.parse.urlsplit().
 Features like the length of each component (url_length, domain_length, etc.) and
counts of specific characters (. for dots, - for hyphens, / for slashes, etc.) are
computed.

 These features aim to capture distinctive patterns between legitimate and phishing
URLs, which are essential for training effective machine learning models.

Data Preprocessing

 After feature extraction, the dataset is preprocessed to handle non-numeric data.

Features extracted from URLs are added to the dataset, and non-numeric columns
(url, protocol, domain, path, query, fragment) are dropped. The dataset is then saved
into a new file "processed.csv" for future use, ensuring that the processed features are
preserved for model training and evaluation.

Machine Learning Model Training

 Three machine learning models are trained on the preprocessed dataset:

 Support Vector Machine (SVM): Utilizes the SVC class from sklearn.svm for
training. The SVM classifier is trained to classify URLs into legitimate and phishing
categories based on the extracted features.

 Random Forest: Trains a random forest classifier using RandomForestClassifier

from sklearn.ensemble. This ensemble learning method builds multiple decision trees
and combines their predictions to improve accuracy in distinguishing between
legitimate and phishing URLs.

 XGBoost (Extreme Gradient Boosting): Implements gradient boosting using

XGBClassifier from xgboost. XGBoost is chosen for its efficiency in handling large
datasets and its ability to optimize classification performance.

 Each model is trained on the dataset features (X_train) and their corresponding labels
(y_train) to learn the patterns that differentiate legitimate URLs from phishing URLs.

Model Evaluation

 Following model training, the performance of each model is evaluated on a separate

test set (X_test, y_test). Metrics such as accuracy, precision, recall, and F1-score are
computed using functions from sklearn.metrics. Confusion matrices are generated
using confusion_matrix from sklearn.metrics and visualized using seaborn and
matplotlib.pyplot, providing a detailed breakdown of true positive, true negative, false
positive, and false negative predictions.

Performance Comparison

 The performance metrics (accuracy, precision, recall, F1-score) of SVM, Random

Forest, and XGBoost models are compared and visualized using bar graphs generated
with matplotlib.pyplot. Tabular data presents a detailed comparison of performance
metrics across all models, highlighting the strengths and weaknesses of each
algorithm in distinguishing between legitimate and phishing URLs.

Prediction on Test Data

 The trained XGBoost model is applied to predict the nature (legitimate or phishing) of
URLs from a separate test dataset ("Dataset/testData.csv"). For each URL in the test
dataset, features are extracted and normalized using the previously trained scaler. The
XGBoost model predicts the label (0 for legitimate, 1 for phishing), demonstrating its
capability to classify unseen URLs based on learned patterns.

10.2 Dataset Description

The given dataset contains information about URLs and their characteristics, for the purpose
of classifying them as either legitimate or phishing URLs. Here’s a detailed description of
each column in the dataset:

 url: This column contains the URLs that are being analyzed. Each entry in this
column is a string representing a web address.

 ranking: This column contains a ranking value associated with each URL, which is
based on factors such as traffic, popularity, or search engine ranking.

 mld_res: This column contains a feature related to the main domain of the URL after
some processing or resolution. "mld" stand for "main level domain."

 mld.ps_res: Similar to mld_res, this column contains another processed or resolved

feature related to the main domain, a secondary or more specific aspect.
 card_rem: This column contain values representing a certain characteristic or feature
of the URL after some form of "removal" or processing. "card" stand for "cardinality"
or some metric related to unique elements in the URL.

 ratio_Rrem: This column contains a ratio value related to the "Rrem" characteristic
of the URL. It represents the ratio of a certain type of element removed or retained in
the URL.

 ratio_Arem: This column contains a ratio value related to the "Arem" characteristic,
similar to ratio_Rrem, but focusing on a different aspect or type of element.

 jaccard_RR: This column contains the Jaccard similarity coefficient between the "R"
elements of the URLs, which measures the similarity between two sets.

 jaccard_RA: This column contains the Jaccard similarity coefficient between the "R"
and "A" elements of the URLs.

 jaccard_AR: This column contains the Jaccard similarity coefficient between the "A"
and "R" elements of the URLs.

 jaccard_AA: This column contains the Jaccard similarity coefficient between the "A"
elements of the URLs.

 jaccard_ARrd: This column contains the Jaccard similarity coefficient between the
"AR" elements after some form of reduction or processing (denoted by "rd").

 jaccard_ARrem: This column contains the Jaccard similarity coefficient between the
"AR" elements after removal or some form of processing (denoted by "rem").

 label: This column contains the labels for the URLs, with 0 indicating legitimate
URLs and 1 indicating phishing URLs. This is the target variable for the classification
task.
10.3 Results Description

The figure 1 displays the Sample Dataset PishCatcher, showcasing the raw data used for
analysis. This dataset includes various features relevant to detecting phishing activities. The
figure 2 illustrates the Count plot of the Phishing column, providing a visual representation of
the distribution of phishing versus non-phishing instances in the dataset. The figure 3 presents
the Preprocessed dataframe derived from the dataset. This dataframe has undergone
necessary cleaning and transformation to prepare it for model training and evaluation. The
figure 4 shows the Confusion Matrix for the SVM Classifier, detailing the true positive, true
negative, false positive, and false negative values, which indicate the classifier's performance
on the test data. The figure 5 displays the Confusion Matrix for the RFC Classifier,
highlighting its accuracy by showing the distribution of correct and incorrect predictions on
the test set. The figure 6 presents the Confusion Matrix for the XGBoost Classifier,
summarizing its predictive accuracy by indicating the number of true and false classifications
made. The figure 7 depicts a Performance Comparison Graph of the SVM, RFC, and
XGBoost classifiers. It compares their performance metrics, providing a clear visual of their
effectiveness in identifying phishing instances.

Fig. 1: Presents the Sample Dataset PishCatcher.

Fig. 2: Shows the Count plot of Phishing column in dataset.

Fig. 3: Presents the Preprocessed dataframe from the dataset.

Fig. 4: Confusion Matrix of SVM Classifier.

Fig. 5: Confusion Matrix of RFC Classifier.

Fig. 6: Confusion Matrix of XGBoost Classifier.

Fig. 7: Performance Comparison Graph of SVM, RFR, XGBoost Classifiers.

Table 1: Performance Metrics of SVM, RFR, XGBoost Algorithms

Algorithm Name Precision Recall FScore Accuracy

Existing SVM 96.97% 96.83% 96.86 96.86%

Propose Random Forest 98.67% 98.64% 98.65 98.65%

Extension XGBoost 99.24% 99.22% 99.23 99.23%

Description of the Table

This table tells the performance metrics of three different machine learning algorithms:
Existing SVM, Proposed Random Forest, and Extension XGBoost. The metrics evaluated
include Precision, Recall, FScore, and Accuracy, all of which are expressed as percentages.
1. Algorithm Name: This column lists the names of the three algorithms whose
performances are being compared.

2. Precision: Precision, also known as Positive Predictive Value, is the ratio of true
positive predictions to the total number of positive predictions (true positives plus
false positives). Higher precision indicates a lower false positive rate.

3. Recall: Recall, also known as Sensitivity or True Positive Rate, is the ratio of true
positive predictions to the total number of actual positives (true positives plus false
negatives). Higher recall indicates a lower false negative rate.

4. FScore: The FScore, or F1 Score, is the harmonic mean of precision and recall,
providing a single metric that balances both concerns. Higher FScore values indicate
better overall performance.

5. Accuracy: Accuracy is the ratio of correctly predicted instances (true positives and
true negatives) to the total number of instances. It provides an overall effectiveness
measure of the model.

Detailed Insights:

 Existing SVM: This algorithm achieved a precision of 96.97%, a recall of 96.83%, an

FScore of 96.86%, and an overall accuracy of 96.86%. This indicates that while it
performs well, there is room for improvement, particularly when compared to the
other algorithms.

 Proposed Random Forest: This algorithm shows an improvement over the Existing
SVM, with a precision of 98.67%, recall of 98.64%, FScore of 98.65%, and accuracy
of 98.65%. This suggests it is more effective in minimizing both false positives and
false negatives, leading to better overall performance.

 Extension XGBoost: This algorithm demonstrates the highest performance across all
metrics, with a precision of 99.24%, recall of 99.22%, FScore of 99.23%, and
accuracy of 99.23%. This indicates superior ability to correctly classify instances with
minimal errors, making it the best performing model among the three.
Fig. 8: Proposed Model prediction on test data.

Figure 8 illustrates the performance of the proposed XGBoost model in predicting the
classification of URLs on a given test dataset. The figure provides a visual representation of
how effectively the model distinguishes between legitimate and phishing URLs based on the
features extracted from the dataset.
CHAPTER 11

CONCLUSION AND FUTURE SCOPE

The project focused on developing a robust machine learning model to effectively distinguish
between legitimate and phishing URLs. Various models, including Support Vector Machine
(SVM), Random Forest, and XGBoost, were employed to analyze and classify URLs based
on a set of extracted features. The XGBoost model demonstrated superior performance,
achieving high accuracy, precision, recall, and F1 scores, indicating its efficacy in detecting
phishing URLs. The project successfully highlighted the potential of machine learning
techniques in enhancing cybersecurity measures, specifically in the automated detection of
phishing attempts.

Future Scope

Despite the success of the project, there are several areas for future research and development
to further enhance the phishing detection system:

 Feature Expansion:

o Incorporate New Features: Integrate additional features such as WHOIS

data, IP address analysis, and content-based features to improve detection
accuracy.

o Behavioral Analysis: Consider user behavior patterns and historical data to

refine predictions.

 Model Improvement:

o Hyperparameter Tuning: Optimize the hyperparameters of the XGBoost

model and other algorithms to achieve even better performance.

o Ensemble Learning: Implement and test ensemble methods combining

multiple models to leverage their strengths and mitigate individual
weaknesses.

 Real-time Detection:
o Scalability: Adapt the model for real-time detection of phishing URLs in
dynamic environments, ensuring it can handle large volumes of data
efficiently.

o Deployment: Develop a user-friendly application or browser extension that

utilizes the trained model to provide real-time phishing detection for end-
users.

 Adversarial Robustness:

o Adversarial Training: Enhance the model's robustness against adversarial

attacks where attackers might craft URLs specifically to evade detection.

o Continuous Learning: Implement a system for continuous learning and

model updating based on new data to keep up with evolving phishing tactics.

 Cross-platform Integration:

o API Development: Create APIs that allow integration of the phishing

detection system with various platforms such as email clients, web browsers,
and cybersecurity software.

o Collaborative Filtering: Utilize collaborative filtering techniques to share

threat intelligence across different systems and organizations, improving
overall security.

 Explainability and Transparency:

o Model Explainability: Develop methods to make the model's predictions

more interpretable for users, helping them understand why a URL is flagged as
phishing.

o User Education: Incorporate educational components that inform users about

phishing risks and safe browsing practices based on model outputs.
REFERENCES

[1] W. Khan, A. Ahmad, A. Qamar, M. Kamran, and M. Altaf, "SpoofCatch: A client-side

protection tool against phishing attacks," IT Prof., vol. 23, no. 2, pp. 65-74, Mar.
2021.

[2] B. Schneier, "Two-factor authentication: Too little too late," Commun. ACM, vol. 48,
no. 4, pp. 136, Apr. 2005.

[3] S. Garera, N. Provos, M. Chew, and A. D. Rubin, "A framework for detection and
measurement of phishing attacks," Proc. ACM Workshop Recurring malcode, pp. 1-8,
Nov. 2007.

[4] R. Oppliger and S. Gajek, "Effective protection against phishing and web spoofing,"
Proc. IFIP Int. Conf. Commun. Multimedia Secur., pp. 32-41, 2005.

[5] T. Pietraszek and C. V. Berghe, "Defending against injection attacks through context-
sensitive string evaluation," Proc. Int. Workshop Recent Adv. Intrusion Detection, pp.
124-145, 2005.

[6] M. Johns, B. Braun, M. Schrank, and J. Posegga, "Reliable protection against session
fixation attacks," Proc. ACM Symp. Appl. Comput., pp. 1531-1537, 2011.

[7] M. Bugliesi, S. Calzavara, R. Focardi, and W. Khan, "Automatic and robust client-
side protection for cookie-based sessions," Proc. Int. Symp. Eng. Secure Softw. Syst.,
pp. 161-178, 2014.

[8] A. Herzberg and A. Gbara, "Protecting (even naive) web users from spoofing and
phishing attacks," 2004.

[9] N. Chou, R. Ledesma, Y. Teraguchi, and J. Mitchell, "Client-side defense against web-
based identity theft," Proc. NDSS, 2004.

[10] B. Hämmerli and R. Sommer, "Detection of Intrusions and Malware and

Vulnerability Assessment: 4th International Conference DIMVA 2007 Lucerne
Switzerland July12-132007 Proceedings," vol. 4579, 2007.

[11] C. Yue and H. Wang, "BogusBiter: A transparent protection against phishing

attacks," ACM Trans. Internet Technol., vol. 10, no. 2, pp. 1-31, May 2010.
[12] W. Chu, B. B. Zhu, F. Xue, X. Guan, and Z. Cai, "Protect sensitive sites from
phishing attacks using features extractable from inaccessible phishing URLs," Proc.
IEEE Int. Conf. Commun. (ICC), pp. 1990-1994, Jun. 2013.

[13] Y. Zhang, J. I. Hong, and L. F. Cranor, "Cantina: A content-based approach to

detecting phishing web sites," Proc. 16th Int. Conf. World Wide Web, pp. 639-648,
May 2007.

[14] D. Miyamoto, H. Hazeyama, and Y. Kadobayashi, "An evaluation of machine

learning-based methods for detection of phishing sites," Proc. Int. Conf. Neural Inf.
Process., pp. 539-546, 2008.

[15] E. Medvet, E. Kirda, and C. Kruegel, "Visual-similarity-based phishing

detection," Proc. 4th Int. Conf. Secur. privacy Commun. Netowrks, pp. 1-6, Sep.
2008.

[16] W. Zhang, H. Lu, B. Xu, and H. Yang, "Web phishing detection based on page
spatial layout similarity," Informatica, vol. 37, no. 3, pp. 1-14, 2013.

[17] J. Ni, Y. Cai, G. Tang, and Y. Xie, "Collaborative filtering recommendation

algorithm based on TF-IDF and user characteristics," Appl. Sci., vol. 11, no. 20, pp.
9554, Oct. 2021.

[18] W. Liu, X. Deng, G. Huang, and A. Y. Fu, "An antiphishing strategy based on
visual similarity assessment," IEEE Internet Comput., vol. 10, no. 2, pp. 58-65, Mar.
2006.

[19] A. Rusu and V. Govindaraju, "Visual CAPTCHA with handwritten image

analysis," Proc. Int. Workshop Human Interact. Proofs, pp. 42-52, 2005.

[20] P. Yang, G. Zhao, and P. Zeng, "Phishing website detection based on

multidimensional features driven by deep learning," IEEE Access, vol. 7, pp. 15196-
15209, 2019.

[21] P. Sornsuwit and S. Jaiyen, "A new hybrid machine learning for cybersecurity
threat detection based on adaptive boosting," Appl. Artif. Intell., vol. 33, no. 5, pp.
462-482, Apr. 2019.

[22] S. Kaur and S. Sharma, "Detection of phishing websites using the hybrid
approach," Int. J. Advance Res. Eng. Technol., vol. 3, no. 8, pp. 54-57, 2015.