0% found this document useful (0 votes)
61 views53 pages

Major Project Final Report

The document discusses the use of machine learning techniques to detect phishing websites, which are designed to deceive users into revealing sensitive information. It highlights the increasing prevalence of phishing attacks and the need for effective detection methods, emphasizing the advantages of machine learning in terms of accuracy, scalability, and cost-effectiveness. The project aims to develop a system that can automatically identify phishing sites, thereby providing users with enhanced protection against identity theft and financial fraud.

Uploaded by

ajithsonny00
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views53 pages

Major Project Final Report

The document discusses the use of machine learning techniques to detect phishing websites, which are designed to deceive users into revealing sensitive information. It highlights the increasing prevalence of phishing attacks and the need for effective detection methods, emphasizing the advantages of machine learning in terms of accuracy, scalability, and cost-effectiveness. The project aims to develop a system that can automatically identify phishing sites, thereby providing users with enhanced protection against identity theft and financial fraud.

Uploaded by

ajithsonny00
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 53

by NAAC with A++ Grade CHENNAI 600 062,

TAMILNADU, INDIA
PHISHING WEBSITES DETECTION USING MACHINE
LEARNING

i
ABSTRACT

Phishing is a common attack on credulous people by making them disclose their


unique information using counterfeit websites. The objective of phishing website
URLs is to purloin personal information like usernames, passwords, and online bank-
ing transactions. Phishers use websites that are visually and semantically similar to
those real websites. As technology continues to grow, phishing techniques started
to progress rapidly and this needs to be prevented by using anti-phishing mecha-
nisms to detect phishing. Machine learning is a powerful tool used to strive against
phishing attacks. Phishing is the most unsafe criminal exercise in cyberspace. Since
most users go online to access the services provided by the government and financial
institutions, there has been a significant increase in phishing attacks over the past
few years. Phishers started to earn money and they are doing this as a successful
business. Various methods are used by phishers to attack vulnerable users such as
messaging, VOIP, spoofed links, and counterfeit websites. It is very easy to cre-
ate counterfeit websites, which looks like genuine website in terms of layout and
content. Even, the content of these websites would be identical to their legitimate
websites. The reason for creating these websites is to get private data from users
like account numbers, login IDs, debit and credit card passwords, etc. Moreover,
attackers ask security questions to answer posing as a high-level security measure
providing to users. When users respond to those questions, they get easily trapped in
phishing attacks. Research has been going on to prevent phishing attacks by differ-
ent communities around the world. Phishing attacks can be prevented by detecting
the websites and creating awareness among users to identify the phishing websites.
Machine learning algorithms have been one of the powerful techniques in detecting
phishing websites. In this project, various methods of detecting phishing websites
have been discussed.

Keywords: Ada Boost, Random Forest, XG Boost, Performance Analysis, SVM,


Gradient Boosting.

ii
LIST OF ACRONYMS AND
ABBREVIATIONS

AD BOOST Adaptive BOOST

SVM Support Vector Machine

URL Uniform Resource Locator

UML Unified Modeling Language

XGBOOST Extreme Gradient Boost

iii
INTRODUCTION

Introduction

In recent years, advancements in Internet and cloud technologies have led to a


significant increase in electronic trading. This growth leads to unauthorized access
to user’s sensitive information and damages the resources of an enterprise. Phish-
ing is one of the familiar attacks that trick users to access malicious content and
gain their information. In terms of website interface and uniform resource locator
(URL), most phishing web pages look identical to the actual web pages. Phishing
commonly attacks credulous people by making them disclose their unique infor-
mation using counterfeit websites. Phishing website URLs aim to purloin personal
information like usernames, passwords, and online banking transactions[2]. Various
strategies for detecting phishing websites, such as blacklisting, heuristics, Etc., have
been suggested.
Existing research works show that the performance of the phishing detection sys-
tem is limited. There is a demand for intelligent techniques to protect users from
cyber-attacks. URL detection technique based on machine learning A recurrent
method is employed to detect phishing URLs. However, due to inefficient security
technologies, there is an exponential increase in the number of victims. The anony-
mous and uncontrollable framework of the Internet is more vulnerable to phishing
attacks.

Aim of the project

The aim of phishing website detection using a machine learning project is to develop
a system that can automatically identify and flag potentially dangerous phishing web-
sites. The ultimate goal is to provide users with an extra layer of protection against
scams and phishing attacks in order to minimize the risks of identity theft and finan-
cial fraud. By analyzing website content, behavior, and other relevant factors, the
machine learning algorithm can learn to distinguish between legitimate and fraudu-

1
lent websites, allowing it to make more accurate and effective predictions over time.
This can help users make better-informed decisions about which websites to trust
and which to avoid.

Project Domain

The project domain is ”machine learning” which encompasses a variety of appli-


cations and areas of study that involve using algorithms and statistical models to
enable machines to improve their performance on specific tasks or learn from data.
Some common areas within the project domain of machine learning include natural
language processing, computer vision, robotics, speech recognition, fraud detection,
recommendation systems, and predictive analytics, among others. These applica-
tions typically involve training machines to recognize patterns, make predictions,
and make decisions based on data, with the aim of improving their accuracy and ef-
ficiency over time. Ultimately, the project domain of machine learning is concerned
with enabling machines to learn from data and improve their performance, leading
to more intelligent and capable systems that can improve human productivity and
quality of life.

Scope of the Project

The scope of phishing website detection using machine learning is significant and
offers a range of potential benefits. Some of the key areas where machine learning
can be applied to this problem include:

Early Detection

By training machine learning algorithms to recognize patterns and characteristics


common to phishing sites, it is possible to detect these sites much earlier than man-
ual review. By training machine learning algorithms on large datasets of known
phishing and legitimate websites, it is possible to identify these patterns and charac-
teristics and use them to classify new websites as either phishing or legitimate. This
approach allows for faster and more accurate detection of phishing websites, as well
as the ability to detect new and emerging phishing techniques that may be missed by
manual review.

2
Real-Time Detection

With machine learning, it is possible to detect phishing sites in real-time, which can
help prevent potential victims from falling prey to these scams. One approach to real-
time phishing website detection using machine learning is to use supervised learning
algorithms, such as decision trees, random forests, and support vector machines, that
are trained on a large dataset of known phishing and legitimate websites. These
algorithms can then be used to classify new websites as either phishing or legitimate
based on their features.

Accuracy

Machine learning algorithms can be trained to recognize subtle differences between


legitimate and fraudulent sites with high accuracy, which can flag potentially ma-
licious sites that may otherwise go undetected. The accuracy of phishing website
detection using machine learning algorithms varies depending on the dataset, for the
dataset used in this project, the highest accuracy is gained by random forest classi-
fier 83.786 percent. The features used for training, and the specific algorithm used
for classification. However, several studies have reported high levels of accuracy in
detecting phishing websites using machine learning.

Scalability

As the number of new phishing sites continues to grow, machine learning can help
scale the detection and screening process, allowing organizations to stay ahead of
the curve. By leveraging large datasets of known phishing and legitimate websites,
machine learning algorithms can identify patterns and features that are common to
phishing sites, such as the use of misleading URLs, the presence of suspicious links,
and the incorporation of fake login pages. This allows for the automation of the
detection process and the efficient screening of large volumes of web traffic.

Reduced Cost

Machine learning can be a cost-effective solution for detecting phishing sites, as


it can automate the process and reduce the need for manual review. Overall, the
scope of phishing website detection using machine learning is vast, and the potential
benefits for organizations are significant.

3
LITERATURE REVIEW

Anjali et al., [1] provided an overview of different machine learning techniques


used for detecting phishing attacks, including decision trees, neural networks, and
support vector machines. The authors also compare the performance of different
machine learning algorithms and datasets used in phishing detection studies.

Al-Rfou et al., [2] focused on a new approach to detecting phishing websites


using ensemble machine learning algorithms, which involves combining multiple
machine learning models to improve the accuracy of the phishing detection system.
The authors evaluate the performance of different ensemble algorithms, including
bagging,boosting, and stacking.

C. Santhosh Kumar et al., [3] proposed a new approach to detecting phishing


websites using machine learning, which involves using the random forest algorithm
and a hybrid feature selection technique based on the principal component analysis
(PCA) and correlation-based feature selection (CFS).

J. Shad et al., [4] confined that the Web harms users by stealing their confiden-
tial information such as account ID, user name, password, etc. Phishing is a social
engineering attack and current attacks on mobile devices. That might result in the
form of financial losses. In this paper, we described many detection techniques us-
ing URL, Hyperlinks features that can be used to differentiate between defective and
non-defective websites. There are six main approaches: heuristic, blacklist, Fuzzy
Rule, machine learning, image processing, and CANTINA-based approach. It deliv-
ers a good consideration of the phishing issue, a present machine learning solution,
and future studies about Phishing threats by using the machine learning approach.

4
Kumar.A et al., [5] focused on the International Conference on Internet of Things
Smart Innovation and Usages (pp. 1-6). IEEE. This paper proposes a machine
learning-based approach for detecting phishing websites by analyzing various fea-
tures such as URL, domain, and content. The proposed approach is evaluated on a
dataset of phishing and legitimate websites and achieves high accuracy rates.

K. Shima et al., [6] focused on the 21st Conference on Innovation in Clouds,


Internet and Networks and Workshops (ICIN). In present days, websites are mainly
responsible for the rapid growth of criminal activities on the internet and correspond-
ing activities which results in many illegal things. So there are many preventive steps
to be taken to stop this kind of activity. Here we propose a model which will classify
the given URL into any of the three possible classes, i.e. Benign, spam, and mal-
ware. Our model will detect the classification of the URL without using any website
content.

M. Karabatak et al., [7] proposed the approach that, These days, numerous en-
emies of phishing frameworks are being created to recognize phishing substances
in online correspondence frameworks. In spite of the accessibility of hordes hostile
to phishing frameworks, phishing proceeds unabated because of lacking recognition
of a zero-day assault, pointless computational overhead, and high bogus rates. In
spite of the fact that Machine Learning approaches have accomplished promising
exactness rates, the decision and the exhibition of the component vector limit their
successful location. In this work, an upgraded AI-based prescient model is proposed
to improve the effectiveness of phishing plans.

Naik et al., [8] proposed a new approach to detecting phishing websites using
convolutional neural networks (CNNs) and a feature extraction technique based on
the distribution of ASCII characters in the URL. This paper provides a survey of the
existing literature on phishing website detection using machine learning. It covers
different approaches and techniques, as well as challenges and future directions.

T. Peng et al., [9] focused on Phishing attacks are one of the most common
and least defended security threats today. We present an approach that uses natural
language processing techniques to analyze text and detect inappropriate statements
which are indicative of phishing attacks.

5
Y. Sonmez et al., [10] proposed that Phishing commonly attacks credulous peo-
ple by making them disclose their unique information using counterfeit websites.
Phishing website URLs aim to purloin personal information like usernames, pass-
words, and online banking transactions. Phishers use websites that are visually and
semantically similar to those real websites. As technology continues to grow, phish-
ing techniques started to progress rapidly and this needs to be prevented by using
antiphishing mechanisms to detect phishing. Machine learning is a powerful tool
used to strive against phishing attacks. This paper surveys the features used for de-
tection and detection techniques using machine learning.

6
PROJECT DESCRIPTION

Existing System

Where in the case of the existing system means that what is the previous system says
a Manual human intervention is not that much applicable and error-prone. Legacy
and Conventional Data Mining Algorithms can’t deal with huge volumes of data,
slower and more inaccurate, the existing system is late processing the modules, it
takes more time to find the accuracy of the classifiers and finally, the accurate results
are less than proposed system

Proposed System

Machine Learning is cutting edge and trending for different kinds of diverse applica-
tions in a society where it can deal with tons of data, refined and revised algorithms,
and available heavy processing power in terms of GPU. the proposed system is hav-
ing better processing than the existing system, it is less time-consuming to provide
results, and finally gives better accuracy than the existing system

Feasibility Study

The feasibility of the project is to show if machine learning can effectively detect
phishing websites. The accuracy, precision, and recall of the models will determine
if they can be implemented for real-world applications. The study will also reveal
the most effective features for identifying phishing websites, which can be used to
improve the accuracy of the models.

7
Economic Feasibility

This study is carried out to check the economic impact that the system will have on
the organization. The amount of funds that the company can pour into the research
and development of the system is limited. The expenditures must be justified. Thus,
the developed system is well within the budget and this was achieved because most
of the technologies used are freely available. Only the customized products had to
be purchased.

Technical Feasibility

This study is carried out to check the technical feasibility, that is, the technical re-
quirements of the system. Any system developed must not have a high demand on
the available technical resources. This will lead to high demands on the available
technical resources. This will lead to high demands being placed on the client. The
developed system must have a modest requirement, as only minimal or null changes
are required for implementing this system.

Social Feasibility

The aspect of the study is to check the level of acceptance of the system by the user.
This includes the process of training the user to use the system efficiently. The user
must not feel threatened by the system, instead must accept it as a necessity. The
level of acceptance by the users solely depends on the methods that are employed
to educate the user about the system and to make him familiar with it. His level of
confidence must be raised so that he is also able to make some constructive criticism,
which is welcomed, as he is the final user of the system.

System Specification

Functional and non-functional requirements

Requirement analysis is a very critical process that enables the success of a system
or software project to be assessed. Requirements are generally split into two types:
Functional and non-functional requirements.

8
Functional Requirements

These are the requirements that the end user specifically demands as basic facilities
that the system should offer. All these functionalities need to be necessarily incor-
porated into the system as a part of the contract. These are represented or stated in
the form of input to be given to the system, the operation performed and the output
expected. They are basically the requirements stated by the user which one can see
directly in the final product, unlike the non-functional requirements.

Examples of functional requirements

• Authentication of the user whenever he/she logs into the system


• System shutdown in case of a cyber-attack
• A verification email is sent to the user whenever he/she registers for the first time
on some software system.

Non-functional requirements

These are basically the quality constraints that the system must satisfy according to
the project contract. The priority or extent to which these factors are implemented
varies from one project to other. They are also called non-behavioral requirements.
They basically deal with issues like:
• Portability
• Security
• Maintainability
• Reliability
• Scalability
• Performance
• Reusability

9
Examples of non-functional requirements

• Emails should be sent with a latency of no greater than 12 hours from such an
activity.
• The processing of each request should be done within 10 seconds
• The site should load in 3 seconds whenever simultaneous users are 10000

Hardware Specification

• Operating system: Windows 7 and above


• RAM: 8 GB and above
• Hard disk or SSD: More than 500 GB
• Processor: Intel 3rd generation or high or Ryzen with 8GB Ram

Software Specification

• Software: Python 3.6 or high version


• IDE : PyCharm
• Framework: Flask

Standards and Policies

The details of the User and Admin will be private and their details should not be
released until and unless the user allows to share them for some specific reason. for
example, the details of the user will not be shared with the Admin until he chooses
a course to enroll in and he needs to fill in certain details required for enrolling pro-
cess. Similarly, the user will not be allowed to get the data of the Admin until Admin
allows them to share the details with the user.

Details provided during the registration on a role basis will allow you to provide
or get service through this web application. A user will not be allowed to use the
feature of the admin part until he is admitted as an admin.

10
METHODOLOGY

General Architecture

Figure 4.1: Architecture Diagram for Phishing Website Prediction

In Figure 4.1 has a data collection component at the beginning, followed by a prepro-
cessing component. The preprocessed data would then be fed into a feature selection
component, followed by a model selection component, and then a model training
component. The trained model would then be evaluated using a model evaluation
component. The final component would be a model deployment component, which
would deploy the model in a real-time environment. Finally, a monitoring and up-
dates component would be used to ensure that the model remains effective over time.

11
4.1.1 Data Flow Diagram

Figure 4.2: Data Flow Diagram

Inputs:
The inputs for the system would be the URLs or web addresses of the websites that
need to be checked for phishing. This could be provided by a human user or gener-
ated automatically by a system that collects URLs from various sources.

Preprocessing:
The URLs would then be preprocessed to extract the relevant features that could be
used in the machine learning model. Features could include characteristics such as
the length of the URL, the number of subdomains, the presence of suspicious key-
words, etc.

Machine Learning Model: The preprocessed data would be fed into a machine
learning model that has been trained to detect phishing websites. The model could
use techniques such as supervised learning, unsupervised learning, or a combination
of both, to classify the websites as legitimate or phishing.

12
Outputs: The output of the system would be a prediction of whether each website
is a phishing site or not. This could be presented to the user in various formats, such
as a list of URLs with phishing scores or a visualization that highlights suspicious
websites.

Feedback: The system could also include a feedback loop that allows users to
report false positives or false negatives, which can be used to improve the accuracy
of the model over time.

Database: The data flow diagram would also include a database that stores the
URLs and their corresponding classifications. This could be used to train the ma-
chine learning model or to monitor the performance of the system over time.

Overall, the data flow diagram for phishing website detection using machine learn-
ing would be designed to efficiently process large volumes of data and provide accu-
rate predictions of phishing websites. Through a combination of machine learning
algorithms and user feedback loops, the system could continue to improve and adapt
to new threats over time.

13
Use Case Diagram

Figure 4.3: Use Case Diagram

In Figure 4.3 depicts the different actors and their interactions with the machine
learning system, as well as the different use cases and their relationships. For exam-
ple, the user may interact with the system through a web browser, while the phishing
website may attempt to deceive the user by masquerading as a legitimate website.
The machine learning system would sit in between these two actors, analyzing the
web traffic and identifying potential threats. The system would then take appropriate
actions such as alerting the user, blocking access to the phishing website, or updat-
ing the machine learning model. The reports generated by the system would provide
valuable insights into the effectiveness of the machine learning system in detecting
and preventing phishing attacks.

14
Class Diagram

Figure 4.4: Class Diagram

In Figure 4.4 depicts the different classes and their relationships within the phishing
website detection system. The class serves as the main interface between the user
and the phishing website, relying on sub-systems such, as Feature Extractor to detect
and prevent phishing attacks. The different classes and their relationships provide a
clear understanding of how the system functions and how different components work
together to achieve the overall goal of detecting and preventing phishing attacks.

15
Sequence Diagram

Figure 4.5: Sequence Diagram

In Figure 4.5 would depict the different components of the system and their interac-
tions, showing the flow of data and control between them. For example, the diagram
would show how the user’s web request is processed and how the system uses sub-
systems such as to detect and prevent phishing attacks. The diagram would also show
how the system communicates with the user through their web browser to provide
warnings and alerts when potential phishing attacks are detected. Overall, the se-
quence diagram would provide a clear understanding of the different steps involved
in detecting and preventing phishing attacks using machine learning.

16
Collaboration Diagram

Figure 4.6: Collaboration Diagram

In Figure 4.6 depicts the interactions and communication between these components,
showing how they collaborate to detect and prevent phishing attacks using machine
learning. For example, the diagram would show how the User’s web browser com-
municates with the to send web requests and receive warnings and alerts when po-
tential phishing attacks are detected. The diagram would also show how the com-
municates with sub-systems such as analyzing website data, training and evaluating
machine learning models and updating the system periodically. The diagram would
also show how it communicates with potentially malicious websites to analyze their
content and determine whether they are phishing websites or not. Overall, the collab-
oration diagram would provide a clear understanding of the different components of
the system and how they collaborate to achieve the goal of detecting and preventing
phishing attacks using machine learning.

17
Activity Diagram

Figure 4.7: Activity Diagram

In Figure 4.7 depicts the different activities involved in detecting and preventing
phishing attacks using machine learning, showing the flow of control between them.
For example, the diagram would show how data is collected and how features are ex-
tracted by the Feature Extractor, how machine learning algorithms are selected, and
how the machine learning model is trained and evaluated by the and respectively. The
diagram would also show how the analyzes website content, alerts the user, blocks
the website if necessary, and how periodically updates the machine learning model
to improve its accuracy. Overall, the activity diagram would provide a clear un-
derstanding of the different activities involved in detecting and preventing phishing
attacks using machine learning algorithms.

18
Algorithm and Pseudo Code

Algorithm for Phishing Website

1. Set up the project environment by installing the necessary dependencies and


tools(e.g., Py Charm, SQL Yog, XAMPP).

2. Create the basic file structure for the project(Index.html,Graph.html,App.py, Load


Data.html,Login.html,Model.html, Prediction.html).

3. Implement App.py component, which will serve as the main component of the
website.

4. This component will generate the link to the website.

5. The algorithms that are used are ADA Boost, XG-Boost, Random Forest Clas-
sifier, Gradient Boost Classifier

6. The above classifiers are used to collect and preprocess the data.

7. Collect a dataset of URLs with labels indicating whether they are phishing or le-
gitimate. Preprocess the data by extracting relevant features such as URL length,
domain age, and presence of certain keywords.

8. Split the dataset into training and testing sets. The training set will be used to
train the models, and the testing set will be used to evaluate the performances of
the models.

9. Train the model on the training set by sequentially adding weak classifiers to
the model. Each weak classifier is trained on a subset of the training data and
assigned a weight based on its performance. The final model is a weighted
combination of the weak classifiers.

10. Evaluate the performance of the models on the testing set by calculating metrics
such as accuracy, precision, recall, and F1-score.

11. Tune the hype parameters of the models to optimize their performance of the
models. Hyperparameters include the number of weak classifiers and the learn-
ing rate.

12. Deploy the trained models to detect phishing websites in real time by inputting
a URL and predicting whether it is phishing or legitimate

19
Pseudo Code

1 # Preprocess data
2 features = extract features ( website data )
3

4 # T r a i n model
5 model = t r a i n m o d e l ( t r a i n i n g d a t a , labels )
6

7 # E v a l u a t e model
8 a c c u r a c y = e v a l u a t e m o d e l ( model , test data , test labels )
9

10 # Use model t o c l a s s i f y new w e b s i t e


11 p r e d i c t i o n = model . p r e d i c t ( f e a t u r e s )
12

13 # Output result
14 if p r e d i c t i o n == ’ p h i s h i n g ’ :
15 p r i n t ( ”WARNING: T h i s w e b s i t e may be a p h i s h i n g s i t e ! ” )
16 else :
17 p r i n t ( ” T h i s w e b s i t e a p p e a r s t o be l e g i t i m a t e . ” )
18

19 def login () :
20 if r e q u e s t . method== ’POST ’ :
21 u s e r e m a i l = r e q u e s t . form [ ’ u s e r e m a i l ’ ]
22 session [ ’useremail ’]=useremail
23 u s e r p a s s w o r d = r e q u e s t . form [ ’ u s e r p a s s w o r d ’ ]
24 s q l =” s e l e c t c o u n t ( * ) from u s e r where Email=’% s ’ and Password=’% s ’ ”%( u s e r e m a i l , u s e r p a s s w o r d )
25 u e r y ( s q l , db )
26 print (x)
27 p r i n t ( ’ ######################## ’ )
28 c o u n t =x . v a l u e s [ 0 ] [ 0 ]
29

30 if c o u n t == 0 :
31 msg=” u s e r C r e d e n t i a l s Are n o t valid”
32 return r e n d e r t e m p l a t e ( ” l o g i n . html ” , name=msg )
33 else :
34 s=” s e l e c t * from u s e r where Email=’% s ’ and Password=’% s ’ ”%( u s e r e m a i l , u s e r p a s s w o r d )
35 z=pd . r e a d s q l q u e r y ( s , db )
36 session [ ’email ’]=useremail
37 pno= s t r ( z . v a l u e s [ 0 ] [ 5 ] )
38 p r i n t ( pno )
39 name= s t r ( z . v a l u e s [ 0 ] [ 1 ] )
40 p r i n t ( name )
41 s e s s i o n [ ’ pno ’ ] = pno
42 s e s s i o n [ ’ name ’ ] = name
43 return r e n d e r t e m p l a t e ( ” userhome . html ” , myname=name )
44 return r e n d e r t e m p l a t e ( ’ l o g i n . html ’ )
45

46 @app . r o u t e ( ’ / model ’ , methods = [ ’GET ’ , ”POST” ] )


47 d e f model ( ) :
48 g l o b a l score 1 , score 2 , score 3 , score 4 , s c o r e 5

20
49 i f r e q u e s t . method == ”POST” :
50 model = i n t ( r e q u e s t . form [ ’ s e l e c t e d ’ ] )
51 p r i n t ( model )
52 p a t h = os . l i s t d i r ( app . c o n f i g [ ’ u p l o a d f o l d e r ’ ] )
53 f i l e = os . p a t h . j o i n ( app . c o n f i g [ ’ u p l o a d f o l d e r ’ ] , p a t h [ 0 ] )
54 df = pd . r e a d c s v ( f i l e )
55

56 p r i n t ( df . columns )
57 p r i n t ( ’ ####################################################### ’ )
58

59 X = df . drop ( [ ’ Label ’ , ’ Domain ’ , ’ W e b T r a f f i c ’ ] , axis =1)


60 y = df . Label
61 x t r a i n , x t e s t , y t r a i n , y t e s t = t r a i n t e s t s p l i t ( X, y , t e s t s i z e = 0 . 3 , r a n d o m s t a t e = 20 )
62

63 p r i n t ( df )
64 i f model == 1 :
65 from s k l e a r n . ensemble import RandomForestClassifier
66 rfr = RandomForestClassifier ()
67 rfr . fit ( x train , y train )
68 pred = rfr . predict ( x test )
69 s c o r e 1 = a c c u r a c y s c o r e ( y t e s t , p r e d ) *100
70 print (score1)
71 msg = ’ The a c c u r a c y o b t a i n e d by Random F o r e s t C l a s s i f i e r i s ’ + s t r ( s c o r e 1 ) + s t r ( ’%’ )
72 return r e n d e r t e m p l a t e ( ’ model . html ’ , msg=msg )
73 elif model == 2 :
74 classifier = AdaBoostClassifier ()
75 classifier . fit ( x train , y train )
76 pred = classifier . predict ( x test )
77 s c o r e 2 = a c c u r a c y s c o r e ( y t e s t , p r e d ) *100
78 print (score2)
79 msg = ’ The a c c u r a c y o b t a i n e d by Ada Boost C l a s s i f i e r i s ’ + s t r ( s c o r e 2 ) + s t r ( ’%’ )
80 return r e n d e r t e m p l a t e ( ’ model . html ’ , msg=msg )
81 elif model == 3 :
82 from x g b o o s t i m p o r t X G B C l a s s i f i e r
83 xgb = X G B C l a s s i f i e r ( )
84 xgb . f i t ( x t r a i n , y t r a i n )
85 p r e d = xgb . p r e d i c t ( x t e s t )
86 s c o r e 3 = a c c u r a c y s c o r e ( y t e s t , p r e d ) *100
87 print (score3)
88 msg = ’ The a c c u r a c y o b t a i n e d by XGBoost C l a s s i f i e r i s ’ + s t r ( s c o r e 3 ) + s t r ( ’%’ )
89 return r e n d e r t e m p l a t e ( ’ model . html ’ , msg=msg )
90 elif model == 4 :
91 c f = SVC( k e r n e l = ’ l i n e a r ’ )
92 cf . fit ( x train , y train )
93 pred = cf . predict ( x test )
94 s c o r e 4 = a c c u r a c y s c o r e ( y t e s t , p r e d ) *100
95 print (score4)
96 msg = ’ The a c c u r a c y o b t a i n e d by S u p p o r t V e c t o r Machine i s ’ + s t r ( s c o r e 4 ) + s t r ( ’%’ )
97 return r e n d e r t e m p l a t e ( ’ model . html ’ , msg=msg )
98 elif model == 5 :

21
99 gb = G r a d i e n t B o o s t i n g C l a s s i f i e r (
100 ) gb . f i t ( x t r a i n , y t r a i n )
101 p r e d = gb . p r e d i c t ( x t e s t )
102 score5 = accuracy score ( y test , pred)
103 *100 p r i n t ( s c o r e 5 )
104 msg = ’ The a c c u r a c y o b t a i n e d by G r a d i e n t B o o s Classifie is ’ + str (score5) + str (
r ’%
ting’)
105 r e t u r n r e n d e r t e m p l a t e ( ’ model . html ’ , msg=msg )
106 r e t u r n r e n d e r t e m p l a t e ( ’ model . html ’ )
107

108

109

110 # f i t t h e model
111 forest.fit(X train , y train)
112 p r i n t ( ’ aa ’ )
113 print ( url1 )
114 print (type( url1 ))
115 my features = featureExtraction ( url
116 1 ) p r o b o f d o m s = top doms [ 1 ] . v a l u e s
117 if my features [0] in prob of doms:
118 r e t u r n r e n d e r t e m p l a t e ( ’ p r e d i c t i o n . html ’ , msg = ’ s u c c e s
119 s ’)else :
120 pred 1 = f o r e s t . p r e d i c t ( [ m y f e a t u r e s [ 1 : ]
121 ] ) p r i n t ( pred 1 )
122 i f pred 1 == 0 :
123 msg=” ”
124 else :
125 msg=” ”
126 return r e n d e r t e m p l a t e ( ’ p r e d i c t i o n . html ’ , r e s u l t =pred 1 , msg =
127 msg ) r e t u r n r e n d e r t e m p l a t e ( ’ p r e d i c t i o n . html ’ )

This pseudo-code includes an ’App’ component that fetches project data and renders
the classifier components. It Gathers a large dataset of labeled website examples,
including both legitimate and phishing websites. It Converts the website data into
a format that can be used by the machine learning algorithm. This may include ex-
tracting features such as URL length, domain age, and the presence of suspicious
keywords. It uses a supervised learning algorithm, such as a decision tree, random
forest, XG-Boost, Gradient Boost, and Support Vector Machine to train the model
on the preprocessed training data. It uses a separate test dataset to evaluate the per-
formance of the trained model. Finally, evaluate the model’s performance, and use
the model to classify new websites as either legitimate or potential phishing sites.

22
Module Description

Data Collection

As you know, machines initially learn from the data that you give them. It is of
the utmost importance to collect reliable data so that your machine-learning model
can find the correct patterns. The quality of the data that you feed to the machine
will determine how accurately your model is performing. If you have incorrect or
outdated data, you will have wrong outcomes or predictions which are not relevant.

Pre-Processing The Data

Putting together all the data you have and randomizing it. This helps make sure
that data is evenly distributed, and that the ordering does not affect the learning
process. Cleaning the data to remove unwanted data, missing values, rows, columns,
duplicate values, data type conversion, etc. You might even have to restructure the
dataset and change the rows and columns or index of rows and columns. Visualize
the data to understand how it is structured and understand the relationship between
various variables and classes present.

Splitting the Data

Splitting the cleaned data into two sets - a training set and a testing set. The training
set is the set your model learns from. A testing set is used to check the accuracy of
your model after training.

Model Training

Training is the most important step in machine learning. In training, you pass the
prepared data to your machine-learning model to find patterns and make predictions.
It results in the model learning from the data so that it can accomplish the task set.
Over time, with training, the model gets better at predicting.

Making Predictions

In the end, you can use your model on unseen data to make predictions accurately.
and finally, it generates a graph that shows which model secured high accuracy and
shows us if the URL is legitimate for phishing.

23
Steps to execute/run/implement the project

Home page

Here users view the home page of the phishing website prediction web application
which shows the interface like register, login, logout, etc.

Uploading dataset

Download the dataset from Kaggle so that the data of legitimate and phished websites
list is already collected and uploaded to Kaggle. Use this data set and submit it in
load data section.

Viewing the dataset uploaded

The uploaded data set is shown on the screen which contains the phished and legiti-
mate URL’s

Testing

Testing the data set using selected algorithms like ADA Boost, Support Vector Ma-
chine, Random Forest Classifier and etc.,

Prediction

Get the accuracy of the selected algorithms and evaluate the model’s performance,
and use the model to classify new websites as either legitimate or potential phishing
sites.

24
IMPLEMENTATION AND TESTING

Input and Output

Input Design

In an information system, input is the raw data that is processed to produce output.
During the input design, the developers must consider the input devices such as PC,
MICR, and OMR so on.
Therefore, the quality of system input determines the quality of system output.

Well-designed input forms and screens have the following properties


• It should serve a specific purpose effectively such as storing, recording, and retriev-
ing the information.
• It ensures proper completion with accuracy.
• It should be easy to fill and straightforward.
• It should focus on the user’s attention, consistency, and simplicity.
• All these objectives are obtained using the knowledge of basic

Output Design

The design of output is the most important task of any system. During output design,
developers identify the type of outputs needed and consider the necessary output
controls and prototype report layouts.

Objectives of Output Design


• To develop an output design that serves the intended purpose and eliminates the
production of unwanted output.
• To develop the output design that meets the end user’s requirements.
• To deliver the appropriate quantity of output.
• To form the output in the appropriate format and direct it to the right person.

25
Testing

To test the performance of a phishing website detection using a machine learning


project, you would typically follow these steps, Collect a dataset of websites that
includes both legitimate and phishing websites. Preprocess the dataset by extracting
relevant features from each website, such as URL structure, HTML content, and
presence of certain keywords. Split the dataset into training and testing sets, with
a majority of the data in the training set. Train several different machine learning
algorithms on the training set, using techniques such as cross-validation to optimize
hyperparameters and prevent overfitting. Evaluate the performance of each algorithm
on the testing set, using metrics such as accuracy, precision, recall, and F1 score to
assess its effectiveness at detecting phishing websites. Compare the performance
of the different algorithms and choose the one with the best overall performance.
Deploy the chosen algorithm in a real-world setting and monitor its performance
over time, making adjustments as needed to improve its accuracy and effectiveness.
It’s important to note that the effectiveness of phishing website detection using a
machine learning project will depend heavily on the quality and size of the dataset
used for training and testing, as well as the choice of features and machine learning
algorithms. Therefore, it’s important to invest significant time and effort in data
collection and preprocessing, as well as algorithm selection and tuning, to achieve
the best possible performance.

Types of Testing

There are several types of tests done depending on the project functionality and re-
quirements some of the testings are mentioned below.

Unit testing

Unit testing is a software development process in which the smallest testable parts
of an application, called units, are individually and independently tested for proper
operation. This testing helps the developer to understand whether all the components
are working accordingly or not.

26
Input

1 <!DOCTYPE html>
2 <html l a n g =” en ”>
3

4 <head>
5 <meta c h a r s e t =” u t f −8 ”>
6 <meta c o n t e n t =” width = device −width , i n i t i a l − s c a l e = 1 . 0 ” name=” v i e w p o r t ”>
7

8 < t i t l e >S a i l o r B o o t s t r a p Template − Index </ t i t l e >


9 <meta c o n t e n t =” ” name=” d e s c r i p t i o n ”>
10 <meta c o n t e n t =” ” name=” keywords ”>
11

12 <!−− F a v i c o n s −−>
13 <link h r e f =” s t a t i c / a s s e t s / img / f a v i c o n . png ” r e l =” i c o n ”>
14 <link h r e f =” s t a t i c / a s s e t s / img / apple − touch − i c o n . png ” r e l =” apple − touch − i c o n ”>
15

16 <!−− Google F o n t s −−>


17 <link
18 h r e f =” h t t p s : / / f o n t s . g o o g l e a p i s . com / c s s ? f a m i l y =Open+Sans : 3 0 0 , 3 0 0 i , 4 0 0 , 4 0 0 i , 6 0 0 , 6 0 0 i , 7 0 0 , 7
00 i |
Raleway : 3 0 0 , 3 0 0 i , 4 0 0 , 4 0 0 i , 5 0 0 , 5 0 0 i , 6 0 0 , 6 0 0 i , 7 0 0 , 7 0 0 i | Poppi ns : 3 0 0 , 3 0 0 i , 4 0 0 , 4 0 0 i , 5 0 0 , 5 0 0 i
,600,600i ,700,700 i”
19 r e l =” s t y l e s h e e t ”>
20

21 <!−− Vendor CSS F i l e s −−>


22 <link h r e f =” s t a t i c / a s s e t s / vendor / b o o t s t r a p / c s s / b o o t s t r a p . min . c s s ” r e l =” s t y l e s h e e t ”>
23 <link h r e f =” s t a t i c / a s s e t s / vendor / i c o f o n t / i c o f o n t . min . c s s ” r e l =” s t y l e s h e e t ”>
24 <link h r e f =” s t a t i c / a s s e t s / vendor / b o x i c o n s / c s s / b o x i c o n s . min . c s s ” r e l =” s t y l e s h e e t ”>
25 <link h r e f =” s t a t i c / a s s e t s / vendor / a n i m a t e . c s s / a n i m a t e . min . c s s ” r e l =” s t y l e s h e e t ”>
26 <link h r e f =” s t a t i c / a s s e t s / vendor / r e m i x i c o n / r e m i x i c o n . c s s ” r e l =” s t y l e s h e e t ”>
27 <link h r e f =” s t a t i c / a s s e t s / vendor / venobox / venobox . c s s ” r e l =” s t y l e s h e e t ”>
28 <link h r e f =” s t a t i c / a s s e t s / vendor / owl . c a r o u s e l / a s s e t s / owl . c a r o u s e l . min . c s s ” r e l =” s t y l e s h e e t ”>
29

30 <!−− Template Main CSS F i l e −−>


31 <link h r e f =” s t a t i c / a s s e t s / c s s / s t y l e . c s s ” r e l =” s t y l e s h e e t ”>
32

33

34 </ head>
35

36 <body>
37

38 <!−− ======= Header ======= −−>


39 <h e a d e r s t y l e =” c o l o r : w h i t e ; ” i d =” h e a d e r ” c l a s s =” f i x e d − t o p ”>
40 <d i v c l a s s =” c o n t a i n e r d− f l e x a l i g n − i t ems − c e n t e r ”>
41

42 <h1 c l a s s =” l o go ”><a h r e f =” / ”>PHISHING WEBSITE DETECTION </a></h1>


43 <!−− Uncomment below i f you p r e f e r t o use an image l o go −−>
44 <a h r e f =” i n d e x . html ” c l a s s =” l og o ”><img s r c =” s t a t i c / a s s e t s / img / l og o . png ” a l t =” ” c l a s s =”
img− f l u i d ”></a>
45

27
46 <nav c l a s s =” nav −menu d−none d−lg − b l o c k ”>
47

48 <ul >
49 < l i ><a h r e f =” / ”>Home</a></ l i >
50 < l i c l a s s =” a c t i v e ”><a h r e f =” / l o g i n ”>Log In </a></ l i >
51 <l i ><a h r e f =” / r e g i s t r a t i o n ”>R e g i s t e r </a></ l i >
52

53

54 <!−− <l i ><a h r e f =” / l o a d d a t a ”>Load Data </a></ l i >


55 <l i ><a h r e f =” / view d a t a ”>View Data </a></ l i >
56 <l i ><a h r e f =” / model ”> S e l e c t Model </a></ l i >
57 <l i ><a h r e f =” / p r e d i c t i o n ”>P r e d i c t i o n </a></ l i >
58 <l i ><a h r e f =” / graph ”>Graph </a></ l i > −−>
59

60 </ ul >
61

62 </ nav ><!−− . nav −menu −−>


63

64 <!−− <a h r e f =” i n d e x . html ” c l a s s =” get − s t a r t e d − b t n ml− a u t o ”>Get S t a r t e d </a>−−>


65

66 </ div >


67 </ header ><!−− End Header −−>
68

69 <d i v c l a s s =” o v e r l a y ”>
70 <d i v c l a s s =” gtco − c o n t a i n e r ”>
71 <d i v c l a s s =” row ”>
72 <d i v c l a s s =” col −md−12 col −md− o f f s e t −0 t e x t − c e n t e r ”>
73 <d i v c l a s s =” d i s p l a y − t ”>
74 <d i v c l a s s =” d i s p l a y − t c animate −box ” data − animate − e f f e c t =” f a d e I n ”>
75 <c e n t e r >
76

77

78 <!−− <h3 s t y l e =” bottom : 151 px ; c o l o r : rgb ( 1 1 , 203 , 236 ) ; t o p : − 222 ; ”>{{msg}}</ h3> −−>
79 </ c e n t e r >
80 <!−− <h3 s t y l e =” c o l o r : rgb ( 1 1 , 203 , 236 ) ; bottom : 115 px ; ”> Welcome To t h e w e b s i t e </h3>
−−>
81 </ div >
82 </ div >
83 </ div >
84 </ div >
85 </ div >
86 </ div >
87 <!−− ======= Hero S e c t i o n ======= −−>
88 <section i d =” he r o ”>
89 <d i v i d =” h e r o C a r o u s e l ” c l a s s =” c a r o u s e l slide c a r o u s e l − f a d e ” data − r i d e =” c a r o u s e l ”>
90

91 <o l c l a s s =” c a r o u s e l − i n d i c a t o r s ” i d =” hero − c a r o u s e l − i n d i c a t o r s ”></ol >


92

93

94

28
95 <!−− <d i v c l a s s =” c a r o u s e l − i n n e r ” r o l e =” l i s t b o x ”>−−>
96

97 <!−− S l i d e 1 −−>
98 <d i v c l a s s =” c a r o u s e l − i t em a c t i v e ” s t y l e =” background −image : u r l ( s t a t i c / a s s e t s / img / s l i d e / 2 . j p g
)”
>
99

100 <d i v c l a s s =” c o n t a i n e r ”>


101 <!−− R e g i s t r a t i o n S t a r t −−>
102 <d i v c l a s s =” c o n t a i n e r − f l u i d bg− r e g i s t r a t i o n py −5 ” s t y l e =” margin : 90 px 0 ; ”>
103 <d i v c l a s s =” c o n t a i n e r py −5 ”>
104 <d i v c l a s s =” row a l i g n − i t ems − c e n t e r ”>
105

106 <d i v c l a s s =” col −lg −5 ” s t y l e =” l e f t : 361 px ; t o p : 120 px ; ”>


107 <d i v c l a s s =” c a r d border −0 ”>
108 <d i v c l a s s =” card − h e a d e r bg− p r i m a r y t e x t − c e n t e r p−4 ”>
109 <h1 c l a s s =” t e x t − w h i t e m−0 ”>User Login </h1>
110 </ div >
111 <d i v c l a s s =” card −body rounded − bottom bg− w h i t e p−5 ”>
112 <form a c t i o n =” {{ u r l f o r ( ’ l o g i n ’ ) }} ” method=” p o s t ”>
113

114 <d i v c l a s s =” form − group ”>


115 <i n p u t t y p e =” e m a i l ” name=” u s e r e m a i l ” c l a s s =” form − c o n t r o l
p−4 ” p l a c e h o l d e r =” E n t e r Your Email ” r e q u i r e d =” r e q u i r e d ” />
116 </ div >
117 <d i v c l a s s =” form − group ”>
118 <i n p u t t y p e =” password ” name=” u s e r p a s s w o r d ” c l a s s =” form − c o n t r o l p
−4 ” p l a c e h o l d e r =” E n t e r Your Password ” r e q u i r e d =” r e q u i r e d ” />
119 </ div >
120 <div >
121 <b u t t o n c l a s s =” b t n btn − p r i m a r y btn − b l o c k py −3 ” t y p e =” s u b m i t
”> Login </ b u t t o n >
122 </ div >
123 </ form>
124 </ div >
125 </ div >
126 </ div >
127 </ div >
128 </ div >
129 </ div >
130 <!−− R e g i s t r a t i o n End −−>
131 <!−− <h2 c l a s s =” a n i m a t e animated animate f a d e I n D o w n ” s t y l e =” margin − t o p : 244 px ;
”>Welcome t o P h i s h i n g <br><s pan>Website D e t e c t i o n </ span ></h2>−−> −−>
132 <!−− <p c l a s s =” a n i m a t e animated animate f a d e I n U p ”>Ut v e l i t e s t quam d o l o r ad a a l
i q u i d q u i a l i q u i d . Sequi ea u t

29
Test result

Figure 5.1: Unit Testing

Integration testing

Ensure that the system can correctly receive input data from various sources, such
as user input or website data feeds. Test the preprocessing stage to ensure that the
data is cleaned and formatted correctly. Verify that the feature extraction stage is
accurately extracting relevant features from the preprocessed data. Test the feature
selection techniques to ensure that they are effectively selecting the most relevant
features for the machine learning model. Test the machine learning model selection
process to ensure that the appropriate model is chosen for the detection task. Verify
that the model is trained correctly on the preprocessed and feature-selected data.
Test the evaluation metrics to ensure that the performance of the model is accurately
measured. Verify that the model parameters and feature selection techniques are
optimized to improve its performance. Test the deployment process to ensure that the
optimized model is successfully integrated into the system for real-world phishing
website detection. Verify that the model is being monitored and updated periodically
to maintain its effectiveness against evolving phishing tactics.

30
Input

1 /*
2 SQLyog E n t e r p r i s e − MySQL GUI v6 . 5 6
3 MySQL − 5 . 5 . 5 − 1 0 . 1 . 1 3 − MariaDB : D a t a b a s e − p h i s h i n g
4

5 CREATE TABLE ‘ user ‘ (


6 ‘ Id ‘ i n t ( 2 0 0 ) NOT NULL AUTO INCREMENT,
7 ‘ Name ‘ v a r c h a r ( 2 0 0 ) DEFAULT NULL,
8 ‘ Email ‘ v a r c h a r ( 2 0 0 ) DEFAULT NULL,
9 ‘ Password ‘ v a r c h a r ( 2 0 0 ) DEFAULT
10 NULL, ‘ Age ‘ v a r c h a r ( 2 0 0 ) DEFAULT NULL,
11 ‘Mob‘ v a r c h a r ( 2 0 0 ) DEFAULT NULL,
12 PRIMARY KEY ( ‘ Id ‘ )
13 ) ENGINE=InnoDB AUTO INCREMENT=2 DEFAULT CHARSET= l a t i n 1 ;
14

15 / * Data f o r t h e t a b l e ‘ user ‘ * /
16

17 insert i n t o ‘ user ‘ ( ‘ Id ‘ , ‘ Name ‘ , ‘ Email ‘ , ‘ Password ‘ , ‘ Age ‘ , ‘ Mob ‘ ) v a l u e s ( 1 , ’ Balaram ’ , ’ balaram@ gmail .
com ’ , ’ 1234 ’ , ’ 26 ’ , ’ 7853011277 ’ ) ;
18

19 / * ! 40101 SET SQL MODE=@OLD SQL MODE * / ;


20 / * ! 40014 SET FOREIGN KEY CHECKS=@OLD FOREIGN KEY CHECKS * / ;

31
Test result

Figure 5.2: Integration Testing

System testing

Input

1 </ header ><!−− End Header −−>


2

3 < s e c t i o n i d =” he r o ”>
4 <d i v i d =” h e r o C a r o u s e l ” c l a s s =” c a r o u s e l slide c a r o u s e l − f a d e ” data − r i d e =” c a r o u
s e l ”>
5
<h1 s t y l e =” c o l o r : w h i t e ; ”>{{msg}}</ h1>
6

7
<o l c l a s s =” c a r o u s e l − i n d i c a t o r s ” i d =” hero − c a r o u s e l − i n d i c a t o r s ”></ol >
8

9 <!−− <d i v c l a s s =” c a r o u s e l − i n n e r ” r o l e =” l i s t b o x ”>−−>


10 <d i v c l a s s =” c a r o u s e l − i t em a c t i v e ” s t y l e =” c o l o r : b r i g h t ; background −image : url ( sta
tic
/ a s s e t s / img / s l i d e / 2 . j p g ) ”>
11
<h1 s t y l e =” c o l o r : w h i t e ; ”>{{msg}}</ h1>
12

13 <d i v c l a s s =” c a r o u s e l − c o n t a i n e r ”>
14 <c e n t e r ><h4 s t y l e =” c o l o r : w h i t e ; ”>{{msg}}</ h4></ c e n t e
r>
15
<d i v c l a s s =” c o n t a i n e r ”>
16
{%b l o c k body %}
17
{% i f msg == ’ s u c c e s s ’ %}

32
18 <h3 s t y l e =” c o l o r : w h i t e ; background : g r e e n ”><i >The w e b s i t e i
s ” L e g i t i m a t e ” </ i ></h3>
19 {% e l i f r e s u l t == [ 0 ] %}
20 <h3 s t y l e =” c o l o r : w h i t e ; background : g r e e n ”><i >The w e b s i t e i
s ” L e g i t i m a t e ” </ i ></h3>
21 {% e l i f r e s u l t == [ 1 ] %}
22 <h3 s t y l e =” c o l o r : w h i t e ; background : g r e e n ”><i >The w e b s i t e i
s ” p h i s h i n g ” </ i ></h3>
23 {% e n d i f %}
24 {% e n d b l o c k %}
25 <h3 s t y l e =” c o l o r : w h i t e ”>ENTER URL</h3>
26 <form a c t i o n =” {{ u r l f o r ( ’ p r e d i c t i o n ’ ) }} ” method=” p o s t ”>
27 <i n p u t t y p e =” u r l ” name=” a ” s t y l e =” width : 50 0 px ” p l a c e h o l d e r
=” E n t e r URL”><b r><b r>
28

29

30 <i n p u t t y p e =” s u b m i t ” v a l u e =” s u b m i t ” c l a s s =” b t n btn − i n f o ”>


31

32

33

34 </ form>
35

36 </ div >


37 </ div
>
38 </ div >
39 <!−− </ div >−−>
40 </ div >
41 </ s e c t i o n >
42

43

44

45

46

47

48 <script s r c =” s t a t i c / a s s e t s / vendor / j q u e r y / j q u e r y . min . j s ”></ s c r i p t >


49 <script s r c =” s t a t i c / a s s e t s / vendor / b o o t s t r a p / j s / b o o t s t r a p . b u n d l e . min . j s ”></ s c r i p t >
50 <script s r c =” s t a t i c / a s s e t s / vendor / j q u e r y . e a s i n g / j q u e r y . e a s i n g . min . j s ”></ s c r i p t >
51 <script s r c =” s t a t i c / a s s e t s / vendor / php − email −form / v a l i d a t e . j s ”></ s c r i p t >
52 <script s r c =” s t a t i c / a s s e t s / vendor / i s o t o p e − l a y o u t / i s o t o p e . pkgd . min . j s ”></ s c r i p t >
53 <script s r c =” s t a t i c / a s s e t s / vendor / venobox / venobox . min . j s ”></ s c r i p t >
54 <script s r c =” s t a t i c / a s s e t s / vendor / w a y p o i n t s / j q u e r y . w a y p o i n t s . min . j s ”></ s c r i p t >
55 <script s r c =” s t a t i c / a s s e t s / vendor / owl . c a r o u s e l / owl . c a r o u s e l . min . j s ”></ s c r i p t >
56

57 <!−− Template Main JS F i l e −−>


58 <script s r c =” s t a t i c / a s s e t s / j s / main . j s ”></ s c r i p t >
59

60 </ body>
61

62 </ html>

33
Test result

Figure 5.3: System Testing Result

In Figure 5.3 the graph shows us the models and the accuracy obtained by them here
the most accurate and highest result is obtained by Random Forest Classifier(89.78),
the second highest accuracy is obtained by XG-Boost Classifier(83.54), the third
accurate result are obtained by the gradient boost classifier(82.34), and finally, ADA
Boost classifier and Support vector machine obtained the same level of accuracy that
is (78.67).

34
Test Result

Figure 5.4: Final test Image

In Figure 5.4 the model is trained by checking the classifiers giving the accuracy
based on that the model is selected and is ready to predict whether the URL given is
legitimate or a phishing URL.

35
RESULTS AND DISCUSSIONS

Efficiency of the Proposed System

Phishing websites are fraudulent websites that are designed to trick users into giving
away their personal or financial information. Phishing attacks are becoming more
sophisticated, and traditional anti-phishing measures are no longer enough. There-
fore, a proposed system of phishing website detection using machine learning can be
developed. Machine learning algorithms are designed to detect patterns in data that
human analysts may not be able to spot. The proposed system of phishing website
detection would utilize machine learning algorithms to analyze website features such
as domain age, certificate issuer, IP address location, and HTML code. The analysis
can then be compared to known phishing websites to determine if the website poses
a threat.
Once the system identifies a potential phishing website, it would send an alert to the
user warning them of the threat. The user can then choose to proceed with caution or
avoid the site altogether. Overall, the proposed system of phishing website detection
using machine learning has the potential to significantly reduce the risk of phishing
attacks. However, it is important to note that no system is foolproof and the human
factor remains critical in ensuring online safety. Therefore, users must remain vig-
ilant and aware of potential threats to their personal and financial information, even
when using machine learning-based detection systems.

36
Comparison of Existing and Proposed System

Existing System Proposed System


Security Better High
Loading Time Low Speed
Complexity Low High
Interface No Interface Interface Availabe
User Friendly Better Best
Flexibility Less Flexibility High Flexibility
Table 6.1: Comparison of Existing and Proposed System

In Table 6.1 proposed system utilizes advanced machine learning algorithms, feature
engineering, and feature selection to improve the accuracy of phishing website detec-
tion. Additionally, the proposed system uses advanced phishing detection techniques
to identify more sophisticated phishing websites. Finally, the proposed system gen-
erates advanced alerts and reports to ensure that users and authorities are quickly
alerted to phishing websites.

Sample Code

1 w r i t e your code h e r e
2 main code
3 i m p o r t os
4 from flask import *
5 i m p o r t m a t p l o t l i b . p y p l o t as
6 p l t i m p o r t s e a b o r n as s n s
7 i m p o r t pandas as pd
8 from urllib . parse import urlpar
9 seimport ipaddress
10 import re
11 from bs 4 i m p o r t B e a u t i f u l S o
12 u p i m p o r t whois

37
13 import urllib
14 import urllib . request
15 from d a t e t i m e i m p o r t d a t e t i m e
16 import requests
17

18 from s k l e a r n . ensemble import AdaBoostClassifier


19 from s k l e a r n . ensemble import GradientBoostingClassifier
20 from s k l e a r n . svm i m p o r t SVC
21 from s k l e a r n . m e t r i c s i m p o r t a c c u r a c y s c o r e
22 from s k l e a r n . m o d e l s e l e c t i o n i m p o r t train test split
23 i m p o r t mysql . c o n n e c t o r
24 db=mysql . c o n n e c t o r . c o n n e c t ( u s e r =” r o o t ” , password =” ” , p o r t = ’ 3306 ’ , d a t a b a s e = ’ p h i s h i n g ’ )
25 c u r =db . c u r s o r ( )
26

27 app = F l a s k ( name )
28 app . s e c r e t k e y = ” f g h h d f g d f g r t h r t t g d f s a d f s a f f g d ”
29

30 app . c o n f i g [ ’ u p l o a d f o l d e r ’ ] = r ’ u p l o a d s ’
31 top doms = pd . r e a d c s v ( ’ top −1m. csv ’ , h e a d e r =None )
32

33 @app . r o u t e ( ’ / ’ )
34 d e f home ( ) :
35 return r e n d e r t e m p l a t e ( ’ i n d e x . html ’ )
36

37 @app . r o u t e ( ’ / l o g i n ’ , methods =[ ’POST ’ , ’GET ’ ] )


38 def login () :
39 if r e q u e s t . method== ’POST ’ :
40 u s e r e m a i l = r e q u e s t . form [ ’ u s e r e m a i l ’ ]
41 session [ ’useremail ’]=useremail
42 u s e r p a s s w o r d = r e q u e s t . form [ ’ u s e r p a s s w o r d ’ ]
43 s q l =” s e l e c t c o u n t ( * ) from u s e r where Email=’% s ’ and Password=’% s ’ ”%( u s e r e m a i l , u s e r p a s s w o r d )
44 # cur . execute( sql )
45 # data=cur . fetchall ()
46 # db . commit ( )
47 x=pd . r e a d s q l q u e r y ( s q l , db )
48 print (x)
49 p r i n t ( ’ ######################## ’ )
50 c o u n t =x . v a l u e s [ 0 ] [ 0 ]
51

52 if c o u n t == 0 :
53 msg=” u s e r C r e d e n t i a l s Are n o t valid”
54 return r e n d e r t e m p l a t e ( ” l o g i n . html ” , name=msg )
55 else :
56 s=” s e l e c t * from u s e r where Email=’% s ’ and Password=’% s ’ ”%( u s e r e m a i l , u s e r p a s s w o r d )
57 z=pd . r e a d s q l q u e r y ( s , db )
58 session [ ’email ’]=useremail
59 pno= s t r ( z . v a l u e s [ 0 ] [ 5 ] )
60 p r i n t ( pno )
61 name= s t r ( z . v a l u e s [ 0 ] [ 1 ] )
62 p r i n t ( name )

38
63 s e s s i o n [ ’ pno ’ ] = pno
64 s e s s i o n [ ’ name ’ ] = name
65 return r e n d e r t e m p l a t e ( ” userhome . html ” , myname=name )
66 return r e n d e r t e m p l a t e ( ’ l o g i n . html ’ )
67 @app . r o u t e ( ’ / r e g i s t r a t i o n ’ , methods =[ ”POST” , ”GET” ] )
68 def registration () :
69 if r e q u e s t . method== ’POST ’ :
70 username = r e q u e s t . form [ ’ username ’ ]
71 u s e r e m a i l = r e q u e s t . form [ ’ u s e r e m a i l ’ ]
72 u s e r p a s s w o r d = r e q u e s t . form [ ’ u s e r p a s s w o r d ’ ]
73 conpassword = r e q u e s t . form [ ’ conpassword ’ ]
74 Age = r e q u e s t . form [ ’ Age ’ ]
75

76 c o n t a c t = r e q u e s t . form [ ’ c o n t a c t ’ ]
77 if u s e r p a s s w o r d == conpassword :
78 s q l =” s e l e c t * from u s e r where Email=’% s ’ and Password=’% s ’ ”%( u s e r e m a i l , u s e r p a s s w o r d )
79 cur . execute( sql )
80 data=cur . fetchall ()
81 db . commit ( )
82 print (data)
83 if data ==[]:
84

85 sql = ” insert into u s e r ( Name , Email , Password , Age , Mob) v a l u e s (%s ,% s ,% s ,% s ,% s ) ”


86 v a l =( username , u s e r e m a i l , u s e r p a s s w o r d , Age , c o n t a c t )
87 cur . execute(sql , val)
88 db . commit ( )
89 flash (”Registered successfully” ,”success”)
90 return r e n d e r t e m p l a t e ( ” l o g i n . html ” )
91 else :
92 flash (”Details are i n v a l i d ” , ” warning ” )
93 return r e n d e r t e m p l a t e ( ” r e g i s t r a t i o n . html ” )
94 else :
95 f l a s h ( ” Password doesn ’ t match ” , ” warning ” )
96 return r e n d e r t e m p l a t e ( ” r e g i s t r a t i o n . html ” )
97 return r e n d e r t e m p l a t e ( ’ r e g i s t r a t i o n . html ’ )
98

99

100 @app . r o u t e ( ’ / l o a d d a t a ’ , methods = [ ”POST” , ”GET” ] )


101 def load data () :
102 i f r e q u e s t . method == ”POST” :
103 file = request . files [ ’ file ’]
104 f i l e t y p e = os . p a t h . s p l i t e x t ( f i l e . f i l e n a m e ) [ 1 ]
105 print ( filetype )
106 if f i l e t y p e == ’ . csv ’ :
107 mypath = os . p a t h . j o i n ( app . c o n f i g [ ’ u p l o a d f o l d e r ’ ] , f i l e . f i l e n a m e )
108 f i l e . s av e ( mypath )
109 return r e n d e r t e m p l a t e ( ’ l o a d d a t a . html ’ , msg = ’ s u c c e s s ’ )
110 else :
111 return r e n d e r t e m p l a t e ( ’ l o a d d a t a . html ’ , msg = ’ i n v a l i d ’ )
112 return r e n d e r t e m p l a t e ( ’ l o a d d a t a . html ’ )

39
113

114 @app . r o u t e ( ’ / view d a t a ’ , methods = [ ”POST” , ”GET” ] )


115 def view data () :
116 p a t h = os . l i s t d i r ( app . c o n f i g [ ’ u p l o a d f o l d e r ’ ] )
117 f i l e = os . p a t h . j o i n ( app . c o n f i g [ ’ u p l o a d f o l d e r ’ ] , p a t h [ 0 ] )
118 df = pd . r e a d c s v ( f i l e )
119 r e t u r n r e n d e r t e m p l a t e ( ’ view d a t a . html ’ , col name = df . columns , r o w v a l = l i s t ( df . v a l u e s . t o l i s t ( ) )
)
120

121 @app . r o u t e ( ’ / model ’ , methods = [ ’GET ’ , ”POST” ] )


122 d e f model ( ) :
123 g l o b a l score 1 , score 2 , score 3 , score 4 , s c o r e 5
124 i f r e q u e s t . method == ”POST” :
125 model = i n t ( r e q u e s t . form [ ’ s e l e c t e d ’ ] )
126 p r i n t ( model )
127 p a t h = os . l i s t d i r ( app . c o n f i g [ ’ u p l o a d f o l d e r ’ ] )
128 f i l e = os . p a t h . j o i n ( app . c o n f i g [ ’ u p l o a d f o l d e r ’ ] , p a t h [ 0 ] )
129 df = pd . r e a d c s v ( f i l e )
130

131 p r i n t ( df . columns )
132 p r i n t ( ’ ####################################################### ’ )
133

134 X = df . drop ( [ ’ Label ’ , ’ Domain ’ , ’ W e b T r a f f i c ’ ] , axis =1)


135 y = df . Label
136 x t r a i n , x t e s t , y t r a i n , y t e s t = t r a i n t e s t s p l i t ( X, y , t e s t s i z e = 0 . 3 , r a n d o m s t a t e = 20 )
137

138 p r i n t ( df )
139 i f model == 1 :
140 from s k l e a r n . ensemble import RandomForestClassifier
141 rfr = RandomForestClassifier ()
142 rfr . fit ( x train , y train )
143 pred = rfr . predict ( x test )
144 s c o r e 1 = a c c u r a c y s c o r e ( y t e s t , p r e d ) *100
145 print (score1)
146 msg = ’ The a c c u r a c y o b t a i n e d by Random F o r e s t C l a s s i f i e r i s ’ + s t r ( s c o r e 1 ) + s t r ( ’%’ )
147 return r e n d e r t e m p l a t e ( ’ model . html ’ , msg=msg )
148 elif model == 2 :
149 classifier = AdaBoostClassifier ()
150 classifier . fit ( x train , y train )
151 pred = classifier . predict ( x test )
152 s c o r e 2 = a c c u r a c y s c o r e ( y t e s t , p r e d ) *100
153 print (score2)
154 msg = ’ The a c c u r a c y o b t a i n e d by Ada Boost C l a s s i f i e r i s ’ + s t r ( s c o r e 2 ) + s t r ( ’%’ )
155 return r e n d e r t e m p l a t e ( ’ model . html ’ , msg=msg )
156 elif model == 3 :
157 from x g b o o s t i m p o r t X G B C l a s s i f i e r
158 xgb = X G B C l a s s i f i e r ( )
159 xgb . f i t ( x t r a i n , y t r a i n )
160 p r e d = xgb . p r e d i c t ( x t e s t )
161 s c o r e 3 = a c c u r a c y s c o r e ( y t e s t , p r e d ) *100

40
162 print (score3)
163 msg = ’ The a c c u r a c y o b t a i n e d by XGBoost C l a s s i f i e r i s ’ + s t r ( s c o r e 3 ) + s t r
164 ( ’%’ ) r e t u r n r e n d e r t e m p l a t e ( ’ model . html ’ , msg=msg )
165 e l i f model == 4 :
166 c f = SVC( k e r n e l = ’ l i n e a r ’
167 )cf . fit ( xtrain, ytrain )
168 pred = cf . predict ( x test )
169 score4 = accuracy score ( y test , pred)
170 *100 p r i n t ( s c o r e 4 )
171 msg = ’ The a c c u r a c y o b t a i n e d by S u p p o r t V e c t o r Machine i s ’ + s t r ( s c o r e 4 ) + s t
172 r ( ’%’ ) r e t u r n r e n d e r t e m p l a t e ( ’ model . html ’ , msg=msg )
173 e l i f model == 5 :
174 gb = G r a d i e n t B o o s t i n g C l a s s i f i e r (
175 ) gb . f i t ( x t r a i n , y t r a i n )
176 p r e d = gb . p r e d i c t ( x t e s t )
177 score5 = accuracy score ( y test , pred)
178 *100 p r i n t ( s c o r e 5 )
179 msg = ’ The a c c u r a c y o b t a i n e d by G r a d i e n t B o o s t i n g C l a s s i f i e r i s ’ + s t r ( s c o r e 5 ) +
s t r ( ’% ’ )
180 return r e n d e r t e m p l a t e ( ’ model . html ’ ,
181 msg=msg ) r e t u r n r e n d e r t e m p l a t e ( ’ model . html ’ )
182

183 @app . r o u t e ( ’ / p r e d i c t i o n ’ , methods =[ ”POST” , ”GET” ] )


184 def prediction () :
185 i f r e q u e s t . method == ”POST” :
186 u r l 1 = r e q u e s t . form [ ’ a ’
187 ] d e f get Domain ( u r l ) :
188 domain = u r l p a r s e ( u r l ) . n e t l o c
189 i f r e . match ( r ” ˆwww. ” , domain ) :
190 domain = domain . r e p l a c e ( ”www. ” , ” ”
191 ) r e t u r n domain

41
Output

Figure 6.1: Home Page

In Figure 6.1 when users open the website the home page is viewed which contains
the load data section, view data section, select model section, prediction section,
graph section, and finally log out section.

42
Figure 6.2: Prediction

In Figure 6.2 the algorithms that are used will be displayed from which the user
selects the algorithms to obtain accuracy after obtaining accuracy. We input the
random url to check whether the URL is legitimate or phishing.

43
CONCLUSION AND FUTURE
ENHANCEMENTS

Conclusion

This project presented various algorithms and approaches to detect phishing websites
by several researchers in Machine Learning. On reviewing the papers, we came to
the conclusion that most of the work is done by using familiar machine learning
algorithms like XG-Boost, Decision Tree and Random Forest, and MLP classifier
which generates the neural network results. Some authors proposed a new system
like Phish Score and Phish Checker for detection. The combinations of features with
regard to accuracy, precision, and recall were used. As phishing websites increase
day by day, some features may be included or replaced with new ones to detect them.
There are quite a few things that can be polished or added in future work. We have
opted to use two data mining classifiers in this project namely the ID3 and Naive
Bayes classifiers.
There are more classes such as the Bayesian network classifier, Neural Network
classifier, and C4.5 classifier. Such classifiers were not included in our project and
could be counted in the future to give more data to be compared with.

Future Enhancements

In the future, if we get a structured dataset of phishing we can perform phishing de-
tection much more faster than any other technique. In the future, we can use a com-
bination of any other two or more classifiers to get maximum accuracy. Our project
also plans to explore various phishing techniques that use Lexical features, Network-
based features, Content-based features, Webpage-based features, and HTML and
JavaScript features of web pages which can improve the performance of the system.
In particular, we extract features from URLs and pass them through the classifiers.

44
INDUSTRY DETAILS

SOURCE CODE

11.1 Source Code

1 i m p o r t os
2 from flask import *
3 i m p o r t m a t p l o t l i b . p y p l o t as
4 p l t i m p o r t s e a b o r n as s n s
5 i m p o r t pandas as pd
6 from urllib . parse import urlpar
7 seimport ipaddress
8 import re
9 from bs 4 i m p o r t B e a u t i f u l S o
10 u p i m p o r t whois
11 import urllib
12 import urllib . request
13 from d a t e t i m e i m p o r t d a t e t i
14 meimport requests
15

16 from s k l e a r n . ensemble import AdaBoostClassifier


17 from s k l e a r n . ensemble i m p o r t G r a d i e n t B o o s t i n g C l a s s i f i e r
18 from s k l e a r n . svm i m p o r t SVC
19 from s k l e a r n . m e t r i c s i m p o r t a c c u r a c y s c o r e
20 from s k l e a r n . m o d e l s e l e c t i o n i m p o r t train test sp
21 l i t i m p o r t mysql . c o n n e c t o r
22 db=mysql . c o n n e c t o r . c o n n e c t ( u s e r =” r o o t ” , password =” ” , p o r t = ’ 3306 ’ , d a t a b a s e = ’ p h i s h i
23 n g ’ ) c u r =db . c u r s o r ( )

24

25 app = F l a s k ( name )

26 app . s e c r e t k e y = ” f g h h d f g d f g r t h r t t g d f s a d f s a f f g d ”

27

28 app . c o n f i g [ ’ u p l o a d f o l d e r ’ ] = r ’ u p l o a d s ’

29 top doms = pd . r e a d c s v ( ’ top −1m. csv ’ , h e a d e r =None )

30

31 @app . r o u t e ( ’ / ’ )

32 d e f home ( ) :

33 r e t u r n r e n d e r t e m p l a t e ( ’ i n d e x . html ’ )

34

35 @app . r o u t e ( ’ / l o g i n ’ , methods =[ ’POST ’ , ’GET ’ ] )

45
36 def login () :
37 if r e q u e s t . method== ’POST ’ :
38 u s e r e m a i l = r e q u e s t . form [ ’ u s e r e m a i l ’ ]
39 session [ ’useremail ’]=useremail
40 u s e r p a s s w o r d = r e q u e s t . form [ ’ u s e r p a s s w o r d ’ ]
41 s q l =” s e l e c t c o u n t ( * ) from u s e r where Email=’% s ’ and Password=’% s ’ ”%( u s e r e m a i l , u s e r p a s s w o r d )
42 # cur . execute( sql )
43 # data=cur . fetchall ()
44 # db . commit ( )
45 x=pd . r e a d s q l q u e r y ( s q l , db )
46 print (x)
47 p r i n t ( ’ ######################## ’ )
48 c o u n t =x . v a l u e s [ 0 ] [ 0 ]
49

50 if c o u n t == 0 :
51 msg=” u s e r C r e d e n t i a l s Are n o t valid”
52 return r e n d e r t e m p l a t e ( ” l o g i n . html ” , name=msg )
53 else :
54 s=” s e l e c t * from u s e r where Email=’% s ’ and Password=’% s ’ ”%( u s e r e m a i l , u s e r p a s s w o r d )
55 z=pd . r e a d s q l q u e r y ( s , db )
56 session [ ’email ’]=useremail
57 pno= s t r ( z . v a l u e s [ 0 ] [ 5 ] )
58 p r i n t ( pno )
59 name= s t r ( z . v a l u e s [ 0 ] [ 1 ] )
60 p r i n t ( name )
61 s e s s i o n [ ’ pno ’ ] = pno
62 s e s s i o n [ ’ name ’ ] = name
63 return r e n d e r t e m p l a t e ( ” userhome . html ” , myname=name )
64 return r e n d e r t e m p l a t e ( ’ l o g i n . html ’ )
65 @app . r o u t e ( ’ / r e g i s t r a t i o n ’ , methods =[ ”POST” , ”GET” ] )
66 def registration () :
67 if r e q u e s t . method== ’POST ’ :
68 username = r e q u e s t . form [ ’ username ’ ]
69 u s e r e m a i l = r e q u e s t . form [ ’ u s e r e m a i l ’ ]
70 u s e r p a s s w o r d = r e q u e s t . form [ ’ u s e r p a s s w o r d ’ ]
71 conpassword = r e q u e s t . form [ ’ conpassword ’ ]
72 Age = r e q u e s t . form [ ’ Age ’ ]
73

74 c o n t a c t = r e q u e s t . form [ ’ c o n t a c t ’ ]
75 if u s e r p a s s w o r d == conpassword :
76 s q l =” s e l e c t * from u s e r where Email=’% s ’ and Password=’% s ’ ”%( u s e r e m a i l , u s e r p a s s w o r d )
77 cur . execute( sql )
78 data=cur . fetchall ()
79 db . commit ( )
80 print (data)
81 if data ==[]:
82

83 sql = ” insert into u s e r ( Name , Email , Password , Age , Mob) v a l u e s (%s ,% s ,% s ,% s ,% s ) ”


84 v a l =( username , u s e r e m a i l , u s e r p a s s w o r d , Age , c o n t a c t )
85 cur . execute(sql , val)

46
86 db . commit ( )
87 flash (”Registered successfully” ,”success”)
88 return r e n d e r t e m p l a t e ( ” l o g i n . html ” )
89 else :
90 flash (”Details are i n v a l i d ” , ” warning ” )
91 return r e n d e r t e m p l a t e ( ” r e g i s t r a t i o n . html ” )
92 else :
93 f l a s h ( ” Password doesn ’ t match ” , ” warning ” )
94 return r e n d e r t e m p l a t e ( ” r e g i s t r a t i o n . html ” )
95 return r e n d e r t e m p l a t e ( ’ r e g i s t r a t i o n . html ’ )
96

97

98 @app . r o u t e ( ’ / l o a d d a t a ’ , methods = [ ”POST” , ”GET” ] )


99 def load data () :
100 i f r e q u e s t . method == ”POST” :
101 file = request . files [ ’ file ’]
102 f i l e t y p e = os . p a t h . s p l i t e x t ( f i l e . f i l e n a m e ) [ 1 ]
103 print ( filetype )
104 if f i l e t y p e == ’ . csv ’ :
105 mypath = os . p a t h . j o i n ( app . c o n f i g [ ’ u p l o a d f o l d e r ’ ] , f i l e . f i l e n a m e )
106 f i l e . s av e ( mypath )
107 return r e n d e r t e m p l a t e ( ’ l o a d d a t a . html ’ , msg = ’ s u c c e s s ’ )
108 else :
109 return r e n d e r t e m p l a t e ( ’ l o a d d a t a . html ’ , msg = ’ i n v a l i d ’ )
110 return r e n d e r t e m p l a t e ( ’ l o a d d a t a . html ’ )
111

112 @app . r o u t e ( ’ / view d a t a ’ , methods = [ ”POST” , ”GET” ] )


113 def view data () :
114 p a t h = os . l i s t d i r ( app . c o n f i g [ ’ u p l o a d f o l d e r ’ ] )
115 f i l e = os . p a t h . j o i n ( app . c o n f i g [ ’ u p l o a d f o l d e r ’ ] , p a t h [ 0 ] )
116 df = pd . r e a d c s v ( f i l e )
117 r e t u r n r e n d e r t e m p l a t e ( ’ view d a t a . html ’ , col name = df . columns , r o w v a l = l i s t ( df . v a l u e s . t o l i s t ( ) )
)
118

119 @app . r o u t e ( ’ / model ’ , methods = [ ’GET ’ , ”POST” ] )


120 d e f model ( ) :
121 g l o b a l score 1 , score 2 , score 3 , score 4 , s c o r e 5
122 i f r e q u e s t . method == ”POST” :
123 model = i n t ( r e q u e s t . form [ ’ s e l e c t e d ’ ] )
124 p r i n t ( model )
125 p a t h = os . l i s t d i r ( app . c o n f i g [ ’ u p l o a d f o l d e r ’ ] )
126 f i l e = os . p a t h . j o i n ( app . c o n f i g [ ’ u p l o a d f o l d e r ’ ] , p a t h [ 0 ] )
127 df = pd . r e a d c s v ( f i l e )
128

129 p r i n t ( df . columns
130 X = df . drop ( [ ’ Label ’ , ’ Domain ’ , ’ W e b T r a f f i c ’ ] , axis =1)
131 y = df . Label
132 x t r a i n , x t e s t , y t r a i n , y t e s t = t r a i n t e s t s p l i t ( X, y , t e s t s i z e = 0 . 3 , r a n d o m s t a t e = 20 )
133

134 p r i n t ( df )

47
135 i f model == 1 :
136 from s k l e a r n . ensemble import RandomForestClassifier
137 rfr = RandomForestClassifier ()
138 rfr . fit ( x train , y train )
139 pred = rfr . predict ( x test )
140 s c o r e 1 = a c c u r a c y s c o r e ( y t e s t , p r e d ) *100
141 print (score1)
142 msg = ’ The a c c u r a c y o b t a i n e d by Random F o r e s t C l a s s i f i e r i s ’ + s t r ( s c o r e 1 ) + s t r ( ’%’ )
143 return r e n d e r t e m p l a t e ( ’ model . html ’ , msg=msg )
144 elif model == 2 :
145 classifier = AdaBoostClassifier ()
146 classifier . fit ( x train , y train )
147 pred = classifier . predict ( x test )
148 s c o r e 2 = a c c u r a c y s c o r e ( y t e s t , p r e d ) *100
149 print (score2)
150 msg = ’ The a c c u r a c y o b t a i n e d by Ada Boost C l a s s i f i e r i s ’ + s t r ( s c o r e 2 ) + s t r ( ’%’ )
151 return r e n d e r t e m p l a t e ( ’ model . html ’ , msg=msg )
152 elif model == 3 :
153 from x g b o o s t i m p o r t X G B C l a s s i f i e r
154 xgb = X G B C l a s s i f i e r ( )
155 xgb . f i t ( x t r a i n , y t r a i n )
156 p r e d = xgb . p r e d i c t ( x t e s t )
157 s c o r e 3 = a c c u r a c y s c o r e ( y t e s t , p r e d ) *100
158 print (score3)
159 msg = ’ The a c c u r a c y o b t a i n e d by XGBoost C l a s s i f i e r i s ’ + s t r ( s c o r e 3 ) + s t r ( ’%’ )
160 return r e n d e r t e m p l a t e ( ’ model . html ’ , msg=msg )
161 elif model == 4 :
162 c f = SVC( k e r n e l = ’ l i n e a r ’ )
163 cf . fit ( x train , y train )
164 pred = cf . predict ( x test )
165 s c o r e 4 = a c c u r a c y s c o r e ( y t e s t , p r e d ) *100
166 print (score4)
167 msg = ’ The a c c u r a c y o b t a i n e d by S u p p o r t V e c t o r Machine i s ’ + s t r ( s c o r e 4 ) + s t r ( ’%’ )
168 return r e n d e r t e m p l a t e ( ’ model . html ’ , msg=msg )
169 elif model == 5 :
170 gb = G r a d i e n t B o o s t i n g C l a s s i f i e r ( )
171 gb . f i t ( x t r a i n , y t r a i n )
172 p r e d = gb . p r e d i c t ( x t e s t )
173 s c o r e 5 = a c c u r a c y s c o r e ( y t e s t , p r e d ) *100
174 print (score5)
175 msg = ’ The a c c u r a c y o b t a i n e d by G r a d i e n t B o o s t i n g C l a s s i f i e r i s ’ + s t r ( s c o r e 5 ) +
s t r ( ’% ’ )
176 return r e n d e r t e m p l a t e ( ’ model . html ’ , msg=msg )
177 return r e n d e r t e m p l a t e ( ’ model . html ’ )
178

179 @app . r o u t e ( ’ / p r e d i c t i o n ’ , methods =[ ”POST” , ”GET” ] )


180 def prediction () :
181 i f r e q u e s t . method == ”POST” :
182 u r l 1 = r e q u e s t . form [ ’ a ’ ]
183 d e f get Domain ( u r l ) :

48
184 domain = u r l p r s e ( u r l ) . n e t l o c
a
185 ˆwww. ” , domain ) :
i f r e . match ( r ”
omain . r e p l a c e ( ”www. ” , ” ” )
186 domain = d
187 r e t u r n domain
188

189 def havingIP( url )


:
190 try : ip address ( url )
191 ipaddress
192 .ip= 1
194 e x c ei pp t =: 0
195 return ip
196

197 d e f have At Sign ( u r l ) :


198 i f ”@” i n u r l :
199 at = 1
200 else :
201 at = 0
202 return at
203

204 def getLength( url ) :


205 if len( url ) < 54:
206 length = 0
207 else :
208 length = 1
209 return length
210

211 def getDepth( url ) :


212 s = urlparse ( url ) . path . split ( ’/ ’
)
213 depth = 0
214 for j in range(len(s)) :
215 if len(s[ j ]) != 0:
216 depth = depth + 1
217 return depth
218

219 def redirection ( url ) :


220 pos = u r l . r f i n d ( ’ / / ’ )
221 i f pos > 6 :
222 i f pos > 7 :
223 return 1
224 else :
225 return 0
226 else :
227 return 0

49
References

[1] Anjali, S., Kumar, S., Singh, S. (2017). Machine learning techniques for
detecting phishing websites. International Journal of Computer Applications,
164(6),14-18.,2017.
[2] Al-Rfou, R., Dawoud, A., Alafandi, A., Saad, M. . Ensemble machine learn-
ing for detecting phishing websites. Journal of Intelligent Fuzzy Systems, 37(4),
5587-5596.2019.
[3] C.Santhosh Kumar, C. S., Sundararajan, E., Kalpana, R, Phishing website de-
tection using random forest and hybrid feature selection. International Journal of
Advanced Science and Technology, 28(3), 67-72.,2019.
[4] J. Shad and S. Sharma, “A Novel Machine Learning Approach to Detect Phishing
Websites Jaypee Institute of Information Technology,” 425–430, 2018.
[5] Kumar, A., Singhal, M., Jain, M. Machine learning-based approach for detecting
phishing websites. In 2019 International Conference on Internet of Things Smart
Innovation and Usages (IoT-SIU) (pp. 1-6). IEEE, 2019.
[6] K. Shima et al., “Classification of URL bitstreams using bag of bytes,” in 2018
21st Conference on Innovation in Clouds, Internet and Networks and Workshops
(ICIN), vol. 91, pp. 1–5.2018
[7] M. Karabatak and T. Mustafa, “Performance comparison of classifiers on re-
duced phishing website dataset,” 6th Int. Symp. Digit. Forensic Secur. ISDFS
1–5, 2018.
[8] Naik, S. P., Patil, R. B., Jadhav, D. A. Detection of phishing websites using
CNN and feature extraction technique. International Conference on Computing
Methodologies and Communication (ICCMC) (pp. 285-289). IEEE.2019.
[9] T. Peng, I. Harris, and Y. Sawa, “Detecting Phishing Attacks Using Natural Lan-
guage Processing and Machine Learning,” Proc. - 12th IEEE Int. Conf. Semant.
Comput. ICSC 300–301, 2018.
[10] Y. Sonmez, T. Tuncer, H. Gokal, and E. Avci, “Phishing web sites feature clas-
sification based on extreme learning machine,” 6th Int. Symp. Digit. Forensic
Secure. ISDFS - Proceeding, vol. 2018–Janua, 1–5, 2018.

50

You might also like