0% found this document useful (0 votes)
23 views54 pages

NK DT Project

Uploaded by

rp8374
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views54 pages

NK DT Project

Uploaded by

rp8374
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 54

Report on

Title of the work carried out


submitted
inpartialfulfillmentoftherequirements
forthecompletionof

Design Thinking Course (20EC22P01)

T NITIN KUMAR Roll No. 21R11A04R4


SYED IRFAN Roll No.21R11A04R3
P AKHILESH Roll No.21R11A04Q5

Department of Electronics and Communication Engineering


GEETHANJALI COLLEGE OF ENGINEERING AND TECHNOLOGY, (UGC AUTONOMOUS)
Cheeryal (V), Keesara (M), Medchal Dist, Hyderabad– 501 301
(Affiliated to Jawaharlal Nehru Technological University, Hyderabad, accredited by NAAC ‘A+’ and NBA, New Delhi)

2022-2023

GEETHANJALICOLLEGEOFENGINEERINGAND TECHNOLOGY

Department of Electronics and

Communication Engineering

Dept. of ECE,
GCET
CERTIFICATE

This is to certify that the report titled “CLASSIFICATION


AND PHISHY is
DETECTION OF WEBSITES”

being submitted by T NITIN KUMAR, S IRFAN AND PAKHILESHbearing roll number 21R11A04R4

,21R11A04R3 AND 21R11A04Q5 respectively, in partial fulfillment of the requirements for the

completion of Design Thinking course (20EC22P01).

Signature HoD-ECE
Nameofin-chargefaculty Designation

Dept. of ECE,
GCET
CONTENTS

Page No.

Chapter 1. Introduction to Design Thinking 1

Chapter 2. Identifying the problem statement: Empathy Phase 6


2.1 Key aspects of empathy phase

2.2 Import ance of the Empathy Phase

2.3 empathy phase for phishing websites

2.4 Empathy Map 1 : people facing problem with phishing websites

2.5 Empathy Map 2 : people want to classify the phishy websites

Chapter 3. Finalizing the problem statement: Define and Ideate Process 16


3.1Define phase:
3.1.2Key Aspects of the Define Phase
3.1.3 Importance of the Define Phase
3.2 ideate phase
3.2.2 ideas generated

Chapter 4. Developing the solution: Prototype and Test 21


4.1 key aspects of the define phase
4.2 importance of the prototype phase
4.3 prototype : software which predicts the phishy websites
4.4 steps to use application

Chapter 5. Results and Analysis 31


5.1 website is predicted whether its legitimate or not

Chapter 6. Learning outcomes 36

Chapter 7. Power point presentation slides 37

References 51

Dept. of ECE,
GCET
ABSTRACT
With raising in-depth amalgamation of the Internet and social life, the Internet is looking
differently at how people are learning and working, meanwhile opening us to growing
serious security attacks. The ways to recognize various network threats, specifically attacks
not seen before, is a primary issue that needs to be looked into immediately.
The aim of phishing site URLs is to collect the private information like user’s identity,
passwords and online money related exchanges. Phishers use the sites which are visibly
and semantically like those of authentic websites.
Since the majority of the clients go online to get to the administrations given by the
government and money related organizations, there has been a vital increment
inphishing threats and attacks since some years. As technology is growing, phishing
methods have started to progress briskly and this should be avoided by making use of
anti-phishintechniques to detect phishing. Machine learning is a authoritative tool that can
be used to aim against phishing assaults.
This study develops and creates a model that can predict whether a URL link is legitimate
or phishing Cyber security persons are now looking for trustworthy and steady detection
techniques for phishing websites detection. By extracting and evaluating numerous aspects
of authentic and phishing URLs, thisproject uses machine learning technology to detect
phishing URLs.In conclusion, the study provided a model for URL classification into
phishing and legitimate URLs.
This would be very valuable in assisting individuals and companies in identifying phishing
attacks by authenticating any link supplied to them to prove its validity.
Keywords: Phishing attacks, legitimate, trust worthy, Machine Learning, Personal
Information,
Malicious links, Phishing domain characteristics

Dept. of ECE,
GCET
Chapter 1. Introduction to Design Thinking

Design Thinking is a human-centered, iterative problem-solving approach that


empowers individuals and teams to tackle complex challenges and innovate effectively. It
originated from design methodologies but has since been adopted across various fields, such
as business, education, and social services.
At its core, Design Thinking involves empathizing with users to gain deep insights into
their needs, desires, and pain points. This empathetic understanding forms the foundation
of the entire process. The key stages of Design Thinking typically include.
Design Thinkers immerse themselves in the users’ experiences, aiming to
understand their perspectives, motivations, and emotions. This empathetic approach
helps uncover unmet needs and identify opportunities for improvement.
In this stage, the insights gained from empathizing are synthesized to define the
problem clearly. The focus shifts from understanding the users to framing the problem
in a way that guides the subsequent stages of the process.
The ideation stage encourages brainstorming and the exploration of diverse
solutions. It encourages participants to think creatively, without judgment, and come
up with a wide range of ideas to address the defined problem.
Design Thinking emphasizes creating tangible representations of ideas.
Prototypes can take various forms, from sketches and wireframes to physical models or
interactive mock-ups. These prototypes are used to test and gather feedback from users.
Design Thinking is characterized by its iterative nature, meaning that the process
often involves going back and forth between stages as new insights are gained and ideas
evolve. This flexible approach allows for continuous improvement and refinement of
solutions.
Moreover, Design Thinking encourages collaboration and cross-disciplinary
teamwork, as diverse perspectives enrich the problem-solving process. By placing users at
the heart of the process, Design Thinking
enables the creation of products, services, or solutions that truly resonate with their
intended audience.

Dept. of ECE,
GCET
Design Thinking was popularized by design consultancy firms like IDEO and the Stanford
d.school. It draws inspiration from design processes but has been adapted and integrated
into various disciplines due to its effectiveness in fostering creativity and problem-solving.

The central tenet of Design Thinking is the focus on understanding and empathizing
with end-users. By putting users at the center of the process, designers can create solutions
that truly address their needs,
preferences, and pain points.
Design Thinking is not a linear process but rather iterative, meaning that it involves
continuous cycles of exploration, ideation, prototyping, and testing. Each iteration brings
the design closer to an optimal solution
through continuous refinement.

Beyond a mere process, Design Thinking embodies a mindset characterized by


curiosity, open- mindedness, and a willingness to learn from failure. It encourages
embracing ambiguity and seeing challenges
as opportunities for growth.

Design Thinking is versatile and applicable to a wide array of industries and sectors.
It has been
successfully used in product development, service design, business strategy, social
innovation, healthcare,
education, and more.
Visual tools, such as sketches, storyboards, and mind maps, play a significant role in
Design Thinking.
They help externalize ideas, facilitate communication, and promote a shared
understanding among team
members.
Various tools and methods are employed throughout the Design Thinking process,
such as personal
development, journey mapping, brainstorming techniques, rapid prototyping, and
usability testing.

Dept. of ECE,
GCET
Design Thinking aligns with the principles of Human-Centered Design (HCD), which
emphasizes the importance of designing for the needs and experiences of people. HCD and
Design Thinking often go hand in hand in creating impactful solutions.

Design Thinking has been associated with numerous successful case studies where
innovative products,
services, or systems have been developed to address real-world problems. Its ability to
foster user-centricity has led to the creation of solutions that resonate with their intended
audiences.

As the landscape of challenges continues to evolve, Design Thinking remains a


powerful approach to
tackle complex problems, foster innovation, and drive positive change in diverse domains.
Its adaptability and
focus on human needs make it a valuable asset for individuals, teams, and organizations
seeking to create
meaningful impact.

Design Thinking is closely linked to the innovation process. It provides a structured


approach to generate
innovative ideas and solutions by encouraging a deep understanding of users and their
contexts. By challenging
assumptions and exploring new possibilities, Design Thinking fosters breakthrough
innovations.
Many successful companies have embraced Design Thinking as a strategic tool to
drive innovation and
improve customer experiences. By integrating Design Thinking into their processes,
businesses can stay
customer-focused, identify new market opportunities, and remain competitive in dynamic
environments.

Dept. of ECE,
GCET
Fig:1.1-steps of design thinking

Design Thinking and User Experience (UX) design are closely intertwined. UX design
applies Design
Thinking principles to create intuitive, enjoyable, and user-friendly products and services.
The iterative nature
of Design Thinking aligns well with the continuous improvement cycle that
characterizes UX design.
Design Thinking and Agile methodologies share some similarities, such as their
iterative nature and
emphasis on user feedback. Both approaches aim to deliver value early and often, but they
differ in their core
focus. Design Thinking concentrates on problem-solving and user empathy, while Agile
focuses on project
management and software development.
Design Thinking is not limited to business and product development; it has also been
effectively used in
social innovation and addressing complex societal challenges. Nonprofits,
governments, and social enterprises
leverage Design Thinking to create impactful solutions for issues like poverty,
healthcare, education, and
sustainability.

Dept. of ECE,
GCET
Despite its benefits, Design Thinking also faces some challenges. Ensuring effective
collaboration within
diverse teams, maintaining the right balance between creativity and feasibility, and
avoiding "design for
design's sake" are some of the common challenges designers encounter.
As Design Thinking gains popularity, many institutions and organizations offer Design
Thinking courses
and certifications. These programs equip individuals with the knowledge and skills to apply
the methodology in
various settings.

Designers using the Design Thinking approach should also consider ethical
implications. Understanding
the potential consequences of a design solution on different stakeholders is crucial to ensure
that the outcome
aligns with ethical standards and avoids unintended negative impacts.

Design Thinking has found its way into educational settings, transforming the way
students learn and
solve problems. It encourages active learning, critical thinking, and creativity, preparing
students to become
adaptable problem-solvers in the real world.

Dept. of ECE,
GCET
Chapter 2. Identifying the problem statement:Empathy Phase

The Empathy phase is one of the foundational stages of the Design Thinking
process. During this phase, designers and problem solvers aim to deeply
understand the perspectives, needs, desires, and challenges of the users or
stakeholders for whom they are designing a solution. It involves putting aside
preconceptions and immersing oneself in the users experiences to gain valuable
insights that will guide the rest of the design
process.

Fig:2.1-spectrum of empathy

Key Aspects of the Empathy Phase:

Dept. of ECE,
1. Observation: Designers observe users in their natural environment, paying close
attention to their behaviors, actions, and interactions. This helps identify patterns
and understand how users currently approach and deal with the problem at hand.
2. Interviewing: Engaging in one-on-one interviews with users allows
designers to delve deeper into their thoughts, feelings, and motivations. Open-
ended questions encourage users to express their needs and preferences, leading
to more profound insights.
3. Empathetic Listening: Empathy is at the core of this phase. Designers
actively listen to users without judgment, seeking to understand their emotions
and perspectives. By putting themselves in the users' shoes, designers can develop
a more comprehensive understanding of their experiences.
4. Building Empathy Tools: Designers often create empathy tools, such as
empathy maps and personas, to synthesize and visualize the collected user
insights. Empathy maps help in organizing observations, emotions,
thoughts, and pain points, while personas represent fictional characters embodying
specific user characteristics.

Fig:2.2-empathy phase

Importance of the Empathy Phase:

1. Human-Centered Approach: The Empathy phase ensures a human-


centered perspective, putting the needs and experiences of users at the forefront
of the design process. By understanding users deeply, designers can create
solutions that resonate with their intended audience.
2. Uncovering Unmet Needs: Empathy allows designers to uncover unmet

Dept. of ECE,
needs and pain points that users might not articulate explicitly. These hidden
insights can lead to innovative solutions that address real and meaningful problems.
3. Building Empathy in the Team: By actively engaging in empathetic
research, the design team cultivates a shared understanding and empathy for
the users. This shared empathy lays the foundation for collaborative problem-
solving and creative idea generation.
Enhancing Creativity: Empathy provides designers with a rich pool of
experiences and emotions to draw upon during the subsequent stages of the
Design Thinking process, fueling creativity and ideation. The Empathy phase sets
the stage for a successful Design Thinking process. By gaining a profound
understanding of users, designers can define the problem more accurately and
develop solutions that are tailored to meet real user needs, resulting in more
impactful and user-centric outcomes.

CHAPTER 2

LITERATURE SURVEY

2.1 OverviewoftheStudy

This chapter offers an insight into various important studies


conducted by excellent scholars from articles, books, and other
sources relevant to the detection of phishing websites. It also provides
the project with a theoretical review, conceptual review, and
empirical review to demonstrate understanding of the project

2.2 LiteratureSurvey

A literature survey is an insightful article that presents the existing


information including considerable discoveries just as theoretical and
methodological commitments to a specific topic.

No Paper Title Method/Technique Publish Limitations


Year

Dept. of ECE,
1 FS-NN: ”An effecti Proposed method 2019 The continuous
Phishing Websites has 3 stages: growing of
Detection Defines new index, features that
Model Based on Designs optimal are sensitive of
Optimal feature selection phishing
Feature Selection and algorithm,Produce attacks need
Neural Network” OFSNN model collection of
more features
for the OFS

2 ”Fuzzy Rough Set The 2019 The specific


Feature Selection to proposed features used in
Enhance Phishing method uses the method is not
Attack Detection” Fuzzy Rough specified.
Set (FRS)
theory to
identify the
features. The
decision
boundary is
decided lower
and upper
approximation
region. Using
the lower and
upper
approximation
memberships, a
set member is
decided to
which category
it belongs

Dept. of ECE,
3 ”Phishing Website The 2019 It requires
Detection based on proposed more
Multidimensional method has computation and
Features driven by the following therefore an
Deep stages: expensive
Learning” 1.character method
succession
features of the
URL are
extricated as
well as utilized
for fast
characterization
2. the LSTM
(long short-term
memory)
network is
utilized to
catch setting
semantic
and dependency
features of URL
character
groupings.
3. softmax
classifies the
features
extracted

Dept. of ECE,
4 ”WC-PAD: Web It is a 3-phase 2019 Time consuming
Crawling based Phishing detection of as it involves
Attack Detection” phishing attack three phases and
approach. The 3 each website has
phases of WC- to go through the
PAD are 1) three phases.
blacklist of DNS
2) Approach
based on
Heuristics and
3) Approach
based on Web
crawler. Feature
extraction as
well as phishing
attack detection
both makes use
of web crawler.

5 ”Phishing URL Detection NN module is 2019 false positive rate


via used to derive is high
CNN and Attention- representation
Based of spatial
Hierarchical RNN” feature that is
character level
of the URLs.
Then the
representational
features are

combined by
using a CNN of 3
layers
to create
precise feature
representations
of URLs. That
is then used for
training the
classifier of
phishing URLs.

Dept. of ECE,
6 ”An Adaptive Machine A phishing 2020
Learning Based detection
Approach for system was
Phishing Detection developed by
Using making use of
Hybrid Features” classifier of
Machine
learning called
XCS. It is an
adaptive ML
technique that
is online. This
advances a lot
of
rules called
classifiers. This
model derives 38
features from
source code of
webpage and
URLs.

7 ”A new method for The three 2020 Does not give full
Detection of Phishing major phases in information about
Websites: URL this work the techniques use
Detection are Parsing,
Heuristic
Classification of
data,
Performance
Analysis in
this model. All
of these phases
use various and
distinctive
methods for
data processing
to get
results
that are
better.

Table 2.1: Literature Survey.

Dept. of ECE,
From the above, ML methods plays a vital role in many applications
of cybersecurity and shall remain an encouraging path that
captivates more such investigations. When coming to the reality,
there are several barriers that are limitations during
implementations. As discussed, there are many approaches earlier
proposed for detecting phishing website attack and they also have
their own limitations. Therefore, the aim of the project is detection of
phishing website attack using a novel Machine learning technique.
2.3 AnalysisofExistingSystem

The existing system of phishing detection techniques suffers low


detection accuracy and high false alarm especially when different
phishing approaches are introduced. Above and beyond, the most
common technique used is the blacklist-based method which is
inefficient in responding to emanating phishing attacks since
registering a new domain has become easier, no comprehensive
blacklist can ensure a perfect up- to-date database for phishing
detection.
2.4 ProposedSystem

The proposed phishing detection system utilizes machine learning


models. The system comprises two major parts, which are the
machine learning models and a web application. These models consist
of Decision Tree, Support Vector Machine, XGBooster, logistic
regression, and Random Forest. These models are selected after
different comparison-based performances of multiple machine
learning algorithms. Each of these models is trained and tested on a
website content-based feature, extracted from both phishing and
legitimate dataset. Hence, the model with the highest accuracy is
selected and integrated into a web application that will enable a user
to predict if a URL link is phishing or legitimate.
2.5 Benefitsofthenewsystem

i. Will be able to differentiate between phishing (0) and


legitimate (1) URLs ii. It Will help reduce phishing data
breaches for an organization iii. It Will be helpful to
individuals and organizations iv. It is easy to use
2.6 Summary

In this chapter we mainly focused on existing system through

Dept. of ECE,
literature survey and various research paper analyzed and we
specified some important points of each paper and related diagrams
or graphs are included. In comparison section we have mainly
highlighted few important advantages and disadvantages in each
paper and comparison between those papers. This chapter also
introduces drawbacks of existing system and functionality of
proposed system and their advantages.

Dept. of ECE,
CHAPTER 3

ANALYSIS

3.1 OverviewofSystemAnalysis

This chapter describes the various process, methods, and procedures


adopted by the researcher to achieve the set aim and objectives and
the conceptual structure within which the research was conducted.
The methodology of any research work refers to the research
approach adopted by the researcher to tackle the stated problem.
Since the efficiency and maintainability of any application are solely
dependent on how designs are prepared. This chapter provides
detailed descriptions of methods employed to proffer solutions to the
stated objectives of the research work.
According to the Merriam-Webster dictionary (11th.Ed), system
analysis is "the process of studying a procedure or business to
identify its goals and purposes and create systems and procedures
that will efficiently achieve them". It is also the act, process, or
profession of studying an activity (such as a procedure, a business, or
a physiological function) typically by mathematical means to define
its goals or purposes and to discover operations and procedures for
accomplishing them most efficiently. System analysis is used in every
field where the development of something is done. Before planning
and development, you need to understand the old systems thoroughly
and use the knowledge to determine how best your new system can
work.
In ML and statistics, classification method is an approach involving
supervised learning where computer program gains information
from input and afterward utilizes this figuring out how to
characterize new observations. Here are few classification techniques
used in the detection of phishing URLs.
3.2 SoftwareRequirementSpecification

3.2.1 Installation Requirements


The hardware (physical components of a computer system that can
be seen, touched, or felt) and software (both system software and the
application software installed and used in the system development)
tools needed to satisfy these objectives highlighted

Dept. of ECE,
below:

3.2.2 Hardware Requirements:

1. Processor CPU - Intel Pentium Dual Core and Higher


2.Hard Disk capacity - 512MB Space required minimum
3.RAM - 4GB minimum
3.2.3 Software requirements

1. Programming language – Python


2.Operating system - Windows 8.1 or above
3.IDE – Anaconda or PyCharm
4.I Python version 3.x
3.3 OtherNon-FunctionalRequirements
A non-functional requirement is a determination that depicts the
framework’s activity abilities and requirements that improve its usefulness.
Some of them are as follows:

Reusability: the same code with limited changes can be used


for detecting phishing attacks variants like smishing, vishing,
etc.

Maintainability: The implementation is very basic and


includes print statements that makes it easy to debug.

Usability: The software used is very user friendly and open


source. It also runs on any operating system.
Scalability: The implementation can include detection of
vishing, smishing, etc.

Dept. of ECE,
3.4 SystemArchitecture

Figure 3.1: System Architecture

The architecture of the system is as shown in fig 4.1; the URLs to be


classified as legitimate or phishing is fed as input to the appropriate
classifier. Then classifier that is being trained to classify URLs as
phishing or legitimate from the training dataset uses the pattern it
recognized to classify the newly fed input. The features such as IP
address, URL length, domain, having favicon, etc. are extracted from
the URL and a list of its values is generated. The list is fed to the
classifiers such as KNN, kernel SVM, Decision tree and Random
Forest classifier. These models’ performance is then evaluated and an
accuracy score is generated. The trained classifier using the generated
list predicts if the URL is legitimate or phishing. The list contains
values 1,
0 and -1 if the features exist, not applicable and if the features doesn’t
exist respectively.
There are 30 features being considered in this project.
3.5 SupportingPythonmodules

Python has an approach to place definitions in a document and use


them in a content or in an intuitive case of the interpreter. Such

Dept. of ECE,
a file is known as a module; definitions a module can be brought into
different modules or into the fundamental module.
Some of the modules used in the project are as shown in Table 3.1

S.No Python Modules Description

1 Ip address Ip address gives the capacities to


generate, control and work on IPv4
and IPv6 addresses and networks.

2 Re This module gives regular expression


matching activities like those found in
Perl.
3 urllib.request The urllib.request module
characterizes functions and classes
which help in opening URLs (for the
most part HTTP) in a complex world.

4 BeautifulSoup BeautifulSoup is a package in python


for parsing HTML and XML records.
It makes a parse tree for parsed pages
that can be utilized to extricate
information from HTML, which is
valuable for web scraping.

5 Socket The BSD interface of socket is given


access by this module.
6 Requests The HTTP requests are allowed to
send by this module making use of
Python.
7 Whois WHOIS is an inquiry and response
convention that is comprehensively
used for addressing databases that
store the selected customers or
trustees of an Internet resource. for
example, a domain name, an
autonomous framework or an IP
address block , also simultaneously
used for broad extend of information.

Table 3.1: Supporting python modules

Dept. of ECE,
3.6 Machinelearningmodels

The machine learning model is nothing but a piece of code; an


engineer or data scientist makes it smart through training with data.
So, if you give garbage to the model, you will get garbage in return,
i.e. the trained model will provide false or wrong predictions.

3.6.1 Supervised learning


Supervised learning, in the background of artificial intelligence (AI) and
machine learning, is a type of system in which both input and
preferred output data are provided. Input and output data are
labelled for classification to deliver a learning basis for future data
processing. Supervised learning models have some benefits over the
unsupervised approach, but they also have boundaries. The systems
are more likely to make decisions that humans can relate to, for
example, because humans have provided the basis for decisions.
However, in the case of a retrieval-based method, supervised learning
systems have distress dealing with new information. The supervised
learning is categorized into 2 other categories which are
“Classification” and “Regression”.
Classification problem is when the target variable is categorical (i.e.
the output could be classified into classes — it belongs to either Class
A or B or something else).
While a Regression problem is when the target variable is continuous

3.6.2 Logistic Regression


Logistic Regression is a classification model that is used when the
dependent variable (output) is in the binary format such as 0 (False)
or 1 (True). This makes logistic regression a good algorithm fit for the
purpose of our work in predicting if a URL is a phishing URL (1) or
not (0) as in the case of this paper.

Logistic Regression is an extension of the Linear Regression model.


Let us understand this with a simple example. If we want to classify if
an email is a spam or not, if we apply a Linear Regression model, we
would get only continuous values between 0 and 1 such as 0.4, 0.7 etc.
On the other hand, the Logistic Regression extends this linear
regression model by setting a threshold at 0.5, hence the data point
will be classified as spam if the output value is greater than 0.5 and
not spam if the output

Dept. of ECE,
value is lesser than 0.5. In this way, we can use Logistic
Regression to classification problems and get accurate predictions.

The logistic function, also called as sigmoid function was initially


used by statisticians to describe properties of population growth in
ecology. The sigmoid function is a mathematical function used to map
the predicted values to probabilities. Logistic Regression has an S-
shaped curve and can take values between 0 and 1 but never exactly
at those limits.

3.6.3 Random Forest Algorithm

Random forest algorithm is one of the most powerful algorithms in


machine learning technology and it is based on concept of decision
tree algorithm. Random forest algorithm creates the forest with
number of decision trees. High number of tree gives high detection
accuracy. Creation of trees are based on bootstrap method. In
bootstrap method features and samples of dataset are randomly
selected with replacement to construct single tree.
Among randomly selected features, random forest algorithm will
choose best splitter for the classification and like decision tree
algorithm; Random forest algorithm also uses gini index and
information gain methods to find the best splitter. This process will
get continue until random forest creates n number of trees.
Each tree in forest predicts the target value and then algorithm will
calculate the votes for each predicted target. Finally random forest
algorithm considers high voted predicted target as a final prediction.

3.6.4 Support Vector Machine Algorithm


Support vector machine is another powerful algorithm in machine
learning technology. In support vector machine algorithm each data
item is plotted as a point in n-dimensional space and support vector
machine algorithm constructs separating line for classification of
two classes, this separating line is well known as hyperplane.
Support vector machine seeks for the closest points called as support
vectors and once it finds the closest point it draws a line connecting
to them. Support vector machine then construct separating line
which bisects and perpendicular to the connecting line. In order to
classify data perfectly the margin should be maximum. Here the
margin is a distance between hyperplane and support vectors. In
real scenario it is not possible to separate complex and non linear
data, to solve this problem

Dept. of ECE,
support vector machine uses kernel trick which transforms lower dimensional
space to higher dimensional space.

Dept. of ECE,
CHAPTER 4

DESIGN

4.1 SystemModelling

System modeling involves the process of developing an abstract


model of a system, with each model presenting a different view or
perspective of the system. It is the process of representing a system
using various graphical notations that shows how users will interact
with the system and how certain parts of the system function. The
proposed system was modeled using the following diagrams:
i. Architecture
diagram ii. Use
case diagram iii.
Flowcharts
The proposed system will be implemented using Python
Programming language along with different machine learning
models and libraries such as pandas, scikit-learn, python who-is,
beautiful-Soup, NumPy, seaborn, and matplotlib. Etc.

4.2 UMLActivityDiagram

Activity diagram is a behavioral diagram. The fig 4.5 shows the


activity diagram of the system. It depicts the control flow from a start
point to an end point showing various paths which exists during the
execution of the activity.

Dept. of ECE,
Figure 4.1: UML activity diagram

4.3 DataFlowDiagrams

DFDs are used to depict graphically the data flow in a system. It


explains the processes involved in a system from the input to the
report generation. It shows all possible paths from one entity to
another of aa system. The detail of a data flow diagram can be
represented in three different levels that are numbered 0, 1 and 2.
There are many types of notations to draw a data flow diagram
among which Yourdon Coad and Gane-Sarson method are popular.
The DFDs depicted in this chapter uses the Gane-Sarson DFD
notations.

Dept. of ECE,
4.1.1 Data Flow Diagram – Level 0

DFD level 0 is called a Context Diagram. It is a simple overview of


the whole system being modeled. Fig 4.2 shows the DFD level 0 of
the system.

Figure 4.2: DFD - level 0

It shows the system as a high-level process with its relationship to the


external entities. It should be easily acknowledged by a wide range of
audience from stakeholders to developers to data analysts.

4.1.2 Data Flow Diagram – Level 1

DFD level 1 gives a more detailed explanation of the Context


diagram. The high-level process of the Context diagram is broken
down into its subprocesses. The DFD level
1 of the system is depicted in fig 4.3

Dept. of ECE,
Figure 4.3: DFD - level 1

The Level 1 DFD takes a step deep by including the processes


involved in the system such as feature extraction, splitting of dataset,
building the classifier, etc. and hence gives a more detailed vision of
the system.

Dept. of ECE,
4.1.3 Data Flow Diagram – Level 2

DFD level 2 goes one more step deeper into the subprocesses of Level
1. Fig 4.4 shows the DFD level 2 of the system. It might require more
text to get into the necessary level of detail about the functioning of
the system. The Level 2 gives a more detailed sight of the system by
categorizing the processes involved in the system to three categories
namely preprocessing, feature scaling and classification. It also
graphically depicts each of these categories in detail and gives a
complete idea of how the system works.

Dept. of ECE,
Figure 4.4: DFD - level 2

4.2Summary
The system’s architecture, the processes involved from input to
output with varying levels of complexity and the system’s behaviour
is graphically represented for better understanding of the system in
the above chapter.
CHAPTER 5

Dept. of ECE,
IMPLEMENTATION

This chapter illustrates the approach employed to classify the URLs


as either phishing or legitimate. The methodology involves building a
training set. The training set is used for training a machine learning
model, i.e., the classifier. Fig 5.1 shows the diagrammatic
representation of the implementation.

Figure 5.1: Implementation

5.1 TechnologyUsed
PYTHON

In technical terms, Python is an object-oriented, high-level


programming language with integrated dynamic semantics primarily
for web and app development. It is extremely attractive in the field of
Rapid Application Development because it offers dynamic typing and
dynamic binding options.
Python is relatively simple, so it's easy to learn since it requires a
unique syntax that focuses on readability. Developers can read and
translate Python code much easier than other languages. In turn, this
reduces the cost of program maintenance and development

Dept. of ECE,
because it allows teams to work collaboratively without significant
lang- uage and experience barriers.
Additionally, Python supports the use of modules and packages,
which means that programs can be designed in a modular style and
code can be reused across a variety of projects. Once you have
developed a module or package you need, it can be scaled for use in
other projects, and it’s easy to import these modules.
One of the most promising benefits of Python is that both the
standard library and the interpreter are available free of charge, in
both binary and source form. There is no exclusivity either, as
Python and all the necessary tools are available on all major
platforms. Therefore, it is an enticing option for developers who
don't want to worry about paying high development costs.
That makes Python accessible to almost anyone. If you have the time
to learn, you can create some amazing things with the language.

MACHINE LEARNING

Machine learning provides simplified and efficient methods for data


analysis. It has indicated promising outcomes in real time
classification problems recently. The key advantage of machine
learning is the ability to create flexible models for specific tasks like
phishing detection. Since phishing is a classification problem, Machine
learning models can be used as a powerful tool. Machine learning
models could adapt to changes quickly to identify patterns of
fraudulent transactions that help to develop a learningbased
identification system. Most of the machine learning models discussed
here are classified as supervised machine learning, this is where an
algorithm tries to learn a function that maps an input to an output
based on example input-output pairs. It infers a function from
labeled training data consisting of a set of training examples.

PANDAS

Pandas is an open-source Python Library providing high-


performance data manipulation and analysis tool using its powerful
data structures. The name Pandas is derived from the word Panel
Data – an Econometrics from Multidimensional data. In 2008,
developer Wes McKinney started developing pandas when

Dept. of ECE,
in need of high performance, flexible tool for analysis of data. Prior
to Pandas, Python was majorly used for data munging and
preparation. It had very little contribution towards data analysis.
Pandas solved this problem. Using Pandas, we can accomplish five
typical steps in the processing and analysis of data, regardless of the
origin of data — load, prepare, manipulate, model, and analyze.
Python with Pandas is used in a wide range of fields including
academic and commercial domains including finance, economics,
Statistics, analytics, etc.
NUMPY

NumPy is a Python package. It stands for 'Numerical Python'. It is a


library consisting of multidimensional array objects and a collection
of routines for processing of array.
Numeric, the ancestor of NumPy, was developed by Jim Hugunin.
Another package Num array was also developed, having some
additional functionalities. In 2005, Travis Oliphant created NumPy
package by incorporating the features of Num array into Numeric
package. There are many contributors to this open source project.
Operations using NumPy Using NumPy, a developer can perform the
following operations – • Mathematical and logical operations on
arrays.
• Fourier transforms and routines for shape
manipulation.

• Operations related to linear algebra. NumPy has in-built functions


for linear algebra and random number generation.
5.2 Flowchartofthesystem

A flowchart is a diagram that depicts a process, system, or computer


algorithm. It is a graphical representation of the steps that are to be
performed in a system, it shows the steps in sequential order. It is
used in presenting the flow of algorithms and to communicate
complex processes in clear, easy-to-understand diagrams.
Figure 5.2 shows the flow of phishing detection systems using the machine
learning process.
Figure 5.3 shows the phishing detection web interface system. The
user inputs a URL link and the website validates the format of the
URL and then predicts if the link is phishing or legitimate.

Dept. of ECE,
Figure 5.2 Flowchart of the proposed System

Dept. of ECE,
Figure 5.3 Flowchart of the web interface

Dept. of ECE,
46
5.1 Dataset

5.2 Processinvovedinimpementatonill
The first step of the research work was determining the right data
set. The dataset selected was collected from Kaggle for this task.
The reasons behind selecting this dataset are several. It includes:
1. The data set is large, so working with it is intriguing
2. The number of features in the data set is 30 giving a
wide range of features making the predictions a little
more accurate. 3.The number of URLs is quite evenly
distributed among the 2 categories.

5.5.1 Splitting:

The dataset into training part of dataset and testing part of dataset.
The dataset was split into training and testing dataset with 75% for
training and 25% for testing using the
“train test split” method. The splitting was done after assigning the dependent
variables and independent variables. 5.5.2
Preprocessing:

Preprocessing involves filling the missing data or removing the


missing data and getting a clean dataset. But the dataset chosen was
already preprocessed and did not require any further preprocessing
from my end. The only step to be performed in preprocessing was
feature scaling.

Dept. of ECE,
5.5.3 Feature extraction

Feature values are extracted using python modules like whois,


requests, socket, re, ipaddress, BeautifulSoup, etc. to get information
regarding ip address, length of url, domain name, subdomains,
presence of favicon, etc. The value obtained is stored in a list. This is
being done because the dataset is in this format and hence the
classifier will be trained with input of this format.
Therefore, when a URL is passed as input to the system, it converts it
into a python list of 30 elements each representing its respective
feature and there after that list is fed to the trained classifier. The
classifier that is being used includes KNN, kernel SVM, Decision Tree
and random forest classifier.
5.3 GeneralWorkingofTheSystem

A one-page phishing detection web application has been developed to


run on any browser. The application was developed using
programming languages such as HTML, CSS, PHP, and JavaScript.
The phishing detection web application has the following pages:

5.6.1 The home page

The home page contains a session for a user to enter a URL and
predict if it is phishing or legitimate. It predicts the state of the URL
base on the feature selection. The purpose of this page is to help its
users validate a URL link

Dept. of ECE,
5.6.2 FAQs Page

The FAQs Page contains a series of questions and answers about the
phishing attacks and how the users can get prevented from getting
attacked by the phishing sites.

Figure 5.4 The Home Page

Dept. of ECE,
Figure 5.5 Code for the web application

Dept. of ECE,
Dept. of ECE,
5.4 SUMMARY

This chapter discusses the working of the system through proposed


system architecture. The flow diagram shows the working of Proposed
system and the software requirement specification. The Project is also
explained through the architecture of the proposed system.

Dept. of ECE,
CHAPTER 6
RESULTS AND DISCUSSIONS

Dept. of ECE,
6.1 TableandGraphsofresults

Table 6.1 performance of the proposed system

Dept. of ECE,
Figure 6.1 Graph of accuracy
6.2 Resultscomparisonandgraphs
The phishing scam in websites classification model is generated by
implementing random forest algorithm, Logistic regression and
support vector machine algorithms. The goal of this project is to
compare the performance of different classifiers and find out the
best approach for classification phishing and non-phishing website.
These algorithms were implemented in python.

Table 6.2 Accuracy classification

Dept. of ECE,
CHAPTER 7

TESTING AND VALIDATION


System testing is actually a series of different tests whose primary
purpose is to fully exercise the computer-based system. Although
each test has a different purpose, all work to verify that all the
system elements have been properly integrated and perform
allocated functions. The testing process is actually carried out to
make sure that the web application exactly does the same thing
what is supposed to do. In the testing stage following goals are tried to
achieve: - ● To affirm the quality of the project.
●To find and eliminate and residual errors from
previous stages.
●To validate the software as solution to the original
problem.
●To provide operational reliability of the
system.

In this chapter, we check for the working of the proposed system by


testing and comparing the result of the algorithm and the actual
result. It is basically validating the system. The testing is done for
each algorithm with a legitimate and phishing URL and the results
are as follows.

7.1 TestingTypes

7.1.1 Unit Testing

Unit testing, also known as component testing refers to tests that


verify the functionality of a specific section of code, usually at the
function level. In an object- oriented environment, this is usually at the
class level, and the minimal unit tests include the constructors and
destructors. Unit testing is a software development process that
involves synchronized application of a broad spectrum of defect
prevention and detection strategies in order to reduce software
development risks, time and costs.

Dept. of ECE,
7.2.1ValidationTesting

At the culmination of integration testing, software is completed


assembled as a package. Interfacing errors have been uncovered and
corrected. Validation testing can be defined in Machine learning based
approach to detect phishing attacks Testing Dept. of CSE, SJCIT 31
2021-22 many ways; here the testing validates the software function in
a manner that is reasonably expected by the customer. In software
project management, software testing, and software engineering,
verification and validation (V&V) is the process of checking that a
software system meets specifications and that it fulfills its intended
purpose. It may also be referred to as software quality control.
7.1.2 Functional Testing

Functional testing is a type of testing that seeks to establish whether


each application feature works as per the software requirements.
Each function is compared to the corresponding requirement to
ascertain whether its output is consistent with the end user’s
expectations. The testing is done by providing sample inputs, capturing
resulting outputs, and verifying that actual outputs are the same as
expected outputs.

7.1.3 Integration Testing

Integration testing is any type of software testing that seeks to verify


the interfaces between components against a software design.
Software components may be integrated in an iterative way or all
together. Normally the former is considered a better practice since it
allows interface issues to be located more quickly and fixed.
Integration testing works to expose defects in the interfaces and
interaction between integrated components (modules).
Progressively larger groups of tested software components
corresponding to elements of the architectural design are integrated
and tested until the software works as a system.
7.2 Test Cases

Dept. of ECE,
S Input URL Expected Actual Remarks
No Output Output

1 HTTPS://WWW.MLRITM.AC.IN/ Legitimate Legitimate Success

2 https://www.2498.b.hostable.me/ Phishing Phishing Success

3 HTTPS://FACEBOOK.COM Legitimate Legitimate Success

4 WWW.FACEBOOK.COM Please Please Success


input full input full
URL URL

Table 7.1 Test Cases Table

Dept. of ECE,
7.3 Summary

This chapter discuss about the importance of testing and varies


methods that are used to test the model built. This helps us to
understand the performance of the system and make the necessary
changes accordingly.
CHAPTER1

CONCLUSION AND FUTURE SCOPE

8.1 Conclusion

The demonstration of phishing is turning into an advanced danger to


this quickly developing universe of innovation. Today, every nation is
focusing on cashless exchanges, business online, tickets that are
paperless and so on to update with the growing world. Yet phishing is
turning into an impediment to this advancement. Individuals are not
feeling web is dependable now. It is conceivable to utilize AI to get
information and assemble extraordinary information items. A lay
person, completely unconscious of how to recognize a security danger
shall never invite the danger of making money related exchanges on the
web. Phishers are focusing on installment industry and cloud benefits
the most.
The project means to investigate this region by indicating an utilization
instance of recognizing phishing sites utilizing ML. It
8.2 Future Enhancement

Further work can be done to enhance the model by using assembling


models to get greater accuracy score. Ensemble methods is a ML
technique that combines many base models to generate an optimal
predictive model. Further reaching future work would be combining
multiple classifiers, trained on different aspects of the same training
set, into a single classifier that may provide a more robust prediction
than any of the single classifiers on their own.

Dept. of ECE,
aimed to build a phishing detection mechanism using machine learning
tools and techniques which is efficient, accurate and cost effective. The
project was carried out in Anaconda IDE and was written in Python.
The proposed method used four machine learning classifiers to achieve
this and a comparative study of the four algorithms was made. A good
accuracy score was also achieved.

The project can also include other variants of phishing like smishing,
vishing, etc. to complete the system. Looking even further out, the
methodology needs to be evaluated on how it might handle collection
growth. The collections will ideally grow incrementally over time so
there will need to be a way to apply a classifier incrementally to the new
data, but also potentially have this classifier receive feedback that might
modify it over time.

8.3 Recommendation

Through this project, one can know a lot about phishing attacks and
how to prevent them. This project can be taken further by creating a
browser extension that can be installed on any web browser to detect
phishing URL Links.

Dept. of ECE,
REFERENCES

[1] Reid G. Smith and Joshua Eckroth. Building ai applications:


Yesterday, today, and tomorrow. AI Magazine, 38(1):6–22, Mar.
2017.
[2] Panos Louridas and Christof Ebert. Machine learning. IEEE
Software, 33:110– 115, 09 2016.
[3] Michael Jordan and T.M. Mitchell. Machine learning: Trends,
perspectives, and prospects. Science (New York, N.Y.), 349:255–60,
07 2015.
[4] Steven Aftergood. Cybersecurity: The cold war online. Nature,
547:30+, Jul 2017. 7661.
[5] Aleksandar Milenkoski, Marco Vieira, Samuel Kounev, Alberto
Avritzer, and Bryan Payne. Evaluating computer intrusion
detection systems: A survey of common practices. ACM Computing
Surveys, 48:12:1–, 09 2015.
[6] Chirag N. Modi and Kamatchi Acha. Virtualization layer security
challenges and intrusion detection/prevention systems in cloud
computing: a comprehensive review. The Journal of
Supercomputing, 73(3):1192–1234, Mar 2017.
[7] Eduardo Viegas, Altair Santin, Andre Fanca, Ricardo Jasinski,
Volnei Pedroni, and Luiz Soares de Oliveira. Towards an energy-
efficient anomaly-based intrusion detection engine for embedded
systems. IEEE Transactions on Computers, 66:1–1, Jan 2016. 53

[8] Y. Xin, L. Kong, Z. Liu, Y. Chen, Y. Li, H. Zhu, M. Gao, H. Hou, and C.
Wang.
Machine learning and deep learning methods for cybersecurity. IEEE
Access, 6:35365– 35381, 2018.
[9] Neha R. Israni and Anil N. Jaiswal. A survey on various phishing
and anti- phishing measures. International journal of engineering

Dept. of ECE,
research and technology, 4, 2015.
[10] Pingchuan Liu and Teng-Sheng Moh. Content based spam e- mail
filtering. pages 218–224, 10 2016.
[11] N. Agrawal and S. Singh. Origin (dynamic blacklisting) based
spammer detection and spam mail filtering approach. In 2016 Third
International Conference on Digital Information Processing, Data
Mining, and Wireless Communications (DIPDMWC), pages 99–104,
2016.
[12] Vikas Sahare, Sheetalkumar Jain, and Manish Giri. Survey:anti-
phishing framework using visual cryptography on cloud. JAFRC, 2,
01 2015.
[13] S. Patil and S. Dhage. A methodical overview on phishing
detection along with an organized way to construct an anti- phishing
framework. In 2019 5th International Conference on Advanced
Computing Communication Systems (ICACCS), pages 588– 593,
2019.
[14] Dipesh Vaya, Sarika Khandelwal, and Teena Hadpawat. Visual
cryptography: A review. International Journal of Computer
Applications, 174:40–43, 09 2017. [15] Saurabh Saoji. Phishing
detection system using visual cryptography, 03 2015.

[16] C. Pham, L. A. T. Nguyen, N. H. Tran, E. Huh, and C. S.


Hong. Phishing-aware:
A neuro-fuzzy approach for anti-phishing on fog networks. IEEE
Transactions on Network and Service Management, 15(3):1076–1089,
2018.
[17] K. S. C. Yong, K. L. Chiew, and C. L. Tan. A survey of the qr code
phishing: the current attacks and countermeasures. In 2019 7th
International Conference on Smart Computing Communications
(ICSCC), pages 1–5, 2019. 54
[18] G. Egozi and R. Verma. Phishing email detection using robust nlp
techniques. In
2018 IEEE International Conference on Data Mining Workshops

Dept. of ECE,
(ICDMW), pages 7– 12, 2018.
[19] J. Mao, W. Tian, P. Li, T. Wei, and Z. Liang. Phishing-alarm:
Robust and efficient phishing detection via page component similarity.
IEEE Access, 5:17020– 17030, 2017.
[20] G. J. W. Kathrine, P. M. Praise, A. A. Rose, and E. C. Kalaivani.
Variants of phishing attacks and their detection techniques. In 2019
3rd International Conference on Trends in Electronics and Informatics
(ICOEI), pages 255–259, 2019.
[21] Muhammet Baykara and Zahit Gurel. Detection of phishing
attacks. pages 1–5, 03 2018. [22] Prof. Gayathri Naidu . A survey on
various phishing detection and prevention techniques. International
Journal of Engineering and Computer Science, 5(9), May 2016.
[23] E. Zhu, Y. Chen, C. Ye, X. Li, and F. Liu. Ofs-nn: An effective
phishing websites detection model based on optimal feature selection
and neural network. IEEE Access, 7:73271–73284, 2019.
[24] Mahdieh Zabihimayvan and Derek Doran. Fuzzy rough set
feature selection to enhance phishing attack detection, 03 2019.
[25] P. Yang, G. Zhao, and P. Zeng. Phishing website detection based
on multidimensional features driven by deep learning. IEEE Access,
7:15196–15209, 2019.
[26] T. Nathezhtha, D. Sangeetha, and V. Vaidehi. Wc-pad: Web
crawling based phishing attack detection. In 2019 International
Carnahan Conference on Security
Technology (ICCST), pages 1–6, 20

Dept. of ECE,
Dept. of ECE,

You might also like