Phishing Website Detection DOCUMENTATION
Phishing Website Detection DOCUMENTATION
BACHELOR OF TECHNOLOGY
IN
i|Page
KALLAM HARANADHAREDDY INISTITUTE OF TECHNOLOGY
AN ISO 9001:2015 CERTIFIED INSTITUTION ACCREDITED BY NBA &
NAAC WITH ‘A’ GRADE
(APPROVED BY AICTE, AFFILIATED TO JNTUK, KAKINADA)
CERTIFICATE
This is to certify that the project work entitle" PHISHING WEBSITE DETECTION USING
MACHINE LEARNING ALGOITHMS" being submitted by Phanindhra Kumar Sanam
(178X1A0585), Mani Sai Sankar Pasupuleti (178X1A0561), Satish Namala (178X1A0569), and Naga
Sai Teja G (178X1A0568) in the partial fulfilment for the award of degree of Bachelor of Technology in
Computer Science Engineering in Kallam Haranadhareddy Institute of Technology and this bonafide work
carried out by them.
External Examiner
ii | P a g e
DECLARATION
" under the guidance of Dr. Md. Sirajuddin Sir is submitted in partial
fulfillment of the requirements for the award of the degree of Bachelor of
Technology in Computer Science and Engineering.
This is a record of bonafide work carried out by us and the results embodied in
this project have not been reproduced or copied from any source. The results
embodied in this project have not been submitted to any other university for the
award of any other degree.
iii | P a g e
ACKNOWLEDGEMENT
We profoundly grateful to express our deep sense of gratitude and respect towards our
honorable chairman, our Grand Father Sri KALLAM HARANADHA REDDY, Chairman of
Kallam group for his precious support in the college.
We are thankful to Dr. M. UMA SANKAR REDDY, Director, KHIT, GUNTUR for his
encouragement and support for the completion of the project.
We are much thankful to Dr. B. SIVA BASIVI REDDY, Principal KHIT, GUNTUR for his
support during and till the completion of the project.
We are greatly indebted to Dr. K. VENKATA SUBBA REDDY, Professor & Head
Department of Computer Science and Engineering, KHIT, GUNTUR for providing the laboratory
facilities to the fullest extent as and when required and also for giving us the opportunity to carry
the project work in the college.
We are also thankful to our Project Coordinators Mr. N. Md. Jubair Basha and
We extend our deep sense of gratitude to our Internal Guide Dr. Md. Sirajuddin Sir, other
Faculty Members & Support staff for their valuable suggestions, guidance and constructive ideas in
each and every step, which was indeed of great help towards the successful completion of our
project.
iv | P a g e
ABSTRACT
The main objective this project is to classify phishing websites by using machine learning
algorithms. Internet has become important part of our lives. The usage of smart internet-based
devices for financial transactions provides a platform for attackers to launch various attacks.
This project addresses the phishing attack. Phishing is a type of social engineering attack that
often used to steal user’s personal data. Due to COVID-19 crises there is a rise in phishing
attacks by using which attackers seized the personal information of users. One of the sources
for launching phishing attack is by using phishing websites. A phishing website gives
impression to the users as a legitimate website and steals user’s personal information. Machine
learning is a powerful tool used to strive against phishing attacks. In this project different
machine learning algorithms are considered for detecting phishing websites.
v|Page
TABLE OF CONTENTS
1.1 Introduction 1
vi | P a g e
CHAPTER 4: SOFTWARE REQUIREMENT SPECIFICATION 13-15
5.1 Python 17
5.6 Variables 21
5.9 Datasets 26
6.1 Introduction 27
6.2 Normalization 27
vii | P a g e
6.5 UML Diagrams 32-40
CHAPTER 8: CONCLUSION 71
ix | P a g e
CHAPTER - 1
1. INTRODUCTION
1.1 Introduction to project
Internet use has become an essential part of our daily activities as a result of rapidly
growing technology. Due to this rapid growth of technology and intensive use of digital
systems, data security of these systems has gained great importance. The primary
objective of maintaining security in information technologies is to ensure that necessary
precautions are taken against threats and dangers likely to be faced by users during the
use of these technologies. Phishing is defined as imitating reliable websites in order to
obtain the proprietary information entered into websites every day for various purposes,
such as usernames, passwords and citizenship numbers. Phishing websites contain
various hints among their contents and web browser-based information . Individual(s)
committing the fraud sends the fake website or e-mail information to the target address
as if it comes from an organization, bank or any other reliable source that performs
reliable transactions. Contents of the website or the e-mail include requests aiming to lure
the individuals to enter or update their personal information or to change their passwords
as well as links to websites that look like exact copies of the websites of the organizations
concerned.
1.2 Purpose of the project
Phishing is one of the most common and most dangerous attacks among cybercrimes.
The aim of these attacks is to steal the information used by individuals and organizations
to conduct transactions. Phishing websites contain various hints among their contents and
web browser-based information. The purpose of this study is to perform Extreme
Learning Machine (ELM) based classification for 30 features including Phishing
Websites Data in UC Irvine Machine Learning Repository database.
1|Page
1.4 Solution for the problem statement:
We proposed a system with the help of machine learning techniques and algorithms like
Logistic Regression, KNN, SVC, Random Forest, Decision Tree, XGB Classifier and
Naïve Bayes to predict Phishing. We trained our model with a large dataset and with more
than 30 features. We performed hyper parameter tuning for improving accuracy of
machine learning algorithms. We have considered all the classification metrics for testing
the model and tried to improve the model for precision and recall.
2|Page
CHAPTER - 2
2. REQUIREMENTS
• RAM: 4GB
• Processor: Intel i3
• Software : Anaconda
• Jupyter IDE
3|Page
CHAPTER - 3
3 SYSTEM ANALYSIS
1. Numpy
2. Pandas
3. Matplotlib
4. Scikit –learn
1 . Numpy:
2. Pandas
4|Page
analyze. Python with Pandas is used in a wide range of fields including academic and
commercial domains including finance, economics, Statistics, analytics, etc.
3.Matplotlib
For simple plotting the pyplot module provides a MATLAB-like interface, particularly
when combined with IPython. For the power user, you have full control of line styles, font
properties, axes properties, etc, via an object oriented interface or via a set of functions
familiar to MATLAB users.
4. Scikit – learn
5|Page
3.2 EXISTING SYSTEM
➢ In existing system phishing website detection is done by some of the data mining
algorithms like j48,C4.5 in weka explorer which are not suitable for very large
datasets.
➢ In most of the existing systems the dataset is very small for improving precision and
as a result they produce false positives and false negatives.
➢ Number of features in the existing model is very less and as a result the existing
systems can face a threat of becoming outdated.
➢ With rapid growth in technology, there are many ways open for an attacker to pose
phishing threats for an internet user.
➢ Attackers are finding new ways to break into the existing systems and are able to
phish the data of users. Traditional features are being perfectly exploited by cyber
attackers.
Phishing is one of the most common and most dangerous attacks among cybercrimes.
The aim of these attacks is to steal the information used by individuals and organizations to
conduct transactions. Phishing websites contain various hints among their contents and web
6|Page
browser based information. The purpose of this study is to perform Extreme Learning
Machine (ELM) based classification for 30 features including Phishing Websites Data in UC
Irvine Machine Learning Repository database.
Inputs:
➢ Importing the all required packages like numpy, pandas, matplotlib, scikit-learn and
required machine learning algorithms packages .
7|Page
3.5 PROCESS MODELS USED WITH JUSTIFICATION
This project uses an iterative development lifecycle, where components of the application
are developed through a series of tight iterations. The first iteration focuses on very basic
functionality, with subsequent iterations adding new functionality to the previous work and or
correcting errors identified for the components in production.
The six stages of the SDLC are designed to build on one another, taking outputs from the
previous stage, adding additional effort, and producing results that leverage the previous effort
and are directly traceable to the previous stages. During each stage, additional information is
gathered or developed, combined with the inputs, and used to produce the stage deliverables.
It is important to note that the additional information is restricted in scope, new ideas that
would take the project in directions not anticipated by the initial set of high-level
requirements or features that are out-of-scope are preserved for later consideration.
Too many software development efforts go awry when the development team and
customer personnel get caught up in the possibilities of automation. Instead of focusing on
high priority features, the team can become mired in a sea of nice to have features that are not
essential to solve the problem, but in themselves are highly attractive. This is the root cause
8|Page
of a large percentage of failed and or abandoned development efforts and is the primary
reason the development team utilizes the iterative model.
When Object orientation is used in analysis as well as design, the boundary between
OOA and OOD is blurred. This is particularly true in methods that combine analysis and
design. One reason for this blurring is the similarity of basic constructs (i.e.,objects and
classes) that are used in OOA and OOD. Though there is no agreement about what parts of
the object-oriented development process belong to analysis and what parts to design, there is
some general agreement about the domains of the two activities.
The fundamental difference between OOA and OOD is that the former models the
problem domain, leading to an understanding and specification of the problem, while the
latter models the solution to the problem. That is, analysis deals with the problem domain,
while design deals with the solution domain. However, OOAD subsumed the solution domain
representation. That is, the solution domain representation, created by OOD, generally
contains much of the representation created by OOA. The separating line is a matter of
perception, and different people have different views on it. The lack of clear separation
between analysis and design can also be considered one of the strong points of the object
oriented approach; the transition from analysis to design is “seamless”. This is also the main
reason OOAD methods-where analysis and designs are both performed.
The main difference between OOA and OOD, due to the different domains of modeling,
is in the type of objects that come out of the analysis and design process.
9|Page
Features of OOAD:
• All objects can be represented graphically including the relation between them.
• All Key Participants in the system will be represented as actors and the actions done by
them will be represented as use cases.
• A typical use case is nothing bug a systematic flow of series of events which can be well
described using sequence diagrams and each event can be described diagrammatically by
Activity as well as state chart diagrams.
• So the entire system can be well described using the OOAD model, hence this model is
chosen as the SDLC model.
Preliminary investigation examines project feasibility, the likelihood the system will be
useful to the organization. The main objective of the feasibility study is to test the Technical,
Operational and Economical feasibility for adding new modules and debugging old running
systems. All systems are feasible if they are unlimited resources and infinite time. There are
aspects in the feasibility study portion of the preliminary investigation:
● Technical Feasibility
● Operational Feasibility
● Economical Feasibility
A system can be developed technically and that will be used if installed must still be a
good investment for the organization. In the economical feasibility, the development cost in
creating the system is evaluated against the ultimate benefit derived from the new systems.
Financial benefits must equal or exceed the costs.
10 | P a g e
The system is economically feasible. It does not require any additional hardware or
software. Since the interface for this system is developed using the existing resources and
technologies available at NIC, There is nominal expenditure and economical feasibility for
certain.
Proposed projects are beneficial only if they can be turned into an information system.
That will meet the organization’s operating requirements. Operational feasibility aspects of
the project are to be taken as an important part of the project implementation. Some of the
important issues raised are to test the operational feasibility of a project includes the
following:
The well-planned design would ensure the optimal utilization of the computer resources
and would help in the improvement of performance status.
The well-planned design would ensure the optimal utilization of the computer resources
and would help in the improvement of performance status.
11 | P a g e
3.9TECHNICAL FEASIBILITY
The technical issue usually raised during the feasibility stage of the investigation includes
the following:
12 | P a g e
CHAPTER - 4
13 | P a g e
development styles.
PURPOSE
In software engineering, the same meanings of requirements apply, except that the focus
of interest is the software itself.
• Data analysis
• Data preprocessing
• Model building
• Prediction
14 | P a g e
4.2NON FUNCTIONAL REQUIREMENTS
Introduction to Django The Web development framework that saves you time and
makes Web development a joy. Using Django, you can build and maintain high quality Web
applications with minimal fuss. At its best, Web development is an exciting, creative act; at
its worst, it can be a repetitive, frustrating nuisance. Django lets you focus on the fun stuff —
the crux of your Web application — while easing the pain of the repetitive bits. In doing so, it
provides high-level abstractions of common Web development patterns, shortcuts for frequent
programming tasks, and clear conventions for how to solve problems. At the same time,
Django tries to stay out of your way, letting you work outside the scope of the framework as
needed. The goal of this book is to make you a Django expert. The focus is twofold. First, we
explain, in depth, what Django does and how to build Web applications with it. Second, we
discuss higher-level concepts where appropriate, answering the question “How can I apply
these tools effectively in my own projects?” By reading this book, you’ll learn the skills
needed to develop powerful Web sites quickly, with code that is clean and easy to maintain.
15 | P a g e
With a one-off dynamic page such as this one, the write-it-from-scratch approach isn’t
necessarily bad. For one thing, this code is simple to comprehend — even a novice developer
can read these 16 lines of Python and understand all it does, from start to finish. There’s
nothing else to learn; no other code to read. It’s also simple to deploy: just save this code in a
file called latestbooks.cgi, upload that file to a Web server, and visit that page with a browser.
But as a Web application grows beyond the trivial, this approach breaks down, and you face a
number of problems:
Should a developer really have to worry about printing the “Content-Type” line and
remembering to close the database connection? This sort of boilerplate reduces programmer
productivity and introduces opportunities for mistakes. These setup- and teardown-related
tasks would best be handled by some common infrastructure.
16 | P a g e
CHAPTER : 5
5. LANGUAGES OF IMPLEMENTATION
5.1 Python
What Is A Script?
Basically, a script is a text file containing the statements that comprise a Python
program. Once you have created the script, you can execute it over and over without
having to retype it each time.
Scripts are editable
Perhaps, more importantly, you can make different versions of the script by
modifying the statements from one file to the next using a text editor. Then you can
execute each of the individual versions. In this way, it is easy to create different programs
with a minimum amount of typing.
Just about any text editor will suffice for creating Python script files.
You can use Microsoft Notepad, Microsoft WordPad, Microsoft Word, or just about
any word processor if you want to.
Script:
Scripts are distinct from the core code of the application, which is usually written in a
17 | P a g e
different language, and are often created or at least modified by the end-user. Scripts are
often interpreted from source code or bytecode, whereas the applications they control are
traditionally compiled to native machine code.
Program:
The program has an executable form that the computer can use directly to execute the
instructions.
The same program in its human-readable source code form, from which executable
programs are derived (e.g., compiled)
P ython
What is Python?
Chances you are asking yourself this. You may have found this book because you
want to learn to program but don’t know anything about programming languages. Or you
may have heard of programming languages like C, C++, C#, or Java and want to know
what Python is and how it compares to “big name” languages. Hopefully I can explain it
for you.
Python concepts
If your not interested in the the hows and whys of Python, feel free to skip to the next chapter.
In this chapter I will try to explain to the reader why I think Python is one of the best
languages available and why it’s a great one to start programming with.
18 | P a g e
• Great interactive environment
• Python is Interpreted − Python is processed at runtime by the interpreter. You do not need to
compile your program before executing it. This is similar to PERL and PHP. • Python is
Interactive − You can actually sit at a Python prompt and interact with the interpreter directly to
write your programs.
Python was developed by Guido van Rossum in the late eighties and early nineties at the
National Research Institute for Mathematics and Computer Science in the Netherlands.
Python is derived from many other languages, including ABC, Modula-3, C, C++, Algol-68,
SmallTalk, and Unix shell and other scripting languages.
Python is copyrighted. Like Perl, Python source code is now available under the GNU General
Public License (GPL).
Python is now maintained by a core development team at the institute, although Guido van
Rossum still holds a vital role in directing its progress.
19 | P a g e
• Easy-to-learn − Python has few keywords, simple structure, and a clearly defined syntax. This
allows the student to pick up the language quickly.
• Easy-to-read − Python code is more clearly defined and visible to the eyes. •
Easy-to-maintain − Python's source code is fairly easy-to-maintain.
• A broad standard library − Python's bulk of the library is very portable and cross-
platform compatible on UNIX, Windows, and Macintosh.
• Interactive Mode − Python has support for an interactive mode which allows interactive
testing and debugging of snippets of code.
• Portable − Python can run on a wide variety of hardware platforms and has the same interface
on all platforms.
• Extendable − You can add low-level modules to the Python interpreter. These modules enable
programmers to add to or customize their tools to be more efficient.
• Scalable − Python provides a better structure and support for large programs than shell
scripting.
Apart from the above-mentioned features, Python has a big list of good features, few are listed
below −
• It supports functional and structured programming methods as well as OOP. • It can be used as a
scripting language or can be compiled to byte-code for building large applications.
• It provides very high-level dynamic data types and supports dynamic type checking. •
IT supports automatic garbage collection.
20 | P a g e
• It can be easily integrated with C, C++, COM, ActiveX, CORBA, and Java.
Types Python is a dynamic-typed language. Many other languages are static typed, such as
C/C++ and Java. A static typed language requires the programmer to explicitly tell the
computer what type of “thing” each data value is.
For example, in C if you had a variable that was to contain the price of something, you would
have to declare the variable as a “float” type.
This tells the compiler that the only data that can be used for that variable must be a floating
point number, i.e. a number with a decimal point.
Python, however, doesn’t require this. You simply give your variables names and assign
values to them. The interpreter takes care of keeping track of what kinds of objects your
program is using. This also means that you can change the size of the values as you develop
the program. Say you have another decimal number (a.k.a. a floating point number) you need
in your program.
With a static typed language, you have to decide the memory size the variable can take when
you first initialize that variable. A double is a floating point value that can handle a much
larger number than a normal float (the actual memory sizes depend on the operating
environment). If you declare a variable to be a float but later on assign a value that is too big
to it, your program will fail; you will have to go back and change that variable to be a double.
With Python, it doesn’t matter. You simply give it whatever number you want and Python will
take care of manipulating it as needed. It even works for derived values.
For example, say you are dividing two numbers. One is a floating point number and one is an
integer. Python realizes that it’s more accurate to keep track of decimals so it automatically
calculates the result as a floating point number
5.6 Variables
Variables are nothing but reserved memory locations to store values. This means that when
21 | P a g e
you create a variable you reserve some space in memory.
Based on the data type of a variable, the interpreter allocates memory and decides what can be
stored in the reserved memory. Therefore, by assigning different data types to variables, you
can store integers, decimals or characters in these variables.
The data stored in memory can be of many types. For example, a person's age is stored as a
numeric value and his or her address is stored as alphanumeric characters. Python has various
standard data types that are used to define the operations possible on them and the storage
method for each of them.
• Numbers
• String
• List
• Tuple
• Dictionary
P ython Numbers
Number data types store numeric values. Number objects are created when you assign a value
to them
P ython Strings
Strings in Python are identified as a contiguous set of characters represented in the quotation
marks. Python allows for either pairs of single or double quotes. Subsets of strings can be
taken using the slice operator ([ ] and [:] ) with indexes starting at 0 in the beginning of the
22 | P a g e
string and working their way from -1 at the end.
P ython Lists
Lists are the most versatile of Python's compound data types. A list contains items separated
by commas and enclosed within square brackets ([]). To some extent, lists are similar to arrays
in C. One difference between them is that all the items belonging to a list can be of different
data type.
The values stored in a list can be accessed using the slice operator ([ ] and [:]) with indexes
starting at 0 in the beginning of the list and working their way to end -1. The plus (+) sign is
the list concatenation operator, and the asterisk (*) is the repetition operator.
P ython Tuples
A tuple is another sequence data type that is similar to the list. A tuple consists of a number of
values separated by commas. Unlike lists, however, tuples are enclosed within parentheses.
The main differences between lists and tuples are: Lists are enclosed in brackets ( [ ] ) and
their elements and size can be changed, while tuples are enclosed in parentheses ( ( ) ) and
cannot be updated. Tuples can be thought of as read-only lists.
P ython Dictionary
Python's dictionaries are kind of hash table type. They work like associative arrays or hashes
found in Perl and consist of key-value pairs. A dictionary key can be almost any Python type,
but are usually numbers or strings. Values, on the other hand, can be any arbitrary Python
object.
Dictionaries are enclosed by curly braces ({ }) and values can be assigned and accessed using
square braces ([]).
The normal mode is the mode where the scripted and finished .py files are run in the Python
23 | P a g e
interpreter.
Interactive mode is a command line shell which gives immediate feedback for each statement,
while running previously fed statements in active memory. As new lines are fed into the
interpreter, the fed program is evaluated both in part and in whole
20 Python libraries
1. Requests. The most famous http library written by kenneth reitz. It’s a must
have for every python developer.
2. Scrapy. If you are involved in webscraping then this is a must have library for
you. After using this library you won’t use any other.
3. wxPython. A gui toolkit for python. I have primarily used it in place of tkinter.
You will really love it.
4. Pillow. A friendly fork of PIL (Python Imaging Library). It is more user friendly
than PIL and is a must have for anyone who works with images.
5. SQLAlchemy. A database library. Many love it and many hate it. The choice is
yours.
6. BeautifulSoup. I know it’s slow but this xml and html parsing library is very
useful for beginners.
7. Twisted. The most important tool for any network application developer. It has
a very beautiful api and is used by a lot of famous python developers.
8. NumPy. How can we leave this very important library ? It provides some
advance math functionalities to python.
9. SciPy. When we talk about NumPy then we have to talk about scipy. It is a
library of algorithms and mathematical tools for python and has caused many
scientists to switch from ruby to python.
10. matplotlib. A numerical plotting library. It is very useful for any data
24 | P a g e
scientist or any data analyzer.
11. Pygame. Which developer does not like to play games and develop them ? This
library will help you achieve your goal of 2d game development.
12. Pyglet. A 3d animation and game creation engine. This is the engine in which
the famous python port of minecraft was made.
13. pyQT. A GUI toolkit for python. It is my second choice after wxpython for
developing GUI’s for my python scripts.
14. pyGtk. Another python GUI library. It is the same library in which the famous
Bittorrent client is created.
15. Scapy. A packet sniffer and analyzer for python made in python.
16. pywin32. A python library which provides some useful methods and classes
for interacting with windows.
17. nltk. Natural Language Toolkit – I realize most people won’t be using this one,
but it’s generic enough. It is a very useful library if you want to manipulate
strings. But it’s capacity is beyond that. Do check it out.
20. IPython. I just can’t stress enough how useful this tool is. It is a python prompt
on steroids. It has completion, history, shell capabilities, and a lot more. Make
sure that you take a look at it.
Numpy:
NumPy’s main object is the homogeneous multidimensional array. It is a table of elements (usually
25 | P a g e
numbers), all of the same type, indexed by a tuple of positive integers. In NumPy dimensions are called axes.
The number of axes is rank.
Offers Matlab-ish capabilities within Python
m atplotlib
DataSets
The DataSet object is similar to the ADO Recordset object, but more powerful, and
with one other important distinction: the DataSet is always disconnected. The DataSet
object represents a cache of data, with database-like structures such as tables, columns,
relationships, and constraints. However, though a DataSet can and does behave much
like a database, it is important to remember that DataSet objects do not interact
directly with databases, or other source data. This allows the developer to work with a
programming model that is always consistent, regardless of where the source data
resides. Data coming from a database, an XML file, from code, or user input can all be
placed into DataSet objects. Then, as changes are made to the DataSet they can be
tracked and verified before updating the source data. The GetChanges method of the
DataSet object actually creates a second DatSet that contains only the changes to the
data. This DataSet is then used by a DataAdapter (or other objects) to update the
original data source.
The DataSet has many XML characteristics, including the ability to produce and consume
XML data and XML schemas. XML schemas can be used to describe schemas interchanged
via WebServices. In fact, a DataSet with a schema can actually be compiled for type safety
and statement completion.
26 | P a g e
CHAPTER : 6
6. SYSTEM DESIGN
6.1 INTRODUCTION
Software design sits at the technical kernel of the software engineering process and is applied
regardless of the development paradigm and area of application. Design is the first step in the
development phase for any engineered product or system. The designer’s goal is to produce a
model or representation of an entity that will later be built. Beginning, once system
requirement
have been specified and analyzed, system design is the first of the three technical activities -
design, code and test that is required to build and verify software.
The importance can be stated with a single word “Quality”. Design is the place where
quality is fostered in software development. Design provides us with representations of
software that can assess for quality. Design is the only way that we can accurately translate a
customer’s view into a finished software product or system. Software design serves as a
foundation for all the software engineering steps that follow. Without a strong design we risk
building an unstable system – one that will be difficult to test, one whose quality cannot be
assessed until the last stage.
6.2 NORMALIZATION
27 | P a g e
the problems that can arise due to data redundancy i.e. repetition of data in the database,
maintain data integrity as well as handling problems that can arise due to insertion, updation,
deletion anomalies.
Insertion anomaly: Inability to add data to the database due to absence of other data.
Deletion anomaly: Unintended loss of data due to deletion of other data. Update
anomaly: Data inconsistency resulting from data redundancy and partial update
Normal Forms: These are the rules for structuring relations that eliminate anomalies.
A relation is said to be in first normal form if the values in the relation are atomic for every
attribute in the relation. By this we mean simply that no attribute value can be a set of values
or, as it is sometimes expressed, a repeating group.
A relation is said to be in second Normal form is it is in first normal form and it should satisfy
any one of the following rules.
3) Every non key attribute is fully functionally dependent on full set of primary key.
28 | P a g e
Transitive Dependency: If two non key attributes depend on each other as well as on the
primary key then they are said to be transitively dependent.
The above normalization principles were applied to decompose the data in multiple tables
thereby making the data to be maintained in a consistent state.
6.3 E – R DIAGRAMS
• The relation upon the system is structure through a conceptual ER-Diagram, which not only
specifics the existential entities but also the standard relations through which the system exists
and the cardinalities that are necessary for the system state to continue.
• The entity Relationship Diagram (ERD) depicts the relationship between the data objects. The
ERD is the notation that is used to conduct the date modeling activity the attributes of each data
object noted is the ERD can be described resign a data object descriptions. • The set of primary
components that are identified by the ERD are
The primary purpose of the ERD is to represent data objects and their relationships.
6.4 DATA FLOW DIAGRAMS
A data flow diagram is graphical tool used to describe and analyze movement of data
through a system. These are the central tool and the basis from which the other components
are developed. The transformation of data from input to output, through processed, may be
described logically and independently of physical components associated with the system.
These are known as the logical data flow diagrams. The physical data flow diagrams show the
actual implements and movement of data between people, departments and workstations. A
full description of a system actually consists of a set of data flow diagrams. Using two familiar
notations Yourdon, Gane and Sarson notation develops the data flow diagrams. Each
component in a DFD is labeled with a descriptive name. Process is further identified with a
29 | P a g e
number that will be used for identification purpose. The development of DFD’S is done in
several levels. Each process in lower level diagrams can be broken down into a more detailed
DFD in the next level. The lop-level diagram is often called context diagram. It consists a
single process bit, which plays vital role in studying the current system. The process in the
context level diagram is exploded into other process at the first level DFD.
The idea behind the explosion of a process into more process is that understanding at
one level of detail is exploded into greater detail at the next level. This is done until further
explosion is necessary and an adequate amount of detail is described for analyst to understand
the process.
Larry Constantine first developed the DFD as a way of expressing system requirements in a
graphical from, this lead to the modular design.
Larry Constantine first developed the DFD as a way of expressing system requirements in a
graphical from, this lead to the modular design.
A DFD is also known as a “bubble Chart” has the purpose of clarifying system requirements
and identifying major transformations that will become programs in system design. So it is the
starting point of the design to the lowest level of detail. A DFD consists of a series of bubbles
joined by data flows in the system.
1. The DFD shows flow of data, not of control loops and decision are controlled considerations
do not appear on a DFD.
2. The DFD does not indicate the time factor involved in any process whether the dataflow take
place daily, weekly, monthly or yearly.
30 | P a g e
TYPES OF DATA FLOW DIAGRAMS
1. Current Physical
2. Current Logical
3. New Logical
4. New Physical
CURRENT PHYSICAL:
In Current Physical DFD process label include the name of people or their positions or the
names of computer systems that might provide some of the overall system-processing label
includes an identification of the technology used to process the data. Similarly data flows and
data stores are often labels with the names of the actual physical media on which data are
stored such as file folders, computer files, business forms or computer tapes.
CURRENT LOGICAL:
The physical aspects at the system are removed as mush as possible so that the current system
is reduced to its essence to the data and the processors that transforms them regardless of
actual physical form.
NEW LOGICAL:
This is exactly like a current logical model if the user were completely happy with he user
were completely happy with the functionality of the current system but had problems with
how it
was implemented typically through the new logical model will differ from current logical
model while having additional functions, absolute function removal and inefficient flows
recognized.
NEW PHYSICAL:
The new physical represents only the physical implementation of the new system.
31 | P a g e
6.5 UML Diagrams
NewUseCase
Data
Understanding Predictive Learning
NewUseCase2 Model Building
NewUseCase3
Dataset
Data
Analytics(EDA) Trained Dataset
NewUseCase4
Model Evaluation
Particular Data
NewUseCase5
Test/Test Split
NewUseCase6
EXPLANATION:
The primary motivation behind a utilization case chart is to show what framework capacities are performed for which
entertainer. Parts of the entertainers in the framework can be portrayed. The above chart comprises of client as
entertainer. Each will assume a specific part to accomplish the idea.
32 | P a g e
6.5.2 Class Diagram
Model Evaluation
dataset
traineddataset()
particulardata()
EXPLANATION
In this class chart addresses how the classes with qualities and strategies are connected together to play out the
confirmation with security. From the above chart shown the different classes engaged with our venture.
Model Evaluation
EXPLAN
ATION:
In the above digram tells about the progression of articles between the classes. It's anything but a chart that shows a
total or fractional perspective on the construction of a displayed framework. In this item chart addresses how the
classes with traits and strategies are connected together to play out the confirmation with security.
33 | P a g e
6.5.4 Component Diagram
Trained Particular
Dataset Data
EXPLANATION:
A segment gives the arrangement of required interfaces that a part acknowledges or carries out. These are the static
outlines of the bound together demonstrating language. Segment outlines are utilized to address the working and
conduct of different segments of a framework.
34 | P a g e
6.5.5 Deployment Diagram
Data
Data
Understanding
Analysis(EDA)
Train Data
Model Building
Model
Evaluation
EXPLANATION:
An UML sending chart is an outline that shows the design of run time preparing hubs and the parts that live on
them. Arrangement graphs are a sort of design chart utilized in demonstrating the actual parts of an article situated
framework. They are regularly be utilized to demonstrate the static sending perspective on a framework.
35 | P a g e
6.5.6 STATE DIAGRAM
Dataset
Split Data
Model-Building Phase
Particular Data
Machine Learning
Dataset
EXPLANATION:
State outline are an inexactly characterized graph to show work processes of stepwise exercises and activities, with
help for decision, cycle and simultaneousness. State charts necessitate that the framework portrayed is made out of a
limited number of states; at times, this is to be sure the situation, while at different occasions this is a sensible
deliberation. Numerous types of state charts exist, which vary marginally and have distinctive semantics.
36 | P a g e
6.5.7 Sequence Diagram
Datasets Transfers
Datasets
Training Data
Predictive Learning
Machine Learning
Trained Dataset
Split Data
Particular Data
EXPLANATION:
UML Sequence Diagrams are connection charts that detail how tasks are done. They catch the association between
objects with regards to joint effort. Grouping Diagrams are time center and they show the request for the association
outwardly by utilizing the upward pivot of the outline to address time what messages is sent and when.
37 | P a g e
6.5.8 Collaboration Diagram
5: Machine Learning
Train
Data Data
Understanding
1: Datasets Transfers
8: Split Data 3: Training Data
4: Predictive Learning
Model
Building
Dataset
6: Trained Dataset
9: Particular Data
2: Datasets
Model
Evaluation
Data
Analysis(EDA)
EXPLANATION:
Coordinated effort outlines are utilized to show how items interface to play out the conduct of a specific use case, or
a piece of a utilization case. Alongside succession charts, joint effort are utilized by architects to characterize and
explain the jobs of the articles that play out a specific progression of occasions of a utilization case. They are the
essential wellspring of data used to deciding class duties and interfaces.
38 | P a g e
6.5.9 Activity Diagram
Dataset
Dataset
EXPLANATION:
Action chart are an inexactly characterized graph to show work processes of stepwise exercises and activities, with
help for decision, emphasis and simultaneousness. UML, action charts can be utilized to portray the business and
operational bit by bit work processes of parts in a framework. UML movement outlines might actually display the
inner rationale of a mind boggling activity. From numerous points of view UML action outlines are the article
arranged likeness stream graphs and information stream charts (DFDs) from underlying turn of events.
39 | P a g e
6.5.10 System Architecture
DATASET
DATASET
EXPLORATORY
EXPLORATORY DATA
DATA
ANALYTICS
ANALYTICS
TRAIN/TEST
TRAIN/TEST SPLIT
SPLIT
MODEL
MODEL BUILDING/HYPER
BUILDING/HYPER
PARAMETER
PARAMETER TUNING
TUNING
MODEL
MODEL
ELEVATION
ELEVATION
RESULT
RESULT
40 | P a g e
CHAPTER : 7
7.Implementation
7.1 Data Collection
We collected phishing websites dataset from Kaggle website. It consists
of mix of phishing and legitimate URL features. Dataset has 11055 rows
and 31 columns.
41 | P a g e
7.2 Exploratory Data Analysis
We loaded the dataset into python IDE with the help of pandas package
and checked if there are any missing values in data. We found that
thereare no missing values in data and we removed an unwanted column
for our process. After removing unwanted column, below are the
columns left out in our dataset.
Figure: 7.2.1
We tried to analyze the obtained data and go to find out the following observations:
Figure: 7.2.2
From the above count plot which is plotted with the help of seaborn
package, we can observe the count of values of target variable.
And below is the plot which is drawn with the help of matplotlib
package to find out the correlation among the features.
42 | P a g e
From the analysis we clearly found out that the type of our data is
classification problem where the target variable is Result.
X = data.drop('Result', axis=1)
Figure: 7.3.1
Figure: 7.3.2
7.4.1Logistic Regression
for the function used at the core of the method, the logistic function. The
44 | P a g e
logistic function, also called the sigmoid function was developed by
statisticians to describe properties of population growth in ecology, rising
quickly and maxing out at the carrying capacity of the environment. It’s an
S-shaped curve that can take any real-valued number and map it into a
value between 0 and 1, but never exactly at those limits.
1 / (1 + e^-value)
Where e is the base of the natural logarithms (Euler’s number or the EXP()
function in your spreadsheet) and value is the actual numerical value that
you want to transform. Below is a plot of the numbers between -5 and 5
transformed into the range 0 and 1 using the logistic function.
P(sex=male|height)
Written another way, we are modeling the probability that an input (X)
45 | P a g e
belongs to the default class (Y=1), we can write this formally as:
P(X) = P(Y=1|X)
We’re predicting probabilities? I thought logistic regression was a classification
algorithm?
Note that the probability prediction must be transformed into binary values
(0 or 1) in order to actually make a probability prediction.. Logistic
regression is a linear method, but the predictions are transformed using the
logistic function. The impact of this is that we can no longer understand the
predictions as a linear combination of the inputs as we can with linear
regression, for example, continuing on from above, the model can be stated
as:
/ 1 – p(X)) = b0 + b1 * X
We can move the exponent back to the right and write it as:
odds = e^(b0 + b1 * X)
We trained logistic regression model with the help of training split and tested with
test split.
Accuracy score - 0.92
Confusion Matrix:
Phishing Non
Phishing
Table:7.4.1.1
Classification Report:
Precision Recall F1-Score Support
46 | P a g e
Phishing 0.94 0.92 0.93 9
0
1
Table:7.4.1.2
Random forest is a supervised learning algorithm. It can be used both for classification and
regression. It is also the most flexible and easy to use algorithm. A forest is comprised of
trees. It is said that the more trees it has, the more robust a forest is. Random forests
creates decision trees on randomly selected data samples, gets prediction from each tree
and selects the best solution by means of voting. It also provides a pretty good indicator
of the feature importance.
Advantages:
47 | P a g e
• Random forest is considered as a highly accurate and robust method because of the
number of decision trees participating in the process.
• It does not suffer from the over fitting problem. The main reason is that it takes the
average of all the predictions, which cancels out the biases.
• Random forests can also handle missing values. There are two ways to handle these: using median
values to replace continuous variables and computing the proximity- weighted average of missing
values.
It works in four steps:
1. Select random samples from a given dataset.
2. Construct a decision tree for each sample and get a prediction result from each decision
tree.
3. Perform a vote for each predicted result.
4. Select the prediction result with the most votes as the final prediction.
We trained logistic regression model with the help of training split and tested with test split.
Accuracy score - 0.96
Confusion Matrix:
Phishing Non-Phishing
Table:7.4.2.1
Classification Report:
Precision Recall F1-Score Support
48 | P a g e
Non-Phishing 0.95 0.96 0.95 854
Table:7.4.2.2
A decision tree is a flowchart-like tree structure where an internal node represents feature,
the branch represents a decision rule, and each leaf node represents the outcome. The
topmost node in a decision tree is known as the root node. It learns to partition on the basis
of the attribute value. It partitions the tree in recursively manner call recursive partitioning.
This flowchart-like structure helps you in decision making. It's visualization like a
flowchart diagram which easily mimics the human level thinking. That is why decision
trees are easy to understand and interpret.
1. Select the best attribute using Attribute Selection Measures (ASM) to split the
records.
2. Make that attribute a decision node and breaks the dataset into smaller subsets.
49 | P a g e
3. Starts tree building by repeating this process recursively for each child until one of
the condition will match:
o All the tuples belong to the same attribute value.
Pros
It requires fewer data preprocessing from the user, for example, there is no need to
normalize columns.
It can be used for feature engineering such as predicting missing values, suitable for
variable selection.
The decision tree has no assumptions about distribution because of the non-parametric
nature of the algorithm. (Source)
` Cons
The small variation(or variance) in data can result in the different decision tree. This
can be reduced by bagging and boosting algorithms.
Decision trees are biased with imbalance dataset, so it is recommended that balance
out the dataset before creating the decision tree.
50 | P a g e
Confusion Matrix:
Phishing Non-Phishing
Classification Report:
Table:7.4.3.1
Table:7.4.3.2
The naive Bayes classifier is a generative model for classification. Before the advent of
deep learning and its easy-to-use libraries, the Naive Bayes classifier was one of the
widely deployed classifiers for machine learning applications. Despite its simplicity, the
naive Bayes classifier performs quite well in many applications.
A Naive Bayes classifier is a probabilistic machine learning model that’s used for
51 | P a g e
classification task. The crux of the classifier is based on the Bayes theorem.
Bayes Theorem:
Using Bayes theorem, we can find the probability of A happening, given that B has
occurred. Here, B is the evidence and A is the hypothesis. The assumption made here is that
the predictors/features are independent. That is presence of one particular feature does not
affect the other. Hence it is called naive.
Confusion Matrix:
Phishing Non-Phishing
Table:7.4.4.1
Classification Report:
Precision Recall F1-Score Support
52 | P a g e
Macro avg 0.80 0.64 0.60 1755
Table:7.4.4.2
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used
for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in the
correct category in the future. This best decision boundary is called a hyperplane.SVM
chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases
are called as support vectors, and hence algorithm is termed as Support Vector Machine.
Types of SVM
Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then such data
is termed as linearly separable data, and classifier is used called as Linear SVM
classifier.
53 | P a g e
dimensional space, but we need to find out the best decision boundary that helps to classify
the data points. This best boundary is known as the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the dataset, which means
if there are 2 features then hyperplane will be a straight line. And if there are 3 features, then
hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position
of the hyperplane are termed as Support Vector. Since these vectors support the hyperplane,
hence called a Support vector.
Advantages
SVM Classifiers offer good accuracy and perform faster prediction compared to Naïve Bayes
algorithm. They also use less memory because they use a subset of training points in the
decision phase. SVM works well with a clear margin of separation and with high dimensional
space.
Confusion Matrix:
Phishing Non-Phishing
Table:7.4.5.1
54 | P a g e
Classification Report:
Precision Recall F1-Score Support
Table:7.4.5.2
The KNN algorithm assumes that similar things exist in close proximity. In other words,
similar things are near to each other. KNN captures the idea of similarity (sometimes
called distance, proximity, or closeness) with some mathematics we might have learned in
our childhood— calculating the distance between points on a graph.
Advantages
55 | P a g e
3. The algorithm is versatile. It can be used for classification, regression, and search (as we will see
in the nextsection).
Accuracy score - 0.928
Confusion Matrix:
Phishing Non-Phishing
Table:7.4.6.1
Classification Report:
Precision Recall F1-Score Support
Table:7.4.6.2
XGBoost is a powerful machine learning algorithm especially where speed and accuracy
are concerned. XGBoost (eXtreme Gradient Boosting) is an advanced implementation
of gradient boosting algorithm.
ADVANTAGES
1. Regularization:
56 | P a g e
• Standard GBM has no regularization like XGBoost, therefore it also helps
to reduce overfitting.
High Flexibility
Tree Pruning:
• A GBM would stop splitting a node when it encounters a negative loss in the
split. Thus it is more of a greedy algorithm.
• XGBoost on the other hand make splits upto the max_depth specified and then
start pruning the tree backwards and remove splits beyond which there is no
positive gain.
57 | P a g e
• Another advantage is that sometimes a split of negative loss say -2 may be
followed by a split of positive loss +10. GBM would stop as it encounters -2.
But XGBoost will go deeper and it will see a combined effect of +8 of the split
and keep both.
Built-in Cross-Validation
Confusion Matrix:
Phishing Non-Phishing
Table:7.4.7.1
Classification Report:
Precision Recall F1-Score Support
58 | P a g e
Macro avg 0.94 0.94 0.94 1755
Table:7.4.7.2
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
df=pd.read_csv('C:/Users/k.anusha/Documents/phishing/dataset.csv'
) df.head()
df.describe()
df.isnull().sum()
df.dtypes
sns.countplot(x='Result',data=df)
x=df.drop(['Result','index'],axis=1)
y=df['Result']
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2)
# Logistic Regression
# Accuracy Score
59 | P a g e
from sklearn.metrics import accuracy_score
a1=accuracy_score(y_test,y_pred)
a1
# Classification Report
60 | P a g e
# Decision Tree Classifier
Output:
61 | P a g e
# Random Forest
62 | P a g e
from sklearn.metrics import
classification_report
print(classification_report(y_test,y_pred)
MSE =
np.square(np.subtract(y_test,y_pred)).me
an() RMSE3 = math.sqrt(MSE)
print("Root Mean Square Error:\n",RMSE3)
rf.predict([[-1,0,-1,1,-1,-1,1,1,-1,1,1,-1,1,0,0,-1,-1,-1,0,1,1,1,1,1,1,1,-1,1,-1,-1]])
Output:
63 | P a g e
#SupportVector Machine
from sklearn.svm
import SVC
sv=SVC()
sv.fit(x_train,y_train)
y_pred=sv.predict(x_test)
from sklearn.metrics import
accuracy_score
a4=accuracy_score(y_test,y_pred
)
a4
from sklearn.metrics import
classification_report
print(classification_report(y_test,y_pre
d))
MSE =
np.square(np.subtract(y_test,y_pred)).mea
n() RMSE4 = math.sqrt(MSE)
print("Root Mean Square Error:\n",RMSE4)
sv.predict([[-1,-1,-1,1,-1,1,-1,1,1,-1,-1,1,1,1,1,-1,-1,-1,-1,1,-1,1,-1,-1,-1,1,-1,1,-1,-1]])
Output:
64 | P a g e
#Navie Bayes
65 | P a g e
#Gradient Boosting Algorithm
from sklearn.ensemble import
GradientBoostingClassifier
gb=GradientBoostingClassifier()
gb.fit(x_train,y_train)
y_pred=sv.predict(x_test)
from sklearn.metrics import
accuracy_score
a6=accuracy_score(y_test,y_pred)
a6
66 | P a g e
from sklearn.metrics import classification_report print(classification_report(y_test,y_pred))
MSE = np.square(np.subtract(y_test,y_pred)).mean()
RMSE6 = math.sqrt(MSE)
print("Root Mean Square Error:\n",RMSE6)
gb.predict([[1,0,-1,1,1,-1,1,1,-1,1,1,1,1,0,0,-1,1,1,0,-1,1,-1,1,-1,-1,0,-1,1,1,1]])
67 | P a g e
#K-Nearest Neighbors
68 | P a g e
69 | P a g e
# Acurracy levels for various Algorithms
sns.barplot(x='Algorithm',y='Accuracy',data=d
f1) plt.xticks(rotation=90)
plt.title('Comparision of Accuracy Levels for various algorithms')
70 | P a g e
CONCLUSION
The present project is aimed at classification of phishing websites based on the features.
For that we have taken the phishing dataset which collected from uci machine learning
repository and we built our model with seven different classifiers like SVC, Naïve Bayes,
XGB Classifier, Random Forest, K-Nearest Neighbours, Decision Tree and we got good
accuracy scores. There is a scope to enhance it further .if we can have more data our
project will be much more effective and we can get very good results. For this we need
API integrations go get the data of different websites.
71 | P a g e