0% found this document useful (2 votes)
4K views80 pages

Phishing Website Detection DOCUMENTATION

Uploaded by

Chintu Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (2 votes)
4K views80 pages

Phishing Website Detection DOCUMENTATION

Uploaded by

Chintu Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 80

A Project Report on

PHISHING WEBSITE DETECTION


USING
MACHINE LEARNING ALGOITHMS
Submitted in partial fulfillment of the requirements for the award of the degree in

BACHELOR OF TECHNOLOGY
IN

COMPUTER SCIENCE AND ENGINEERING


BY
Phanindra Kumar S <178X1A0585>
Mani Sai Sankar P <178X1A0561>
Satish N <178X1A0569>
Naga Sai Teja G <178X1A0568>
Under the esteemed guidance of

Dr. Md. Sirajuddin Sir, Professor,


CSE

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


KALLAM HARANADHAREDDY INSTITUTE OF TECHNOLOGY
AN ISO 9001:2015 CERTIFIED INSTITUTION
ACCREDITED BY NBA & NAAC WITH ‘A’ GRADE
(APPROVED BY AICTE, AFFILIATED TO JNTUK, KAKINADA)

NH-5, CHOWDAVARAM, GUNTUR – 522019


2017 – 2021

i|Page
KALLAM HARANADHAREDDY INISTITUTE OF TECHNOLOGY
AN ISO 9001:2015 CERTIFIED INSTITUTION ACCREDITED BY NBA &
NAAC WITH ‘A’ GRADE
(APPROVED BY AICTE, AFFILIATED TO JNTUK, KAKINADA)

NH-5, CHOWDAVARAM, GUNTUR-522019

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CERTIFICATE
This is to certify that the project work entitle" PHISHING WEBSITE DETECTION USING
MACHINE LEARNING ALGOITHMS" being submitted by Phanindhra Kumar Sanam
(178X1A0585), Mani Sai Sankar Pasupuleti (178X1A0561), Satish Namala (178X1A0569), and Naga
Sai Teja G (178X1A0568) in the partial fulfilment for the award of degree of Bachelor of Technology in
Computer Science Engineering in Kallam Haranadhareddy Institute of Technology and this bonafide work
carried out by them.

Internal Guide Head of the Department


Dr. Md. Sirajuddin Sir Dr.K.V. Subba Reddy Sir
Professor Professor & HOD

External Examiner

ii | P a g e
DECLARATION

We Phanindhra Kumar Sanam (178X1A0585), Mani Sai Sankar


Paspuleti (178X1A0561), Satish Namala (178X1A0569), and Naga Sai Teja
G (178X1A0568) hereby declare that the project report titled " PHISHING
WEBSITE DETECTION USING MACHINE LEARNING ALGOITHMS

" under the guidance of Dr. Md. Sirajuddin Sir is submitted in partial
fulfillment of the requirements for the award of the degree of Bachelor of
Technology in Computer Science and Engineering.

This is a record of bonafide work carried out by us and the results embodied in
this project have not been reproduced or copied from any source. The results
embodied in this project have not been submitted to any other university for the
award of any other degree.

Phanindra Kumar S <178X1A0585>


Mani Sai Sankar P <178X1A0561>
Satish N <178X1A0569>
Naga Sai Teja G <178X1A0568>

iii | P a g e
ACKNOWLEDGEMENT
We profoundly grateful to express our deep sense of gratitude and respect towards our
honorable chairman, our Grand Father Sri KALLAM HARANADHA REDDY, Chairman of
Kallam group for his precious support in the college.

We are thankful to Dr. M. UMA SANKAR REDDY, Director, KHIT, GUNTUR for his
encouragement and support for the completion of the project.

We are much thankful to Dr. B. SIVA BASIVI REDDY, Principal KHIT, GUNTUR for his
support during and till the completion of the project.

We are greatly indebted to Dr. K. VENKATA SUBBA REDDY, Professor & Head
Department of Computer Science and Engineering, KHIT, GUNTUR for providing the laboratory
facilities to the fullest extent as and when required and also for giving us the opportunity to carry
the project work in the college.

We are also thankful to our Project Coordinators Mr. N. Md. Jubair Basha and

Mr. P. LAKSHMIKANTH who helped us in each step of our Project.

We extend our deep sense of gratitude to our Internal Guide Dr. Md. Sirajuddin Sir, other
Faculty Members & Support staff for their valuable suggestions, guidance and constructive ideas in
each and every step, which was indeed of great help towards the successful completion of our
project.

Phanindra Kumar S <178X1A0585>


Mani Sai Sankar P <178X1A0561>
Satish N <178X1A0569>
Naga Sai Teja G <178X1A0568>

iv | P a g e
ABSTRACT
The main objective this project is to classify phishing websites by using machine learning
algorithms. Internet has become important part of our lives. The usage of smart internet-based
devices for financial transactions provides a platform for attackers to launch various attacks.
This project addresses the phishing attack. Phishing is a type of social engineering attack that
often used to steal user’s personal data. Due to COVID-19 crises there is a rise in phishing
attacks by using which attackers seized the personal information of users. One of the sources
for launching phishing attack is by using phishing websites. A phishing website gives
impression to the users as a legitimate website and steals user’s personal information. Machine
learning is a powerful tool used to strive against phishing attacks. In this project different
machine learning algorithms are considered for detecting phishing websites.

Keywords--- Phishing, Phishing Websites, Detection, Machine Learning.

Project Supervisor: DR. MD. SIRAJUDDIN SIR

v|Page
TABLE OF CONTENTS

TITLE Page No.

CHAPTER 1: INTRODUCTION 1-2

1.1 Introduction 1

1.2 Purpose of the System 1

1.3 Problem Statement 1

1.4 Solution of Problem Statement 2

CHAPTER 2: REQUIREMENTS 3-3

2.1 Hardware Requirements 3

2.2 Software Requirements 3

CHAPTER 3: SYSTEM ANALYSIS 4-12

3.1 Study of System 4

3.2 Existing System 6

3.3 Proposed System 6

3.4 Input and Output 7

3.5 Process Models Used with Justification 8

3.6 Feasibility Study 10

3.7 Economic Feasibility 10

3.8 operational Feasibility 11

3.9 Technical Feasibility 12

vi | P a g e
CHAPTER 4: SOFTWARE REQUIREMENT SPECIFICATION 13-15

4.1 Functional Requirements 14

4.2 Non-Functional Requirements 15

CHAPTER 5: LANGUAGES OF IMPLEMENTATION 17-26

5.1 Python 17

5.2 Difference Between a Script and a Program 17

5.3 History of Python 19

5.4 Python features 19

5.5 Dynamic vs Static 21

5.6 Variables 21

5.7 Standard Data Types 22

5.8 Different Modules in python 23

5.9 Datasets 26

CHAPTER 6: SYSTEM DESIGN 27-40

6.1 Introduction 27

6.2 Normalization 27

6.3 ER- Diagram 29

6.4 Data Flow Diagram 29

vii | P a g e
6.5 UML Diagrams 32-40

6.5.1 Use case Diagram 32

6.5.2 Class Diagram 33

6.5.3 Object Diagram 33

6.5.4 Component Diagram 34

6.5.5 Deployment Diagram 35

6.5.6 State Diagram 36

6.5.7 Sequence Diagram 37

6.5.8 Collaboration Diagram 38

6.5.9 Activity Diagram 39

6.5.10 System Architecture 40

CHAPTER 7: IMPLEMENTATION 41-59

7.1 Data Collection 41

7.2 Data Analysis 42

7.3 Data Processing 43

7.4 Modeling 44-56

7.4.1 Logistic Regression 44

7.4.2 Random Forest Classifier 47

7.4.3 Decision Tree Classifier 49

7.4.4 Naive Bayes Classifier 51

7.4.5 Support Vector Machine 53


viii | P a g e
7.4.6 K-Nearest Neighbors 55

7.4.7 XGB Classifier 56

7.5 Coding and Execution 59

CHAPTER 8: CONCLUSION 71

ix | P a g e
CHAPTER - 1

1. INTRODUCTION
1.1 Introduction to project
Internet use has become an essential part of our daily activities as a result of rapidly
growing technology. Due to this rapid growth of technology and intensive use of digital
systems, data security of these systems has gained great importance. The primary
objective of maintaining security in information technologies is to ensure that necessary
precautions are taken against threats and dangers likely to be faced by users during the
use of these technologies. Phishing is defined as imitating reliable websites in order to
obtain the proprietary information entered into websites every day for various purposes,
such as usernames, passwords and citizenship numbers. Phishing websites contain
various hints among their contents and web browser-based information . Individual(s)
committing the fraud sends the fake website or e-mail information to the target address
as if it comes from an organization, bank or any other reliable source that performs
reliable transactions. Contents of the website or the e-mail include requests aiming to lure
the individuals to enter or update their personal information or to change their passwords
as well as links to websites that look like exact copies of the websites of the organizations
concerned.
1.2 Purpose of the project
Phishing is one of the most common and most dangerous attacks among cybercrimes.
The aim of these attacks is to steal the information used by individuals and organizations
to conduct transactions. Phishing websites contain various hints among their contents and
web browser-based information. The purpose of this study is to perform Extreme
Learning Machine (ELM) based classification for 30 features including Phishing
Websites Data in UC Irvine Machine Learning Repository database.

1.3 Problem statement


• Low performance
• Accuracy is less
• Detection is complex

1|Page
1.4 Solution for the problem statement:

We proposed a system with the help of machine learning techniques and algorithms like
Logistic Regression, KNN, SVC, Random Forest, Decision Tree, XGB Classifier and
Naïve Bayes to predict Phishing. We trained our model with a large dataset and with more
than 30 features. We performed hyper parameter tuning for improving accuracy of
machine learning algorithms. We have considered all the classification metrics for testing
the model and tried to improve the model for precision and recall.

2|Page
CHAPTER - 2

2. REQUIREMENTS

2.1 Hardware Requirement: -

• RAM: 4GB

• Processor: Intel i3

• Hard Disk: 120GB

2.2 Software Requirement: -

• OS: Windows or Linux

• Software : Anaconda

• Jupyter IDE

• Language : Python Scripting

3|Page
CHAPTER - 3

3 SYSTEM ANALYSIS

3.1 STUDY OF THE SYSTEM

1. Numpy
2. Pandas
3. Matplotlib
4. Scikit –learn

1 . Numpy:

Numpy is a general-purpose array-processing package. It provides a high-performance


multidimensional array object, and tools for working with these arrays.It is the fundamental
package for scientific computing with Python. It contains various features including these
important ones:

• A powerful N-Dimensional array object.

• Sophisticated (broadcasting) functions.

• Tools for integrating C/C++ and Fortran code.

• Useful linear algebra, Fourier transform, and random number capabilities

2. Pandas

Pandas is an open-source Python Library providing high-performance data manipulation


and analysis tool using its powerful data structures. Python was majorly used for data
munging and preparation. It had very little contribution towards data analysis. Pandas solved
this problem. Using Pandas, we can accomplish five typical steps in the processing and
analysis of data, regardless of the origin of data load, prepare, manipulate, model, and

4|Page
analyze. Python with Pandas is used in a wide range of fields including academic and
commercial domains including finance, economics, Statistics, analytics, etc.
3.Matplotlib

Matplotlib is a Python 2D plotting library which produces publication quality figures in a


variety of hardcopy formats and interactive environments across platforms. Matplotlib can be
used in Python scripts, the Python and IPython shells, the Jupyter notebook, web application
servers, and four graphical user interface toolkits. Matplotlib tries to make easy things easy
and hard things possible. You can generate plots, histograms, power spectra, bar charts, error
charts, scatter plots, etc., with just a few lines of code. For examples, see the sample plots
and thumbnail gallery.

For simple plotting the pyplot module provides a MATLAB-like interface, particularly
when combined with IPython. For the power user, you have full control of line styles, font
properties, axes properties, etc, via an object oriented interface or via a set of functions
familiar to MATLAB users.

4. Scikit – learn

Scikit-learn provides a range of supervised and unsupervised learning algorithms via a


consistent interface in Python. It is licensed under a permissive simplified BSD license and is
distributed under many Linux distributions, encouraging academic and commercial use. The
library is built upon the SciPy (Scientific Python) that must be installed before you can use
scikit-learn. This stack that includes:

• NumPy: Base n-dimensional array package


• SciPy: Fundamental library for scientific computing
• Matplotlib: Comprehensive 2D/3D plotting
• IPython: Enhanced interactive console
• Sympy: Symbolic mathematics
• Pandas: Data structures and analysis
•Extensions or modules for SciPy care conventionally named SciKits. As such, the
module provides learning algorithms and is named scikit-learn.

5|Page
3.2 EXISTING SYSTEM

➢ In existing system phishing website detection is done by some of the data mining
algorithms like j48,C4.5 in weka explorer which are not suitable for very large
datasets.
➢ In most of the existing systems the dataset is very small for improving precision and
as a result they produce false positives and false negatives.
➢ Number of features in the existing model is very less and as a result the existing
systems can face a threat of becoming outdated.
➢ With rapid growth in technology, there are many ways open for an attacker to pose
phishing threats for an internet user.
➢ Attackers are finding new ways to break into the existing systems and are able to
phish the data of users. Traditional features are being perfectly exploited by cyber
attackers.

3.3. PROPOSED SYSTEM

Thus study the problem of predicting online purchase conversions in an e-commerce


site. To understand user behavior and intent on the web, existing predictors leverage the
traditional search pattern of entering queries then clicking on interesting results. However,
conversion takes more than a click. That is, after repeatedly clicking around and being
exposed to advertising (i.e., retargeted), users’ ultimate success metric of the marketplace
search is buying products. Beyond the traditional mechanism, our contribution is to allow the
predictors to consider dynamic marketplace mechanisms for a deeper prediction of both
clicks and purchases. Specifically, inspired by traditional search problems we focus on two
research questions: “Prediction from market” and “Predictability from individual” for
conversion.

Phishing is one of the most common and most dangerous attacks among cybercrimes.
The aim of these attacks is to steal the information used by individuals and organizations to
conduct transactions. Phishing websites contain various hints among their contents and web

6|Page
browser based information. The purpose of this study is to perform Extreme Learning
Machine (ELM) based classification for 30 features including Phishing Websites Data in UC
Irvine Machine Learning Repository database.

3.4 INPUT AND OUTPUT

The following are the project's inputs and outputs.

Inputs:

➢ Importing the all required packages like numpy, pandas, matplotlib, scikit-learn and
required machine learning algorithms packages .

➢ Setting the dimensions of visualization graph.

➢ Downloading and importing the dataset and convert to data frame.


Outputs:
➢ Preprocessing the importing data frame for imputing nulls with the related
information.

➢ All are displaying cleaned outputs.


➢ After applying machine learning algorithms it will give good results and
visualization plots.

7|Page
3.5 PROCESS MODELS USED WITH JUSTIFICATION

SDLC Model: (Software Development Life Cycle)

This project uses an iterative development lifecycle, where components of the application
are developed through a series of tight iterations. The first iteration focuses on very basic
functionality, with subsequent iterations adding new functionality to the previous work and or
correcting errors identified for the components in production.

The six stages of the SDLC are designed to build on one another, taking outputs from the
previous stage, adding additional effort, and producing results that leverage the previous effort
and are directly traceable to the previous stages. During each stage, additional information is
gathered or developed, combined with the inputs, and used to produce the stage deliverables.
It is important to note that the additional information is restricted in scope, new ideas that
would take the project in directions not anticipated by the initial set of high-level
requirements or features that are out-of-scope are preserved for later consideration.

Too many software development efforts go awry when the development team and
customer personnel get caught up in the possibilities of automation. Instead of focusing on
high priority features, the team can become mired in a sea of nice to have features that are not
essential to solve the problem, but in themselves are highly attractive. This is the root cause

8|Page
of a large percentage of failed and or abandoned development efforts and is the primary
reason the development team utilizes the iterative model.

DESIGN PRINCIPLES & METHODOLOGY:

Object Oriented Analysis And Design

When Object orientation is used in analysis as well as design, the boundary between
OOA and OOD is blurred. This is particularly true in methods that combine analysis and
design. One reason for this blurring is the similarity of basic constructs (i.e.,objects and
classes) that are used in OOA and OOD. Though there is no agreement about what parts of
the object-oriented development process belong to analysis and what parts to design, there is
some general agreement about the domains of the two activities.

The fundamental difference between OOA and OOD is that the former models the
problem domain, leading to an understanding and specification of the problem, while the
latter models the solution to the problem. That is, analysis deals with the problem domain,
while design deals with the solution domain. However, OOAD subsumed the solution domain
representation. That is, the solution domain representation, created by OOD, generally
contains much of the representation created by OOA. The separating line is a matter of
perception, and different people have different views on it. The lack of clear separation
between analysis and design can also be considered one of the strong points of the object
oriented approach; the transition from analysis to design is “seamless”. This is also the main
reason OOAD methods-where analysis and designs are both performed.

The main difference between OOA and OOD, due to the different domains of modeling,
is in the type of objects that come out of the analysis and design process.

9|Page
Features of OOAD:

• It users Objects as building blocks of the application rather functions

• All objects can be represented graphically including the relation between them.

• All Key Participants in the system will be represented as actors and the actions done by
them will be represented as use cases.

• A typical use case is nothing bug a systematic flow of series of events which can be well
described using sequence diagrams and each event can be described diagrammatically by
Activity as well as state chart diagrams.

• So the entire system can be well described using the OOAD model, hence this model is
chosen as the SDLC model.

3.6 FEASIBILITY Study

Preliminary investigation examines project feasibility, the likelihood the system will be
useful to the organization. The main objective of the feasibility study is to test the Technical,
Operational and Economical feasibility for adding new modules and debugging old running
systems. All systems are feasible if they are unlimited resources and infinite time. There are
aspects in the feasibility study portion of the preliminary investigation:

● Technical Feasibility
● Operational Feasibility
● Economical Feasibility

3.7 ECONOMIC FEASIBILITY

A system can be developed technically and that will be used if installed must still be a
good investment for the organization. In the economical feasibility, the development cost in
creating the system is evaluated against the ultimate benefit derived from the new systems.
Financial benefits must equal or exceed the costs.

10 | P a g e
The system is economically feasible. It does not require any additional hardware or
software. Since the interface for this system is developed using the existing resources and
technologies available at NIC, There is nominal expenditure and economical feasibility for
certain.

3.8 OPERATIONAL FEASIBILITY

Proposed projects are beneficial only if they can be turned into an information system.
That will meet the organization’s operating requirements. Operational feasibility aspects of
the project are to be taken as an important part of the project implementation. Some of the
important issues raised are to test the operational feasibility of a project includes the
following:

● Is there sufficient support for the management from the users?


● Will the system be used and work properly if it is being developed and
implemented?
● Will there be any resistance from the user that will undermine the possible
application benefits?

The well-planned design would ensure the optimal utilization of the computer resources
and would help in the improvement of performance status.

This system is targeted to be in accordance with the above-mentioned issues. Beforehand,


the management issues and user requirements have been taken into consideration. So there is
no question of resistance from the users that can undermine the possible application benefits.

The well-planned design would ensure the optimal utilization of the computer resources
and would help in the improvement of performance status.

11 | P a g e
3.9TECHNICAL FEASIBILITY

The technical issue usually raised during the feasibility stage of the investigation includes
the following:

● Does the necessary technology exist to do what is suggested?


● Does the proposed equipment have the technical capacity to hold the data required
to use the new system?
● Will the proposed system provide adequate response to inquiries, regardless of the
number or location of users?
● Can the system be upgraded if developed?
● Are there technical guarantees of accuracy, reliability, ease of access and data
security?

Earlier no system existed to cater to the needs of ‘Secure Infrastructure Implementation


System’. The current system developed is technically feasible. It is a web based user interface
for audit workflow at NIC-CSD. Thus it provides easy access to the users. The database’s
purpose is to create, establish and maintain a workflow among various entities in order to
facilitate all concerned users in their various capacities or roles. Permission to the users would
be granted based on the roles specified. Therefore, it provides the technical guarantee of
accuracy, reliability and security. The software and hard requirements for the development of
this project are not many and are already available in-house at NIC or are available as free as
open source. The work for the project is done with the current equipment and existing
software technology. Necessary bandwidth exists for providing fast feedback to the users
irrespective of the number of users using the system.

12 | P a g e
CHAPTER - 4

4. SOFTWARE REQUIREMENT SPECIFICATION

A Software Requirements Specification (SRS) – a requirements specification for a


software system – is a complete description of the behavior of a system to be developed. It
includes a set of use cases that describe all the interactions the users will have with the
software. In addition to use cases, the SRS also contains non-functional requirements. Non-
functional requirements are requirements which impose constraints on the design or
implementation (such as performance engineering requirements, quality standards, or design
constraints).

System requirements specification: A structured collection of information that embodies


the requirements of a system. A business analyst, sometimes titled system analyst, is
responsible for analyzing the business needs of their clients and stakeholders to help identify
business problems and propose solutions. Within the systems development life cycle domain,
typically performs a liaison function between the business side of an enterprise and the
information technology department or external service providers. Projects are subject to three
sorts of requirements:

● Business requirements describe in business terms what must be delivered or


accomplished to provide value.
● Product requirements describe properties of a system or product (which could be
one of Several ways to accomplish a set of business requirements).
● Process requirements describe activities performed by the developing
organization. For instance, process requirements could specify specific
methodologies that must be followed, and constraints that the organization must
obey.
Product and process requirements are closely linked. Process requirements often specify
the activities that will be performed to satisfy a product requirement. For example, a
maximum development cost requirement (a process requirement) may be imposed to help
achieve a maximum sales price requirement (a product requirement) a requirement that the
product be maintainable often is addressed by imposing requirements to follow particular

13 | P a g e
development styles.

PURPOSE

In systems engineering, a requirement can be a description of what a system must do,


referred to as a Functional Requirement. This type of requirement specifies something that
the delivered system must be able to do. Another type of requirement specifies something
about the system itself, and how well it performs its functions. Such requirements are often
called Non-functional requirements, or 'performance requirements' or 'quality of service
requirements.' Examples of such requirements include usability, availability, reliability,
supportability, testability and maintainability.

A collection of requirements define the characteristics or features of the desired system. A


'good' list of requirements as far as possible avoids saying how the system should implement
the requirements, leaving such decisions to the system designer. Specifying how the system
should be implemented is called "implementation bias" or "solution engineering". However,
implementation constraints on the solution may validly be expressed by the future owner, for
example for required interfaces to external systems; for interoperability with other systems;
and for commonality (e.g. of user interfaces) with other owned products.

In software engineering, the same meanings of requirements apply, except that the focus
of interest is the software itself.

4.1 FUNCTIONAL REQUIREMENTS


• Load data

• Data analysis

• Data preprocessing

• Model building

• Prediction

14 | P a g e
4.2NON FUNCTIONAL REQUIREMENTS

1. Secure access of confidential data (user’s details). SSL can be used.


2. 24 X 7 availability.
3. Better component design to get better performance at peak time
4. Flexible service based architecture will be highly desirable for future extension

Introduction to Django The Web development framework that saves you time and
makes Web development a joy. Using Django, you can build and maintain high quality Web
applications with minimal fuss. At its best, Web development is an exciting, creative act; at
its worst, it can be a repetitive, frustrating nuisance. Django lets you focus on the fun stuff —
the crux of your Web application — while easing the pain of the repetitive bits. In doing so, it
provides high-level abstractions of common Web development patterns, shortcuts for frequent
programming tasks, and clear conventions for how to solve problems. At the same time,
Django tries to stay out of your way, letting you work outside the scope of the framework as
needed. The goal of this book is to make you a Django expert. The focus is twofold. First, we
explain, in depth, what Django does and how to build Web applications with it. Second, we
discuss higher-level concepts where appropriate, answering the question “How can I apply
these tools effectively in my own projects?” By reading this book, you’ll learn the skills
needed to develop powerful Web sites quickly, with code that is clean and easy to maintain.

What Is a Web Framework?

Django is a prominent member of a new generation of Web frameworks. So what


exactly does that term mean? To answer that question, let’s consider the design of a Web
application written using the Common Gateway Interface (CGI) standard, a popular way to
write Web applications circa 1998. In those days, when you wrote a CGI application, you did
everything yourself — the equivalent of baking a cake from scratch. For example, here’s a
simple CGI script, written in Python, that displays the ten most recently published books from
a database:

15 | P a g e
With a one-off dynamic page such as this one, the write-it-from-scratch approach isn’t
necessarily bad. For one thing, this code is simple to comprehend — even a novice developer
can read these 16 lines of Python and understand all it does, from start to finish. There’s
nothing else to learn; no other code to read. It’s also simple to deploy: just save this code in a
file called latestbooks.cgi, upload that file to a Web server, and visit that page with a browser.
But as a Web application grows beyond the trivial, this approach breaks down, and you face a
number of problems:

Should a developer really have to worry about printing the “Content-Type” line and
remembering to close the database connection? This sort of boilerplate reduces programmer
productivity and introduces opportunities for mistakes. These setup- and teardown-related
tasks would best be handled by some common infrastructure.

16 | P a g e
CHAPTER : 5
5. LANGUAGES OF IMPLEMENTATION
5.1 Python

What Is A Script?

Up to this point, I have concentrated on the interactive programming capability of


Python. This is a very useful capability that allows you to type in a program and to have
it executed immediately in an interactive mode

Scripts are reusable

Basically, a script is a text file containing the statements that comprise a Python
program. Once you have created the script, you can execute it over and over without
having to retype it each time.
Scripts are editable

Perhaps, more importantly, you can make different versions of the script by
modifying the statements from one file to the next using a text editor. Then you can
execute each of the individual versions. In this way, it is easy to create different programs
with a minimum amount of typing.

You will need a text editor

Just about any text editor will suffice for creating Python script files.

You can use Microsoft Notepad, Microsoft WordPad, Microsoft Word, or just about
any word processor if you want to.

5.2 Difference between a script and a program

Script:

Scripts are distinct from the core code of the application, which is usually written in a
17 | P a g e
different language, and are often created or at least modified by the end-user. Scripts are
often interpreted from source code or bytecode, whereas the applications they control are
traditionally compiled to native machine code.

Program:

The program has an executable form that the computer can use directly to execute the
instructions.

The same program in its human-readable source code form, from which executable
programs are derived (e.g., compiled)
P ython

What is Python?

Chances you are asking yourself this. You may have found this book because you
want to learn to program but don’t know anything about programming languages. Or you
may have heard of programming languages like C, C++, C#, or Java and want to know
what Python is and how it compares to “big name” languages. Hopefully I can explain it
for you.

Python concepts

If your not interested in the the hows and whys of Python, feel free to skip to the next chapter.
In this chapter I will try to explain to the reader why I think Python is one of the best
languages available and why it’s a great one to start programming with.

Open source general-purpose language.

• Object Oriented, Procedural, Functional

• Easy to interface with C/ObjC/Java/Fortran

• Easy-ish to interface with C++ (via SWIG)

18 | P a g e
• Great interactive environment

Python is a high-level, interpreted, interactive and object-oriented scripting language. Python


is designed to be highly readable. It uses English keywords frequently where as other
languages use punctuation, and it has fewer syntactical constructions than other languages.

• Python is Interpreted − Python is processed at runtime by the interpreter. You do not need to
compile your program before executing it. This is similar to PERL and PHP. • Python is
Interactive − You can actually sit at a Python prompt and interact with the interpreter directly to
write your programs.

• Python is Object-Oriented − Python supports Object-Oriented style or technique of


programming that encapsulates code within objects.

• Python is a Beginner's Language − Python is a great language for the beginner-level


programmers and supports the development of a wide range of applications from simple text
processing to WWW browsers to games.

5.3 History of Python

Python was developed by Guido van Rossum in the late eighties and early nineties at the
National Research Institute for Mathematics and Computer Science in the Netherlands.
Python is derived from many other languages, including ABC, Modula-3, C, C++, Algol-68,
SmallTalk, and Unix shell and other scripting languages.

Python is copyrighted. Like Perl, Python source code is now available under the GNU General
Public License (GPL).

Python is now maintained by a core development team at the institute, although Guido van
Rossum still holds a vital role in directing its progress.

5.4 Python Features

Python's features include −

19 | P a g e
• Easy-to-learn − Python has few keywords, simple structure, and a clearly defined syntax. This
allows the student to pick up the language quickly.

• Easy-to-read − Python code is more clearly defined and visible to the eyes. •
Easy-to-maintain − Python's source code is fairly easy-to-maintain.

• A broad standard library − Python's bulk of the library is very portable and cross-
platform compatible on UNIX, Windows, and Macintosh.

• Interactive Mode − Python has support for an interactive mode which allows interactive
testing and debugging of snippets of code.

• Portable − Python can run on a wide variety of hardware platforms and has the same interface
on all platforms.

• Extendable − You can add low-level modules to the Python interpreter. These modules enable
programmers to add to or customize their tools to be more efficient.

• Databases − Python provides interfaces to all major commercial databases. • GUI


Programming − Python supports GUI applications that can be created and ported to many
system calls, libraries and windows systems, such as Windows MFC, Macintosh, and the X
Window system of Unix.

• Scalable − Python provides a better structure and support for large programs than shell
scripting.

Apart from the above-mentioned features, Python has a big list of good features, few are listed
below −

• It supports functional and structured programming methods as well as OOP. • It can be used as a
scripting language or can be compiled to byte-code for building large applications.

• It provides very high-level dynamic data types and supports dynamic type checking. •
IT supports automatic garbage collection.

20 | P a g e
• It can be easily integrated with C, C++, COM, ActiveX, CORBA, and Java.

5.5 Dynamic vs Static

Types Python is a dynamic-typed language. Many other languages are static typed, such as
C/C++ and Java. A static typed language requires the programmer to explicitly tell the
computer what type of “thing” each data value is.

For example, in C if you had a variable that was to contain the price of something, you would
have to declare the variable as a “float” type.
This tells the compiler that the only data that can be used for that variable must be a floating
point number, i.e. a number with a decimal point.

Python, however, doesn’t require this. You simply give your variables names and assign
values to them. The interpreter takes care of keeping track of what kinds of objects your
program is using. This also means that you can change the size of the values as you develop
the program. Say you have another decimal number (a.k.a. a floating point number) you need
in your program.

With a static typed language, you have to decide the memory size the variable can take when
you first initialize that variable. A double is a floating point value that can handle a much
larger number than a normal float (the actual memory sizes depend on the operating
environment). If you declare a variable to be a float but later on assign a value that is too big
to it, your program will fail; you will have to go back and change that variable to be a double.
With Python, it doesn’t matter. You simply give it whatever number you want and Python will
take care of manipulating it as needed. It even works for derived values.

For example, say you are dividing two numbers. One is a floating point number and one is an
integer. Python realizes that it’s more accurate to keep track of decimals so it automatically
calculates the result as a floating point number

5.6 Variables

Variables are nothing but reserved memory locations to store values. This means that when

21 | P a g e
you create a variable you reserve some space in memory.

Based on the data type of a variable, the interpreter allocates memory and decides what can be
stored in the reserved memory. Therefore, by assigning different data types to variables, you
can store integers, decimals or characters in these variables.

5.7 Standard Data Types

The data stored in memory can be of many types. For example, a person's age is stored as a
numeric value and his or her address is stored as alphanumeric characters. Python has various
standard data types that are used to define the operations possible on them and the storage
method for each of them.

Python has five standard data types −

• Numbers

• String

• List

• Tuple

• Dictionary

P ython Numbers

Number data types store numeric values. Number objects are created when you assign a value
to them

P ython Strings

Strings in Python are identified as a contiguous set of characters represented in the quotation
marks. Python allows for either pairs of single or double quotes. Subsets of strings can be
taken using the slice operator ([ ] and [:] ) with indexes starting at 0 in the beginning of the

22 | P a g e
string and working their way from -1 at the end.

P ython Lists

Lists are the most versatile of Python's compound data types. A list contains items separated
by commas and enclosed within square brackets ([]). To some extent, lists are similar to arrays
in C. One difference between them is that all the items belonging to a list can be of different
data type.

The values stored in a list can be accessed using the slice operator ([ ] and [:]) with indexes
starting at 0 in the beginning of the list and working their way to end -1. The plus (+) sign is
the list concatenation operator, and the asterisk (*) is the repetition operator.

P ython Tuples

A tuple is another sequence data type that is similar to the list. A tuple consists of a number of
values separated by commas. Unlike lists, however, tuples are enclosed within parentheses.
The main differences between lists and tuples are: Lists are enclosed in brackets ( [ ] ) and
their elements and size can be changed, while tuples are enclosed in parentheses ( ( ) ) and
cannot be updated. Tuples can be thought of as read-only lists.

P ython Dictionary

Python's dictionaries are kind of hash table type. They work like associative arrays or hashes
found in Perl and consist of key-value pairs. A dictionary key can be almost any Python type,
but are usually numbers or strings. Values, on the other hand, can be any arbitrary Python
object.

Dictionaries are enclosed by curly braces ({ }) and values can be assigned and accessed using
square braces ([]).

5.8 Different modes in python

Python has two basic modes: normal and interactive.

The normal mode is the mode where the scripted and finished .py files are run in the Python

23 | P a g e
interpreter.
Interactive mode is a command line shell which gives immediate feedback for each statement,
while running previously fed statements in active memory. As new lines are fed into the
interpreter, the fed program is evaluated both in part and in whole

20 Python libraries
1. Requests. The most famous http library written by kenneth reitz. It’s a must
have for every python developer.

2. Scrapy. If you are involved in webscraping then this is a must have library for
you. After using this library you won’t use any other.

3. wxPython. A gui toolkit for python. I have primarily used it in place of tkinter.
You will really love it.

4. Pillow. A friendly fork of PIL (Python Imaging Library). It is more user friendly
than PIL and is a must have for anyone who works with images.

5. SQLAlchemy. A database library. Many love it and many hate it. The choice is
yours.

6. BeautifulSoup. I know it’s slow but this xml and html parsing library is very
useful for beginners.

7. Twisted. The most important tool for any network application developer. It has
a very beautiful api and is used by a lot of famous python developers.

8. NumPy. How can we leave this very important library ? It provides some
advance math functionalities to python.

9. SciPy. When we talk about NumPy then we have to talk about scipy. It is a
library of algorithms and mathematical tools for python and has caused many
scientists to switch from ruby to python.

10. matplotlib. A numerical plotting library. It is very useful for any data

24 | P a g e
scientist or any data analyzer.

11. Pygame. Which developer does not like to play games and develop them ? This
library will help you achieve your goal of 2d game development.
12. Pyglet. A 3d animation and game creation engine. This is the engine in which
the famous python port of minecraft was made.

13. pyQT. A GUI toolkit for python. It is my second choice after wxpython for
developing GUI’s for my python scripts.

14. pyGtk. Another python GUI library. It is the same library in which the famous
Bittorrent client is created.

15. Scapy. A packet sniffer and analyzer for python made in python.

16. pywin32. A python library which provides some useful methods and classes
for interacting with windows.

17. nltk. Natural Language Toolkit – I realize most people won’t be using this one,
but it’s generic enough. It is a very useful library if you want to manipulate
strings. But it’s capacity is beyond that. Do check it out.

18. nose. A testing framework for python. It is used by millions of python


developers. It is a must have if you do test driven development.

19. SymPy. SymPy can do algebraic evaluation, differentiation, expansion,


complex numbers, etc. It is contained in a pure Python distribution.

20. IPython. I just can’t stress enough how useful this tool is. It is a python prompt
on steroids. It has completion, history, shell capabilities, and a lot more. Make
sure that you take a look at it.

Numpy:
NumPy’s main object is the homogeneous multidimensional array. It is a table of elements (usually
25 | P a g e
numbers), all of the same type, indexed by a tuple of positive integers. In NumPy dimensions are called axes.
The number of axes is rank.
Offers Matlab-ish capabilities within Python

• Fast array operations

• 2D arrays, multi-D arrays, linear algebra etc.

m atplotlib

• High quality plotting library.

5.9 DATA SETS

DataSets

The DataSet object is similar to the ADO Recordset object, but more powerful, and
with one other important distinction: the DataSet is always disconnected. The DataSet
object represents a cache of data, with database-like structures such as tables, columns,
relationships, and constraints. However, though a DataSet can and does behave much
like a database, it is important to remember that DataSet objects do not interact
directly with databases, or other source data. This allows the developer to work with a
programming model that is always consistent, regardless of where the source data
resides. Data coming from a database, an XML file, from code, or user input can all be
placed into DataSet objects. Then, as changes are made to the DataSet they can be
tracked and verified before updating the source data. The GetChanges method of the
DataSet object actually creates a second DatSet that contains only the changes to the
data. This DataSet is then used by a DataAdapter (or other objects) to update the
original data source.
The DataSet has many XML characteristics, including the ability to produce and consume
XML data and XML schemas. XML schemas can be used to describe schemas interchanged
via WebServices. In fact, a DataSet with a schema can actually be compiled for type safety
and statement completion.

26 | P a g e
CHAPTER : 6

6. SYSTEM DESIGN

6.1 INTRODUCTION

Software design sits at the technical kernel of the software engineering process and is applied
regardless of the development paradigm and area of application. Design is the first step in the
development phase for any engineered product or system. The designer’s goal is to produce a
model or representation of an entity that will later be built. Beginning, once system
requirement

have been specified and analyzed, system design is the first of the three technical activities -
design, code and test that is required to build and verify software.

The importance can be stated with a single word “Quality”. Design is the place where
quality is fostered in software development. Design provides us with representations of
software that can assess for quality. Design is the only way that we can accurately translate a
customer’s view into a finished software product or system. Software design serves as a
foundation for all the software engineering steps that follow. Without a strong design we risk
building an unstable system – one that will be difficult to test, one whose quality cannot be
assessed until the last stage.

During design, progressive refinement of data structure, program structure, and


procedural details are developed reviewed and documented. System design can be viewed
from either technical or project management perspective. From the technical point of view,
design is comprised of four activities – architectural design, data structure design, interface
design and procedural design.

6.2 NORMALIZATION

It is a process of converting a relation to a standard form. The process is used to handle

27 | P a g e
the problems that can arise due to data redundancy i.e. repetition of data in the database,
maintain data integrity as well as handling problems that can arise due to insertion, updation,
deletion anomalies.

Decomposing is the process of splitting relations into multiple relations to eliminate


anomalies and maintain anomalies and maintain data integrity. To do this we use normal
forms or rules for structuring relation.

Insertion anomaly: Inability to add data to the database due to absence of other data.
Deletion anomaly: Unintended loss of data due to deletion of other data. Update
anomaly: Data inconsistency resulting from data redundancy and partial update

Normal Forms: These are the rules for structuring relations that eliminate anomalies.

FIRST NORMAL FORM:

A relation is said to be in first normal form if the values in the relation are atomic for every
attribute in the relation. By this we mean simply that no attribute value can be a set of values
or, as it is sometimes expressed, a repeating group.

SECOND NORMAL FORM:

A relation is said to be in second Normal form is it is in first normal form and it should satisfy
any one of the following rules.

1) Primary key is a not a composite primary key

2) No non key attributes are present

3) Every non key attribute is fully functionally dependent on full set of primary key.

THIRD NORMAL FORM:

A relation is said to be in third normal form if their exits no transitive dependencies.

28 | P a g e
Transitive Dependency: If two non key attributes depend on each other as well as on the
primary key then they are said to be transitively dependent.

The above normalization principles were applied to decompose the data in multiple tables
thereby making the data to be maintained in a consistent state.

6.3 E – R DIAGRAMS

• The relation upon the system is structure through a conceptual ER-Diagram, which not only
specifics the existential entities but also the standard relations through which the system exists
and the cardinalities that are necessary for the system state to continue.

• The entity Relationship Diagram (ERD) depicts the relationship between the data objects. The
ERD is the notation that is used to conduct the date modeling activity the attributes of each data
object noted is the ERD can be described resign a data object descriptions. • The set of primary
components that are identified by the ERD are

◆ Data object ◆ Relationships

◆ Attributes ◆ Various types of indicators.

The primary purpose of the ERD is to represent data objects and their relationships.
6.4 DATA FLOW DIAGRAMS

A data flow diagram is graphical tool used to describe and analyze movement of data
through a system. These are the central tool and the basis from which the other components
are developed. The transformation of data from input to output, through processed, may be
described logically and independently of physical components associated with the system.
These are known as the logical data flow diagrams. The physical data flow diagrams show the
actual implements and movement of data between people, departments and workstations. A
full description of a system actually consists of a set of data flow diagrams. Using two familiar
notations Yourdon, Gane and Sarson notation develops the data flow diagrams. Each
component in a DFD is labeled with a descriptive name. Process is further identified with a

29 | P a g e
number that will be used for identification purpose. The development of DFD’S is done in
several levels. Each process in lower level diagrams can be broken down into a more detailed
DFD in the next level. The lop-level diagram is often called context diagram. It consists a

single process bit, which plays vital role in studying the current system. The process in the
context level diagram is exploded into other process at the first level DFD.

The idea behind the explosion of a process into more process is that understanding at
one level of detail is exploded into greater detail at the next level. This is done until further
explosion is necessary and an adequate amount of detail is described for analyst to understand
the process.

Larry Constantine first developed the DFD as a way of expressing system requirements in a
graphical from, this lead to the modular design.

Larry Constantine first developed the DFD as a way of expressing system requirements in a
graphical from, this lead to the modular design.

A DFD is also known as a “bubble Chart” has the purpose of clarifying system requirements
and identifying major transformations that will become programs in system design. So it is the
starting point of the design to the lowest level of detail. A DFD consists of a series of bubbles
joined by data flows in the system.

SAILENT FEATURES OF DFD’S

1. The DFD shows flow of data, not of control loops and decision are controlled considerations
do not appear on a DFD.

2. The DFD does not indicate the time factor involved in any process whether the dataflow take
place daily, weekly, monthly or yearly.

3. The sequence of events is not brought out on the DFD.

30 | P a g e
TYPES OF DATA FLOW DIAGRAMS
1. Current Physical

2. Current Logical

3. New Logical

4. New Physical

CURRENT PHYSICAL:

In Current Physical DFD process label include the name of people or their positions or the
names of computer systems that might provide some of the overall system-processing label
includes an identification of the technology used to process the data. Similarly data flows and
data stores are often labels with the names of the actual physical media on which data are
stored such as file folders, computer files, business forms or computer tapes.

CURRENT LOGICAL:

The physical aspects at the system are removed as mush as possible so that the current system
is reduced to its essence to the data and the processors that transforms them regardless of
actual physical form.

NEW LOGICAL:

This is exactly like a current logical model if the user were completely happy with he user
were completely happy with the functionality of the current system but had problems with
how it

was implemented typically through the new logical model will differ from current logical
model while having additional functions, absolute function removal and inefficient flows
recognized.

NEW PHYSICAL:

The new physical represents only the physical implementation of the new system.
31 | P a g e
6.5 UML Diagrams

6.5.1 Use case diagram

NewUseCase

Data
Understanding Predictive Learning
NewUseCase2 Model Building

NewUseCase3
Dataset

Data
Analytics(EDA) Trained Dataset
NewUseCase4

Model Evaluation

Particular Data

NewUseCase5

Test/Test Split

NewUseCase6

EXPLANATION:

The primary motivation behind a utilization case chart is to show what framework capacities are performed for which
entertainer. Parts of the entertainers in the framework can be portrayed. The above chart comprises of client as
entertainer. Each will assume a specific part to accomplish the idea.

32 | P a g e
6.5.2 Class Diagram

Data Understanding Data Analysis (EDA) Train Data


analysis datamodel Model Building
dataprocess
tuningprocess
datasetstransfers() datasets() trainingdata()
modelbuildingphase() splitdata() predictivelearning()
machinelearning()

Model Evaluation
dataset

traineddataset()
particulardata()

EXPLANATION
In this class chart addresses how the classes with qualities and strategies are connected together to play out the
confirmation with security. From the above chart shown the different classes engaged with our venture.

6.5.3 Object Diagram


Data Understanding Data Analysis (EDA) Train Data Model Building

Model Evaluation

EXPLAN
ATION:
In the above digram tells about the progression of articles between the classes. It's anything but a chart that shows a
total or fractional perspective on the construction of a displayed framework. In this item chart addresses how the
classes with traits and strategies are connected together to play out the confirmation with security.
33 | P a g e
6.5.4 Component Diagram

Data Datasets DEEP


Understanding Transfers Learning

Model-Building Datasets Data Analysis


Phase (EDA)

Train Data Training Split Data


Data

Model Predictive Model


Evaluation Learning Building

Trained Particular
Dataset Data

EXPLANATION:
A segment gives the arrangement of required interfaces that a part acknowledges or carries out. These are the static
outlines of the bound together demonstrating language. Segment outlines are utilized to address the working and
conduct of different segments of a framework.

34 | P a g e
6.5.5 Deployment Diagram

Data
Data
Understanding
Analysis(EDA)

Train Data

Model Building

Model
Evaluation

EXPLANATION:
An UML sending chart is an outline that shows the design of run time preparing hubs and the parts that live on
them. Arrangement graphs are a sort of design chart utilized in demonstrating the actual parts of an article situated
framework. They are regularly be utilized to demonstrate the static sending perspective on a framework.

35 | P a g e
6.5.6 STATE DIAGRAM

Dataset

Data Understanding Train Data Model Building


Data Analysis(EDA) Model Evaluation

Datasets Training Data


Predictive Learning
Datasets Transfers Trained Dataset

Split Data
Model-Building Phase
Particular Data

Machine Learning

Dataset

EXPLANATION:
State outline are an inexactly characterized graph to show work processes of stepwise exercises and activities, with
help for decision, cycle and simultaneousness. State charts necessitate that the framework portrayed is made out of a
limited number of states; at times, this is to be sure the situation, while at different occasions this is a sensible
deliberation. Numerous types of state charts exist, which vary marginally and have distinctive semantics.

36 | P a g e
6.5.7 Sequence Diagram

Data Data Train Data Model Building Model Dataset


Understanding Analysis(EDA) Evaluation

Datasets Transfers

Datasets

Training Data

Predictive Learning

Machine Learning
Trained Dataset

Model Building Phase

Split Data

Particular Data

EXPLANATION:
UML Sequence Diagrams are connection charts that detail how tasks are done. They catch the association between
objects with regards to joint effort. Grouping Diagrams are time center and they show the request for the association
outwardly by utilizing the upward pivot of the outline to address time what messages is sent and when.

37 | P a g e
6.5.8 Collaboration Diagram

5: Machine Learning
Train
Data Data
Understanding

1: Datasets Transfers
8: Split Data 3: Training Data

4: Predictive Learning
Model
Building

Dataset
6: Trained Dataset
9: Particular Data

2: Datasets
Model
Evaluation

7: Model Building Phase

Data
Analysis(EDA)

EXPLANATION:
Coordinated effort outlines are utilized to show how items interface to play out the conduct of a specific use case, or
a piece of a utilization case. Alongside succession charts, joint effort are utilized by architects to characterize and
explain the jobs of the articles that play out a specific progression of occasions of a utilization case. They are the
essential wellspring of data used to deciding class duties and interfaces.

38 | P a g e
6.5.9 Activity Diagram

Dataset

Data Understanding Data Analysis(EDA) Model Building Model Evaluation


Train Data

Datasets Transfers Datasets Training Data Trained Dataset


Predictive Learning

Model-Building Phase Split Data


Machine Learning
Particular Data

Dataset

EXPLANATION:
Action chart are an inexactly characterized graph to show work processes of stepwise exercises and activities, with
help for decision, emphasis and simultaneousness. UML, action charts can be utilized to portray the business and
operational bit by bit work processes of parts in a framework. UML movement outlines might actually display the
inner rationale of a mind boggling activity. From numerous points of view UML action outlines are the article
arranged likeness stream graphs and information stream charts (DFDs) from underlying turn of events.

39 | P a g e
6.5.10 System Architecture

DATASET
DATASET

EXPLORATORY
EXPLORATORY DATA
DATA
ANALYTICS
ANALYTICS

TRAIN/TEST
TRAIN/TEST SPLIT
SPLIT

MODEL
MODEL BUILDING/HYPER
BUILDING/HYPER
PARAMETER
PARAMETER TUNING
TUNING

MODEL
MODEL
ELEVATION
ELEVATION

RESULT
RESULT

40 | P a g e
CHAPTER : 7
7.Implementation
7.1 Data Collection
We collected phishing websites dataset from Kaggle website. It consists
of mix of phishing and legitimate URL features. Dataset has 11055 rows
and 31 columns.

41 | P a g e
7.2 Exploratory Data Analysis
We loaded the dataset into python IDE with the help of pandas package
and checked if there are any missing values in data. We found that
thereare no missing values in data and we removed an unwanted column
for our process. After removing unwanted column, below are the
columns left out in our dataset.

Figure: 7.2.1

We tried to analyze the obtained data and go to find out the following observations:

Figure: 7.2.2
From the above count plot which is plotted with the help of seaborn
package, we can observe the count of values of target variable.

And below is the plot which is drawn with the help of matplotlib
package to find out the correlation among the features.

42 | P a g e
From the analysis we clearly found out that the type of our data is
classification problem where the target variable is Result.

7.3 Data Processing


We have to prepare data for algorithms for training and testing purposes.
With the help of sci- kit learn package, we have split the data 70% for
training and 30% for testing.

7.3.1 Data Splitting


43 | P a g e
Y = data['Result']

X = data.drop('Result', axis=1)

Figure: 7.3.1

Figure: 7.3.2

# to split train and test set


from sklearn.model_selection
import train_test_split # Split X
and y into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,


random_state=0)
7.4 Modeling

7.4.1Logistic Regression

Logistic regression is another technique borrowed by machine learning


from the field of statistics. It is the go-to method for binary classification
problems (problems with two class values). Logistic regression is named

for the function used at the core of the method, the logistic function. The

44 | P a g e
logistic function, also called the sigmoid function was developed by
statisticians to describe properties of population growth in ecology, rising
quickly and maxing out at the carrying capacity of the environment. It’s an
S-shaped curve that can take any real-valued number and map it into a
value between 0 and 1, but never exactly at those limits.

1 / (1 + e^-value)
Where e is the base of the natural logarithms (Euler’s number or the EXP()
function in your spreadsheet) and value is the actual numerical value that
you want to transform. Below is a plot of the numbers between -5 and 5
transformed into the range 0 and 1 using the logistic function.

Logistic regression uses an equation as the representation, very much like


linear regression. Input values (x) are combined linearly using weights or
coefficient values to predict an output value (y). A key difference from
linear regression is that the output value being modeled is a binary value (0
or 1) rather than a numeric value

Below is an example logistic regression equation:


y = e^(b0 + b1*x) / (1 + e^(b0 + b1*x))
Where y is the predicted output, b0 is the bias or intercept term and b1 is
the coefficient for the single input value (x). Each column in your input
data has an associated b coefficient (a constant real value) that must be
learned from your training data. The actual representation of the model that
you would store in memory or in a file are the coefficients in the equation
(the beta value or b’s).Logistic regression models the probability of the
default class (e.g. the first class).
For example, if we are modeling people’s sex as male or female from their
height, then the first class could be male and the logistic regression model
could be written as the probability of male given a person’s height, or more
formally:

P(sex=male|height)
Written another way, we are modeling the probability that an input (X)

45 | P a g e
belongs to the default class (Y=1), we can write this formally as:

P(X) = P(Y=1|X)
We’re predicting probabilities? I thought logistic regression was a classification
algorithm?
Note that the probability prediction must be transformed into binary values
(0 or 1) in order to actually make a probability prediction.. Logistic
regression is a linear method, but the predictions are transformed using the
logistic function. The impact of this is that we can no longer understand the
predictions as a linear combination of the inputs as we can with linear
regression, for example, continuing on from above, the model can be stated
as:

p(X) = e^(b0 + b1*X) / (1

+ e^(b0 + b1*X)) ln(p(X)

/ 1 – p(X)) = b0 + b1 * X

We can move the exponent back to the right and write it as:
odds = e^(b0 + b1 * X)
We trained logistic regression model with the help of training split and tested with
test split.
Accuracy score - 0.92
Confusion Matrix:
Phishing Non
Phishing

Phishing 827(TP) 74(FN)

Non-Phishing 57(FP) 797(TN)

Table:7.4.1.1
Classification Report:
Precision Recall F1-Score Support

46 | P a g e
Phishing 0.94 0.92 0.93 9
0
1

Non-Phishing 0.92 0.93 0.92 8


5
4

Accuracy 0.93 1755

Macro avg 0.93 0.93 0.93 1755

Weighted avg 0.93 0.93 0.93 1755

Table:7.4.1.2

7.4.2 Random Forest Classifier

Random forest is a supervised learning algorithm. It can be used both for classification and
regression. It is also the most flexible and easy to use algorithm. A forest is comprised of
trees. It is said that the more trees it has, the more robust a forest is. Random forests
creates decision trees on randomly selected data samples, gets prediction from each tree
and selects the best solution by means of voting. It also provides a pretty good indicator
of the feature importance.

It technically is an ensemble method (based on the divide-and-conquer approach) of


decision trees generated on a randomly split dataset. This collection of decision tree
classifiers is also known as the forest. The individual decision trees are generated using an
attribute selection indicator such as information gain, gain ratio, and Gini index for each
attribute. Each tree depends on an independent random sample. In a classification
problem, each tree votes and the most popular class is chosen as the final result. In the
case of regression, the average of all the tree outputs is considered as the final result. It is
simpler and more powerful compared to the other non-linear classification algorithms.

Advantages:

47 | P a g e
• Random forest is considered as a highly accurate and robust method because of the
number of decision trees participating in the process.

• It does not suffer from the over fitting problem. The main reason is that it takes the
average of all the predictions, which cancels out the biases.

• The algorithm can be used in both classification and regression problems.

• Random forests can also handle missing values. There are two ways to handle these: using median
values to replace continuous variables and computing the proximity- weighted average of missing
values.
It works in four steps:
1. Select random samples from a given dataset.
2. Construct a decision tree for each sample and get a prediction result from each decision
tree.
3. Perform a vote for each predicted result.
4. Select the prediction result with the most votes as the final prediction.

We trained logistic regression model with the help of training split and tested with test split.
Accuracy score - 0.96

Confusion Matrix:
Phishing Non-Phishing

Phishing 858(TP) 43(FN)

Non-Phishing 35(FP) 819(TN)

Table:7.4.2.1

Classification Report:
Precision Recall F1-Score Support

Phishing 0.96 0.95 0.96 901

48 | P a g e
Non-Phishing 0.95 0.96 0.95 854

Accuracy 0.96 1755

Macro avg 0.96 0.96 0.96 1755

Weighted avg 0.96 0.96 0.96 1755

Table:7.4.2.2

7.4.3 Decision Tree Classifier:

A decision tree is a flowchart-like tree structure where an internal node represents feature,
the branch represents a decision rule, and each leaf node represents the outcome. The
topmost node in a decision tree is known as the root node. It learns to partition on the basis
of the attribute value. It partitions the tree in recursively manner call recursive partitioning.
This flowchart-like structure helps you in decision making. It's visualization like a
flowchart diagram which easily mimics the human level thinking. That is why decision
trees are easy to understand and interpret.

Decision Tree is a white box type of ML algorithm. It shares internal decision-making


logic, which is not available in the black box type of algorithms such as Neural Network.
Its training time is faster compared to the neural network algorithm. The time complexity
of decision trees is a function of the number of records and number of attributes in the
given data. The decision tree I s a distribution-free or non-parametric method, which does
not depend upon probability distribution assumptions. Decision trees can handle high
dimensional data with good accuracy.

The basic idea behind any decision tree algorithm is as follows:

1. Select the best attribute using Attribute Selection Measures (ASM) to split the
records.

2. Make that attribute a decision node and breaks the dataset into smaller subsets.

49 | P a g e
3. Starts tree building by repeating this process recursively for each child until one of
the condition will match:
o All the tuples belong to the same attribute value.

o There are no more remaining attributes.

o There are no more instances.

Pros

Decision trees are easy to interpret and visualize.

It can easily capture Non-linear patterns.

It requires fewer data preprocessing from the user, for example, there is no need to
normalize columns.

It can be used for feature engineering such as predicting missing values, suitable for
variable selection.

The decision tree has no assumptions about distribution because of the non-parametric
nature of the algorithm. (Source)

` Cons

Sensitive to noisy data. It can overfit noisy data.

The small variation(or variance) in data can result in the different decision tree. This
can be reduced by bagging and boosting algorithms.

Decision trees are biased with imbalance dataset, so it is recommended that balance
out the dataset before creating the decision tree.

Accuracy score - 0.935

50 | P a g e
Confusion Matrix:
Phishing Non-Phishing

Phishing 849(TP) 52(FN)

Non-Phishing 62(FP) 792(TN)

Classification Report:
Table:7.4.3.1

Precision Recall F1-Score Support

Phishing 0.93 0.94 0.94 901

Non-Phishing 0.94 0.93 0.93 854

Accuracy 0.94 1755

Macro avg 0.94 0.93 0.93 1755

Weighted avg 0.94 0.94 0.94 1755

Table:7.4.3.2

7.4.4 Naïve Bayes Classifier

The naive Bayes classifier is a generative model for classification. Before the advent of
deep learning and its easy-to-use libraries, the Naive Bayes classifier was one of the
widely deployed classifiers for machine learning applications. Despite its simplicity, the
naive Bayes classifier performs quite well in many applications.

A Naive Bayes classifier is a probabilistic machine learning model that’s used for

51 | P a g e
classification task. The crux of the classifier is based on the Bayes theorem.

Bayes Theorem:

Using Bayes theorem, we can find the probability of A happening, given that B has
occurred. Here, B is the evidence and A is the hypothesis. The assumption made here is that
the predictors/features are independent. That is presence of one particular feature does not
affect the other. Hence it is called naive.

Accuracy score - 0.65

Confusion Matrix:
Phishing Non-Phishing

Phishing 901(TP) 0(FN)

Non-Phishing 607(FP) 247(TN)

Table:7.4.4.1

Classification Report:
Precision Recall F1-Score Support

Phishing 0.60 1.00 0.75 901

Non-Phishing 1.00 0.29 0.45 854

Accuracy 0.65 1755

52 | P a g e
Macro avg 0.80 0.64 0.60 1755

Weighted avg 0.79 0.65 0.60 0755

Table:7.4.4.2

7.4.5 Support Vector Machine (SVM)

Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used
for Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in the
correct category in the future. This best decision boundary is called a hyperplane.SVM
chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases
are called as support vectors, and hence algorithm is termed as Support Vector Machine.

Types of SVM

SVM can be of two types:

Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then such data
is termed as linearly separable data, and classifier is used called as Linear SVM
classifier.

Non-linear SVM: Non-Linear SVM is used for non-linearly separated data,


which means if a dataset cannot be classified by using a straight line, then such data is
termed as non-linear data and classifier used is called as Non-linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:

Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n

53 | P a g e
dimensional space, but we need to find out the best decision boundary that helps to classify
the data points. This best boundary is known as the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the dataset, which means
if there are 2 features then hyperplane will be a straight line. And if there are 3 features, then
hyperplane will be a 2-dimension plane.

We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.

Support Vectors:

The data points or vectors that are the closest to the hyperplane and which affect the position
of the hyperplane are termed as Support Vector. Since these vectors support the hyperplane,
hence called a Support vector.

Advantages
SVM Classifiers offer good accuracy and perform faster prediction compared to Naïve Bayes
algorithm. They also use less memory because they use a subset of training points in the
decision phase. SVM works well with a clear margin of separation and with high dimensional
space.

Accuracy score - 0.943

Confusion Matrix:
Phishing Non-Phishing

Phishing 838(TP) 63(FN)

Non-Phishing 37(FP) 817(TN)

Table:7.4.5.1

54 | P a g e
Classification Report:
Precision Recall F1-Score Support

Phishing 0.96 0.93 0.94 901

Non-Phishing 0.93 0.94 0.94 854

Accuracy 0.94 1755

Macro avg 0.94 0.94 0.94 1755

Weighted avg 0.94 0.94 0.94 1755

Table:7.4.5.2

7.4.6 K-Nearest Neighbors (KNN)

The k-nearest neighbors (KNN) algorithm is a simple, easy-to-implement supervised


machine learning algorithm that can be used to solve both classification and regression
problems. A supervised machine learning algorithm (as opposed to an unsupervised
machine learning algorithm) is one that relies on labeled input data to learn a function that
produces an appropriate output when given new unlabeled data.

The KNN algorithm assumes that similar things exist in close proximity. In other words,
similar things are near to each other. KNN captures the idea of similarity (sometimes
called distance, proximity, or closeness) with some mathematics we might have learned in
our childhood— calculating the distance between points on a graph.

Advantages

1. The algorithm is simple and easy to implement.

2. There’s no need to build a model, tune several parameters, or make additional


assumptions.

55 | P a g e
3. The algorithm is versatile. It can be used for classification, regression, and search (as we will see
in the nextsection).
Accuracy score - 0.928
Confusion Matrix:
Phishing Non-Phishing

Phishing 855(TP) 46(FN)

Non-Phishing 79(FP) 775(TN)

Table:7.4.6.1

Classification Report:
Precision Recall F1-Score Support

Phishing 0.92 0.95 0.93 901

Non-Phishing 0.94 0.91 0.93 854

Accuracy 0.93 1755

Macro avg 0.93 0.93 0.93 1755

Weighted avg 0.93 0.93 0.93 0755

Table:7.4.6.2

7.4.7 XGB Classifier:

XGBoost is a powerful machine learning algorithm especially where speed and accuracy
are concerned. XGBoost (eXtreme Gradient Boosting) is an advanced implementation
of gradient boosting algorithm.

ADVANTAGES

1. Regularization:

56 | P a g e
• Standard GBM has no regularization like XGBoost, therefore it also helps
to reduce overfitting.

In fact, XGBoost is also known as a ‘regularized boosting‘technique.


Parallel Processing:
• XGBoost implements parallel processing and is blazingly faster as compared
to GBM.
• XGBoost also supports implementation on Hadoop.

High Flexibility

• XGBoost allows users to define custom optimization objectives and


evaluation criteria.
• This adds a whole new dimension to the model and there is no limit to what we
can do.

Handling Missing Values

• XGBoost has an in-built routine to handle missing values.


• The user is required to supply a different value than other observations and pass
that as a parameter. XGBoost tries different things as it encounters a missing
value on each node and learns which path to take for missing values in
future.

Tree Pruning:

• A GBM would stop splitting a node when it encounters a negative loss in the
split. Thus it is more of a greedy algorithm.
• XGBoost on the other hand make splits upto the max_depth specified and then
start pruning the tree backwards and remove splits beyond which there is no
positive gain.

57 | P a g e
• Another advantage is that sometimes a split of negative loss say -2 may be
followed by a split of positive loss +10. GBM would stop as it encounters -2.
But XGBoost will go deeper and it will see a combined effect of +8 of the split
and keep both.

Built-in Cross-Validation

• XGBoost allows user to run a cross-validation at each iteration of the boosting


process and thus it is easy to get the exact optimum number of boosting
iterations in a single run.
• This is unlike GBM where we have to run a grid-search and only limited
values can be tested.

Accuracy score - 0.94

Confusion Matrix:
Phishing Non-Phishing

Phishing 836(TP) 65(FN)

Non-Phishing 39(FP) 815(TN)

Table:7.4.7.1

Classification Report:
Precision Recall F1-Score Support

Phishing 0.96 0.93 0.94 901

Non-Phishing 0.93 0.95 0.94 854

Accuracy 0.94 1755

58 | P a g e
Macro avg 0.94 0.94 0.94 1755

Weighted avg 0.94 0.94 0.94 1755

Table:7.4.7.2

7.5 Coding and Execution

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
df=pd.read_csv('C:/Users/k.anusha/Documents/phishing/dataset.csv'
) df.head()
df.describe()
df.isnull().sum()
df.dtypes
sns.countplot(x='Result',data=df)
x=df.drop(['Result','index'],axis=1)
y=df['Result']
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2)

# Logistic Regression

from sklearn.linear_model import LogisticRegression


lr=LogisticRegression()
lr.fit(x_train,y_train)
x_test
y_pred=lr.predict(x_test)
y_pred

# Accuracy Score

59 | P a g e
from sklearn.metrics import accuracy_score
a1=accuracy_score(y_test,y_pred)
a1

# Classification Report

from sklearn.metrics import classification_report


print(classification_report(y_test,y_pred))
import math
MSE = np.square(np.subtract(y_test,y_pred)).mean()
RMSE1 = math.sqrt(MSE)
print("Root Mean Square Error:\n",RMSE1)
x.keys()
lr.predict([[1,0,1,1,1,-1,-1,-1,-1,1,1,-1,1,0,-1,-1,-1,-1,0,1,1,1,1,1,-1,1,-1,1,0,-1]])

60 | P a g e
# Decision Tree Classifier

from sklearn.tree import DecisionTreeClassifier


tree=DecisionTreeClassifier()
tree.fit(x_train,y_train)
y_pred=tree.predict(x_test)
from sklearn.metrics import accuracy_score
a2=accuracy_score(y_test,y_pred)
a2
MSE = np.square(np.subtract(y_test,y_pred)).mean()
RMSE2 = math.sqrt(MSE)
print("Root Mean Square Error:\n",RMSE2)
tree.predict([[1,0,-1,1,1,-1,1,1,-1,1,1,1,1,0,0,-1,1,1,0,-1,1,-1,1,-1,-1,0,-1,1,1,1]])

Output:

61 | P a g e
# Random Forest

from sklearn.ensemble import


RandomForestClassifier
rf=RandomForestClassifier()
rf.fit(x_train,y_train)
y_pred=rf.predict(x_test)
from sklearn.metrics import
accuracy_score
a3=accuracy_score(y_test,y_pred)
a3

62 | P a g e
from sklearn.metrics import
classification_report
print(classification_report(y_test,y_pred)
MSE =
np.square(np.subtract(y_test,y_pred)).me
an() RMSE3 = math.sqrt(MSE)
print("Root Mean Square Error:\n",RMSE3)
rf.predict([[-1,0,-1,1,-1,-1,1,1,-1,1,1,-1,1,0,0,-1,-1,-1,0,1,1,1,1,1,1,1,-1,1,-1,-1]])

Output:

63 | P a g e
#SupportVector Machine

from sklearn.svm
import SVC
sv=SVC()
sv.fit(x_train,y_train)
y_pred=sv.predict(x_test)
from sklearn.metrics import
accuracy_score
a4=accuracy_score(y_test,y_pred
)
a4
from sklearn.metrics import
classification_report
print(classification_report(y_test,y_pre
d))
MSE =
np.square(np.subtract(y_test,y_pred)).mea
n() RMSE4 = math.sqrt(MSE)
print("Root Mean Square Error:\n",RMSE4)
sv.predict([[-1,-1,-1,1,-1,1,-1,1,1,-1,-1,1,1,1,1,-1,-1,-1,-1,1,-1,1,-1,-1,-1,1,-1,1,-1,-1]])

Output:

64 | P a g e
#Navie Bayes

from sklearn.naive_bayes import


GaussianNB nb=GaussianNB()
nb.fit(x_train,y_train)
y_pred=sv.predict(x_t
est)
from sklearn.metrics import
accuracy_score
a5=accuracy_score(y_test,y_pred
)
a5
from sklearn.metrics import
classification_report
print(classification_report(y_test,y_pre
d))
MSE =
np.square(np.subtract(y_test,y_pred)).mea
n() RMSE5 = math.sqrt(MSE)
print("Root Mean Square Error:\n",RMSE5)
nb.predict([[-1,1,1,1,-1,-1,-1,-1,-1,1,1,-1,1,-1,1,-1,-1,-1,0,1,1,1,1,-1,-1,-1,-1,1,1,-1]])

65 | P a g e
#Gradient Boosting Algorithm
from sklearn.ensemble import
GradientBoostingClassifier
gb=GradientBoostingClassifier()
gb.fit(x_train,y_train)
y_pred=sv.predict(x_test)
from sklearn.metrics import
accuracy_score
a6=accuracy_score(y_test,y_pred)
a6
66 | P a g e
from sklearn.metrics import classification_report print(classification_report(y_test,y_pred))
MSE = np.square(np.subtract(y_test,y_pred)).mean()
RMSE6 = math.sqrt(MSE)
print("Root Mean Square Error:\n",RMSE6)
gb.predict([[1,0,-1,1,1,-1,1,1,-1,1,1,1,1,0,0,-1,1,1,0,-1,1,-1,1,-1,-1,0,-1,1,1,1]])

67 | P a g e
#K-Nearest Neighbors

from sklearn.neighbors import KNeighborsClassifier


knn=KNeighborsClassifier(n_neighbors=2)
knn.fit(x_train,y_train)
y_pred=sv.predict(x_test)
from sklearn.metrics import accuracy_score
a7=accuracy_score(y_test,y_pred)
a7
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))
MSE = np.square(np.subtract(y_test,y_pred)).mean()
RMSE7 = math.sqrt(MSE)
print("Root Mean Square Error:\n",RMSE7)
knn.predict([[-1,1,1,1,-1,-1,-1,-1,-1,1,1,-1,1,-1,1,-1,-1,-1,0,1,1,1,1,-1,-1,-1,-1,1,1,-1]])

68 | P a g e
69 | P a g e
# Acurracy levels for various Algorithms

sns.barplot(x='Algorithm',y='Accuracy',data=d
f1) plt.xticks(rotation=90)
plt.title('Comparision of Accuracy Levels for various algorithms')

70 | P a g e
CONCLUSION

The present project is aimed at classification of phishing websites based on the features.
For that we have taken the phishing dataset which collected from uci machine learning
repository and we built our model with seven different classifiers like SVC, Naïve Bayes,
XGB Classifier, Random Forest, K-Nearest Neighbours, Decision Tree and we got good
accuracy scores. There is a scope to enhance it further .if we can have more data our
project will be much more effective and we can get very good results. For this we need
API integrations go get the data of different websites.

71 | P a g e

You might also like