Computer Science Project
Computer Science Project
BY
ARUNESH M
REG NO RA1731241040018
Dr.J.PADMAVATHI
Associate Professor & Head
Department of Computer Science and Application
JUNE 2020
BONAFIDE CERTIFICATE
Certified that this project report titled <Detecting Phishing Attacks Using Natural
Language Processing and Machine Learning= is the bonafide work of ARUNESH M
(Reg.No: RA1731241040018), who carried out the project under my supervision. Certified
further, that to the best of my knowledge the work reported here in does not form part of any
other project report or dissertation on the basis of which a degree or award was conferred on
an earlier occasion or any other candidate.
Place: VADAPALANI
Date :
ACKNOWLEDGEMENT
First and foremost, I would like to thank heartfelt and with deep sense of gratitude the
Management of SRM Institute of Science & Technology for their constant support and
endorsement.
ABSTRACT
Cloud computing has revolutionized the IT industry for the past decade and is
still developing creative ways to solve current problems. Innovations such as cloud storage
and non-native applications are unimaginable 10years ago but a reality now a days.
Companies and research institutes are slowly moving to the cloud to address their computing
needs. Services and applications are also common in the cloud right now.
LIST OF FIGURES
PAGE
S.NO. FIG.NO. TITLE OF THE FIGURE
NO.
1.
4.3.1 .Net Framework Architecture 11
2.
5.1 System Architecture 12
3.
5.2.1 Data Flow Diagram Level0 13
4.
5.3 Class Diagram 14
5.
6.1 Module Description 15
6.
9.2.1 Cloud Controller 51
7.
9.2.2 Client1 52
8.
9.2.3 Client2 53
9.
9.2.4 Client3 54
TABLE OF CONTENTS
PAGE
CHAPTER TITLE
NO.
BONAFIDE CERTIFICATE i
ACKNOWLEDGEMENT ii
ABSTRA v
CT
LIST OF FIGURES vi
1 LIST OF CONTENT 1
1.1 Overview 1
2 LITERATURE SURVEY 3
The Survey and Future Evolution of Green
2.1 3
Computing
A Study On Green Computing: The Future
2.2 3
Computing And Eco-Friendly Technology
Cloud Load Balancing Techniques : A Step
2.3 3
Towards Green Computing
Holistic Approach to Cloud Service
2.4 4
Computing: Balancing Energy in Processing,
2.5 4
and Transport
3 SYSTEM ANALYSIS 6
3.1 OBJECTIVE 6
3.2 EXISTNG SYSTEM 6
3.2.1 Disadvantage Of Existing System 6
3.3 PROPOSED SYTEM 6
3.3.1 Advantage Of Proposed System 7
3.4 FEASIBILITY STUDY 7
3.4.1 Operational Feasibility 7
3.4.2 Technical Feasibility 7
3.4.3 Economical Feasibility 7
4 SYSTEMREQUIREMENTS 8
4.1 Hardware Requirements 8
4.2 Software Requirements 8
4.3 Software Description 8
5 SYSTEM DESIGN 12
5.1 System Architecture 12
6 IMPLEMENTATION 15
6.1 Module Description 15
6.2 Techniques 16
7 SYSTEM TESTING 18
9 REFERENCES 19
10 APPENDICES 20
10.1 Appendix – A (Coding) 20
10.2 Appendix – B (Screen Shot) 50
CHAPTER 1
INTRODUCTION
CHAPTER 1
INTRODUCTION
1. INTRODUCTION
While the Internet has brought unprecedented convenience to many people for managing
their finances and investments, it also provides opportunities for conducting fraud on a
massive scale with little cost to the fraudsters. Fraudsters can manipulate users instead of
hardware/software systems, where barriers to technological compromise have increased
significantly. Phishing is one of the most widely practiced Internet frauds. It focuses on the
theft of sensitive personal information such as passwords and credit card details. Phishing
attacks take two forms:
The specific malware used in phishing attacks is subject of research by the virus and malware
community and is not addressed in this thesis. Phishing attacks that proceed by deceiving
users are the research focus of this thesis and the term 8phishing attack9 will be used to refer
to this type of attack.
CHAPTER 2
WORKING ENVIRONMENT
HARDWARE REQUIREMENTS:
SOFTWARE REQUIREMENTS:
SYSTEM SOFTWARE:
• Basic utilities
• Web browser
CHAPTER 3
SYSTEM ANALYSIS
CHAPTER 3
SYSTEM ANALYSIS
Feasibility Study
People often purchase products online and make payment through e- banking. There
are many E banking phishing websites. In order to detect the e- banking phishing website our
system uses an effective heuristic algorithm. The e-banking phishing website can be detected
based on some important characteristics like URL and Domain Identity, and security and
encryption criteria
Malicious Web sites largely promote the growth of Internet criminal activities and constrain
the development of Web services. As a result, there has been strong motivation to develop
systemic solution to stopping the user from visiting such Web sites. We propose a learning
based approach to classifying Web sites into 3 classes: Benign, Spam and Malicious. Our
mechanism only analyzes the Uniform Resource Locator (URL) itself without accessing the
content of Websites. Thus, it eliminates the run-time latency and the possibility of exposing
users to the browser based vulnerabilities. By employing learning algorithms, our scheme
achieves better performance on generality and coverage compared with blacklisting service.
Existing System:
A poorly structured NN model may cause the model to under fit the training dataset. On
the other hand, exaggeration in restructuring the system to suit every single item in the
training dataset may cause the system to be over fitted. One possible solution to avoid the
Over fitting problem is by restructuring the NN model in terms of tuning some parameters,
adding new neurons to the hidden layer or sometimes adding a new layer to the
network.ANN with a small number of hidden neurons may not have a satisfactory
representational power to model the complexity and diversity inherent in the data. On the
other hand, networks with too many hidden neurons could over fit the data. However, a
certain stage the model can no longer be improved, therefore, the structuring process should
be terminated. Hence, an acceptable error rate should be specified when creating any NN
model, which itself is considered a problem since it is difficult to determine the acceptable
error rate a priori. For instance , the model designer may set the acceptable error rate to a
value that is unreachable which causes the model to stick in local minima or sometimes the
model designer may set the acceptable error rate to a value that can further be improved.
Disadvantage:
Proposed System:
Lexical features are based on the observation that the URLs of many illegal sites look
different, compared with legitimate sites. Analyzing lexical features enables us to capture the
property for classification purposes. We first distinguish the two parts of a URL:the hostname
and the path, from which we extract bag-of-words (strings delimited by 8/9, 8?9, 8.9, 8=9, 8-9
and89).
We find that phishing website prefers to have longer URL, more levels (delimited by
dot), more tokens in domain and path, longer token. Besides, phishing and malware websites
could pretend to be a benign one by containing popular brand names as tokens other than
those in second-level domain. Considering phishing websites and malware websites may use
IP address directly so as to cover the suspicious URL, which is very rare in benign case. Also,
phishing URLs are found to contain several suggestive word tokens (confirm, account,
banking, secure, ebayisapi, webscr, login, sign in),we check the presence of these security
sensitive words and include the binary value in our features. Intuitively, malicious sites are
always less popular than benign ones. For this reason, site popularity can be considered as an
important feature. Traffic rank feature is acquired from Alexa.com. Host-based features are
based on the observation that malicious sites are always registered in less reputable hosting
centre or regions.
Advantage:
Though there is much phishing detection, the scope of the project is limited to feature
based phishing detection techniques. It extracts the discriminative features from the websites
which help in identifying the website class. In this process, rules play an important role as
they are easily understood by humans. The rules are formed in such a way that IF a condition
THEN class category where class category represents the category to which a class belongs
to. This rule induction helps to facilitate the decision making process which ensures
reliability and completeness.
CHAPTER 4
SYSTEM DESIGN
CHAPTER 4
SYSTEM DESIGN
UMLDIAGRAMS
GOALS:
USE CASEDIAGRAM
A use case diagram in the Unified Modeling Language (UML) is a type of behavioral
diagram defined by and created from a Use-case analysis. Its purpose is to present a graphical
overview of the functionality provided by a system in terms of actors, their goals (represented
as use cases), and any dependencies between those use cases. The main purpose of a use case
diagram is to show what system functions are performed for which actor. Roles of the actors
in the system can be depicted.
CLASSDIAGRAM
SEQUENCEDIAGRAM
CHAPTER 5
DATA FLOWDIAGRAM
CHAPTER 5
DATA FLOWDIAGRAM
LEVEL 0
Dataset
Collectio
Pre-
processin
Rando
m
Trained
&Testing
dataset
LEVEL 1
Dataset
collectio
Pre-
processin
Feature
Extractio
Apply
Algorith
LEVEL 2
Classif
y
the
Accurac
y
Detectio
n
malicious
and
Fin
d
possibility
CHAPTER 6
IMPLEMENTATION
CHAPTER 6
IMPLEMENTATION
OBJECTIVE
The main objective of this paper is to detect the Begin, Malicious and Malware URLs with
the use of NLP.
Modules
One of the challenges faced by our research was the unavailability of reliable training data
sets. In fact, this challenge faces any researcher in the field. However, although plenty of
articles about predicting phishing websites using data mining techniques have been
disseminated these days, no reliable training dataset has been published publically, maybe
because there is no agreement in literature on the definitive features that characterize phishing
websites, hence it is difficult to shape a dataset that covers all possible features.
In this article, we shed light on the important features that have proved to be sound and
effective in predicting phishing websites. In addition, we proposed some new features,
experimentally assign new rules to some well-known features and update some other features.
Using the IP Address I an IP address issued as an alternative of the domain name in the
URL, such as <http://125.98.3.123/fake.html=, users can be sure that someone is trying to
steal their personal information. Sometimes, the IP address is even transformed into
hexadecimal code as show in the following link
<http://0x58.0xCC.0xCA.0x62/2/paypal.ca/index.html=.
It is unusual to find a legitimate website asking users to submit their personal information
through a pop-up window. On the other hand, this feature has been used in some legitimate
websites and its main goal is to warn users about fraudulent activities or broadcast a welcome
announcement, though no personal information was asked to be filled in through these pop-
up windows.
4. Classification
To ensure that our approach works well irrespective of the underlying classiûer chosen for
the task, we performed the experiments using two different classiûers: Random Forest and
Support vector machine, as these are some of the most commonly used classiûers for the task
of text-data classiûcation. Scikit-learn implementation of these classiûer with their default
parameter settings are used for our experiments. The tf-idf feature is used to represent each
URL in the database.
CHAPTER 7
SYSTEM TESTING
CHAPTER 7
SYSTEM TESTING
Testing
White-box testing (also known as clear box testing, glass box testing,
transparent box testing, and structural testing) is a method of testing software that tests
internal structures or workings of an application, as opposed to its functionality (i.e. black-
box testing). In white-box testing an internal perspective of the system, as well as
programming skills, are used to design test cases. The tester chooses inputs to exercise paths
through the code and determine the appropriate outputs. This is analogous to testing nodes in
a circuit,
e.g. in-circuit testing (ICT).
White-box testing is a method of testing the application at the level of the source code.
The test cases are derived through the use of the design techniques mentioned above: control
flow testing, data flow testing, branch testing, path testing, statement coverage and decision
coverage as well as modified condition/decision coverage. White-box testing is the use of
these techniques as guidelines to create an error free environment by examining any fragile
code.
These White-box testing techniques are the building blocks of white-box testing, whose
essence is the careful testing of the application at the source code level to prevent any hidden
errors later on. These different techniques exercise every visible path of the source code to
minimize errors and create an error-free environment. The whole point of white-box testing is
the ability to know which line of the code is being executed and being able to identify what
the correct output should be.
Levels
1. Unit testing. White-box testing is done during unit testing to ensure that the code is
working as intended, before any integration happens with previously tested code.
White-box testing during unit testing catches any defects early on and aids in any
defects that happen later on after the code is integrated with the rest of the application
and therefore prevents any type of errors later on.
2. Integration testing. White-box testing at this level are written to test the interactions of
each interface with each other. The Unit level testing made sure that each code was
tested and working accordingly in an isolated environment and integration examines
the correctness of the behavior in an open environment through the use of white-box
testing for any interactions of interfaces that are known to the programmer.
3. Regression testing. White-box testing during regression testing is the use of recycled
white-box test cases at the unit and integration testing levels.
White-box testing's basic procedures involve the understanding of the source code that
you are testing at a deep level to be able to test them. The programmer must have a deep
understanding of the application to know what kinds of test cases to create so that every
visible path is exercised for testing. Once the source code is understood then the source code
can be analyzed for test cases to be created. These are the three basic steps that white-box
testing takes in order to create test cases:
Test procedures
Test cases
Test cases are built around specifications and requirements, i.e., what the application
is supposed to do. Test cases are generally derived from external descriptions of the software,
including specifications, requirements and design parameters.
Although the tests used are primarily functional in nature, non- functional tests may
also be used. The test designer selects both valid and invalid inputs and determines the correct
output without any knowledge of the test object's internal structure.
Unit testing
Ideally, each test case is independent from the others. Substitutes such as method
stubs, mock objects, fakes, and test harnesses can be used to assist testing a module in
isolation. Unit tests are typically written and run by software developers to ensure that code
meets its design and behaves as intended. Its implementation can vary from being very
manual (pencil and paper) to being formalized as part of build automation.
Testing will not catch every error in the program, since it cannot evaluate every
execution path in any but the most trivial programs. The same is true for unit testing.
Additionally, unit testing by definition only tests the functionality of the units themselves.
Therefore, it will not catch integration errors or broader system-level errors (such as functions
performed across multiple units, or non- functional test areas such as performance).
Unit testing should be done in conjunction with other software testing activities, as they
can only show the presence or absence of particular errors; they cannot prove a complete
absence of errors. In order to guarantee correct behavior for every execution path and every
possible input, and ensure the absence of errors, other techniques are required, namely the
application of formal methods to proving that a software component has no unexpected
behavior.
This obviously takes time and its investment may not be worth the effort. There are
also many problems that cannot easily be tested at all – for example those that are
nondeterministic or involve multiple threads. In addition, code for a unit test is likely to be at
least as buggy as the code it is testing. Fred Brooks in The Mythical Man-Month quotes :
never take two chronometers to sea. Always take one or three. Meaning, if two chronometers
contradict, how do you know which one is correct?
Another challenge related to writing the unit tests is the difficulty of setting up
realistic and useful tests. It is necessary to create relevant initial conditions so the part of the
application being tested behaves like part of the complete system. If these initial conditions
are not set correctly, the test will not be exercising the code in a realistic context, which
diminishes the value and accuracy of unit test results.
To obtain the intended benefits from unit testing, rigorous discipline is needed
throughout the software development process. It is essential to keep careful records not only
of the tests that have been performed, but also of all changes that have been made to the
source code of this or any other unit in the software. Use of a version control system is
essential. If a later version of the unit fails a particular test that it had previously passed, the
version-control software can provide a list of the source code changes (if any) that have been
applied to the unit since that time.
It is also essential to implement as stainable process for ensuring that test case failures
are reviewed daily and addressed immediately if such a process is not implemented and
ingrained into the team's workflow, the application will evolve out of sync with the unit test
suite, increasing false positives and reducing the effectiveness of the test suite.
Unit testing embedded system software presents a unique challenge: Since the
software is being developed on a different platform than the one it will eventually run on, you
cannot readily run a test program in the actual deployment environment, as is possible with
desktop programs
Functional testing
Functional testing is a quality assurance (QA) process and a type of black box testing
that bases its test cases on the specifications of the software component under test. Functions
are tested by feeding them input and examining the output, and internal program structure is
rarely considered (not like in white- box testing). Functional Testing usually describes what
the system does.
Functional testing differs from system testing in that functional testing" verifies a program
by checking it against ... design document(s) or specification(s)", while system testing
"validate a program by checking it against the published user or system requirements" (Kane,
Falk, Nguyen 1999, p.52).
Functional testing typically involves five steps. The identification of functions that the
software is expected to perform
Performance testing
In software engineering, performance testing is in general testing performed to determine
how a system performs in terms of responsiveness and stability under a particular
workload. It can also serve to investigate , measure, validate or verify other quality
attributes of the system, such as scalability, reliability and resource usage.
Testing types
Load testing
Load testing is the simplest form of performance testing. A load test is usually
conducted to understand the behavior of the system under a specific expected load. This load
can be the expected concurrent number of users on the application performing a specific
number of transactions within the set duration.
This test will give out the response times of all the important business critical
transactions. If the database, application server, etc. are also monitored, then this simple test
can itself point towards bottlenecks in the application software.
Stress testing
Stress testing is normally used to understand the upper limits of capacity within the
system. This kind of test is done to determine the system's robustness in terms of extreme
load and helps application administrators to determine if the system will perform sufficiently
if the current load goes well above the expected maximum.
Soak testing
Soak testing, also known as endurance testing, is usually done to determine if the
system can sustain the continuous expected load. During soak tests, memory utilization is
monitored to detect potential leaks. Also important, but often overlooked is performance
degradation. That is, to ensure that the throughput and/or response times after some long
period of sustained activity are as good as or better than at the beginning of the test. It
essentially involves applying a significant load to a system for an extended, significant period
of time. The goal is to discover how the system behaves under sustained use.
Spike testing
Spike testing is done by suddenly increasing the number of or load generated by, users
by a very large amount and observing the behavior of the system. The goal is to determine
whether performance will suffer, the system will fail, or it will be able to handle dramatic
changes in load.
Configuration testing
Rather than testing for performance from the perspective of load, tests are created to
determine the effects of configuration changes to the system's components on the system's
performance and behavior. A common example would be experimenting with different
methods of load-balancing.
Isolation testing
Isolation testing is not unique to performance testing but involves repeating a test
execution that resulted in a system problem. Often used to isolate and confirm the fault
domain.
Integration testing
Purpose
Cases being simulated via appropriate parameter and data inputs. Simulated usage of
shared data areas and inter- process communication is tested and individual subsystems are
exercised through their input interface.
Test cases are constructed to test whether all the components within assemblages
interact correctly, for example across procedure calls or process activations, and this is done
after testing individual modules, i.e. unit testing. The overall idea is a "building block"
approach, in which verified assemblages are added to a verified base which is then used to
support the integration testing of further assemblages.
Some different types of integration testing are big bang, top-down, and bottom-up.
Other Integration Patterns are: Collaboration Integration, Backbone Integration, Layer
Integration, Client/Server Integration, Distributed Services Integration and High-frequency
Integration.
Big Bang
In this approach, all or most of the developed modules are coupled together to form a
complete software system or major part of the system and then used for integration testing.
The Big Bang method is very effective for saving time in the integration testing process.
However, if the test cases and their results a recorded properly, the entire integration process
will be more complicated and may prevent the testing team from achieving the goal of
integration testing.
A type of Big Bang Integration testing is called Usage Model testing. Usage Model
Testing can be used in both software and hardware integration testing. The basis behind this
type of integration testing is to run user-like workloads in integrated user-like environments.
In doing the testing in this manner, the environment is proofed, while the individual
components are proofed indirectly through their use.
For integration testing, Usage Model testing can be more efficient and provides better
test coverage than traditional focused functional integration testing. To be more efficient and
accurate, care must be used in defining the user- like workloads for creating realistic
scenarios in exercising the environment. This gives confidence that the integrated
environment will work as expected for the target customers.
Verification and Validation are independent procedures that are used together for
checking that a product, service, or system meets requirements and specifications and that it
full fills its intended purpose. These are critical components of a quality management system
such as ISO 9000. The words "verification" and "validation" are sometimes preceded with
"Independent" (or IV&V), indicating that the verification and validation is to be performed
by a disinterested third party.
It is sometimes said that validation can be expressed by the query" Are you
building the right thing?" and verification by "Are you building it right?"In practice, the usage
of these terms varies. Sometimes they are even used interchangeably.
The PMBOK guide, an IEEE standard, defines them as follows in its 4th edition
• "Validation. The assurance that a product, service, or system meets the needs of the
customer and other identified stakeholders. It often involves acceptance and suitability
with external customers. Contrast with verification."
• Verification is intended to check that a product, service, or system (or portion thereof,
or set thereof) meets a set of initial design specifications. In the development phase,
verification procedures involve performing special tests to model or simulate a portion,
or the entirety, of a product, service or system, then performing a review or analysis of
the modeling results. In the post-development phase, verification procedures involve
regularly repeating tests devised specifically to ensure that the product, service, or system
continues to meet the initial design requirements, specifications, and regulations as time
progresses.
• Validation is intended to check that development and verification procedures for a
product, service, or system (or portion thereof, or set thereof) result in a product,
service, or system (or portion thereof, or set thereof) that meets initial requirements.
For a new development flow or verification flow, validation procedures may involve
modeling either flow and using simulations to predict faults or gaps that might lead to
invalid or incomplete verification or development of a product, service, or system(or
portion thereof, or set thereof). A set of validation requirements, specifications, and
regulations may then be used as a basis for qualifying a development flow or
verification flow for a product, service, or system(or portion thereof, or set thereof).
Additional validation procedures also include those that are designed specifically to
ensure that modifications made to an existing qualified development flow or
verification flow will have the effect of producing a product, service, or system (or
portion thereof, or set thereof) that meets the initial design requirements,
specifications, and regulations; these validations help to keep the flow qualified.
• It is a process of establishing evidence that provides a high degree of assurance that a
product, service, or system accomplishes its intended requirements. This often
involves acceptance of fitness for purpose with end users and other product
stakeholders. This is often an external process.
• It is sometimes said that validation can be expressed by the query" Are you building
the right thing?" and verification by "Are you building it right?". "Building the right
thing" refers back to the user's needs, while "building it right" checks that the
specifications are correctly implemented by the system. In some contexts, it is
required to have written requirements for both as well as formal procedures or
protocols for determining compliance.
• It is entirely possible that a product passes when verified but fails when validated.
This can happen when, say, a product is built as per the specifications but the
specifications themselves fail to address the user9s needs.
Activities
Each template of DQ, IQ, OQ and PQ usually can be found on the internet
respectively, whereas the DIY qualifications of machinery/equipment can be assisted either
by the vendor's training course materials and tutorials, or by the published guidance books,
such as step-by-step series if the acquisition of machinery/equipment is not bundled with on-
site qualification services. This kind of the DIY approach is also applicable to the
qualifications of software, computer operating systems and a manufacturing process. The
most important and critical task as the last step of the activity is to generating and archiving
machinery/equipment qualification reports for auditing purposes, if regulatory compliances
are mandatory.
Qualification of machinery/equipment is venue dependent, in particular items that are
shock sensitive and require balancing or calibration, and re- qualification needs to be
conducted once the objects are relocated.
The full scales of some equipment qualifications are even time dependent as
consumables are used up (i.e. filters) or springs stretch out, requiring recalibration, and hence
re- certification is necessary when a specified due time lapse Re- qualification of
machinery/equipment should also be conducted when replacement of parts, or coupling with
another device, or installing a new application software and restructuring of the computer
which affects especially the pre-settings, such as on BIOS, registry, disk drive partition table,
dynamically-linked (shared) libraries, or an in file etc., have been necessary. In such a
situation, the specifications of the parts /devices/software and restructuring proposals should
be appended to the qualification document whether the parts/devices/software are genuine or
not.
Torres and Hyman have discussed the suitability of non-genuine parts for clinical use
and provided guidelines for equipment users to select appropriate substitutes which are
capable to avoid adverse effects. In the case when genuine parts/devices/software are
demanded by some of regulatory requirements, then re-qualification does not need to be
conducted on the non-genuine assemblies. Instead, the asset has to be recycled for non-
regulatory purposes.
System testing
The following examples are different types of testing that should be considered during
System testing:
o Sanity testing
o Smoke testing
o Exploratory testing
o Ad hoc testing
o Regression testing
o Installation testing
o Maintenance testing Recovery testing and failover testing.
o Accessibility testing, including compliance with:
o Americans with Disabilities Act of1990
o Section 508 Amendment to the Rehabilitation Act of1973
o Web Accessibility Initiative (WAI) of the World Wide Web Consortium(W3C)
Although different testing organizations may prescribe different tests as part of System
testing, this list serves as a general framework or foundation to begin with.
Structure Testing:
Output Testing:
• Output of test cases compared with the expected results created during design of test
cases.
• Asking the user about the format required by them tests the output generated or
displayed by the system under consideration.
• Here, the output format is considered into two was, one is on screen and another one
is printed format.
• The output on the screen is found to be correct as the format was designed in the
system design phase according to user needs.
• The output comes out as the specified requirements as the user9s hard copy.
• Final Stage, before handling over to the customer which is usually carried out by the
customer where the test cases are executed with actual data.
• Thesystemunderconsiderationistestedforuseracceptanceandconstantly keeping touch
with the prospective system user at the time of developing and making changes when
ever required.
• It involves planning and execution of various types of test in order to demonstrate that
the implemented software system satisfies the requirements stated in the requirement
document.
CHAPTER 8
SAMPLE CODING
CHAPTER 8
SAMPLE CODING
Decision_tree
plt.show()
• train_data=labels_train.value_counts() import
matplotlib.pyplot as plt;plt.rcdefaults()
importnumpy as np
importmatplotlib.pyplot as plt
• test_data=labels_test.value_counts()
objects = ('Yes', 'No')
y_pos = np.arange(len(objects))
performance = test_data
• plt.show()
fromsklearn.tree import DecisionTreeClassifier model
= DecisionTreeClassifier()
model.fit(data_train,labels_train)
• pred_label =model.predict(data_test)
• import matplotlib.pyplot as
pltdefplot_confusion_matrix(cm,classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
if normalize:
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
accuracy_score(labels_test,pred_label)
labels = [0,1]
plt.show()
Feature_Extraction
• import numpy as np
import pandas aspd
• raw_data= pd.read_csv('1000.txt',sep="delimeter", header=None, engine='python')
• raw_data.head()
• raw_data.columns=["websites"] raw_data.head()
• seperation_of_protocol = raw_data['websites'].str.split("://",expand =True)
• seperation_of_protocol.head()
• type(seperation_of_protocol)
• seperation_of_protocol.columns=["Protocol","domain_name","address"]
seperation_of_protocol.head()
• seperation_domain_name =
seperation_of_protocol['domain_name'].str.split("/",1,expand =True)
• seperation_domain_name.columns=["Domain_name","Address"]
• seperation_domain_name.head()
• splitted_data =
pd.concat([seperation_of_protocol['Protocol'],seperation_domain_name],axis
=1)
• splitted_data.columns =['protocol','domain_name','address']
• splitted_data.head()
• type(splitted_data)
• splitted_data.isnull().sum()
• splitted_data =splitted_data.dropna()
• splitted_data.isnull().sum()
• deflong_url(l):
• """This function is defined in order to differntiate website based on the length of
theURL"""
iflen(l) <54: return0
eliflen(l) >= 54 and len(l) <= 75:
return2
return 1
• splitted_data['long_url'] =splitted_data['address'].apply(long_url)
• splitted_data[splitted_data.long_url ==0].head()
• defhave_at_symbol(l):
"""This function is used to check whether the URL contains @symbol or not"""
if "@" in l:
return 1
return 0
• splitted_data['having_@_symbol'] =
splitted_data['address'].apply(have_at_symbol)
• splitted_data.head()
• defredirection(l):
"""If the url has symbol(//) after protocol then such URL is to be classified
as phishing """
if "//" in l:
return 1
return 0
• splitted_data['redirection_//_symbol'] =
splitted_data['domain_name'].apply(redirection)
• splitted_data.head()
• defredirection(l):
"""If the url has symbol(//) after protocol then such URL is to be classified
as phishing """
if "//" in l:
return 1
return 0
• splitted_data['redirection_//_symbol'] =
splitted_data['domain_name'].apply(redirection)
• splitted_data.head()
• defsub_domains(l):
ifl.count('.') <3: return0
elifl.count('.') == 3:
return2
return 1
• splitted_data['sub_domains'] =
splitted_data['domain_name'].apply(sub_domains)
• splitted_data.head()
• re
• defhaving_ip_address(url):
match=re.search('(([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.([01]?\\d\\d?|2[0-4]\\d|25[0-
5])\\.([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\/)|'#IPv4
'((0x[0-9a-fA-F]{1,2})\\.(0x[0-9a-fA-F]{1,2})\\.(0x[0-9afA-F]{1,2})\\.(0x[0-
9a-fA-F]{1,2})\\/)' #IPv4 in hexadecimal
'(?:[a-fA-F0-9]{1,4}:){7}[a-fA-F0-9]{1,4}',url) #Ipv6
if match:
#print match.group()
return 1
else:
#print 'No matching pattern found'
return 0
CHAPTER 9
OUTPUT SCREENS
CHAPTER 9
OUTPUT SCREEN
Confusion matrix
Fake 300
200
100
60
Predicted label
CHAPTER 10
CONCLUSION AND
FUTURE ENHANCEMENT
CHAPTER 10
Conclusion
Finally, phishing attacks are a major problem. It is important that they are countered. The
work reported in this thesis indicates how understanding of the nature of phishing may be
increased and provides a method to identify phishing problems in systems. It also contains a
prototype of a system that catches those phishing attacks that evaded other defences, i.e. those
attacks that have <slipped through the net=. An original contribution has been made in this
important field, and the work reported here has the potential to make the internet world a
safer place for a significant number of people.
Future Work
In the future we provide some technical solution by improve the efficiency of spam filters. By
which too many mails are classified correctly and properly. By this legitimate user can surf
internet with less fear. The user-phishing interaction model was derived from application of
cognitive walkthroughs. A large-scale controlled user study and follow on interviews could
be carried out to provide a more rigorous conclusion. The current model does not describe
irrational decision making nor address influence by other external factors such as emotion,
pressure, and other human factors. It would be very useful to expand the model to
accommodate these factors. we have theoretically and experimentally evaluated of Phish
Limiter. We have evaluated the trustworthiness of each SDN üow to identify any potential
hazards based on each deep packet inspection. Likewise, we have observed how the proposed
inspection approach of two SF and FI modes within Phish Limiter detects and mitigates
phishing attacks before reaching end users if the üow has been determined untrustworthy.
Using our real-world experimental evaluation on GENI and phishing dataset, we have
demonstrated that Phish Limiter is an effective and efûcient solution to detect and mitigate
phishing attacks with its accuracy of98.39%.
CHAPTER 11
BIBLIOGRAPHY & REFERENCE
CHAPTER 11
BIBLIOGRAPHY AND REFERENCES