0% found this document useful (0 votes)
207 views51 pages

Classification of Features For Detecting Phishing Web Sites Based On Machine Learning Techniques

This document discusses classification of features for detecting phishing web sites using machine learning techniques. It begins by introducing phishing attacks and how they work, tricking users into entering sensitive information on fake websites. It then discusses the problems with phishing detection and the objectives of classifying features to better identify phishing sites. The document explores phishing kits, which make launching phishing attacks easier, and different types of phishing attacks. The overall goal is to determine the best features and machine learning algorithms for improved phishing website detection.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
207 views51 pages

Classification of Features For Detecting Phishing Web Sites Based On Machine Learning Techniques

This document discusses classification of features for detecting phishing web sites using machine learning techniques. It begins by introducing phishing attacks and how they work, tricking users into entering sensitive information on fake websites. It then discusses the problems with phishing detection and the objectives of classifying features to better identify phishing sites. The document explores phishing kits, which make launching phishing attacks easier, and different types of phishing attacks. The overall goal is to determine the best features and machine learning algorithms for improved phishing website detection.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Classification Of Features For Detecting Phishing Web Sites Based On Machine Learning Techniques

CHAPTER 1
INTRODUCTION

Department of Computer Science and Engineering, PLITMS, Buldana Page 1


Classification Of Features For Detecting Phishing Web Sites Based On Machine Learning Techniques

1.1 Introduction
Internet has tremendously changed the way we work and communicate with
each other. There are applications like e-mail, file transfer, voice communication,
YouTube etc. that are available for users to use. But with its predictable success has
come its weaknesses and vulnerabilities. The protocols and applications responsible
for its success are being exploited by malicious users and hackers for gaining
limelight. Phishing websites is one such area where administrators need new
techniques and algorithms to protect naive users from getting exploited. Phishing is an
attempt of fraud aimed at stealing our information, which is mostly done by emails.
The ideal way to save ourselves from these phishing attacks is by observing such an
attack. These phishing emails mostly come from trusted sources and try to retrieve our
valuable information, for instance our passwords, bank details or even SSN. Many a
times, these attacks come from sites where we have not even made any type of
account. The procedure followed by phishers includes us reaching their website
through the means of an email. In those emails, they make us click on a certain link
that directs us to their websites.
The looks of these phishing websites are quite similar to their respective
legitimate ones and the only distinguishing factor is their URLs. Various initiations
appearing from social websites, banks and online payment portals are used to deceive
users. These phishing emails mostly contain links to websites that are affected with
malware. Some of the ways to tackle these phishing attacks include generating
awareness among people and training theusers.[1]

"Phish" is pronounced just like it's spelled, which is to say like the word "fish"
— the analogy is of an angler throwing a baited hook out there (the phishing email)
and hoping you bite. The term arose in the mid-1990s among hackers aiming to trick
AOL users into giving up their login information. The "ph" is part of a tradition of
whimsical hacker spelling, and was probably influenced by the term "phreaking,"
short for "phone phreaking," an early form of hacking that involved playing sound
tones into telephone handsets to get free phone calls.

Department of Computer Science and Engineering, PLITMS, Buldana Page 2


CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

Some phishing scams have succeeded well enough to make waves:

 Perhaps one of the most consequential phishing attacks in historyhappened in


2016, when hackers managed to get Hillary Clinton campaign chair John
Podesta to offer up his Gmailpassword.
 The"fappening"attack,in which intimate photos of a number of celebrities were
made ,was originally thought to be a result of insecurity on Apple's
iCloudservers,but was in fact the product of a number of successful phishing
attempts.
 In 2016, employees at the University of Kansas responded to a phishing email
and handed over access to their paycheck deposit information, resulting in
them losing pay.[2]

Fig.1.1-: Block diagram of phishing attack

Department of Computer Science and Engineering, PLITMS, Buldana Page 3


CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

1.2 Explanation of Phishing Attack


1) Creating a fake website: As part of phishing attack, attackers create a fake
website which appears similar to original website. They use the main features
of the original website such as logo, design of a website to create a fake
website so that users cannot suspect such fake websites.
2) Linking a fake website through email: Once creation of the fake website is
done, attackers send thousands of e-mails to multiple users and make email
recipients(users) to click a URL which re-directs to the fake website.
3) Clicking a malicious URL: The users who were not aware of the malicious
URL provided in the email, clicks it which directs to the fake website
provided by the attackers.This is where the phishing attack begins.
4) Entering sensitive information: Once the user is redirected to the fake
website, the sensitive information such as login credentials and other details
are entered by the user in order to access the website created by the attacker.
5) Compiling the stolen data and using it: Once the user enters the sensitive
information,all the sensitive data is collected so that the attacker can sell the
data or use it for his/her own purpose.[3]

1.3 Problem Identification


Phishing URLs considers as the fastest rising online crime method used for stealing
personal financial data and perpetrating identity theft. Individuals who respond to
phishing URL, and input the requested financial or personal information into e-mails,
websites,or pop-up windows put themselves and their institutions at risk.
The Microsoft Consumer Safety Index survey showed that the annual worldwide
impact of phishing email was US $5 billion. On the other hand, the cost of repairing
their impact is US $6 billion (MCSI reveals the impact of poor online safety
behaviors). With the massive work exists for phishing detection task, there is no set
of features that has been determined as the best to detected phishing. Moreover, the
same nondeterministic scenario is applied for the underling classification algorithm.
Finally, there is a need to keep on enhancing the accuracy of the detection techniques.
Overall the problems carried out in this project are as following:
 How to determine the best set of features to be used with phishing detection.
 How to select the best classification algorithm to be used for phishing

Department of Computer Science and Engineering, PLITMS, Buldana Page 4


CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

detection.
 How to enhance the performance of the best selected features and classifiers.
 How to integrate multiple classification algorithms for phishing detection and
to evaluate suchintegration.

1.4 Objectives
 To carry out an exploratory analysis of the Phishing Websites Data Set and an
interpretation ofit.
 To determine and evaluate the best set of features to be used for
phishingdetection.
 To create a new dataset which has recent websites entries to get a better
accuracy.
 To determine the best classification algorithm for phishingdetection.
 To distinguish the phishing websites from the legitimate websites and ensure
secure transactions tousers.

1.5 Phishing Kit


The availability of phishing kits makes it easy for cyber criminals, even when
the kit having easy way of understanding the techniques with minimal technical skills,
to launch phishing campaigns. A phishing kit bundles phishing website resources and
tools that need only be installed on a server. Once installed, all the attacker needs to
do is send out email stop potentially victims. Phishing kits as well as mailing lists are
available on the dark web. A couple of sites, Phishtank and OpenPhish, keep crowd-
sourced lists of known phishingkits.
The Duo Labs report, phishing Barrel,includes ananlysis of phishing kit
reuse.Ofthe3,200 phishing kits that Duo discovered, 900 (27 percent) were found on
more than one host. That number might actually be higher, however.“Why don’t we
see a higher percentage of kit reuse? Perhaps because we were measuring based on
the SHA1 hash of the kit contents. A single change to just one file in the kit would
appear as two separate kits even when they are otherwise identical,”said Jordan
Wright,a senior R&D engineer at Duo and the report’sauthor.

Department of Computer Science and Engineering, PLITMS, Buldana Page 5


CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

1.6 Explanation of Phishing Kit


The following procedure was followed while accessing the phishing kit:
 Firstly very sensitive website is cloned and made an exact copy
about its complete feature

 Then the login credentials are scripted with the original website
and after understanding the logic of its pattern.

 This logic or script data is then bundled or pack into a zip file for
making it phishing kit

 Then the original website login credential script is modified by


unzipping the bundled into it

 And then its misuse continues by making spooling and phishing


data to various links user
Analyzing phishing kits allows security teams to track who is using them. “One of the
most useful things that can be learn from analyzing phishing kits is where credentials
are being sent. By tracking email addresses found in phishing kits, we can correlate
actors to specific campaigns and even specific kits,” said Wright in the report. “It gets
even better. Not only can we see where credentials are sent, but we also see where
credentials claim to be sent from. Creators of phishing kits commonly use the ‘From’
header like a signing card, letting us find multiplekits created by the sameauthor.”[4]

1.7 Types of Phishing


Phishing attacks are broadly classified into four categories such as Phishing, Spear
Phishing, Whale phishing and ClonePhishing.
Phishing: A phishing technique where an attacker impersonates a trustworthy entity
in order to obtain sensitive information such as usernames, passwords, bank account
numberetc.
Spear phishing
When an attackers try to craft a message to appeal to a specific individual, that's
called spear phishing. (The image is of a fisherman aiming for one specific fish, rather

Department of Computer Science and Engineering, PLITMS, Buldana Page 6


CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

than just casting a baited hook in the water to see who bites.) Phishers identify their
targets (sometimes using information on sites like LinkedIn) and use spoofed
addresses to send emails that could plausibly look like they're coming from co-
workers. For instance, the spear phisher might target someone in the finance
department and pretend to be the victim's manager requesting a large bank transfer on
shortnotice.

Whale phishing
Whale phishing, or whaling, is a form of spear phishing aimed at the very big fish —
CEOsor other high-value targets. Many of these scams target company board
members, who are considered particularly vulnerable: they have a great deal of
authority within a company, but since they aren't full-time employees, they often use
personal email addresses for business- related correspondence, which doesn't have the
protections offered by corporateemail.

1.8 Organization of the Report.


This report is organized as follows.

Chapter No. Chapter name


1. INTRODUCTION
2. LITERATURE SURVEY
3. PROBLEM DEFINITION
4. METHODS OF FEATURE EXTRACTION OF URL’s
5. REQUIREMENT ANALYSIS AND SPECIFICATION
6. PROPOSED WORK
7. IMPLEMENTATION AND TESTING
8. RESULT AND DISCUSSION
9. CONCLUSION
10. LIMITATION AND FUTURE SCOPE
REFERENCES

Table No. 1.1 Report Organization

Department of Computer Science and Engineering, PLITMS, Buldana Page 7


CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

CHAPTER 2
LITERARTURE SURVEY

Department of Computer Science and Engineering, PLITMS, Buldana Page 8


CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

2.1 Literature Survey


Sr.no. Name of Author Publications Year of Title and Description
Publication
th
1. Ishant Tyagi; 5 International 27 A Novel Machine Learning Approach to Detect Phishing Websites
Conference on September This paper is focused on various Machine Learning algorithms aimed at
Jatin Shad;
Signal Processing 2018 predicting whether a website is phishing or legitimate. Machine learning
Shubham Sharma; and Integrated solutions are able to detect zero hour phishing attacks and they are better at
Networks (SPIN) handling new types of phishing attacks, so they are preferred. In our
Siddharth Gaur;
implementation, we managed an accuracy of 98.4% in prediction a website
Gagandeep Kaur to be phishing or legitimate.

2. Miss Sneha Mande, www.ijariie.com Vol-4 Issue- Detection of Phishing Web Sites Based On Extreme Machine Learning
6 2018 Phishing makes utilization of spoof messages that are made to look valid
Prof.D.S.Thosar,
IJARIIE- and implied to be originating from honest to goodness sources like money
ISSN(O)- related foundations, e-commerce destinations and so forth, to draw clients
2395-4396 to visit fake sites through joins gave in the phishing email.
3. Ebubekir Buber, IEEE 978-1-5386- Feature Selections for the Machine Learning based Detection of
Önder Demir 1880-6/17/ Phishing Websites
Ozgur Koray Sahingoz As a software detection scheme, two main approaches are widely used:

Department of Computer Science and Engineering, PLITMS, Buldana Page 9


CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

blacklists/whitelists and machine learning approaches. Machine learning


solutions are able to detect zero-hour phishing attacks and they have
superior adaption for new types of phishing attacks, therefore they are
mainly preferred.
4. Melanie Volkamer Computers & February User experiences of TORPEDO: TOoltip-powered phishing email
Benjamin Maximilian Security 2017 DetectiOn
Berens We propose a concept called TORPEDO to improve phish detection by
providing just in time and just-in-place trustworthy tooltips to help people
judge links embedded in emails. TORPEDO's tooltips contain the actual
URL with the domain highlighted and delay link activation for a short
period, giving the person time to inspect the URL before they click.
5. Junaid Ahsenali Chaudhry 7th International 2015 Phishing: Classification and Counter Measures
Robert G. Rittenhouse Conference on We highlight countermeasures to tackle phishing and propose suggestions
Multimedia, to businesses in order to minimize the loss of revenue and reputation to
Computer Graphics phishing attacks. The paper discusses technical methods of defending
and Broadcasting against phishing attacks hardening the infrastructure and training
employees against phishing attacks.

Table No.2.1 Technical Clarification of various authors

Department of Computer Science and Engineering, PLITMS, Buldana Page 10


CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

CHAPTER 3
PROBLEM DEFINITION

Department of Computer Science and Engineering, PLITMS, Buldana Page 11


CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

3.1 Problem Definition


 The process of protecting cyberspace from attacks has come to be
known as Cyber Security [16, 32, 37]. Cyber Security is all about protecting,
preventing, and recovering all the resources that use the internet from cyber-attacks
[20, 38, 47]. The complexity in the cyber security domain increases daily, which
makes identifying, analyzing, and controlling the relevant risk events significant
challenges. Cyber attacks are digital malicious attempts to steal, damage, or intrude
into the personal or organizational confidential data [2]. Phishing attack uses fake
websites to take sensitive client data, for example, account login credentials, credit
card numbers, etc. In the year of 2018, the Anti-Phishing Working Group (APWG)
detailed above 51,401 special phishing websites. Another report by RSA assessed that
worldwide associations endured losses adding up to $9 billion just due to phishing
attack happenings in the year 2016 [26]. These stats have demonstrated that the current
anti-phishing techniques and endeavors are not effective. Figure 1 shows how a typical
phishing attack activity happens.

 Personal computer clients are victims of phishing attack because of the


five primary reasons [60]: (1) Users do not have brief information about Uniform
Resource Locator (URLs), (2) the exact idea about which pages can be trusted, (3)
entire location of the page because of the redirection or hidden URLs, (4) the URL
possess many possible options, or some pages accidentally entered, (5) Users cannot
differentiate a phishing website page from the legitimate ones.

 Phishing websites are common entry points of online social engineering


attacks, including numerous ongoing web scams [30]. In such type of attacks, the
attackers create website pages by copying genuine websites and send suspicious URLs
to the targeted victims through spam messages, texts, or online social networking. An
attacker scatters a fake variant of an original website, through email, phone, or content
messages [5], with the expectation that the targeted victims would accept the cases in
the email made. They will likely target the victim to include their personal or highly
sensitive data (e.g., bank details, government savings number, etc.). A phishing attack
brings about an attacker acquiring bank card information and login data. In any case,
there are a few methods to battle phishing [27]. The expanded utilization of Artificial
Intelligence (AI) has affected essentially every industry, including cyber-security. On

Department of Computer Science and Engineering, PLITMS, Buldana Page 12


CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

account of email security, AI has brought speed, accuracy, and the capacity to do a
detailed investigation. AI can detect spam, phishing, skewers phishing, and different
sorts of attacks utilizing previous knowledge in the form of datasets. These type of
attacks likely creates a negative impact on clients’ trust toward social services such as
web services. According to the APWG report, 1,220,523 phishing attacks have been
reported in 2016, which is 65% more expansion than 2015 [1]. Figure 2 shows the
Phishing Report for the third quarter of 2019.

 As per Parekh et al. [51], a generic phishing attack has four stages.
First, the phisher makes and sets up a fake website that looks like an authentic website.
Secondly, the person sends a URL connection of the website to a targeted victim
pretending like a genuine organization, user, or association. Thirdly, the person in
question will be tempted to visit the injected fake website. Fourth, the unfortunate
targeted victim will click on the fake source link and give his/her valuable data as
input. By utilizing the individual data of the person in question, impersonation
activities will be performed by the phisher. APWG contributes individual reports on
phishing URLs and analyzes the regularly evolving nature and procedures of
cybercrimes. The Anti-Phishing Working Group (APWG) tracks the number of
interesting phishing websites, an essential proportion of phishing over the globe.
Phishing locales dictate the interesting base URLs. The absolute number of phishing
websites recognized by APWG in the 3rd quarter-2019 was 266,387 [3]. This was
46% from the 182,465 seen in Q2 and in Q4-2018 practically twofold 138,328 was
seen.

 Attack techniques are grouped into two categories: attack launching and
data collection. For attack launching, several techniques are identified such as email
spoofing, attachments, abusing social settings, URLs spoofing, website spoofing,
intelligent voice reaction, collaboration in a social network, reserve social engineering,
man in the middle attack, spear phishing, spoofed mobile internet browser and
installed web content. Meanwhile, for data collection during and after the victim’s
interaction with attacks, various data collection techniques are used [49]. There are
two types of data collection techniques, one is automated data collection techniques
(such as fake websites forms, key loggers, and recorded messages) and the other is
manual data collection techniques (such as human misdirection and social

Department of Computer Science and Engineering, PLITMS, Buldana Page 13


CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

networking). Then, there are counter-measures for victim’s data collected or used
before and after the attack. These counter-measures are used to detect and prevent
attacks. We categorized counter-measurement into four groups (1) Deep learning-
based Techniques, (2) Machine learning Techniques, (3) Scenario-based Techniques,
and (4) Hybrid Techniques.

 To the best of our knowledge, existing literature [11, 18, 28, 40, 62]
include a limited number of surveys focusing more on providing an overview of attack
detection techniques. These surveys do not include details about all deep learning,
machine learning, hybrid, and scenario based techniques. Besides, these surveys lack
in providing an extensive discussion about current and future challenges for phishing
attack detection.

3.2 Objectives

Keeping in sight the above limitations, this article makes the following contributions:
1. To provide a comprehensive and easy-to-follow survey focusing on deep learning,
machine learning, hybrid learning, and scenario-based techniques for phishing
attack detection.
2. To provide an extensive discussion on various phishing attack techniques and
comparison of results reported by various studies.
3. To provide an overview of current practices, challenges, and future research
directions for phishing attack detection.

3.3 Advantages of Proposed system


Phishing is a deceitful attempt to obtain sensitive data using social
networking approaches, for example, usernames and passwords in an endeavor to
deceive website users and getting their sensitive credentials [24]. Phishers prey on
human emotion and the urge to follow instructions in a flow. Phishing is so
omnipresent in the internet world that it has become a constant threat. In phishing, the
biggest challenge is that the attackers are continuously devising new approaches to
deceive clients such that they fall prey to their phishing traps.
A comparative study of previous works using different approaches is
discussed in the above section with details. Machine learning based approaches, deep
learning based approaches, scenario-based approaches, and hybrid techniques are
Department of Computer Science and Engineering, PLITMS, Buldana Page 14
CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

deployed in past to tackle this problem. A detailed comparative analysis revealed that
machine learning methods are the most frequently used and effective methods to
detect a phishing attack. Different classification methods such as SVM, RF, ANN,
C4.5, k-NN, DT have been used. Techniques with feature reduction give better
performance. Classification is done through ELM, SVM, LR, C4.5, LC-ELM, kNN,
XGB, and feature selection with ANOVA detected phishing attack with 99.2%
accuracy, which is highest among all methods proposed so far but with trade-offs in
terms of computational cost.

Department of Computer Science and Engineering, PLITMS, Buldana Page 15


CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

CHAPTER 4
METHODS OF FEATURE EXTRACTION OF URL’s

Department of Computer Science and Engineering, PLITMS, Buldana Page 16


CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

4.1 Algorithm:

An algorithm mentioned below includes the way how detecting the website whether
it’s real or phishing website to do so some parameter classification has been done. The
wordings present in the website extracted the feature in terms of words, phrases and
letters and verify with the database mention

1. During the process the dataset was prepared which includes the words and phrases
set

2. Once the pre process completes the features will be extracted from the URL

3. Compute attribute values, if

Attribute present value = 1

Attribute absent value = -1

Attribute not considered = 0

3.1 Select attribute X and Y

3.2 Compute equation for X and Y

4. Calculating the threshold with exact matching of the dataset

5. Finding the range value.

6. Distinguish phishing and legitimate site using attribute value and comparing the
dataset.

7. Compute Sensitivity and Specificity. [4][6]

4.2 Feature Categories of URL Features

The URL features are grouped into 4 categories as shown below.


 Address bar related features
 Abnormal based features
 HTML and JavaScript based features
 Domain based features

Department of Computer Science and Engineering, PLITMS, Buldana Page 17


CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

4.1.1. Address bar-related features: The features which are related to the address of
an URL are referred as address bar-related features. These includes the length of the
host URL, number of dots and slashes, special characters, HTTP and SSL check,
@symbol, and IP Address.
a. Length of the host URL: URL is an alphanumeric string which is used to access
the network resources on World Wide Web (WWW). The URL is a combination of
network protocol, hostname and the path. The length of a hostname of a URL is one of
the key features to be extracted while detecting the phishing URLs.

b. Number of dots and slashes: Sometimes, URL consists of multiple domains. The
sub domains are part of the domain names which further narrow down the hierarchy of
Domain Name Systems. The number of dots and slashes which exists in a URL
determines the number of sub domains in the URL to verify whether the URL is
phishing, legitimate or suspicious.[12]

c. Special Characters: A URL contains special characters such as #,-,in order to


differentiate terms in hostname and distinguish between the hostname and the domain
name of a particular website. Certain special characters are used by the attackers to
trick the web users to access the links with these characters in order to perform
phishing attacks.[5]

d. HTTP with SSL Check: Certain URL uses transport layer security to protect the
URL from the attacks. The HTTPS protocol adds a security layer in order to transfer
the sensitive information across the network without any issues. So, in order to
determine whether a URL is legitimate or not, parameters such as HTTPS, authenticity
of certificate and age of the certificate plays a vital role.

e. @ Symbol: @ symbol is used by attackers to make the web browser ignore


everything prior to it and redirects the user to the link typed after it.

f. IP Address: An IP Address is a unique identifier given to an identify a computer on


the network. Attackers use the IP address instead of the domain name to trick the web
users. Any legitimate URL is formed by using the hostname and pathname but not
using IP address and path name.[10]

Department of Computer Science and Engineering, PLITMS, Buldana Page 18


CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

4.1.2. Abnormal based features: The URL features which relates to anomalies or
discrepancies between the W3C objects and Web Identity are known as abnormal
based features. These features are mostly related to the source code of the web page.
These features will play an important role in identifying the phishing websites. These
features include Request URL (RURL), URL of an anchor (AURL), Server Form
Handler (SFH)

a. Request URL: For most of the legitimate websites, the external objects such as
external scripts, CSS, images and other attachments are tied to their own domain. So,
the Request URL feature can be easily used to categorize the websites by checking
whether the external files are linked to the original domain ornot.

b. Anchor of a URL: This feature is similar to the Request URL. This feature verifies
that all the anchors in a specific web page should be pointed to same domain on the
web page itself. In this way, all the anchors in a tested URL can be verified to check
whether that particular website has a phishing attack.

c. Server Form Handler: For any web pages or websites which need authentication
or authorization, a server form with username and password are to be filled in order to
access that particular site. A server form handler is served as an important feature to
differentiate phishing sites from those of the legitimate websites which takes the
following form.<form action= “/login/login.jsp” method=”post” target=”_login”>
The above form tag is used in server-side script to perform an action based
on the user’s navigation. The “action” in the above tag describes the path and
“method” describes the type of HTTP method used in handling a page request.

4.1.3. HTML and JavaScript based Features: The features which are related to
HTML tags and JavaScript functions are treated as HTML and JavaScript based
features. These features include Redirect Page, Disabling Right Click and Using Pop-
up window and on Mouse Over.

a. Redirect Page: Redirecting a web page is a technique of navigating the web users
to a different webpage other than the requested page. Many attackers use open redirect
pages found on the web pages to redirect the link to their illegitimate sites. So, the

Department of Computer Science and Engineering, PLITMS, Buldana Page 19


CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

number of redirects used in a web page determines whether the website is a legitimate
or not.

b. on Mouse Over feature: The on Mouse Over event is triggered in JavaScript


whenever the mouse pointer over an element. The attackers use the Java Script code to
display an unauthentic URL in the status bar of a web page in order to trick the web
users. So, this feature is used to determine whether the status bar is changed or not
when on Mouse Over event is triggered.

c. Right Click: This feature is similar to the on Mouse Over feature. The attackers use
the JavaScript code to hide the source code from the web users by disabling the right
click function. Using this feature, one can easily distinguish the phishing websites to
that of a legitimate website.

4.1.4. Domain based features: The URL features related to domain name-based
information is known as Domain based features. These features include Alexa Page
Rank, Age of the Domain, DNS Record and Website Traffic.

a. Alexa Page Rank: The ranking of a website indicates the popularity of the website
and therefore, many users access it. It is to be understood that attackers maintain the
phishing URLs for a certain amount of time. Once the link or URLs are expired, they
no more appear on Internet. The Alexa page ranking is one among the many features
used to detect the malicious URLs.

b. Age of the Domain: This feature gives the approximate age of that particular
domain of a website. The more age the website is, the more legitimate it is. So, this
feature is used in detecting whether a website is legitimate, suspicious or phished
based on the website age extracted from WHO IS Database.

c. DNS Record: DNS Records are mapping filenames which informs the DNS server
about the IP address associated with each website on the Internet. Many legitimate
websites contain the Owner of the Domain, Date and Time Created details which
differentiates the legitimate sites to that of phished websites.

Department of Computer Science and Engineering, PLITMS, Buldana Page 20


CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

d. Website Traffic: In general, there are more visits to the legitimate websites. Due to
this, there is high traffic as they are frequently visited. In contrast to this, phishing sites
can be identified easily based on website traffic as they have no web traffic.

4.2 Heuristics used in the system


A heuristic is a method or rule which is used in problem solving for practical purposes.
Here in this system, the heuristics are stated based on which the proposed system is
built which classifies a website or URL.

Heuristic 1: Length of host URL


If length of the URL is< 54 Legitimate
Else if length of the URL is >= 54 and<=75 Suspicious
Otherwise Phishing
Description: This association rule indicates that if the length of the URL is more than
75, the URL is categorized as a phishing URL.

Heuristic 2: Number of dots and slashes


If the number of the dots in the domain part of the URL < 3 Legitimate
Else if number of the dots in the domain part=3 Suspicious
Otherwise Phishing
Description: The number of the dots in the domain part indicates the number of sub
domains used in the URL. Therefore, if the number of sub domain is more than 3, the
URL is classified as phishing.

Heuristic 3: Using @ symbol


If the domain name of the URL includes@ symbol Phishing
Otherwise Legitimate
Description: This rule indicates that a URL is denoted as phishing URL if the URL
contains any special characters such as @

Heuristic 4: Using Special Character


If URL contains any of special characters such as dash, underscore, comma,semicolon
Phishing
Otherwise Legitimate

Department of Computer Science and Engineering, PLITMS, Buldana Page 21


CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

Description: This rule indicates that when special characters such as dash, underscore,
comma, semicolon are part of the input URL, then it is a phishing URL.

Heuristic 5: Using HTTPS with Secure Socket Layer


If the URL contains HTTPS with Trusted SSL Legitimate
Else if the URL contains HTTPS with Untrusted SSL Suspicious
Otherwise Phishing
Description: Most of the legitimate URL contains HTTPS protocol with trusted SSL.
Also, it is deducted that a phishing URL doesn’t contain HTTPS as the attackers don’t
add an extra layer of security SSL to the malicious phishing URLs.

Heuristic 6: Using IP Address


If the URL contains IP Address as part of it Phishing
Otherwise Legitim
Description: If an IP Address is part of the URL, it is an indication of someone is
trying to access the sensitive information through a phishing attack. So, if a URL
contains IP Address, the system will mark it as phishing else legitimate.

Heuristic 7: Using Request URL


If the web page contains 22% of the request URLs loaded from other domains
Legitimate
Else if the web page contains request URLs% between 22 and 61 Suspicious
Otherwise Phishing
Description: According to the security analysts, the legitimate URLs contains only
22% of the content such as images, CSS etc., loaded from the other domains. Whereas
Phishing Websites contains more % of the content loaded from multiple domains.

Heuristic 8: Using Anchor URL


If less than 31% anchor part of the URL is connected to other domains
Legitimate
Else if % anchor part of the URL is in between 31 and 67 Suspicious
Otherwise Phishing
Description: This rule indicates that a smaller number of the objects loaded from
different domains in anchor of an URL, the less malicious the URL.

Department of Computer Science and Engineering, PLITMS, Buldana Page 22


CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

Heuristic 9: Using Server Form Handler (SFH)


If the SFH of a URL contains blank or empty Phishing
Else if SFH of a URL belongs to a distinct domain Suspicious
Otherwise Legitimate
Description: This rule indicates that phishing URLs doesn’t contain SFH because
theinformationsubmittedbythewebusersarenothandledbytheexternaldomains.Basedonth
e SFH handler, the URL is categorized accordingly.

Heuristic 10: Using Redirect Page


If the number of the redirects in the website<=1 Legitimate
Else if the number of the redirects in the webpage is in between 1and4 Suspicious
Otherwise Phishing
Description: This rule indicates that a greater number of the redirects in a web page,
more malicious the URL as an authentic code using HTML or any server-side script
contained redirection to a single page.

Heuristic 11: Using on Mouse Over


If an on Mouse Over event is triggered which changes the status bar Phishing
Else if it doesn’t change the status bar Suspicious
Otherwise Legitimate
Description: This rule indicates that attackers use the JavaScript function to make sure
that fake URL is displayed in the status bar.

Heuristic 12: Alexa Page Rank


If a web page rank is less than 100000 Legitimate Otherwise Phishing
Description: The web page rank is a key indicator which is governed by the factors
such as the number of pages visited by the web users as well as the number of the
visitors to a website or URL. Therefore, the less the page rank of a web site, the more
legitimate the URL.

Heuristic 13: Website Traffic


If Website traffic of a URL is high Legitimate
Otherwise Phishing

Department of Computer Science and Engineering, PLITMS, Buldana Page 23


CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

Description: There exists a high traffic for legitimate URLs because of frequent visits
by the users. If there are no frequent visits to the URL, the URL is marked as phishing.

Heuristic 14: Shortened URL Description:


Phishers often shorten a URL on the “World Wide Web” i.e. making a website URL
much shorter but it still leads to the required website. This can be done by using a
“HTTP Redirect” on a domain name that is short, which directs to the webpage with
long URL. An example is: “http://portal.hud.ac.uk/” which may be shortened to
“bit.ly/19DXSk4”. So, if a website shortens itself, then it can be considered as
phishing.

Heuristic 15: Double slash redirecting Description


The presence of“//”implies that the user will be directed to another website. What
phishers do is that the put the address of their malicious website beyond the original
“//” and users they redirect the users to their required webpage. So, finding out the
location of “//” can be useful to find out whether we are being redirected. If the
th
position of“//”is at 6th or 7 position (HTTP of HTTPS), then we can be assured that
we are not being redirected.

Heuristic 16: Prefix suffix Description


Phishers add prefix of suffix which are separated by a dash (“-”) symbol in the domain
name to mislead the users. Although, in reality, one can hardly find any “-” symbol in
legitimate URLs. For instance, in the given weblinkhttp://www.Confirme-
paypal.com/, phishers have added dash symbol between Confirm and paypal and its
difficult for regular user to detect the trick.

Heuristic 17: Iframe tag Description


Another trick that phishers do is that they hide a webpage within a legitimate webpage
by using “iframe” tag in the HTML script. So, phishers can make use of the “frame
Border” attribute which causes the browser to render a visual delineation.

Heuristic 18: Links in <Meta>, <Script>, and <Link> tags Description


Legitimate websites use <Meta> tags to show metadata about the HTML document,
<Script> tags are used to create a client-side script, and <Link> tags to obtain other

Department of Computer Science and Engineering, PLITMS, Buldana Page 24


CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

web resources. One can assume that these tags are linked to the same domain of the
webpage. So, if the percentage of “<Meta>”, “<Script>”, and “<Link>” tags is less
than 17%, then we call our website legitimate. For it to be in category of suspicious, it
must have its percentage less than 81 but at the same time greater than 17. If its
percentage exceeds 81, then it falls in the category of phishing.

Heuristic 19: Domain registration length Description


Domain registration length should be at least one year as we are aware of the fact that
trustworthy domains are regularly paid for several years in advance.

Heuristic 20: Using non-standard port Description


If a certain service (like HTTP) is up or down, then this feature comes into the picture.
It is advised to merely open those ports only which are required. Some firewalls,
Proxy and Network Address Translation (NAT) servers will, by default, bar almost
every port and open the selected ones only. Opening all the ports puts user’s
information in danger as the attacker may run service of its choice.

Heuristic 21: Abnormal URL Description


WHOIS database is used for pulling this feature. The identity is part of URL, for an
innocent website. So, in a website, if URL doesn’t contain host name, then it is
considered phishing, and else it is acceptable.

Heuristic 21: Statistical based reports feature Description


This feature basically searches the domain name and IP address of the website and
searches it among the “Top 10 Domains” as well as “Top 10 IPs”. If any matches
occur, the website is considered as phishing. The statistics were provided by Phish
Tank’s statistical reports published in the years 2010 to 2012.

Heuristic 22: Sub Domain and Multi Sub Domains Description


This feature is based on the number of dots in the URL. For example:
http://www.iitd.ac.in. Here the “in” is country-code Top Level Domain (CCTLD). The
“ac” part is an abbreviation for “academics”, the joined “ac.in” is called a Second-
Level Domain (SLD) and “iitd” is the real name of domain. At first, we neglect
(www.) part and then (CCTLD) (if present) from URL.

Department of Computer Science and Engineering, PLITMS, Buldana Page 25


CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

Heuristic 23: Favicon Description


Many phishers use fake favicon but it can potentially expose them. A favicon is an
icon affiliated with a particular webpage. Favicon is displayed as a pictorial reminder
of website’s identity in address bar by current users as graphical browsers and
newsreaders. If there is a mismatch in the favicon of the domain and the URL in the
address bar, the none can be assured that it is a phishing attempt.

Heuristic 24: Submitting Information to Email Description


User often submits its personal information to various web forms. User’s private
information can be averted to attacker’s email. This may be done in two ways: first by
using mail () in PHP and second being “mailto”. So, if either of these two functions is
used, then the website can be phishing.

Heuristic 25: Using Pop-up Window Description


If a pop-up window in website asks for private information, then this website might as
well be a phishing one as usually, legitimate does not ask their users to submit any
important piece of information through a pop-up window.

Department of Computer Science and Engineering, PLITMS, Buldana Page 26


CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

CHAPTER 5
REQUIREMENT ANALYSIS AND SPECIFICATION

Department of Computer Science and Engineering, PLITMS, Buldana Page 27


CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

5.1 Techniques
Research methodology defines how the development work should be carried out in the
form of research activity. Research methodology can be understand as a tool that is
used to investigate some area, for which data is collected, analyzed and on the basis of
the analysis conclusions are drawn. There are three types of research i.e. quantitative,
qualitative and mixed approach as defined in.

5.2 Quantitative Approach


This approach is carried out by investigating the problem by means of collecting data,
experiments and simulation which gives some results, these results are analyzed and
decisions are made on their basis. This approach is used when the researchers‟ want
verify the theories they proposed, or observe the information in greater detail.

5.3 Qualitative Approach


This approach is usually involves the knowledge claims. These claims are based on a
participatory as well as / or constructive perspectives. This approach follows the
strategies such as ethnographies, phenomenology and grounded theories. When the
researcher wants to study the context or focusing on single phenomenon or concepts,
they used qualitative approach to achieve their desired goals.

5.4 Mixed Approach


Mixed approach glue together both quantitative and qualitative approaches. This
approach is followed when the researchers wants to base their knowledge claims on
matter of fact grounds. Mixed approach has the ability to produce more complete
knowledge necessary to put a theory and practice as it combined both quantitative and
qualitative approaches.

5.5 Author’s Approach


Author`s approach towards the thesis is quantitative. This approach starts by studying
the related literature specific to website security issues in regular browsing. Literature
review is followed by Classification of Features For Detecting Phishing Web Sites
Based On Machine Learning Techniques

Department of Computer Science and Engineering, PLITMS, Buldana Page 28


CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

5.6 Hardware Requirements


 CORE I3 PROCESSOR
 4 GB Ram
 Hard disk 10 GB

5.7 Software Requirements


 Operating System: Windows
 Front-End: HTML,CSS,JAVASCRIPT
 Back-End: MySQL
 Web Server: GLASSFISH SERVER.
 Tools: Netbeans [8.0.2], SQLyog

Department of Computer Science and Engineering, PLITMS, Buldana Page 29


CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

CHAPTER 6
PROPOSED WORK

Department of Computer Science and Engineering, PLITMS, Buldana Page 30


CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

6.1 Proposed System

The proposed method created the database of the feature websites that are classified by
determining the input and output parameters through learning mechanism. Learning
shows higher result while identifying the phishing website through Support vector
machine and naives bayes classifier. The design of this learning mechanism was
identified as the high performing classification against the major phishing activity for
various websites. In the research conducted for identification this learning mechanism
provides higher accuracy and test performance in identifying the legitimate website.

This learning mechanism uses feed forward neural network for classification of single
layer or hidden layer which need not to be tuned continuously. Its tuning depends upon
the way of classification pattern and its hidden nodes are randomly assigned some
values and never changes. To build the linear model for identification these hidden
nodes are usually used to learned the mechanisms for single step layer. This auto
generated models creates good generalization performance and acceptable to learn
thousand time faster than ever expected.

6.2 System Design


This section focuses on the system architecture of the proposed system in detecting the
phishing URLs. The main goal of the proposed system is to detect an URL which is
provided as input by the user as a phished, suspicious or legitimate URL. The system
design involves designing a User Interface through which user inputs an URL and
thereafter, the system displays the output results to the user. Once the input URL is
submitted, then the system extract website features using python standard built-in
functions and collect all features which would be used in classification phase for
classifying the inputURL.

Department of Computer Science and Engineering, PLITMS, Buldana Page 31


CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

Figure.6.1 - System Design of Phishing Detection System


The following are the steps included in the system architecture shown in Figure 6.1.
 User Input: As part of first step, the user inputs a URL (either Phishing or a
Legitimate URL). Once the URL is fed to the system, the system extracts features
such as length ofURL, HTTP with SSL Check. These features give a brief
overview on category of URL thereby increasing the response time of the system.

 Feature Extraction: In this step, all the relevant features of the URLs are extracted
which are used to differentiate between phishing URLs and legitimate URLs. A
URL feature is classified into three groups such as Address-bar based features,
Abnormal features, HTML and JavaScript based features and Domain based
features.

 Predictive Analysis: The features which are extracted from the previous step are
subjected to different heuristics. A total of 30 features will be used to determine
whether a URL is a phished, suspicious or legitimate one. Based on the features
extracted, the proposed rules are applied in order to categorize a URL.

 Evaluation: The results of the classification are evaluated and the user is notified
whether the given URL is Phished orlegitimate.

Department of Computer Science and Engineering, PLITMS, Buldana Page 32


CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

6.3 Use Case Diagram

Fig. 6.2 Use Case Diagram

6.4 Sequence Flow Diagram

Fig. 6.3 Sequence Flow Diagram

Department of Computer Science and Engineering, PLITMS, Buldana Page 33


CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

CHAPTER 7
IMPLEMENTATION AND TESTING

Department of Computer Science and Engineering, PLITMS, Buldana Page 34


CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

7.1 Implementation steps:


Step 1:
Registration and Login
This step generated the profile of a user by registering it by providing authentic user id
and password, once it authenticate the profile was created

Step 2:
Start the dataset run
During this process the datatset which are available for classifying the URL feature
was executed and a model was created for comparison of each and every URL letter.
The dataset creation usually makes use of various words and sequences in the
sentences which make the possibility of having higher accuracy with maximize dataset
content

Step 3:
URL inserted
In this process the input inserted on website URL was identified and extraction of the
URL started in terms of heuristic pattern

Step 4:
Feature Extraction
During extraction each and every URL heuristic feature was gathered and compared
the dataset which we created by using datatset by use of KNN Classifier.
The feature extraction of the URL makes it easy to rectify the sentence into
distinguishable wording and phrases

 Model Creation
The dataset training was carried out and a model was created in accordance with
database available which helps to create the model for further comparison
 KNN Classifier
The working of classifier used in the project was KNN classifier which helps to
predict the similarity of words and letters in the input URL

Department of Computer Science and Engineering, PLITMS, Buldana Page 35


CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

Step 5:
Comparison with dataset
Once the inserted URL and dataset was compared it shows the result based on model
comparison. In this method the KNN algorithm was implemented so as to find the
nearest accuracy measures for authenticate and phishing website.

Step 6:
Identification
Once the system identified it as a phishing website it was blacklisted to be access in
future

7.2 K-nearest neighbors (KNN)


KNN can be used for both classification and regression predictive problems. However,
it is more widely used in classification problems in the industry. It is commonly used
for its easy of interpretation and low calculation time.

Fig.7.1 KNN Diagram 1


The three closest points to Blue Star are all Red Circle. Hence, with good
confidence level we can say that the Blue Star should belong to the class Red Circle.
Here, the choice becamevery obvious as all three votes from the closest neighbor went
to Red Circle. The choice of the parameter K is very crucial in thisalgorithm.First let
us try to understand what exactly does K influence in the algorithm. If we see the last
example, given that all the 6-training observation remain constant, with a given K
value we can make boundaries of each class. These boundaries will segregate RC from
GS. The same way, let’s try to see the effect of value “K” on the class boundaries.
Following are the different boundaries separating the two classes with different values
ofK.[16]
Department of Computer Science and Engineering, PLITMS, Buldana Page 36
CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

Fig.7.2.KNN Diagram 2
Observations can be found that the boundary becomes smoother with
increasing value of K. With K increasing to infinity it finally becomes all blue or all
red depending on the total majority. The training error rate and the validation error rate
are two parameters we need to access on different K-value. Following is the curve for
the training error rate with varying value of K:

Fig.7.3 - KNN Curve Graph


The error rate at K=1 is always zero for the training sample. This is because
the closest point to any training data point is itself. Hence the prediction is always
accurate with K=1. If validation error curve would have been similar, our choice of K
would have been 1. Following is the validation error curve with varying value ofK:
This makes the story clearer. At K=1, we were over fitting the boundaries.
Hence, error rate initially decreases and reaches a minimal. After the minima point, it
then increases with increasing K. To get the optimal value of K, you can segregate the
training and validation from the initial dataset. Now plot the validation error curve to
get the optimal value of K. This value of K should be used for all predictions.
Department of Computer Science and Engineering, PLITMS, Buldana Page 37
CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

CHAPTER 8
RESULT AND DISCUSSION

Department of Computer Science and Engineering, PLITMS, Buldana Page 38


CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

8.1Result
8.1.1 Signup/Registration
This page is used to register the system by inserting the details of an individual. It also
includes to provide the details of the user id and password

Fig. 8.1 Signup/Registration Page


8.1.2 Login Page
Once the registration is done the login page appeared where a user need to enter the
user name and password which was entered during the registration process.

Fig. 8.2 Login Page

Department of Computer Science and Engineering, PLITMS, Buldana Page 39


CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

8.1.3 Home page


After login this page home page will open which contain the profile details and an
input field for checking the phishing website

Fig.8.3 Home page


8.1.4 Result Page
This page depicts the result which was generated through input URL and
differentiating it as authenticate and phishing website based on the classification done

Fig.8.4 Result Page

Department of Computer Science and Engineering, PLITMS, Buldana Page 40


CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

8.1.5 Password Reset


During the process of login if the user forgets the password then forgot password
section available in the system which helps to reset the password and the rest link was
sent on the mail id

Fig, 8.5 Password Reset Page

Sr.no. Url Detected Accuracy

1 http://www.abcd.org Phishing 80%

2 https://www.iitb.ac.in/ Authenticate 95%

3 http://cidofindia.com/ Phishing 90%

4 http://www.gstcouncil.gov.in/ Phishing 95%

5 https://www.indgovtjobs.in/2014/03/spo Phishing 67%

rts-authority-of-india-recruitment.html

Table.8.1Test Result of URL

Department of Computer Science and Engineering, PLITMS, Buldana Page 41


CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

Fig. 8.6 Graph showing result of URL searched

Department of Computer Science and Engineering, PLITMS, Buldana Page 42


CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

CHAPTER 9
CONCLUSION

Department of Computer Science and Engineering, PLITMS, Buldana Page 43


CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

9.1 Conclusion
Phishing websites mainly retrieve user’s information through login pages. They are
interested in the bank details of the users. Out of the many features considered to
detect the phishing website, the most important one is HTTPS with SSL i.e. whether a
website uses HTTPS, issuer of certificate is trusted or not, and the age of certificate
should be at least one year. In this regard, the Phishing Website Dataset is tested to
predict the accuracy of phishing detection evaluation based on four extreme classifier
algorithms (KNN, RBF-SVM, Decision Tree and Random Forest). The accuracy using
K-Nearest Neighbor is 95.20%, RBF (Radial Basis Function) Support Vector Machine
is 94.70%, Decision Tree is 91.94% and Random Forest is 87.74%. Hence, we
conclude that K-Nearest Neighbors classifier results best in terms of accuracy among
the four classifier algorithms.
The testing is done on all legitimate websites as well as malicious websites which are
collected from phish tank. The testing is done on combination of multiple heuristics as
well as on individual heuristics to ensure the efficient functionality of system. From a
set of URLs tested, majority of the URLs have been classified as correctly by the
system. The evaluation of the system is done using a confusion matrix which lists the
True Positives, True Negatives, False Positives and False Negatives. Once all this
information is collected, the precision and recall are calculated for the system. Based
on heuristics selected by the user, the precision and recall varies accordingly. For a
better precision and recall, the false positives and false negatives can be reduced which
will improve the accuracy of the classification.

Department of Computer Science and Engineering, PLITMS, Buldana Page 44


CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

CHAPTER 10
LIMITATION AND FUTURE SCOPE

Department of Computer Science and Engineering, PLITMS, Buldana Page 45


CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

10.1 Limitations
Although it’s quite important o protect our data at any cost but during the process the
detection system and technology used in the process have some disadvantages and
limitations
Following are some techniques which creates some limitations during detection
process

10.1.1 Mimicking User Response to Detect Phishing Attacks


As it provides fake identity of user, the phishers may find out the fake user by
consistently analyzing fake response from user.
 Phishers can use CAPTCHA (completely automated public Turing tests to tell
computers and humans apart) to intact the response of the legitimate user.

10.1.2 Page Classifier


 The network can be in an unprotected state as the system builds its profile.
 If malicious activity looks like normal traffic to the system it will never send an
alarm.
 False positives can become cumbersome with an anomaly based setup. Normal
usage such as checking e-mail after a meeting has the potential to signal an alarm

10.2 Future Scope


Feature selection techniques need more improvement to cope with the continuous
development of new techniques by the phishers over the time. As part of future work,
it can be enhanced to include the followingfunctionalities.Certain additional heuristics
such as Number of Links Pointing to Page, GoogleIndex, TTL value of the domain can
be implemented in addition to theheuristics.The discriminative classifier algorithms
such as Generalized Linear Model, Gradient Boosting (GBM), Boosting can be used to
predict the URL category by training huge amount of the data extracted from the
datasets.
It is found that phishing attacks is very crucial and it is important for us to get a
mechanism to detect it. As very important and personal information of the user can be
leaked through phishing websites, it becomes more critical to take care of this issue.
This problem can be easily solved by using any of the machine learning algorithm
with the classifier. We already have classifiers which gives good prediction rate of the

Department of Computer Science and Engineering, PLITMS, Buldana Page 46


CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

phishing beside, but after our survey that it will be better to use a hybrid approach for
the prediction and further improve the accuracy prediction rate of phishing websites.
As malicious URLs are created every other day and the attackers are using techniques
to fool users and modify the URLs to attack. Nowadays deep learning and machine
learning methods are used to detect a phishing attack. classification methods such as
RF, SVM, C4.5, DT, PCA, k-NN are also common. These methods are most useful
and effective for detecting the phishing attack. Future research can be done for a more
scalable and robust method including the smart plug in solutions to tag/label if the
website is legitimate or leading towards a phishing attack.

Department of Computer Science and Engineering, PLITMS, Buldana Page 47


CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

REFERENCES
[1] Ishant Tyagi, Jatin Shad, Shubham Sharma “A Novel Machine Learning
Approach to Detect Phishing Websites”, in 5th International Conference on
Signal Processing and Integrated Networks (SPIN),2018
[2] Detection of Phishing Web Sites Based On Extreme Machine Learning Miss
Sneha Mande1 , Prof.D.S.Thosar2 , Vol-4 Issue-6 2018 IJARIIE-ISSN(O)-2395-
4396 9322
[3] Ebubekir Buber, ÖnderDemir and OzgurKoraySahingoz, "Feature Selections for
the Machine Learning based Detection of Phishing Websites", 2017 International
Artificial Intelligence and Data Processing Symposium (IDAP), 2017.
[4] M. Volkamer, K. Renaud, B. Reinheimer and A. Kunz, "User experiences of
TORPEDO: TOoltip-poweRed Phishing Email DetectiOn", Computers &
Security (2017).
[5] P. Singh, Y.P.S Maravi and S. Sharma, "Phishing websites detection through
supervised learning networks", IEEE International Conference on Computing and
Communications Technologies (ICCCT), pp. 61-65, 2015.
[6] A. A. Ahmed and N. A. Abdullah, "Real time detection of phishing websites", 7th
IEEE Annual Information Technology Electronics and Mobile Communication
Conference IEEE IEMCON, 2016.
[7] Z. Dan Dong, A. Kapadia, J. Blythe and L. J. Camp, "Beyond the Lock Icon:
Real-Time Detection of Phishing Websites Using Public Key Certificates", IEEE
APWG Symposium on Electronic Crime Research, pp. 1-12, May 2015.
[8] Mustafa Aydin and Nazife Baykal, "Feature Extraction and Classification
Phishing Websites Based on URL", IEEE International Conference on
Communications and Network Security (CNS), pp. 769-770, 2015
[9] S. Marchal, J. Francois, R. State and T. Engel, "PhishScore: hacking phishers’
minds", proceedings of the 10th International Conference on Network and
Service Management 2014 (CNSM 2014), vol. 11, no. 4, pp. 458-471, 2014.
[10] Luong Anh, Tuan Nguyen, Ba Lam To, HuuKhuong Nguyen and Minh Hoang
Nguyen, "A novel approach for phishing detection using URL-based
heuristic", IEEE International Conference on Computing Management and
Telecommunications (ComManTel), pp. 298-303, 2014.
[11] Zheng Dong, Apu Kapadia, Jim Blythe and L. Jean Camp “Beyond the Lock
Icon: Real- time Detection of Phishing Websites Using Public

Department of Computer Science and Engineering, PLITMS, Buldana Page 48


CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

KeyCertificates”,2015
[12] Samuel Marchal, Jérôme Francois, Radu State, Thomas Engel “Phish Score:
Hacking Phishers’ Minds in 10th CNSM and Workshop at 2014IFIP
[13] A. Belabed, E. Aïmeur, A. Chikh “A personalized whitelist approach for phishing
webpagedetection”,in2012SeventhInternationalConferenceonAvailability,Reliabil
ityand Security
[14] NuttapongSanglerdsinlapachai,ArnonRungsawang“UsingDomainTop-
pageSimilarity Feature in Machine Learning-based Web Phishing Detection”, in
2010 ThirdInternational Conference on Knowledge Discovery and DataMining.
[15] Y. Zhang, J. I. Hong, and L. F. Cranor, “Cantina: a content-basedapproach
todetecting phishing web sites,” in The 16th internationalconference on World
Wide Web, 2007, pp. 639–648.
[16] G. Xiang, J. Hong, C. P. Rose, and L. Cranor, “Cantina+: a feature-richmachine
learning framework for detecting phishing web sites,” ACMTransactions on
Information and System Security, vol. 14, no. 2, pp.1–28, Sept. 2011.
[17] M. E. Maurer and D. Herzner, “Using visual website similarity forphishing
detection and reporting,” in CHI ’12 Extended Abstracts onHuman Factors in
Computing Systems, 2012, pp. 1625–1630.
[18] A. Sunil and A. Sardana, “A pagerank based detection technique forphishingweb
sites,” in IEEE Symposium on Computers & Informatics,2012, pp. 58–63.
[19] M. G. Alkhozae and O. A. Batarfi, “Phishing websites detected basedon phishing
characteristic in the webpage source code,” in InternationalJournal of Information
and Communication Technology Research, vol. 1,no. 6, Oct. 2011, pp. 283–291.
[20] L. A. T. Nguyen, B. L. To, H. K. Nguyen, and M. H. Nguyen,“A novel approach
for phishing detection using url-based heuristic,”in IEEE International
Conference on Computing, Management and Telecommunications
(ComManTel), 2014, pp. 298–303.
[21] G. P. Zhang, “Neural networks for classification: a survey,” in Systems,Man, and
Cybernetics, Part C: Applications and Reviews, IEEE Transactionson, Vol 30,
2000, S. H. Liao and C. H. Wen, “Artificial neural networks classification
andclustering of methodologies and applications - literature analysis from1995 to
2005,” in Expert Systems with Applications, vol. 32, 2007, pp.1–11.
[22] Anti-Phishing Working Group, “Phishing Activity Trends Report,” 2014.
[23] S. Sheng, M. Holbrook, P. Kumaraguru, L. F. Cranor, and J. Downs, “Who

Department of Computer Science and Engineering, PLITMS, Buldana Page 49


CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

fallsfor phish?,” in Proceedings of the 28thinternational conference on Human


factors in computing systems - CHI ’10, 2010, p. 373.
[24] M. Rader and S. Rahman, “EXPLORING HISTORICAL AND EMERGING
PHISHING TECHNIQUES AND MITIGATINGTHE ASSOCIATED
SECURITY RISKS,” Int. J. Netw. Secur. …, 2013.
[25] E. Earley, “Understanding social engineering,” Help Net Security, 2010.
[Online]. Available: http://www.netsecurity.
[26] J. Hong, “The state of phishing attacks,” Commun. ACM, vol. 55, no. 1, p. 74,
Jan. 2012.
[27] S. A. Robila and J. W. Ragucci, “Don’t be a phish,” ACM SIGCSE Bull., vol. 38,
no. 3, p. 237, Jun. 2006.
[28] D. D. Caputo, S. L. Pfleeger, J. D. Freeman, and M. E. Johnson, “Going Spear
Phishing: Exploring Embedded Training andAwareness,” IEEE Secur. Priv., vol.
12, no. 1, pp. 28–38, Jan. 2014.
[29] S. Sheng, B. Magnien, P. Kumaraguru, A. Acquisti, L. F. Cranor, J.
Hong, and E. Nunge, “Anti-Phishing Phil,” in Proceedings ofthe 3rd symposium
on Usable privacy and security - SOUPS ’07, 2007, p. 88.
[30] M. Blythe, H. Petrie, and J. A. Clark, “F for fake,” in Proceedings of the 2011
annual conference on Human factors in computingsystems - CHI ’11, 2011, p.
3469.
[31] R. Dhamija, J. D. Tygar, and M. Hearst, “Why phishing works,” in Proceedings
of the SIGCHI conference on Human Factors incomputing systems - CHI ’06,
2006, p. 581.
[32] M. L. Hale, R. F. Gamble, and P. Gamble, “CyberPhishing: A Game-Based
Platform for Phishing Awareness Testing,” in 201548th Hawaii International
Conference on System Sciences, 2015, pp. 5260–5269.
[33] M. Khonji, A. Jones, and Y. Iraqi, “A study of feature subset evaluators and
feature subset searching methods for phishingclassification,” in Proceedings of
the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam
Conference on -CEAS ’11, 2011, pp. 135–144.
[34] R. Verma, N. Shashidhar, and N. Hossain, “Two-Pronged Phish Snagging,” in
2012 Seventh International Conference onAvailability, Reliability and Security,
2012, pp. 174–179.
[35] M. Hale and R. Gamble, “Toward Increasing Awareness of Suspicious Content

Department of Computer Science and Engineering, PLITMS, Buldana Page 50


CLASSIFICATION OF FEATURES FOR DETECTING PHISHING WEB SITES BASED ON MACHINE LEARNING TECHNIQUES

through Game Play,” in 2014 IEEE WorldCongress on Services, 2014, pp. 113–
120.
[36] P. Singh, Y. P. S. Maravi, and S. Sharma, “Phishing websites detection through
supervised learning networks,” in 2015International Conference on Computing
and Communications Technologies (ICCCT), 2015, pp. 61–65.
[37] Y.-S. Chen, Y.-H. Yu, H.-S. Liu, and P.-C. Wang, “Detect phishing by checking
content consistency,” in Proceedings of the 2014IEEE 15th International
Conference on Information Reuse and Integration (IEEE IRI 2014), 2014, pp.
109–119.
[38] L. Wu, X. Du, and J. Wu, “MobiFish: A lightweight anti-phishing scheme for
mobile phones,” in 2014 23rd InternationalConference on Computer
Communication and Networks (ICCCN), 2014, pp. 1–8.
[39] S.-S. Tseng, C.-H. Ku, A.-C. Lu, Y.-J. Wang, and G.-G. Geng, “Building a Self-
Organizing Phishing Model Based uponDynamic EMCUD,” in 2013 Ninth
International Conference on Intelligent Information Hiding and Multimedia
SignalProcessing, 2013, pp. 509–512.

Department of Computer Science and Engineering, PLITMS, Buldana Page 51

You might also like