0% found this document useful (0 votes)
32 views14 pages

Towards Detection of Phishing Websites On Client-Side Using Machine

Uploaded by

bhanujarudran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views14 pages

Towards Detection of Phishing Websites On Client-Side Using Machine

Uploaded by

bhanujarudran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Telecommunication Systems (2018) 68:687–700

https://doi.org/10.1007/s11235-017-0414-0

Towards detection of phishing websites on client-side using machine


learning based approach
Ankit Kumar Jain1 · B. B. Gupta1

Published online: 26 December 2017


© Springer Science+Business Media, LLC, part of Springer Nature 2017

Abstract
The existing anti-phishing approaches use the blacklist methods or features based machine learning techniques. Blacklist
methods fail to detect new phishing attacks and produce high false positive rate. Moreover, existing machine learning based
methods extract features from the third party, search engine, etc. Therefore, they are complicated, slow in nature, and not fit for
the real-time environment. To solve this problem, this paper presents a machine learning based novel anti-phishing approach
that extracts the features from client side only. We have examined the various attributes of the phishing and legitimate websites
in depth and identified nineteen outstanding features to distinguish phishing websites from legitimate ones. These nineteen
features are extracted from the URL and source code of the website and do not depend on any third party, which makes
the proposed approach fast, reliable, and intelligent. Compared to other methods, the proposed approach has relatively high
accuracy in detection of phishing websites as it achieved 99.39% true positive rate and 99.09% of overall detection accuracy.

Keywords Phishing attack · Social engineering · Website · Machine learning · Hyperlink

1 Introduction loss due to the phishing scam is more than 17.36$ million in
US only, followed by Japan and UK with the loss of 8.38 and
Phishing is an online identity theft, which can deceive 7.21 million dollar respectively [6].
Internet users into revealing their secret information and cre- Todays, cyber experts, and phisher are in a rat race con-
dentials, e.g., login id, password, credit card number, etc. dition. The cyber experts continue to improve anti-phishing
Phishing is one of the major computer security threats faced solutions with the help of researchers and developers. The
by the cyber-world and could lead to financial losses for both developer invents various anti-phishing tools that alert users
industries and individuals [1]. Among the various cyberse- to malicious emails and websites. (e.g., Calling-ID Tool-
curity attacks, phishing paid special attention because of its bar, Netcraft Cloudmark Anti-Fraud Toolbar, etc.). A Recent
adverse effect on the economy [2–4]. According to APWG study examines that only 3 out of 14 tools identified a phish-
report, 122,0523 phishing attacks were found worldwide in ing website hosted locally and it is a critical concern on
2016, and it is observed as 65% of growth over 2015 [5]. the trust of the conventional tools [7]. Moreover, these tools
The per month attack growth also increased by 5753% over are exposed in the public domain. Raising human aware-
12 years from 2004 to 2016 (1609 phishing attacks per month ness is not a sufficient mitigation method and deploying
in 2004 and average of 92,564 attacks in 2106). Quarter 2 complementary technical solutions is a crucial requirement
of 2016 represented an all-time high number of phishing [8,9]. In the previous few years, researchers and develop-
attacks, which were 466,065 [5]. The motive of phishing ers build various phishing detection solutions. However,
attack is not only gaining the credentials; now it has become the phishing problem still available, and the development
the number 1 delivery method for other types of malicious of efficient anti-phishing approach become a challenging
software like ransomware [6]. In August 2016, the financial task. Moreover, most of the anti-phishing solutions pro-
duce high false positive rate and not capable of dealing
B B. B. Gupta with zero hour attack. Blacklist based detection approaches
[email protected] have the quick access time; however, it cannot identify
the zero-hour attack. Moreover, other solutions like heuris-
1 National Institute of Technology Kurukshetra, Kurukshetra, tic and visual comparison produce high false predictions.
India

123
688 A. K. Jain, B. B. Gupta

Therefore, it is essential to design an approach that can effi- approaches split into two classes; user education based and
ciently classify phishing webpages. The recent development software based. Software-based approaches are further clas-
of phishing detection employed numerous machine learn- sified into blacklist, visual similarity, search engine, and
ing based approaches. These approaches train a classification machine learning based solutions.
algorithm with some features that can distinguish a phishing
webpage from the legitimate one [10]. The efficiency of the 2.1 User education
detection approaches depends on training data, selection of
good feature set, and classification algorithm used to train User education approach aims to improve the capacity of
these features [11]. Internet users in the detection of phishing attacks [13]. Inter-
The existing machine learning based approaches extract net users can be educated to distinguish the characteristics of
features from various sources like URL, page source, search phishing and legitimate emails and websites. In this, Sheng
engine, and third party services like website traffic, DNS, et al. [14] developed an interactive educational game “Anti-
whois record, etc. The extraction of third party features is a Phishing Phill”, that teach users that how to identify the
complicated and time-consuming process [12]. Integration phishing websites. After spending 15 min on the game,
of features from different sources is also a difficult pro- users were better able to identify phishing websites com-
cess. Therefore, they are complicated, slow, and does not pared to the other users who did not play the game. The
produce results in the real-time. To cope up this problem, main motto behind this game design is to provide conceptual
this paper presents an efficient solution, which extracts the knowledge to computer users behind the phishing attacks.
features from client side only. Identifying the outstanding This conceptual knowledge may help the users in avoiding
features is one of the preconditions for the design of good phishing attacks.
phishing detection approach. Therefore, we have examined
the various attributes of the phishing and legitimate websites 2.2 Phishing blacklist
in depth and identified various efficient client side features in
order to detect the phishing websites. These nineteen features A blacklist contains the list of malicious domains, URLs,
are obtained from the URL and source code of the web- and IP addresses [15]. Sheng et al. [16] showed that a fake
site. Therefore, it makes our approach fast and reliable. We domain added in blacklist after the substantial amount of
have evaluated proposed features on various machine learn- time and approximate 50–80% of fake domains added after
ing algorithms using 4059 phishing and legitimate websites performing the attack. The blacklist needs to be the regular
dataset. Evaluation results show that the proposed approach update from their source because thousands of fake websites
accurately filters the phishing sites as it has 99.39% of true launch every day.
positive rate and very less 1.25% of false positive rate. The
main advantages of the proposed approach compared to exist- 2.3 Visual similarity based techniques
ing phishing detection solutions are (1) it is fast, reliable and
provide real-time phishing detection, (2) it can detect the These techniques [17] utilize various features to compute the
phishing webpages hosted on the compromised domain, (3) similarity between websites like page source code, images,
it can detect the webpages written in any textual language, textual content, text formatting, HTML tags, CSS, website
(4) it does not require any dedicated resources for phishing logo, etc. Most of the visual similarity based approaches
detection (5) it is platform independent, and (6) it is available compare the new website with previously visited or stored
as a client side desktop application. websites. Therefore, these techniques cannot detect the new
The remainder of this paper is structured as follows. Sec- phishing websites and produce high false negative rate. Some
tion 2 describes the related work. Section 3 presents the of the techniques take the snapshot of websites to compare
overview of our proposed approach. Section 4 describes the which require high computation time, therefore does not fit
proposed feature set. Section 5 shows the training dataset in time constraint environment.
and performance metrics. We present the implementation
and evaluation details in Sect. 6. Section 7 discuss the advan- 2.4 Machine learning based techniques
tages of our proposed approach. Finally, Sect. 8 concludes
the paper and present future work. These methods [10,18–21] train a classification algorithm
with some features that can distinguish a genuine website
from the phishing one. In this, a website is declared as phish-
2 Related work ing, if the design of the websites matches with the predefined
feature set. The performance of these solutions depends on
This section presents the overview of phishing detection features set, training data and classification algorithm. These
approaches proposed in the literature. Phishing detection features are extracted from various sources like URL, page

123
Towards detection of phishing websites on client-side using machine learning based approach 689

source, website traffic, search engine, DNS, etc. In this, some • Client side implementation The proposed approach is
of the features are difficult to access, slow, third party depen- implemented at client side on user’s system. Therefore,
dent, and time consumable. Therefore, some of the machine it provides better user’s privacy (D5).
learning solutions require a high computations to obtain and • Feature set selection The proposed features are extracted
compute the features from various sources. from the URL and source code of webpage (no third
party features). Therefore, extraction and computing the
features are easy and fast, and it provides a real time
2.5 Search engine based techniques
phishing prediction (approximately 2–6 s) (D3). More-
over, the most of the features are not affected by the
The search Engine (SE) based techniques extract identity
textual language of the webpage (D2) and can detect any
features (e.g., title, copyright, logo, domain name, etc.) from
kind of phishing website (D4). Moreover, we proposed
the webpage and make use of the search engine to check the
some new features that increase the detection accuracy
legitimacy of webpage [22–24]. The FPR of these methods is
of our method.
high because newly constructed genuine sites do not appear
• Sensitivity analysis of features We conducted a sensitivity
in the top search results. Previous search based techniques
analysis on the feature set to ensure the higher detec-
believe that legitimate site appears in the top results of search
tion accuracy (D1). Sensitivity analysis predicts the most
engine. Although, only popular sites appear in the top search
powerful features in the detection of the phishing web-
results. Moreover, these techniques do not provide desired
sites.
results when webpages are in a language other than English
because the search engine like Google does not give precise
results for the non-English search query.
3.3 System architecture

Figure 1 presents the system architecture of the proposed


approach. Our approach extracts and analyses various fea-
3 Proposed phishing detection approach tures of suspicious websites for successful detection of
wide-ranging phishing attack. The selection of outstanding
3.1 Design objectives feature set is the major contribution of this paper. We pro-
posed six new features to improve the detecting accuracy
We designed our anti-phishing approach to satisfy following of phishing webpages. Our proposed features identify the
requirements: relation between the page content and the URL of the web-
page. We used pattern matching algorithms to match the
• High detection accuracy (D1) Misclassification of gen-
domain name of page resource elements with the domain
uine website as phishing (false positive) must be mini-
name of the queried webpage. Our features are based on
mum and correct classification of phishing websites (true
URL and content of the webpage. The content obtained
positive) must be high to provide high detection accuracy.
from the page source and document object model (DOM)
• Language independent detection (D2) The detection
of the webpage. A web crawler is used to gather the web-
approach must not be reliant on a language specific con-
site features automatically. In particular, features 1, 2, 3,
tent of the webpage.
5, 6, 17, 18, 19 are taken from other approaches [10,18–
• Real time protection (D3) The phishing detection method
21]; features 7, 8, 11, 14, 15, 16 are novel and proposed by
must provide its prediction before revealing the credential
us. Moreover, the features 4, 9, 10, 12, 13 are proposed by
of the user on the phishing website.
other approaches but we modified them for better results.
• Target independent detection (D4) Phishing detection
The features of our approach are classified into five cate-
should not be dependent on a particular brand/sector, a
gories as shown in Table 1. Section 4 of this paper give the
phishing detection approach should be capable of detect-
detailed explanation of the proposed features. After extrac-
ing any kind of phishing website regardless of the newly
tion of features, we apply heuristics to generate the feature
created website (zero-hour attack).
vector and creates a unique feature vector for every web-
• Nondisclosure of privacy (D5) The detection approach
site to generate the labelled dataset. The feature vector is the
must not share the user’s data (e.g., browser’s history) to
numerical representation of feature for the statistical proce-
any third party.
dures in machine learning algorithm. In this, F = {F1, F2,
F3, …, F19} is defined as the feature vectors corresponding
3.2 Design concepts to each feature. Each feature produces the value in the form
of 1 and 0, where 1 indicates for phishing and 0 indicate
We adapted following design concepts to fulfilling the above for legitimate. In the training stage, a random forest (RF)
mentions requirements: classifier is trained using the feature vector taken from every

123
690 A. K. Jain, B. B. Gupta

Fig. 1 System architecture of


proposed approach

Table 1 Feature used in the


S. No Category Features name Total
proposed approach
features

1 URL forgery Number of dots, presence of special symbol, URL length, 8


suspicious words in URL, position of top-level domain,
http count, brand name in URL, Data URI
2 Fake login form Fake login form identification 1
3 Hyperlink information Number of hyperlinks, no hyperlink feature, foreign 6
hyperlinks, empty hyperlinks, erroneous hyperlinks,
hyperlinks redirection
4 Copied CSS Suspicious CSS identification 1
5 Fake web identity Copyright, identity keywords, favicon 3

entry in the training dataset. In the testing phase, the classi- 4 Features extraction
fier determines whether a given website is a phishing site or
not. A binary classifier classifies the websites into two possi- Given the limitation of search engine and third party
ble categories namely phishing and legitimate. When a user dependent approaches presented in the literature, we uti-
requested for a new website, the trained classifier identify lize the client-side specific features in our approach. We
the legitimacy of given website from the generated feature have selected eight URL-based features (F1–F8), one login
vectors. form feature (F9), six hyperlink specific features (F10–F15),
one CSS feature (F16), and three web identity features

123
Towards detection of phishing websites on client-side using machine learning based approach 691

Usually, legitimate websites are short, significant, and easy-


to-remember. On the other hand, phishing websites are nor-
mally longer, and may not contain any meaningful domain.
Moreover, Phisher also hides the redirected information in
long URL. In our experiment, we found that if URL length is
Fig. 2 URL structure greater than 74, then the website is more likely to be phishing.

1, if URL length ≥ 74
(F17–F20). We discuss all these features in the following F3 = (3)
0, Otherwise
subsections.
F4—Suspicious words in URLThis feature examines the pres-
4.1 URL based features ence of suspicious words in URL. Phisher adds suspicious
keywords in URL to gain the trust on it. We identified nine
A webpage is addressed by Uniform Resource Locator keywords frequently present in the phishing URLs namely
(URL). Figure 2 presents the structure of URL. The structure security, login, signin, bank, account, update, include, webs
divides URL into five parts starting from protocol, sub- and online. If any of these keywords are found in URL, then
domain, base domain, top-level domain, and path segment. this feature make the URL as phishing.
An attacker has the control over the full URL, and it can

fix any value of subdomain, base domain, and path segment. 1, if URL contain any suspicious word
This subsection presents how a cybercriminal trap users using F4 = (4)
0, Otherwise
URL obscure technique.
F1—Number of dots in URL This features checks the number F5—Position of the top-level domain This feature examines
of dot (.) in the URL. In general, the legitimate website does the two things regarding the top-level domains (TLD) in the
not contain more than three dots in the URL although phish- URL. The first thing is to check the position of the TLD in
ing URL may contain more than three dots. The phishing the base domain part. The second point, it also verifies the
URL contains many sub-domains in URL to confuse users. occurrence of more than one TLD in the URL. In the legit-
These subdomains are separated by the dot symbol. Consider imate site, the top-level domain appears one time in URL,
an example https://support.appleid.itune.com-txutwo3wfh. and it is not present in base domain part.
store; it is a phishing site, but some users may believe that Example 1 http://support.paypal.com.prodigitalmedia.org/si
they are visiting the official Apple website. Therefore, if any gnin/?country.x=US\&loc, in the given phishing URL, top-
URL has more than three dot symbol, it is considered as level domain .com (TLD) appear in the base domain part.
phishing.
 Example 2 http://romeiroseromarias.com.br/verify/0/asb.co
1, if dots in URL ≥ 4 .nz/, in the given phishing URL, two top-level domains
F1 = (1)
0, Otherwise appear (.com.br and .co.nz).

F2—Presence of special symbol in URL This feature deter- ⎨ 1, if top level domain name present in base domain
mines whether the URL address contains the special symbol F5 = 1, if occurrence of top level domain in URL > 1

at sign (“@”) and dash symbol (-). The presence of “@” 0, otherwise
symbol in the URL ignores everything written before it, and (5)
the path after “@” consider as the real domain for retriev-
ing the website. The dash (-) symbol is used in fake URL to F6—http count in URL This feature counts the appearance of
looks like original URL. Attackers add some prefix or suffix ‘http’ protocol in the URL. In phishing URL, http protocol
keywords in brand name with dash symbol, so users believe may appear more than one time, however, in the legitimate
that they are accessing the right website. e.g. www.paypal- site, ‘http’ appears only one time.
india.com. It is the fake site, but a user may think that it is

the official Indian site of PayPal. 1, if http count in URL > 1
F6 = (6)
 0, Otherwise
1, if URL contain @ or − symbol
F2 = (2)
0, Otherwise F7—Brand name in URL Most of the phishing websites have
the brand name of the targeted website somewhere in the
F3—Length of URL Phisher generally uses the long URL URL. According to the current report of APWG, the 45.97%
in the address bar to hide the brand or organization name. of phishing websites contain brand name of the targeted site

123
692 A. K. Jain, B. B. Gupta

in URL [25]. In this feature, if a brand name is present and its Algorithm1: Fake Login form Detector Algorithm
Input: DOM tree of Suspicious URL
position is not at the right place, then site marked as phishing. Output: Existence of Fake login form, F9 {0, 1},
We have selected top 500 phishing targets including banks, Start
payment gateways, etc. The top name found in the phishing 1. If the value of action field is blank, # or javascript:void(0)) then set F9 = 1
2: If the value of action field is in the form of “filename.php” then set F9 = 1
URLs are PayPal, Amazon, Apple, Yahoo, Dropbox, Google, 3: If action field contain foreign base domain then set F9 =1 else set F9 = 0
AOL, USAA, etc. End

Example http://forlittledrops.org/asd/Paypalaccount/, in the Fig. 3 Algorithm for detection of fake login form
given phishing URL, “PayPal” found in the path segment.


1, if any top brand name present at incorrect position in URL
F7 = (7)
0, Otherwise

F8—Data URI These days data URI (uniform resource iden-


tifier) based attack seems to be most common phishing limited pages. Moreover, sometimes a phishing site only con-
attacks [26]. Data URI scheme provides a facility to add data sist of one or two web pages (usually attackers create only
in-line in web pages as if they were external resources. This login page). This feature calculates the number of pages in
scheme allows fetching different elements such as HTML, a website by visiting hyperlinks in the source code. In our
images, and javascript in a single HTTP request rather than approach, we extracted the hyperlinks from the “src” attribute
multiple HTTP requests. The syntax of data URI is given of img, script, frame, input, link tags and anchor attribute of
below: the href tag.
data : [<media type>][; base64], <data>
With the data URI method, it is possible to show media con- F10 = total hyperlinks present in website (9)
tents in a web browser without hosting the actual data on the
internet. Traditional anti-phishing techniques fail to detect it
because the phishing web pages not hosted anywhere on the F11—No hyperlink feature This feature checks whether
internet. In this attack, users do not require to communicate any hyperlinks present on the website or not. Sometimes
with a server to get phished. attackers use the hyperlink hidden techniques [27] to bypass
 the anti-phishing solutions. Moreover, attackers also use
1, if Data URI present
F8 = (8) server site script to cover up the page source content.
0, Otherwise From our study, we analyse that a legitimate website con-
tains at least one hyperlink. Moreover, if a website does
4.2 Login form based feature not include the hyperlinks, it depicts the phishing attack.

F9—Fake login form The fake websites always include the


login form because it is the only way to obtain the user’s per- 
sonal data. In the legitimate website, the action field of login 1, if number of hyperlinks are zero
F11 = (10)
form usually contains a link that has the same base domain 0, Otherwise
as appear in the address bar of the browser. However, as
per our observation, form action field of the phishing web- F12—Foreign hyperlinks The Foreign hyperlink contains
sites includes the URL having the different base domain, null domain name different from the website domain name.
links (URL may be in footer section), or a simple PHP file. Cybercriminals usually copy the HTML coding from their
The action attribute of phishing website includes a PHP file, targeted official website to construct the phishing web-
which named as mail.php, login.php, index.php, etc. PHP file site, and it may have numerous foreign hyperlinks that
contains a script that saves the inputted data (e.g., user id or point to their targeted site [28]. In a legitimate website,
password) in a text file at hacker’s computer. The algorithm most of the hyperlinks point to the browsed domain name.
to detect fake login form is presented in Fig. 3. On the other hand, phishing sites have the many hyper-
links that point to the foreign domain. In this features, we
4.3 Hyperlink specific features calculate the ratio of the external hyperlink to the total
hyperlink present in the website. The feature results in
F10—Number of webpages A legitimate website usually con- 1 if the ratio is greater than 0.5 otherwise, the result is
tains many web pages while a phishing website has very 0.

123
Towards detection of phishing websites on client-side using machine learning based approach 693


1, if Foreign Hyperlinks
Total Hyperlinks > 0.5 and Total Hyperlinks in Webpage > 0
F12 = (11)
0, Otherwise

F13—Empty hyperlinks Empty or null hyperlink returns the phishing attack, sometimes URL redirection confuse user
on the same page when a user clicks on it. It increases about which website they are surfing. Proposed approach
the chance of user falling for the phishing scam since if consider response code 301 and 302 for URL redirection.
a hyperlink is active, the user may end up reaching the This feature results in 1 if the ratio of redirection hyperlinks
original website if it is clicked. Thus, attacker prevents is greater than 0.3, else results is 0.


1, if Number of hyperlinks which are redirecting
> 0.3 and Total Hyperlinks > 0
F15 = Total Hyperlinks (14)
0, Otherwise

any chance of user’s redirection to the original website by


removing hyperlinks. Moreover, Phisher also exploits the 4.4 CSS based feature
vulnerability of web browser with the help of empty links.
<a href = “#”>, <a href = “#content”> and <a href = F16—Copied CSS Cascading Style Sheets (CSS) is a lan-
“JavaScript::void(0)”> HTML coding are used to create guage used for setting the visual appearance of a website.
empty hyperlinks. This feature calculates the ratio of the An attacker always tries to mimic the same visual design
empty hyperlinks to the total hyperlinks present on the web- of the phishing website as the legitimate website. CSS of
site. The feature results in 1 if the ratio is greater than 0.34 any website either includes with external CSS file or within
otherwise, the result is 0. the HTML tags itself. Phishing website usually contains the
external CSS file, which includes the link of the targeted legit-

1, if Empty Hyperlinks
Total Hyperlinks > 0.34 and Total Hyperlinks in Webpage > 0
F13 = (12)
0, Otherwise

F14—Error in hyperlinks This feature checks the error in imate site. However, numerous genuine websites use more
hyperlinks. Error “404 not found” occurred when a user than one external CSS file or include internal CSS style.


1, if CSS file is external and contain foreign domain name
F16 = (15)
0, Otherwise

requested for an URL and server cannot locate the URL. The
4.5 Web identity based features
attacker also adds some hyperlinks in the fake page which
not exists. We consider the 403 and 404 response code of
The phishing website is the mimicked fake copy of pop-
hyperlinks. In this feature, we calculate the ratio of hyper-
ular brand or organisation, and it may have many identity
links occurring error.


1, if NumberTotal
of error in Hyperlinks
> 0.3 and Total Hyperlinks in Webpage > 0
F14 = Hyperlinks (13)
0, Otherwise

F15—Hyperlink redirection In this feature, the system features, which are copied from the targeted page (e.g.,
checks the number of hyperlinks redirected to some other favicon, copyright information, etc.), and claiming a false
place out of the total hyperlinks available in the website. In identity.

123
694 A. K. Jain, B. B. Gupta

Algorithm 2: To find fake identity of Website Table 2 Datasets used for training and testing
Input: the DOM tree of a website
Output: F18 {0, 1}, 1- Phishing, 0- legitimate # Dataset Number of instances Category
Start
1. Extract the identity keywords from title and meta tag 1 Phishtank [29] 1528 Phishing
2. Extract the top keywords using tf-idf algorithm from website
3. Construct the identity keywords set from step 1 and step 2
2 Openphish [30] 613 Phishing
4. If one of the identity keyword matched with the domain name then set F18 = 0 3 Alexa [31] 1600 Legitimate
5. else set F18 = 1
5 Payment gateway [32] 66 Legitimate
End
6 Top banking website [33] 252 Legitimate
Fig. 4 Algorithm to find fake identity of the website


F17—Copyright features The identity of a website can 1, if foreign domain found in favicon link
be extracted using copyright information given in the text F19 =
0, Otherwise
form. Copyright field of a website contains the name of the (17)
organization. This feature extracts the keywords from the
copyright field, tokenized them, and matches with the suspi-
cious domain name. The symbol and keywords used to locate
5 Training dataset and performance metric
the copyright information are the @ symbol, © symbol, &
copy, copyright and all right reserved.
The proposed approach build a binary classifier based on
 the features described in Sect. 4, which classify phishing
0, if copyright keyword matched with base domain
F17 = and legitimate websites correctly. This section describes the
1, Otherwise
training and testing dataset, and performance matrix used in
(16)
our approach.
F18—Identity Keywords Some specific keywords present
5.1 Training dataset
in the website by which a developer can know the infor-
mation about the exact identity of the website. We make
Our training dataset consists of 2141 phishing and 1918 legit-
a set of identity keywords, which include title, meta and
imate websites. Table 2 presents the number of instances
frequent appeared keywords. Approach apply TF-IDF algo-
and the sources of phishing and legitimate datasets. We
rithm [22] to extract the most frequently appeared keywords.
have collected Phishing dataset from two sources namely
These extracted keywords, matched with the domain name
Phishtank [29] and Openphish [30]. These phishing datasets
of the suspicious site. If the site is legitimate, then one of
consist of verified URLs. The phishing websites are short
identity keyword should be the part of the domain name.
lived. Therefore, we crawled when phishing websites are
However, phishing websites include the identity keyword
active. The legitimate dataset is taken from various sources
in the path segment of URL to fool users. e.g. http://www.
as shown in Table 2. The legitimate dataset Alexa is a most
shoppiingg.com/www.amazon.com/cgi-bin/index.htm. The
reliable source websites, and it ranks the website based on
algorithm to find the fake identity of a website is explained
page views and unique site users. The popular sites got the
in Fig. 4.
high rank and unpopular sites situated at the low rank. We
F19—Favicon Favicon is a unique image icon associated added some high ranked and some low ranked websites in
with the particular website. An attacker may use the same our dataset. Moreover, we have added the payment gateways
favicon of the targeted website to fool innocent users. Fav- websites in the dataset because these are the perfect target of
icon is an .ico file linked to an URL, which is available in cyber-criminals. Moreover, our dataset comprises of numer-
link tag of the DOM tree. If the favicon shown in the address ous languages websites (e.g., English, Russian, Spanish,
bar is different from the present website, then it is consid- Portuguese, Hindi, Chinese, etc.) to test the language inde-
ered as phishing attempt. Therefore, if favicon contains the pendent performance of our method. Every feature vector
foreign domain, the feature results 1, otherwise it results as 0. has the one entry in the dataset for defined nineteen features.

<link rel="shortcut icon" href="https://www.facebook.com/rsrc.php/yl/r/H3nktOa7ZMg.ico" />


<link rel="shortcut icon" href="//in.bmscdn.com/webin/common/favicon.ico" type="image/x-icon" />
<link type="image/png" href="/css/img/favicon.png" rel="shortcut icon">
Example of HTML coding for favicon

123
Towards detection of phishing websites on client-side using machine learning based approach 695

Table 3 Performance measures used in our approach ated the separate function for each feature. Different libraries
Measure Formula Description are required for the extraction of features from the webpage.
These libraries can be installed individually using either the
N P→P
TPR T PR = NP × 100 Rate of phishing websites pip installer for python or downloading and extracting them
classified as phishing out of total from the official websites. Following libraries that are used
phishing websites
N L→P
during execution of the code are -
FPR FPR = NL × 100 Rate of legitimate websites
classified as phishing out of total BeautifulSoup This library is used for pulling data from
legitimate websites HTML and XML files.
N P→L
FNR FNR = NP × 100 Rate of phishing websites
classified as legitimate out of urllib2 This library is used to get response object from the
total phishing websites URL, which extracts all the resources from the webpage.
N L→L
TNR TNR = × 100 Rate of legitimate websites
NL re This library is used to perform a regular expression match
classified as legitimate out of
total legitimate websites of the desired string to another.
Accuracy Accuracy = The rate of phishing and legitimate Time This library is used to capture time of a particular
N L→L +N P→P
(A) N L +N P × 100 websites which are identified
instance.
correctly with respect to all the
websites
6.2 Complexity of the proposed approach

Feature extraction from the source code of the webpage helps


The feature vector is having identical values removed from
in reducing the processing time as well as response time,
the dataset.
hence making the approach more reliable and efficient. The
computational complexity of the proposed approach depends
5.2 Performance metric on the extraction and computing the proposed features. The
URL based features are easy to calculate. To compute feature
We have calculated the true positive rate (TPR), false posi- F1, F2, F3, F5, F6, F8, we implemented single pattern match-
tive rate (FPR), true negative rate (TNR), false negative rate ing algorithm (i.e. Knuth–Morris–Pratt algorithm) which
(FNR) and accuracy to evaluate the performance of proposed required O(n) time and space complexity, where n is the
anti-phishing approach. These are the standard metrics to length of URL. URL based feature F4 and F7 are calculated
judge any anti-phishing approach. N L and N P denote the total using multiple pattern matching where we have implemented
number of legitimate and phishing websites respectively. Karp–Rabin algorithm. Its best and average case running time
N L→L are the legitimate websites classified as legitimate, is O(n + m), where m is the combined length of all pattern.
N L→P are the legitimate websites misclassified as phishing. To compute hyperlink specific feature F10, F11, F12, F13,
N P→P are the phishing websites classified as phishing, and F14, F15, F16, F18, we need to obtain all hyperlinks from
N P→L are the phishing websites misclassified as legitimate. the webpage. A regular expression, which can include and
Table 3 presents the measures used for classification of phish- identify all the ways in which hyperlinks can be present on
ing and legitimate websites. the webpage. Every text in the page source that matches the
given regular expression is identified as a hyperlink, and it is
calculated in term of linear time complexity of O(t), where t
6 Implementation and evaluation is source code length of the webpage. The login form based
feature (F9) required regex matching pattern, which is cal-
6.1 Implementation details culated in O(t) time. Copyright feature F17 needed a string
matching algorithm and computed in O(t) time. Feature F18
In the process of phishing website identification, we first extracted identity keywords using TF-IDF algorithm. The
identified the relevant and useful features, construct the TF-IDF is based on the frequency of each term and index-
dataset by extracting features from legitimate and phishing ing a document of p tokens is O( p). The learning model
websites. The labelled dataset is used to train the random of our approach is the random forest. The time complexity
forest classifier. A laptop machine having core Pentium of building a complete unpruned decision tree is O(x * y
i5 processor with 2.4 GHz clock speed and 4 GB RAM log(y)), where y is the number of records and x is the number
is used to implement the proposed anti-phishing solution. of features. In our approach the value of x (i.e., number of
Our proposed approach is implemented using the Python features = 19) is constant, so the time complexity of the ran-
programming language. Python offers a vast support of its dom forest is O(y log(y)). In summary, our crawler calculates
libraries, and it has a reasonable compile time. We have cre- the proposed features in O(t) time, and learning algorithm

123
696 A. K. Jain, B. B. Gupta

Table 4 Performance of proposed approach on various classifiers Feature {F1~F8}


75.03%
Algorithm TPR (%) FPR (%) Accuracy (%) 80.00% 70.46%
66.37%
70.00%
Random forest 99.39 1.25 99.09 60.00%
Support vecor machine 98.23 6.15 96.16 50.00%
33.63%
Neural networks 98.93 2.92 98.05 40.00%
24.97%
30.00%
Logistic regression 98.41 1.93 98.25
20.00%
Naive Bayes 98.46 3.39 97.59 10.00%
0.00%
TPR TNR FPR FNR Accuracy

Fig. 6 Results of URL based features

Feature {F9}
95.20%
100.00% 88.49%
90.00% 82.48%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00% 17.52%
20.00%
4.80%
10.00%
0.00%
Fig. 5 ROC curve on random forest classifier TPR TNR FPR FNR Accuracy

Fig. 7 Performance of login form feature


required O(ylog(y)) time. Moreover, the proposed method is
not dependent on any third party services, and it does not Feature {F10~F15}
need to wait for the results return by these services. 100.00%
93.69%
89.73% 91.82%

80.00%
6.3 Results on popular classification algorithms
60.00%
Table 4 presents the performance of our approach on
40.00%
popular and widely accepted classifiers in term of TPR,
FPR, and accuracy. WEKA software is used to judge the 20.00% 10.27%
6.31%
performance of proposed technique on various machine 0.00%
learning classifiers. We have evaluated our dataset with TPR TNR FPR FNR Accuracy

10-fold cross-validation, which uses 90% of data for train- Fig. 8 Results of hyperlink based features
ing purpose, and remaining 10% data for testing purpose.
It is noticed that random forest outperformed SVM, neu-
ral networks, logistic regression and naïve Bayes. Random classify the websites. Moreover, we also evaluated the effi-
forest performs best regarding highest TPR, and accu- ciency of each category of the proposed feature set. Figure 6
racy. We have also explored the area under ROC (Receiver shows the classification results of URL-based features (fea-
Operating Characteristic) curve to find a better metric of ture 1 to feature 8). As seen in the figure, URL based features
precision. In our experiment, the area under the ROC can correctly filter 75.03% of legitimate websites and 66.37%
curve for phishing website is 99. 85 for the random for- of phishing websites. Figure 7 presents the performance of
est as shown in Fig. 5, and it shows that our approach fake login form detector. As seen in the figure, our login
has the high accuracy in classification of correct web- form detection algorithm can correctly classify high amount
sites. of legitimate websites and provide 88.49% accuracy. This
is because, attackers change the link of login form handler
6.4 Evaluation of features to send the user’s detail to their desired source, and our fake
login form algorithm successfully detects it. Figure 8 presents
In this experiment, we evaluated the performance of our the results of hyperlink based feature, and it represents that
approach. Random forest classification algorithm is used to these features are most significant in the classification of

123
Towards detection of phishing websites on client-side using machine learning based approach 697

Feature {F17~F19} Table 5 Confusion matrix


100.00% 91.64% 88.69% 90.24% Classified as legitimate Classified as phishing

80.00% Legitimate websites 1894 24


Phishing websites 13 2128
60.00%

40.00%
Table 6 Performance of proposed approach on different combination
20.00% 11.31% 8.36% of features
Features TPR (%) FPR (%) Accuracy (%)
0.00%
TPR TNR FPR FNR Accuracy
FURL 66.37 24.97 70.46
Fig. 9 Results of identity based features FHYPERLINK 93.69 10.27 91.82
FIDENTITY 91.64 11.31 90.24
FCSS + FLOGINFORM 85.61 6.26 89.46
Chart Title FURL + FHYPERLINK + 99.39 1.25 99.09
99.39% 98.75% 99.09% FIDENTITY + FCSS +
100.00% FLOGINFORM
90.00%
80.00%
70.00% PCC PLOT
60.00% 1
50.00% 0.8
40.00% 0.6
30.00% 0.4
20.00% 0.2
10.00% 1.25% 0.61%
PCC

0
0.00% -0.2
TPR TNR FPR FNR Accuracy
-0.4
-0.6
Fig. 10 Overall results of proposed approach
-0.8
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
FEATURE

Fig. 11 PCC values of proposed feature set


phishing websites accurately. The only hyperlink specific
features can detect the 93.69% of phishing websites, and
produces 91.82% of overall detection accuracy. As seen in 6.5 Importance of proposed features
Fig. 9, the identity based features deliver the high accuracy
because the phishing websites always claim the wrong iden- The proposed feature set is carefully projected to ensure the
tity to trap the Internet user. The results demonstrate that correct classification of legitimate and phishing websites.
the given feature successfully determine the correct identity We experimentally determine the importance of proposed
of the website. Figure 10 presents the results of proposed features using the Pearson product-moment Correlation
approach by combining all kind of features. It is noticed that Coefficient (PCC). PCC measures of the linear correlation
URL, hyperlink, identity, login form and CSS features are between two variables by producing a value between + 1 and
useful in phishing detection. However, a single kind of fea- − 1. The higher absolute value indicates more dominant fea-
ture is not sufficient to detect all types of phishing websites ture in classification result. For example − 0.71 PCC value
and does not produce high accuracy. Therefore, we inte- has greater significant compared to + 0.34. We have calcu-
grated all features to improve the detection accuracy of the lated the PCC values of each feature. If the feature is relevant
proposed approach. If any approach uses only URL based in correct classification, it produces the non-zero value. Fig-
feature, it yields high false negative rate, which wrongly ure 11 presents the plot of the PCC of each of the 19 features
judged the phishing websites. Our approach results high true of the proposed approach with the label. From the figure, we
positive rate (i.e., more than 99% of phishing sites correctly analyse that PCC values for all features are non-zero, repre-
identified), and low false positive rate (i.e., less than 1.3% sent that every feature in proposed approach is important.
of legitimate websites misclassified as phishing). Table 5
presents the confusion matrix of the proposed approach. This 6.6 Runtime analysis
matrix shows the number of correct and false predictions.
The results on various combination are stated in Table 6 in The response time is the time duration between inputting
numeric form. URL to producing output. When user input URL the approach

123
698 A. K. Jain, B. B. Gupta

Table 7 Comparison of proposed approach with other standard 7 Advantages of our approach
approaches
Approach TPR (%) FPR (%) ACC (%) SEI LI TSI 7.1 Language independency
Montaze et al. [18] 88 12 88 Yes Yes No
The language barrier has been a bottleneck in most of the
Xiang et al. [19] 92 0.4 95.8 No No No
existing approaches. English is used as the textual lan-
Gowtham et al. [10] 98.24 1.71 98.25 No Yes No guage for only about 52.1% [35] of the websites. Therefore,
Zhang et al. [22] 97 6 95 No No No language independence becomes a critical issue for any anti-
Tan et al. [23] 99.68 7.48 96.10 No No No phishing scheme. In our approach, most of the features are
Chiew et al. [24] 99.8 13 93.4 No Yes Yes language independent (F1–F16, F19). Therefore, our system
El-Alfy et al. [20] 97.24 3.88 96.74 No No No yields accurate results to a large extent. Most of the previous
Zhang et al. [21] 98.64 0.53 99.04 Yes Yes No approaches have used the search engine and textual content
Proposed approach 99.39 1.25 99.09 Yes Yes Yes in feature set [10,19–23]. The search engines do not pro-
duce the correct outcomes in non-English query search [34].
Therefore, our approach detects the websites missed by the
search engine and produce low FPR. Furthermore, our testing
tries to fetch all defined features from the webpage URL and
dataset comprises various language websites, and experiment
textual content as discussed in Sect. 4. Then, based on the
results on the dataset show the high detection accuracy of our
extracted feature value, the trained classifier classifies the
approach.
current URL and shows the result in the form of phishing or
legitimate. We selected some random URLs from our dataset
7.2 Low response time
to check runtime analysis of our approach. The total response
time of our approach in extraction and computing the feature
Low response time is a necessary and critical requirement
vector, and producing the result is 2350 ± 3500 ms. This
for a phishing detection system that acts as another rea-
response time is relatively low and acceptable in real time
son for choosing the client side features. Using proposed
environment. The classifier is not dependent on any third
features for the detection of phishing webpages provides
party services, and it does not need to wait for the results
an average response time of around 2–6 s, which is quite
return by these services. Hence, our approach is fast as com-
low as compared to the existing alternatives such as other
pared to other solutions.
machine learning and visual similarity techniques, which
gives a response time of around 10–13 s. Accessing the source
6.7 Comparison with existing anti-phishing code and producing result requires a negligible amount of
approaches time.

In this experiment, we have compared the proposed method 7.3 Third party independency
with the benchmarked anti-phishing approaches. Table 7
present comparison that is based on TPR, FPR, accuracy Several approaches use third-party dependent feature in the
(ACC), language independent solution (LI), Search engine classification [10,18–23]. We have not chosen these features
independent solution (SEI), and third party services inde- (such as DNS, blacklist/ whitelist, whois record, certifying
pendent (TSI). As seen in the table, our work gives highest authority, etc.) for the following reasons.
detection accuracy among the approaches discussed in the
literature. The work of Tan et al. [23] and Chiew et al. [24] • Blacklist and whitelist contain a certain number URLs
give the TPR higher than our approach. However, these and does not cover all the websites.
two approaches produce very high FPR as compared to our • Some of the approaches verify the domain age from
approach. There is a trade-off between TRR and FPR so a whois lookup. From the Phishtank dataset [29], we anal-
good anti-phishing system should provide balanced TPR and yse that more than 30% of phishing webpages hosted
FPR. Most of the previous approaches have used the search on the compromised domain. If a website hosted on the
engine in feature set [10,19,20,22–24]. However, there are compromised domain, domain age feature give the age of
several drawbacks to search engine based feature. First, the compromised domain and it leads approach in the wrong
new genuine sites do not appear in top search results, and prediction.
this feature leads to the wrong prediction. Second, the search • Certifying authority does not certify to each legitimate
engines do not produce the accurate outcomes in non-English websites, and wrongly classify the new legitimate web-
query search [34]. Therefore, our approach detects the web- sites.
sites missed by the search engine and produce low FPR. • DNS database may also be poisoned.

123
Towards detection of phishing websites on client-side using machine learning based approach 699

• Third party dependent features create additional network structed a dataset collected from various sources and included
delay that can cause the high prediction time. the variety of websites to validate the proposed solution. Our
experimental results on dataset showed that the solution is
7.4 Compromised domain detection very efficient as it has 99.39% true positive rate and only
1.25 % false positive rate. The proposed approach also has
Nowadays, cybercriminals host the phishing webpage on good accuracy as compared to other existing anti-phishing
publicly available websites by exploiting vulnerabilities solutions.
using various phishing tools. Phishing webpage on the com- The feature set of our phishing detection approach
promised domain is a large scale deployment, and it provides entirely depends on the URL and source code of the web-
numerous advantages to cybercriminals. A hacker does not site, which can detect the webpages written in HTML
require a web hosting server to deploy the phishing webpage. code only. Therefore, the identification of non-HTML web-
Our approach is based on source code, and it is not included sites is the aim of our future scope. Nowadays, Mobile
features which predict false in the case of the compro- devices are more popular and seem to be a perfect tar-
mised domain. Some of the features like the age of domain, get for malicious attacks like mobile phishing. Therefore,
certifying authority, WHOIS lookup, etc. provide incorrect detecting the phishing websites in the mobile environ-
information in case of the compromised website and produce ment is a challenge for further research and develop-
false results. The search base techniques compare the domain ment.
name from the top ‘T’ search results to check the legitimacy.
In compromised domain attack, the domain name is genuine, Acknowledgements This research work is being supported by Sir
Visvesvaraya Young Faculty Research Fellowship Grant from Min-
and most of the time it appears in top ‘T’ search results and istry of Electronics & Information Technology (MeitY), Government
these methods fail to detect the compromised domain. of India.
Visual similarity techniques compare the visual appear-
ance of the suspicious webpage with it’s corresponding
authenticate webpage are stored in a local database. These
techniques only detect the compromised webpage only if References
its corresponding legitimate webpage present in the local
1. Jain, A. K., & Gupta, B. B. (2017). Detection of phishing attacks
database. Maintaining the large database that contains every
in financial and e-banking websites using link and visual similar-
legitimate webpage is a tough task. Though, our approach ity relation. International Journal of Information and Computer
can identify the compromised domain up to a large extent Security, Inderscience, 2017 (Forthcoming Articles).
because it does not depend on the local database for compar- 2. Gupta, S., & Gupta, B. B. (2017). Detection, avoidance, and
attack pattern mechanisms in modern web application vulnerabili-
isons.
ties: Present and future challenges. International Journal of Cloud
Applications and Computing, 7(3), 1–43.
7.5 Client side application 3. Almomani, A., et al. (2013). A survey of phishing email filter-
ing techniques. IEEE Communications Surveys & Tutorials, 15.4,
2070–2090.
The webpage is labelled the phishing webpage as phishing,
4. Gupta, B. B., et al. (2017). Fighting against phishing attacks: State
legitimate based on the page source of the webpage, and of the art and future challenges. Neural Computing and Applica-
it does not require any other resource. Most of the anti- tions, 28(12), 3629–3654.
phishing techniques resource such as OCR, DNS, SE, It 5. APWG Q4 2016 Report available at: http://docs.apwg.org/reports/
apwg_trends_report_q4_2016.pdf. Last accessed on September
makes our approach platform independent. Our approach
22, 2017.
uses best features, delivers competitive detection rate with 6. Razorthorn phishing report, Available at : http://www.razorth
minimum resources. This makes our approach portable and orn.co.uk/wp-content/uploads/2017/01/Phishing-Stats-2016.pdf.
platform independent (i.e. performance is not affected if Last accessed on September 22, 2017.
7. Purkait, S. (2015). Examining the effectiveness of phishing fil-
changes in the facilities supplied by third parties or protocol)
ters against DNS based phishing attacks. Information & Computer
Security, 23(3), 333–346.
8. Huang, Z., Liu, S., Mao, X., Chen, K., & Li, J. (2017). Insight
8 Conclusion and future work of the protection for data security under selective opening attacks.
Information Sciences, Volumes, 412–413, 223–241.
9. Li, J., Chen, X., Huang, X., Tang, S., Xiang, Y., Hassan, M. M., et al.
This paper presented a novel approach for filtering phishing (2015). Secure distributed deduplication systems with improved
websites at client side where URL, hyperlink, CSS, login reliability. IEEE Transactions on Computers, 64(12), 3569–3579.
form, and identity features are used. The main contributions 10. Gowtham, R., & Krishnamurthi, I. (2014). A comprehensive and
efficacious architecture for detecting phishing webpages. Comput-
of this paper is the identification of various new client-side ers & Security, 40, 23–37.
specific features that are previously not studied. Furthermore, 11. Aboudi, N. E., & Benhlima, L. (2017). Parallel and distributed pop-
we also created a new heuristic for each feature. We have con- ulation based feature selection framework for health monitoring.

123
700 A. K. Jain, B. B. Gupta

International Journal of Cloud Applications and Computing, 7(1), 31. Alexa Most Popular sites, Available at : http://www.alexa.com/
57–71. topsites. Last accessed on September 22, 2017.
12. Sahoo, D., Liu, C., & Hoi, S. C. H. (2017). Malicious URL detection 32. List of online payment gateways. available at: http://research.
using machine learning: A survey. arXiv:1701.07179. omicsgroup.org/index.php/List_of_online_payment_service_prov
13. Arachchilage, N. A. G., Love, S., & Beznosov, K. (2016). Phishing iders. Last accessed on September 27, 2017.
threat avoidance behaviour: An empirical investigation. Computers 33. Top banking websites in the world. Available at: https://www.
in Human Behavior, 60, 185–197. similarweb.com/top-websites/category/finance/banking. Last
14. Sheng, S., Magnien, B., Kumaraguru, P., Acquisti, A., Cranor, L. accessed on September 27, 2017.
F., Hong, J. & Nunge, E. (2007). Anti-phishing phil: The design 34. Chu, P., Komlodi, A., & Rózsa, G. (2015). Online search in english
and evaluation of a game that teaches people not to fall for phish. In as a non-native language. Proceedings of the Association for Infor-
Proceedings of the 3rd symposium on usable privacy and security, mation Science and Technology, 52(1), 1–9.
Pittsburgh, (pp. 88–99). 35. Percentages of websites using various content languages. Available
15. Jain, A. K., & Gupta, B. B. (2016). A novel approach to protect at https://w3techs.com/technologies/overview/content_language/
against phishing attacks at client side using auto-updated white-list. all. Last accessed on September 22, 2017.
EURASIP Journal of Information Security, 2016, 1–11.
16. Sheng, S., Wardman, B., Warner, G., Cranor, L. F., Hong, J., & Ankit Kumar Jain is presently
Zhang, C. (2009). An empirical analysis of phishing blacklists. working as Assistant Professor in
In Proceedings of the 6th Conference on Email and Anti-Spam National Institute of Technology,
(CEAS’09). Kurukshetra, India. He received
17. Jain, A. K., & Gupta, B. B. (2017). Phishing detection: Analysis of Master of technology from Indian
visual similarity based approaches. Security and Communication Institute of Information Technol-
Networks, 2017, Article ID 5421046, 20 pages, https://doi.org/10. ogy Allahabad (IIIT) India. Cur-
1155/2017/5421046. rently, he is pursuing Ph.D. in
18. Montazer, G. A., & ArabYarmohammadi, S. (2015). Detection of cyber security from National Insti-
phishing attacks in Iranian e-banking using a fuzzy-rough hybrid tute of Technology, Kurukshetra.
system. Applied Soft Computing, 35, 482–492. His general research interest is in
19. Xiang, G., Hong, J., Rose, C. P., & Cranor, L. (2011). Cantina+: the area of Information and Cyber
A feature-rich machine learning framework for detecting phishing security, Phishing Website Detec-
web sites. ACM Transactions on Information and System Security tion, Web security, Mobile Secu-
(TISSEC), 14(2), 21. rity, Online Social Network and
20. El-Alfy, E. S. M. (2017). Detection of phishing websites based Machine Learning. He has published many papers in reputed journals
on probabilistic neural networks and K-medoids clustering. The and conferences.
Computer Journal. https://doi.org/10.1093/comjnl/bxx035.
21. Zhang, W., Jiang, Q., Chen, L., & Li, C. (2017). Two-stage ELM for
phishing Web pages detection using hybrid features. World Wide B. B. Gupta received Ph.D. degree
Web, 20(4), 797–813. from Indian Institute of Technol-
22. Zhang, Y., Hong, J. I., & Cranor, L. F. (2007). Cantina: A content- ogy Roorkee, India in the area
based approach to detecting phishing web sites. In Proceedings of of Information and Cyber Secu-
the 16th international conference on world wide web, (pp. 639– rity. In 2009, he was selected for
648). Canadian Commonwealth Schol-
23. Tan, C. L., Chiew, K. L., Wong, K., & Sze, S. N. (2016). arship awarded by Government of
PhishWHO: Phishing webpage detection via identity keywords Canada. He published more than
extraction and target domain name finder. Decision Support Sys- 100 research papers (including 02
tems, 88, 18–27. books and 14 book chapters) in
24. Chiew, K. L., Chang, E. H., & Tiong, W. K. (2015). Utilisation International Journals and Con-
of website logo for phishing detection. Computers & Security, 54, ferences of high repute including
16–26. IEEE, Elsevier, ACM, Springer,
25. APWG 2014 H2 Report Available at : https://docs.apwg. Wiley, Taylor & Francis, Inder-
org/reports/apwg_trends_report_q3_2014.pdf. Last accessed on science, etc. He has visited several
September 22, 2017. countries, i.e. Canada, Japan, Malaysia, China, Hong-Kong, etc to
26. Dataurization of URLs for a more effective phishing campaign. present his research work. His biography was selected and published
Available at: https://thehackerblog.com/dataurization-of-urls-for- in the 30th Edition of Marquis Who’s Who in the World, 2012. Dr.
a-more-effective-phishing-campaign/index.html. Last accessed Gupta also received Young Faculty research fellowship award from
on September 10, 2017. Ministry of Electronics and Information Technology, government of
27. Geng, G. G., Yang, X. T., Wang, W., & Meng, C. J. (2014). A India in 2017. He is also working as principal investigator of vari-
Taxonomy of hyperlink hiding techniques. In Asia-Pacific web con- ous R&D projects. He is serving as associate editor of IEEE Access
ference, (pp. 165–176). and Executive editor of IJITCA, Inderscience, respectively. He is also
28. Jain, A. K., & Gupta, B. B. (2017). Two-level authentication serving as reviewer for Journals of IEEE, Springer, Wiley, Taylor &
approach to protect from phishing attacks in real time. Journal of Francis, etc. He is also serving as guest editor of various reputed
Ambient Intelligence and Humanized Computing. https://doi.org/ Journals. He was also visiting researcher with Yamaguchi University,
10.1007/s12652-017-0616-z. Japan in January 2015. At present, Dr. Gupta is working as Assistant
29. Verified Phishing URL, Available at : https://www.phishtank.com. Professor in the Department of Computer Engineering, National Insti-
Last accessed on September 22, 2017. tute of Technology Kurukshetra India. His research interest includes
30. Phishing dataset available at : https://www.openphish.com/. Last Information security, Cyber Security, Cloud Computing, Web security,
accessed on September 27, 2017. Intrusion detection and Phishing.

123

You might also like