Towards Detection of Phishing Websites On Client-Side Using Machine
Towards Detection of Phishing Websites On Client-Side Using Machine
https://doi.org/10.1007/s11235-017-0414-0
Abstract
The existing anti-phishing approaches use the blacklist methods or features based machine learning techniques. Blacklist
methods fail to detect new phishing attacks and produce high false positive rate. Moreover, existing machine learning based
methods extract features from the third party, search engine, etc. Therefore, they are complicated, slow in nature, and not fit for
the real-time environment. To solve this problem, this paper presents a machine learning based novel anti-phishing approach
that extracts the features from client side only. We have examined the various attributes of the phishing and legitimate websites
in depth and identified nineteen outstanding features to distinguish phishing websites from legitimate ones. These nineteen
features are extracted from the URL and source code of the website and do not depend on any third party, which makes
the proposed approach fast, reliable, and intelligent. Compared to other methods, the proposed approach has relatively high
accuracy in detection of phishing websites as it achieved 99.39% true positive rate and 99.09% of overall detection accuracy.
1 Introduction loss due to the phishing scam is more than 17.36$ million in
US only, followed by Japan and UK with the loss of 8.38 and
Phishing is an online identity theft, which can deceive 7.21 million dollar respectively [6].
Internet users into revealing their secret information and cre- Todays, cyber experts, and phisher are in a rat race con-
dentials, e.g., login id, password, credit card number, etc. dition. The cyber experts continue to improve anti-phishing
Phishing is one of the major computer security threats faced solutions with the help of researchers and developers. The
by the cyber-world and could lead to financial losses for both developer invents various anti-phishing tools that alert users
industries and individuals [1]. Among the various cyberse- to malicious emails and websites. (e.g., Calling-ID Tool-
curity attacks, phishing paid special attention because of its bar, Netcraft Cloudmark Anti-Fraud Toolbar, etc.). A Recent
adverse effect on the economy [2–4]. According to APWG study examines that only 3 out of 14 tools identified a phish-
report, 122,0523 phishing attacks were found worldwide in ing website hosted locally and it is a critical concern on
2016, and it is observed as 65% of growth over 2015 [5]. the trust of the conventional tools [7]. Moreover, these tools
The per month attack growth also increased by 5753% over are exposed in the public domain. Raising human aware-
12 years from 2004 to 2016 (1609 phishing attacks per month ness is not a sufficient mitigation method and deploying
in 2004 and average of 92,564 attacks in 2106). Quarter 2 complementary technical solutions is a crucial requirement
of 2016 represented an all-time high number of phishing [8,9]. In the previous few years, researchers and develop-
attacks, which were 466,065 [5]. The motive of phishing ers build various phishing detection solutions. However,
attack is not only gaining the credentials; now it has become the phishing problem still available, and the development
the number 1 delivery method for other types of malicious of efficient anti-phishing approach become a challenging
software like ransomware [6]. In August 2016, the financial task. Moreover, most of the anti-phishing solutions pro-
duce high false positive rate and not capable of dealing
B B. B. Gupta with zero hour attack. Blacklist based detection approaches
[email protected] have the quick access time; however, it cannot identify
the zero-hour attack. Moreover, other solutions like heuris-
1 National Institute of Technology Kurukshetra, Kurukshetra, tic and visual comparison produce high false predictions.
India
123
688 A. K. Jain, B. B. Gupta
Therefore, it is essential to design an approach that can effi- approaches split into two classes; user education based and
ciently classify phishing webpages. The recent development software based. Software-based approaches are further clas-
of phishing detection employed numerous machine learn- sified into blacklist, visual similarity, search engine, and
ing based approaches. These approaches train a classification machine learning based solutions.
algorithm with some features that can distinguish a phishing
webpage from the legitimate one [10]. The efficiency of the 2.1 User education
detection approaches depends on training data, selection of
good feature set, and classification algorithm used to train User education approach aims to improve the capacity of
these features [11]. Internet users in the detection of phishing attacks [13]. Inter-
The existing machine learning based approaches extract net users can be educated to distinguish the characteristics of
features from various sources like URL, page source, search phishing and legitimate emails and websites. In this, Sheng
engine, and third party services like website traffic, DNS, et al. [14] developed an interactive educational game “Anti-
whois record, etc. The extraction of third party features is a Phishing Phill”, that teach users that how to identify the
complicated and time-consuming process [12]. Integration phishing websites. After spending 15 min on the game,
of features from different sources is also a difficult pro- users were better able to identify phishing websites com-
cess. Therefore, they are complicated, slow, and does not pared to the other users who did not play the game. The
produce results in the real-time. To cope up this problem, main motto behind this game design is to provide conceptual
this paper presents an efficient solution, which extracts the knowledge to computer users behind the phishing attacks.
features from client side only. Identifying the outstanding This conceptual knowledge may help the users in avoiding
features is one of the preconditions for the design of good phishing attacks.
phishing detection approach. Therefore, we have examined
the various attributes of the phishing and legitimate websites 2.2 Phishing blacklist
in depth and identified various efficient client side features in
order to detect the phishing websites. These nineteen features A blacklist contains the list of malicious domains, URLs,
are obtained from the URL and source code of the web- and IP addresses [15]. Sheng et al. [16] showed that a fake
site. Therefore, it makes our approach fast and reliable. We domain added in blacklist after the substantial amount of
have evaluated proposed features on various machine learn- time and approximate 50–80% of fake domains added after
ing algorithms using 4059 phishing and legitimate websites performing the attack. The blacklist needs to be the regular
dataset. Evaluation results show that the proposed approach update from their source because thousands of fake websites
accurately filters the phishing sites as it has 99.39% of true launch every day.
positive rate and very less 1.25% of false positive rate. The
main advantages of the proposed approach compared to exist- 2.3 Visual similarity based techniques
ing phishing detection solutions are (1) it is fast, reliable and
provide real-time phishing detection, (2) it can detect the These techniques [17] utilize various features to compute the
phishing webpages hosted on the compromised domain, (3) similarity between websites like page source code, images,
it can detect the webpages written in any textual language, textual content, text formatting, HTML tags, CSS, website
(4) it does not require any dedicated resources for phishing logo, etc. Most of the visual similarity based approaches
detection (5) it is platform independent, and (6) it is available compare the new website with previously visited or stored
as a client side desktop application. websites. Therefore, these techniques cannot detect the new
The remainder of this paper is structured as follows. Sec- phishing websites and produce high false negative rate. Some
tion 2 describes the related work. Section 3 presents the of the techniques take the snapshot of websites to compare
overview of our proposed approach. Section 4 describes the which require high computation time, therefore does not fit
proposed feature set. Section 5 shows the training dataset in time constraint environment.
and performance metrics. We present the implementation
and evaluation details in Sect. 6. Section 7 discuss the advan- 2.4 Machine learning based techniques
tages of our proposed approach. Finally, Sect. 8 concludes
the paper and present future work. These methods [10,18–21] train a classification algorithm
with some features that can distinguish a genuine website
from the phishing one. In this, a website is declared as phish-
2 Related work ing, if the design of the websites matches with the predefined
feature set. The performance of these solutions depends on
This section presents the overview of phishing detection features set, training data and classification algorithm. These
approaches proposed in the literature. Phishing detection features are extracted from various sources like URL, page
123
Towards detection of phishing websites on client-side using machine learning based approach 689
source, website traffic, search engine, DNS, etc. In this, some • Client side implementation The proposed approach is
of the features are difficult to access, slow, third party depen- implemented at client side on user’s system. Therefore,
dent, and time consumable. Therefore, some of the machine it provides better user’s privacy (D5).
learning solutions require a high computations to obtain and • Feature set selection The proposed features are extracted
compute the features from various sources. from the URL and source code of webpage (no third
party features). Therefore, extraction and computing the
features are easy and fast, and it provides a real time
2.5 Search engine based techniques
phishing prediction (approximately 2–6 s) (D3). More-
over, the most of the features are not affected by the
The search Engine (SE) based techniques extract identity
textual language of the webpage (D2) and can detect any
features (e.g., title, copyright, logo, domain name, etc.) from
kind of phishing website (D4). Moreover, we proposed
the webpage and make use of the search engine to check the
some new features that increase the detection accuracy
legitimacy of webpage [22–24]. The FPR of these methods is
of our method.
high because newly constructed genuine sites do not appear
• Sensitivity analysis of features We conducted a sensitivity
in the top search results. Previous search based techniques
analysis on the feature set to ensure the higher detec-
believe that legitimate site appears in the top results of search
tion accuracy (D1). Sensitivity analysis predicts the most
engine. Although, only popular sites appear in the top search
powerful features in the detection of the phishing web-
results. Moreover, these techniques do not provide desired
sites.
results when webpages are in a language other than English
because the search engine like Google does not give precise
results for the non-English search query.
3.3 System architecture
123
690 A. K. Jain, B. B. Gupta
entry in the training dataset. In the testing phase, the classi- 4 Features extraction
fier determines whether a given website is a phishing site or
not. A binary classifier classifies the websites into two possi- Given the limitation of search engine and third party
ble categories namely phishing and legitimate. When a user dependent approaches presented in the literature, we uti-
requested for a new website, the trained classifier identify lize the client-side specific features in our approach. We
the legitimacy of given website from the generated feature have selected eight URL-based features (F1–F8), one login
vectors. form feature (F9), six hyperlink specific features (F10–F15),
one CSS feature (F16), and three web identity features
123
Towards detection of phishing websites on client-side using machine learning based approach 691
123
692 A. K. Jain, B. B. Gupta
in URL [25]. In this feature, if a brand name is present and its Algorithm1: Fake Login form Detector Algorithm
Input: DOM tree of Suspicious URL
position is not at the right place, then site marked as phishing. Output: Existence of Fake login form, F9 {0, 1},
We have selected top 500 phishing targets including banks, Start
payment gateways, etc. The top name found in the phishing 1. If the value of action field is blank, # or javascript:void(0)) then set F9 = 1
2: If the value of action field is in the form of “filename.php” then set F9 = 1
URLs are PayPal, Amazon, Apple, Yahoo, Dropbox, Google, 3: If action field contain foreign base domain then set F9 =1 else set F9 = 0
AOL, USAA, etc. End
Example http://forlittledrops.org/asd/Paypalaccount/, in the Fig. 3 Algorithm for detection of fake login form
given phishing URL, “PayPal” found in the path segment.
1, if any top brand name present at incorrect position in URL
F7 = (7)
0, Otherwise
123
Towards detection of phishing websites on client-side using machine learning based approach 693
1, if Foreign Hyperlinks
Total Hyperlinks > 0.5 and Total Hyperlinks in Webpage > 0
F12 = (11)
0, Otherwise
F13—Empty hyperlinks Empty or null hyperlink returns the phishing attack, sometimes URL redirection confuse user
on the same page when a user clicks on it. It increases about which website they are surfing. Proposed approach
the chance of user falling for the phishing scam since if consider response code 301 and 302 for URL redirection.
a hyperlink is active, the user may end up reaching the This feature results in 1 if the ratio of redirection hyperlinks
original website if it is clicked. Thus, attacker prevents is greater than 0.3, else results is 0.
1, if Number of hyperlinks which are redirecting
> 0.3 and Total Hyperlinks > 0
F15 = Total Hyperlinks (14)
0, Otherwise
F14—Error in hyperlinks This feature checks the error in imate site. However, numerous genuine websites use more
hyperlinks. Error “404 not found” occurred when a user than one external CSS file or include internal CSS style.
1, if CSS file is external and contain foreign domain name
F16 = (15)
0, Otherwise
requested for an URL and server cannot locate the URL. The
4.5 Web identity based features
attacker also adds some hyperlinks in the fake page which
not exists. We consider the 403 and 404 response code of
The phishing website is the mimicked fake copy of pop-
hyperlinks. In this feature, we calculate the ratio of hyper-
ular brand or organisation, and it may have many identity
links occurring error.
1, if NumberTotal
of error in Hyperlinks
> 0.3 and Total Hyperlinks in Webpage > 0
F14 = Hyperlinks (13)
0, Otherwise
F15—Hyperlink redirection In this feature, the system features, which are copied from the targeted page (e.g.,
checks the number of hyperlinks redirected to some other favicon, copyright information, etc.), and claiming a false
place out of the total hyperlinks available in the website. In identity.
123
694 A. K. Jain, B. B. Gupta
Algorithm 2: To find fake identity of Website Table 2 Datasets used for training and testing
Input: the DOM tree of a website
Output: F18 {0, 1}, 1- Phishing, 0- legitimate # Dataset Number of instances Category
Start
1. Extract the identity keywords from title and meta tag 1 Phishtank [29] 1528 Phishing
2. Extract the top keywords using tf-idf algorithm from website
3. Construct the identity keywords set from step 1 and step 2
2 Openphish [30] 613 Phishing
4. If one of the identity keyword matched with the domain name then set F18 = 0 3 Alexa [31] 1600 Legitimate
5. else set F18 = 1
5 Payment gateway [32] 66 Legitimate
End
6 Top banking website [33] 252 Legitimate
Fig. 4 Algorithm to find fake identity of the website
F17—Copyright features The identity of a website can 1, if foreign domain found in favicon link
be extracted using copyright information given in the text F19 =
0, Otherwise
form. Copyright field of a website contains the name of the (17)
organization. This feature extracts the keywords from the
copyright field, tokenized them, and matches with the suspi-
cious domain name. The symbol and keywords used to locate
5 Training dataset and performance metric
the copyright information are the @ symbol, © symbol, &
copy, copyright and all right reserved.
The proposed approach build a binary classifier based on
the features described in Sect. 4, which classify phishing
0, if copyright keyword matched with base domain
F17 = and legitimate websites correctly. This section describes the
1, Otherwise
training and testing dataset, and performance matrix used in
(16)
our approach.
F18—Identity Keywords Some specific keywords present
5.1 Training dataset
in the website by which a developer can know the infor-
mation about the exact identity of the website. We make
Our training dataset consists of 2141 phishing and 1918 legit-
a set of identity keywords, which include title, meta and
imate websites. Table 2 presents the number of instances
frequent appeared keywords. Approach apply TF-IDF algo-
and the sources of phishing and legitimate datasets. We
rithm [22] to extract the most frequently appeared keywords.
have collected Phishing dataset from two sources namely
These extracted keywords, matched with the domain name
Phishtank [29] and Openphish [30]. These phishing datasets
of the suspicious site. If the site is legitimate, then one of
consist of verified URLs. The phishing websites are short
identity keyword should be the part of the domain name.
lived. Therefore, we crawled when phishing websites are
However, phishing websites include the identity keyword
active. The legitimate dataset is taken from various sources
in the path segment of URL to fool users. e.g. http://www.
as shown in Table 2. The legitimate dataset Alexa is a most
shoppiingg.com/www.amazon.com/cgi-bin/index.htm. The
reliable source websites, and it ranks the website based on
algorithm to find the fake identity of a website is explained
page views and unique site users. The popular sites got the
in Fig. 4.
high rank and unpopular sites situated at the low rank. We
F19—Favicon Favicon is a unique image icon associated added some high ranked and some low ranked websites in
with the particular website. An attacker may use the same our dataset. Moreover, we have added the payment gateways
favicon of the targeted website to fool innocent users. Fav- websites in the dataset because these are the perfect target of
icon is an .ico file linked to an URL, which is available in cyber-criminals. Moreover, our dataset comprises of numer-
link tag of the DOM tree. If the favicon shown in the address ous languages websites (e.g., English, Russian, Spanish,
bar is different from the present website, then it is consid- Portuguese, Hindi, Chinese, etc.) to test the language inde-
ered as phishing attempt. Therefore, if favicon contains the pendent performance of our method. Every feature vector
foreign domain, the feature results 1, otherwise it results as 0. has the one entry in the dataset for defined nineteen features.
123
Towards detection of phishing websites on client-side using machine learning based approach 695
Table 3 Performance measures used in our approach ated the separate function for each feature. Different libraries
Measure Formula Description are required for the extraction of features from the webpage.
These libraries can be installed individually using either the
N P→P
TPR T PR = NP × 100 Rate of phishing websites pip installer for python or downloading and extracting them
classified as phishing out of total from the official websites. Following libraries that are used
phishing websites
N L→P
during execution of the code are -
FPR FPR = NL × 100 Rate of legitimate websites
classified as phishing out of total BeautifulSoup This library is used for pulling data from
legitimate websites HTML and XML files.
N P→L
FNR FNR = NP × 100 Rate of phishing websites
classified as legitimate out of urllib2 This library is used to get response object from the
total phishing websites URL, which extracts all the resources from the webpage.
N L→L
TNR TNR = × 100 Rate of legitimate websites
NL re This library is used to perform a regular expression match
classified as legitimate out of
total legitimate websites of the desired string to another.
Accuracy Accuracy = The rate of phishing and legitimate Time This library is used to capture time of a particular
N L→L +N P→P
(A) N L +N P × 100 websites which are identified
instance.
correctly with respect to all the
websites
6.2 Complexity of the proposed approach
123
696 A. K. Jain, B. B. Gupta
Feature {F9}
95.20%
100.00% 88.49%
90.00% 82.48%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00% 17.52%
20.00%
4.80%
10.00%
0.00%
Fig. 5 ROC curve on random forest classifier TPR TNR FPR FNR Accuracy
80.00%
6.3 Results on popular classification algorithms
60.00%
Table 4 presents the performance of our approach on
40.00%
popular and widely accepted classifiers in term of TPR,
FPR, and accuracy. WEKA software is used to judge the 20.00% 10.27%
6.31%
performance of proposed technique on various machine 0.00%
learning classifiers. We have evaluated our dataset with TPR TNR FPR FNR Accuracy
10-fold cross-validation, which uses 90% of data for train- Fig. 8 Results of hyperlink based features
ing purpose, and remaining 10% data for testing purpose.
It is noticed that random forest outperformed SVM, neu-
ral networks, logistic regression and naïve Bayes. Random classify the websites. Moreover, we also evaluated the effi-
forest performs best regarding highest TPR, and accu- ciency of each category of the proposed feature set. Figure 6
racy. We have also explored the area under ROC (Receiver shows the classification results of URL-based features (fea-
Operating Characteristic) curve to find a better metric of ture 1 to feature 8). As seen in the figure, URL based features
precision. In our experiment, the area under the ROC can correctly filter 75.03% of legitimate websites and 66.37%
curve for phishing website is 99. 85 for the random for- of phishing websites. Figure 7 presents the performance of
est as shown in Fig. 5, and it shows that our approach fake login form detector. As seen in the figure, our login
has the high accuracy in classification of correct web- form detection algorithm can correctly classify high amount
sites. of legitimate websites and provide 88.49% accuracy. This
is because, attackers change the link of login form handler
6.4 Evaluation of features to send the user’s detail to their desired source, and our fake
login form algorithm successfully detects it. Figure 8 presents
In this experiment, we evaluated the performance of our the results of hyperlink based feature, and it represents that
approach. Random forest classification algorithm is used to these features are most significant in the classification of
123
Towards detection of phishing websites on client-side using machine learning based approach 697
40.00%
Table 6 Performance of proposed approach on different combination
20.00% 11.31% 8.36% of features
Features TPR (%) FPR (%) Accuracy (%)
0.00%
TPR TNR FPR FNR Accuracy
FURL 66.37 24.97 70.46
Fig. 9 Results of identity based features FHYPERLINK 93.69 10.27 91.82
FIDENTITY 91.64 11.31 90.24
FCSS + FLOGINFORM 85.61 6.26 89.46
Chart Title FURL + FHYPERLINK + 99.39 1.25 99.09
99.39% 98.75% 99.09% FIDENTITY + FCSS +
100.00% FLOGINFORM
90.00%
80.00%
70.00% PCC PLOT
60.00% 1
50.00% 0.8
40.00% 0.6
30.00% 0.4
20.00% 0.2
10.00% 1.25% 0.61%
PCC
0
0.00% -0.2
TPR TNR FPR FNR Accuracy
-0.4
-0.6
Fig. 10 Overall results of proposed approach
-0.8
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
FEATURE
123
698 A. K. Jain, B. B. Gupta
Table 7 Comparison of proposed approach with other standard 7 Advantages of our approach
approaches
Approach TPR (%) FPR (%) ACC (%) SEI LI TSI 7.1 Language independency
Montaze et al. [18] 88 12 88 Yes Yes No
The language barrier has been a bottleneck in most of the
Xiang et al. [19] 92 0.4 95.8 No No No
existing approaches. English is used as the textual lan-
Gowtham et al. [10] 98.24 1.71 98.25 No Yes No guage for only about 52.1% [35] of the websites. Therefore,
Zhang et al. [22] 97 6 95 No No No language independence becomes a critical issue for any anti-
Tan et al. [23] 99.68 7.48 96.10 No No No phishing scheme. In our approach, most of the features are
Chiew et al. [24] 99.8 13 93.4 No Yes Yes language independent (F1–F16, F19). Therefore, our system
El-Alfy et al. [20] 97.24 3.88 96.74 No No No yields accurate results to a large extent. Most of the previous
Zhang et al. [21] 98.64 0.53 99.04 Yes Yes No approaches have used the search engine and textual content
Proposed approach 99.39 1.25 99.09 Yes Yes Yes in feature set [10,19–23]. The search engines do not pro-
duce the correct outcomes in non-English query search [34].
Therefore, our approach detects the websites missed by the
search engine and produce low FPR. Furthermore, our testing
tries to fetch all defined features from the webpage URL and
dataset comprises various language websites, and experiment
textual content as discussed in Sect. 4. Then, based on the
results on the dataset show the high detection accuracy of our
extracted feature value, the trained classifier classifies the
approach.
current URL and shows the result in the form of phishing or
legitimate. We selected some random URLs from our dataset
7.2 Low response time
to check runtime analysis of our approach. The total response
time of our approach in extraction and computing the feature
Low response time is a necessary and critical requirement
vector, and producing the result is 2350 ± 3500 ms. This
for a phishing detection system that acts as another rea-
response time is relatively low and acceptable in real time
son for choosing the client side features. Using proposed
environment. The classifier is not dependent on any third
features for the detection of phishing webpages provides
party services, and it does not need to wait for the results
an average response time of around 2–6 s, which is quite
return by these services. Hence, our approach is fast as com-
low as compared to the existing alternatives such as other
pared to other solutions.
machine learning and visual similarity techniques, which
gives a response time of around 10–13 s. Accessing the source
6.7 Comparison with existing anti-phishing code and producing result requires a negligible amount of
approaches time.
In this experiment, we have compared the proposed method 7.3 Third party independency
with the benchmarked anti-phishing approaches. Table 7
present comparison that is based on TPR, FPR, accuracy Several approaches use third-party dependent feature in the
(ACC), language independent solution (LI), Search engine classification [10,18–23]. We have not chosen these features
independent solution (SEI), and third party services inde- (such as DNS, blacklist/ whitelist, whois record, certifying
pendent (TSI). As seen in the table, our work gives highest authority, etc.) for the following reasons.
detection accuracy among the approaches discussed in the
literature. The work of Tan et al. [23] and Chiew et al. [24] • Blacklist and whitelist contain a certain number URLs
give the TPR higher than our approach. However, these and does not cover all the websites.
two approaches produce very high FPR as compared to our • Some of the approaches verify the domain age from
approach. There is a trade-off between TRR and FPR so a whois lookup. From the Phishtank dataset [29], we anal-
good anti-phishing system should provide balanced TPR and yse that more than 30% of phishing webpages hosted
FPR. Most of the previous approaches have used the search on the compromised domain. If a website hosted on the
engine in feature set [10,19,20,22–24]. However, there are compromised domain, domain age feature give the age of
several drawbacks to search engine based feature. First, the compromised domain and it leads approach in the wrong
new genuine sites do not appear in top search results, and prediction.
this feature leads to the wrong prediction. Second, the search • Certifying authority does not certify to each legitimate
engines do not produce the accurate outcomes in non-English websites, and wrongly classify the new legitimate web-
query search [34]. Therefore, our approach detects the web- sites.
sites missed by the search engine and produce low FPR. • DNS database may also be poisoned.
123
Towards detection of phishing websites on client-side using machine learning based approach 699
• Third party dependent features create additional network structed a dataset collected from various sources and included
delay that can cause the high prediction time. the variety of websites to validate the proposed solution. Our
experimental results on dataset showed that the solution is
7.4 Compromised domain detection very efficient as it has 99.39% true positive rate and only
1.25 % false positive rate. The proposed approach also has
Nowadays, cybercriminals host the phishing webpage on good accuracy as compared to other existing anti-phishing
publicly available websites by exploiting vulnerabilities solutions.
using various phishing tools. Phishing webpage on the com- The feature set of our phishing detection approach
promised domain is a large scale deployment, and it provides entirely depends on the URL and source code of the web-
numerous advantages to cybercriminals. A hacker does not site, which can detect the webpages written in HTML
require a web hosting server to deploy the phishing webpage. code only. Therefore, the identification of non-HTML web-
Our approach is based on source code, and it is not included sites is the aim of our future scope. Nowadays, Mobile
features which predict false in the case of the compro- devices are more popular and seem to be a perfect tar-
mised domain. Some of the features like the age of domain, get for malicious attacks like mobile phishing. Therefore,
certifying authority, WHOIS lookup, etc. provide incorrect detecting the phishing websites in the mobile environ-
information in case of the compromised website and produce ment is a challenge for further research and develop-
false results. The search base techniques compare the domain ment.
name from the top ‘T’ search results to check the legitimacy.
In compromised domain attack, the domain name is genuine, Acknowledgements This research work is being supported by Sir
Visvesvaraya Young Faculty Research Fellowship Grant from Min-
and most of the time it appears in top ‘T’ search results and istry of Electronics & Information Technology (MeitY), Government
these methods fail to detect the compromised domain. of India.
Visual similarity techniques compare the visual appear-
ance of the suspicious webpage with it’s corresponding
authenticate webpage are stored in a local database. These
techniques only detect the compromised webpage only if References
its corresponding legitimate webpage present in the local
1. Jain, A. K., & Gupta, B. B. (2017). Detection of phishing attacks
database. Maintaining the large database that contains every
in financial and e-banking websites using link and visual similar-
legitimate webpage is a tough task. Though, our approach ity relation. International Journal of Information and Computer
can identify the compromised domain up to a large extent Security, Inderscience, 2017 (Forthcoming Articles).
because it does not depend on the local database for compar- 2. Gupta, S., & Gupta, B. B. (2017). Detection, avoidance, and
attack pattern mechanisms in modern web application vulnerabili-
isons.
ties: Present and future challenges. International Journal of Cloud
Applications and Computing, 7(3), 1–43.
7.5 Client side application 3. Almomani, A., et al. (2013). A survey of phishing email filter-
ing techniques. IEEE Communications Surveys & Tutorials, 15.4,
2070–2090.
The webpage is labelled the phishing webpage as phishing,
4. Gupta, B. B., et al. (2017). Fighting against phishing attacks: State
legitimate based on the page source of the webpage, and of the art and future challenges. Neural Computing and Applica-
it does not require any other resource. Most of the anti- tions, 28(12), 3629–3654.
phishing techniques resource such as OCR, DNS, SE, It 5. APWG Q4 2016 Report available at: http://docs.apwg.org/reports/
apwg_trends_report_q4_2016.pdf. Last accessed on September
makes our approach platform independent. Our approach
22, 2017.
uses best features, delivers competitive detection rate with 6. Razorthorn phishing report, Available at : http://www.razorth
minimum resources. This makes our approach portable and orn.co.uk/wp-content/uploads/2017/01/Phishing-Stats-2016.pdf.
platform independent (i.e. performance is not affected if Last accessed on September 22, 2017.
7. Purkait, S. (2015). Examining the effectiveness of phishing fil-
changes in the facilities supplied by third parties or protocol)
ters against DNS based phishing attacks. Information & Computer
Security, 23(3), 333–346.
8. Huang, Z., Liu, S., Mao, X., Chen, K., & Li, J. (2017). Insight
8 Conclusion and future work of the protection for data security under selective opening attacks.
Information Sciences, Volumes, 412–413, 223–241.
9. Li, J., Chen, X., Huang, X., Tang, S., Xiang, Y., Hassan, M. M., et al.
This paper presented a novel approach for filtering phishing (2015). Secure distributed deduplication systems with improved
websites at client side where URL, hyperlink, CSS, login reliability. IEEE Transactions on Computers, 64(12), 3569–3579.
form, and identity features are used. The main contributions 10. Gowtham, R., & Krishnamurthi, I. (2014). A comprehensive and
efficacious architecture for detecting phishing webpages. Comput-
of this paper is the identification of various new client-side ers & Security, 40, 23–37.
specific features that are previously not studied. Furthermore, 11. Aboudi, N. E., & Benhlima, L. (2017). Parallel and distributed pop-
we also created a new heuristic for each feature. We have con- ulation based feature selection framework for health monitoring.
123
700 A. K. Jain, B. B. Gupta
International Journal of Cloud Applications and Computing, 7(1), 31. Alexa Most Popular sites, Available at : http://www.alexa.com/
57–71. topsites. Last accessed on September 22, 2017.
12. Sahoo, D., Liu, C., & Hoi, S. C. H. (2017). Malicious URL detection 32. List of online payment gateways. available at: http://research.
using machine learning: A survey. arXiv:1701.07179. omicsgroup.org/index.php/List_of_online_payment_service_prov
13. Arachchilage, N. A. G., Love, S., & Beznosov, K. (2016). Phishing iders. Last accessed on September 27, 2017.
threat avoidance behaviour: An empirical investigation. Computers 33. Top banking websites in the world. Available at: https://www.
in Human Behavior, 60, 185–197. similarweb.com/top-websites/category/finance/banking. Last
14. Sheng, S., Magnien, B., Kumaraguru, P., Acquisti, A., Cranor, L. accessed on September 27, 2017.
F., Hong, J. & Nunge, E. (2007). Anti-phishing phil: The design 34. Chu, P., Komlodi, A., & Rózsa, G. (2015). Online search in english
and evaluation of a game that teaches people not to fall for phish. In as a non-native language. Proceedings of the Association for Infor-
Proceedings of the 3rd symposium on usable privacy and security, mation Science and Technology, 52(1), 1–9.
Pittsburgh, (pp. 88–99). 35. Percentages of websites using various content languages. Available
15. Jain, A. K., & Gupta, B. B. (2016). A novel approach to protect at https://w3techs.com/technologies/overview/content_language/
against phishing attacks at client side using auto-updated white-list. all. Last accessed on September 22, 2017.
EURASIP Journal of Information Security, 2016, 1–11.
16. Sheng, S., Wardman, B., Warner, G., Cranor, L. F., Hong, J., & Ankit Kumar Jain is presently
Zhang, C. (2009). An empirical analysis of phishing blacklists. working as Assistant Professor in
In Proceedings of the 6th Conference on Email and Anti-Spam National Institute of Technology,
(CEAS’09). Kurukshetra, India. He received
17. Jain, A. K., & Gupta, B. B. (2017). Phishing detection: Analysis of Master of technology from Indian
visual similarity based approaches. Security and Communication Institute of Information Technol-
Networks, 2017, Article ID 5421046, 20 pages, https://doi.org/10. ogy Allahabad (IIIT) India. Cur-
1155/2017/5421046. rently, he is pursuing Ph.D. in
18. Montazer, G. A., & ArabYarmohammadi, S. (2015). Detection of cyber security from National Insti-
phishing attacks in Iranian e-banking using a fuzzy-rough hybrid tute of Technology, Kurukshetra.
system. Applied Soft Computing, 35, 482–492. His general research interest is in
19. Xiang, G., Hong, J., Rose, C. P., & Cranor, L. (2011). Cantina+: the area of Information and Cyber
A feature-rich machine learning framework for detecting phishing security, Phishing Website Detec-
web sites. ACM Transactions on Information and System Security tion, Web security, Mobile Secu-
(TISSEC), 14(2), 21. rity, Online Social Network and
20. El-Alfy, E. S. M. (2017). Detection of phishing websites based Machine Learning. He has published many papers in reputed journals
on probabilistic neural networks and K-medoids clustering. The and conferences.
Computer Journal. https://doi.org/10.1093/comjnl/bxx035.
21. Zhang, W., Jiang, Q., Chen, L., & Li, C. (2017). Two-stage ELM for
phishing Web pages detection using hybrid features. World Wide B. B. Gupta received Ph.D. degree
Web, 20(4), 797–813. from Indian Institute of Technol-
22. Zhang, Y., Hong, J. I., & Cranor, L. F. (2007). Cantina: A content- ogy Roorkee, India in the area
based approach to detecting phishing web sites. In Proceedings of of Information and Cyber Secu-
the 16th international conference on world wide web, (pp. 639– rity. In 2009, he was selected for
648). Canadian Commonwealth Schol-
23. Tan, C. L., Chiew, K. L., Wong, K., & Sze, S. N. (2016). arship awarded by Government of
PhishWHO: Phishing webpage detection via identity keywords Canada. He published more than
extraction and target domain name finder. Decision Support Sys- 100 research papers (including 02
tems, 88, 18–27. books and 14 book chapters) in
24. Chiew, K. L., Chang, E. H., & Tiong, W. K. (2015). Utilisation International Journals and Con-
of website logo for phishing detection. Computers & Security, 54, ferences of high repute including
16–26. IEEE, Elsevier, ACM, Springer,
25. APWG 2014 H2 Report Available at : https://docs.apwg. Wiley, Taylor & Francis, Inder-
org/reports/apwg_trends_report_q3_2014.pdf. Last accessed on science, etc. He has visited several
September 22, 2017. countries, i.e. Canada, Japan, Malaysia, China, Hong-Kong, etc to
26. Dataurization of URLs for a more effective phishing campaign. present his research work. His biography was selected and published
Available at: https://thehackerblog.com/dataurization-of-urls-for- in the 30th Edition of Marquis Who’s Who in the World, 2012. Dr.
a-more-effective-phishing-campaign/index.html. Last accessed Gupta also received Young Faculty research fellowship award from
on September 10, 2017. Ministry of Electronics and Information Technology, government of
27. Geng, G. G., Yang, X. T., Wang, W., & Meng, C. J. (2014). A India in 2017. He is also working as principal investigator of vari-
Taxonomy of hyperlink hiding techniques. In Asia-Pacific web con- ous R&D projects. He is serving as associate editor of IEEE Access
ference, (pp. 165–176). and Executive editor of IJITCA, Inderscience, respectively. He is also
28. Jain, A. K., & Gupta, B. B. (2017). Two-level authentication serving as reviewer for Journals of IEEE, Springer, Wiley, Taylor &
approach to protect from phishing attacks in real time. Journal of Francis, etc. He is also serving as guest editor of various reputed
Ambient Intelligence and Humanized Computing. https://doi.org/ Journals. He was also visiting researcher with Yamaguchi University,
10.1007/s12652-017-0616-z. Japan in January 2015. At present, Dr. Gupta is working as Assistant
29. Verified Phishing URL, Available at : https://www.phishtank.com. Professor in the Department of Computer Engineering, National Insti-
Last accessed on September 22, 2017. tute of Technology Kurukshetra India. His research interest includes
30. Phishing dataset available at : https://www.openphish.com/. Last Information security, Cyber Security, Cloud Computing, Web security,
accessed on September 27, 2017. Intrusion detection and Phishing.
123