Phishing Website Detection Using Fuzzy Logic: Twinkll Sisodia Simran Choudhary
Phishing Website Detection Using Fuzzy Logic: Twinkll Sisodia Simran Choudhary
AbstractDetecting phishing website represents one of the social networking, shopping and gaming applications is very
most important research topics in the area of information common. Only from the Google and Apple stores 195 billion
security. Due to the nonlinear nature of the phishers deception app are downloaded in 2015.
tricks, random and vague behavior of the network traffic and the
large number of features in the problem space, phishing website According to Steven Myers a phishing attack in general
detection represent a complex problem. Since these parameters consists of The Lure (e.g. email spamming), The Hook
cannot be easily combined using a mathematical formula, fuzzy (e.g. a phishing website) and The Catch/Kill (e.g. identity
logic can be used to combine them. This paper presents a novel theft using the credentials). Aaron Emigh uses a more
approach, based on fuzzy logic and fuzzy inference rules to detect detailed eight-step-model. Step 0: a preparation phase, e.g. for
phishing website. The results show that applying fuzzy logic registering a domain. Step 1: this is followed by sending out a
causes decrements in false positive and miss rate. malicious message. Step 2: That is somehow responded by the
user. Step 3: This is followed by the prompt to provide
Keywords Phishing website; False Positive Rate; Miss Rate confidential information. Step 4: the users answer to this
prompt. Step 5: This confidential information is then
I. INTRODUCTION transmitted back to the attacker. Step 6: and attacker
Phishing is a form of online identity theft that employs impersonate the person. Step 7: which can then be followed by
both social engineering and technical subterfuge to steal means to engage in further fraud, e.g. monetizing the data.
consumers personal identity data and financial account Besides the fact that they differ in granularity the specified
credentials. It is an act of trying to get hold of sensitive data end of the life cycle differs a little.
and valid user credentials by luring users into entering these There were at least 67, 677 phishing attacks reported by
on legitimate looking websites [1]. Attackers try to maximize the Anti-Phishing Working Group (APWG) in the last six
the number of successfully fooled users whilst minimizing the months of 2013 [9]. The latest reports showed that most
chance of their attack being detected and removed by an phishing attacks are spear phishing that aim the financial,
authority [2]. business and payment sectors [3]. The number of phishing
Due to the number of the users using the internet are attacks and phishing sites is rapidly increasing and on the
increasing rapidly, the numbers of phishing attacks are also average, Sophos identifies 16,173 malicious web pages
increasing. The increase of 59% in phishing attack volumes is everyday [7]. Even though Web browsers (i.e. Mozilla
reported in 2012 than 2011 and globally the losses due to Firefox, Internet Explorer, Opera, Google Chrome, etc.)
phishing are estimated at $1.5 billion in 2012 an increase of provide add-on tool for blocking phishing emails and phishing
22% than 2011. The phishers attack mostly the financial sites, Phishers still manage to override these security
activities and because of better economic growth of Canada mechanisms. Phishing has become more and more
the phishing attacks increased to 400% in 2012 [9,11]. In complicated that Phishers can bypass the filter set by current
2012, cyber criminals have used the simple hosting method anti- phishing techniques. The rapidly increasing number of
tactics and targeted the hijacked websites to launch the phishing attacks suggests that it is therefore difficult to find a
phishing attack. The web shells, smarter web analytical tools single logical procedure to detect phishing emails and that
and automated toolkits are used to hack huge number of existing anti-phishing tools are not sufficient. This may be
websites. The RSA analysts have noted that the combined attributed to the mostly passive approach of anti-phishing
attack schemes are in use to phish users and redirecting them techniques. The approaches are passive since they do not stop
to infection points. the source of the phishing emails rather they simply classify
and detect phishing emails [6]. A proactive approach to
In 2013, the launch of 4-G channels in mobile communication minimizing phishing has been conducted where the system
and the growth in the use of the mobile usage in the personal and removes a phishing page from the host server rather than just
office life, it is forecasted in 2013 that the phishing attacks filtering email and flagging suspected messages as spam [5].
expected to be more directed at the mobile and smart phone users. The study in [6] however, assumes that emails have already
The expected attacks would be by voice (vishing), Mobile been classified as a phishing email or legitimate email. The
applications, SMS (smishing) and spammed emails that the user study has ignored the phishing email classification and was
will open on their mobiles. The use of more concerned with how to deal with the Phisher once a
phishing website has been detected [8]. This paper proposes to 6) Unusual port number: Spoofguard identified several
develop a fuzzy rule based phishing detection framework that standard port numbers as 21, 70, 80, 443, 1080. These
combines phishing website classification and the proactive correspond to common services used in web browsers such as
approach to stop the Phishers from its source as a result of the FTP, Gopher, web, secure web, and SOCKS. If a suspicious
classification. The study will take into consideration different
unknown port number is used the phishing score is increased
features of URL in classifying phishing websites.
because attackers often use different port numbers to bypass
security detection programs that may monitor a specific port
II. FEATURE EXTRACTION number.
We have investigated a large number of different features B. Anomalous Based
contributing in the classification of the type of the websites
that have been proposed in [8]. We have selected 24 major 1) Suspicious Symbol: Two symbol common in phishing
features. Further we classified the feature in four categories. URLs the @ and - . The most common and dangerous
The dataset utilized in the experiments consists of 500 character used in phishing URLs is the @ character. This
legitimate and 600 fake websites respectively. It has been character is used by web browsers to automatically pass a
collected from Phishtank (www.phishtank.com). username to a secure site. The username proceeds the @
symbol and the destination URL follows the @ symbol. The
A. URL Based problem is that if the website isnt setup to handle a secure
connection, the web browser will navigate to the destination
1) URL Length: Phishers hide the suspicious part of the URL without any error message. Phishers exploit this
URL to redirect the informations submitted by the users or weakness by filling the username field with a legitimate URL
redirect the uploaded page to a suspicious domain. while the destination URL is a phishing site like
Scientifically, there is no standard reliable length that http://[email protected] will navigate to the
differentiates between phishing URLs and legitimate ones. If destination URL which is phishingsite.com and will attempt
the URL length exceeds 50 characters the URL can be to log in using www.sbi.com as the username. A dash - is
considered phishy. the other suspicious character. However it was determined that
the dash is not a good indicator of a phishing site.
2) IP Address: Phishing URLs often contain IP
addresses to hide the actual URL of the website. A website 2) Adding prefix and suffix: Phishers try to scam users by
URL may be extremely long and look suspicious but the URL reshaping the suspicious URL so it looks legitimate. One
that contains the IP address is typically shorter and look more technique used to do so is adding prefix or suffix to the
standard like http://96.145.190.245. IP addresses are used by legitimate URL thus the user may not notice any difference
phishers to conceal the actual domain name of the website like www.sbi-online.com.
being visited. Sometimes IP addresses are combined with
actual keywords or legitimate URLs that actually arent used 3) Domain misspelled: Here the URL either does not
during URL navigation for example, indicate which organization is being phished or the domain
http://[email protected]. name is misspelled. Either there is repetition of character or
missing character in domain name like www.paypayl.com.
3) Sub-domains: Another technique used by the
phishers to scam users is by adding a sub domain to the URL 4) Fake HTTPs protocol/SSL Final: The existence of
so users may believe they are dealing with an authentic HTTPs protocol reflects that the user certainly connected with
website like http://www.paypal.it.ascendancetheatrearts.co.uk. an honest website. However, phishers may use a fake HTTPs
protocol so that the users may be deceived. So checking that
4) Hexadecimal Characters: Often legitimate sites use the HTTPs protocol is issued by a trusted Certificate Authority
hexadecimal notation for punctuation and symbols such as the such as GeoTrust, GoDaddy, Thawte, VeriSign, etc, is
question mark, period, apostrophe, comma, spacebar, etc. Web recommended.
browsers can understand hexadecimal values and they can be
used in URLs by preceding the hexadecimal value with a % 5) Redirection: Sometimes phishers want to hide the URL
symbol. Typically phishing sites use hexadecimal values to completely by using a URL redirection service. This type of
disguise the actual letters and numbers in the URL. service will shorten the URL into a series of letters or
numbers. Services like tinyurl.com and notlong.com allow any
5) Registered URL: If the website identity does not one to enter a URL and the service will essentially create a
match a record in the WHOIS database, the website is shortcut to the URL such as http://www.tinyurl.com/12345.
classified as phishy. Obviously this can be used for both good purposes to shorten
a URL and malicious purposes to hide the actual URL.
C. Content-Based 9) Server Form Handler (SFH): Once the user submitted his
information; the webpage will transfer the information to a
1) External Object Sources(URL): A webpage usually server so that it can process it. Normally, the information is
consists of text and some objects such as images and videos. processed from the same domain where the webpage is being
Typically, these objects are loaded into the webpage from the loaded. Phishers resort to make the server form handler either
same server of the webpage. If the objects are loaded from a empty or the information is transferred to somewhere different
domain other than the one typed in the URL address bar, the than the legitimate domain.
webpage is potentially suspicious.
7) Pop-up Window: Authenticated sites do not ask III. FUZZY DATA MINING ALGORITHMS & TECHNIQUES
users to submit their credentials via a popup window. The approach described here is to apply fuzzy logic and fuzzy
rule based inference to assess phishing website risk on the 24
8) Hiding the links: Phishers often hide the suspicious characteristics and factors which stamp the forged website.
The essential advantage offered by fuzzy logic techniques is
link by showing a fake link on the status bar of the browser or
the use of linguistic variables to represent key phishing
by hiding the status bar itself. This can be achieved by characteristic indicators and relating phishing website
tracking the mouse cursor and once the user arrives to the probability.
suspicious link the status bar content changed. Some
fraudsters use the JavaScript event handler onMouseOver to 1) Fuzzification
show a false URL in the status bar of the users email In this step, linguistic descriptors such as High, Low,
application. Medium, for example, are assigned to a range of values for
each key phishing characteristic indicators. Valid ranges of
the inputs are considered and divided into classes, or fuzzy IV. FUZZY INFERENCE RULES
sets. For example, length of URL address can range from
Rule Base for Category 1: There are six input parameters for
low to high with other values in between. We cannot
the rule base and it has one output. It contains all the if-then
specify clear boundaries between classes. The degree of
belongingness of the values of the variables to any selected rules of the system. In each rule base, every component is
class is called the degree of membership; Membership assumed to be one of three values (based on the linguistic
function is designed for each Phishing characteristic descriptor) and each criterion has six components. Hence rule
indicator, which is a curve that defines how each point in base 1 contains 36 = 729 entries. The output of rule base 1 is
the input space is mapped to a membership value between one of the phishing website rate fuzzy sets (Genuine,
[0, 1]. Suspicious or Fraud) representing URL criteria phishing risk
Linguistic values are assigned for each Phishing rate.
indicator as Low, Moderate, and High while for Phishing
website risk rate as Very legitimate, Legitimate , Rule Base for Category 2:
Suspicious, Phishy, and Very phishy (triangular and
trapezoidal membership function). For each input their Spelling
values ranges from 0 to 10 while for output, ranges from 0 Rule # Errors Keywords Embedded Links Email Content
Domain
to 100. The fuzzy representation more closely matches
human cognition, thereby facilitating expert input and more 1 Low Low Moderate Genuine
reliably representing experts understanding of underlying 2 Low Moderate Moderate Suspicious
dynamics. 3 High High High Fraud
4 Low Low Low Genuine
2) Rule Generation 5 High Moderate Moderate Fraud
Having specified the risk of phishing website and its 6 Moderate Low Moderate Suspicious
key phishing characteristic indicators, the next step is to 7 Moderate Moderate Low Suspicious
specify how the phishing website probability varies.
Experts provide fuzzy rules in the form of if...then
V. RESULTS
statements that relate phishing website probability to
various levels of key phishing characteristic indicators Publicly available datasets from Phishtank were used for
based on their knowledge and experience. simulation. There are two stages in determining the fuzzy data
mining inference rules. 1100 sample instances are used from
3) Aggregation of the rule outputs the Phistank archive. For rule base 1, there are 6 identified
This is the process of unifying the outputs of all Phishing website characteristics based on the URL based
discovered rules. Combining the membership functions of approach. The assigned weight is 0.4. For rule base 2, there
all the rules consequents previously scaled into single are 5 identified characteristics of Phishing websites based on
fuzzy sets (output). the Anomalous based approach. The assigned weight is 0.3.
For rule base 3, there are 9 identified content-based
4) Defuzzification characteristic. The assigned weight is 0.2. For rule base 4,
This is the process of transforming a fuzzy output of a there are 4 general characteristic for detection. The assigned
fuzzy inference system into a crisp output. Fuzziness helps weight is 0.1. The website rating is computed as 0.4 * URL
to evaluate the rules, but the final output has to be a crisp crisp (rule base 1) + 0.3 * Anomalous crisp (rule base 2) + 0.2
number. The input for the defuzzification process is the * Content based crisp (rule base 1)+ 0.1 * General based
aggregate output fuzzy set and the output is a number. This crisp(rule base 4).
step was done using Centroid technique since it is a The initial results showed that URL and Anomalous Domain
and the Content based Domain are important criteria for
commonly used method.The output is phishing website
identify and detecting Phishing emails. If URL and
risk rate and is defined in fuzzy sets like very phishy to Anomalous Domain is Valid or Genuine, it will likely
very legitimate. The fuzzy output set is then defuzzified follow that the website is a legitimate website. The same is
to arrive at a scalar value. true if all of the criteria are Valid or Genuine. Likewise, if
the first criteria is Fraud, the website is considered as a
5) False Positive Rate Phishing website.
The rate of reporting legitimate websites as phishing
websites is called as False Positive rate.
VI. CONCLUSION
6) Miss Rate We have proposed a new technique to detect phishing sites
The rate of reporting Phishing websites as effectively. In the proposed technique, the system model is
legitimate websites is called Miss Rate. built to detect phishing sites by using fuzzy logic model. The
technique is experimented with the training dataset containing
5 00 sites and 2 testing datasets that each dataset contains 300
phishing sites or ,300 legitimate sites. The results show that
97% phishing websites are detected by using the proposed
technique. Also there is reduction in false positive and miss
rate.
REFERENCES Fuzzy Data Mining, IEEE International Conference on CyberWorlds,
2009, pp265 - 272 A. P. W. Group, Phishing activity trends report,
[1] X. Dong, J. A. Clerk, J .L. Jacob, Defending the weakest link: 2009, http://www.antiphishing.org/reports/apwg reportQ42009.pdf.
Phishing Website Detection by analysing User Behaviours,IEEE
[9] Phishing Activity Trends Report 3Q 2013 of the Anti-Phishing Work
Telecommun
Group (APWG) [online] http://www.apwg.org.
System, 45: pp. 215 226, 2010.
[10] Phishtank. Phishtank feed: validated and online.
[2] B. Wardman, T. Stallings, G. Warner, A. Skjel lum, High -
http://data.phishtank.com/data/online-valid/index.xml, 2015.
Performance Content-Based Phishing Attack Detection, eCrime
Researchers Summit (eCrime), pp. 1-9, Conference: 7-9 Nov. 201 1, [11] GARTNER, INC. Gartner says number of phishing e-mails sent to U.S.
San Diego, CA. adults nearly doubles in just two years.
http://www.gartner.com/it/page.jsp?id=498245, November 9 2006.
[3] A. P. Barraclough, M. A. Hossain, M.A. Tahir, G. Sexton, N. Aslam
Intelligent phishing detection and protection scheme for online [12] PhishingCorpus Homepage.
transactions, Expert Systems with Application 40, pp. 4697-4706, Available:http://monkey.org/~jose/wiki/doku.php?id=PhishingCorpus
2013.
[4] A. Le, A. Markopoulou, M. Faloutsos, Phishdef: Url names say it all,
INFOCOM, Proceedings IEEE, pp. 191-195, 2010.
[5] Shah, R.; Trevathan, J.; Read, W.; Ghodosi, H.,A Proactive Approach
to Preventing Phishing Attacks Using the Pshark Model, IEEE Sixth
International Conference on Information Technology: New
Generations, March 2009, pp. 915 921.
[6] Afroz, S.; Greenstadt, R., PhishZoo: Detecting Phishing Websites by
Looking at Them, IEEE Fifth International Conference on Semantic
Computing (ICSC), 2011, pp. 368-375.
[7] Sophos, Security Threat Report, July 2008.
[8] Maher Aburrous, M. A. Hossain, Keshav Dahal, Fadi Thabatah,
Modelling Intelligent Phishing Detection System for e-Banking using