Phishing_Web_Page_Detection_Methods_URL_and_HTML_Features_Detection
Phishing_Web_Page_Detection_Methods_URL_and_HTML_Features_Detection
Abstract—Phishing is a type of fraud on the Internet in the technology industry and researchers have started to develop
form of fake web pages that mimic the original web pages to applications that can prevent internet users from being
trick users into sending sensitive information to phisher. The exposed to phishing attacks.
statistics presented by APWG and Phistank show that the There are other related researchers who use the features
number of phishing websites from 2015 to 2020 tends to increase
of web pages such as the URL, HTML, and CSS features of
continuously.
To overcome this problem, several studies have been the web. Then use artificial intelligence methods and some
carried out including detecting phishing web pages using make use of various machine learning algorithms.
various features of web pages with various methods. Unfortunately, the use of methods is also considered to be
Unfortunately, the use of several methods is not really effective ineffective because it is too complex, making it difficult to
because the design and evaluation are only too focused on the implement. This is also in accordance with the statement of
achievement of detection accuracy in research, but evaluation Marchal et al [14] that there is something wrong with the
does not represent application in the real world. Whereas a design and evaluation of the various literature that has been
security detection device should require effectiveness, good carried out, because it only chooses to focus on achieving
performance, and deployable. In this study the authors
detection accuracy in research but evaluation does not
evaluated several methods and proposed rules-based
applications that can detect phishing more efficiently. represent real-world application. In fact, a security detection
tool should require effectiveness, good performance,
Keywords—phishing webpage, URL and HTML features, applicability and efficiency. Most of the existing literature
information security, phishing detection. deals only with accuracy. In fact, the optimal solution can
only be obtained if the anti-phishing tool meets anti-phishing
I. INTRODUCTION criteria such as detection ability and effective usability. It is
therefore suggested that ongoing research studies need to
Phishing is a type of fraud on the Internet in the form of concentrate on both [3].
web pages that mimic legitimate web pages to trick users into The author in this study proposes a rules-based method
sending their sensitive information, such as usernames, with the aim of making the application more effective in
passwords, bank account numbers or credit card numbers [1]. terms of accuracy and faster detection ability. In this study
Phishing is not a new type of attack on the internet, but an old also tried several machine learning methods as a comparison
type of attack that attackers still use because it is considered to see an increase in detection accuracy. Then the authors
as one of the most effective ways to reach the target of their evaluate some of these methods and determine a more
attack. Phishing is chosen by attackers because it is not too efficient strategy when detecting phishing web pages.
complicated to carry out but can directly reach the target of
various internet users [3]. This is evidenced by the large II. TAXONOMY OF PHISHING
statistical level of the number of phishing attack cases which
is relatively increasing from year to year. Based on how the attacker carries out the attack, a
phishing attack can be classified into three parts, through
To solve the phishing problem, it can be done by
social engineering, attacks using malware, and through
increasing the knowledge of internet users about the
network based. Attack by social engineering is usually done
characteristics of phishing. If internet users are provided with
by using fake websites and email spoofing. In other ways
enough knowledge and they want to be more thorough in
attacks using malware usually take advantage of applications
looking at the odd things about their internet activities, the
such as Keylogger / Screenlogger, Malware Phishing
number of cases and victims of phishing may decrease.
(Trojan), while over the network can do with DNS poisoning,
However, the problem is that most internet users often neglect
Session hijacking, and Host file poisoning [7].
to carry out these security measures. According to Tan et al
This are some phase usually phishing attack work until the
[4] the key factor that makes this phishing attack continue to
user is exposed or survived phishing:
take its toll is the habit of internet users who are often in a
• Phishers share a web link in which there is malware
hurry when they get interesting information, and are not
and spread through social media and wait until
careful about the irregularities of a thing. Even though most
someone clicks so it's automatic users will be phished
internet users themselves can recognize what kind of web
they should access. Therefore, several stakeholders in the
if the computer operating system does not have it good able to enter because they do not have that
security. authentication code.
• Phishers share links via social media, share phishing • Another way to deal with this phishing attack is to use
forms or web with form phishing then hope that there software. This is quite effective in covering up human
are users who are not aware, so it gives his personal errors on the Internet. With anti-phishing applications
data and even other important account credentials. users can even be warned its activity could be
• Users use public computers such as in airports, malls, restricted by blocking sites that are considered
hospitals, etc. which can be accessed by anyone, if the dangerous or unusual.
security of the computer device is not well then the
phisher can install a Keylogger so he can wait for users
until they become victims. Users will be phished if
login to his personal account even if doing online
transactions at the computer.
• Users can also be phished if they do personal things
such as logging into important accounts or doing
online transactions at the time by using free WIFI in
malls, cafes and other public hotspot WIFI.
Phishing web pages are web pages that are used to steal
important information from victims in a way that resembles a
real web page that usually has a fake login page, form page,
or it could also be by inserting malware on the web. There are
Fig. 1. Phases of How Phishing Attacks Work several types of web pages that are the main target of being
faked by attackers. The web page target are web banking
There are some techniques for preventing phishing attacks. login pages, social media, webmail, and e-commerce.
Please see fig 2 [8].
• User Education or education for internet users is one
the best way to prevent phishing attacks. With
understanding the characteristics or signs of phishing
users will be easier to identify suspicious things. Of
course, the user should always be reminded too to
always be aware and alert to the unusual.
• Network level phishing defenses can be implemented
with set points in and out of network access. From the
network side, network administrators can manage
access anywhere and from anywhere that can get out
and enter. Even network administrators can block
access to sites that are unknown by IP and domain
name.
• Authentication mechanism is widely used because it
is considered to add more levels of security. With this
authentication mechanism the server will request Fig. 3. Current Phishing Target
authentication information to the user in case of
Some of the things that motivate phishing are mostly
unusual activity or when the user uses a new device.
because of the desire to earn money. While those who are
Authentication system usually uses SMS or email to
even more skilled will intend and seriously carry out phishing
send an authentication code that the user must fill in
with a main target, for example reasons of revenge or political
when they want to access the server. So even though
opponents, usually they will build their web in a very detailed
the attacker managed to get a username and passwords
and similar way because it is based on political elements.
with other phishing methods, attackers have not been
Then they usually look for personal data and credentials from
top websites that are commonly used daily by the victim, for which means that the feature is not selected as a feature for
example email, cloud storage media, important company data phishing application detection.
to the target victim's credit card number. Here are the features that have been selected:
• HTTP protocol
SSL is a security layer with data encryption
TABLE 1 technology. SSL on regular URLs is indicated using
MOST PHISHING TARGETS HTTPS. Meanwhile, websites that do not have SSL
are marked with HTTP.
No Target Main Business • URL length
1 Paypal Payment URL length is an indication of the difference between
phishing and legitimate web pages. A URL that is too
2 Amazon E-Commerce
long can indicate it is the URL of a phishing web page.
3 Microsoft IT • Many dot (.)
4 Apple IT Normally the URL of a legitimate web page is no more
5 Facebook Social media than 3 dots in the domain name. While the phishing
can be more than 3 because it is used to create a
6 Google IT
phishing subdomain.
7 AOL News Forum • The word sensitive
8 Internal Revenue Service Finance Words often used in phishing web page URLs to
9 US Automobile Association Finance attract the attention of potential victims. Words like
'secure', 'banking', 'confirm', 'free', 'sale', 'porn' and
10 JPMorgan Chase and Co. Finance
others are often found in phishing web URLs.
• Dash on the domain name
There are about 2 million phishing webs that have been The number of strange or unusual symbols in the URL
considered valid, 11 thousand are still active online, while is also an indication of phishing characteristics such as
about 2 million others are already offline. It shows that most the dash (-) symbol which is commonly used to add
phishing websites are usually made only temporarily, and sensitive words or brands of phishing targets.
also that some old web pages will appear again with different
• Double top domain
domain names, so if you use blacklisting domain names, they
Using the top two domains to disguise the real domain
need to be updated continuously so that the data is stored
name. for example:
more efficiently.
instagram.com.contoso.info.
• Many paths
IV. WEB FEATURES SELECTION Many paths can be indicated as characteristic of a
phishing web page because they are often used to
There are many features contained in a web page, but not
insert the name of the target web for example
all these features can be used to distinguish which features
contoso.com/login/account/instagram.com
are common on phishing web pages and which are not. The
author designed a feature selection algorithm to select the • Shortlink
most suitable features for use. The designed algorithm is an Shortlink is a relatively new feature used because it
algorithm for the assessment of web features. The can be used as a shorter URL and insert interesting
predetermined features will be rated according to the ratio of words so that potential victims can click on them. The
the number of features detected during the feature assessment use of shortlink is also commonly used for phishing
process or in another scenario the author selects features using a free online form.
based on the detected feature ratios on phishing web pages • Top target
and legitimate web pages. The initial process of feature Most of the phishing web pages online today are
assessment is to identify the features that exist on web pages websites that target data from these top target websites
that are known to be phishing web pages or legitimate web such as Facebook, OneDrive, Instagram, and others.
pages. This process also proves whether the features that have So, the addition of this feature can indicate that the
been selected and used by previous researchers are proven to web page is considered phishing if there is use of the
be many on the web pages to be identified. There are 500 top target brand in the domain name.
phishing web pages and 500 legitimate web pages data that • Fake login
are identified then the features will be rated and ranked based One indication of phishing web pages is a login form
on the results of this feature assessment. The calculation of that asks for sensitive information but when you enter
the feature value will be determined by the ratio of the your username and password you can't log in, or the
number of features that appear on web pages in the dataset. information you enter just seems to disappear.
The following is the calculation algorithm used by the other • HTML file size
author, which is also used by Srinivasa R. and Rao A. Pais Phishing web pages usually only contain relatively
[2]. From the algorithm above, it is determined that every simple content. The length of the HTML code appears
feature value detected on a phishing web page will be affected to be less than 10 Kilobytes shorter.
or reduced by the feature value detected on legitimate web • Favicon
pages. So, if the same number or more features are detected
on a legitimate web page, the value will be 0 or below 0,
Most phishing websites don't have a favicon. A the identification in the form of information on whether the
favicon is an icon that appears on the tab menu when website is considered phishing or legitimate.
the web is opened. In this research, we got good results using the rule
• Cheap domain method created. From the experimental results we got 86.6%
Cheap domains such as .xyz .info and others are of web phishing was detected correctly. About 13.4% more
commonly used for web phishing. Apart from cheap failed to be detected due to several causes including system
rental prices, there are still many choices of cheap failure to detect certain features, especially in HTML files,
domain names available because they are not widely several times the system failed to detect favicon and fake
used. login features. Also, because the phishing web creator has
• Free hosting used a URL like the top website so that the feature of the
The use of free hosting is preferred because you get a phishing web page URL is not detected. But here the author
domain name and hosting at the same time for free, so assesses that creating a web with a URL like that will reduce
that if the web is blacklisted, phishers don't feel too the likelihood of him being clicked on by potential victims.
much to lose. The table 2 is result from the experiments conducted:
TABLE 3
EKSPERIMENT RESULT WITH SOME METHOD
SVM 14 95,4%
GNB 14 95,3%
Likewise, with other features such as other features that [13] Adebowale, M. A., K. T. Lwin, E. Sánchez, and M. A. Hossain,
‘Intelligent Web-Phishing Detection and Protection Scheme Using
feature a clear difference between phishing web page features
Integrated Features of Images, Frames and Text’, Expert Systems with
and legitimate web pages. Applications, 115.December 2017 (2019).
[14] Marchal, Samuel, and N. Asokan, ‘On Designing and Evaluating
VII. CONCLUSION Phishing Webpage Detection Techniques for the Real World’, 11th
USENIX Workshop on Cyber Security Experimentation and Test,
CSET 2018, Co-Located with USENIX Security 2018, (2018).
Based on the results of experiments and analyzes that [15] Ferreira, Ana, and Soraia Teles, ‘Persuasion: How Phishing Emails
have been carried out in this study, the authors draw the Can Influence Users and Bypass Security Measures’, International
following conclusions: Journal of Human Computer Studies, 125.November 2018 (2019).
1) Using the URL feature alone can make the detection [15] Mahalakshmi, A., N. Swapna Goud, and G. Vishnu Murthy, ‘A Survey
on Phishing and It’s Detection Techniques Based on Support Vector
process faster, but the accuracy will decrease. Using the Method (SVM) and Software Defined Networking (SDN)’,
HTML feature adds to the load of the application and there is International Journal of Engineering and Advanced Technology, 8.2
a risk of failure when downloading and opening HTML files. (2018).
Therefore, the use of a combination of URL and HTML is [16] Chen, Jing, Scott Mishler, Bin Hu, Ninghui Li, and Robert W. Proctor,
‘The Description-Experience Gap in the Effect of Warning Reliability
better to use. on User Trust and Performance in a Phishing-Detection Context’,
2) From the results of several tested methods, it shows that International Journal of Human Computer Studies, 119.November
the accuracy of using machine learning is not much better 2017 (2018).
than the use of rule based, even though the machine learning [17] Alam, Safwan, and Khalil El-Khatib, ‘Phishing Susceptibility
Detection through Social Media Analytics’, ACM International
method uses more hardware resources for training data and Conference Proceeding Series, 20-22-July (2016).
its application. [18] Sahingoz, Ozgur Koray, Ebubekir Buber, Onder Demir, and Banu Diri,
3) Referring to the test results that produce relatively the same ‘Machine Learning Based Phishing Detection from URLs’, Expert
accuracy and based on the fact that URL and HTML syntax Systems with Applications, 117 (2019).
[19] Tewari, Aakanksha, A. K. Jain, and B. B. Gupta, ‘Recent Survey of
data are relatively the same from year to year, the use of Various Defense Mechanisms against Phishing Attacks’, Journal of
machine learning is less efficient, so the use of rules is more Information Privacy and Security, 12.1 (2016).
suitable for phishing web page detection methods . [20] Baykara, Muhammet, and Zahit Ziya Gürel, ‘Detection of Phishing
Attacks’, 6th International Symposium on Digital Forensic and
Security, ISDFS 2018 - Proceeding, 2018-January (2018).
REFERENCES [21] Mao, Jian, Wenqian Tian, Pei Li, Tao Wei, and Zhenkai Liang,
‘Phishing-Alarm: Robust and Efficient Phishing Detection via Page
[1] Li, Yukun, Zhenguo Yang, Xu Chen, Huaping Yuan, and Wenyin Liu, Component Similarity’, IEEE Access, 5 (2017).
‘A Stacking Model Using URL and HTML Features for Phishing [22] Rao, Routhu Srinivasa, and Alwyn Roshan Pais, ‘Detection of
Webpage Detection’, Future Generation Computer Systems, 94 (2019) Phishing Websites Using an Efficient Feature-Based Machine
[2] Rao, Routhu Srinivasa, and Alwyn R. Pais, ‘Detecting Phishing Learning Framework’, Neural Computing and Applications, 31.8
Websites Using Automation of Human Behavior’, CPSS 2017 - (2019).
Proceedings of the 3rd ACM Workshop on Cyber-Physical System
Security, Co-Located with ASIA CCS 2017, 2017.
[3] Alsharnouby, Mohamed, Furkan Alaca, and Sonia Chiasson, ‘Why
Phishing Still Works: User Strategies for Combating Phishing
Attacks’, International Journal of Human Computer Studies, 82 (2015).
[4] Tan, Choon Lin, Kang Leng Chiew, Kok Sheik Wong, and San Nah
Sze, ‘PhishWHO: Phishing Webpage Detection via Identity Keywords
Extraction and Target Domain Name Finder’, Decision Support
Systems, 88 (2016).
[5] Marchal, Samuel, Kalle Saari, Nidhi Singh, and N. Asokan, ‘Know
Your Phish: Novel Techniques for Detecting Phishing Sites and Their
Targets’, 2015.
[6] Sonowal, Gunikhan, and K. S. Kuppusamy, ‘PhiDMA – A Phishing
Detection Model with Multi-Filter Approach’, Journal of King Saud
University - Computer and Information Sciences, 32.1 (2020).
[7] Gupta, Surbhi, Abhishek Singhal, and Akanksha Kapoor, ‘A Literature
Survey on Social Engineering Attacks: Phishing Attack’, Proceeding -
IEEE International Conference on Computing, Communication and
Automation, ICCCA 2016, 2017.
[8] Gupta, B. B., Nalin A.G. Arachchilage, and Kostas E. Psannis,
‘Defending against Phishing Attacks: Taxonomy of Methods, Current
Issues and Future Directions’, Telecommunication Systems, 67.2
(2018)
[9] Chiew, Kang Leng, Choon Lin Tan, Kok Sheik Wong, Kelvin S.C.
Yong, and Wei King Tiong, ‘A New Hybrid Ensemble Feature
Selection Framework for Machine Learning-Based Phishing Detection
System’, Information Sciences, 484 (2019).
[10] Kim, Sungjin, Jinkook Kim, and Brent Byung Hoon Kang, ‘Malicious
URL Protection Based on Attackers’ Habitual Behavioral Analysis’,
Computers and Security, 77 (2018).
[11] Orunsolu, A. A., A. S. Sodiya, and A. T. Akinwale, ‘A Predictive
Model for Phishing Detection’, Journal of King Saud University -
Computer and Information Sciences, 2020.
[12] Jansen, Jurjen, and Paul van Schaik, ‘The Design and Evaluation of a
Theory-Based Intervention to Promote Security Behaviour against
Phishing’, International Journal of Human Computer Studies,
123.January 2018 (2019).