0% found this document useful (0 votes)
4 views

Phishing_Web_Page_Detection_Methods_URL_and_HTML_Features_Detection

The document discusses phishing web page detection methods using URL and HTML features, highlighting the increasing prevalence of phishing attacks from 2015 to 2020. It critiques existing detection methods for their complexity and lack of real-world applicability, proposing a rules-based approach that aims for more effective detection. The study evaluates various web features to improve detection accuracy, achieving a 86.6% success rate in identifying phishing sites.

Uploaded by

madhavvignesh21
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Phishing_Web_Page_Detection_Methods_URL_and_HTML_Features_Detection

The document discusses phishing web page detection methods using URL and HTML features, highlighting the increasing prevalence of phishing attacks from 2015 to 2020. It critiques existing detection methods for their complexity and lack of real-world applicability, proposing a rules-based approach that aims for more effective detection. The study evaluates various web features to improve detection accuracy, achieving a 86.6% success rate in identifying phishing sites.

Uploaded by

madhavvignesh21
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

The 2020 IEEE International Conference on Internet of Things and Intelligence System (IoTaIS)

Phishing Web Page Detection Methods: URL and


HTML Features Detection
2020 IEEE International Conference on Internet of Things and Intelligence System (IoTaIS) | 978-1-7281-9448-6/20/$31.00 ©2021 IEEE | DOI: 10.1109/IoTaIS50849.2021.9359694

Humam, Faris Setiadi, Yazid


Dept. Computer Science Dept. Computer Science
Universitas Indonesia Universitas Indonesia
Depok, Indonesia Depok, Indonesia
[email protected] [email protected]

Abstract—Phishing is a type of fraud on the Internet in the technology industry and researchers have started to develop
form of fake web pages that mimic the original web pages to applications that can prevent internet users from being
trick users into sending sensitive information to phisher. The exposed to phishing attacks.
statistics presented by APWG and Phistank show that the There are other related researchers who use the features
number of phishing websites from 2015 to 2020 tends to increase
of web pages such as the URL, HTML, and CSS features of
continuously.
To overcome this problem, several studies have been the web. Then use artificial intelligence methods and some
carried out including detecting phishing web pages using make use of various machine learning algorithms.
various features of web pages with various methods. Unfortunately, the use of methods is also considered to be
Unfortunately, the use of several methods is not really effective ineffective because it is too complex, making it difficult to
because the design and evaluation are only too focused on the implement. This is also in accordance with the statement of
achievement of detection accuracy in research, but evaluation Marchal et al [14] that there is something wrong with the
does not represent application in the real world. Whereas a design and evaluation of the various literature that has been
security detection device should require effectiveness, good carried out, because it only chooses to focus on achieving
performance, and deployable. In this study the authors
detection accuracy in research but evaluation does not
evaluated several methods and proposed rules-based
applications that can detect phishing more efficiently. represent real-world application. In fact, a security detection
tool should require effectiveness, good performance,
Keywords—phishing webpage, URL and HTML features, applicability and efficiency. Most of the existing literature
information security, phishing detection. deals only with accuracy. In fact, the optimal solution can
only be obtained if the anti-phishing tool meets anti-phishing
I. INTRODUCTION criteria such as detection ability and effective usability. It is
therefore suggested that ongoing research studies need to
Phishing is a type of fraud on the Internet in the form of concentrate on both [3].
web pages that mimic legitimate web pages to trick users into The author in this study proposes a rules-based method
sending their sensitive information, such as usernames, with the aim of making the application more effective in
passwords, bank account numbers or credit card numbers [1]. terms of accuracy and faster detection ability. In this study
Phishing is not a new type of attack on the internet, but an old also tried several machine learning methods as a comparison
type of attack that attackers still use because it is considered to see an increase in detection accuracy. Then the authors
as one of the most effective ways to reach the target of their evaluate some of these methods and determine a more
attack. Phishing is chosen by attackers because it is not too efficient strategy when detecting phishing web pages.
complicated to carry out but can directly reach the target of
various internet users [3]. This is evidenced by the large II. TAXONOMY OF PHISHING
statistical level of the number of phishing attack cases which
is relatively increasing from year to year. Based on how the attacker carries out the attack, a
phishing attack can be classified into three parts, through
To solve the phishing problem, it can be done by
social engineering, attacks using malware, and through
increasing the knowledge of internet users about the
network based. Attack by social engineering is usually done
characteristics of phishing. If internet users are provided with
by using fake websites and email spoofing. In other ways
enough knowledge and they want to be more thorough in
attacks using malware usually take advantage of applications
looking at the odd things about their internet activities, the
such as Keylogger / Screenlogger, Malware Phishing
number of cases and victims of phishing may decrease.
(Trojan), while over the network can do with DNS poisoning,
However, the problem is that most internet users often neglect
Session hijacking, and Host file poisoning [7].
to carry out these security measures. According to Tan et al
This are some phase usually phishing attack work until the
[4] the key factor that makes this phishing attack continue to
user is exposed or survived phishing:
take its toll is the habit of internet users who are often in a
• Phishers share a web link in which there is malware
hurry when they get interesting information, and are not
and spread through social media and wait until
careful about the irregularities of a thing. Even though most
someone clicks so it's automatic users will be phished
internet users themselves can recognize what kind of web
they should access. Therefore, several stakeholders in the

978-1-7281-9448-6/20/$31.00 ©2020 IEEE 167


Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 30,2025 at 05:53:02 UTC from IEEE Xplore. Restrictions apply.
The 2020 IEEE International Conference on Internet of Things and Intelligence System (IoTaIS)

if the computer operating system does not have it good able to enter because they do not have that
security. authentication code.
• Phishers share links via social media, share phishing • Another way to deal with this phishing attack is to use
forms or web with form phishing then hope that there software. This is quite effective in covering up human
are users who are not aware, so it gives his personal errors on the Internet. With anti-phishing applications
data and even other important account credentials. users can even be warned its activity could be
• Users use public computers such as in airports, malls, restricted by blocking sites that are considered
hospitals, etc. which can be accessed by anyone, if the dangerous or unusual.
security of the computer device is not well then the
phisher can install a Keylogger so he can wait for users
until they become victims. Users will be phished if
login to his personal account even if doing online
transactions at the computer.
• Users can also be phished if they do personal things
such as logging into important accounts or doing
online transactions at the time by using free WIFI in
malls, cafes and other public hotspot WIFI.

Fig. 2. Taxonomy Prevents Phishing Attacks

III. PHISHING WEB TARGET

Phishing web pages are web pages that are used to steal
important information from victims in a way that resembles a
real web page that usually has a fake login page, form page,
or it could also be by inserting malware on the web. There are
Fig. 1. Phases of How Phishing Attacks Work several types of web pages that are the main target of being
faked by attackers. The web page target are web banking
There are some techniques for preventing phishing attacks. login pages, social media, webmail, and e-commerce.
Please see fig 2 [8].
• User Education or education for internet users is one
the best way to prevent phishing attacks. With
understanding the characteristics or signs of phishing
users will be easier to identify suspicious things. Of
course, the user should always be reminded too to
always be aware and alert to the unusual.
• Network level phishing defenses can be implemented
with set points in and out of network access. From the
network side, network administrators can manage
access anywhere and from anywhere that can get out
and enter. Even network administrators can block
access to sites that are unknown by IP and domain
name.
• Authentication mechanism is widely used because it
is considered to add more levels of security. With this
authentication mechanism the server will request Fig. 3. Current Phishing Target
authentication information to the user in case of
Some of the things that motivate phishing are mostly
unusual activity or when the user uses a new device.
because of the desire to earn money. While those who are
Authentication system usually uses SMS or email to
even more skilled will intend and seriously carry out phishing
send an authentication code that the user must fill in
with a main target, for example reasons of revenge or political
when they want to access the server. So even though
opponents, usually they will build their web in a very detailed
the attacker managed to get a username and passwords
and similar way because it is based on political elements.
with other phishing methods, attackers have not been
Then they usually look for personal data and credentials from

978-1-7281-9448-6/20/$31.00 ©2020 IEEE 168


Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 30,2025 at 05:53:02 UTC from IEEE Xplore. Restrictions apply.
The 2020 IEEE International Conference on Internet of Things and Intelligence System (IoTaIS)

top websites that are commonly used daily by the victim, for which means that the feature is not selected as a feature for
example email, cloud storage media, important company data phishing application detection.
to the target victim's credit card number. Here are the features that have been selected:
• HTTP protocol
SSL is a security layer with data encryption
TABLE 1 technology. SSL on regular URLs is indicated using
MOST PHISHING TARGETS HTTPS. Meanwhile, websites that do not have SSL
are marked with HTTP.
No Target Main Business • URL length
1 Paypal Payment URL length is an indication of the difference between
phishing and legitimate web pages. A URL that is too
2 Amazon E-Commerce
long can indicate it is the URL of a phishing web page.
3 Microsoft IT • Many dot (.)
4 Apple IT Normally the URL of a legitimate web page is no more
5 Facebook Social media than 3 dots in the domain name. While the phishing
can be more than 3 because it is used to create a
6 Google IT
phishing subdomain.
7 AOL News Forum • The word sensitive
8 Internal Revenue Service Finance Words often used in phishing web page URLs to
9 US Automobile Association Finance attract the attention of potential victims. Words like
'secure', 'banking', 'confirm', 'free', 'sale', 'porn' and
10 JPMorgan Chase and Co. Finance
others are often found in phishing web URLs.
• Dash on the domain name
There are about 2 million phishing webs that have been The number of strange or unusual symbols in the URL
considered valid, 11 thousand are still active online, while is also an indication of phishing characteristics such as
about 2 million others are already offline. It shows that most the dash (-) symbol which is commonly used to add
phishing websites are usually made only temporarily, and sensitive words or brands of phishing targets.
also that some old web pages will appear again with different
• Double top domain
domain names, so if you use blacklisting domain names, they
Using the top two domains to disguise the real domain
need to be updated continuously so that the data is stored
name. for example:
more efficiently.
instagram.com.contoso.info.
• Many paths
IV. WEB FEATURES SELECTION Many paths can be indicated as characteristic of a
phishing web page because they are often used to
There are many features contained in a web page, but not
insert the name of the target web for example
all these features can be used to distinguish which features
contoso.com/login/account/instagram.com
are common on phishing web pages and which are not. The
author designed a feature selection algorithm to select the • Shortlink
most suitable features for use. The designed algorithm is an Shortlink is a relatively new feature used because it
algorithm for the assessment of web features. The can be used as a shorter URL and insert interesting
predetermined features will be rated according to the ratio of words so that potential victims can click on them. The
the number of features detected during the feature assessment use of shortlink is also commonly used for phishing
process or in another scenario the author selects features using a free online form.
based on the detected feature ratios on phishing web pages • Top target
and legitimate web pages. The initial process of feature Most of the phishing web pages online today are
assessment is to identify the features that exist on web pages websites that target data from these top target websites
that are known to be phishing web pages or legitimate web such as Facebook, OneDrive, Instagram, and others.
pages. This process also proves whether the features that have So, the addition of this feature can indicate that the
been selected and used by previous researchers are proven to web page is considered phishing if there is use of the
be many on the web pages to be identified. There are 500 top target brand in the domain name.
phishing web pages and 500 legitimate web pages data that • Fake login
are identified then the features will be rated and ranked based One indication of phishing web pages is a login form
on the results of this feature assessment. The calculation of that asks for sensitive information but when you enter
the feature value will be determined by the ratio of the your username and password you can't log in, or the
number of features that appear on web pages in the dataset. information you enter just seems to disappear.
The following is the calculation algorithm used by the other • HTML file size
author, which is also used by Srinivasa R. and Rao A. Pais Phishing web pages usually only contain relatively
[2]. From the algorithm above, it is determined that every simple content. The length of the HTML code appears
feature value detected on a phishing web page will be affected to be less than 10 Kilobytes shorter.
or reduced by the feature value detected on legitimate web • Favicon
pages. So, if the same number or more features are detected
on a legitimate web page, the value will be 0 or below 0,

978-1-7281-9448-6/20/$31.00 ©2020 IEEE 169


Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 30,2025 at 05:53:02 UTC from IEEE Xplore. Restrictions apply.
The 2020 IEEE International Conference on Internet of Things and Intelligence System (IoTaIS)

Most phishing websites don't have a favicon. A the identification in the form of information on whether the
favicon is an icon that appears on the tab menu when website is considered phishing or legitimate.
the web is opened. In this research, we got good results using the rule
• Cheap domain method created. From the experimental results we got 86.6%
Cheap domains such as .xyz .info and others are of web phishing was detected correctly. About 13.4% more
commonly used for web phishing. Apart from cheap failed to be detected due to several causes including system
rental prices, there are still many choices of cheap failure to detect certain features, especially in HTML files,
domain names available because they are not widely several times the system failed to detect favicon and fake
used. login features. Also, because the phishing web creator has
• Free hosting used a URL like the top website so that the feature of the
The use of free hosting is preferred because you get a phishing web page URL is not detected. But here the author
domain name and hosting at the same time for free, so assesses that creating a web with a URL like that will reduce
that if the web is blacklisted, phishers don't feel too the likelihood of him being clicked on by potential victims.
much to lose. The table 2 is result from the experiments conducted:

V. WEB PHISHING DETECTOR METHOD TABLE 2


EXPERIMENTAL RESULTS
An application designed to work to check whether a web
page is indicated as phishing or legitimate by checking its
True Values
features. The result of feature detection is the main reference
where certain features can be a strong indication as a phishing True False
feature. After going through the stages of feature detection
and checking rules, the application will display the results of Prediction True 325 50
the identification in the form of information on whether the
website is considered phishing or legitimate.
The process of detecting phishing web pages begins with False 50 0
retrieving URL data and downloading the web HTML data.
Then the data is parsed to find the URL and HTML features
you are looking for. To illustrate the flow of the process of The author has also tried using three machine learning
implementing the application algorithm, it can be seen in methods namely SVM, Decision Tree, and Gaussian Naïve
Bayes by using the same dataset to see whether the use of
Figure 4.
machine learning can significantly improve accuracy. The
following are the results of the tests carried out with the
Scikit-Learn library and the results have been validated with
an average of 10 tests. The amounts of training and testing
data are 75% and 25%.

TABLE 3
EKSPERIMENT RESULT WITH SOME METHOD

Method Total Features Accuracy

Rule based 14 93,3%

SVM 14 95,4%

Decision Tree 14 96,8%

GNB 14 95,3%

From the test results in table 3 above, the detection of


phishing pages using machine learning does not really
Fig. 4. Flowchart of Data Processing in Applications increase the accuracy, in fact it is relatively the same. In terms
of accuracy, it is also seen that the accuracy of feature
VI. RESULT selection is very influential and is at the core of detecting this
phishing web page. In addition, feature value limitations with
An application designed to work to check whether a web the rules that are created can be more efficient than if all
page is indicated as phishing or legitimate by checking its machines must be imposed. An example is when setting a
features. The result of feature detection is the main reference limit on the number of dots. Directly, the data writer
where certain features can be a strong indication as a phishing determines whether the URL is normal or not. Normally, the
feature. After going through the stages of feature detection URL of the web page does not have more than 3 dots.
and checking rules, the application will display the results of

978-1-7281-9448-6/20/$31.00 ©2020 IEEE 170


Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 30,2025 at 05:53:02 UTC from IEEE Xplore. Restrictions apply.
The 2020 IEEE International Conference on Internet of Things and Intelligence System (IoTaIS)

Likewise, with other features such as other features that [13] Adebowale, M. A., K. T. Lwin, E. Sánchez, and M. A. Hossain,
‘Intelligent Web-Phishing Detection and Protection Scheme Using
feature a clear difference between phishing web page features
Integrated Features of Images, Frames and Text’, Expert Systems with
and legitimate web pages. Applications, 115.December 2017 (2019).
[14] Marchal, Samuel, and N. Asokan, ‘On Designing and Evaluating
VII. CONCLUSION Phishing Webpage Detection Techniques for the Real World’, 11th
USENIX Workshop on Cyber Security Experimentation and Test,
CSET 2018, Co-Located with USENIX Security 2018, (2018).
Based on the results of experiments and analyzes that [15] Ferreira, Ana, and Soraia Teles, ‘Persuasion: How Phishing Emails
have been carried out in this study, the authors draw the Can Influence Users and Bypass Security Measures’, International
following conclusions: Journal of Human Computer Studies, 125.November 2018 (2019).
1) Using the URL feature alone can make the detection [15] Mahalakshmi, A., N. Swapna Goud, and G. Vishnu Murthy, ‘A Survey
on Phishing and It’s Detection Techniques Based on Support Vector
process faster, but the accuracy will decrease. Using the Method (SVM) and Software Defined Networking (SDN)’,
HTML feature adds to the load of the application and there is International Journal of Engineering and Advanced Technology, 8.2
a risk of failure when downloading and opening HTML files. (2018).
Therefore, the use of a combination of URL and HTML is [16] Chen, Jing, Scott Mishler, Bin Hu, Ninghui Li, and Robert W. Proctor,
‘The Description-Experience Gap in the Effect of Warning Reliability
better to use. on User Trust and Performance in a Phishing-Detection Context’,
2) From the results of several tested methods, it shows that International Journal of Human Computer Studies, 119.November
the accuracy of using machine learning is not much better 2017 (2018).
than the use of rule based, even though the machine learning [17] Alam, Safwan, and Khalil El-Khatib, ‘Phishing Susceptibility
Detection through Social Media Analytics’, ACM International
method uses more hardware resources for training data and Conference Proceeding Series, 20-22-July (2016).
its application. [18] Sahingoz, Ozgur Koray, Ebubekir Buber, Onder Demir, and Banu Diri,
3) Referring to the test results that produce relatively the same ‘Machine Learning Based Phishing Detection from URLs’, Expert
accuracy and based on the fact that URL and HTML syntax Systems with Applications, 117 (2019).
[19] Tewari, Aakanksha, A. K. Jain, and B. B. Gupta, ‘Recent Survey of
data are relatively the same from year to year, the use of Various Defense Mechanisms against Phishing Attacks’, Journal of
machine learning is less efficient, so the use of rules is more Information Privacy and Security, 12.1 (2016).
suitable for phishing web page detection methods . [20] Baykara, Muhammet, and Zahit Ziya Gürel, ‘Detection of Phishing
Attacks’, 6th International Symposium on Digital Forensic and
Security, ISDFS 2018 - Proceeding, 2018-January (2018).
REFERENCES [21] Mao, Jian, Wenqian Tian, Pei Li, Tao Wei, and Zhenkai Liang,
‘Phishing-Alarm: Robust and Efficient Phishing Detection via Page
[1] Li, Yukun, Zhenguo Yang, Xu Chen, Huaping Yuan, and Wenyin Liu, Component Similarity’, IEEE Access, 5 (2017).
‘A Stacking Model Using URL and HTML Features for Phishing [22] Rao, Routhu Srinivasa, and Alwyn Roshan Pais, ‘Detection of
Webpage Detection’, Future Generation Computer Systems, 94 (2019) Phishing Websites Using an Efficient Feature-Based Machine
[2] Rao, Routhu Srinivasa, and Alwyn R. Pais, ‘Detecting Phishing Learning Framework’, Neural Computing and Applications, 31.8
Websites Using Automation of Human Behavior’, CPSS 2017 - (2019).
Proceedings of the 3rd ACM Workshop on Cyber-Physical System
Security, Co-Located with ASIA CCS 2017, 2017.
[3] Alsharnouby, Mohamed, Furkan Alaca, and Sonia Chiasson, ‘Why
Phishing Still Works: User Strategies for Combating Phishing
Attacks’, International Journal of Human Computer Studies, 82 (2015).
[4] Tan, Choon Lin, Kang Leng Chiew, Kok Sheik Wong, and San Nah
Sze, ‘PhishWHO: Phishing Webpage Detection via Identity Keywords
Extraction and Target Domain Name Finder’, Decision Support
Systems, 88 (2016).
[5] Marchal, Samuel, Kalle Saari, Nidhi Singh, and N. Asokan, ‘Know
Your Phish: Novel Techniques for Detecting Phishing Sites and Their
Targets’, 2015.
[6] Sonowal, Gunikhan, and K. S. Kuppusamy, ‘PhiDMA – A Phishing
Detection Model with Multi-Filter Approach’, Journal of King Saud
University - Computer and Information Sciences, 32.1 (2020).
[7] Gupta, Surbhi, Abhishek Singhal, and Akanksha Kapoor, ‘A Literature
Survey on Social Engineering Attacks: Phishing Attack’, Proceeding -
IEEE International Conference on Computing, Communication and
Automation, ICCCA 2016, 2017.
[8] Gupta, B. B., Nalin A.G. Arachchilage, and Kostas E. Psannis,
‘Defending against Phishing Attacks: Taxonomy of Methods, Current
Issues and Future Directions’, Telecommunication Systems, 67.2
(2018)
[9] Chiew, Kang Leng, Choon Lin Tan, Kok Sheik Wong, Kelvin S.C.
Yong, and Wei King Tiong, ‘A New Hybrid Ensemble Feature
Selection Framework for Machine Learning-Based Phishing Detection
System’, Information Sciences, 484 (2019).
[10] Kim, Sungjin, Jinkook Kim, and Brent Byung Hoon Kang, ‘Malicious
URL Protection Based on Attackers’ Habitual Behavioral Analysis’,
Computers and Security, 77 (2018).
[11] Orunsolu, A. A., A. S. Sodiya, and A. T. Akinwale, ‘A Predictive
Model for Phishing Detection’, Journal of King Saud University -
Computer and Information Sciences, 2020.
[12] Jansen, Jurjen, and Paul van Schaik, ‘The Design and Evaluation of a
Theory-Based Intervention to Promote Security Behaviour against
Phishing’, International Journal of Human Computer Studies,
123.January 2018 (2019).

978-1-7281-9448-6/20/$31.00 ©2020 IEEE 171


Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 30,2025 at 05:53:02 UTC from IEEE Xplore. Restrictions apply.

You might also like