Classification of Features For Detecting Phishing Web Sites Based On Machine Learning Techniques
Classification of Features For Detecting Phishing Web Sites Based On Machine Learning Techniques
CHAPTER 1
INTRODUCTION
1.1 Introduction
Internet has tremendously changed the way we work and communicate with
each other. There are applications like e-mail, file transfer, voice communication,
YouTube etc. that are available for users to use. But with its predictable success has
come its weaknesses and vulnerabilities. The protocols and applications responsible
for its success are being exploited by malicious users and hackers for gaining
limelight. Phishing websites is one such area where administrators need new
techniques and algorithms to protect naive users from getting exploited. Phishing is an
attempt of fraud aimed at stealing our information, which is mostly done by emails.
The ideal way to save ourselves from these phishing attacks is by observing such an
attack. These phishing emails mostly come from trusted sources and try to retrieve our
valuable information, for instance our passwords, bank details or even SSN. Many a
times, these attacks come from sites where we have not even made any type of
account. The procedure followed by phishers includes us reaching their website
through the means of an email. In those emails, they make us click on a certain link
that directs us to their websites.
The looks of these phishing websites are quite similar to their respective
legitimate ones and the only distinguishing factor is their URLs. Various initiations
appearing from social websites, banks and online payment portals are used to deceive
users. These phishing emails mostly contain links to websites that are affected with
malware. Some of the ways to tackle these phishing attacks include generating
awareness among people and training theusers.[1]
"Phish" is pronounced just like it's spelled, which is to say like the word "fish"
— the analogy is of an angler throwing a baited hook out there (the phishing email)
and hoping you bite. The term arose in the mid-1990s among hackers aiming to trick
AOL users into giving up their login information. The "ph" is part of a tradition of
whimsical hacker spelling, and was probably influenced by the term "phreaking,"
short for "phone phreaking," an early form of hacking that involved playing sound
tones into telephone handsets to get free phone calls.
detection.
How to enhance the performance of the best selected features and classifiers.
How to integrate multiple classification algorithms for phishing detection and
to evaluate suchintegration.
1.4 Objectives
To carry out an exploratory analysis of the Phishing Websites Data Set and an
interpretation ofit.
To determine and evaluate the best set of features to be used for
phishingdetection.
To create a new dataset which has recent websites entries to get a better
accuracy.
To determine the best classification algorithm for phishingdetection.
To distinguish the phishing websites from the legitimate websites and ensure
secure transactions tousers.
Then the login credentials are scripted with the original website
and after understanding the logic of its pattern.
This logic or script data is then bundled or pack into a zip file for
making it phishing kit
than just casting a baited hook in the water to see who bites.) Phishers identify their
targets (sometimes using information on sites like LinkedIn) and use spoofed
addresses to send emails that could plausibly look like they're coming from co-
workers. For instance, the spear phisher might target someone in the finance
department and pretend to be the victim's manager requesting a large bank transfer on
shortnotice.
Whale phishing
Whale phishing, or whaling, is a form of spear phishing aimed at the very big fish —
CEOsor other high-value targets. Many of these scams target company board
members, who are considered particularly vulnerable: they have a great deal of
authority within a company, but since they aren't full-time employees, they often use
personal email addresses for business- related correspondence, which doesn't have the
protections offered by corporateemail.
CHAPTER 2
LITERARTURE SURVEY
2. Miss Sneha Mande, www.ijariie.com Vol-4 Issue- Detection of Phishing Web Sites Based On Extreme Machine Learning
6 2018 Phishing makes utilization of spoof messages that are made to look valid
Prof.D.S.Thosar,
IJARIIE- and implied to be originating from honest to goodness sources like money
ISSN(O)- related foundations, e-commerce destinations and so forth, to draw clients
2395-4396 to visit fake sites through joins gave in the phishing email.
3. Ebubekir Buber, IEEE 978-1-5386- Feature Selections for the Machine Learning based Detection of
Önder Demir 1880-6/17/ Phishing Websites
Ozgur Koray Sahingoz As a software detection scheme, two main approaches are widely used:
CHAPTER 3
PROBLEM DEFINITION
account of email security, AI has brought speed, accuracy, and the capacity to do a
detailed investigation. AI can detect spam, phishing, skewers phishing, and different
sorts of attacks utilizing previous knowledge in the form of datasets. These type of
attacks likely creates a negative impact on clients’ trust toward social services such as
web services. According to the APWG report, 1,220,523 phishing attacks have been
reported in 2016, which is 65% more expansion than 2015 [1]. Figure 2 shows the
Phishing Report for the third quarter of 2019.
As per Parekh et al. [51], a generic phishing attack has four stages.
First, the phisher makes and sets up a fake website that looks like an authentic website.
Secondly, the person sends a URL connection of the website to a targeted victim
pretending like a genuine organization, user, or association. Thirdly, the person in
question will be tempted to visit the injected fake website. Fourth, the unfortunate
targeted victim will click on the fake source link and give his/her valuable data as
input. By utilizing the individual data of the person in question, impersonation
activities will be performed by the phisher. APWG contributes individual reports on
phishing URLs and analyzes the regularly evolving nature and procedures of
cybercrimes. The Anti-Phishing Working Group (APWG) tracks the number of
interesting phishing websites, an essential proportion of phishing over the globe.
Phishing locales dictate the interesting base URLs. The absolute number of phishing
websites recognized by APWG in the 3rd quarter-2019 was 266,387 [3]. This was
46% from the 182,465 seen in Q2 and in Q4-2018 practically twofold 138,328 was
seen.
Attack techniques are grouped into two categories: attack launching and
data collection. For attack launching, several techniques are identified such as email
spoofing, attachments, abusing social settings, URLs spoofing, website spoofing,
intelligent voice reaction, collaboration in a social network, reserve social engineering,
man in the middle attack, spear phishing, spoofed mobile internet browser and
installed web content. Meanwhile, for data collection during and after the victim’s
interaction with attacks, various data collection techniques are used [49]. There are
two types of data collection techniques, one is automated data collection techniques
(such as fake websites forms, key loggers, and recorded messages) and the other is
manual data collection techniques (such as human misdirection and social
networking). Then, there are counter-measures for victim’s data collected or used
before and after the attack. These counter-measures are used to detect and prevent
attacks. We categorized counter-measurement into four groups (1) Deep learning-
based Techniques, (2) Machine learning Techniques, (3) Scenario-based Techniques,
and (4) Hybrid Techniques.
To the best of our knowledge, existing literature [11, 18, 28, 40, 62]
include a limited number of surveys focusing more on providing an overview of attack
detection techniques. These surveys do not include details about all deep learning,
machine learning, hybrid, and scenario based techniques. Besides, these surveys lack
in providing an extensive discussion about current and future challenges for phishing
attack detection.
3.2 Objectives
Keeping in sight the above limitations, this article makes the following contributions:
1. To provide a comprehensive and easy-to-follow survey focusing on deep learning,
machine learning, hybrid learning, and scenario-based techniques for phishing
attack detection.
2. To provide an extensive discussion on various phishing attack techniques and
comparison of results reported by various studies.
3. To provide an overview of current practices, challenges, and future research
directions for phishing attack detection.
deployed in past to tackle this problem. A detailed comparative analysis revealed that
machine learning methods are the most frequently used and effective methods to
detect a phishing attack. Different classification methods such as SVM, RF, ANN,
C4.5, k-NN, DT have been used. Techniques with feature reduction give better
performance. Classification is done through ELM, SVM, LR, C4.5, LC-ELM, kNN,
XGB, and feature selection with ANOVA detected phishing attack with 99.2%
accuracy, which is highest among all methods proposed so far but with trade-offs in
terms of computational cost.
CHAPTER 4
METHODS OF FEATURE EXTRACTION OF URL’s
4.1 Algorithm:
An algorithm mentioned below includes the way how detecting the website whether
it’s real or phishing website to do so some parameter classification has been done. The
wordings present in the website extracted the feature in terms of words, phrases and
letters and verify with the database mention
1. During the process the dataset was prepared which includes the words and phrases
set
2. Once the pre process completes the features will be extracted from the URL
6. Distinguish phishing and legitimate site using attribute value and comparing the
dataset.
4.1.1. Address bar-related features: The features which are related to the address of
an URL are referred as address bar-related features. These includes the length of the
host URL, number of dots and slashes, special characters, HTTP and SSL check,
@symbol, and IP Address.
a. Length of the host URL: URL is an alphanumeric string which is used to access
the network resources on World Wide Web (WWW). The URL is a combination of
network protocol, hostname and the path. The length of a hostname of a URL is one of
the key features to be extracted while detecting the phishing URLs.
b. Number of dots and slashes: Sometimes, URL consists of multiple domains. The
sub domains are part of the domain names which further narrow down the hierarchy of
Domain Name Systems. The number of dots and slashes which exists in a URL
determines the number of sub domains in the URL to verify whether the URL is
phishing, legitimate or suspicious.[12]
d. HTTP with SSL Check: Certain URL uses transport layer security to protect the
URL from the attacks. The HTTPS protocol adds a security layer in order to transfer
the sensitive information across the network without any issues. So, in order to
determine whether a URL is legitimate or not, parameters such as HTTPS, authenticity
of certificate and age of the certificate plays a vital role.
4.1.2. Abnormal based features: The URL features which relates to anomalies or
discrepancies between the W3C objects and Web Identity are known as abnormal
based features. These features are mostly related to the source code of the web page.
These features will play an important role in identifying the phishing websites. These
features include Request URL (RURL), URL of an anchor (AURL), Server Form
Handler (SFH)
a. Request URL: For most of the legitimate websites, the external objects such as
external scripts, CSS, images and other attachments are tied to their own domain. So,
the Request URL feature can be easily used to categorize the websites by checking
whether the external files are linked to the original domain ornot.
b. Anchor of a URL: This feature is similar to the Request URL. This feature verifies
that all the anchors in a specific web page should be pointed to same domain on the
web page itself. In this way, all the anchors in a tested URL can be verified to check
whether that particular website has a phishing attack.
c. Server Form Handler: For any web pages or websites which need authentication
or authorization, a server form with username and password are to be filled in order to
access that particular site. A server form handler is served as an important feature to
differentiate phishing sites from those of the legitimate websites which takes the
following form.<form action= “/login/login.jsp” method=”post” target=”_login”>
The above form tag is used in server-side script to perform an action based
on the user’s navigation. The “action” in the above tag describes the path and
“method” describes the type of HTTP method used in handling a page request.
4.1.3. HTML and JavaScript based Features: The features which are related to
HTML tags and JavaScript functions are treated as HTML and JavaScript based
features. These features include Redirect Page, Disabling Right Click and Using Pop-
up window and on Mouse Over.
a. Redirect Page: Redirecting a web page is a technique of navigating the web users
to a different webpage other than the requested page. Many attackers use open redirect
pages found on the web pages to redirect the link to their illegitimate sites. So, the
number of redirects used in a web page determines whether the website is a legitimate
or not.
c. Right Click: This feature is similar to the on Mouse Over feature. The attackers use
the JavaScript code to hide the source code from the web users by disabling the right
click function. Using this feature, one can easily distinguish the phishing websites to
that of a legitimate website.
4.1.4. Domain based features: The URL features related to domain name-based
information is known as Domain based features. These features include Alexa Page
Rank, Age of the Domain, DNS Record and Website Traffic.
a. Alexa Page Rank: The ranking of a website indicates the popularity of the website
and therefore, many users access it. It is to be understood that attackers maintain the
phishing URLs for a certain amount of time. Once the link or URLs are expired, they
no more appear on Internet. The Alexa page ranking is one among the many features
used to detect the malicious URLs.
b. Age of the Domain: This feature gives the approximate age of that particular
domain of a website. The more age the website is, the more legitimate it is. So, this
feature is used in detecting whether a website is legitimate, suspicious or phished
based on the website age extracted from WHO IS Database.
c. DNS Record: DNS Records are mapping filenames which informs the DNS server
about the IP address associated with each website on the Internet. Many legitimate
websites contain the Owner of the Domain, Date and Time Created details which
differentiates the legitimate sites to that of phished websites.
d. Website Traffic: In general, there are more visits to the legitimate websites. Due to
this, there is high traffic as they are frequently visited. In contrast to this, phishing sites
can be identified easily based on website traffic as they have no web traffic.
Description: This rule indicates that when special characters such as dash, underscore,
comma, semicolon are part of the input URL, then it is a phishing URL.
Description: There exists a high traffic for legitimate URLs because of frequent visits
by the users. If there are no frequent visits to the URL, the URL is marked as phishing.
web resources. One can assume that these tags are linked to the same domain of the
webpage. So, if the percentage of “<Meta>”, “<Script>”, and “<Link>” tags is less
than 17%, then we call our website legitimate. For it to be in category of suspicious, it
must have its percentage less than 81 but at the same time greater than 17. If its
percentage exceeds 81, then it falls in the category of phishing.
CHAPTER 5
REQUIREMENT ANALYSIS AND SPECIFICATION
5.1 Techniques
Research methodology defines how the development work should be carried out in the
form of research activity. Research methodology can be understand as a tool that is
used to investigate some area, for which data is collected, analyzed and on the basis of
the analysis conclusions are drawn. There are three types of research i.e. quantitative,
qualitative and mixed approach as defined in.
CHAPTER 6
PROPOSED WORK
The proposed method created the database of the feature websites that are classified by
determining the input and output parameters through learning mechanism. Learning
shows higher result while identifying the phishing website through Support vector
machine and naives bayes classifier. The design of this learning mechanism was
identified as the high performing classification against the major phishing activity for
various websites. In the research conducted for identification this learning mechanism
provides higher accuracy and test performance in identifying the legitimate website.
This learning mechanism uses feed forward neural network for classification of single
layer or hidden layer which need not to be tuned continuously. Its tuning depends upon
the way of classification pattern and its hidden nodes are randomly assigned some
values and never changes. To build the linear model for identification these hidden
nodes are usually used to learned the mechanisms for single step layer. This auto
generated models creates good generalization performance and acceptable to learn
thousand time faster than ever expected.
Feature Extraction: In this step, all the relevant features of the URLs are extracted
which are used to differentiate between phishing URLs and legitimate URLs. A
URL feature is classified into three groups such as Address-bar based features,
Abnormal features, HTML and JavaScript based features and Domain based
features.
Predictive Analysis: The features which are extracted from the previous step are
subjected to different heuristics. A total of 30 features will be used to determine
whether a URL is a phished, suspicious or legitimate one. Based on the features
extracted, the proposed rules are applied in order to categorize a URL.
Evaluation: The results of the classification are evaluated and the user is notified
whether the given URL is Phished orlegitimate.
CHAPTER 7
IMPLEMENTATION AND TESTING
Step 2:
Start the dataset run
During this process the datatset which are available for classifying the URL feature
was executed and a model was created for comparison of each and every URL letter.
The dataset creation usually makes use of various words and sequences in the
sentences which make the possibility of having higher accuracy with maximize dataset
content
Step 3:
URL inserted
In this process the input inserted on website URL was identified and extraction of the
URL started in terms of heuristic pattern
Step 4:
Feature Extraction
During extraction each and every URL heuristic feature was gathered and compared
the dataset which we created by using datatset by use of KNN Classifier.
The feature extraction of the URL makes it easy to rectify the sentence into
distinguishable wording and phrases
Model Creation
The dataset training was carried out and a model was created in accordance with
database available which helps to create the model for further comparison
KNN Classifier
The working of classifier used in the project was KNN classifier which helps to
predict the similarity of words and letters in the input URL
Step 5:
Comparison with dataset
Once the inserted URL and dataset was compared it shows the result based on model
comparison. In this method the KNN algorithm was implemented so as to find the
nearest accuracy measures for authenticate and phishing website.
Step 6:
Identification
Once the system identified it as a phishing website it was blacklisted to be access in
future
Fig.7.2.KNN Diagram 2
Observations can be found that the boundary becomes smoother with
increasing value of K. With K increasing to infinity it finally becomes all blue or all
red depending on the total majority. The training error rate and the validation error rate
are two parameters we need to access on different K-value. Following is the curve for
the training error rate with varying value of K:
CHAPTER 8
RESULT AND DISCUSSION
8.1Result
8.1.1 Signup/Registration
This page is used to register the system by inserting the details of an individual. It also
includes to provide the details of the user id and password
rts-authority-of-india-recruitment.html
CHAPTER 9
CONCLUSION
9.1 Conclusion
Phishing websites mainly retrieve user’s information through login pages. They are
interested in the bank details of the users. Out of the many features considered to
detect the phishing website, the most important one is HTTPS with SSL i.e. whether a
website uses HTTPS, issuer of certificate is trusted or not, and the age of certificate
should be at least one year. In this regard, the Phishing Website Dataset is tested to
predict the accuracy of phishing detection evaluation based on four extreme classifier
algorithms (KNN, RBF-SVM, Decision Tree and Random Forest). The accuracy using
K-Nearest Neighbor is 95.20%, RBF (Radial Basis Function) Support Vector Machine
is 94.70%, Decision Tree is 91.94% and Random Forest is 87.74%. Hence, we
conclude that K-Nearest Neighbors classifier results best in terms of accuracy among
the four classifier algorithms.
The testing is done on all legitimate websites as well as malicious websites which are
collected from phish tank. The testing is done on combination of multiple heuristics as
well as on individual heuristics to ensure the efficient functionality of system. From a
set of URLs tested, majority of the URLs have been classified as correctly by the
system. The evaluation of the system is done using a confusion matrix which lists the
True Positives, True Negatives, False Positives and False Negatives. Once all this
information is collected, the precision and recall are calculated for the system. Based
on heuristics selected by the user, the precision and recall varies accordingly. For a
better precision and recall, the false positives and false negatives can be reduced which
will improve the accuracy of the classification.
CHAPTER 10
LIMITATION AND FUTURE SCOPE
10.1 Limitations
Although it’s quite important o protect our data at any cost but during the process the
detection system and technology used in the process have some disadvantages and
limitations
Following are some techniques which creates some limitations during detection
process
phishing beside, but after our survey that it will be better to use a hybrid approach for
the prediction and further improve the accuracy prediction rate of phishing websites.
As malicious URLs are created every other day and the attackers are using techniques
to fool users and modify the URLs to attack. Nowadays deep learning and machine
learning methods are used to detect a phishing attack. classification methods such as
RF, SVM, C4.5, DT, PCA, k-NN are also common. These methods are most useful
and effective for detecting the phishing attack. Future research can be done for a more
scalable and robust method including the smart plug in solutions to tag/label if the
website is legitimate or leading towards a phishing attack.
REFERENCES
[1] Ishant Tyagi, Jatin Shad, Shubham Sharma “A Novel Machine Learning
Approach to Detect Phishing Websites”, in 5th International Conference on
Signal Processing and Integrated Networks (SPIN),2018
[2] Detection of Phishing Web Sites Based On Extreme Machine Learning Miss
Sneha Mande1 , Prof.D.S.Thosar2 , Vol-4 Issue-6 2018 IJARIIE-ISSN(O)-2395-
4396 9322
[3] Ebubekir Buber, ÖnderDemir and OzgurKoraySahingoz, "Feature Selections for
the Machine Learning based Detection of Phishing Websites", 2017 International
Artificial Intelligence and Data Processing Symposium (IDAP), 2017.
[4] M. Volkamer, K. Renaud, B. Reinheimer and A. Kunz, "User experiences of
TORPEDO: TOoltip-poweRed Phishing Email DetectiOn", Computers &
Security (2017).
[5] P. Singh, Y.P.S Maravi and S. Sharma, "Phishing websites detection through
supervised learning networks", IEEE International Conference on Computing and
Communications Technologies (ICCCT), pp. 61-65, 2015.
[6] A. A. Ahmed and N. A. Abdullah, "Real time detection of phishing websites", 7th
IEEE Annual Information Technology Electronics and Mobile Communication
Conference IEEE IEMCON, 2016.
[7] Z. Dan Dong, A. Kapadia, J. Blythe and L. J. Camp, "Beyond the Lock Icon:
Real-Time Detection of Phishing Websites Using Public Key Certificates", IEEE
APWG Symposium on Electronic Crime Research, pp. 1-12, May 2015.
[8] Mustafa Aydin and Nazife Baykal, "Feature Extraction and Classification
Phishing Websites Based on URL", IEEE International Conference on
Communications and Network Security (CNS), pp. 769-770, 2015
[9] S. Marchal, J. Francois, R. State and T. Engel, "PhishScore: hacking phishers’
minds", proceedings of the 10th International Conference on Network and
Service Management 2014 (CNSM 2014), vol. 11, no. 4, pp. 458-471, 2014.
[10] Luong Anh, Tuan Nguyen, Ba Lam To, HuuKhuong Nguyen and Minh Hoang
Nguyen, "A novel approach for phishing detection using URL-based
heuristic", IEEE International Conference on Computing Management and
Telecommunications (ComManTel), pp. 298-303, 2014.
[11] Zheng Dong, Apu Kapadia, Jim Blythe and L. Jean Camp “Beyond the Lock
Icon: Real- time Detection of Phishing Websites Using Public
KeyCertificates”,2015
[12] Samuel Marchal, Jérôme Francois, Radu State, Thomas Engel “Phish Score:
Hacking Phishers’ Minds in 10th CNSM and Workshop at 2014IFIP
[13] A. Belabed, E. Aïmeur, A. Chikh “A personalized whitelist approach for phishing
webpagedetection”,in2012SeventhInternationalConferenceonAvailability,Reliabil
ityand Security
[14] NuttapongSanglerdsinlapachai,ArnonRungsawang“UsingDomainTop-
pageSimilarity Feature in Machine Learning-based Web Phishing Detection”, in
2010 ThirdInternational Conference on Knowledge Discovery and DataMining.
[15] Y. Zhang, J. I. Hong, and L. F. Cranor, “Cantina: a content-basedapproach
todetecting phishing web sites,” in The 16th internationalconference on World
Wide Web, 2007, pp. 639–648.
[16] G. Xiang, J. Hong, C. P. Rose, and L. Cranor, “Cantina+: a feature-richmachine
learning framework for detecting phishing web sites,” ACMTransactions on
Information and System Security, vol. 14, no. 2, pp.1–28, Sept. 2011.
[17] M. E. Maurer and D. Herzner, “Using visual website similarity forphishing
detection and reporting,” in CHI ’12 Extended Abstracts onHuman Factors in
Computing Systems, 2012, pp. 1625–1630.
[18] A. Sunil and A. Sardana, “A pagerank based detection technique forphishingweb
sites,” in IEEE Symposium on Computers & Informatics,2012, pp. 58–63.
[19] M. G. Alkhozae and O. A. Batarfi, “Phishing websites detected basedon phishing
characteristic in the webpage source code,” in InternationalJournal of Information
and Communication Technology Research, vol. 1,no. 6, Oct. 2011, pp. 283–291.
[20] L. A. T. Nguyen, B. L. To, H. K. Nguyen, and M. H. Nguyen,“A novel approach
for phishing detection using url-based heuristic,”in IEEE International
Conference on Computing, Management and Telecommunications
(ComManTel), 2014, pp. 298–303.
[21] G. P. Zhang, “Neural networks for classification: a survey,” in Systems,Man, and
Cybernetics, Part C: Applications and Reviews, IEEE Transactionson, Vol 30,
2000, S. H. Liao and C. H. Wen, “Artificial neural networks classification
andclustering of methodologies and applications - literature analysis from1995 to
2005,” in Expert Systems with Applications, vol. 32, 2007, pp.1–11.
[22] Anti-Phishing Working Group, “Phishing Activity Trends Report,” 2014.
[23] S. Sheng, M. Holbrook, P. Kumaraguru, L. F. Cranor, and J. Downs, “Who
through Game Play,” in 2014 IEEE WorldCongress on Services, 2014, pp. 113–
120.
[36] P. Singh, Y. P. S. Maravi, and S. Sharma, “Phishing websites detection through
supervised learning networks,” in 2015International Conference on Computing
and Communications Technologies (ICCCT), 2015, pp. 61–65.
[37] Y.-S. Chen, Y.-H. Yu, H.-S. Liu, and P.-C. Wang, “Detect phishing by checking
content consistency,” in Proceedings of the 2014IEEE 15th International
Conference on Information Reuse and Integration (IEEE IRI 2014), 2014, pp.
109–119.
[38] L. Wu, X. Du, and J. Wu, “MobiFish: A lightweight anti-phishing scheme for
mobile phones,” in 2014 23rd InternationalConference on Computer
Communication and Networks (ICCCN), 2014, pp. 1–8.
[39] S.-S. Tseng, C.-H. Ku, A.-C. Lu, Y.-J. Wang, and G.-G. Geng, “Building a Self-
Organizing Phishing Model Based uponDynamic EMCUD,” in 2013 Ninth
International Conference on Intelligent Information Hiding and Multimedia
SignalProcessing, 2013, pp. 509–512.