0% found this document useful (0 votes)
32 views

Classification of Phishing Website Using Hybrid Machine Learning Techniques

The problem with cyber security involves scam websites, stilling the information that exploit people's trust. It could be reduced to the act of enticing internet users even though that they can get their personal data, including user names and passwords. In this study, we present a method for identifying phishing websites. The technology works as an add-on to a web browser, alerting the user when it finds a phishing website.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Classification of Phishing Website Using Hybrid Machine Learning Techniques

The problem with cyber security involves scam websites, stilling the information that exploit people's trust. It could be reduced to the act of enticing internet users even though that they can get their personal data, including user names and passwords. In this study, we present a method for identifying phishing websites. The technology works as an add-on to a web browser, alerting the user when it finds a phishing website.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Volume 8, Issue 7, July – 2023 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165

Classification of Phishing Website Using Hybrid


Machine Learning Techniques
T.Pavansai Ziaul Haque Choudhury G.Gowtham sai
Vignan University Vignan University Vignan University
Information Technology, Information Technology, Information Technology,
Guntur, AP Guntur, AP Guntur, AP

Abstract:- The problem with cyber security involves scam attack called as phishing tricks victims into accessing
websites, stilling the information that exploit people's malicious files and divulging personal details. The majority
trust. It could be reduced to the act of enticing internet of fake sites utilize the same Domain and web experience as
users even though that they can get their personal data, trustworthy websites. There is a great need for an intelligent
including user names and passwords. In this study, we plan to protect customers from cyber-attacks [3].
present a method for identifying phishing websites. The
technology works as an add-on to a web browser, alerting The person got redirected to that website if they click on
the user when it finds a phishing website. A machine a phishing link. The attacker uses the victim's information to
learning technique, specifically supervised learning is gain access to other official websites after taking it. Several
proposed in our study. The Logistic regression, Principal alternative detection procedures are developed and used in
Component Analysis (PCA) and Apriori algorithms are the literature to identify this kind of phishing attempt. Use of
chosen because of its success in classification. By signature-based/rule-based detection techniques is the
examining the characteristics of phishing websites and simplest strategy [4]. The signature of the phishing assault is
selecting strongest combination of them, we developed a listed in this method. This link might, for the purpose of
classifier that performs better. detecting attacks, become the description of the URL
addresses.
Keywords:- Phishing Website, Cyber Security, Machine
Learning. Many studies are recently been conducted in an effort to
address to the phishing issue. Some researchers utilized the
I. INTRODUCTION URL and compared it with already-existing watch lists and
include lists of harmful websites that they have been
In this era, the modern world, technologies are merged developing, while others used the URL in the other way,
completely. One of those technologies, that is advancing comparing it with a whitelist of trustworthy websites [4]. The
quickly each day and has a significant effect on people's latter strategy makes use of heuristics and a database of
lives, is the web and internet. It has evolved into a valuable signatures. Additionally, some studies have used methods of
and handy platform for facilitating public transactions such machine learning. Computer programming, a sub field of
as e-banking and e-commerce. Users now believe that giving artificial intelligence (AI), that executes jobs and has the
their private information to the internet is convenient as a capacity to learn or behave intelligently, includes the
result of this. A significant security issue has arisen as a discipline of machine learning. It really has supervised
result of the security thieves who have started to target this learning and unsupervised learning as its two separate active
material. One of these issues is what are known as phishing learning. A model is prepared for supervised learning by
sites. They are using social engineering, which may be providing it with a collection of measurable characteristics of
characterized as con artists trying to influence the consumer data linked to a target label corresponding to this data. Once
into providing personal information? In accordance with the the classifier is developed, it may create a new label with
Anti-Phishing Task force, statistics indicate that such unknown data. Unsupervised learning, in contrast hand, is
frequency of phishing assaults is rising, posing a threat to based on creating fresh data without providing a goal label
user data. (APWG) [1] as well as Mcafee Lab [2], which throughout the training phase.
noted phishing assaults, reported an increase of 47.48%
compared to all phishing attacks discovered in 2016. Among the main problems with data security was
phishing. Users can click on links which take them straight to
Internet-connected gadgets and their services are a fake website or they may receive malicious email that
becoming increasingly widely used all over the world as a connect to the phoney website. Nevertheless, the two
result of technical advancement. IoT devices, despite being approaches have one thing in common: rather than technical
regarded as novel technology garner more attention for other flaws, the attacker focuses on human weaknesses [3].
web system security challenges as well. Several efforts have Phishing is the practise of fraudsters tricking victims into
been made to address these difficulties, and machine learning divulging their personal information, including usernames,
methods are frequently used in their execution [1–3]. An

IJISRT23JUL950 www.ijisrt.com 1385


Volume 8, Issue 7, July – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
passwords, and credit card numbers. Users are experiencing various probability qualities in chosen methods, it can be
economical and financial troubles as a result of these frauds. viewed as a better option. As a result, the following research
will be included in this study.
Throughout the 1990s, phishing set up a fraudulent
profile on America Online (AOL), a corporation that offered II. LITERATURE SURVEY
an online system and had a web portal, using a false name
and a counterfeit credit card. The phishers may be using its Anti-Phishing is employed to stop consumers from
services in this way at no expense to themselves. Since then, visiting shady websites, which might result in phishing
in the middle of the 1990s, AOL has upgraded its anti- attacks. Here, Anti Phish tracks the sensitive information that
phishing system. Sadly, the phishers used a different strategy, the user is required to fill out and warns the user anytime he
hijacking legitimate accounts by impersonating an AOL or she tries to exchange that information with an unreliable
representative and asking customers to surrender their website. The most effective explanation for this is to
passwords for safety. Both emails and text messaging were encourage consumers to only visit reliable websites. This
used for this the algorithms technique compares any attack strategy, nevertheless, is impractical. In any case, the user
patterns with the signature of a systematic pattern using could be duped [4]. As a result, it is necessary for the
signatures database of those attacks. Intuitions have the associates to provide these justifications in order to combat
drawback of failing to identify novel threats since signatures the phishing issue. Options that are widely recognized rely on
may be easily evaded through obfuscation. Considering the spooky websites to identify "clones" and keep track of
rise in novel assaults, particularly 0 threats, upgrading the malicious scams that are on the hit list. An effective
authentication system is also a laborious process [7]. In order procedure of machine dependability on a trait meant for the
to identify phishing websites, content analysis uses well- reflecting of the beleaguered user deceit via
known methods like phrase intensity often the (TF-IDF). In telecommunications is an option for identifying these attacks.
order to determine if a website is scamming or not, it This method may be used to identify spam emails or texts
examines the message content of each page on the site. Other and emails that are transmitted via emails and used to capture
techniques used by researchers to identify phishing websites victims. Roughly 800 scam emails and 7,000 non phishing
include monitoring site traffic with Alexia. Machine-learning emails have been tracked so far, and over 95% of them have
makes use of this predictive capability. After learning the been properly identified. Moreover, 0.09% of the actual
traits of the fake website, it makes predictions about new emails have been used to classify the emails. We can just
phishing traits. There are various techniques, including finish with the techniques for spotting the trick and the
artificial neural network (ANN), Naive Bayes, support vector evolving the majority of e-banking providers employ [7]
machine (SVM), logistic regression, and naive Bayes (NB)
(BN). From algorithm each engine, phishing detection Phishing websites, which nature of the attacks.
performance differs. Identification and classification are quite dynamic and
intricate. As it deals with taking into consideration numerous
In the study, we describe a method for recognize URLs quality criteria rather than precise numbers, some vital data
using ML techniques. Using Addresses, a recurrent neural mining techniques may offer an excellent way to maintain the
technique is utilized to identify phishing websites. Our work e-commerce websites safe due to the participation of various
aims to increase cypher attack detection rates by offering uncertainties in the detection [8]. This research proposes an
good performance with low false-negative and false-positive intelligent robust and efficient model for detecting e-banking
rates as phishing schemes grow more prevalent. False- phishing websites in order to overcome the "fuzziness" in the
negative sites are those that are misidentified as authentic evaluation of e-banking, email phishing. The used model
websites, whereas false-positive sites are those that are uses data mining methods and fuzzy logic to take into
misidentified as legitimate websites Figure 1 shows a account the many elements that make an e-banking scam
straightforward description of phishing. When a client goes website effective. Here, two methods are described for
to a website and opens on an email, phishing begins. Outside correlating data from several DNS servers and numerous
connection, for example a pop-up or advertising [5]. suspects in the FF domain. Real-world examples may be
utilized to demonstrate how our correlation techniques,
A. Problem statement which are based on an analytical solution that can quantify
When new phishing strategies are launched, phishing multiple DNS queries needed to validate an FF domain,
detection solutions do suffer from low detecting quality and speed up the identification of the FF domain [9].
high rates of false alarms. Additionally, because registration
new domains has gotten simpler, the most popular The publish N subscription association model, also
methodology, the exclusion list method, is ineffective at called as LARSID, illustrates how correlation schemes may
reacting to phishing assaults that are on the rise. No complete be implemented on a large scale utilizing a distributed
blacklisted can guarantee a flawlessly a go dataset. In architecture that is more salable than a centralized one. Since
addition, several solutions have made use of page content that the FF Mother Ship is protected by a proxy screen, it is
analysis to address the false negative issues and strengthen quite challenging to accurately and quickly identify the FF
the weaknesses of the expired lists. Additionally, different domains. The number of DNS requests necessary to provide
web pages analysis techniques each employ a unique a specific number of distinct IP addresses is calculated as a
approach to accurately identifying malicious URLs. Because theoretical approach to the issue of FF detection. Several
aggregation may mix overall similarities in correctness and models are offered for various locations due to the variance

IJISRT23JUL950 www.ijisrt.com 1386


Volume 8, Issue 7, July – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
in the allocation of attributes in the various phishing areas function. The S-shaped curve of the logistic regression has a
[10]. Gaining sufficient information from a new location to range of values around 0 and 1, but just never completely at
recover the detection algorithm and utilize the transfer such boundaries.
learning technique to modify the current model is almost
impossible. Use of our URL-based technique is a suitable Logistic(n) = 1/(1+exp⁡(-n)) (1)
strategy for phishing detection [6]. We must use the
transferring approach to create a more effective mode in
order to deal with all the conditions for identifying
characteristic failure. Comparative study of the classifiers’
model-based features is shown in the table1.

III. PROPOSED SYSTEM

We have proposed a novel anti-phishing technique that's


projected to assure strong protection is the anti-phishing
technique employing deep learning. The study deal both
URLs (Universal Resource Locators) and URLs (Universal
Resource Identifies) in this method, check them using
machine learning, and anticipate whether or not they are
phishing websites. Here, a web application for viewing
inferred URLs is developed. Every time we visit a website,
the associated URL (Universal Resource) is verified using a
machine learning algorithm. Users employed logistic Fig 1: logistic regression
regression, Principal Component Analysis and Apriori
algorithms techniques to construct our train-out model, with B. Principle Component Analysis
the end, our system chose logistic regression since it An unsupervised learning approach called principal
provided a much more precise estimate. Phishing is a sort of component analysis is used in machine learning to reduce
widespread fraud that occurs when a malicious website poses dimensionality. With the use of orthogonal transformation, it
as a legitimate one with the intention of obtaining sensitive is a statistical procedure that transforms the observations of
data, such as usernames, account login information, or correlated characteristics into a collection of linearly
MasterCard numbers. Phishing is a deception technique that uncorrelated data. The Principal Components are these newly
combines social engineering and technology to get private altered features. One of the widely used tools for exploratory
information, such as usernames and credit card numbers, by data analysis and predictive modelling is this one. It is a
impersonating a reliable person or organization in a digital method for identifying significant patterns in the provided
communication. dataset by lowering the variances. PCA generally tries to find
the lower-dimensional surface to project the high-
A. Logistic regression dimensional data.
Whenever the response variable (output) is in binary
code, including such 0 (False) or 1, logistic regression is The PCA algorithm is based on some mathematical
employed as a classification algorithm (True). Because of concepts such as:
this, logistic regression is an effective technique for your  Variance and Covariance
job's goal of determining if a URL is indeed a scam URL (1)  Eigenvalues and Eigen factors
or not (0), in the case presented here.
Mathematical equation of PCA: Z=XS−1 Z = X S – 1
The logistic regression model is expanded upon by
logistic regression. Let's use a clear example to better grasp  Variance and Covariance
this. If we use a regression analysis model to determine if a In statistics and probability theory, the mathematical
message is spam or not, we would only regularly receive concepts of variance and covariance are often utilized. A
values ranging from 0 and 1, such as 0.4, 0.7, etc. The covariance is a measure of the directional connection
Logistic Regression, in contrast hand, expands this linear between two random variables, whereas variation is the
regression model by establishing a cutoff at 0.5; as a result, dispersion of a data set around its mean value.
the data point will be classed as spam if the output value is
larger than 0.5 and not spamming if the resulting value is less  Eigenvalues and eigen factors
than 0.5. By using Logistic Regression to categorization Eigenvalues are a unique collection of scalar values
issues, we may get specific estimates. connected to a set of linear equations that are most likely
seen in matrix equations. The characteristic roots are another
Statistical methods first employed the logistic name for the eigenvectors. It is a non-zero vector, after
regression, sometimes known as the nonlinear function, to applying linear transformations, can only be altered by its
characterize characteristics of population expansion in scalar component.
ecological. The projected values are transformed to
probabilities using a mathematical tool called the logistic

IJISRT23JUL950 www.ijisrt.com 1387


Volume 8, Issue 7, July – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
 Apriori algorithm
The Apriori algorithm is made to operate on databases
that include transactions and construct association rules using
frequent item-sets. It establishes how strongly or weakly two
items are associated using these association rules. This
approach calculates the item-set associations quickly by
using a breadth-first search and a hash tree. Finding the
common item sets from a huge dataset is an iterative
procedure.

 Apriori Algorithm Procedure:


The stages for the apriori algorithm are as follows:
Step 1: Establish the minimal support and confidence for
item-sets in the transactional database.
Step 2: Take all transaction supports that have a greater
support value than the minimum or chosen support value in Fig. 2. Different types of attacks
step two.
Step 3: Track down all the rules in these subgroups with  Phishing
confidence values greater than the threshold or minimum. Phishing means to rob data from users with fake
Step 4: Arrange the rules in order of lowering lift. websites which will send by mail to the user. In the above fig
phishing has a count of 100000.
IV. EXPERIMENTS
 Malware
There are several traits and patterns that might be Malware is any program that is utilized to gain
regarded as elements on fake sites. In this paragraph, we unauthorized access to IT systems in order to obtain data,
discuss all scamming web page aspects that were employed obstruct system performance, or otherwise impair IT
in the past investigations as feasible. In addition, when we networks. Hackers that use ransomware encrypt data or hold
examine the patterns and characteristics of phishing, we devices hostage until they are paid or given a ransom are
discover a few fresh traits that qualify as features. There are considered to be a subset of malware. In the above fig
37 scamming characteristics in all, of which 3 are new malware has a count of 50000.
characteristics. With the characteristics in table I, we divide
these in three major categories as follows [9].  Abnormal Url
An extremely lengthy URL is produced by a hacker
The URL can be used to automatically extract. who attempts to assault the web email system's parsing
Web page may be used to feature extraction. process since the system might not be designed for handling
very long strings. This typically indicates the existence of an
Because the goal of a fake website is to collect sensitive overflow of buffers or threat of service attack.
data like an e-mail address and a password, we utilize the
number of entered emails and the number of input passwords
as additional features. The quantity of login or pass code
inputs is regarded as a scamming website characteristic.
Some other novel characteristic is the number of icons. As
we were researching phishing features, we discovered that
many scam websites utilize.

A. Types of phishing websites benign:


Synonym of maliciousness, harmless or well-
intentioned. In the above fig benign has highest count and the
count is 400000.

 Defacement
Long after the hacker's message has been removed, the
damage a defacement assault does to a website's identity and
Fig. 3. Abnormal Url
credibility serves as a visible sign that a website has been
hacked. In the above fig defacement has a count of 100000.
 Https
The HTTP is a fusion of the SSL with the HTTP. TLS
is a popular authentication and security tool for web servers
and browsers.

IJISRT23JUL950 www.ijisrt.com 1388


Volume 8, Issue 7, July – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
The total accuracy of the confusion matrix is 91.28%

B. Accuracy of the phishing models

 Decision tree classifier


A deep learning-supervised tool called a decision tree
may be used to categories or predict data based on how
previous questions have been resolved. The model goes
through supervised learning in that it is trained on a set of
data containing the target category before being put to the
test on that set of data. In the above fig the accuracy of
decision tree classifier is 0.91%

 AdaBoost Classifier
Fig. 4: HTTPS The boosting technique used by ML ensemble
techniques is the AdaBoost algorithm, also known as
 Shortening service Adaptive Boosting. Every time, the weights are redistributed,
A third-party website known as a URL shortening with samples that were wrongly categorized obtaining higher
service changes the lengthy URL into a short, dependent on weights—hence, the phrase "adaptive boosting". In the above
case numeric code. Simply said, this indicates that a URL fig the accuracy of adaboost classifier is 0.82%.
shorten service reduces the amount of characters of absurdly
lengthy URLs (web addresses).  K-Neighbor Classifier
The K-Neighbors Classifier looks for the five nearest
neighbours. The classifier has to be explicitly told to use
Euclidean distance to calculate how close neighboring points
are to one another. Using our recently learned model, we
assess the benignity of a tumor based on its average
compactness and area. In the above fig the accuracy of K-
neighbors is 0.89%.

Fig 5. Shortening services

 Confusion Matrices
A confusion matric is worked to demonstrate the
ppresentation of a classification system. The result of an
algorithm for classification is presented visually in a
confusion matrix. Fig. 7: Accuracy of models

 SGD classifier
SGD classifier. Essentially, the SGD classifier employs
a simple SGD learning technique that supports a variety of
categorization loss equations and penalties. Sci kit Learn
provides the Classifier module to implement SGD
classification. In the above fig the accuracy is 0.82%.

 Technique
We examine every conceivable pairing of the 36
features in order to identify the best and poorest traits as
well as to eliminate any. Unnecessary features. This
equation may be used to determine the length of any
mixture:

n! (2)
Σ =
Fig. 6. Confusion Matrix k!(n − k)!

IJISRT23JUL950 www.ijisrt.com 1389


Volume 8, Issue 7, July – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
support mechanism for this sort of threat identification and
The k is the number of attributes that range from 1 to 36 avoidance, and they employ a whitelist/blacklist strategy in
that were selected. n is the number of characteristics, which particular to thwart such attacks. These are fixed methods,
really is 36. The research will be condensed into obtaining though, therefore they are unable to detect newly introduced
the greatest and the lowest outcome for every k combo assaults on the system. As a result, we suggest using a
because there are an enormous number of conceivable logistic regression computer vision system to recognize
combinations. As in conclusion, the optimum combination incoming URLs as an effective option. According to the
will be the greatest correctness also with fewest research observations, this technique produces an acceptable
characteristics. It explains the method for selecting features effectiveness rate of roughly 98%.
in Fig. 1.
REFERENCES
This system's major purpose is to determine if a web
page is real or a phishing site and classify it accordingly. The [1]. R. B. Basnet, A. H. Sung, "Mining web to detect
algorithm depicted in Fig. 2 can be used to carry out this phishing URLs", Proceedings of the International
purpose. Every time the person visits a new web page, this Conference on Machine Learning and Applications, vol.
algorithm is activated. Its job is to retrieve the web page 1, pp. 568-573, Dec 2012.
characteristics using the URL and DOM objects. the URL [2]. Abdelhamid N., Thabtah F., Ayesh A. (2014) Phishing
that was utilized to retrieve the characteristics of the URL detection based associative classification data mining.
and site rank. Whereas the DOM, which is a link among Expert systems with Applications Journal. 41 (2014)
scripts, was developed to remove the characteristic of the 5948-5959.
[3]. Mohammad, R. M., Thabtah, F. & McCluskey, L.
table.
(2013) Predicting Phishing Websites using Neural
Network trained with Back Propagation. Las Vigas,
 Training and testing of model World Congress in Computer Science, Computer
Process of teaching methods with a portion of the Engineering, and Applied Computing, pp. 682-686.
Information and evaluating how well They do in accurately [4]. Aburrous M.., Hossain M., Dahal K.P. and Thabtah F.
categorizing the datasets are referred to simply as the (2010) Experimental Case Studies for Investigating E-
"training and Evaluation model." Use the logistic regression Banking Phishing Techniques and Attack Strategies.
which is loaded by using classifier from the Sci-Kit module Journal of Cognitive Computation, Springer Verlag, 2
in order to build the datasets. (3): 242-253.
[5]. Mohammad R., Thabtah F., McCluskey L., (2014B)
 Regression Analysis Intelligent Rule based Phishing Websites Classification.
The different URLs may now be provided to the Journal of Information Security (2), 1-17. ISSN
training sample as inputs. It makes a prediction about the 17518709. IET.
quality of the Address and outputs good or terrible. First, [6]. Jain, Ankit Kumar, and B. B. Gupta. "Comparative
load the Logistic Regression package, then use the Logistic analysis of features-based machine learning approaches
Regression Method to build a Logistic Regression for phishing detection." Computing for Sustainable
classification instance. Global Development (INDIACom), 2016 3rd
International Conference on. IEEE, 2016, pp. 2125-
Table 1. Models and its accuracy: 2130.
[7]. R.Aravindhan, Dr.R.Shanmugalakshmi, Certain
Model Accuracy Investigation on Web Application Security: Phishing
1 Decision Tree Classifier 0.909528 Detection and Phishing Target Discovery, January 2016.
2 Random Forest Classifier 0.914749 [8]. L. A. T. Nguyen, B. L. To, H. K. Nguyen, and M. H.
3 AdaBoost Classifier 0.820077 Nguyen, “A novel approach for phishing detection using
4 K-Neighbors Classifier 0.890409 URL-based heuristic,” 2014 Int. Conf. Compute.
5 SGD Classifier 0.820591 Manage. Telecommun. ComManTel 2014, pp. 298–303,
6 Extra Trees Classifier 0.914672 2014.
7 Gaussian NB 0.789548 [9]. A. Berthold, et al., ”Improved phishing detection using
model-based features,” in Proc. Conference on Email
V. CONCLUSION and Anti-Spam (CEAS). Mountain View Conf, CA, aug
2008
Because we utilize the Internet more often in our [10]. L. Ma, et al.,”Detecting phishing emails using hybrid
features,”IEEE Conf, 2009, pp. 493-497
everyday lives, cybercriminals target their victims through
this medium. One of the most common attacks is "phishing,"
which involves creating a faked website to steal customers'
private data, including their user-ID and password, from
financial websites utilizing social media tools. The malicious
website is made to seem exactly like a legal website, even
down to replicating the original website word for word. Due
to the sentence meaning of such pages, which exploits human
weaknesses, their sensing therefore represents a very simple
issue to address. Software packages are only effective as a

IJISRT23JUL950 www.ijisrt.com 1390

You might also like