Phish Guard Phishing Website using Machine Learning Algorithms

Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-8 | Issue-5 , October 2024, URL: https://www.ijtsrd.com/papers/ijtsrd69425.pdf Paper URL: https://www.ijtsrd.com/other-scientific-research-area/other/69425/phish-guard-phishing-website-using-machine-learning-algorithms/abhishek-jadhao

Uploaded by

Editor IJTSRD

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views10 pages

Phish Guard Phishing Website using Machine Learning Algorithms

Uploaded by

Editor IJTSRD

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

International Journal of Trend in Scientific Research and Development (IJTSRD)

Volume 8 Issue 5, Sep-Oct 2024 Available Online: www.ijtsrd.com e-ISSN: 2456 – 6470

Phish Guard Phishing Website using Machine Learning Algorithms

Abhishek Jadhao1, Lakshmi Mahindre2, Komal Rahangdale3,
Vinita Singh4, Prof. Rina Shipurkar5, Prof. Usha Kosarkar6
1,2,3,4
School of Science, G. H. Raisoni University, Amravati, Maharashtra, India
5
Assistant Professor, G. H. Raisoni University, Amravati, Maharashtra, India
6
Assistant Professor, G H Raisoni College of Engineering & Management, Nagpur, Maharashtra, India

ABSTRACT How to cite this paper: Abhishek Jadhao

Phishing attacks pose a significant threat to individuals and | Lakshmi Mahindre | Komal
organizations, leading to substantial financial and reputational Rahangdale | Vinita Singh | Prof. Rina
damage. Traditional detection methods, such as blacklists and Shipurkar | Prof. Usha Kosarkar "Phish
Guard Phishing Website using Machine
signature-based techniques, often fall short in identifying
Learning Algorithms" Published in
sophisticated phishing attempts. This research proposes a International
comprehensive system that leverages machine learning and deep Journal of Trend in
learning techniques to detect and delete phishing threats in emails Scientific Research
and websites. The system integrates multiple modules to analyze and Development
email structures, text content, and URLs, ensuring a robust defense (ijtsrd), ISSN:
against phishing attacks. By employing advanced algorithms like 2456-6470,
Convolutional Neural Networks (CNNs) and Long Short-Term Volume-8 | Issue-5, IJTSRD69425
Memory (LSTM) networks, the system achieves high accuracy in October 2024,
identifying phishing attempts. Experimental results demonstrate the pp.625-634, URL:
system’s effectiveness in real-world scenarios, significantly reducing www.ijtsrd.com/papers/ijtsrd69425.pdf
the risk of phishing attacks. This study contributes to the field of Copyright © 2024 by author (s) and
cybersecurity by providing a scalable and efficient solution for International Journal of Trend in
phishing detection and mitigation, paving the way for safer online Scientific Research and Development
interactions. The anonymous and uncontrollable framework of the Journal. This is an
Internet is more vulnerable to phishing attacks. Existing research Open Access article
works show that the performance of the phishing detection system is distributed under the
limited. There is a demand for an intelligent technique to protect terms of the Creative Commons
users from the cyber-attacks. In this study, the author proposed a Attribution License (CC BY 4.0)
URL detection technique based on machine learning approaches. A (http://creativecommons.org/licenses/by/4.0)
recurrent neural network method is employed to detect phishing
URL. Researcher evaluated the proposed method with 7900 KEYWORDS: machine learning,
malicious and 5800 legitimate sites, respectively. The experiments’ phish attack, anti phishing tool,
outcome shows that the proposed method’s performance is better cybersecurity solutions, url scanning
than the recent approaches in malicious URL detection. It is one of
the familiar attacks that trick users to access malicious content and
gain their information. In terms of website interface and uniform
resource locator (URL), most phishing webpages look identical to the
actual webpages. Various strategies for detecting phishing websites,
such as blacklist, heuristic, Etc., have been suggested.

I. INTRODUCTION
Nowadays Phishing becomes a main area of concern victim to phishing . In 3rd Microsoft Computing
for security researchers because it is not difficult to Safer Index Report released in February 2014, it was
create the fake website which looks so close to estimated that the annual worldwide impact of
legitimate website. Experts can identify fake websites phishing could be as high as $5 billion . Phishing
but not all the users can identify the fake website and attacks are becoming successful because lack of user
such users become the victim of phishing attack. awareness. Since phishing attack exploits the
Main aim of the attacker is to steal banks account weaknesses found in users, it is very difficult to
credentials. In United States businesses, there is a loss mitigate them but it is very important to enhance
of US$2billion per year because their clients become phishing detection techniques. The general method to

@ IJTSRD | Unique Paper ID – IJTSRD69425 | Volume – 8 | Issue – 5 | Sep-Oct 2024 Page 625
International Journal of Trend in Scientific Research and Development @ www.ijtsrd.com eISSN: 2456-6470
detect phishing websites by updating blacklisted
URLs, Internet Protocol (IP) to the antivirus database
which is also known as “blacklist" method. To evade
blacklists attackers uses creative techniques to fool
users by modifying the URL to appear legitimate via
obfuscation and many other simple techniques
including: fast-flux, in which proxies are
automatically generated to host the web-page;
algorithmic generation of new URLs; etc. Major
drawback of this method is that, it cannot detect zero-
hour phishing attack. Heuristic based detection which
includes characteristics that are found to exist in
phishing attacks in reality and can detect zero-hour
phishing attack, but the characteristics are not
guaranteed to always exist in such attacks and false
positive rate in detection is very high. Fig 1. Multiple forms of phishing attacks.
II. RELATED WORK B. PHISHING DETECTION APPROACHES-
Phishing attacks are categorized according to Phishing detection schemes which detect phishing on
Phisher’s mechanism for trapping alleged users. the server side are better than phishing prevention
Several forms of these attacks are keyloggers, DNS strategies and user training systems. These systems
toxicity, Etc. The initiation processes in social can be used either via a web browser on the client or
engineering include online blogs, short message through specific host-site software presents the
services (SMS), social media platforms that use web classification of Phishing detection approaches.
2.0 services, such as Facebook and Twitter, file- Heuristic and ML based approach is based on
sharing services for peers, Voice over IP (VoIP) supervised and unsupervised learning techniques. It
systems where the attackers use caller spoofing IDs . requires features or labels for learning an
Each form of phishing has a little difference in how environment to make a prediction. Proactive phishing
the process is carried out in order to defraud the URL detection is similar to ML approach. However,
unsuspecting consumer. E-mail phishing attacks URLs are processed and support a system to predict a
occur when an attacker sends an e-mail with a link to URL as a legitimate or malicious . Blacklist and
potential users to direct them to phishing websites. Whitelist approaches are the traditional methods to
A. CLASSIFICATION OF PHISHING ATTACK identify the phishing sites . The exponential growth of
TECHNIQUE web domains reduces the performance of the
Phishing websites are challenging to an organization traditional method .
and individual due to its similarities with the
legitimate websites . Fig 1 presents the multiple forms
of phishing attacks. Technical subterfuge refers to the
attacks include Keylogging, DNS poisoning, and
Malwares. In these attacks, attacker intends to gain
the access through a tool / technique. On the one
hand, users believe the network and on the other
hand, the network is compromised by the attackers.
Social engineering attacks include Spear phishing,
Whaling, SMS, Vishing, and mobile applications. In
these attacks, attackers focus on the group of people
or an organization and trick them to use the phishing Fig 2. Anti—Phishing approaches
URL . Apart from these attacks, many new attacks are
emerging exponentially as the technology evolves The existing methods rely on new internet users to a
constantly. minimum. Once they identify phishing website, the
site is not accessible, or the user is informed of the
probability that the website is not genuine. This
approach requires minimum user training and requires
no modifications to existing website authentication
systems. The performance of the detection systems is
calculated according to the following:

@ IJTSRD | Unique Paper ID – IJTSRD69425 | Volume – 8 | Issue – 5 | Sep-Oct 2024 Page 626
International Journal of Trend in Scientific Research and Development @ www.ijtsrd.com eISSN: 2456-6470
 Number of True Positives (TP): The total number WHOIS properties, PageRank, traffic rank
of malicious websites. information and page importance properties. They
 Number of True Negatives (TN): The total studied how the volume of different training data
number of legitimate websites. influences the accuracy of classifiers. The research
includes Support Vector Machine (SVM), K-NN,
 Number of False Positives (FP): The total number random forest classification (RFC) and Artificial
of incorrect predictions of legitimate websites as a Neural Network (ANN) techniques for the
malicious website. classification.
 Number of False Negatives (FN): The total Based on the output without and with the
number of incorrect predictions of malicious functionality selection a comparative study of
websites as a legitimate website machine learning algorithms is carried out in the
C. RESEARCH QUESTIONS- study . Experiments on a phishing dataset were
Researcher framed the Research Questions (RQ) carried out with 30 features including 4898 phished
according to the objective of the study and its and 6157 benign web pages. Several ML methods
background. They are as follows: were used to yield a better outcome. A method for
 RQ1—How URL detectors identify the phishing selecting functions is subsequently employed to
URLs or websites? increase model performance. Random forests
algorithm achieved the highest accuracy prior to and
 RQ2—How to apply ML methods to classify after the selection of features and dramatically
malicious and legitimate websites? increase building time. The results of the experiment
 RQ3—How to evaluate a URL detector shown that using the selection approach with machine
performance? learning algorithms can boost the effectiveness of the
classification models for the detection of phishing
On the one hand, RQ1 and RQ2 assist to develop a without reducing their performance.
ML based phishing detection system for securing an
network from phishing attacks. On the other hand, In this study authors proposed URLNet, a CNN-based
RQ3 specifies the importance of the performance deep-neural URL detection network. They argued that
evaluation of a phishing technique. To address RQ1, current methods often use Bag of Words(BoW) such
authors found some recent literature related to URL as features and suffered some essential limitations,
detection using Artificial Intelligence (AI) techniques. such as the failure to detect sequential concepts in a
The following part of this section presents the studies URL string, the lack of automated feature extraction
in detail with Table 2. and the failure of unseen features in real—time
URLs. They developed a CNNs and Word CNNs for
Authors in the study proposed a URL-based anti- character and configured the network. In addition,
phishing machine learning method. They have taken they suggested advanced techniques that were
14 features of the URL to detect the website as a particularly effective for handling uncommon terms, a
malicious or legitimate to test the efficiency of their problem commonly exist in malicious URL detection
method. More than 33,000 phishing and valid URLs tasks. This method can permit URLNet to identify
in Support Vector Machine (SVM) and Naïve Bayes embeddings and use sub word information from
(NB) classifiers were used to train the proposed invisible words during testing phase.
system. The phishing detection method focused on
the learning process. They extracted 14 different Authors suggested a URL detector for high precision
features, which make phishing websites different phishing attacks. They argued that the technique
from legitimate websites. The outcome of their could be scaled to various sizes and proactively
experiment reached over 90% of precision when adapted. For both legitimate and malicious URLs a
websites with SVM Classification are detected. limited data collection of 572 cases had been
employed. The characteristics were extracted and
The study explored multiple ML methods to detect then weighed as cases to use in the prediction process.
URLs by analyzing various URL components using The test results were highly reliable with and without
machine learning and deep learning methods. Authors online phishing threats. For the improvement of the
addressed various methods of supervised learning for accuracy, Genetic algorithm (GA) has been used.
the identification of phishing URLs based on lexicon,

@ IJTSRD | Unique Paper ID – IJTSRD69425 | Volume – 8 | Issue – 5 | Sep-Oct 2024 Page 627
International Journal of Trend in Scientific Research and Development @ www.ijtsrd.com eISSN: 2456-6470
TABLE NO 1.COMPARISON STUDY OF LITERATURE
S. No. Authors Contributions Limitations
Both SVM and NB are slow learners and
Jain A.K., and Employed both NB and SVM algorithms does not store the previous results in the
1
Gupta B.B [2] to identify the malicious websites. memory. Thus, the efficiency of the
URL detector may be reduced.
They compared the performance of
Purbay M., different types of ML methods.
Utilized multiple ML methods for
2 and Kumar D. However, there were no discussions
classifying URLs.
[3] about the retrieval capacity of the
algorithms.
The outcome of the experiments
Gandotra E., demonstrated that the performance of the
Applied multiple classification algorithms
3 and Gupta D. system was better rather than other ML
for detecting malicious URLs.
[4] methods. However, It lacks in handling
larger volume of data.
Deep learning methods demand more
Proposed a deep learning based URL
Hung Le et time to produce an output. In addition, it
4 detector. Authors argued that the method
al., [5] processes the URL and matches with
can produce insights from URL.
library to generate an output.
Developed a crawler to extract URLs The performance evaluation was based
Hong J. et al., from data repositories. Applied lexical on crawler-based dataset. Thus, there is
5
[6] features approach to identify the phishing no assurance for the effectiveness of the
websites. URL detector with real time URLs.
Proposed a URL detector based on
Authors employed an older dataset
Kumar J. et blacklisted dataset. Also, a lexical feature
6 which can reduce the performance of the
al., [7] approach was employed to classify
detector with real—time URLs.
malicious and legitimate websites.
Hassan Y.A. Suggested a URL detector for classifying The performance of GA based URL
and websites and predict the phishing detector was better; nonetheless, the
7
Abdelfettah websites. They used GA technique to predicting time was huge with complex
B. [8] improve the performance. set of URLs.
The method employed a server for
Rao RS and Authors employed page attributes include
8 updating the page attributes that reduces
Pais AR. [9] logo, favicon, scripts and styles.
the performance of the detecting system.
A CNN based detecting system for The existing research shows that the
Aljofey A et
9 identifying the phishing page. A performance of CNN is better for
al. [10]
sequential pattern is used to find URLs. retrieving images rather than text.
AlEroud A Neural Network based detection system
Generative adversarial network is used in
10 and Karabatis can identify the impression of an adverse
the research to bypass a detection system.
Gv[11] network by learning the environment.

III. RESEARCH METHODOLOGY

RQ3 stated that how ML method can be employed to
identify a malicious or legitimate URL. To present a
solution, authors

FIG NO.1 RESEARCH FRAMEWORK

@ IJTSRD | Unique Paper ID – IJTSRD69425 | Volume – 8 | Issue – 5 | Sep-Oct 2024 Page 628
International Journal of Trend in Scientific Research and Development @ www.ijtsrd.com eISSN: 2456-6470
Let ∑mn=0xn be the set of URLs where m is the
maximum limit for the number (n) of URLs. Let M, L
∈ xn be the malicious and legitimate, accordingly.
Suppose M and L contains the properties Pm and Pl,
respectively. The proposed framework employs
RNN—LSTM to identify the properties Pm and Pl in
an order to declare an URL as malicious or legitimate.
The following equations from 1 to 4 presents the ALGORITHM—DATA COLLECTION
method for identifying the malicious URL. The term illustrates the steps of data pre—process. url is one of
"recurring neural network" implies two broad groups the elements of URL dataset. In this process, the raw
of networks of a similar general structure, where one data is pre—processed by scanning each URL in th
is a finite, and the other is an infinite input. Both dataset. A set of functions are developed in order to
network groups contains time dynamic behaviour. A remove the irrelevant data. Finally, D2 is the set of
recurrent network of finite input is a directed acyclic features returned by the pre—process activity.
graph that can be replaced by a purely feedforward ALGORITHM—DATA PRE-PROCESS
neural network, whereas a recurrent network of represents the processes of data transformation.
infinite input is a directed cyclical graph that cannot “Num” is the vector returned by the data
be modified. The modified version of RNN is LSTM. transformation process. During this process, each
It is a deep learning method, which prevents the feature of D2 is converted as a vector. Each data in
gradient problem of RNN. Multiple gates are D2 is processed using the Generate Vectors function.
employed for improving the performance of LSTM. A vector is generated and passed as an input to the
In comparison with RNN, LSTM prevents back training phase.
propagation. Each input of LSTM generates an output
that becomes an input for the following layer or
module of LSTM. Eqs 1 to 4 illustrates the concept of
the proposed study.
∑(M+L)=xn
1. Input=∑mn=0xn
2. Malicious=Output_RNN(Input(Pm))
ALGORITHM—DATA TRANSFORMATION
3. Legitimate=Output_RNN(Input(Pl)) provides the processes involved in the training phase.
Cell state (CS)—It indicates the cell space that Each URL is processed with the support of vector.
accommodate both long term and short-term LSTMLib is one of the functions in the LSTM to
memories. predict an output using the vectors. The library is
updated with the extracted features that contains the
Hidden state (HS)—This is the output status necessary data related to malicious and normal web
information that user use to determine URL with pages. Thus, the iterative process is used to scan each
respect to the current data, hidden condition and vector and suspicious URL and generate a final
current cell input. The secret state is used to recover outcome. Lastly, op is the prediction returned by the
both short-term and long-term memory, in order to proposed method during the training phase.
make a prediction.
Input gate (IT)—The total number of information
flows to the cell state.
Forget gate (FT)—The total number of data flows
from the current input and past cell state into the
present cell state.
Output gate (OT)—The total number of information
flows to the hidden state.

@ IJTSRD | Unique Paper ID – IJTSRD69425 | Volume – 8 | Issue – 5 | Sep-Oct 2024 Page 629
International Journal of Trend in Scientific Research and Development @ www.ijtsrd.com eISSN: 2456-6470
ALGORITHM—TRAINING PHASE bootstrap method. In bootstrap method features and
indicates the testing phase of the proposed URL samples of dataset are randomly selected with
detection. The proposed processes each element from replacement to construct single tree. Among
LSTMMemory function is compared with the vector randomly selected features, random forest algorithm
of URL and decide an output. The f is the element of will choose best splitter for the classification and like
the feedback which is collected from the crawler that decision tree algorithm; Random forest algorithm also
indicates the page rank of a website. The page rank uses gini index and information gain methods to find
indicates the value of a website and the lowest the best splitter. This process will get continue until
ranking website will be declared as malicious or random forest creates n number of trees. Each tree in
suspicious to alert the users. forest predicts the target value and then algorithm will
calculate the votes for each predicted target. Finally
random forest algorithm considers high voted
predicted target as a final prediction.
C. SUPPORT VECTOR MACHINE
ALGORITHM
Support vector machine is another powerful algorithm
in machine learning technology. In support vector
machine algorithm each data item is plotted as a point
in n-dimensional space and support vector machine
algorithm constructs separating line for classification
of two classes, this separating line is well known as
hyperplane. Support vector machine seeks for the
closest points called as support vectors and once it
ALGORITHM—TESTING PHASE finds the closest point it draws a line connecting to
shows the snippet of epoch settings in the training them. Support vector machine then construct
phase. The epoch value is used to indicate the separating line which bisects and perpendicular to the
execution time of a method. The learning rate can be connecting line. In order to classify data perfectly the
increased to improve the performance of a method. margin should be maximum. Here the margin is a
IV. PROPOSED WORK distance between hyperplane and support vectors.
MACHINE LEARNING ALGORITHM- V. ROPOSED RESEARCH MODEL
Three machine learning classification model Decision A. Objective
Tree, Random forest and Support vector machine has The primary goal is to develop an effective system for
been selected to detect phishing websites. detecting phishing websites to enhance
A. DECISION TREE ALGORITHM cybersecurity measures.
One of the most widely used algorithm in machine B. Data Collection
learning technology. Decision tree algorithm is easy 1. Data Sources:-
to understand and also easy to implement. Decision Phishing Dataset: Utilize publicly available datasets
tree begins its work by choosing best splitter from the like the Phishing Websites Data Set from the UCI
available attributes for classification which is Machine Learning Repository or datasets from
considered as a root of the tree. Algorithm continues Kaggle.
to build tree until it finds the leaf node. Decision tree
Legitimate Websites: Scrape data from well-known
creates training model which is used to predict target
legitimate websites to create a balanced dataset.
value or class in tree representation each internal node
of the tree belongs to attribute and each leaf node of Real-Time Data: Integrate APIs (e.g., Google Safe
the tree belongs to class label. In decision tree Browsing) to get real-time data on phishing URLs.
algorithm, gini index and information gain methods 2. Data Attributes :-
are used to calculate these nodes. URL characteristics (length, entropy)
B. RANDOM FOREST ALGORITHM Domain age and registration details
Random forest algorithm is one of the most powerful
algorithms in machine learning technology and it is Presence of HTTPS
based on concept of decision tree algorithm. Random Use of special characters or IP addresses
forest algorithm creates the forest with number of
decision trees. High number of tree gives high Page content features (e.g., keywords, meta tags)
detection accuracy. Creation of trees are based on

@ IJTSRD | Unique Paper ID – IJTSRD69425 | Volume – 8 | Issue – 5 | Sep-Oct 2024 Page 630
International Journal of Trend in Scientific Research and Development @ www.ijtsrd.com eISSN: 2456-6470
C. Feature Extraction 2. API Integration:-
1. Static Features:- Incorporate third-party APIs for real-time checking
URL-Based Features: Analyze the structure of URLs and threat intelligence feeds.
(e.g., presence of subdomains, length). F. Evaluation Metrics
Domain Features: Examine WHOIS information, age 1. Performance Metrics:-
of domain, and registration details. Accuracy: Percentage of correctly identified phishing
2. Dynamic Features:- vs. legitimate sites.
Content Analysis: Use NLP techniques to analyze the Precision and Recall: To evaluate the balance
content of the webpage (e.g., identifying phishing- between false positives and false negatives.
related keywords). F1 Score: Harmonic mean of precision and recall to
JavaScript Analysis: Inspect scripts for malicious assess overall model performance.
behavior.
2. Cross-Validation:-
D. Model Selection Use k-fold cross-validation to ensure robustness of
1. Machine Learning Algorithms:- the model.
Supervised Learning: Train classifiers like Random
G. User Interface Design
Forest, Decision Trees, Support Vector Machines 1. Web Interface:-
(SVM), and Neural Networks. Simple user input form to enter URLs for analysis.
Ensemble Methods: Consider using ensemble Display results with confidence scores and actionable
techniques (e.g., Bagging, Boosting) to improve insights.
accuracy.
2. User Feedback:-
2. Deep Learning Approaches:- Implement a feedback loop where users can report
Convolutional Neural Networks (CNN): For image- false positives/negatives to improve the model.
based phishing detection.
H. Deployment and Monitoring
Recurrent Neural Networks (RNN): To analyze 1. Deployment:-
sequences in URL patterns. Deploy the model using cloud platforms (e.g., AWS,
E. Implementation Framework Azure) for scalability.
1. Tech Stack:- 2. Monitoring:-
Backend: Python (Flask/Django) for server-side Continuously monitor the model’s performance and
implementation. update it based on new phishing techniques and data.
Frontend: HTML, CSS, JavaScript frameworks (e.g., I. Ethical Considerations
React) for user interface. Ensure user privacy and data security.
Database: SQL/NoSQL databases to store website
Maintain transparency about data usage and model
data and user queries.
limitations.

Fig1. Proposed research model

@ IJTSRD | Unique Paper ID – IJTSRD69425 | Volume – 8 | Issue – 5 | Sep-Oct 2024 Page 631
International Journal of Trend in Scientific Research and Development @ www.ijtsrd.com eISSN: 2456-6470
VI. RESULT ANALYSIS – where phishing website is considered to involve
To evaluate the efficiency of a system, we use certain automatic categorization of websites into a
parameters. For each machine learning model, we predetermined set of class values based on several
calculate the Accuracy, Precision, Recall, F1 Score features and the class variable. The ML based
and ROC curve to determine its performance. Each of phishing techniques depend on website functionalities
these metrics is calculated based on True Positive to gather information that can help classify websites
(TP), True Negative (TN), False Positive (FP) and for detecting phishing sites. The problem of phishing
False Negative (FN). cannot be eradicated, nonetheless can be reduced by
In the case of URL classification, True Positive (TP) combating it in two ways, improving targeted anti-
is the number of phishing URLs that are correctly phishing procedures and techniques and informing the
classified as phishing. True Negative (TN) is the public on how fraudulent phishing websites can be
number of legitimate URLs that are correctly detected and identified. To combat the ever evolving
and complexity of phishing attacks and tactics, ML
classified as legitimate. False Positive (FP) is the
number of legitimate URLs that are classified as anti-phishing techniques are essential. Authors
phishing. False Negative (FN) is the number of employed LSTM technique to identify malicious and
legitimate websites. A crawler was developed that
phishing URLs that are classified as legitimate. These
crawled 7900 URLs from AlexaRank portal and also
values are summarized in Table IV called Confusion
Matrix. employed Phishtank dataset to measure the efficiency
of the proposed URL detector. The outcome of this
Predicted Predicted study reveals that the proposed method presents
Phishing Legitimate superior results rather than the existing deep learning
Actual Phishing TP FN methods. A total of 7900 malicious URLS were
Actual Legitimate FP TN detected using the proposed URL detector. It has
TABLE:- CONFUSION MATRIX FOR achieved better accuracy and F1—score with limited
PHISHING DETECTION amount of time. The future direction of this study is
to develop an unsupervised deep learning method to
Precision is the number of URLs that are actually
generate insight from a URL. In addition, the study
phishing out of all the URLs predicted as phishing. It
can be extended in order to generate an outcome for a
measures the classifiers exactness. The formula to
larger network and protect the privacy of an
calculate precision is given by Equation (1) below.
individual.
1. Recall is the number of URLs that the classifier
The findings underscore the critical need for a multi-
identified as phishing out of all the URLs that are
layered approach to cybersecurity. User education
actually phishing. It is also called sensitivity or
emerges as a cornerstone in this defense strategy,
true positive rate. It is an important measure and
empowering individuals to recognize and avoid
should be as high as possible.
phishing attempts. Additionally, the implementation
2. F1-Score is the weighted average of precision and of robust cybersecurity measures, including multi-
recall. It is used to measure precision and recall at factor authentication, secure browsing practices, and
the same time. regular software updates, is essential in fortifying
3. Accuracy is the number of instances that were defenses against these threats.Advanced detection
correctly classified out of all the instances in the algorithms, particularly those leveraging artificial
test data. intelligence and machine learning, have shown
promise in identifying and neutralizing phishing
VII. CONCLUSION websites with greater accuracy and speed. These
This paper aims to enhance detection method to technologies can analyze vast amounts of data to
detect phishing websites using machine learning detect patterns and anomalies indicative of phishing
technology. We achieved 97.14% detection accuracy activities, thereby providing a proactive defense
using random forest algorithm with lowest false mechanism. Despite these advancements, the
positive rate. Also result shows that classifiers give dynamic and evolving nature of phishing tactics
better performance when we used more data as necessitates continuous research and development.
training data. In future hybrid technology will be Future efforts should focus on enhancing detection
implemented to detect phishing websites more methods, improving user awareness programs, and
accurately, for which random forest algorithm of fostering collaboration between cybersecurity
machine learning technology and blacklist method professionals and organizations. By staying ahead of
will be used. The proposed study emphasized the the increasingly complex tactics employed by
phishing technique in the context of classification, phishers, we can better safeguard our digital

@ IJTSRD | Unique Paper ID – IJTSRD69425 | Volume – 8 | Issue – 5 | Sep-Oct 2024 Page 632
International Journal of Trend in Scientific Research and Development @ www.ijtsrd.com eISSN: 2456-6470
environments. In conclusion, while phishing websites the SIGCHI conference on Human Factors in
remain a formidable challenge, a comprehensive and computing systems (pp. 1117-1122)”.
adaptive approach to cybersecurity can significantly
[10] Kumar, S., Kumar, D., & Lal, S. (2021).
mitigate the risks. Through ongoing education,
“Phishing Websites Detection Using Machine
technological innovation, and collaborative efforts,
Learning Techniques: A Review. International
we can build a more resilient defense against the
Journal of Advanced Trends in Computer
ever-present threat of phishing
Science and Engineering, 10(2), 6-12 “
VII. REFERENCES
[11] Rezgui, A., Mosbah, M., & Braham, R.
[1] Whitten, A., & Tygar, J. D. (1999). “Why
(2020).” Phishing websites detection based on
Johnny can't encrypt: A usability evaluation of
textual and visual features using machine
PGP 5.0. In Proceedings of the 8th conference
learning techniques. Journal of Information
on USENIX Security Symposium-Volume
Security and Applications, 52, 102507”.
8(pp. 169-184)”.
[12] AlShurideh, M., & Alkhayat, M. (2020).”
[2] Jakobsson, M., & Myers, S. (2007). “Phishing
Phishing websites detection using machine
and countermeasures: understanding the
learning techniques. Journal of Physics:
increasing problem of electronic identity theft.
Conference Series, 1654(1), 012020”.
Wiley Publishing”.
[13] Wang, W., & Li, J. (2019). “Phishing website
[3] Kumaraguru, P., & Cranor, L. F. (2008).
detection based on HTML feature analysis and
“Phishing in Indian cyber space. In Proceedings
machine learning algorithms. Security and
of the 4th annual workshop on Cyber security
Communication Networks, 2019, 1-12”.
and information intelligence research (pp. 1-1)“
[14] Dey, S., & Guha, S. (2019). “A novel approach
[4] Wang, X., & Zhang, Y. (2011). “Design and
for detection and classification of phishing
implementation of a phishing website detection
websites using machine learning techniques.
system based on visual similarity. In
Procedia Computer Science, 157, 576-585”.
Proceedings of the 2011 international
conference on Internet computing and [15] Patel, S., Kotecha, K., & Patel, A. (2018).
information services (pp. 1-4)”. “Classification and detection of phishing
websites using machine learning techniques.
[5] Wang, W., & Li, J. (2012). “A new phishing
International Journal of Computer Science and
website detection method based on visual
Information Technologies, 9(6), 5096-5100”.
similarity and URL features. In Proceedings of
the 2012 international conference on computer [16] Kumar, R., Pateriya, R., & Tiwari, R. (2018).
science and electronics engineering (Vol. 3, pp. “Detection of phishing websites using machine
518-521)”. learning techniques. International Journal of
Computer Sciences and [27] Engineering,6(6),
[6] Choo, K. K. R., & Smith, R. G. (2010).
239-243”.
“Phishing for phools: An examination of the
cyberspace deception techniques and their [17] Ren, S., Chen, X., Guo, B., & Zhang, Z.
effectiveness. Journal of Financial Crime, (2017).” Phishing website detection using
17(3),273-286”. machine learning techniques”.
[7] Dhamija, R., Tygar, J. D., & Hearst, M. (2006). [18] The set of phishing URLs are collected from
“Why phishing works. In Proceedings of the opensource service called PhishTank. This
SIGCHI conference on Human Factors in service provide a set of phishing URLs in
computing systems (pp. 581-590)”. multiple formats like csv, json etc. that gets
updated hourly. To download the data:
[8] Sheng, S., Holbrook, M., Kumaraguru, P., &
https://www.phishtank.com/developer_info.php
Cranor, L. F. (2010). “Who falls for phish? a
. From this dataset, 5000 random phishing
demographic analysis of phishing susceptibility
URLs are collected to train the ML models.
and effectiveness of interventions. In
Proceedings of the SIGCHI conference on [19] The legitimate URLs are obatined from the
Human Factors in computing systems (pp. 373- open datasets of the University of New
382)”. Brunswick,
https://www.unb.ca/cic/datasets/url-2016.html .
[9] Blythe, J., & Wright, P. (2006).”Phishing and
This dataset has a collection of benign, spam,
the online banking customer. In Proceedings of
phishing, malware & defacement URLs. Out of

@ IJTSRD | Unique Paper ID – IJTSRD69425 | Volume – 8 | Issue – 5 | Sep-Oct 2024 Page 633
International Journal of Trend in Scientific Research and Development @ www.ijtsrd.com eISSN: 2456-6470
all these types, the benign url dataset is 1115, PP. 249-262, https://doi.org/10.1007/978-
considered for this project. From this dataset, 981-99-8661-3_19
5000 random legitimate URLs are collected to
[23] Usha Kosarkar, Gopal Sakarkar, Shilpa Gedam
train the ML models. (2021), “Deepfakes, a threat to society”,
[20] Usha Kosarkar, Gopal Sakarkar, Shilpa Gedam International Journal of Scientific Research in
(2022), “An Analytical Perspective on Various Science and Technology (IJSRST), 13th October
Deep Learning Techniques for Deepfake 2021, 2395-602X, Volume 9, Issue 6, PP.
Detection”, 1st International Conference on 1132-1140, https://ijsrst.com/IJSRST219682
Artificial Intelligence and Big Data Analytics [24] Usha Kosarkar, Prachi Sasankar(2021), “A
(ICAIBDA), 10th & 11th June 2022, 2456-3463,
study for Face Recognition using techniques
Volume 7, PP. 25-30,
PCA and KNN”, Journal of Computer
https://doi.org/10.46335/IJIES.2022.7.8.5
Engineering (IOSR-JCE), 2278-0661,PP 2-5,
[21] Usha Kosarkar, Gopal Sakarkar, Shilpa Gedam
[25] Usha Kosarkar, Gopal Sakarkar (2024),
(2022), “Revealing and Classification of
“Design an efficient VARMA LSTM GRU
Deepfakes Videos Images using a Customize
model for identification of deep-fake images
Convolution Neural Network Model”,
via dynamic window-based spatio-temporal
International Conference on Machine Learning
analysis”, Journal of Multimedia Tools and
and Data Engineering (ICMLDE), 7th & 8th
Applications, 1380-7501,
September 2022, 2636-2652, Volume 218, PP.
https://doi.org/10.1007/s11042-024-19220-w
2636-2652,
https://doi.org/10.1016/j.procs.2023.01.237 [26] Usha Kosarkar, Dipali Bhende, “Employing
Artificial Intelligence Techniques in Mental
[22] Usha Kosarkar, Gopal Sakarkar (2023),
Health Diagnostic Expert System”,
“Unmasking Deep Fakes: Advancements,
International Journal of Computer Engineering
Challenges, and Ethical Considerations”, 4th
(IOSR-JCE),2278-0661, PP-40-45,
International Conference on Electrical and
https://www.iosrjournals.org/iosr-
Electronics Engineering (ICEEE),19th & 20th jce/papers/conf.15013/Volume%202/9.%2040-
August 2023, 978-981-99-8661-3, Volume
45.pdf?id=7557