NN RP
NN RP
ABSTARCT – With the advancement in Phishing is an online fraud that targets the
technology and the evolution of humans, online users through E-mails, calls, messages
cybercrimes are increasing immensely. and tries to get the confidential information of
Phishing is a fraudulent activity and it aims the user [1]. These scams are increasing with
at stealing a user’s confidential information time as the virtual criminals are inventing new
such as passwords, bank details, login id’s, ways to attempt the crimes everyday. These
etc using tricky techniques. The attacker scams took peak during the COVID-19 as the
seems to be legitimate in these frauds and it online work increased at that time, number of
is very difficult to differentiate between the scams increased as the users were not aware of
real and fake as there is very minute the illegitimate websites which can leave them
difference in them. These crimes are in the midst of nothing. It is very difficult to
increasing everyday. According to the detect the false URL’s as there is very minute
records, around 75,689 frauds were difference between them. This is the biggest
registered in the year 2023. And the main reason that people get trapped in the online
reason is that the people are not aware of it scams. The dataset regarding these online
and also they do not have any accurate scams is less as compared to others as some
method to detect it. In this paper, we URL’s are deactivated after a short period of
propose two machine learning algorithms time and many new techniques are introduced
which are Random Forest and Decision now and then [8]. This is a major drawback
Tree and Autoencoder Neural Network. and can be overcome by using some special
After training the model with sufficient algorithms such as Neural Networks and
dataset, we come to the conclusion that Machine Learning. Phishers have increased the
among the three algorithms that are used, usage of encryption protocols such as HTTPS.
Autoencoder Neural Network has achieved The usage of HTTPS is an example of using
a good performance in predicting the fake an internet security feature to trap the online
URL’s. Therefore, our approach can alert users.
the online users about the phishing website
The stolen money amount and the number of
whenever they would try to access it
phishing attacks are increasing constantly.
through the URL.
According to the records, these attacks have
Key – Phishing, Fake URL’s, Autoencoder taken a jump since past few years and are
Neural Network, Random Forest, Decision increasing by around 15% gradually every
Trees, Machine Learning. year [3]. The reason of the increase is that,
firstly the online users are not aware of these
phishing attacks and secondly it is very
I. INTRODUCTION powerful because the users cannot differentiate
between the fake and the legitimate URL’s cyber-attacks that were conducted through
[12]. The movement towards a digital world fake websites. Few experiments were carried
can be another reason as this gives a great out for the evaluation of the performance using
opportunity to the attackers to do scams as benchmark dataset. After this, the CNN model
there are many security less websites which was able to perform well on the unseen URLs
may allow fake announcements [6]. and find the illegitimate ones. It gave an
accuracy of 94.31% in training phase and an
Machine Learning techniques are used in
overall accuracy performance of 91.3%.
many modals such as image recognition, price
prediction, biometrics, fraud detection and as Jeevanandham J[15] used a Dragonfly
well as to detect the phishing websites. It is a Optimization technique for detecting the fake
great approach as we can acquire precise and URL’s. He trained a Recurrent Neural Network
more accurate results while using Machine model using this technique for feature
Learning algorithms [10]. In this paper we are selection. Previously he used a Firewall model
using Random Forest and Decision Trees to but it could only block and recognize the
detect the fake URL’s. By Using Random predetermined attacks. But a model was
Forest there is a high probability to get an needed which could detect and block the
accurate result as many decision trees are future attacks also. So, due to this
combined to get an average output in this disadvantage the Dragonfly technique was
process which also can increase the detection introduced by him. In DFO, the results are
rate of a modal. compared using the accuracy, precision and F1
score criteria of the model. 100- 500 epochs
The industries which are particularly the target
were calculated during the training and testing
of phishing attackers are Finance, Ecommerce,
of the dataset. The accuracy of the first 100
Crypto Currencies and Software Services
epoch was around 71.3% and of the fifth layer
(SaaS). In 2023, over 23% phishers attacked
of epoch was 72.1% accuracy. Detecting the
the financial institutions and social media with
fake URLs is difficult as the dataset is very
around 22.6%. Other web based attacks such
small.
as software services and web mail includes
around 22.3% attacks [10]. The using of new Saikiran Boppana[13] used the machine
technologies everyday by the attackers is learning model to find the infected websites to
leading to a constant increase in the percentage prevent the user to be a part of scam or a fraud.
of the attacks. So, to control these attacks and There are various machine learning algorithms
achieve a good accuracy, we focus on which are Decision tree, Random Forest,
Autoencoder Neural Network, Random Forest SVM, XG Boost etc,. The machine learning
and Decision Tree combinations in this model will be provided with an input and it
research paper. will be trained with the input dataset. After the
preprocessing, an output will be provided by
the model. On the basis of the output, we can
decide whether a given URL is fake or
legitimate. The proposed model by him
successfully classified the phishing URLs with
an accuracy of 855 and 82.4% for Random
forest and Decision Tree respectively.
Fig (a) - Structure of a URL Gayathri Priya[7] proposed a model which
Ii. LITERATURE uses Grey Wolf Optimization (GWO) and
Firefly Algorithm to identify phishing websites
Nabeel Al-Milli[4] proposed a deep learning more efficiently. Then the model is introduced
convolutional neural network model in their with a Artificial neural network for classifying
research paper to detect illegitimate URL’s. or segmenting various parts of the website.
This model aimed to reduce the number of This hybrid technique improved the
classification accuracy to a great extent. It Preprocessing of the data is a technique where
gave an accuracy of 95.75 with feature we convert raw data in an ordered data. If The
optimization and 88.75% without feature data is processed, it converts it undemanding
optimization. Recently, AI based phishing to translate the data and makes it is easy to
detection seems to have taken a growth. use. Preprocessing techniques are used
because the data gives accurate and high-
Yajian Zhou [2] used the Convolutional Neural
quality results. The necessitate tasks in
Network to detect malicious URLs. He
preprocessing of data are cleaning, integration,
proposed a simple URL detection model which
transformation and reduction. Usually, a real -
work on the basis of URL content. The URLs
world data contains missing values, noises and
which cannot be classified or segmented are
sometimes extra information or impractical
taken as input to the model and then treated by
format which can not be as it is used by
the CNN to detect whether the URL is
models. Data preprocessing is a required task
malicious or legitimate. So, Machine learning
for finding the missing values, cleaning the
model gave an accuracy of 73% whereas their
data and making it suitable for the training
model of CNN gave an accuracy of 81% .
model. Data can be cleaned by using a
Shewtha M [8] proposed a model of Particle technique called Binning. Binning is a
Swarm Optimization (PSO) to detect fake technique where we sort the data and then
URLs more precisely by going through partition the into equal frequency bins.
multiple properties of the website. In this C. MODEL TRAINING
model, a feature weighting technique is used to
rank the various elements of the website First the dataset that we pre-processed is
according to their properties and importance in divided into two parts, one is training data and
the recognition of real websites. It gave an the other the testing data. 70-80 percent of data
excellent accuracy to the model for detection. is applied in tutoring and preparation of our
It gave an accuracy of 95% during training and ML and neural models and the remaining is
91% during testing of the model. used for model testing. Model training is a
training dataset with various algorithms and
Iii. Methodology then testing refers to check the accuracy and
correctness of the model up to its full. If the
In our research, we have collected data from
model is trained correctly and with a large
all around the world. Using the collected data,
quantity of data then the accuracy for
we used the technique of Machine learning
prediction or testing will increase.
and Neural Network to find out whether a
URL is fake or not as per the requirement of D. MODEL TESTING
the user. We need 60-70% of data from
In machine learning, model training is allude
training, 10-15% for validation and the rest 10-
to as the operation where the performance of a
20% of the data for testing.
experienced model is evaluated our testing
A. DATASET COLLECTION dataset that we divided earlier. When our
model is trained by using the dataset, the
The data collection refers to the technique or
neural model provides the best accuracy or the
the process of organizing, recording and
output for the processed data-set.
gathering information for reference or
analysis. Its work is to systematically collect Testing explicitly identifies which part of the
all the data from various sources such as code fails and provides a relatively coherent
interviews, observations, sensors, etc. The coverage measure. In Machine learning
effective collection of data is very necessary testing, the programmer enters input and
for the research purposes, business intelligence observes the behaviour and logic of the
and many different fields where the collection machine. Hence, the purpose of testing
of data is considered very crucial. machine learning is to elaborate that the logic
B. DATA PREPROCESSING learned by machine remains consistent always.
The logic should not vary even if the program
is called multiple times.
II. DECISION TREE
Decision Tree is a technique of supervised
learning which is used for both classification
as well as regression tasks for various models
training process. It is a tree like or a flowchart
structure where each and every node denotes
the feature, branches denote the rules and the
leaf nodes of the tree denotes the result of the
modal. The main goal of the decision tree is to
find the attribute that maximizes the
information gain after the split. It is very
simple to understand and solve decision
related problems and there is less requirement
of data cleaning.
Fig (b) - Flow Diagram for the Training and III. AUTOENCODER NEURL NETWORK
Testing of the Model
An autoencoder neural network is a type of
We have used the following discussed neural network that can learn image
algorithms in this research. reconstruction, text and other data in
I. RANDOM FOREST compressed from. It has three layers that are
Encoder, Latent Space and Decoder as shown
Random Forest is a machine learning in figure. The function of Encoder layer is to
algorithm. In this, the output of multiple represent the input data in the compressed
decision trees are combined to get an average form to the latent space and then the Latent
output value. This is more accurate and has space represent the compressed data to the
more precision in comparison to using the output layer. Finally, the decoder layer encodes
Decision tree alone. The advantages of using the data into its original dimension.
Random Forest are reduced risk of overfitting, Autoencoders are used in many cases like
ease of use, flexibility and feature importance Anomaly detection, Image inpainting and
have increased its usability in most of the Information retrieval. It is an efficient form of
present models. It can be used to solve both algorithm as it does not need any labels for the
the regression as well as classification representation of the input.
problems. It has been applied to many
industries in order to make better business IV. SUPPORT VECTOR MACHINE
decisions. It is a supervised learning algorithm that is
used for regression and classification tasks. It
works in the way that it finds the optimal
hyperplane that separates the different classes.
It is done by identifying the support vectors
which are considered as the data points which
are closest to the hyperplane. It is a very
powerful and versatile algorithm. It can handle
the high dimensional data very easily and
effectively. It is especially used for large
datasets and requires a very careful selection
of the parameters.
Fig (c) - Random Forest Architecture IV. RESULT
The model will help the users to differentiate
between a fake and a legitimate URL which
reduces the risk of online scams and frauds by
the cyber criminals. Due to the lack in the
detection methods, these scams are increasing Fig (e) - Accuracy of Random Forest
everyday.
v. Conclusion
In summary, this research paper mainly
concentrates on the techniques that can
differentiate between the real and the fake
URL’s. In the research Autoencoder Neural
Network, Random Forest and Decision Tree
algorithms have been used. And after the
comparison with other algorithms we came to
the conclusion that Random Forest gives a
better accuracy than the other two techniques
used. Our research also highlighted the
importance of fairness and ethics in using this
Table (a) - Training and Testing Accuracy of technology. It's crucial to make sure our
the techniques used predictions are unbiased and transparent. Even
The below graph shows the feature importance though our work is successful, still there is a
of various elements of a website according to much to explore and make our model more
which one can classify the given URL as fake reliable so that it can detect the fake URL’s
or legitimate. Some attributes of a website are more accurately.
the length of a URL, its depth, Domain age REFERENCES
and end, prefix, suffix, etc.
[1] Sujatha, G., Ayyannan, M., Priya, S. G.,
Arun, V., Arularasan, A. N., & Kumar,
M. J. (2023). Hybrid optimization
algorithm to mitigate phishing URL
attacks in Smart Cities. 2023 3rd
International Conference on Innovative
Practices in Technology and
Management (ICIPTM).
https://doi.org/10.1109/iciptm57143.202
3.10118171